Data Engineering

Real-Time Change Data Capture at Scale: Engineering Openflow's Database Replication Architecture

Every transaction in your operational database tells a story — a customer purchase, inventory update or user interaction. Yet most organizations still wait hours or even days to act on these insights, running analytics on yesterday's data while opportunities slip away in real-time.

We've seen this pain point countless times: Engineering teams struggling with operational databases trapped behind firewalls; fragile ETL pipelines that break on schema changes; and batch processing windows that leave critical hours of latency between the time that business events occur and when they become available for analysis.

With Snowflake’s recent acquisition of Crunchy Data, we're setting a vision to unify operational and analytical data in one single system through Snowflake Postgres — eliminating the need to move your transactional data elsewhere for analytics. But we also recognize that many organizations will continue using existing OLTP databases outside of Snowflake. That's where Openflow's change data capture (CDC) capabilities become essential, providing a new lens on how we handle traditional CDC work by seamlessly streaming changes from any operational database into the Snowflake AI Data Cloud for real-time analytics and AI applications.
 

The technical challenge: From database logs to analytical insights

Traditional CDC implementations present a complex engineering challenge. You need to tap into database-specific transaction logs — PostgreSQL's Write-Ahead Log (WAL), MySQL's binary log, SQL Server's Change Tracking, Oracle’s redo logs — while handling the intricacies of each database's replication protocol. Then you must stream these changes reliably to your data platform, handle schema evolution gracefully, and maintain exactly-once delivery semantics.

The core problems we set out to solve with Openflow's database connectors include:

  • Security and network isolation: Most transactional databases sit securely behind firewalls, inside virtual private clouds, with no external internet exposure.

  • Performance at scale: Processing 20,000+ change events per second across hundreds of tables simultaneously.

  • Schema evolution: Automatic adaptation to DDL changes, like column additions, renames or drops.

  • Operational complexity: Maintaining CDC infrastructure, handling failures and ensuring data consistency.
     

Openflow's CDC architecture: Native protocols meet Snowpipe Streaming

We designed Openflow's database connectors to combine database-native CDC protocols with Snowflake's high-performance Snowpipe Streaming API. At its core, each connector is implemented as an Apache NiFi data flow that can be deployed either in Snowflake's managed infrastructure or directly in customer VPCs via our BYOC (Bring Your Own Cloud) model, which we will further detail later in this blog.

Database-native change capture

Rather than parsing SQL statements or polling for changes, Openflow connectors tap directly into each database's native replication streams:

  • PostgreSQL: Logical replication via write-ahead log (WAL) with configurable wal_level set to logical

  • MySQL: Binary log (binlog) replication for real-time change capture

  • SQL Server: Change Tracking (CT) API for incremental data extraction

  • Oracle (coming soon): XStream API for high-performance change data capture

This approach delivers several technical advantages. First, it provides true real-time capture; changes are detected the moment they're committed to the database. Second, it reduces impact on source database performance by using efficient native replication mechanisms rather than querying production tables.

-- Example publication configuration in PostgreSQL
CREATE PUBLICATION snowflake_publication FOR ALL TABLES;

-- Alternative: Create publication for specific tables
ALTER PUBLICATION snowflake_publication 
ADD TABLE orders, customers, products;

The connector configuration in Openflow supports flexible table selection patterns:

{
  "Included Table Names": "public.orders,public.customers,inventory.products",
  "Included Table Regex": "public\\.order_.*|inventory\\..*",
  "Filter JSON": [
    {
      "schema": "public",
      "table": "customers",
      "included": ["customer_id", "name", "email", "created_at"],
      "excluded": ["internal_notes", "ssn"]
    },
    {
      "schema": "inventory", 
      "table": "products",
      "includedPattern": "^(product_|sku_).*",
      "excludedPattern": ".*_internal$"
    }
  ]
}

High-performance streaming architecture

The magic happens in how Openflow bridges database-native CDC streams with Snowflake's Snowpipe Streaming API. Our connectors implement a sophisticated buffering and streaming mechanism that can handle massive throughput while maintaining at-least-once delivery semantics.

Each Openflow connector operates as a multi-threaded NiFi data flow with several key components:

# Example: Run merges every hour during business hours
0 0 8-17 * * ?

# Alternative: Continuous merging for real-time updates  
* * * * * ?

Dual-table architecture for change history

One of our key innovations is the dual-table approach to CDC data management. For each source table, we create:

  1. Destination table: Contains the current state of data, continuously updated via MERGE operations.

  2. Journal table: Append-only table that contains full history of all record changes, along with metadata and payload to allow for SCD2 analysis with at-least-once to journal delivery guarantees and compaction during merge operations.

The journal table captures complete change history:

-- Example journal table structure
CREATE OR REPLACE TABLE "Orders_JOURNAL_1749144055_1" (
    "PRIMARY_KEY__OrderID" NUMBER(38,0),
    "PAYLOAD__OrderID" NUMBER(38,0),
    "PAYLOAD__CustomerID" VARCHAR(16777216),
    "PAYLOAD__OrderDate" TIMESTAMP_NTZ(9),
    "PAYLOAD__ShippedDate" TIMESTAMP_NTZ(9),
    "PAYLOAD__Freight" NUMBER(19,4),
    LEAST_SIGNIFICANT_POSITION NUMBER(38,0),
    MOST_SIGNIFICANT_POSITION NUMBER(38,0),
    EVENT_TYPE VARCHAR(16777216),
    SEEN_AT TIMESTAMP_NTZ(9),
    SF_METADATA VARIANT
);

This architecture enables several powerful use cases:

  • Point-in-time analysis: Query data as it existed at any moment in history

  • Change auditing: Track what changed and when

  • Event sourcing: Reconstruct business events from change streams

  • Compliance reporting: Maintain immutable audit trails

Example: Track an order's complete lifecycle through the journal table.

-- See the complete history of an order
SELECT 
    SEEN_AT,
    EVENT_TYPE,
    "PAYLOAD__OrderID",
    "PAYLOAD__CustomerID",
    "PAYLOAD__OrderDate",
    "PAYLOAD__RequiredDate",
    "PAYLOAD__ShippedDate",
    "PAYLOAD__Freight"
FROM "Orders_JOURNAL_1749144055_1"
WHERE "PAYLOAD__OrderID" = 10248
ORDER BY SEEN_AT;

-- Results might show the order lifecycle:
-- 2024-12-01 10:00:00 | INSERT | 10248 | VINET | 2024-12-01 | 2024-12-15 | NULL     | 32.38
-- 2024-12-01 14:20:00 | UPDATE | 10248 | VINET | 2024-12-01 | 2024-12-15 | 2024-12-01 | 32.38
-- 2024-12-03 09:15:00 | UPDATE | 10248 | VINET | 2024-12-01 | 2024-12-15 | 2024-12-03 | 32.38

Addressing enterprise-scale challenges
 

Security: BYOC deployment for enhanced control

For OLTP CDC, one common challenge for data integration is to get through the customer network securely but with flexibility and control. For that reason, the BYOC deployment option provides a Goldilocks solution. Openflow creates infrastructure in customer VPCs with private subnets hosting the EKS cluster. This means your sensitive database connections do not leave your network perimeter. The control plane remains in Snowflake for management and monitoring, but all data movement happens within your security boundary.

The BYOC deployment provides:

  • Network isolation: Database connectors run in your private subnets

  • Encryption in transit: All communications use TLS encryption

  • Key management: Integration with AWS Secrets Manager or HashiCorp Vault

  • Compliance: Address regulatory requirements for data residency
Figure 1: Openflow's BYOC (Bring Your Own Cloud) deployment architecture illustrating secure and flexible data integration from source databases to Snowflake, maintaining data within the customer's security boundary.
Figure 1: Openflow's BYOC (Bring Your Own Cloud) deployment architecture, illustrating secure and flexible data integration from source databases to Snowflake, maintaining data within the customer's security boundary.

Schema evolution: Automatic adaptation

One of our most powerful features is automatic adaptation to schema changes in source tables. When DDL changes occur, we detect schema modifications automatically, update destination tables to match new schema, create new journal tables for the updated schema, and continue replication seamlessly without manual intervention.

This automatic schema evolution supports common database operations like adding, renaming or deleting columns — eliminating one of the biggest operational pain points in traditional CDC implementations.
 

Beyond traditional ETL: AI-ready real-time data

The real power of Openflow's CDC architecture becomes apparent when you consider modern AI and analytics use cases. Openflow connects enterprise data to AI with near real-time, bidirectional data flows, no matter where a model or agent lives.

With change data streaming in near real-time, you can:

  • Build real-time ML feature stores: Continuously update machine learning features with up-to-the-minute data for more accurate and timely model predictions.

  • Enable event-driven analytics: Automatically trigger analyses and alerts the moment business events occur, allowing for immediate insights and actions.

  • Achieve real-time personalization: Power recommendation engines with immediate customer behavior data, delivering highly relevant and current user experiences.

  • Drive operational AI: Fuel intelligent automation with the most current system state, enabling more responsive and effective autonomous operations.

Example: Event-driven analytics with Snowflake Alerts.

-- Create alert for low inventory using real-time data
CREATE OR REPLACE ALERT low_inventory_alert
WAREHOUSE = alerts_wh
SCHEDULE = '1 minute'
IF (EXISTS (
    SELECT 1 FROM inventory_realtime 
    WHERE current_stock < reorder_threshold
    AND last_updated > DATEADD('minute', -2, CURRENT_TIMESTAMP())
))
THEN
    CALL SYSTEM$SEND_EMAIL(
        '[email protected]',
        'Low Inventory Alert',
        'Products below reorder threshold detected'
    );

Looking forward: The future of data movement

Openflow represents a fundamental shift in how we think about data integration. By combining database-native CDC protocols with cloud-native streaming architectures, we've eliminated the traditional trade-offs between performance, security and operational simplicity.

As organizations increasingly deploy AI agents and real-time applications, having fresh, consistent data available within seconds rather than hours becomes a competitive necessity. We are continuing to invest and build Snowflake Openflow with capabilities including  Oracle database support, multicloud deployments, expanded connector menu and Apache Iceberg table support — all building toward a future where real-time data integration is as simple and reliable as running a SQL query.

The age of batch processing is ending. The future belongs to architectures that can move data at the speed of business — securely, reliably and at massive scale. That future is available today with Openflow.

Share Article

Subscribe to our blog newsletter

Get the best, coolest and latest delivered to your inbox each week

Where Data Does More

  • 30-day free trial
  • No credit card required
  • Cancel anytime