New product launch | Managed Airflow is now generally available. Learn more →

CDC: from zero to hero

Written by: Andrei Tserakhau, DoubleCloud Tech Lead, and Tim Brunko, Engineer

Modern microservice architectures often have a need for caches, full-text search indexes, replicas, and reactive interaction between components. Addressing each of those tasks individually can be a challenge, but it turns out there’s a single mechanism that can handle them all. Meet change data capture.

What is change data capture (CDC)?

CDC is a set of software development patterns that let you establish a reactive infrastructure, simplify microservices architecture, and break monoliths down into microservices.

In scenarios like these, you typically start with an online transaction processing (OLTP) database. With CDC, you obtain a stream of events regarding the addition, modification, and deletion of rows that you can then process. That transforms the database into an event-driven system.

A properly configured CDC cuts event responsiveness to a matter of milliseconds. That’s why it’s frequently employed when there’s a need for real-time analysis of a section of data moved to external systems.

CDC exemplars

Below are examples of SQL commands for a PostgreSQL database and the CDC events they generate. The format or protocol is not as important given that they’re just examples. In this notation, “op” stands for operation, which can be “c” for create (insert), “u” for update, or “d” for delete.

Database

CREATE TABLE my_little_pony (
  id INT PRIMARY KEY, name TEXT, description TEXT
);
INSERT INTO my_little_pony (id, name, description) 
VALUES 
  (1, 'Twilight Sparkle', ''), 
  (2, 'Rainbow Dash', ''), 
  (3, 'Spike', 'Spike goes BRRR');

Serialized CDC

{
     "op":"c", 
     "after":{
          "id":1,
          "name":"Twilight Sparkle",
          "description":""
      }
}
{
      "op":"c",
      "after":{
          "id":2,
          "name":"Rainbow Dash",
          "description":""
       }
}
{
      "op":"c",
      "after":{
          "id":3,
          "name":"Spike",
          "description":"Spike goes BRRR..."
       }
 }

Database

  UPDATE 
    my_little_pony 
  SET 
    description = '_' 
  WHERE 
   id < 3;

Serialized CDC

{
    "op":"u",
    "after":{
        "id":1,"name":
        "Twilight Sparkle",
        "description":"_"
     }
}
{
     "op":"u",
     "after":{
         "id":2,
         "name":"Rainbow Dash",
         "description":"_"
      }
}

Database

 DELETE FROM 
   my_little_pony 
 WHERE 
   name = 'Rainbow Dash';

Serialized CDC

{
  "op":"d",
  "before":{
      "id":2
   },
}

As seen in the example, CDC events have the following properties:

  • The database-specific details disappear in the CDC stream (in practice, it’s reduced to almost zero, but some information can still be obtained).

  • All commands are transformed into a set of modified rows, including the new state. Depending on the settings, the previous state can also be obtained.

Where CDC shines

The ability of the CDC to swiftly and efficiently move data in small portions makes it invaluable for real-time responsiveness to changes in the source database. There are two main areas where the CDC is applied:

  • Enabling reactive interactions between infrastructure components.
  • Implementing the patterns needed to build microservice architecture.

Let’s look at those two areas separately.

Reactive interaction between infrastructure components

In modern architectures, it’s common to work with only a portion of the database, and CDC is typically applied to specific tables. By having a sequence of modifying events for a table, you can obtain an asynchronous replica. And the fact that events have already been decoupled from the source database in a format independent of the database itself means that the receiving end can be anything: OLTP/OLAP/cache/full-text search.

If everything is set up correctly, you can, for instance, achieve low-latency caching without modifying the business logic for your services. Let’s explore how that works with specific replication examples.

Replication to a DWH

By applying a stream of changes to a data warehouse (DWH), you can obtain an asynchronous replica of your production OLTP database. That solution addresses multiple issues at once:

  • Analysts can work without impacting the production OLTP database.
  • Data is transferred from the transactional database to the analytical one.
  • Analysts can work with analytical databases.

For example, you can automatically obtain an asynchronous replica of your production PostgreSQL database in ClickHouse, Elasticsearch/OpenSearch, or S3.
Here’s an example of delivering a replication stream from MySQL to ClickHouse.

Cache invalidation

By applying a stream of changes to a cache, you can obtain it directly from your automatically updated (optionally aggregated/transformed) production OLTP database. The RedisCDC project for Redis is an excellent example.
It’s a distributed platform that simplifies the delivery of CDC streams to Redis.

Update search index (Elasticsearch/Solr)

By sending the relevant portion of the change stream to a full-text search engine, you can obtain an automatically updated full-text search index. Here’s an example of delivering a replication stream from MySQL to Elasticsearch.

Replication to the same OLTP storage

Having a stream of changes in the database allows you to obtain an asynchronous logical replica, for example, for a development environment (dev-stand). Since events are usually transmitted logically, you can also perform logical transformations on them along the way. One example is excluding columns with some sensitive data from the replica and encrypting other sensitive data.

Replication to another OLTP storage

If you want to replicate your data to another database, for example, so your production data in MySQL can be used in PostgreSQL or vice versa, CDC is your best friend. The replication might be necessitated by tool limitations.
Here’s an example of delivering data from MySQL to PostgreSQL.

Auditing what happens in your storage

CDC provides the ability to conduct a comprehensive audit of what’s happening in the database. By storing events for changes in the database as is, you can obtain audit logs of everything happening with your production database. You can also achieve real-time auditing when processing the message stream.

Here’s an example of a scenario with a similar audit.

Sharing change events between services

Dispatching events to a queue means you can notify other components or commands about them. While there’s a more systematic approach to this in the section about microservice patterns, you can inform your colleagues about events without relying on specific patterns.

Providing cross-region failovers

If you have a cross-region queue, you can duplicate readers in two different data centers by dispatching CDC to the queue, thereby delivering a regional failure guarantee (in case one region goes offline).

Reducing the load on the main OLTP storage

CDC lets you reduce the load on the master transactional database or service, especially when dealing with a microservice database. Serializing the CDC stream in a queue allows multiple receivers to work with the master’s information, burdening it only once. You can activate all the mentioned receivers, even doubling them to include copies at each data center, without changing the load on the master since they’ll all be reading the message queue.

CDC patterns for microservices

CDC is an approach that allows for simultaneous multiple microservice architecture patterns. From the perspective of microservice design patterns, CDC enables the concurrent implementation of event-driven processing between microservices, lets microservices be decoupled, and assists in breaking monoliths down into microservices. We won’t get into the differences between those patterns here, but there will be a list of useful references at the end for anyone who’s interested.

Outbox pattern

The outbox pattern, also known as application events, prescribes combining state changes with notification sending to other services. That can be achieved by either saving messages to be sent in adjacent tables within the same transaction (see transactional outbox) or subsequently sending messages through CDC.

When stacking the CDC stream into a queue, there’s a side effect: event sourcing with domain events on a bus like Apache Kafka and service interaction through Kafka. That partly enables microservice decoupling: services no longer need to actively notify others since anyone needing events can subscribe to them. The service only works with its own database. Additionally, some free benefits include the option to skip service discovery, reproducibility, and the ease of storing the entire stream, for example, in Elasticsearch.

CQRS pattern

Command query responsibility segregation (CQRS) is a pattern in which an application is conceptually divided into two parts: one modifying the state and the other reading it. CDC allows you to direct the part that modifies the state through a queue to a separate database where reading operations are performed.

Strangler pattern

Strangler is a pattern for breaking monoliths down into microservices. A properly prepared CDC is completely transparent to the legacy application and doesn’t require any changes to the inherited data model, which greatly facilitates the decomposition of the monolith and makes integration into the monolith as non-invasive as possible.

How to implement CDC

In theory, there are three ways to implement CDC: timestamps on rows, triggers on tables, and logical replication. In this section, we will look at the methods, their advantages, and their disadvantages.

Query-based CDC

For example, you might have a table in your database with a dedicated column (let’s call it “updated_at”) that gets filled with the value of now () at every insert/update.

You can regularly retrieve new or updated rows through polling with a straightforward query:

 SELECT 
    ...
 FROM 
    my_table 
 WHERE 
    updated_at > saved_val;

In this basic setup, however, you won’t be able to distinguish between an insert and an update, and you won’t receive notifications about deleted rows. Complicating the schema can teach you how to differentiate between inserts and updates (for example, by introducing a “created_at” field) and receive notifications for row deletions (for instance, by replacing deletions with a flag like “is_deleted”).

Pros

Cons

It works

Non-obvious interaction requirements
with the database

It doesn’t store the change history
separately (a trigger-based approach,
which will be discussed below, does)

Schema and business logic modifications
are needed for timestamp storage

Involves polling, which requires
an external process with regular queries
that may slow the database down

Simple implementations lack the capability
to capture row deletion events

There are modifications to this approach that mitigate some drawbacks while complicating the schema, but the core is the same. Here’s what you can add in rows:

  • Version numbers
  • State indicators
  • Time/version/status

Table triggers

In this case, you create a dedicated table for the history and set up triggers that activate when rows are modified or deleted.

Pros

Cons

It works

Triggers burden and slow
the database

Changes are captured instantly,
enabling real-time processing

Triggers modify the database

Triggers can capture all event types,
deletions included

Requires an external process
to read the change history

Triggers can add valuable metadata

Creating and maintaining triggers can
be complicated, especially considering
schema evolution

Logical replication

Every database provides ACID guarantees with a write-ahead log, a log of changes encoded in binary format. It’s called “wal” in PostgreSQL, “binlog” in MySQL, “redo-log” in Oracle, “oplog” in MongoDB, and “transaction log” in MSSQL.

The log already has what we need, so we just have to decode it and send it to a message queue.

But limitations do come up on occasion. In the case of PostgreSQL, for example, that approach isn’t compatible with pg_repack.

Pros

Cons

Processing happens in real time

It’s complicated

All change types are captured:
insert, update, and delete

It often requires a separate role

An additional data storage mechanism
isn’t required, unlike triggers, since
all data is stored in the wal log

The mechanism may not exist
in very old database versions

It’s mostly not polling, so it
doesn’t create a significant additional
load on the database (all that’s needed
is to sequentially read the wal
from the disk, decode it, and deliver it,
which is not typically noticeable)

If replication is not configured
correctly (not limiting wal growth,
for example), it can be
dangerous (logs can eat all disks)

How we do CDC at DoubleCloud Transfer

DoubleCloud Transfer tackled the cloud migration problem right from the beginning. A database snapshot is transferred, after which all changes that occur in the source database from that point on are applied to the target database. An asynchronous replica of the database is created in the cloud, and all that’s left is to disable write operations in the source database, wait a few seconds for the replica to catch up with the master, and enable operations in the cloud database.

That approach minimizes downtime when migrating to the cloud.
Over time, in addition to migration, the service was introduced for data transfer between different databases, such as from transactional to analytical databases, as well as between message queues and databases. If you put all supported transfer options in one matrix, you can see that it works with snapshots and data replication streams, scaling horizontally, and sharding.

In the service’s terminology, there are two main entities:

  • Endpoint: these are connection settings along with additional configurations. An endpoint can be either a source from which data is extracted or a target where data is loaded.

  • Transfer: this connects a source endpoint to a target endpoint. It includes its own settings, primarily the type of transfer: snapshot, replication, or snapshot + replication.

Once a transfer is created, it can be activated as a one-time action or scheduled for activation. The data is transferred in the case of a table snapshot, and then the transfer is automatically deactivated. For replication, an ongoing process is initiated to continuously move new data from the source to the target.

DoubleCloud Transfer

A cloud agnostic service for aggregating, collecting, and migrating data from various sources.

Transfer is written in Go and operates as a cloud-native solution. It has a control plane responsible for API and data plane management. When activating user transfers, data planes are created in runtimes. One example of a supported runtime is Kubernetes (K8s), where pods are launched to activate the transfer, making it a standalone service running on top of a specific runtime. EC2 and GCP runtime are also supported. When a transfer is activated, DoubleCloud Transfer creates virtual machines in the runtime as needed and starts moving data inside of it. It only takes a few clicks to configure data delivery clocking in at gigabytes per second.

The service has been successfully used at scale in production by multiple teams for several years. Besides its inherent horizontal scalability, we’ve parallelized wherever possible. Nearly two thousand data streams currently run in our internal infrastructure, moving multiple gigabytes per second.

We’ve also implemented the most frequently requested ELT transformations, and ELT capabilities will continue to expand. Configuration can be done through a user-friendly UI, API, and Terraform.

All of that has made DoubleCloud Transfer a versatile data transfer service from any source to any destination (with ELT processing capabilities) for both snapshots and data streams. Users have found numerous creative ways to utilize the service beyond its primary purpose, with CDC just one application.

Even more detail

Since DoubleCloud Transfer has been receiving replication streams at the logical replication level since its inception, the natural step was to make that available to users. And working through wal log processing is what turned DoubleCloud Transfer into a CDC solution.

We began shipping logical replication events to a queue based on user requests, and the decision was made to generate events in the Debezium format so we could become a drop-in replacement for Debezium. That way, it would be possible to replace one CDC product with another in a configured pipeline while keeping the pipeline. This approach would provide useful integrations without introducing a new data format.

As a result, we had to create converters from our internal objects to the serialized Debezium format and cover it with extensive testing.

DoubleCloud Transfer currently has the serializer implemented for PostgreSQL and MySQL sources, with MongoDB source support recently added. You can now easily configure CDC delivery from your YDB tables to Apache Kafka, EventHub, or Kinesis in Debezium format with just a few clicks.

That made DoubleCloud Transfer a drop-in Debezium replacement for MySQL and PostgreSQL. Some notable differences include DoubleCloud Transfer’s ability to handle master migration in PostgreSQL, when a replica becomes the master, with the pg_tm_aux plugin enabled. Additionally, the service can transfer user-defined types in PostgreSQL.

DoubleCloud Transfer also implemented the ability to organize query-based CDC, which the documentation refers to formally as incremental tables and informally as increments.

Real-world use cases for DoubleCloud Transfer

Let’s talk about popular use cases.

OLTP to a DWH

The most common use case is using CDC to create replicas of transactional databases (PostgreSQL, MySQL, MongoDB) in analytical data storages (ClickHouse, S3). Storing events in a queue unloads data from the transactional database once, and there are typically two analytical recipient databases: one in each data center to guarantee survival during regional outages.

The problem

Production processes run in transactional databases, and the business side needs analytical reports conveniently generated through analytical databases. Since HTAP databases are still considered exotic, data needs to be somehow transferred from the transactional database to the analytical one.

Before CDC

After CDC

Teams usually transfer data from
the transactional database to the analytical
one using their own ad-hoc scripts or
in- house solutions with snapshots,
opting for query-based CDC in the best
case scenario. That is not typically
formalized into a product and is
inconvenient to use.

CDC (at DoubleCloud Transfer)
allows for convenient replication of
the transactional database to
the analytical one, for example,
through a user interface, with
a replication lag of just
a few seconds
.

Regional analytics

This is the same as above, only with an extra requirement regarding regional outages.

The problem

Production data needs to be moved to a regional analytical DWH with a sub-second delay.

Before CDC

After CDC

A custom Java service was used to
export a snapshot once a day.

Afterward, a series of scripts were
written for exporting, and then another
internal service was used to perform
query-based CDC.

Introducing a new data
delivery process was challenging,
however, and delays happened at
the beginning and end of the day.

The team set up delivery for
incremental tables to a cross-datacenter
queue with regular execution.

Two other replication transfers
extract data from the queue to the DWH.
That ensures regional compliance
since each of the two transfers
delivers data to a specific regional
DWH installation in a separate data
center.

The queue itself provides
an additional benefit: only one
trip is made to the transactional
database even as the data ends up
in two analytical databases.

As a result, the team has three
transfers: one that runs automatically
on a schedule and exports new rows
and two more that run continuously
in replication mode, extracting data
from the queue into analytical storage.

CDC to a search engine

A customer from another team started using CDC to update Elasticsearch search indexes, and then applied CDC for reactive interaction between components. CDC ultimately became an integral part of the project.

The problem

When data in MySQL was modified, the Elasticsearch indexes had to be updated.

Before CDC

After CDC

Before CDC, a transactional output
pattern would have had to be
implemented.

But since DoubleCloud Transfer
already had CDC when the task appeared,
we implemented event shipping to the queue
and created a script to process events
from the queue.

DoubleCloud Transfer allowed us
to organize the process for updating
search indexes, after which our
colleagues, having assessed the
convenience of setting up a reactive
infrastructure, found a variety of
applications for CDC.

They use CDC to send push notifications,
proxy data to an external CRM, schedule
tasks, reactively respond to promotional
codes, and much more.

Analyzing a raw CDC stream

A customer transformed MySQL/PostgreSQL into a streaming database using CDC events on top of a DWH. The tool they created is called incremental materialized views.

The problem

MySQL is set up in the customer system such that many services access it to retrieve the configurations, so they need a full trail of config changes for observability purposes.

Before CDC

After CDC

Running all services in MySQL wasn’t
an option given the significant load.

Initially, caching services were set up:
caching services periodically fetched
settings from MySQL, and all other
services accessed the caching services.

The caching services began to consume
more and more resources, however, sometimes
lagging by up to dozens of minutes.

Meanwhile, services that accessed
the settings began taking too long
to restart because they took a long
time to initialize their caches.

After setting up a CDC stream with
DoubleCloud Transfer, a streaming
database was developed as the equivalent
of the airflow+dbt combination, where
DAGs only recalculate derivative data
for the changed rows, reactively
responding to CDC events.

Caches cut lag times to an average
of five seconds instead of the
dozens that had been the case.

Services were able to get rid of
local caches, saving lots of
RAM, and they also began restarting
quickly. We hope to receive more detail
about this solution in the future.

Summary

This article thoroughly explored CDC: its history, theory, use cases, open-source practices, and corporate practices in addition to real user stories.

To learn more about CDC, read Martin Kleppmann’s article and experiment with Debezium and DoubleCloud Transfer.

It’s worth noting that DoubleCloud Transfer is a fully managed cloud-native solution that’s easy to onboard and master. Feel free to come and use it, give us feedback, and request features.

Get started with DoubleCloud

Sign in to save this post