Database Intermediate Series: Change Data Capture(I)
Change Data Capture (CDC) refers to identifying and capturing changes made to data in a database and then delivering these changes in a format that other systems can utilize.
Change Data Capture (CDC) refers to identifying and capturing changes made to data in a database and then delivering these changes in a format that other systems can utilize.
Scope of Changes:
We’d want to capture both DML and DDL changes on the table. So, our CDC process should capture INSERT
, UPDATE
, or DELETE
operations on the table, as well as any creations or modifications to the schema of the tables.
Implementing CDC
Now that you know that capturing changes to a table would come under the scope of CDC, let’s see how it would work internally.
Triggers
One common way of implementing CDC is through the use of database triggers. Triggers are SQL commands that automatically execute in response to certain events on a particular table, such as INSERT
, UPDATE
, or DELETE
actions or even DDL operations.
Storage: The change data captured by triggers is typically written into separate audit or change tables within the same database. These tables are structured to log not only the change data but also metadata like the timestamp of the change and the type of operation.
Example:
The above script can be used to create a trigger that would insert records into the orders_change_log
table whenever a new record is inserted in the orders
table.
Advantages of Triggers:
Immediate Capture: Since database systems offer triggers, they are synchronous with your DML and DDL operations.
Temporarily Suspend: You can disable a trigger if you expect bulk operations to happen, allowing you to skip capturing changes in cases you’re backfilling.
Disadvantages of Triggers:
Performance Overhead: Triggers can significantly impact database performance by increasing the processing needed for each transaction, as the trigger is invoked synchronously with DML/DDL operations.
Handling Schema Changes: You always need to ensure that any breaking change to the schema of the source table is also made in the change table, else operations on the source table will start to fail.
Log-Based Change Capture
As the term suggests, in this approach, we monitor the transaction logs/binlogs of the databases for changes and capture them.
Every relational database management system (RDBMS) maintains transaction logs to keep track of all changes. For instance, MySQL has binary logs, PostgreSQL uses write-ahead logging (WAL), and SQL Server utilizes transaction logs.
The log can be asynchronously read, and the consumer system can process the committed changes in the log. Subsequent action can be taken, like persisting it in another table or putting data on Kafka/SQS.
Advantages of Log-Based Change Capture:
Minimal Performance Impact: Since log-based CDC reads transaction logs asynchronously, it minimally impacts the database performance compared to trigger-based methods.
High Throughput: This is efficient for environments with high transaction volumes, as it can handle large amounts of data changes without degrading the source system’s performance.
Implementation Challenges:
Complexity of Log Formats: Transaction logs are typically complex and proprietary to each database vendor, requiring specialized understanding and tools to interpret them. You can leverage solutions like Debezium to overcome this problem.
Handling Schema Changes: Log-based CDC must also manage schema changes in the database, like adding new columns or changing data types. This implies that the downstream system where the changes are being dumped need to be aware of the schema, typically using some sort of Schema Registry to handle schema updates.
This is the first post, to get you started with CDC with an example of tracking changes using Triggers. In our next post, we’ll see how to set up Log-Based CDC with examples. Do let me know if you have any questions.
Fun Fact: I wrote about CDC almost 10 years ago, but that was specific to how CDC works within Microsoft SQL Server. You can find it here.
thanks Pratik, my team at Flipkart was using debezium to capture CDC.
Waiting for part II :)