Database Intermediate Series: Change Data Capture(II)

Change Data Capture (CDC) refers to identifying and capturing changes made to data in a database and then delivering these changes in a format that other systems can utilize.

Pratik Pandey

May 28, 2024

Database Intermediate Series: Change Data Capture(I)

Pratik Pandey

April 22, 2024

Read full story

Our previous post discussed Change Data Capture and how to implement it using triggers. In this post, we’ll explore how to implement Log-Based CDC.

Log-Based Change Capture

Before proceeding with log-based change capture, we must understand the log we’ll use for Change Capture. To do that, let’s create a master replica setup and dive into understanding things in detail.

Setting up Docker Compose

Few things to highlight from the docker-compose:

You’d see a server-id allocated to each MYSQL instance/container. That’s required if you want to set up replication, as MYSQL uses this id to identify each instance in the replication topology.
The log-bin parameter enables binary logging on the Primary instance, which is crucial for replication.

Setting up Replication

You can then run the following script to set up replication. The script performs the following tasks:

Connects to the primary MYSQL instance to create the replication user. A replication user is a specialized database user specifically for setting up and managing replication processes between MySQL servers.

CREATE USER 'replicator'@'%' IDENTIFIED WITH mysql_native_password BY 'replicapass';
GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'%';
FLUSH PRIVILEGES;

We must inform the replica about the current binary log file name and position from the primary server to set up replication. The replica server will use this information to replicate from the correct point in the primary server’s binary log.
To configure the replica server with the necessary information to begin replication. It sets the primary server’s host, user, password, and the starting point in the primary server’s binary log.
Then, we set up the database schema on the primary server by running SQL statements contained in init.sql

CREATE DATABASE IF NOT EXISTS mydb;

USE mydb;
CREATE TABLE IF NOT EXISTS mytable (
    id INT NOT NULL PRIMARY KEY ,
    value VARCHAR(10)
);

INSERT INTO mytable VALUES (1, 'Pratik');
INSERT INTO mytable VALUES (2, 'Demo');

After doing this, log into the replica container, and you’ll see that the records have also been replicated on the replica server.

Understanding the binlog

As an optional exercise, lets dive into the binlog and read it’s contents. You could find the binlogs on the MYSQL Primary instance under /var/lib/mysql/ path. You can copy the binlog to your local system and examine it using the mysqlbinlog tool.

# Copy the binlog to local
docker cp  mysql-primary:/var/lib/mysql/mysql-bin.000003 ./

# Examine the binlog file
mysqlbinlog --base64-output=DECODE-ROWS -vv ./mysql-bin.000003

The mysqlbinlog will output the contents of the binlog in a structured format, so let’s understand some of the binlog events.

Binary Log Metadata:

# at 4
#240528 22:02:37 server id 1 end_log_pos 126 CRC32 0x89fe215b Start: binlog v 4, server v 8.0.36 created 240528 22:02:37
# Warning: this binlog is either in use or was not closed properly.

These entries provide metadata about the binlog itself, including the server ID, position in the binlog and checksum for integrity verification.

GTID Metadata:

# at 157
#240528 22:04:18 server id 1  end_log_pos 236 CRC32 0xc701b68d 	GTID	last_committed=0	sequence_number=1	rbr_only=yes	original_committed_timestamp=1716926658410521	immediate_commit_timestamp=1716926658410521	transaction_length=291
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;
# original_commit_timestamp=1716926658410521 (2024-05-28 22:04:18.410521 CEST)
# immediate_commit_timestamp=1716926658410521 (2024-05-28 22:04:18.410521 CEST)
/*!80001 SET @@session.original_commit_timestamp=1716926658410521*//*!*/;
/*!80014 SET @@session.original_server_version=80036*//*!*/;
/*!80014 SET @@session.immediate_server_version=80036*//*!*/;
SET @@SESSION.GTID_NEXT= 'eb8e7e17-1c5c-11ef-8711-0242ac150002:1'/*!*/;

You could find a lot of useful information from the GTID metadata:

server id: The server on which the transaction was executed, which is the primary server in our case and hence id of 1.
last_committed: The last_committed field is used to determine the dependencies between transactions, particularly in parallel replication scenarios. It helps the replica determine which transactions must be committed before others, ensuring transactional consistency.
sequence_number: It is a unique incremental id attached to each transaction and used to maintain the correct order of transactions(as on primary) when they are applied on the replica.
rbr_only: This indicates that the event should be replicated in a row-based format only. MySQL supports three types of binary logging formats, SBR(Statement based), RBR(Row Based) and MBR(Mixed Based).
original_committed_timestamp and immediate_commit_timestamp: These are used in case of Delayed Replication where a replica server deliberately executes transactions later than the source by at least a specified amount of time.

Transaction Start and Table Mapping Metadata:

BEGIN;
# at 311
#240528 22:04:18 server id 1  end_log_pos 372 CRC32 0xb6299d31  Table_map: `mydb`.`mytable` mapped to number 88
# has_generated_invisible_primary_key=0

The above block provides you with the following information:

BEGIN: Even though we never explicitly gave any BEGIN TRAN command, you could see that the binlog contains a BEGIN statement, implying an implicit transaction was started by MYSQL.
Table_map: This event maps a table name to a numerical identifier (88 in this case), which is used in subsequent row events to refer to the table efficiently. This mapping is crucial for row-based replication where table identifiers are needed to apply row changes.

DML Events:

# at 372
#240528 22:04:18 server id 1  end_log_pos 417 CRC32 0x15cb7fa7  Write_rows: table id 88 flags: STMT_END_F
### INSERT INTO `mydb`.`mytable`
### SET
###   @1=2 /* INT meta=0 nullable=0 is_null=0 */
###   @2='Demo' /* VARSTRING(40) meta=40 nullable=1 is_null=0 */
# at 417
#240528 22:04:18 server id 1  end_log_pos 448 CRC32 0xec7d4c36 	Xid = 56
COMMIT/*!*/;

The above is an example of an insert statement that we executed in our script.

Write_rows: This event type is used in row-based logging to record changes made to rows of a table. You’d notice that the table is reference by id 88, which is the table mapping created in previous step.
INSERT: This event represents the insertion of a new row into mydb.mytable where the first column (@1) is an integer with the value 2, and the second column (@2) is a variable-length string, which is nullable, containing the value 'Demo'.
COMMIT: Commit marks the end of the transaction, ensuring that all operations performed since the BEGIN; are permanently applied to the database.

Hopefully the above gives you a peek into MYSQL binlog. I’ve not covered all different event types, just some to give a good high level idea on how things are structured in the binlog.

You must wonder why I’m discussing replication and binlog in a Change Data Capture post. That’s because the underlying mechanism of CDC is nothing but a special case of replication. In case of replication, the MYSQL replica reads the binlog and takes the action of persisting the changes locally.

In case of CDC, an external process reads the binlog, consuming records just like a MYSQL replica would do and take certain actions like dumping data to Kafka or processing and persisting to another data store.

CDC is more generic where the source could be different data stores, so the analogy was more for cases where the source is a RDBMS, as most CDC implementations use the transactional logs.

👉 Connect with me here: Pratik Pandey on LinkedIn

💻 Checkout the code on Github.