- Published on
CDC (Change Data Capture) Complete Guide 2025: Debezium, Kafka, Data Sync, Outbox Pattern
- Authors

- Name
- Youngju Kim
- @fjvbn20031
TL;DR
- CDC = game changer for data sync: change event streams instead of polling. Real-time, efficient.
- Debezium is the de facto standard: supports PostgreSQL, MySQL, MongoDB, SQL Server, etc.
- WAL/binlog based: reads DB internal logs to capture changes. Non-invasive.
- Outbox Pattern: solves the dual-write problem. Standard for microservice events.
- Use cases: DW sync, search index update, cache invalidation, event-driven architecture.
1. Why CDC
1.1 Traditional Data Sync
Polling:
while True:
new_records = db.query("SELECT * FROM users WHERE updated_at > ?", last_sync)
sync_to_warehouse(new_records)
last_sync = now()
sleep(60)
Problems:
- Latency: 1-minute granularity — not real-time
- Load: queries DB every time
- Misses DELETE:
WHERE updated_at > ?cannot detect deletions - Gaps: transaction timing can cause missed rows
- Non-scalable: queries slow down as data grows
1.2 The CDC Promise
[DB] -> [WAL/binlog] -> [CDC tool] -> [Kafka] -> [Consumers]
^
real-time, non-invasive
Advantages:
- Real-time: millisecond latency
- No DB load: only reads logs
- All changes: captures INSERT, UPDATE, DELETE
- Order preserved: transaction order
- Scalable: independent of data volume
1.3 Use Cases
| Use Case | Description |
|---|---|
| Data warehouse sync | OLTP to Snowflake/BigQuery in real time |
| Search index update | DB change reflected immediately in Elasticsearch |
| Cache invalidation | DB update invalidates Redis cache |
| Event-driven | DB changes become domain events |
| Microservice sync | One service's data consumed by another |
| DB migration | Zero-downtime migration (old to new DB) |
| Audit log | Full change history retained |
2. How CDC Works
2.1 PostgreSQL WAL
WAL (Write-Ahead Log): log of every change written to disk. Foundation of crash recovery.
INSERT INTO users (name) VALUES ('Alice');
-> [WAL]: tx_id=123, op=INSERT, table=users, data={name:'Alice'}
-> [Disk page write]
CDC reads this WAL to capture changes.
2.2 Logical vs Physical Replication
Physical Replication (Streaming):
- Disk-page-level replication
- Standby is byte-identical to primary
- Pros: fast, exact
- Cons: same version/schema only
Logical Replication:
- Replicates logical changes (INSERT/UPDATE/DELETE)
- Different schemas and versions allowed
- Pros: flexible, suitable for CDC
- Cons: slight overhead
2.3 PostgreSQL Logical Replication Setup
# postgresql.conf
wal_level = logical
max_wal_senders = 10
max_replication_slots = 10
-- Create user
CREATE USER cdc_user REPLICATION LOGIN PASSWORD 'secret';
GRANT SELECT ON ALL TABLES IN SCHEMA public TO cdc_user;
-- Create publication (which tables to replicate)
CREATE PUBLICATION my_pub FOR TABLE users, orders;
-- Or all tables
CREATE PUBLICATION my_pub FOR ALL TABLES;
-- Create replication slot (holds changes)
SELECT pg_create_logical_replication_slot('my_slot', 'pgoutput');
2.4 MySQL binlog
MySQL uses binlog:
# my.cnf
log_bin = mysql-bin
binlog_format = ROW # ROW is required for CDC
binlog_row_image = FULL
server_id = 1
expire_logs_days = 7
binlog_format:
STATEMENT: SQL text (non-deterministic on replay)ROW: per-row changes (suitable for CDC)MIXED: both
2.5 MongoDB Change Streams
const changeStream = db.collection('users').watch()
changeStream.on('change', (change) => {
console.log(change)
// { operationType: 'insert', fullDocument: {...}, ... }
})
MongoDB is oplog based.
3. Debezium — the CDC Standard
3.1 What is Debezium?
Debezium = open-source CDC platform. Started by Red Hat.
Architecture:
[PostgreSQL] -> [Debezium connector] -> [Kafka Connect] -> [Kafka topics] -> [Consumers]
Core features:
- Supports many DBs (PG, MySQL, MongoDB, Oracle, SQL Server, Cassandra)
- Built on Kafka Connect
- Exactly-once delivery (with Kafka guarantees)
- Schema change detection
- Filtering and transformation
3.2 Debezium Install (Docker Compose)
version: '3.7'
services:
zookeeper:
image: confluentinc/cp-zookeeper:7.5.0
environment:
ZOOKEEPER_CLIENT_PORT: 2181
kafka:
image: confluentinc/cp-kafka:7.5.0
depends_on: [zookeeper]
environment:
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
connect:
image: debezium/connect:2.5
depends_on: [kafka, postgres]
environment:
BOOTSTRAP_SERVERS: kafka:9092
GROUP_ID: 1
CONFIG_STORAGE_TOPIC: connect_configs
OFFSET_STORAGE_TOPIC: connect_offsets
STATUS_STORAGE_TOPIC: connect_statuses
postgres:
image: debezium/example-postgres:2.5
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
3.3 Register PostgreSQL Connector
curl -X POST http://connect:8083/connectors \
-H "Content-Type: application/json" \
-d '{
"name": "postgres-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "cdc_user",
"database.password": "secret",
"database.dbname": "mydb",
"database.server.name": "mydb",
"table.include.list": "public.users,public.orders",
"plugin.name": "pgoutput",
"publication.name": "my_pub",
"slot.name": "my_slot"
}
}'
Debezium now auto-publishes PostgreSQL changes to Kafka.
3.4 Kafka Topic Structure
Debezium creates one topic per table:
mydb.public.users <- changes to users
mydb.public.orders <- changes to orders
Message format:
{
"before": null,
"after": {
"id": 123,
"name": "Alice",
"email": "alice@example.com"
},
"source": {
"version": "2.5",
"connector": "postgresql",
"name": "mydb",
"ts_ms": 1681545600000,
"snapshot": "false",
"db": "mydb",
"schema": "public",
"table": "users",
"txId": 12345,
"lsn": 23456789
},
"op": "c",
"ts_ms": 1681545600000
}
op values: c=create, u=update, d=delete, r=read (snapshot).
3.5 Kafka Consumer
from kafka import KafkaConsumer
import json
consumer = KafkaConsumer(
'mydb.public.users',
bootstrap_servers=['kafka:9092'],
value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)
for message in consumer:
event = message.value
op = event['op']
if op == 'c':
print(f"new user: {event['after']}")
elif op == 'u':
print(f"update: {event['before']} -> {event['after']}")
elif op == 'd':
print(f"delete: {event['before']}")
4. Outbox Pattern — Solving Dual-Write
4.1 The Dual-Write Problem
Scenario: creating an order saves to DB + publishes Kafka event.
def create_order(order):
db.save(order) # 1
kafka.send("orders", order) # 2
Problem: what if a crash happens between 1 and 2?
- Only 1 succeeds: exists in DB but event never sent — downstream inconsistency
- Only 2 succeeds: event sent but not in DB — ghost event
Dual-write problem: atomic writes across two systems are impossible.
4.2 Outbox Pattern
Key idea: inside the DB transaction, also save the event to an outbox table.
CREATE TABLE outbox (
id UUID PRIMARY KEY,
aggregate_type VARCHAR(255),
aggregate_id VARCHAR(255),
event_type VARCHAR(255),
payload JSONB,
created_at TIMESTAMP DEFAULT NOW()
);
def create_order(order):
with db.transaction():
db.save(order)
db.save(OutboxEvent(
aggregate_type='Order',
aggregate_id=order.id,
event_type='OrderCreated',
payload=order.to_json()
))
Atomicity: both INSERTs in the same transaction succeed or fail together.
4.3 Outbox -> Kafka
Option 1: Polling Publisher (simple)
while True:
events = db.query("""
SELECT * FROM outbox
WHERE published = false
ORDER BY created_at
LIMIT 100
""")
for event in events:
kafka.send(event.aggregate_type, event.payload)
db.execute("UPDATE outbox SET published = true WHERE id = ?", event.id)
sleep(1)
Downsides: polling overhead, slight delay.
Option 2: CDC (recommended)
Debezium watches the outbox table and auto-publishes to Kafka.
curl -X POST http://connect:8083/connectors \
-d '{
"name": "outbox-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"table.include.list": "public.outbox",
"transforms": "outbox",
"transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
"transforms.outbox.route.by.field": "aggregate_type",
"transforms.outbox.table.field.event.id": "id",
"transforms.outbox.table.field.event.payload": "payload"
}
}'
Effect: no polling code needed, near real-time.
5. Data Warehouse Sync
5.1 Scenario
OLTP (PostgreSQL) -> DW (Snowflake/BigQuery) in real time.
5.2 Limits of Traditional ETL
[Extract] -> [Transform] -> [Load]
(batch, daily) (slow) (bulk)
Problems: 24-hour delay, heavy DB load, reprocessing all data.
5.3 CDC-Based ELT
[PostgreSQL] -> [Debezium] -> [Kafka] -> [Stream Processor] -> [Snowflake]
Advantages: real-time (seconds), minimal DB load, only changes processed.
5.4 Tools
| Tool | Notes |
|---|---|
| Debezium + Kafka Connect | Open-source, self-hosted |
| Fivetran | Managed, expensive, easy |
| Airbyte | Open-source + managed |
| Stitch | Simple, managed |
| AWS DMS | AWS-integrated |
| Striim | Enterprise |
| Hevo | No-code |
5.5 Schema Evolution
Problem: what about added/dropped columns in the source DB?
Debezium behavior:
- New column: auto-detected, included in events
- Dropped column: emits DELETE-style events
- Type change: publishes new schema
Schema Registry (Confluent): stores all schema versions, validates compatibility, lets consumers read safely.
6. Search Index Sync
6.1 Scenario
PostgreSQL -> Elasticsearch real-time indexing.
6.2 Architecture
[PostgreSQL] -> [Debezium] -> [Kafka] -> [Elasticsearch Sink Connector] -> [Elasticsearch]
6.3 Elasticsearch Sink Connector
curl -X POST http://connect:8083/connectors \
-d '{
"name": "es-sink",
"config": {
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"topics": "mydb.public.products",
"connection.url": "http://elasticsearch:9200",
"type.name": "_doc",
"key.ignore": "false",
"schema.ignore": "true",
"behavior.on.malformed.documents": "warn"
}
}'
6.4 Effect
- New product -> searchable in 1-2s
- Price update -> search results refresh automatically
- Delete -> immediately removed from index
Legacy batch reindex runs hourly over several hours. CDC is real-time.
7. Cache Invalidation
7.1 Problem
def get_user(user_id):
user = cache.get(f"user:{user_id}")
if user is None:
user = db.query(...)
cache.set(f"user:{user_id}", user, ttl=3600)
return user
def update_user(user_id, data):
db.execute("UPDATE users SET ... WHERE id=?", user_id)
cache.delete(f"user:{user_id}") # what if forgotten? what if another service updates?
Fix: automatic invalidation via CDC.
7.2 CDC-Based
consumer = KafkaConsumer('mydb.public.users')
for message in consumer:
event = message.value
user_id = event['after']['id'] if event['op'] != 'd' else event['before']['id']
cache.delete(f"user:{user_id}")
Advantages: no code changes, all changes captured, centralized.
8. Zero-Downtime Migration
8.1 Scenario
Legacy MySQL -> new PostgreSQL migration.
8.2 Steps
Step 1: CDC setup
[MySQL (current)] -> [Debezium] -> [Kafka] -> [PostgreSQL (new)]
Step 2: Initial snapshot — Debezium copies all MySQL data to PG once.
Step 3: Real-time sync — only changes flow to PG; both DBs stay nearly in sync.
Step 4: Shift read traffic — route some reads to PG (verify), then 100%.
Step 5: Shift write traffic — reverse CDC direction (PG -> MySQL) for rollback safety, move all writes to PG, then retire MySQL.
8.3 Advantages
- Zero downtime: no user impact
- Rollback: revert instantly on issues
- Incremental: risk distributed
- Verification: compare both DBs for consistency
9. Common Pitfalls and Fixes
9.1 Replication Slot Accumulation
Unused PostgreSQL replication slots cause unbounded WAL growth — disk fills up.
-- Check active slots
SELECT slot_name, active, restart_lsn
FROM pg_replication_slots;
-- Drop unused slot
SELECT pg_drop_replication_slot('unused_slot');
Must monitor: alerts on WAL size.
9.2 Large Transactions
Processing millions of rows in one transaction blows up Debezium memory.
Fixes: split transactions, tune max.queue.size, use heartbeats.
9.3 Schema Changes
ALTER TABLE can break CDC.
Prevention: additive changes only, small DDL units, use Debezium's schema history.
9.4 Exactly-Once Delivery
Debezium is at-least-once (duplicates possible).
Fixes: consumer idempotency, Kafka transactions API, message ID tracking.
9.5 Monitoring
# Connector status
curl http://connect:8083/connectors/postgres-connector/status
# Metrics (JMX)
debezium_metrics_QueueRemainingCapacity
debezium_metrics_NumberOfEventsFiltered
Key signals: Lag (source DB delta), Queue size, Error rate.
10. CDC vs Alternatives
10.1 CDC vs Polling
| CDC | Polling | |
|---|---|---|
| Latency | real-time (ms) | minutes/hours |
| DB load | very low | high |
| DELETE detection | yes | no |
| Complexity | high | low |
| Infra | Kafka, etc. | none |
10.2 CDC vs Triggers
Triggers: DB trigger writes to another table. Simple, but hurts DB performance and adds transaction cost.
CDC: external tool reads WAL. Non-invasive, but requires extra infra.
10.3 CDC vs Application Events
App-level: code publishes events directly. Integrates with business logic but suffers dual-write (solved by Outbox).
CDC: directly from DB. Safe and complete, but limited to data-model level.
Combination: Outbox + CDC = best of both worlds.
Quiz
1. Difference between Logical and Physical Replication?
Answer: Physical Replication copies at the disk-page level; the standby is byte-identical to the primary. Fast but requires the same version and schema — mainly used for HA. Logical Replication replicates logical changes (INSERT/UPDATE/DELETE); different schemas and versions work. Suitable for CDC, supported in PostgreSQL 10+. CDC is built on logical replication, enabled via wal_level = logical.
2. How does the Outbox Pattern solve dual-write?
Answer: inside the DB transaction, the event is also stored in an outbox table. Both INSERTs share one transaction, so they atomically succeed or fail together. A separate process (Polling or CDC) then forwards events from the outbox to Kafka. Result: atomicity between DB and broker. Debezium + Outbox Event Router is the standard implementation — almost mandatory for microservices.
3. Why did Debezium become the de facto standard?
Answer: (1) Broad DB support — PostgreSQL, MySQL, MongoDB, Oracle, SQL Server, Cassandra, (2) Built on Kafka Connect — proven infra, (3) Open-source — free, (4) Rich features — schema change detection, filtering, transformation, (5) Red Hat backing — stable development. Alternatives (Maxwell, Canal) support only specific DBs or lack features. Cloud-managed (Fivetran, Airbyte) costs more but reduces ops burden.
4. Benefits of CDC-based cache invalidation?
Answer: (1) Automatic — no missed cache.delete(), (2) Complete — captures all DB changes including those from other services, (3) Centralized — invalidation logic in one place, (4) Reliable — fires at commit time. Downsides: slight delay (usually under 1s), extra infra. Especially useful when multiple microservices share a DB.
5. Role of CDC in zero-downtime DB migration?
Answer: (1) Initial snapshot: Debezium copies all data to the new DB once. (2) Real-time sync: reads WAL and applies changes to the new DB immediately, keeping both DBs nearly synchronized. (3) Incremental traffic shift: move reads first, then writes. (4) Rollback: reverse CDC direction on issues. Enables safe, zero-downtime migration over months. Stripe and Airbnb used this for PostgreSQL migration.
References
- Debezium — official docs
- PostgreSQL Logical Replication
- Kafka Connect
- Debezium Outbox Pattern
- Designing Data-Intensive Applications — Martin Kleppmann
- Confluent CDC
- Stripe's Live Migration
- GitHub Engineering: MySQL Schema Changes
- Airbyte — Debezium alternative
- Fivetran — managed CDC
- Maxwell's Daemon — MySQL-only CDC