Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

TL;DR

CDC = game changer for data sync: change event streams instead of polling. Real-time, efficient.
Debezium is the de facto standard: supports PostgreSQL, MySQL, MongoDB, SQL Server, etc.
WAL/binlog based: reads DB internal logs to capture changes. Non-invasive.
Outbox Pattern: solves the dual-write problem. Standard for microservice events.
Use cases: DW sync, search index update, cache invalidation, event-driven architecture.

1. Why CDC

1.1 Traditional Data Sync

Polling:

while True:
    new_records = db.query("SELECT * FROM users WHERE updated_at > ?", last_sync)
    sync_to_warehouse(new_records)
    last_sync = now()
    sleep(60)

Problems:

Latency: 1-minute granularity — not real-time
Load: queries DB every time
Misses DELETE: WHERE updated_at > ? cannot detect deletions
Gaps: transaction timing can cause missed rows
Non-scalable: queries slow down as data grows

1.2 The CDC Promise

[DB] -> [WAL/binlog] -> [CDC tool] -> [Kafka] -> [Consumers]
                             ^
                  real-time, non-invasive

Advantages:

Real-time: millisecond latency
No DB load: only reads logs
All changes: captures INSERT, UPDATE, DELETE
Order preserved: transaction order
Scalable: independent of data volume

1.3 Use Cases

Use Case	Description
Data warehouse sync	OLTP to Snowflake/BigQuery in real time
Search index update	DB change reflected immediately in Elasticsearch
Cache invalidation	DB update invalidates Redis cache
Event-driven	DB changes become domain events
Microservice sync	One service's data consumed by another
DB migration	Zero-downtime migration (old to new DB)
Audit log	Full change history retained

2. How CDC Works

2.1 PostgreSQL WAL

WAL (Write-Ahead Log): log of every change written to disk. Foundation of crash recovery.

INSERT INTO users (name) VALUES ('Alice');
-> [WAL]: tx_id=123, op=INSERT, table=users, data={name:'Alice'}
-> [Disk page write]

CDC reads this WAL to capture changes.

2.2 Logical vs Physical Replication

Physical Replication (Streaming):

Disk-page-level replication
Standby is byte-identical to primary
Pros: fast, exact
Cons: same version/schema only

Logical Replication:

Replicates logical changes (INSERT/UPDATE/DELETE)
Different schemas and versions allowed
Pros: flexible, suitable for CDC
Cons: slight overhead

2.3 PostgreSQL Logical Replication Setup

# postgresql.conf
wal_level = logical
max_wal_senders = 10
max_replication_slots = 10

-- Create user
CREATE USER cdc_user REPLICATION LOGIN PASSWORD 'secret';
GRANT SELECT ON ALL TABLES IN SCHEMA public TO cdc_user;

-- Create publication (which tables to replicate)
CREATE PUBLICATION my_pub FOR TABLE users, orders;
-- Or all tables
CREATE PUBLICATION my_pub FOR ALL TABLES;

-- Create replication slot (holds changes)
SELECT pg_create_logical_replication_slot('my_slot', 'pgoutput');

2.4 MySQL binlog

MySQL uses binlog:

# my.cnf
log_bin = mysql-bin
binlog_format = ROW  # ROW is required for CDC
binlog_row_image = FULL
server_id = 1
expire_logs_days = 7

binlog_format:

STATEMENT: SQL text (non-deterministic on replay)
ROW: per-row changes (suitable for CDC)
MIXED: both

2.5 MongoDB Change Streams

const changeStream = db.collection('users').watch()
changeStream.on('change', (change) => {
  console.log(change)
  // { operationType: 'insert', fullDocument: {...}, ... }
})

MongoDB is oplog based.

3. Debezium — the CDC Standard

3.1 What is Debezium?

Debezium = open-source CDC platform. Started by Red Hat.

Architecture:

[PostgreSQL] -> [Debezium connector] -> [Kafka Connect] -> [Kafka topics] -> [Consumers]

Core features:

Supports many DBs (PG, MySQL, MongoDB, Oracle, SQL Server, Cassandra)
Built on Kafka Connect
Exactly-once delivery (with Kafka guarantees)
Schema change detection
Filtering and transformation

3.2 Debezium Install (Docker Compose)

version: '3.7'
services:
  zookeeper:
    image: confluentinc/cp-zookeeper:7.5.0
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181

  kafka:
    image: confluentinc/cp-kafka:7.5.0
    depends_on: [zookeeper]
    environment:
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092

  connect:
    image: debezium/connect:2.5
    depends_on: [kafka, postgres]
    environment:
      BOOTSTRAP_SERVERS: kafka:9092
      GROUP_ID: 1
      CONFIG_STORAGE_TOPIC: connect_configs
      OFFSET_STORAGE_TOPIC: connect_offsets
      STATUS_STORAGE_TOPIC: connect_statuses

  postgres:
    image: debezium/example-postgres:2.5
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres

3.3 Register PostgreSQL Connector

curl -X POST http://connect:8083/connectors \
  -H "Content-Type: application/json" \
  -d '{
    "name": "postgres-connector",
    "config": {
      "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
      "database.hostname": "postgres",
      "database.port": "5432",
      "database.user": "cdc_user",
      "database.password": "secret",
      "database.dbname": "mydb",
      "database.server.name": "mydb",
      "table.include.list": "public.users,public.orders",
      "plugin.name": "pgoutput",
      "publication.name": "my_pub",
      "slot.name": "my_slot"
    }
  }'

Debezium now auto-publishes PostgreSQL changes to Kafka.

3.4 Kafka Topic Structure

Debezium creates one topic per table:

mydb.public.users   <- changes to users
mydb.public.orders  <- changes to orders

Message format:

{
  "before": null,
  "after": {
    "id": 123,
    "name": "Alice",
    "email": "alice@example.com"
  },
  "source": {
    "version": "2.5",
    "connector": "postgresql",
    "name": "mydb",
    "ts_ms": 1681545600000,
    "snapshot": "false",
    "db": "mydb",
    "schema": "public",
    "table": "users",
    "txId": 12345,
    "lsn": 23456789
  },
  "op": "c",
  "ts_ms": 1681545600000
}

op values: c=create, u=update, d=delete, r=read (snapshot).

3.5 Kafka Consumer

from kafka import KafkaConsumer
import json

consumer = KafkaConsumer(
    'mydb.public.users',
    bootstrap_servers=['kafka:9092'],
    value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)

for message in consumer:
    event = message.value
    op = event['op']
    if op == 'c':
        print(f"new user: {event['after']}")
    elif op == 'u':
        print(f"update: {event['before']} -> {event['after']}")
    elif op == 'd':
        print(f"delete: {event['before']}")

4. Outbox Pattern — Solving Dual-Write

4.1 The Dual-Write Problem

Scenario: creating an order saves to DB + publishes Kafka event.

def create_order(order):
    db.save(order)              # 1
    kafka.send("orders", order) # 2

Problem: what if a crash happens between 1 and 2?

Only 1 succeeds: exists in DB but event never sent — downstream inconsistency
Only 2 succeeds: event sent but not in DB — ghost event

Dual-write problem: atomic writes across two systems are impossible.

4.2 Outbox Pattern

Key idea: inside the DB transaction, also save the event to an outbox table.

CREATE TABLE outbox (
  id UUID PRIMARY KEY,
  aggregate_type VARCHAR(255),
  aggregate_id VARCHAR(255),
  event_type VARCHAR(255),
  payload JSONB,
  created_at TIMESTAMP DEFAULT NOW()
);

def create_order(order):
    with db.transaction():
        db.save(order)
        db.save(OutboxEvent(
            aggregate_type='Order',
            aggregate_id=order.id,
            event_type='OrderCreated',
            payload=order.to_json()
        ))

Atomicity: both INSERTs in the same transaction succeed or fail together.

4.3 Outbox -> Kafka

Option 1: Polling Publisher (simple)

while True:
    events = db.query("""
        SELECT * FROM outbox
        WHERE published = false
        ORDER BY created_at
        LIMIT 100
    """)
    for event in events:
        kafka.send(event.aggregate_type, event.payload)
        db.execute("UPDATE outbox SET published = true WHERE id = ?", event.id)
    sleep(1)

Downsides: polling overhead, slight delay.

Option 2: CDC (recommended)

Debezium watches the outbox table and auto-publishes to Kafka.

curl -X POST http://connect:8083/connectors \
  -d '{
    "name": "outbox-connector",
    "config": {
      "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
      "table.include.list": "public.outbox",
      "transforms": "outbox",
      "transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
      "transforms.outbox.route.by.field": "aggregate_type",
      "transforms.outbox.table.field.event.id": "id",
      "transforms.outbox.table.field.event.payload": "payload"
    }
  }'

Effect: no polling code needed, near real-time.

5. Data Warehouse Sync

5.1 Scenario

OLTP (PostgreSQL) -> DW (Snowflake/BigQuery) in real time.

5.2 Limits of Traditional ETL

[Extract] -> [Transform] -> [Load]
(batch, daily) (slow)  (bulk)

Problems: 24-hour delay, heavy DB load, reprocessing all data.

5.3 CDC-Based ELT

[PostgreSQL] -> [Debezium] -> [Kafka] -> [Stream Processor] -> [Snowflake]

Advantages: real-time (seconds), minimal DB load, only changes processed.

5.4 Tools

Tool	Notes
Debezium + Kafka Connect	Open-source, self-hosted
Fivetran	Managed, expensive, easy
Airbyte	Open-source + managed
Stitch	Simple, managed
AWS DMS	AWS-integrated
Striim	Enterprise
Hevo	No-code

5.5 Schema Evolution

Problem: what about added/dropped columns in the source DB?

Debezium behavior:

New column: auto-detected, included in events
Dropped column: emits DELETE-style events
Type change: publishes new schema

Schema Registry (Confluent): stores all schema versions, validates compatibility, lets consumers read safely.

6. Search Index Sync

6.1 Scenario

PostgreSQL -> Elasticsearch real-time indexing.

6.2 Architecture

[PostgreSQL] -> [Debezium] -> [Kafka] -> [Elasticsearch Sink Connector] -> [Elasticsearch]

6.3 Elasticsearch Sink Connector

curl -X POST http://connect:8083/connectors \
  -d '{
    "name": "es-sink",
    "config": {
      "connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
      "topics": "mydb.public.products",
      "connection.url": "http://elasticsearch:9200",
      "type.name": "_doc",
      "key.ignore": "false",
      "schema.ignore": "true",
      "behavior.on.malformed.documents": "warn"
    }
  }'

6.4 Effect

New product -> searchable in 1-2s
Price update -> search results refresh automatically
Delete -> immediately removed from index

Legacy batch reindex runs hourly over several hours. CDC is real-time.

7. Cache Invalidation

7.1 Problem

def get_user(user_id):
    user = cache.get(f"user:{user_id}")
    if user is None:
        user = db.query(...)
        cache.set(f"user:{user_id}", user, ttl=3600)
    return user

def update_user(user_id, data):
    db.execute("UPDATE users SET ... WHERE id=?", user_id)
    cache.delete(f"user:{user_id}")  # what if forgotten? what if another service updates?

Fix: automatic invalidation via CDC.

7.2 CDC-Based

consumer = KafkaConsumer('mydb.public.users')

for message in consumer:
    event = message.value
    user_id = event['after']['id'] if event['op'] != 'd' else event['before']['id']
    cache.delete(f"user:{user_id}")

Advantages: no code changes, all changes captured, centralized.

8. Zero-Downtime Migration

8.1 Scenario

Legacy MySQL -> new PostgreSQL migration.

8.2 Steps

Step 1: CDC setup

[MySQL (current)] -> [Debezium] -> [Kafka] -> [PostgreSQL (new)]

Step 2: Initial snapshot — Debezium copies all MySQL data to PG once.

Step 3: Real-time sync — only changes flow to PG; both DBs stay nearly in sync.

Step 4: Shift read traffic — route some reads to PG (verify), then 100%.

Step 5: Shift write traffic — reverse CDC direction (PG -> MySQL) for rollback safety, move all writes to PG, then retire MySQL.

8.3 Advantages

Zero downtime: no user impact
Rollback: revert instantly on issues
Incremental: risk distributed
Verification: compare both DBs for consistency

9. Common Pitfalls and Fixes

9.1 Replication Slot Accumulation

Unused PostgreSQL replication slots cause unbounded WAL growth — disk fills up.

-- Check active slots
SELECT slot_name, active, restart_lsn
FROM pg_replication_slots;

-- Drop unused slot
SELECT pg_drop_replication_slot('unused_slot');

Must monitor: alerts on WAL size.

9.2 Large Transactions

Processing millions of rows in one transaction blows up Debezium memory.

Fixes: split transactions, tune max.queue.size, use heartbeats.

9.3 Schema Changes

ALTER TABLE can break CDC.

Prevention: additive changes only, small DDL units, use Debezium's schema history.

9.4 Exactly-Once Delivery

Debezium is at-least-once (duplicates possible).

Fixes: consumer idempotency, Kafka transactions API, message ID tracking.

9.5 Monitoring

# Connector status
curl http://connect:8083/connectors/postgres-connector/status

# Metrics (JMX)
debezium_metrics_QueueRemainingCapacity
debezium_metrics_NumberOfEventsFiltered

Key signals: Lag (source DB delta), Queue size, Error rate.

10. CDC vs Alternatives

10.1 CDC vs Polling

	CDC	Polling
Latency	real-time (ms)	minutes/hours
DB load	very low	high
DELETE detection	yes	no
Complexity	high	low
Infra	Kafka, etc.	none

10.2 CDC vs Triggers

Triggers: DB trigger writes to another table. Simple, but hurts DB performance and adds transaction cost.

CDC: external tool reads WAL. Non-invasive, but requires extra infra.

10.3 CDC vs Application Events

App-level: code publishes events directly. Integrates with business logic but suffers dual-write (solved by Outbox).

CDC: directly from DB. Safe and complete, but limited to data-model level.

Combination: Outbox + CDC = best of both worlds.

Quiz

1. Difference between Logical and Physical Replication?

Answer: Physical Replication copies at the disk-page level; the standby is byte-identical to the primary. Fast but requires the same version and schema — mainly used for HA. Logical Replication replicates logical changes (INSERT/UPDATE/DELETE); different schemas and versions work. Suitable for CDC, supported in PostgreSQL 10+. CDC is built on logical replication, enabled via wal_level = logical.

2. How does the Outbox Pattern solve dual-write?

Answer: inside the DB transaction, the event is also stored in an outbox table. Both INSERTs share one transaction, so they atomically succeed or fail together. A separate process (Polling or CDC) then forwards events from the outbox to Kafka. Result: atomicity between DB and broker. Debezium + Outbox Event Router is the standard implementation — almost mandatory for microservices.

3. Why did Debezium become the de facto standard?

Answer: (1) Broad DB support — PostgreSQL, MySQL, MongoDB, Oracle, SQL Server, Cassandra, (2) Built on Kafka Connect — proven infra, (3) Open-source — free, (4) Rich features — schema change detection, filtering, transformation, (5) Red Hat backing — stable development. Alternatives (Maxwell, Canal) support only specific DBs or lack features. Cloud-managed (Fivetran, Airbyte) costs more but reduces ops burden.

4. Benefits of CDC-based cache invalidation?

Answer: (1) Automatic — no missed cache.delete(), (2) Complete — captures all DB changes including those from other services, (3) Centralized — invalidation logic in one place, (4) Reliable — fires at commit time. Downsides: slight delay (usually under 1s), extra infra. Especially useful when multiple microservices share a DB.

5. Role of CDC in zero-downtime DB migration?

Answer: (1) Initial snapshot: Debezium copies all data to the new DB once. (2) Real-time sync: reads WAL and applies changes to the new DB immediately, keeping both DBs nearly synchronized. (3) Incremental traffic shift: move reads first, then writes. (4) Rollback: reverse CDC direction on issues. Enables safe, zero-downtime migration over months. Stripe and Airbnb used this for PostgreSQL migration.

References

Debezium — official docs
PostgreSQL Logical Replication
Kafka Connect
Debezium Outbox Pattern
Designing Data-Intensive Applications — Martin Kleppmann
Confluent CDC
Stripe's Live Migration
GitHub Engineering: MySQL Schema Changes
Airbyte — Debezium alternative
Fivetran — managed CDC
Maxwell's Daemon — MySQL-only CDC