필사 모드: CDC & Data Integration 2026 Deep Dive — Debezium 3 · Estuary · Flink CDC · Airbyte · Fivetran · Sling · Hightouch · Census · Sequin
EnglishPrologue — "So where does our data actually meet?"
A 2026 data-platform team meeting.
PM: "We need realtime payment data for the new model."
DBA: "Again? The recs team asked the same thing last week."
Platform: "Just add one more topic to Debezium."
DBA: "We already have five replication slots open and lag is 30 minutes."
That short exchange contains everything about 2026 data integration. **OLTP is the source of truth**, but **analytics, AI, and SaaS want that truth in different shapes**. CDC (Change Data Capture) and data integration fill the gap.
The space matured explosively over the last five years. On one side sits open source like Debezium and Flink CDC; on the other, managed SaaS like Fivetran, Airbyte, and Estuary. Stacked on top, reverse ETL with Hightouch and Census, and lighter tools like Sling and Sequin fill the cracks.
This post maps the whole terrain — from the changes in Debezium 3.0 to Estuary Flow's managed ambitions, how Flink CDC evolved into an ELT tool, the real difference between Airbyte and Fivetran, DB-native mechanisms like Postgres logical replication and MongoDB Change Streams, the outbox pattern, and why reverse ETL became its own category.
Chapter 1 · CDC Fundamentals — Log vs Trigger vs Polling
CDC is anything that detects DB changes and pipes them somewhere. There are only three implementation styles.
| Style | Mechanism | Pro | Con |
| --- | --- | --- | --- |
| Polling | `SELECT ... WHERE updated_at > ?` | Simplest | High latency, no deletes, DB load |
| Trigger | DB trigger → side table | Works on any DB | Hurts txn performance |
| Log-based | Read WAL / binlog / redo | Near-zero overhead | DB privileges, complexity |
Serious CDC in 2026 is almost entirely **log-based**. MySQL exposes binlog, Postgres has WAL plus logical replication, MongoDB has oplog / Change Streams, SQL Server has CDC tables, Oracle has redo log read by LogMiner or XStream.
Polling is still fine for tiny systems. Triggers are a last resort when other options aren't available and change frequency is very low.
The other key axis is **snapshot + incremental**. If a table has 100M rows when you start CDC, you have to read all of them (snapshot) and then follow the changes since that point (incremental). Debezium calls this `incremental snapshot`, Flink CDC handles it via `parallel snapshot`.
Chapter 2 · Why CDC Right Now — Four Use Cases
Why did CDC suddenly become mandatory infrastructure? Four use cases exploded at the same time.
1. **Realtime analytics / operational analytics** — revenue dashboards with 5-minute latency vs 24-hour batch.
2. **Microservice integration** — service A's DB changes need to reach service B, but direct API calls are too tightly coupled. Publish events instead.
3. **Audit / compliance** — keep an immutable log of who changed what when.
4. **AI training data / feature store sync** — keep training datasets, vector stores, and recommendation features in sync with OLTP.
Number 4 exploded between 2024 and 2026. Models retrain more often, RAG systems require freshness SLAs. "A document added yesterday isn't searchable today" is no longer acceptable.
All four use cases share one constraint: **propagate changes without hammering OLTP**. That is why log-based CDC won.
Chapter 3 · Debezium 3.0 (Red Hat, Apache 2.0) — King of Kafka Connect
[Debezium](https://debezium.io/) is effectively the open source CDC standard. Sponsored by Red Hat, licensed Apache 2.0. Debezium 3.0 shipped in early 2025 and the 3.x line is actively updated in 2026.
Supported DBs:
- MySQL, MariaDB
- PostgreSQL (logical replication via pgoutput / decoderbufs / wal2json)
- MongoDB (Change Streams, replica set required)
- SQL Server (native CDC)
- Oracle (LogMiner)
- IBM Db2
- Apache Cassandra
- Vitess
- Spanner (preview)
Traditionally Debezium runs on **Kafka Connect**, serializing to JSON / Avro / Protobuf and pushing to Kafka topics. Message key is the PK, value is a `before` / `after` / `source` / `op` envelope.
{
"before": { "id": 42, "email": "a@x.com" },
"after": { "id": 42, "email": "b@x.com" },
"op": "u",
"source": { "db": "shop", "table": "users", "lsn": 12345 },
"ts_ms": 1731700000000
}
3.0 highlights:
- **Kafka 3.x / 4.0 compatibility** — including share groups
- **Java 17+ required**
- **Schema history improvements** — external storage backends
- **Stable incremental snapshot** — adding tables during live operation
- **Debezium UI** — dedicated management console
Strengths: proven stability, big community, every major DB.
Weaknesses: Kafka Connect operational overhead, JVM dependency, memory hungry.
Chapter 4 · Debezium Server / Engine — Embedded Modes
Not every team can run Kafka Connect. Debezium offers two lighter modes.
**Debezium Server** — standalone container that can ship directly to non-Kafka sinks.
- Kinesis, Pub/Sub, Pulsar
- EventHubs, Event Grid
- HTTP, NATS, Redis Streams
- File, S3
debezium:
sink:
type: pubsub
pubsub:
project.id: my-project
source:
connector.class: io.debezium.connector.postgresql.PostgresConnector
database.hostname: pg.internal
database.dbname: shop
plugin.name: pgoutput
slot.name: debezium
publication.name: debezium_pub
topic.prefix: shop
**Debezium Engine** — Java library embedded inside your app, consuming CDC events directly. Great for low-traffic single-node use cases. No built-in HA.
In 2026 Debezium Server is gaining traction under the "Debezium without Kafka" pitch. Combined with NATS JetStream or Redis Streams it is particularly fast.
Chapter 5 · Estuary Flow — Managed CDC SaaS
[Estuary Flow](https://estuary.dev/) is a managed CDC + integration platform founded in 2020. It grew fast in 2024 - 2025 and is now a serious Fivetran alternative.
Highlights:
- **Realtime CDC** — sub-100ms latency advertised
- **Millions of rows per second** for very large workloads
- **200+ sources / 100+ destinations** and growing fast
- **Open source core** (Flow is Apache 2.0) + **managed hosting**
- **Materializations** — not just copy, but continuously updated materialized views at the sink
Pricing is GB-throughput based, more predictable than Fivetran's MAR (Monthly Active Rows).
flow.yaml
captures:
shop/postgres:
endpoint:
connector:
image: ghcr.io/estuary/source-postgres:dev
config:
address: pg.internal:5432
database: shop
user: estuary
bindings:
- resource:
stream: public.users
mode: Normal
target: shop/users
materializations:
shop/snowflake:
endpoint:
connector:
image: ghcr.io/estuary/materialize-snowflake:dev
config:
account: xy12345.us-east-1
database: ANALYTICS
warehouse: COMPUTE_WH
bindings:
- source: shop/users
resource: { table: USERS }
Strengths: true realtime, predictable pricing, open source core.
Weaknesses: younger ecosystem, smaller connector catalog than Airbyte.
Chapter 6 · Apache Flink CDC 3.x — ELT on top of Flink
[Flink CDC](https://github.com/apache/flink-cdc) (formerly Ververica CDC connectors) went through Apache incubation in 2024 and since **Flink CDC 3.0** has become a **full ELT pipeline tool**, not just a connector.
Core concepts:
- **Source** — MySQL, Postgres, MongoDB, Oracle, SQL Server, TiDB, OceanBase
- **Sink** — Doris, StarRocks, Iceberg, Paimon, Kafka, Elasticsearch
- **Pipeline** — define Source → Transformation → Sink in a single YAML
- **Schema evolution** — auto propagate ALTER TABLE when the sink supports it
- **Parallel snapshot** — split big tables into chunks and read in parallel. 100M-row tables in under an hour.
pipeline.yaml
source:
type: mysql
hostname: mysql.internal
username: flinkcdc
password: ${MYSQL_PASS}
tables: shop.\.*
server-id: 5400-5404
sink:
type: paimon
catalog.properties.metastore: filesystem
catalog.properties.warehouse: s3://lake/paimon
route:
- source-table: shop.\.*
sink-table: ods_shop.<>
pipeline:
name: shop-to-paimon
parallelism: 8
The killer feature is **lakehouse integration**. Sink into Iceberg or Paimon and Trino / Spark / Doris can query it directly. Trino + Iceberg + Flink CDC is a popular 2026 modern data stack combo.
Chapter 7 · Airbyte 1.x (YC W20) — Open Source ELT Standard
[Airbyte](https://airbyte.com/) graduated from YC W20, hit 1.0 in 2024, and is in late 1.x in 2026. It is the open source ELT standard.
Highlights:
- **350+ connectors** — almost every SaaS and DB
- **Open source (MIT / ELv2 mix)** plus **Cloud managed**
- **Airbyte Protocol** — anyone can build connectors against a standard interface
- **CDK** in Python / Java for custom connectors, plus a low-code CDK
- **Embedded** — bundle Airbyte inside your own product
CDC support covers MySQL, Postgres, MongoDB, SQL Server, and others. Internally it historically wrapped Debezium and currently mixes Debezium with Airbyte's own implementations.
Strengths: massive connector catalog, open source, self-hostable.
Weaknesses: realtime CDC is weak (batch / micro-batch oriented), operations are not lightweight.
In 2024 a controversial managed pricing change pushed many teams toward self-hosting. The 2025 pricing reset stabilized things, but self-hosted Airbyte remains very attractive.
Chapter 8 · Fivetran — The Managed ELT Throne
[Fivetran](https://www.fivetran.com/) is the household name for managed ELT. Closed source, notoriously expensive, but it "just works" with an operational stability that won over the enterprise.
Highlights:
- **600+ connectors** — Salesforce, Hubspot, Stripe, Zendesk, NetSuite, Workday, you name it
- **MAR pricing** — Monthly Active Rows. Anything modified once counts. Widely criticized as unpredictable
- **HVR acquisition** — strengthened enterprise CDC (Oracle, SAP HANA)
- **dbt integration** via Transformations
- **Hubspot integration** — strong on marketing data
The 2025 - 2026 trend: enterprises stay on Fivetran, while startups and platform teams move to Airbyte / Estuary / Sling.
Chapter 9 · Stitch (Talend → Qlik) / Hevo / Meltano — The Rest of ELT
**Stitch** (Singer based, acquired by Talend, then Qlik via Talend) announced EOL in 2025, pushing users toward Airbyte / Fivetran / Estuary. The Singer protocol survives as open source.
**Hevo Data** — India-based ELT with 150+ connectors and competitive pricing. Strong in India and Southeast Asia.
**Meltano** — open source ELT spun out of GitLab, Singer based, CLI first, tight dbt integration. Popular with "ELT as code" teams.
**Rivery** — Israel-based ELT combined with workflow. Smaller but loyal niche.
The common thread: SaaS connector breadth is the product, CDC is a side feature.
Chapter 10 · Sling CLI — Fast Local / CLI Integration
[Sling](https://slingdata.io/) is a single-binary CLI for moving data. Written in Go, fast, and lets data engineers script integration on the spot.
Postgres → Snowflake
sling run \
--src-conn POSTGRES \
--src-stream "public.users" \
--tgt-conn SNOWFLAKE \
--tgt-object "ANALYTICS.PUBLIC.USERS" \
--mode full-refresh
Highlights:
- 100+ DBs and cloud storages
- YAML pipeline definitions (`replication.yaml`)
- Sling Cloud (managed) plus OSS CLI
- CDC via incremental mode (not full binlog-based CDC)
Strengths: blazing fast, simple, naturally self-hosted.
Weaknesses: no true CDC, transformations live elsewhere (dbt).
For "move this DB to that DB once a day" inside CI/CD it is unbeatable.
Chapter 11 · Sequin — Postgres to Anywhere
[Sequin](https://sequinstream.com/) is a Postgres-focused open source CDC tool. Tagline: "use Postgres like Kafka."
- Reads Postgres logical replication
- Sinks: Kafka, webhook (HTTP POST), Redis Streams, SQS, GCP Pub/Sub, NATS
- Built in Elixir on the BEAM VM
- Managed and self-hosted options
sequin.yaml
streams:
- name: orders_stream
source:
type: postgres
database: orders
tables: ["public.orders", "public.order_items"]
consumers:
- name: webhook_to_shipping
type: http_push
endpoint: https://shipping.internal/webhook
max_ack_pending: 100
- name: kafka_to_warehouse
type: kafka
topic: orders-cdc
bootstrap_servers: kafka.internal:9092
Sequin is strong for teams who think "Debezium is too heavy and I do not want to run Kafka just for one Postgres." It grew rapidly in 2024 - 2025.
Chapter 12 · Striim / Qlik Replicate / Oracle GoldenGate — Enterprise
The enterprise market is its own thing.
**Striim** — enterprise streaming integration. Covers SAP, Oracle, SQL Server, mainframes. Expensive but standard in big enterprises.
**Qlik Replicate** (formerly Attunity, acquired by Qlik in 2019) — Oracle, SQL Server, SAP HANA, Db2. Common in financial services.
**Oracle GoldenGate** — the Oracle CDC standard. Oracle-to-Oracle replication is the bread and butter, but heterogeneous works too. Pricing is brutal.
**IBM InfoSphere Data Replication (CDC)** — mainframe and Db2.
**SAP Data Services / SAP SLT** — for SAP-only stacks.
You rarely see these in startups, but they are still standard in finance, telco, and government.
Chapter 13 · AWS DMS · GCP Datastream · Azure Data Factory — Cloud Native
All three major clouds offer managed CDC.
**AWS DMS (Database Migration Service)** — Oracle / SQL Server / MySQL / Postgres → RDS / Aurora / Redshift / S3 for migration plus continuous CDC. Named "migration" but often used as pure CDC. Serverless DMS hit GA in 2024.
**AWS DMS + Kinesis Data Streams + Lambda** — DMS pushes change events to Kinesis and Lambda processes them.
**GCP Datastream** — Oracle / MySQL / Postgres / SQL Server → BigQuery / Cloud Storage. Realtime CDC into BigQuery is the main use case. One click in the console.
**GCP Datastream + Dataflow** — transform the stream via Dataflow (Flink / Beam) before landing in BigQuery.
**Azure Data Factory + Synapse Link** — Synapse Link for SQL / Dataverse / Cosmos DB connects OLTP to analytics in one click. Synapse Link for Cosmos DB mirrors NoSQL into the analytical store.
**Azure Database Migration Service** — DMS equivalent on Azure.
Strengths: well integrated with the cloud, fast console setup.
Weaknesses: hits limits fast when multi-cloud or on-prem is mixed in.
Chapter 14 · Postgres Logical Replication · MySQL Binlog · MongoDB Change Streams
CDC builds on database features. These three are worth knowing.
**Postgres logical replication**:
- Requires `wal_level = logical`
- Tune `max_replication_slots`, `max_wal_senders`
- Publication / subscription model
- `pgoutput` (built-in), `wal2json`, `decoderbufs` (Debezium plugin)
- A logical slot **eats disk** if you do not advance it. In a healthy setup the consumer must keep moving the LSN forward.
ALTER SYSTEM SET wal_level = logical;
SELECT pg_create_logical_replication_slot('debezium', 'pgoutput');
CREATE PUBLICATION dbz_pub FOR ALL TABLES;
Traps: `pg_repack`, `VACUUM FULL`, and some ALTER TABLE statements conflict with logical replication. When rows stop flowing in prod it is almost always a stuck slot.
**MySQL binlog**:
- `log_bin = ON`, `binlog_format = ROW`
- `gtid_mode = ON` is strongly recommended for easier failover
- A too-short `expire_logs_days` loses data when consumers fall behind.
**MongoDB Change Streams**:
- **Replica set required**, never standalone
- Subscribe with `db.collection.watch()`
- Resume tokens let you continue after a crash
Chapter 15 · Reverse ETL — Hightouch · Census · Polytomic · Grouparoo
Where traditional ETL is SaaS / DB → warehouse, **reverse ETL is warehouse → SaaS**. As "data team output must flow back to sales / marketing tools" demand exploded, this became its own category.
**Hightouch** — the leader. Snowflake / BigQuery / Databricks / Redshift → Salesforce / Hubspot / Iterable / Segment / ad platforms. Define audiences in SQL, multiple sync modes (upsert, mirror, update-only).
**Census** — the strongest Hightouch competitor. Similar positioning, differentiating on pricing, observability, and audit.
**Polytomic** — younger, emphasizes AI / ML model output sync.
**Grouparoo** — open source reverse ETL acquired by Airbyte in 2022, effectively EOL. The attempt to absorb reverse ETL into Airbyte did not stick — the market settled on it as a separate category.
**Workato** — broader iPaaS that includes reverse ETL but lives more in RPA / workflow automation.
The big 2026 shift: **pushing AI recommendation output into operational tools** is now a major use case. Models produce lead scores, churn risk, and next-best-action; reverse ETL sends them to Salesforce.
Chapter 16 · dbt + Warehouse — The Transformation Layer
If CDC and ELT move data, **dbt** transforms it. The T in modern data stack.
- **dbt Core** (open source, Apache 2.0)
- **dbt Cloud** (managed)
- Models defined in SQL, jinja templating, automatic lineage
- Warehouses: Snowflake, BigQuery, Databricks, Redshift, Postgres, DuckDB, Trino, Athena, and more
The canonical modern data stack flow:
Source (Postgres, Salesforce, Stripe)
→ CDC/ELT (Airbyte/Fivetran/Estuary/Debezium)
→ Warehouse (Snowflake/BigQuery/Databricks)
→ Transform (dbt)
→ BI (Looker/Mode/Hex)
→ Reverse ETL (Hightouch/Census) → SaaS
In 2024 dbt Labs acquired **SDF Labs** to strengthen its SQL compiler, and during 2025 - 2026 they announced the **Fusion** engine layered on top of dbt Core.
Chapter 17 · Pipeline Orchestration — Airflow · Dagster · Prefect · Mage · Kestra
CDC itself is streaming, but consuming that stream to run dbt and trigger ML training is batch / scheduled work. You need an orchestrator.
**Apache Airflow** — the de facto standard. Airflow 3.0 shipped in 2024 with major refactors: better UI, decoupled scheduler, smarter dynamic DAGs.
**Dagster** — software-defined assets. Data assets as first-class citizens. Rising fast with modern data teams.
**Prefect** — Python first, dynamic workflows, lightweight.
**Mage** — UI-oriented ETL, accessible to non-developers.
**Kestra** — YAML orchestration spanning data pipelines and general automation.
The 2026 trend: **Dagster is eating Airflow's data engineering share**. Airflow remains the biggest installed base, but new projects increasingly pick Dagster.
Chapter 18 · Schema Registry — Confluent · Karapace · Apicurio
CDC events evolve their schemas. Avro / Protobuf / JSON Schema have to live somewhere.
**Confluent Schema Registry** — the Kafka standard. License is Confluent Community License, which has been criticized as "fake open source."
**Karapace** — Aiven's Apache 2.0 Schema Registry, drop-in compatible. A real OSS alternative to Confluent.
**Apicurio Registry** — Red Hat. Apache 2.0. Manages schemas plus OpenAPI / AsyncAPI.
Compatibility policies (Confluent convention):
- `BACKWARD` — the new schema must read older data
- `FORWARD` — older schemas must read newer data
- `FULL` — both directions
- `NONE` — no checks (dangerous)
The classic CDC outage: someone drops a column, the registry policy is BACKWARD only, and downstream silently breaks.
Chapter 19 · CDC Patterns — Outbox · Transactional Outbox · Saga
The most common patterns when CDC powers microservice integration.
**Outbox pattern**: put an "outbox" table in the domain DB and write events into it within the same transaction as the business write. CDC then reads the outbox table and ships messages to the broker. **At-least-once delivery is guaranteed** by the atomic transaction.
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
INSERT INTO outbox (aggregate_id, event_type, payload)
VALUES (1, 'AccountDebited', '{"amount":100}');
COMMIT;
CDC tracks only the `outbox` table and forwards to Kafka. Debezium ships an **outbox event router** built in.
**Transactional Outbox** — same idea, popularized by libraries like Eventuate.io.
**Saga pattern** — an alternative to distributed transactions. A sequence of local transactions plus compensations. CDC carries the saga step transitions.
**Change feed** — the native change stream feature in DocumentDB, Cosmos DB, DynamoDB. Same concept under different names.
**Dual write antipattern** — code that writes to the DB and then writes to Kafka. Either side can fail. Forbidden. The outbox pattern is the answer.
Chapter 20 · Reliability — Exactly-once · Ordering · Schema Evolution
Three things you eventually hit when CDC enters production.
**Exactly-once semantics**:
- Debezium is **at-least-once** by default. Duplicates are possible.
- Practical exactly-once is achievable via Kafka transactions (Kafka 3.x) plus idempotent consumers.
- Flink offers exactly-once out of the box via checkpoints and 2PC.
**Ordering guarantees**:
- Events for the same PK must hit the same partition. Producers must use PK as the message key.
- Order across different PKs is not guaranteed.
- Postgres logical replication preserves commit order within a single publication.
**Schema evolution**:
- Avro + Confluent Schema Registry + BACKWARD is the typical baseline.
- Adding a column: safe.
- Dropping a column: blast radius downstream, deprecate gradually.
- Changing a column type: the most dangerous. New column plus migration is the safe path.
Chapter 21 · CDC for AI — Feature Store / Vector Store Sync
The biggest 2024 - 2026 use case shift is **freshness of AI training and inference data**.
**Feature store sync** — Tecton, Feast, and Hopsworks all need to stay in sync with OLTP and events. Realtime features (for example revenue in the last hour) are computed via CDC + Flink or Kafka Streams.
**Vector store sync** — when the RAG source of truth lives in OLTP, how do you keep Pinecone / Weaviate / pgvector updated? CDC → embedding pipeline → vector store is now the canonical architecture.
**LLM training data freshness SLAs** — "documents added yesterday must be searchable today." CDC is the foundation.
**ML model training** — batch CDC dumps land in S3 and Spark / Ray trains on them.
This is the fastest-growing area in 2026. Estuary, Sequin, and Striim all lead their marketing with "AI."
Chapter 22 · Korea Case Studies — Coupang / Kakao / Naver / Woowa Brothers
**Coupang** — well-known for Kafka + Debezium. Order and inventory changes flow through Kafka so many microservices can subscribe.
**Kakao** — multiple tech-blog posts covering Debezium + Kafka Connect in production. Kakao Bank and Kakao Pay run tight CDC operations.
**Naver** — internal data platform powered by binlog-based CDC. NCloud's data integration services also exist.
**Woowa Brothers (Baemin)** — multiple tech-blog posts on adopting Debezium plus operational lessons.
**NCsoft / Nexon** — game log and save-data CDC. Kafka plus in-house tools.
**Toss** — scattered CDC stories across the Toss tech blog. Postgres logical replication with internal consumers is common.
The pattern: most domestic teams run Debezium + Kafka themselves; managed services like Fivetran appear mostly inside global SaaS branches.
Chapter 23 · Japan Case Studies — Mercari / LINE / Cookpad / ZOZO
**Mercari** — CDC is core to the ML platform. Engineering blog posts cover Debezium with GCP Pub/Sub.
**LINE (now LY Corporation after the Yahoo Japan merger)** — the data engineering blog has many Kafka + Debezium + Hadoop / Trino lakehouse stories. Operates very large CDC fleets.
**Cookpad** — CDC for recipe data and user logs. Aurora + DMS + Redshift is a typical combo.
**ZOZO** — fashion EC. Mixes Fivetran with in-house tools for Snowflake ingestion.
**CyberAgent** — ads, games, and media businesses. Kafka + Debezium is the house standard.
**Recruit / Indeed** — internal CDC infrastructure inside the data platform.
Japan is accelerating Snowflake and Databricks adoption, which is pulling Airbyte and Fivetran share up.
Chapter 24 · Decision Checklist — Which Tool Should We Pick
The checklist:
1. **One DB, no realtime requirement** → Sling CLI + cron / dbt
2. **One Postgres, realtime required** → Sequin or Debezium Server
3. **Multiple DBs and Kafka is already there** → Debezium + Kafka Connect
4. **Already on Flink or willing to adopt all at once** → Flink CDC 3.x + Iceberg / Paimon
5. **Many SaaS connectors needed, budget OK** → Fivetran
6. **Many SaaS connectors needed, tight budget** → Airbyte (self-hosted or Cloud)
7. **Managed + true realtime + predictable price** → Estuary Flow
8. **Push OLTP truth back into SaaS** → Hightouch / Census (reverse ETL)
9. **AWS only** → AWS DMS + Kinesis
10. **GCP only, BigQuery** → GCP Datastream
11. **Enterprise Oracle / SAP** → Striim / Qlik Replicate / GoldenGate
Antipatterns to avoid:
- **Dual write** — code that writes the DB and then Kafka directly. Never. Use the outbox pattern.
- **Polling with minute-level lag** — pressures the OLTP. Move to log-based.
- **MAR explosion** — simulate Fivetran's pricing before your CFO sees the invoice.
- **Ignoring BACKWARD compatibility** — one DROP COLUMN can wipe out every downstream system.
Chapter 25 · Operational Lore — Problems You Will Hit
**A Postgres logical slot stops shrinking**:
SELECT slot_name, active, restart_lsn, confirmed_flush_lsn,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots;
If `retained_wal` grows into GB territory the consumer is lagging or dead. The response path:
1. Restart the consumer.
2. If that fails, drop the slot, recreate it, and rerun the initial snapshot.
**Kafka topic rebalance takes 5 minutes**: enable KIP-848 (next-gen consumer rebalance), set static membership via `group.instance.id`.
**Debezium lag explodes at the start**: it is snapshotting. Tune `incremental.snapshot.chunk.size` and `snapshot.fetch.size`. Flink CDC has less of this problem thanks to parallel snapshot.
**MySQL binlog expires too quickly** → data loss when consumers lag. Increase `binlog_expire_logs_seconds` or add an S3 archival path.
**Snowflake bill explodes** — CDC ingests in micro-batches and small files explode. Tune compaction frequency and warehouse size; clean it up with dbt incremental models.
Epilogue — Data Is Not Just Moved, It Is Promised
In 2026 CDC and data integration boil down to **contracts**. Schema evolution, ordering, exactly-once, freshness SLAs — all are contracts between producer and consumer.
The tools are abundant. Debezium is great. Estuary is great. Flink CDC is great. Airbyte is great. But running the wrong tool well loses less often than running the right tool badly.
The question order matters:
1. **Who owns this data?** — where is the source of truth
2. **Who needs it, how fast?** — freshness SLA
3. **Is it replayable?** — retention and replay cost
4. **How does the schema evolve?** — who approves contract changes
5. **How do we detect and recover from incidents?** — observability, lag monitoring
Teams that can answer those five can pick any tool and succeed. Teams that cannot start governance only when Fivetran sends the invoice.
CDC is not infrastructure. It is the expression of your data contracts. Treat it that way.
References
1. [Debezium Official Documentation](https://debezium.io/documentation/)
2. [Debezium 3.0 Release Notes](https://debezium.io/blog/2025/01/30/debezium-3-0-final-released/)
3. [Debezium Server Guide](https://debezium.io/documentation/reference/stable/operations/debezium-server.html)
4. [Estuary Flow Docs](https://docs.estuary.dev/)
5. [Apache Flink CDC](https://github.com/apache/flink-cdc)
6. [Flink CDC 3.x Documentation](https://nightlies.apache.org/flink/flink-cdc-docs-stable/)
7. [Airbyte Documentation](https://docs.airbyte.com/)
8. [Airbyte Protocol Spec](https://docs.airbyte.com/understanding-airbyte/airbyte-protocol)
9. [Fivetran Docs](https://fivetran.com/docs)
10. [Sling Documentation](https://docs.slingdata.io/)
11. [Sequin Docs](https://sequinstream.com/docs)
12. [Meltano Hub](https://hub.meltano.com/)
13. [Striim Platform](https://www.striim.com/docs/)
14. [Qlik Replicate](https://www.qlik.com/us/products/qlik-replicate)
15. [Oracle GoldenGate](https://www.oracle.com/integration/goldengate/)
16. [AWS DMS Documentation](https://docs.aws.amazon.com/dms/)
17. [GCP Datastream](https://cloud.google.com/datastream/docs)
18. [Azure Data Factory CDC](https://learn.microsoft.com/azure/data-factory/concepts-change-data-capture)
19. [Hightouch Documentation](https://hightouch.com/docs)
20. [Census Documentation](https://docs.getcensus.com/)
21. [dbt Documentation](https://docs.getdbt.com/)
22. [Apache Airflow](https://airflow.apache.org/docs/)
23. [Dagster Docs](https://docs.dagster.io/)
24. [Confluent Schema Registry](https://docs.confluent.io/platform/current/schema-registry/index.html)
25. [Karapace OSS Schema Registry](https://github.com/Aiven-Open/karapace)
26. [Apicurio Registry](https://www.apicur.io/registry/docs/)
27. [Postgres Logical Replication](https://www.postgresql.org/docs/current/logical-replication.html)
28. [MongoDB Change Streams](https://www.mongodb.com/docs/manual/changeStreams/)
29. [Outbox Pattern (microservices.io)](https://microservices.io/patterns/data/transactional-outbox.html)
30. [Mercari Engineering Blog](https://engineering.mercari.com/en/)
현재 단락 (1/379)
A 2026 data-platform team meeting.