- Published on
Zero-Downtime Data Migration — Moving Data with Dual-Write, Backfill, and CDC
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- Why Data Migration Is Hard
- Pattern 1: Dual-Write
- Pattern 2: Backfill
- Pattern 3: CDC (Change Data Capture)
- The Full Picture Combining All Three Tools
- Verification
- Shadow Read
- Cutover
- Rollback
- Idempotency & Reprocessing
- Case Studies
- Common Pitfalls
- Migration Checklist
- Conclusion
- References
Introduction
Data migration looks like "just running one INSERT SELECT from one table to another," but on a live service it is an entirely different level of difficulty. If the data sat still, you could simply copy it. But a real service keeps taking writes the whole time the migration runs. Users place orders, payments happen, profiles get edited. At the very moment you move the data, the source keeps changing.
That is why the essence of zero-downtime migration is not "moving frozen data" but rather "moving a flowing river into a different riverbed." You have to reroute the water without damming it. This post covers how to solve that problem by combining three core tools.
- Dual-Write: the application writes to both the old and new stores at once
- Backfill: moving historical data in batches
- CDC (Change Data Capture): reading the change log and syncing it to the new store in real time
It then lays out how to combine these tools in stages to transition safely as "read cutover then write cutover," how to verify the result, and how to roll back when something goes wrong.
The scope of migration covered here is broad. Splitting a table from PostgreSQL to PostgreSQL, moving from MySQL to a new sharded cluster, separating a monolithic DB into a microservice-specific DB, even moving from an RDBMS to a different kind of store — the same principles apply.
Why Data Migration Is Hard
The Consistency Problem
The first wall you hit is consistency. Keeping the source and the target in the same state at any given moment is harder than it sounds. While Backfill moves historical data, the source keeps changing. If a row you just copied is modified one second later, the target keeps the older version.
time t0: source row(id=42) = {amount: 1000}
time t1: backfill copies row(id=42) to target → target = {amount: 1000}
time t2: source row(id=42) is UPDATEd to {amount: 2000}
result: source = 2000, target = 1000 (mismatch!)
To prevent this mismatch you must somehow re-apply the changes that happened after the backfill to the target. This is exactly where CDC or Dual-Write becomes necessary.
The Ordering Problem
The second wall is ordering. When multiple changes happen to the same row, if the order in which they are applied to the target differs from the order at the source, the final state is wrong.
source order: UPDATE amount=2000 → UPDATE amount=3000
target apply: UPDATE amount=3000 → UPDATE amount=2000 (reversed!)
result: source = 3000, target = 2000
If a CDC pipeline has multiple partitions, or if Dual-Write does not guarantee the order in which the two stores are written, this kind of reversal appears. So you must design things so changes to the same key are processed in the same partition and order.
Other Difficulties
| Problem | Description |
|---|---|
| Schema differences | Column types, constraints, and indexes may differ between source and target |
| Data volume | You must not overload the source while moving hundreds of millions of rows |
| Transaction boundaries | The meaning of grouping rows in one transaction may break at the target |
| Handling deletes | Rows DELETEd at the source must also be removed at the target |
| Handling duplicates | The same change must not be applied twice on reprocessing (idempotency) |
There is no magic that solves all of these at once. Instead you combine Dual-Write, Backfill, and CDC to reduce risk step by step.
Pattern 1: Dual-Write
Concept
Dual-Write means that when the application receives a write request, it writes to both the source and the target. New data automatically lands on both sides, so data from the migration point onward also exists in the target.
+-----------------+
write ---> | Application |
+--------+--------+
|
+----------+----------+
| |
v v
+-----------+ +-----------+
| source | | target |
| (old DB) | | (new DB) |
+-----------+ +-----------+
A Simple Implementation
func CreateOrder(ctx context.Context, o Order) error {
// 1) write to source first (the source of truth)
if err := sourceDB.Insert(ctx, o); err != nil {
return fmt.Errorf("source write failed: %w", err)
}
// 2) also write to target
if err := targetDB.Insert(ctx, o); err != nil {
// how do we handle a target failure? (discussed below)
log.Warn("target write failed", "order_id", o.ID, "err", err)
metrics.Inc("dual_write.target_failure")
}
return nil
}
The Risks of Dual-Write
Dual-Write is intuitive but has many traps. The core issue is that "writing to two stores is not atomic."
case A: source succeeds, target fails → data missing in target
case B: source succeeds, target succeeds, but the process dies in between → partial apply
case C: concurrency — two requests reach the two DBs in different orders → ordering reversal
Wrapping it in a distributed transaction (2PC) gives you atomicity but costs latency and availability. If one DB slows down, all writes slow down, and if the coordinator dies, the transaction stalls. So in practice teams usually avoid 2PC and pick one of the following.
| Strategy | Description | Drawback |
|---|---|---|
| Best-effort dual-write | Only log target failures and ignore them | Holes appear in the target |
| Source-first + CDC fixup | Source is truth, target filled by CDC | Dual-write becomes unnecessary |
| Outbox pattern | Record an outbox row in the same source transaction | Needs a separate relay |
In reality, using Dual-Write alone accumulates losses from case A. That is why many teams use CDC instead of, or alongside, Dual-Write. The safest variant of Dual-Write is the Outbox pattern.
The Outbox Pattern
The Outbox pattern secures atomicity by recording business data and the "event" in the same transaction.
BEGIN;
INSERT INTO orders (id, amount, status)
VALUES (42, 2000, 'CREATED');
INSERT INTO outbox (aggregate_id, event_type, payload)
VALUES (42, 'OrderCreated', '{"id":42,"amount":2000}');
COMMIT;
Because it is the same transaction, both commit or both roll back. Then a separate relay (or CDC) reads the outbox table and propagates it to the target. This eliminates the "wrote to source but the event never went out" situation.
Pattern 2: Backfill
Concept
Backfill is the work of moving, in batches, the historical data that already existed at the start of the migration. Dual-Write and CDC only capture "changes from now on," so historical data must be moved separately.
Batch Strategy
If you SELECT * over hundreds of millions of rows at once, the source DB stalls. You must slice it by key range and process it in batches.
func Backfill(ctx context.Context, batchSize int) error {
var lastID int64 = 0
for {
rows, err := sourceDB.Query(ctx, `
SELECT id, amount, status, updated_at
FROM orders
WHERE id > $1
ORDER BY id
LIMIT $2`, lastID, batchSize)
if err != nil {
return err
}
batch := scanRows(rows)
if len(batch) == 0 {
break // done
}
// UPSERT into target (idempotent)
if err := targetDB.UpsertBatch(ctx, batch); err != nil {
return err
}
lastID = batch[len(batch)-1].ID
time.Sleep(throttle) // throttle source load
}
return nil
}
Backfill Design Principles
- Keyset pagination: use
id > lastIDinstead of OFFSET so it does not get slower over time - Use UPSERT: so that running the same batch twice is safe (idempotency)
- Throttling: adjust speed while watching source CPU and replication lag
- Restartable: store the progress checkpoint to resume after an interruption
- Use a read replica: if possible, read from the source's read replica to spread the load
Contention Between Backfill and Real-Time Changes
Suppose Backfill copies row(id=42) at t1, and at t2 CDC delivers an UPDATE for the same row. If the CDC event is processed before the backfill, the backfill's stale value could overwrite it later. The common ways to prevent this are as follows.
1) Version/timestamp comparison: ignore if target.updated_at >= incoming.updated_at
2) Backfill uses INSERT ... ON CONFLICT DO NOTHING (preserve the CDC value if it exists)
3) CDC always UPSERTs (overwrite with the latest value)
In other words, if you split roles so that backfill only "fills the empty slots" and CDC owns real-time changes, the contention becomes simple.
Pattern 3: CDC (Change Data Capture)
Concept
CDC is the technique of reading a database's change log (WAL, binlog, and so on) and turning it into a stream of change events. It captures every INSERT/UPDATE/DELETE at the source without touching application code. The representative tool is Debezium.
+-----------+ WAL/binlog +-----------+ events +-----------+
| source | ---------------> | Debezium | -----------> | Kafka |
| (DB) | | connector | | topic |
+-----------+ +-----------+ +-----+-----+
|
v
+-----------+
| consumer |
| -> target |
+-----------+
CDC Based on PostgreSQL Logical Replication
PostgreSQL provides logical replication and replication slots. Debezium operates on top of this.
-- configuration for logical replication (postgresql.conf)
-- wal_level = logical
-- create a publication
CREATE PUBLICATION orders_pub FOR TABLE orders;
-- inspect replication slots
SELECT slot_name, plugin, active
FROM pg_replication_slots;
Replication slots are powerful but dangerous. If the consumer stalls, WAL is not cleaned up and piles up, which can fill the disk. You must monitor slot lag.
-- check how much WAL the replication slot is holding
SELECT slot_name,
pg_size_pretty(
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
) AS retained_wal
FROM pg_replication_slots;
Example Debezium Connector Configuration
{
"name": "orders-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "source-db",
"database.port": "5432",
"database.user": "debezium",
"database.dbname": "shop",
"topic.prefix": "cdc",
"plugin.name": "pgoutput",
"slot.name": "orders_slot",
"publication.name": "orders_pub",
"table.include.list": "public.orders",
"snapshot.mode": "initial"
}
}
Here snapshot.mode matters. With initial, Debezium first takes a snapshot of existing data when it starts, then switches to the change stream. This snapshot can effectively take over the backfill role. On very large tables, though, you should consider an incremental snapshot.
The Shape of a CDC Event
A Debezium event carries both the before and after state of a change.
{
"op": "u",
"before": { "id": 42, "amount": 1000, "status": "CREATED" },
"after": { "id": 42, "amount": 2000, "status": "PAID" },
"source": { "lsn": 123456789, "ts_ms": 1718500000000 },
"ts_ms": 1718500000123
}
op: c (create), u (update), d (delete), r (read = snapshot)before/after: the row before and after the changesource.lsn: the log position, used to judge ordering
Idempotent Application in the Consumer
A CDC consumer may receive the same event twice (at-least-once). So application must be idempotent.
func ApplyEvent(ctx context.Context, e CDCEvent) error {
switch e.Op {
case "c", "u", "r":
// ignore older events by comparing LSN
_, err := targetDB.Exec(ctx, `
INSERT INTO orders (id, amount, status, _lsn)
VALUES ($1, $2, $3, $4)
ON CONFLICT (id) DO UPDATE
SET amount = EXCLUDED.amount,
status = EXCLUDED.status,
_lsn = EXCLUDED._lsn
WHERE orders._lsn < EXCLUDED._lsn`,
e.After.ID, e.After.Amount, e.After.Status, e.LSN)
return err
case "d":
_, err := targetDB.Exec(ctx,
`DELETE FROM orders WHERE id = $1 AND _lsn < $2`,
e.Before.ID, e.LSN)
return err
}
return nil
}
With an _lsn column you can add a guard that says "ignore if a newer change has already been applied," which blocks ordering reversal and reprocessing at the same time.
The Full Picture Combining All Three Tools
In practice you usually use the Backfill + CDC combination. Dual-Write is used as a supplement in environments where CDC is hard to introduce, or in its Outbox variant. The typical flow is as follows.
[1] Create the CDC slot (start capturing changes from this point)
|
v
[2] Start Backfill (fill the target with historical data in batches)
| meanwhile CDC applies new changes to the target
v
[3] Backfill complete + CDC lag ~ 0 (source == target converges)
|
v
[4] Verification (row count, checksum, sample comparison)
|
v
[5] Shadow read (read the target but discard results, compare only)
|
v
[6] Read cutover (switch reads to the target)
|
v
[7] Write cutover (switch writes to the target, stop the source)
|
v
[8] After stabilization, clean up source/CDC
Order matters. You must create the CDC slot before backfill so you do not miss changes during backfill. And thanks to idempotent application, the final state converges even if backfill and CDC touch the same row.
Verification
Before cutover you must confirm that the source and target are truly the same. If you proceed on "they're probably the same," you will discover data loss in production.
Step 1: Row Count Comparison
The cheapest and fastest check. But equal row counts can still hide different contents, so use it only as a first gate.
-- source
SELECT count(*) FROM orders;
-- target
SELECT count(*) FROM orders;
Step 2: Checksum Comparison
Compute and compare checksums per row or per range. Comparing everything at once is expensive, so split it by key range.
-- checksum per range (PostgreSQL)
SELECT
(id / 100000) AS bucket,
count(*) AS cnt,
md5(string_agg(
id || ':' || amount || ':' || status,
',' ORDER BY id
)) AS checksum
FROM orders
GROUP BY (id / 100000)
ORDER BY bucket;
Run the same query on source and target, pick only the buckets whose checksums differ, and do a precise comparison on those — that is efficient.
Step 3: Sample Comparison
When a full comparison is hard, draw a random sample and compare whole rows. In particular, prioritize recently changed rows and boundary values (the largest id, rows with many NULLs).
Verification priority:
1) Most recently modified rows (detect omissions from CDC lag)
2) Oldest rows (detect early backfill bugs)
3) Boundary / unusual values (NULL, negative, empty string)
4) N random rows
Summary of Verification
| Check | Cost | What it catches | Limitation |
|---|---|---|---|
| Row count | Very low | Mass omission/duplication | Misses content differences |
| Checksum (range) | Medium | Content mismatch ranges | Sensitive to sort/type differences |
| Sample comparison | Low | Concrete row differences | No full guarantee |
| Shadow read | Medium | Real query result differences | Increases operational load |
Shadow Read
A technique that raises verification one more level is shadow read. You read the target with real production traffic, but you do not hand the result to the user — you only compare it against the source result.
func GetOrder(ctx context.Context, id int64) (Order, error) {
// the truth is still the source
o, err := sourceDB.GetOrder(ctx, id)
if err != nil {
return Order{}, err
}
// shadow: read the target too and compare (async, failures ignored)
go func() {
shadow, serr := targetDB.GetOrder(context.Background(), id)
if serr != nil {
metrics.Inc("shadow.read_error")
return
}
if !shadow.Equal(o) {
metrics.Inc("shadow.mismatch")
log.Warn("shadow mismatch", "id", id,
"source", o, "target", shadow)
}
}()
return o, nil
}
Shadow read catches differences in actual query patterns (joins, filters, sorts). Even if row count or checksum passes, you can discover here a case where a specific query returns different results due to index or schema differences. But operational load roughly doubles, so adjust the sampling ratio.
Cutover
Read Cutover First
The reason to switch reads first is that the risk is low. Reads do not change data, so even if you read incorrectly from the target, the source data is intact. If you spot a problem, you can immediately revert to the source (a read rollback is lossless).
read_from_target = feature_flag("orders.read_target") // 0% -> 1% -> 10% -> 50% -> 100%
Raise the percentage gradually while watching shadow mismatches and the error rate.
Write Cutover Carefully
Switching writes is hard to undo. After write cutover, data that landed only in the target does not exist in the source, so a naive rollback loses that data. That is why, before write cutover, you usually prepare the following.
1) Prepare reverse CDC: configure target -> source sync in advance
(to bring the target's new data back to the source on rollback)
2) Short write-freeze window (optional): briefly block writes,
drive CDC lag to 0, then switch — this is the safest
3) Guarantee idempotency keys: so retries right after the switch do not create duplicates
Example Cutover Sequence
T-0 take writes at source while CDC applies to target (keep lag ~0)
T-1 switch reads 100% to target, observe for 24 hours
T-2 enable reverse CDC (target->source)
T-3 (optional) pause writes for 5 seconds, confirm both-direction lag is 0
T-4 switch writes to target
T-5 block writes to source, target is the single truth
T-6 after a week of stabilization, clean up forward CDC, the slot, and source
Rollback
The most important part of a migration plan is "the way back." A cutover with no rollback path is a gamble.
| Stage | Rollback method | Data loss |
|---|---|---|
| During backfill | Just stop, discard the target | None |
| Read cutover | Flip the flag back to source | None |
| Right after write cutover | Apply target changes to source via reverse CDC, then switch | Lossless only if reverse sync exists |
| Long after write cutover | Depends on whether reverse CDC is kept | Loss risk if absent |
The key lesson: if you turn on reverse CDC at the moment of write cutover, even if a problem appears right after the switch, you can bring the new data accumulated in the target back to the source, enabling a lossless rollback. Keep this reverse path for a few days and clean it up once safety is confirmed.
Idempotency & Reprocessing
Both CDC and Backfill require idempotency — "applying the same change multiple times must yield the same result." That is because consumers can restart, events can be delivered more than once, and backfill can be re-run.
How to Guarantee Idempotency
1) UPSERT (ON CONFLICT): update if it exists, insert if not → prevents duplicate INSERTs
2) Version guard: ignore stale changes via _lsn / updated_at / version columns
3) Natural or deterministic key: the same input always maps to the same PK
4) Idempotency token: store the message ID in a processed-records table to block duplicates
Example Dedupe Table
CREATE TABLE processed_events (
event_id TEXT PRIMARY KEY,
processed_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- inside the application transaction
BEGIN;
INSERT INTO processed_events (event_id)
VALUES ('evt-abc-123')
ON CONFLICT (event_id) DO NOTHING;
-- if the INSERT above affected 0 rows, the event was already processed → skip the work
-- ... apply the actual data ...
COMMIT;
Reprocessing is not something to fear — it is a normal part of operations. When idempotency is guaranteed, "re-running from the start yields the same result," so it is safe to rewind the CDC consumer offset to the past and re-apply a suspect range.
Case Studies
Case 1: Splitting a Large Table into Two
Suppose the orders table got too large and you need to split it into orders and order_items.
[before] orders(id, user_id, item_json, amount, ...)
[goal] orders(id, user_id, amount, ...)
order_items(id, order_id, sku, qty, price)
Procedure:
- Create the two new tables, attach a CDC slot to the existing
orders - Backfill: read the existing
ordersand insert transformed rows into the two new tables (parseitem_jsonto createorder_itemsrows) - The CDC consumer applies the same transform logic (route new/updated orders into the two tables)
- Verify: compare order count, amount sum, and item-count sum between source and target
- Read cutover → Write cutover
Here the transform logic must be identical on both the backfill and CDC sides. If the transform is applied to only one side, the data diverges. The key is to extract the transform function into a shared module so both use the same code.
Case 2: Splitting a Service Out of a Monolithic DB
This is the case of carving the payment feature out into a separate microservice and moving it to a dedicated DB.
[before] monolith_db.payments (the monolith accesses it directly)
[goal] payment_service_db.payments (only the payment service accesses it)
In this case the application boundary changes, so Dual-Write (or Outbox) is useful.
- The payment service starts writing to the new DB while the monolith reads the existing DB
- Sync new payments to both sides via Outbox + CDC
- Backfill historical payments
- After verification, switch the monolith's payment reads to new-service API calls (read cutover)
- Delegate the monolith's payment writes to the new service (write cutover)
- Remove the monolith's direct DB access
In a service split, what differs from a table split is that you must move not only the data but also the "access path."
Common Pitfalls
1. Creating the CDC Slot After Backfill
If you create the slot late, you miss the changes in between. Always start the slot/CDC first and do backfill later.
2. Letting Replication Slot Lag Linger
If the consumer stalls, PostgreSQL's WAL is not cleaned up and piles up, filling the source disk. In the worst case the source DB goes down. A slot-lag alarm is mandatory.
3. Forgetting to Guarantee Ordering
If changes to the same key are applied in a different order, the final state is wrong. Route the same key to the same partition and put an LSN/version guard on application.
4. Missing DELETEs
If you only care about INSERT/UPDATE and forget to move DELETEs, ghost rows remain in the target. Always handle CDC's op=d.
5. Schema Changes and Migration at the Same Time
If you add a column to the source during the migration, CDC and the transform logic can break. It is safer to freeze the source schema during the migration window.
6. Verifying Only Right Before Cutover
Verification should run continuously, not once. If you only look right before cutover, you miss mismatches that arose in between.
7. Not Rehearsing Rollback
If the rollback procedure lives only in a document and you never actually do it, it will not work when it matters most. You must actually run the rollback in staging.
8. Timezone/Type Mismatches
If the timestamp type, timezone, or numeric precision differ between source and target, checksums will fail every time. Normalize types before comparing.
Migration Checklist
[ Preparation ]
[ ] Target schema defined and indexes designed
[ ] Transform logic written as a shared module used by both backfill and CDC
[ ] CDC slot/connector configured, snapshot.mode decided
[ ] Idempotent application guaranteed (UPSERT + version guard)
[ ] Dedupe table or idempotency token prepared
[ Execution ]
[ ] Create the CDC slot before backfill
[ ] Backfill with keyset pagination + throttling
[ ] Save backfill checkpoints (for restart)
[ ] Monitor CDC lag, replication slot lag, source load
[ Verification ]
[ ] Row count comparison (first gate)
[ ] Range-by-range checksum comparison
[ ] Recent/oldest/boundary sample comparison
[ ] Compare real query results with shadow read
[ Cutover ]
[ ] Gradually switch read cutover by percentage
[ ] Prepare reverse CDC (target->source)
[ ] (Optional) secure lag 0 with a short write-freeze window
[ ] Block source writes after write cutover
[ Rollback / Cleanup ]
[ ] Document the rollback procedure + rehearse in staging
[ ] Keep reverse CDC during the stabilization period
[ ] Clean up slot/connector/source after confirming safety
Conclusion
The core of zero-downtime data migration is not fancy tooling but staged risk reduction. Move the past with Backfill, catch up to the present with CDC, absorb conflicts with idempotency, gain confidence with verification, switch reads first and writes later, and always keep a way back open.
Three things to remember in particular. First, create the CDC slot before backfill so you do not miss changes. Second, make every application idempotent so you do not fear reprocessing. Third, turn on reverse CDC before write cutover to secure a lossless rollback path. Just keeping these three will let most migrations finish without incident.
Data is a flowing river. Do not dam the river — slowly dig the new riverbed, and once the current is stable, fill in the old one.