Zero-Downtime Data Migration — Moving Data with Dual-Write, Backfill, and CDC

Introduction
Why Data Migration Is Hard
Pattern 1: Dual-Write
Pattern 2: Backfill
Pattern 3: CDC (Change Data Capture)
The Full Picture Combining All Three Tools
Verification
Shadow Read
Cutover
Rollback
Idempotency & Reprocessing
- How to Guarantee Idempotency
- Example Dedupe Table
Case Studies
- Case 1: Splitting a Large Table into Two
- Case 2: Splitting a Service Out of a Monolithic DB
Common Pitfalls
Migration Checklist
Conclusion
References

Introduction

Data migration looks like "just running one INSERT SELECT from one table to another," but on a live service it is an entirely different level of difficulty. If the data sat still, you could simply copy it. But a real service keeps taking writes the whole time the migration runs. Users place orders, payments happen, profiles get edited. At the very moment you move the data, the source keeps changing.

That is why the essence of zero-downtime migration is not "moving frozen data" but rather "moving a flowing river into a different riverbed." You have to reroute the water without damming it. This post covers how to solve that problem by combining three core tools.

Dual-Write: the application writes to both the old and new stores at once
Backfill: moving historical data in batches
CDC (Change Data Capture): reading the change log and syncing it to the new store in real time

It then lays out how to combine these tools in stages to transition safely as "read cutover then write cutover," how to verify the result, and how to roll back when something goes wrong.

The scope of migration covered here is broad. Splitting a table from PostgreSQL to PostgreSQL, moving from MySQL to a new sharded cluster, separating a monolithic DB into a microservice-specific DB, even moving from an RDBMS to a different kind of store — the same principles apply.

Why Data Migration Is Hard

The Consistency Problem

The first wall you hit is consistency. Keeping the source and the target in the same state at any given moment is harder than it sounds. While Backfill moves historical data, the source keeps changing. If a row you just copied is modified one second later, the target keeps the older version.

time t0: source row(id=42) = {amount: 1000}
time t1: backfill copies row(id=42) to target → target = {amount: 1000}
time t2: source row(id=42) is UPDATEd to {amount: 2000}
result:  source = 2000, target = 1000  (mismatch!)

To prevent this mismatch you must somehow re-apply the changes that happened after the backfill to the target. This is exactly where CDC or Dual-Write becomes necessary.

The Ordering Problem

The second wall is ordering. When multiple changes happen to the same row, if the order in which they are applied to the target differs from the order at the source, the final state is wrong.

source order:  UPDATE amount=2000  →  UPDATE amount=3000
target apply:  UPDATE amount=3000  →  UPDATE amount=2000  (reversed!)
result:        source = 3000, target = 2000

If a CDC pipeline has multiple partitions, or if Dual-Write does not guarantee the order in which the two stores are written, this kind of reversal appears. So you must design things so changes to the same key are processed in the same partition and order.

Other Difficulties

Problem	Description
Schema differences	Column types, constraints, and indexes may differ between source and target
Data volume	You must not overload the source while moving hundreds of millions of rows
Transaction boundaries	The meaning of grouping rows in one transaction may break at the target
Handling deletes	Rows DELETEd at the source must also be removed at the target
Handling duplicates	The same change must not be applied twice on reprocessing (idempotency)

There is no magic that solves all of these at once. Instead you combine Dual-Write, Backfill, and CDC to reduce risk step by step.

Pattern 1: Dual-Write

Concept

Dual-Write means that when the application receives a write request, it writes to both the source and the target. New data automatically lands on both sides, so data from the migration point onward also exists in the target.

                +-----------------+
   write  --->  |  Application    |
                +--------+--------+
                         |
              +----------+----------+
              |                     |
              v                     v
        +-----------+        +-----------+
        |  source   |        |  target   |
        |  (old DB) |        |  (new DB) |
        +-----------+        +-----------+

A Simple Implementation

func CreateOrder(ctx context.Context, o Order) error {
    // 1) write to source first (the source of truth)
    if err := sourceDB.Insert(ctx, o); err != nil {
        return fmt.Errorf("source write failed: %w", err)
    }
    // 2) also write to target
    if err := targetDB.Insert(ctx, o); err != nil {
        // how do we handle a target failure? (discussed below)
        log.Warn("target write failed", "order_id", o.ID, "err", err)
        metrics.Inc("dual_write.target_failure")
    }
    return nil
}

The Risks of Dual-Write

Dual-Write is intuitive but has many traps. The core issue is that "writing to two stores is not atomic."

case A: source succeeds, target fails → data missing in target
case B: source succeeds, target succeeds, but the process dies in between → partial apply
case C: concurrency — two requests reach the two DBs in different orders → ordering reversal

Wrapping it in a distributed transaction (2PC) gives you atomicity but costs latency and availability. If one DB slows down, all writes slow down, and if the coordinator dies, the transaction stalls. So in practice teams usually avoid 2PC and pick one of the following.

Strategy	Description	Drawback
Best-effort dual-write	Only log target failures and ignore them	Holes appear in the target
Source-first + CDC fixup	Source is truth, target filled by CDC	Dual-write becomes unnecessary
Outbox pattern	Record an outbox row in the same source transaction	Needs a separate relay

In reality, using Dual-Write alone accumulates losses from case A. That is why many teams use CDC instead of, or alongside, Dual-Write. The safest variant of Dual-Write is the Outbox pattern.

The Outbox Pattern

The Outbox pattern secures atomicity by recording business data and the "event" in the same transaction.

BEGIN;
  INSERT INTO orders (id, amount, status)
  VALUES (42, 2000, 'CREATED');

  INSERT INTO outbox (aggregate_id, event_type, payload)
  VALUES (42, 'OrderCreated', '{"id":42,"amount":2000}');
COMMIT;

Because it is the same transaction, both commit or both roll back. Then a separate relay (or CDC) reads the outbox table and propagates it to the target. This eliminates the "wrote to source but the event never went out" situation.

Pattern 2: Backfill

Concept

Backfill is the work of moving, in batches, the historical data that already existed at the start of the migration. Dual-Write and CDC only capture "changes from now on," so historical data must be moved separately.

Batch Strategy

If you SELECT * over hundreds of millions of rows at once, the source DB stalls. You must slice it by key range and process it in batches.

func Backfill(ctx context.Context, batchSize int) error {
    var lastID int64 = 0
    for {
        rows, err := sourceDB.Query(ctx, `
            SELECT id, amount, status, updated_at
            FROM orders
            WHERE id > $1
            ORDER BY id
            LIMIT $2`, lastID, batchSize)
        if err != nil {
            return err
        }
        batch := scanRows(rows)
        if len(batch) == 0 {
            break // done
        }
        // UPSERT into target (idempotent)
        if err := targetDB.UpsertBatch(ctx, batch); err != nil {
            return err
        }
        lastID = batch[len(batch)-1].ID
        time.Sleep(throttle) // throttle source load
    }
    return nil
}

Backfill Design Principles

Keyset pagination: use id > lastID instead of OFFSET so it does not get slower over time
Use UPSERT: so that running the same batch twice is safe (idempotency)
Throttling: adjust speed while watching source CPU and replication lag
Restartable: store the progress checkpoint to resume after an interruption
Use a read replica: if possible, read from the source's read replica to spread the load

Contention Between Backfill and Real-Time Changes

Suppose Backfill copies row(id=42) at t1, and at t2 CDC delivers an UPDATE for the same row. If the CDC event is processed before the backfill, the backfill's stale value could overwrite it later. The common ways to prevent this are as follows.

1) Version/timestamp comparison: ignore if target.updated_at >= incoming.updated_at
2) Backfill uses INSERT ... ON CONFLICT DO NOTHING (preserve the CDC value if it exists)
3) CDC always UPSERTs (overwrite with the latest value)

In other words, if you split roles so that backfill only "fills the empty slots" and CDC owns real-time changes, the contention becomes simple.

Pattern 3: CDC (Change Data Capture)

Concept

CDC is the technique of reading a database's change log (WAL, binlog, and so on) and turning it into a stream of change events. It captures every INSERT/UPDATE/DELETE at the source without touching application code. The representative tool is Debezium.

   +-----------+    WAL/binlog    +-----------+    events    +-----------+
   |  source   | ---------------> | Debezium  | -----------> |  Kafka    |
   |  (DB)     |                  | connector |              |  topic    |
   +-----------+                  +-----------+              +-----+-----+
                                                                   |
                                                                   v
                                                            +-----------+
                                                            | consumer  |
                                                            | -> target |
                                                            +-----------+

CDC Based on PostgreSQL Logical Replication

PostgreSQL provides logical replication and replication slots. Debezium operates on top of this.

-- configuration for logical replication (postgresql.conf)
-- wal_level = logical

-- create a publication
CREATE PUBLICATION orders_pub FOR TABLE orders;

-- inspect replication slots
SELECT slot_name, plugin, active
FROM pg_replication_slots;

Replication slots are powerful but dangerous. If the consumer stalls, WAL is not cleaned up and piles up, which can fill the disk. You must monitor slot lag.

-- check how much WAL the replication slot is holding
SELECT slot_name,
       pg_size_pretty(
         pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
       ) AS retained_wal
FROM pg_replication_slots;

Example Debezium Connector Configuration

{
  "name": "orders-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "source-db",
    "database.port": "5432",
    "database.user": "debezium",
    "database.dbname": "shop",
    "topic.prefix": "cdc",
    "plugin.name": "pgoutput",
    "slot.name": "orders_slot",
    "publication.name": "orders_pub",
    "table.include.list": "public.orders",
    "snapshot.mode": "initial"
  }
}

Here snapshot.mode matters. With initial, Debezium first takes a snapshot of existing data when it starts, then switches to the change stream. This snapshot can effectively take over the backfill role. On very large tables, though, you should consider an incremental snapshot.

The Shape of a CDC Event

A Debezium event carries both the before and after state of a change.

{
  "op": "u",
  "before": { "id": 42, "amount": 1000, "status": "CREATED" },
  "after":  { "id": 42, "amount": 2000, "status": "PAID" },
  "source": { "lsn": 123456789, "ts_ms": 1718500000000 },
  "ts_ms": 1718500000123
}

op: c (create), u (update), d (delete), r (read = snapshot)
before/after: the row before and after the change
source.lsn: the log position, used to judge ordering

Idempotent Application in the Consumer

A CDC consumer may receive the same event twice (at-least-once). So application must be idempotent.

func ApplyEvent(ctx context.Context, e CDCEvent) error {
    switch e.Op {
    case "c", "u", "r":
        // ignore older events by comparing LSN
        _, err := targetDB.Exec(ctx, `
            INSERT INTO orders (id, amount, status, _lsn)
            VALUES ($1, $2, $3, $4)
            ON CONFLICT (id) DO UPDATE
              SET amount = EXCLUDED.amount,
                  status = EXCLUDED.status,
                  _lsn   = EXCLUDED._lsn
            WHERE orders._lsn < EXCLUDED._lsn`,
            e.After.ID, e.After.Amount, e.After.Status, e.LSN)
        return err
    case "d":
        _, err := targetDB.Exec(ctx,
            `DELETE FROM orders WHERE id = $1 AND _lsn < $2`,
            e.Before.ID, e.LSN)
        return err
    }
    return nil
}

With an _lsn column you can add a guard that says "ignore if a newer change has already been applied," which blocks ordering reversal and reprocessing at the same time.

The Full Picture Combining All Three Tools

In practice you usually use the Backfill + CDC combination. Dual-Write is used as a supplement in environments where CDC is hard to introduce, or in its Outbox variant. The typical flow is as follows.

[1] Create the CDC slot (start capturing changes from this point)
        |
        v
[2] Start Backfill (fill the target with historical data in batches)
        |   meanwhile CDC applies new changes to the target
        v
[3] Backfill complete + CDC lag ~ 0  (source == target converges)
        |
        v
[4] Verification (row count, checksum, sample comparison)
        |
        v
[5] Shadow read (read the target but discard results, compare only)
        |
        v
[6] Read cutover (switch reads to the target)
        |
        v
[7] Write cutover (switch writes to the target, stop the source)
        |
        v
[8] After stabilization, clean up source/CDC

Order matters. You must create the CDC slot before backfill so you do not miss changes during backfill. And thanks to idempotent application, the final state converges even if backfill and CDC touch the same row.

Verification

Before cutover you must confirm that the source and target are truly the same. If you proceed on "they're probably the same," you will discover data loss in production.

Step 1: Row Count Comparison

The cheapest and fastest check. But equal row counts can still hide different contents, so use it only as a first gate.

-- source
SELECT count(*) FROM orders;
-- target
SELECT count(*) FROM orders;

Step 2: Checksum Comparison

Compute and compare checksums per row or per range. Comparing everything at once is expensive, so split it by key range.

-- checksum per range (PostgreSQL)
SELECT
  (id / 100000) AS bucket,
  count(*) AS cnt,
  md5(string_agg(
    id || ':' || amount || ':' || status,
    ',' ORDER BY id
  )) AS checksum
FROM orders
GROUP BY (id / 100000)
ORDER BY bucket;

Run the same query on source and target, pick only the buckets whose checksums differ, and do a precise comparison on those — that is efficient.

Step 3: Sample Comparison

When a full comparison is hard, draw a random sample and compare whole rows. In particular, prioritize recently changed rows and boundary values (the largest id, rows with many NULLs).

Verification priority:
1) Most recently modified rows (detect omissions from CDC lag)
2) Oldest rows (detect early backfill bugs)
3) Boundary / unusual values (NULL, negative, empty string)
4) N random rows

Summary of Verification

Check	Cost	What it catches	Limitation
Row count	Very low	Mass omission/duplication	Misses content differences
Checksum (range)	Medium	Content mismatch ranges	Sensitive to sort/type differences
Sample comparison	Low	Concrete row differences	No full guarantee
Shadow read	Medium	Real query result differences	Increases operational load

Shadow Read

A technique that raises verification one more level is shadow read. You read the target with real production traffic, but you do not hand the result to the user — you only compare it against the source result.

func GetOrder(ctx context.Context, id int64) (Order, error) {
    // the truth is still the source
    o, err := sourceDB.GetOrder(ctx, id)
    if err != nil {
        return Order{}, err
    }
    // shadow: read the target too and compare (async, failures ignored)
    go func() {
        shadow, serr := targetDB.GetOrder(context.Background(), id)
        if serr != nil {
            metrics.Inc("shadow.read_error")
            return
        }
        if !shadow.Equal(o) {
            metrics.Inc("shadow.mismatch")
            log.Warn("shadow mismatch", "id", id,
                "source", o, "target", shadow)
        }
    }()
    return o, nil
}

Shadow read catches differences in actual query patterns (joins, filters, sorts). Even if row count or checksum passes, you can discover here a case where a specific query returns different results due to index or schema differences. But operational load roughly doubles, so adjust the sampling ratio.

Cutover

Read Cutover First

The reason to switch reads first is that the risk is low. Reads do not change data, so even if you read incorrectly from the target, the source data is intact. If you spot a problem, you can immediately revert to the source (a read rollback is lossless).

read_from_target = feature_flag("orders.read_target")  // 0% -> 1% -> 10% -> 50% -> 100%

Raise the percentage gradually while watching shadow mismatches and the error rate.

Write Cutover Carefully

Switching writes is hard to undo. After write cutover, data that landed only in the target does not exist in the source, so a naive rollback loses that data. That is why, before write cutover, you usually prepare the following.

1) Prepare reverse CDC: configure target -> source sync in advance
   (to bring the target's new data back to the source on rollback)
2) Short write-freeze window (optional): briefly block writes,
   drive CDC lag to 0, then switch — this is the safest
3) Guarantee idempotency keys: so retries right after the switch do not create duplicates

Example Cutover Sequence

T-0   take writes at source while CDC applies to target (keep lag ~0)
T-1   switch reads 100% to target, observe for 24 hours
T-2   enable reverse CDC (target->source)
T-3   (optional) pause writes for 5 seconds, confirm both-direction lag is 0
T-4   switch writes to target
T-5   block writes to source, target is the single truth
T-6   after a week of stabilization, clean up forward CDC, the slot, and source

Rollback

The most important part of a migration plan is "the way back." A cutover with no rollback path is a gamble.

Stage	Rollback method	Data loss
During backfill	Just stop, discard the target	None
Read cutover	Flip the flag back to source	None
Right after write cutover	Apply target changes to source via reverse CDC, then switch	Lossless only if reverse sync exists
Long after write cutover	Depends on whether reverse CDC is kept	Loss risk if absent

The key lesson: if you turn on reverse CDC at the moment of write cutover, even if a problem appears right after the switch, you can bring the new data accumulated in the target back to the source, enabling a lossless rollback. Keep this reverse path for a few days and clean it up once safety is confirmed.

Idempotency & Reprocessing

Both CDC and Backfill require idempotency — "applying the same change multiple times must yield the same result." That is because consumers can restart, events can be delivered more than once, and backfill can be re-run.

How to Guarantee Idempotency

1) UPSERT (ON CONFLICT): update if it exists, insert if not → prevents duplicate INSERTs
2) Version guard: ignore stale changes via _lsn / updated_at / version columns
3) Natural or deterministic key: the same input always maps to the same PK
4) Idempotency token: store the message ID in a processed-records table to block duplicates

Example Dedupe Table

CREATE TABLE processed_events (
  event_id   TEXT PRIMARY KEY,
  processed_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- inside the application transaction
BEGIN;
  INSERT INTO processed_events (event_id)
  VALUES ('evt-abc-123')
  ON CONFLICT (event_id) DO NOTHING;
  -- if the INSERT above affected 0 rows, the event was already processed → skip the work
  -- ... apply the actual data ...
COMMIT;

Reprocessing is not something to fear — it is a normal part of operations. When idempotency is guaranteed, "re-running from the start yields the same result," so it is safe to rewind the CDC consumer offset to the past and re-apply a suspect range.

Case Studies

Case 1: Splitting a Large Table into Two

Suppose the orders table got too large and you need to split it into orders and order_items.

[before] orders(id, user_id, item_json, amount, ...)
[goal]   orders(id, user_id, amount, ...)
         order_items(id, order_id, sku, qty, price)

Procedure:

Create the two new tables, attach a CDC slot to the existing orders
Backfill: read the existing orders and insert transformed rows into the two new tables (parse item_json to create order_items rows)
The CDC consumer applies the same transform logic (route new/updated orders into the two tables)
Verify: compare order count, amount sum, and item-count sum between source and target
Read cutover → Write cutover

Here the transform logic must be identical on both the backfill and CDC sides. If the transform is applied to only one side, the data diverges. The key is to extract the transform function into a shared module so both use the same code.

Case 2: Splitting a Service Out of a Monolithic DB

This is the case of carving the payment feature out into a separate microservice and moving it to a dedicated DB.

[before] monolith_db.payments  (the monolith accesses it directly)
[goal]   payment_service_db.payments  (only the payment service accesses it)

In this case the application boundary changes, so Dual-Write (or Outbox) is useful.

The payment service starts writing to the new DB while the monolith reads the existing DB
Sync new payments to both sides via Outbox + CDC
Backfill historical payments
After verification, switch the monolith's payment reads to new-service API calls (read cutover)
Delegate the monolith's payment writes to the new service (write cutover)
Remove the monolith's direct DB access

In a service split, what differs from a table split is that you must move not only the data but also the "access path."

Common Pitfalls

1. Creating the CDC Slot After Backfill

If you create the slot late, you miss the changes in between. Always start the slot/CDC first and do backfill later.

2. Letting Replication Slot Lag Linger

If the consumer stalls, PostgreSQL's WAL is not cleaned up and piles up, filling the source disk. In the worst case the source DB goes down. A slot-lag alarm is mandatory.

3. Forgetting to Guarantee Ordering

If changes to the same key are applied in a different order, the final state is wrong. Route the same key to the same partition and put an LSN/version guard on application.

4. Missing DELETEs

If you only care about INSERT/UPDATE and forget to move DELETEs, ghost rows remain in the target. Always handle CDC's op=d.

5. Schema Changes and Migration at the Same Time

If you add a column to the source during the migration, CDC and the transform logic can break. It is safer to freeze the source schema during the migration window.

6. Verifying Only Right Before Cutover

Verification should run continuously, not once. If you only look right before cutover, you miss mismatches that arose in between.

7. Not Rehearsing Rollback

If the rollback procedure lives only in a document and you never actually do it, it will not work when it matters most. You must actually run the rollback in staging.

8. Timezone/Type Mismatches

If the timestamp type, timezone, or numeric precision differ between source and target, checksums will fail every time. Normalize types before comparing.

Migration Checklist

[ Preparation ]
[ ] Target schema defined and indexes designed
[ ] Transform logic written as a shared module used by both backfill and CDC
[ ] CDC slot/connector configured, snapshot.mode decided
[ ] Idempotent application guaranteed (UPSERT + version guard)
[ ] Dedupe table or idempotency token prepared

[ Execution ]
[ ] Create the CDC slot before backfill
[ ] Backfill with keyset pagination + throttling
[ ] Save backfill checkpoints (for restart)
[ ] Monitor CDC lag, replication slot lag, source load

[ Verification ]
[ ] Row count comparison (first gate)
[ ] Range-by-range checksum comparison
[ ] Recent/oldest/boundary sample comparison
[ ] Compare real query results with shadow read

[ Cutover ]
[ ] Gradually switch read cutover by percentage
[ ] Prepare reverse CDC (target->source)
[ ] (Optional) secure lag 0 with a short write-freeze window
[ ] Block source writes after write cutover

[ Rollback / Cleanup ]
[ ] Document the rollback procedure + rehearse in staging
[ ] Keep reverse CDC during the stabilization period
[ ] Clean up slot/connector/source after confirming safety

Conclusion

The core of zero-downtime data migration is not fancy tooling but staged risk reduction. Move the past with Backfill, catch up to the present with CDC, absorb conflicts with idempotency, gain confidence with verification, switch reads first and writes later, and always keep a way back open.

Three things to remember in particular. First, create the CDC slot before backfill so you do not miss changes. Second, make every application idempotent so you do not fear reprocessing. Third, turn on reverse CDC before write cutover to secure a lossless rollback path. Just keeping these three will let most migrations finish without incident.

Data is a flowing river. Do not dam the river — slowly dig the new riverbed, and once the current is stable, fill in the old one.

Introduction

Why Data Migration Is Hard

The Consistency Problem

The Ordering Problem

Other Difficulties

Pattern 1: Dual-Write

Concept

A Simple Implementation

The Risks of Dual-Write

The Outbox Pattern

Pattern 2: Backfill

Concept

Batch Strategy

Backfill Design Principles

Contention Between Backfill and Real-Time Changes

Pattern 3: CDC (Change Data Capture)

Concept

CDC Based on PostgreSQL Logical Replication

Example Debezium Connector Configuration

The Shape of a CDC Event

Idempotent Application in the Consumer

The Full Picture Combining All Three Tools

Verification

Step 1: Row Count Comparison

Step 2: Checksum Comparison

Step 3: Sample Comparison

Summary of Verification

Shadow Read

Cutover

Read Cutover First

Write Cutover Carefully

Example Cutover Sequence

Rollback

Idempotency & Reprocessing

How to Guarantee Idempotency

Example Dedupe Table

Case Studies

Case 1: Splitting a Large Table into Two

Case 2: Splitting a Service Out of a Monolithic DB

Common Pitfalls

1. Creating the CDC Slot After Backfill

2. Letting Replication Slot Lag Linger

3. Forgetting to Guarantee Ordering

4. Missing DELETEs

5. Schema Changes and Migration at the Same Time

6. Verifying Only Right Before Cutover

7. Not Rehearsing Rollback

8. Timezone/Type Mismatches

Migration Checklist

Conclusion

References