Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction — Migration Is Not Copying Data

The first time you take on a heterogeneous data migration, it is easy to think of it as "copying data." Just move the PostgreSQL table into a MongoDB collection, right? But you soon realize: relational and document stores have fundamentally different philosophies for how to view data, and ignoring that difference produces a system that "works but is slow, loses consistency, and is hell to operate."

The essence of heterogeneous migration is data-model redesign. Do you bundle several normalized tables into one document, or keep them as references? Do you give up strong consistency and accept eventual consistency? Decisions like these matter far more than the copying.

This article covers the essence of model differences, direction-specific migration strategies, incremental migration via strangler, dual write, and CDC, the pitfalls of schema evolution, verification, secondary-store synchronization, and consistency trade-offs.

The Essence of Model Differences — Normalization vs Document/KV

First, let us lay out how each model thinks.

+----------------+------------------------+----------------------------+

| Model | Data layout | Best workload |

+----------------+------------------------+----------------------------+

| Relational | normalized, join | complex queries, txn integ|

| Document | denormalized, bundled | aggregate-level read/write|

| Key-Value | direct key access | cache, session, lookups |

| Wide-column | partition-key sharded | bulk writes, time-series |

| Search (ES) | inverted index | full-text search, facets |

+----------------+------------------------+----------------------------+

Relational follows the normalization philosophy: "store data once and assemble it with joins." Document follows the denormalization philosophy: "data read together is stored together." Even for the same "order and order items," RDB splits them across two tables and joins, while in MongoDB it is natural to embed an items array inside a single order document.

This difference decides everything about the migration.

Direction-Specific Migration Strategies

From RDB to Document

When moving normalized tables to documents, the key question is "embed or reference?"

RDB:

orders(id, user_id, total)

order_items(id, order_id, product, qty)

MongoDB (embedded):

{

"_id": "...",

"user_id": "...",

"total": 50000,

"items": [

{ "product": "keyboard", "qty": 1 },

{ "product": "mouse", "qty": 2 }

]

}

The criterion for embedding is "read together, change together, and does not grow without bound." Order items are read with the order, rarely change once finalized, and are limited in count, so embedding fits. Conversely, a relationship that grows without bound, like "all orders of one user," should be a reference or a separate collection.

// Transformation logic (pseudo)

async function migrateOrder(order) {

const items = await pg.query(

"SELECT product, qty FROM order_items WHERE order_id = $1",

[order.id]

);

await mongo.collection("orders").insertOne({

user_id: order.user_id,

total: order.total,

items: items.rows,

});

}

From Document to RDB

The reverse direction is trickier, because a schemaless document may have ragged fields. Some documents have `phone`, some do not. To move to RDB you must first discover the actual schema.

1. Infer schema: scan all documents, aggregate field presence and type distribution

- phone: 73% present, all string

- tags: 41% present, array of string

2. Design tables: core fields as columns, variable fields as a JSONB column or a side table

3. Embedded arrays -> normalize into child tables

4. Missing fields become NULL, type conflicts get conversion rules

PostgreSQL's JSONB column is a great safety valve for this migration. Pull clearly normalizable fields into columns, and keep ambiguous or rare fields in JSONB, so you can normalize incrementally without losing data.

Data-Model Redesign

Migration is a good chance to rethink the model. Do not simply move the old structure as-is; redesign to leverage the new store's strengths.

- Access patterns first: decide "which query is most frequent" first, and shape the model so that query is fast. Document DBs especially demand access-pattern-driven design.

- Denormalization trade-off: storing data redundantly for read performance means you must update several places on write. It favors read-heavy workloads.

- Aggregate boundaries: bundle data that changes together and must stay consistent together into one aggregate (document). The aggregate concept from domain-driven design applies directly.

Incremental Migration — The Strangler Pattern

A big-bang migration that rips everything out at once is risky. Instead, use the strangler pattern: the new system gradually "strangles" the old one, taking over functionality piece by piece.

[client]

+-----------------+

| routing layer | branch old/new by feature

+--------+--------+

| |

(old feat) (migrated feat)

| |

v v

[RDB] [MongoDB]

At first every request goes to the old system (RDB), but as you move features one by one to the new system (MongoDB), you change the routing. Move user profiles first; once verified, orders; then payments. Each step is small, so if something breaks you revert just that feature.

Dual Write and CDC

The key techniques of incremental migration are dual write and CDC (Change Data Capture).

Dual Write

During the transition, the application writes to both the old and new stores.

application

+--> RDB (old, still authoritative)

+--> MongoDB (new, under verification)

The pitfall of dual write is that the two writes are not atomic. If the RDB write succeeds but the MongoDB write fails, you get inconsistency. So use dual write only as a "verification phase," and reinforce consistency with a separate reconciliation job, or replace it with the safer CDC.

CDC — Change Data Capture

CDC reads the source DB's change log (PostgreSQL's WAL, MySQL's binlog) and streams the changes out. Debezium is the canonical tool.

[RDB] --WAL--> [Debezium] --> [Kafka] --> [consumer] --> [MongoDB]

source capture stream transform target

CDC's advantage is that it reflects every source change into the target without touching application code, and without missing anything. Move existing data with an initial snapshot, then have CDC catch up the changes since that point, and you synchronize the target without stopping the source. It is the core technique of zero-downtime migration.

1. Snapshot: copy the source's current data wholesale to the target

2. Start CDC: catch up changes since the snapshot point as a stream

3. Reach sync: the target follows the source nearly real time (lag ~0)

4. Cutover: switch writes to the target, retire the source

Schema Evolution — The Pitfall of Schemaless

People often say document DBs are "schemaless," but more precisely it means "the schema lives in the application code, not the database." The schema does not disappear; it just moves into code.

The pitfall is that over time, the same collection ends up mixing documents of different shapes.

v1 doc: { name, email }

v2 doc: { name, email, phone }

v3 doc: { fullName, email, phone } <- name -> fullName change

Two strategies handle this.

- Schema version field: put a `schemaVersion` on each document and transform on read according to version (lazy migration). Gradually upgrade to the latest version as you read.

- Bulk migration script: iterate the whole collection, converting old shapes into new ones. For large volumes, run in batches.

Schemaless does not mean abandoning validation. Tools like MongoDB's JSON Schema validation, applying minimal constraints at the DB level, help long-term operations.

Verification

Verifying a heterogeneous migration takes more than a simple row-count comparison.

- Row/document count match: confirm the source and target have the same number of records. The most basic check, but mandatory.

- Sample reconciliation: pull random samples and compare values field by field. Catches bugs in the transform logic.

- Checksum/hash: compute hashes of core fields on both sides and compare. Useful when a full comparison is impractical.

- Business invariants: verify domain rules like "every order's total equals the sum of its items" hold in the target too.

Verification levels:

L1 count : count(source) == count(target)

L2 sample : extract random 1%, reconcile fields

L3 checksum : hash core columns after sorting, compare

L4 invariant : re-verify domain rules (totals, referential integrity)

Synchronizing Secondary Stores — ElasticSearch, Redis

When you move the primary store, the secondary stores derived from it (search index, cache) must be repopulated too.

- ElasticSearch: keep the primary store as the source of truth and rebuild the search index via CDC or an indexing pipeline. Since an index can always be rebuilt from the primary store, a full reindex after migration is cleanest.

- Redis: a cache is generally not the source of truth, so after migration, empty it and refill (cache warming). If data existed only in the cache, persist it to the primary store first.

[primary store migration done]

+--> ElasticSearch full reindex (from primary store)

+--> Redis flush and warm (from primary store)

The core principle is "a secondary store must always be reconstructible from the primary store." That keeps migrations simple.

Consistency Trade-offs

The hardest decision in heterogeneous migration is the consistency level. A team used to RDB's strong consistency and ACID transactions often has to accept eventual consistency when moving to a document/distributed store.

Strong consistency (RDB) Eventual consistency (distributed NoSQL)

------------------------ ----------------------------------------

everyone sees the new value all nodes converge after a moment

immediately on write

multi-row atomicity via txn atomicity per document (aggregate)

joins guarantee integrity application manages integrity

scalability limits easy horizontal scaling

Questions to check at transition time: does this data truly need strong consistency? Separate parts that require strong consistency, like payments and inventory, from parts where eventual consistency suffices, like like-counts and view-counts. Demand strong consistency for everything and you lose NoSQL's advantages; leave everything eventually consistent and you get bugs where money disappears.

Cutover Strategy — When and How to Switch

The most nerve-wracking moment of a migration is the cutover, the instant you hand the source of truth from the old store to the new one. Cutover styles divide by risk level into a few kinds.

+--------------------+------------------------------------------------+

| Style | Characteristics |

+--------------------+------------------------------------------------+

| Big-bang cutover | switch everything at one point. Simple but |

| | hard to roll back |

| Gradual (reads) | shift reads to the new store bit by bit (canary)|

| Gradual (writes) | shift writes to the new store by feature |

| Shadow reads | read the new store but discard, compare to old|

+--------------------+------------------------------------------------+

The safest approach is to start with shadow reads. Read the new store too, but show users the old store's result, and compare the two in the background. Once mismatches drop enough, then shift reads gradually to the new store. Write cutover is last.

1. Shadow reads: compare new-store read results to old (users see old)

2. Canary reads: shift traffic 1% -> 10% -> 50% -> 100% to the new store

3. Write cutover: make the new store the source of truth

4. Retire old store: keep it as a read-only backup for a while, then remove

The key is leaving a sufficient observation period between each step. Not rushing is the fastest path.

Continuous Reconciliation — Even After Migration

Even after cutover, it is safer to run the old and new stores in parallel for a while and reconcile them continuously. While keeping dual write, run a reconciliation job that periodically compares the two stores.

[reconciliation job (hourly)]

+-- compare old- and new-store records by key

+-- on mismatch: log + alert + (auto) correct the new store

+-- observe the mismatch-rate trend on a dashboard

Once the mismatch rate converges to zero and stabilizes for a sufficient period, only then do you retire the old store. This "overlap period" is the insurance that guarantees migration safety. Delete the old store too soon and, when a hidden bug in the new store surfaces, you have nowhere to fall back to.

Migration Tools — AWS DMS and CDC Pipelines

It is hard to do a heterogeneous migration with hand-written scripts alone. Dedicated tools help.

AWS DMS

AWS Database Migration Service is the canonical managed tool for heterogeneous migration. It works even when source and target are of different kinds (e.g., Oracle to PostgreSQL, MySQL to DynamoDB). Three core concepts:

+----------------+------------------------------------------------+

| Concept | Role |

+----------------+------------------------------------------------+

| Replication | the compute that actually performs migration |

| instance | |

| Endpoint | source/target DB connection info |

| Migration task| full load (whole copy) + CDC (track changes) |

+----------------+------------------------------------------------+

A DMS migration task is usually composed of two phases. Full load moves existing data wholesale, then CDC catches up the changes since that point. When heterogeneous, you must specify type mapping (what the source's NUMBER becomes in the target) with transformation rules.

Pitfalls of Transformation Rules

The trickiest part of heterogeneous work is type and encoding conversion.

Conversions to watch:

- Numeric precision: Oracle NUMBER(38) -> PostgreSQL numeric (preserve precision)

- Date/timezone: beware losing timezone info (unify on timestamptz)

- Character encoding: garbling when converting Latin-1 -> UTF-8

- NULL vs empty string: Oracle treats '' and NULL alike, Postgres differently

- Case: Oracle identifiers default uppercase, Postgres lowercase

Automatic tools handle most of these conversions, but you must catch the edge cases with verification. The difference between NULL and empty string in particular is a classic trap that silently corrupts data.

Migrating to Specialized Stores — Time-Series, Graph

Beyond document and search, you sometimes move to a specialized store that fits the data's nature.

- Time-series DB (TimescaleDB, InfluxDB): when moving data whose time axis is central, like logs and metrics, design time-based partitioning and downsampling policy together. Compress/roll up old data and keep only recent data at high resolution.

- Graph DB (Neo4j): data where relationships are first-class citizens (social graphs, recommendations) becomes far simpler to query when moved out of RDB's join hell into a graph. In migration you convert foreign keys to edges and rows to nodes.

RDB relationships -> graph

users row -> (:User) node

follows(a, b) -> (:User)-[:FOLLOWS]->(:User) edge

complex N-hop join -> variable-length path query (simple)

The key is choosing a store that fits the data's intrinsic shape. Instead of cramming everything into RDB, move time-series to a time-series DB and relationships to a graph, and you leverage each one's strengths.

A Case Study — From Monolithic RDB to Document + Search

Let us walk through a hypothetical e-commerce team's case.

Initial: one PostgreSQL for everything (users, products, orders; search via LIKE)

Problem: product search slow (LIKE %keyword%), catalog schema changes often

(needs flexible attributes), single DB overloaded by growing traffic

Strategy:

1. Product catalog to MongoDB (flexible attributes, fits document model)

2. Product search to ElasticSearch (full-text, facets)

3. Orders/payments stay on PostgreSQL (need strong consistency)

Migration:

- CDC (Debezium) syncs PostgreSQL products -> MongoDB

- MongoDB -> ElasticSearch indexing pipeline

- Strangler gradually shifts product-read traffic

- After verification, cutover; keep old product table read-only for a while

The key is that not everything was moved. Orders and payments, which need strong consistency, stayed on the RDB; only the parts needing flexibility and search were moved. In heterogeneous environments, polyglot persistence — picking "the right store for each kind of data" — is often the right answer.

The Operational Burden of Polyglot

Moving heterogeneous gains you the benefit of polyglot persistence — using the right store for each kind of data — but the operational burden rises along with it. You must honestly acknowledge this trade-off.

The costs of polyglot:

- Operational complexity: backup/monitoring/tuning/incident response

differ per DB

- Consistency boundaries: you must manage cross-store sync/integrity

- Cognitive load: the team must know several data models and query languages

- No transactions: there is no atomicity spanning stores

So "can move" and "should move" are different. Move to a new store only when the gains (performance, flexibility, scalability) clearly outweigh the added operational burden. Often the best answer is "PostgreSQL's JSONB is enough" or "hold with RDB for now and move when a real bottleneck appears." Migration is a means, not an end.

The decision criteria, as a table:

+---------------------------+--------------------------------------+

| Move when | Stay when |

+---------------------------+--------------------------------------+

| current store is a clear | there is still performance headroom |

| bottleneck | |

| data model fundamentally | absorbable via JSONB etc. |

| does not fit | |

| you have ops capability | team has never operated the new |

| for the new store | store |

| gains clearly exceed the | "because it is the latest tech" is |

| operational burden | the only reason |

+---------------------------+--------------------------------------+

Common Pitfalls

- Big-bang migration: trying to move everything at once lands you in an unrollback-able state. Split with strangler.

- Copying the model as-is: a 1:1 copy of an RDB table into documents yields a slow document DB that still needs joins. Redesign by access pattern.

- Mistaking dual-write atomicity: the two writes are not atomic. You need a reconciliation mechanism for mismatches.

- Verifying by row count only: cases where the count matches but the values are wrong are common. Verify down to samples, checksums, and invariants.

- Over/under-demanding consistency: failing to distinguish the required consistency level per part costs you performance or breaks data.

- Forgetting secondary stores: forget the search index and cache and, after migration, search fails or stale data lingers in the cache.

Checklist

- [ ] Did you define access patterns first and redesign the model around them?

- [ ] Did you make the embed-vs-reference criterion explicit?

- [ ] Did you plan incremental migration with strangler?

- [ ] Did you set up a zero-downtime sync path via CDC or dual write?

- [ ] Did you decide a schema-evolution strategy (version field or bulk transform)?

- [ ] Did you prepare four-level verification: count, sample, checksum, invariant?

- [ ] Is there a plan to rebuild secondary stores (search, cache)?

- [ ] Did you distinguish the required consistency level per data type?

- [ ] Is there a rollback path if cutover fails?

Closing

Heterogeneous data migration is not about moving data but about remodeling it. The shift from relational normalization thinking to document aggregate thinking, from strong to eventual consistency, is not a mere tool swap but a shift in design philosophy. Split small with strangler, move with zero downtime via CDC, verify in four levels, and pick the right consistency for each kind of data. Keep these principles and heterogeneous migration becomes not a risky gamble but a controlled evolution.

References

- MongoDB Data Modeling docs: https://www.mongodb.com/docs/manual/core/data-modeling-introduction/

- MongoDB Schema Validation: https://www.mongodb.com/docs/manual/core/schema-validation/

- Debezium (CDC): https://debezium.io/documentation/

- PostgreSQL Logical Replication (WAL): https://www.postgresql.org/docs/current/logical-replication.html

- Elasticsearch Reindex API: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html

- Redis docs: https://redis.io/docs/latest/

- AWS DMS docs: https://docs.aws.amazon.com/dms/

- TimescaleDB docs: https://docs.timescale.com/

- Neo4j data import docs: https://neo4j.com/docs/getting-started/data-import/

- Strangler Fig pattern (Martin Fowler): https://martinfowler.com/bliki/StranglerFigApplication.html