Skip to content

필사 모드: AWS DMS in Practice — Migration and Continuous Replication (CDC)

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Opening — The Hard Problem of Zero-Downtime Migration

Database migration is a rite of passage almost every infrastructure engineer faces at some point. Whether you are moving on-premises Oracle to the cloud, lifting a self-managed MySQL onto RDS, or escaping an expensive commercial engine for PostgreSQL, you eventually run into the same question. How do you move the data without stopping the service and without losing a single row?

The simplest approach is a dump and restore. But not many organizations can afford to halt a service while a production database of hundreds of gigabytes or several terabytes is dumped and restored. The real challenge is how to catch up on the changes that occur between the moment the dump starts and the moment the restore finishes.

AWS Database Migration Service, or DMS for short, tackles exactly this problem head-on. It combines a full load that moves the initial data in bulk with change data capture (CDC) that follows the changes in near real time, keeping the source and target almost continuously in sync. Once you are in that state, all you need is a short cut-over window to achieve a near-zero-downtime switch.

In this article I walk through DMS from a practitioner perspective: its components, how full load and CDC work, heterogeneous migration and SCT integration, LOB handling and data validation, monitoring and cut-over strategy, and a comparison with native logical replication.

What DMS Is — Three Components

In one sentence, DMS is a managed replication service that moves data from a source database to a target database and keeps them in sync. To understand how it works, you first need to know its three components.

- Replication instance: the EC2-based managed compute that actually performs the replication work. The engine that reads data from the source, transforms it, and writes it to the target runs here.

- Endpoint: the connection information for the source and target databases. It holds the engine type, host, port, credentials, and extra connection attributes.

- Replication task: the unit that defines which tables to move and how (full load, CDC, or both). Table mappings and task settings attach to it.

The relationship among the three looks like this.

+------------------+ +------------------------------+ +------------------+

| Source DB | | Replication instance | | Target DB |

| (on-prem/RDS) | | (managed EC2) | | (RDS/Aurora) |

| | read | +----------------------+ | write | |

| Oracle/MySQL/ | -----> | | Replication task | | -----> | PostgreSQL/ |

| PostgreSQL/... | | | - table mappings | | | Aurora/... |

| | | | - full load + CDC | | | |

| change log | | | - transform rules | | | |

| (redo/binlog/ | | +----------------------+ | | |

| WAL) | | | | |

+------------------+ +------------------------------+ +------------------+

The replication instance is the relay engine wedged between source and target. It reads the source transaction log (Oracle redo log, MySQL binlog, PostgreSQL WAL) to capture changes, transforms them into a form the target engine understands, and applies them. AWS manages patching, monitoring, and multi-AZ failover for this instance.

Full Load and CDC — Combining Two Phases

A DMS task operates in one of three modes depending on the migration type.

- full-load only: it moves the data as of the current moment and stops. Good for a frozen source or a one-time copy.

- cdc only: it follows changes after a given point. Used when the initial data was already moved by another method.

- full-load-and-cdc: the most common zero-downtime migration pattern. It records the changes that occur while the initial data is being moved, applies those changes once the full load completes to catch up, and then switches to ongoing real-time replication.

The full flow of full-load-and-cdc, drawn along a time axis, looks like this.

time ───────────────────────────────────────────────────────────────────>

[full load start] [full load done] [cached changes applied]

| | |

v v v

+---------------------------------+--------------------------+----------------------+

| full load in progress | applying cached changes | ongoing CDC |

| (parallel copy per table) | (changes piled up) | (real-time catch-up) |

+---------------------------------+--------------------------+----------------------+

| |

+-- source changes in this window are -------------------------+

tracked via internal cache or log and applied later

The key point is that DMS does not miss source changes even while the full load is running. DMS remembers the log position at the start of the full load and caches subsequent changes or re-reads them from the log, applying them once the full load finishes. This is what lets you reach a consistent state without ever stopping the source.

Homogeneous and Heterogeneous — Two Branches of Migration

DMS migrations split into two broad branches. This distinction determines the difficulty of the work and the tools you will need.

- Homogeneous migration: source and target run the same engine. For example, moving on-premises PostgreSQL to RDS PostgreSQL, or self-managed MySQL to Aurora MySQL. Since the schema structure is identical, almost no conversion is needed.

- Heterogeneous migration: source and target engines differ. Moving from Oracle to PostgreSQL, or from SQL Server to Aurora. Because data types, functions, stored procedures, and sequences differ per engine, you need a schema conversion tool.

In a homogeneous migration, DMS alone can handle both data and changes. That said, DMS by default moves tables and base data, not secondary objects such as indexes, foreign keys, triggers, or sequences perfectly. So in practice it is common to move the schema first with native tools and move only the data with DMS.

In a heterogeneous migration, the AWS Schema Conversion Tool (SCT) or DMS Schema Conversion enters the picture.

SCT Integration — Schema Conversion

The first step of a heterogeneous migration is schema conversion. This is the work of turning Oracle NUMBER into PostgreSQL numeric, PL/SQL procedures into PL/pgSQL, and sequences and triggers into target syntax. Doing this by hand is endless, so AWS provides the Schema Conversion Tool.

The SCT workflow looks roughly like this.

+---------------------+ +------------------------+ +----------------------+

| source schema | | SCT / DMS schema conv | | target schema |

| (Oracle, etc.) | ---> | | ---> | (PostgreSQL, etc.) |

| | | - automatic conversion | | |

| tables / types / | | - report of failures | | converted DDL + |

| procedures / trigs | | - difficulty rating | | manual-fix list |

+---------------------+ +------------------------+ +----------------------+

The real value of SCT lies less in the automatic conversion itself than in the report it produces of items that cannot be converted. The parts that convert automatically need no attention, but engine-specific features (Oracle packages, autonomous transactions, certain hints) must be redesigned by a person. SCT assigns a difficulty rating to such items so you can estimate the migration effort.

Once schema conversion is done and the objects exist in the target, DMS takes over the actual data movement. In other words, SCT builds the container, and DMS fills it with content.

Hands-on Configuration — Endpoints and Tasks

Now let us get into real configuration. The skeleton for creating a source endpoint, a target endpoint, and a replication task via the CLI looks like this.

1. Create a source endpoint (Oracle example)

aws dms create-endpoint \

--endpoint-identifier source-oracle \

--endpoint-type source \

--engine-name oracle \

--server-name oracle.internal.example.com \

--port 1521 \

--database-name ORCL \

--username dms_user \

--password "REPLACE_WITH_SECRET"

2. Create a target endpoint (PostgreSQL example)

aws dms create-endpoint \

--endpoint-identifier target-postgres \

--endpoint-type target \

--engine-name postgres \

--server-name mydb.cluster-abc.ap-northeast-2.rds.amazonaws.com \

--port 5432 \

--database-name appdb \

--username dms_user \

--password "REPLACE_WITH_SECRET"

3. Test the connection

aws dms test-connection \

--replication-instance-arn arn:aws:dms:ap-northeast-2:111122223333:rep:ABCDEF \

--endpoint-arn arn:aws:dms:ap-northeast-2:111122223333:endpoint:SOURCE

Table mapping is the JSON that defines which schemas and tables to move and how to transform them. It is built from selection rules and transformation rules.

{

"rules": [

{

"rule-type": "selection",

"rule-id": "1",

"rule-name": "include-app-schema",

"object-locator": {

"schema-name": "APP",

"table-name": "%"

},

"rule-action": "include"

},

{

"rule-type": "transformation",

"rule-id": "2",

"rule-name": "schema-to-lowercase",

"rule-target": "schema",

"object-locator": {

"schema-name": "APP"

},

"rule-action": "convert-lowercase"

},

{

"rule-type": "transformation",

"rule-id": "3",

"rule-name": "table-to-lowercase",

"rule-target": "table",

"object-locator": {

"schema-name": "APP",

"table-name": "%"

},

"rule-action": "convert-lowercase"

}

]

}

Oracle stores identifiers in uppercase while PostgreSQL prefers lowercase, so a rule that converts schema and table names to lowercase, as above, is nearly mandatory in a heterogeneous migration.

Task Settings — The Essentials

The fine-grained behavior of a replication task is controlled by the task settings JSON. There are many entries, but the essentials you actually touch in practice come down to the following.

{

"TargetMetadata": {

"SupportLobs": true,

"FullLobMode": false,

"LimitedSizeLobMode": true,

"LobMaxSize": 64,

"BatchApplyEnabled": true

},

"FullLoadSettings": {

"TargetTablePrepMode": "DROP_AND_CREATE",

"MaxFullLoadSubTasks": 8,

"CommitRate": 10000

},

"Logging": {

"EnableLogging": true,

"LogComponents": [

{ "Id": "SOURCE_CAPTURE", "Severity": "LOGGER_SEVERITY_DEFAULT" },

{ "Id": "TARGET_APPLY", "Severity": "LOGGER_SEVERITY_DEFAULT" }

]

},

"ValidationSettings": {

"EnableValidation": true,

"ThreadCount": 5

},

"ErrorBehavior": {

"DataErrorPolicy": "LOG_ERROR",

"TableErrorPolicy": "SUSPEND_TABLE"

}

}

A few of these entries deserve explanation. Setting TargetTablePrepMode to DROP_AND_CREATE empties and recreates the target table before the full load. If you already built the schema with SCT, TRUNCATE_BEFORE_LOAD or DO_NOTHING is safer. MaxFullLoadSubTasks is the number of tables loaded concurrently, tuned to the headroom on source and target. Turning on BatchApplyEnabled applies CDC changes in batches rather than one by one, greatly raising throughput.

LOB Handling — The Trap of Large Objects

The topic that trips people up most often in DMS is LOB (Large Object) handling, that is, large objects such as BLOB and CLOB. DMS handles LOBs in three modes.

| Mode | Behavior | Pros | Cons |

| --- | --- | --- | --- |

| Full LOB Mode | Moves every LOB with no size limit | No data loss | Slow, heavy memory pressure |

| Limited LOB Mode | Moves only up to a set max size | Fast, predictable | Anything over the limit is truncated |

| Inline LOB Mode | Small LOBs inline, large ones handled separately | Good balance | Somewhat complex to configure |

The most common choice in practice is Limited LOB Mode. Set LobMaxSize large enough, but you must be aware that LOBs exceeding that limit may be truncated. Set the limit too small and data is silently cut off; set it too large and performance plummets.

Tables with LOBs carry one more constraint. For DMS to move a LOB, the table must have a primary or unique key. Without a key, the LOB column is loaded as NULL or the whole table is skipped. So checking whether LOB tables have keys before migration is important.

LOB handling decision flow

Does the table have a LOB column?

|

no ----> Limited/Full setting irrelevant, proceed quickly

|

yes

|

Does it have a primary key?

|

no ----> Add a key first or handle separately (risk of loss if left as is)

|

yes

|

Do you know the max LOB size?

|

yes ----> Limited LOB Mode + LobMaxSize with headroom

|

no ----> Full LOB Mode (slow but safe) or consider Inline

Data Validation — The Real Start Comes After the Move

Moving the data is not the end. You must verify that the source and target data truly match. When you enable validation in task settings, DMS provides a data validation feature that compares source and target row by row.

Validation targets rows that have completed full load and CDC, hashing or directly comparing column values to find mismatches. Results show up in the task statistics and a dedicated validation table.

Check per-table statistics and validation state for a task

aws dms describe-table-statistics \

--replication-task-arn arn:aws:dms:ap-northeast-2:111122223333:task:MYTASK \

--query "TableStatistics[].{Table:TableName,State:TableState,Validation:ValidationState,Failed:ValidationFailedRecords}"

A validation state of "Validated" means a match; "Mismatched records" means a discrepancy. Common causes of mismatches are timezone handling, floating-point precision differences, character encoding, and the LOB truncation mentioned earlier. Enabling validation adds replication load, so when the operational impact is large, teams adjust the validation thread count or split it into a separate validation task.

Validation is the safety net that bridges the gap between "moved" and "moved correctly." The scariest thing in a migration is not failure but silent data corruption, so skipping validation is not advisable.

Monitoring and Restart — The Reality of Operations

A DMS task is not something you launch and forget; it needs steady watching. The key metrics are exposed through CloudWatch.

| Metric | Meaning | When to pay attention |

| --- | --- | --- |

| CDCLatencySource | Lag reading changes from the source | A growing value means a source log read bottleneck |

| CDCLatencyTarget | Lag applying changes to the target | A growing value means a target write bottleneck |

| FullLoadThroughputRowsTarget | Full load rows per second | Check full load speed |

| FreeableMemory | Available memory on the replication instance | Scale up if it runs short |

| CDCIncomingChanges | Number of pending changes | If it keeps piling up, apply cannot keep up |

It is important to read the two latency metrics separately. High source latency means the step that reads the source transaction log is slow; high target latency means the step that applies to the target is slow. Different causes call for different remedies. Target latency can be eased by turning on BatchApplyEnabled or temporarily disabling target indexes during migration.

You also need a restart strategy ready for when a task stops. A DMS task can be restarted from the beginning or resumed from where it stopped.

Resume from the stopping point (preserving CDC position)

aws dms start-replication-task \

--replication-task-arn arn:aws:dms:ap-northeast-2:111122223333:task:MYTASK \

--start-replication-task-type resume-processing

Start CDC from a specific point (checkpoint based)

aws dms start-replication-task \

--replication-task-arn arn:aws:dms:ap-northeast-2:111122223333:task:MYTASK \

--start-replication-task-type start-replication \

--cdc-start-position "checkpoint:V1#34#..."

Resume continues from the last checkpoint and does not move data again. By contrast, rerunning a full load from scratch can overwrite target changes that piled up in the meantime, so you must understand the meaning of the restart type before choosing it.

Cut-over Strategy — The Moment of Switching

The climax of a migration is the cut-over, the moment you switch the application from source to target. If full-load-and-cdc has the two databases nearly in real-time sync, the cut-over can finish within a short window.

The recommended cut-over procedure is as follows.

cut-over procedure

1. Confirm CDC lag is near zero (monitor CDCLatencyTarget)

2. Briefly block writes coming into the source (read-only mode or traffic halt)

3. Wait until all remaining changes are applied to the target (pending changes 0)

4. Final data validation check (row counts, validation state)

5. Correct sequence/auto-increment values (DMS does not move sequence current values)

6. Switch the application connection string to the target

7. Confirm normal operation on the target, then open up traffic

8. Keep the source for a period so you can roll back if there is a problem

I want to emphasize step 5, which people often skip. DMS moves a table's row data but does not move the "next value" state of sequences or auto-increment columns. Cutting over without correcting this leads to primary key collisions when inserting new rows on the target. Just before cut-over, querying the current maximum value of each sequence and resetting the target sequence above it is essential.

Keeping a rollback path open, as in step 8, also matters. If an unexpected problem arises after cut-over, you need to be able to revert quickly to the source, so do not discard the source immediately; keep it for a period. More conservative teams configure reverse-direction CDC, replicating from target back to source, in advance.

Limits and Cost — Things to Know Before Using It

DMS is powerful but not all-powerful. There are limits you should go in aware of.

First, DMS moves data, not the entire schema. It will create base tables and primary keys, but secondary indexes, foreign keys, triggers, stored procedures, views, and sequences are not moved by default. These must be handled separately with SCT or native tools.

Second, foreign keys and triggers get in the way during full load. So you typically disable the target foreign keys and triggers during the full load and re-enable them afterward. Otherwise, the table load order causes constraint violations.

Third, CDC depends on the source transaction log. If you have not enabled supplemental logging on the source (Oracle supplemental logging, MySQL binlog ROW format, PostgreSQL logical replication settings), CDC will not work. Missing this prerequisite leads to the baffling situation where full load works but CDC does not.

Cost has two main axes: the hourly charge for the replication instance and the data transfer charge. The replication instance scales with instance class and running time, so once the migration is done, remember to clean up the task and the instance. Transfer within the same region is cheap, but cross-region or internet-bound transfer adds cost. Roughly expressed, cost breaks down like this.

Main components of DMS cost

replication instance charge = instance class rate x running time

storage charge = allocated log/cache storage x GB-month

data transfer charge = cross-region/internet transfer x per-GB rate

(transfer within the same AZ/region is generally free or cheap)

The most common waste is leaving the replication instance running after the migration is done. Unless you intend to keep using CDC for permanent replication, it is wise to put cleanup of resources right after cut-over validation into your checklist.

Alternative Comparison — Native Logical Replication

DMS is not the only answer. For homogeneous migration, especially moving between PostgreSQL instances, the engine's built-in logical replication is often simpler and faster. The two approaches compare like this.

| Criterion | AWS DMS | Native logical replication |

| --- | --- | --- |

| Heterogeneous support | Strong (cross-engine conversion) | Nearly impossible (same engine) |

| Homogeneous performance | Good | Generally faster and more faithful |

| Schema conversion | Integrates with SCT | Handle yourself |

| Operational burden | Managed, console integrated | Manual setup and monitoring |

| Data validation | Built-in validation | Needs a separate tool |

| Transform rules | Rich mapping rules | Limited |

| Cost | Instance charge applies | Engine feature, little extra cost |

| Best fit | Heterogeneous, complex transforms | Homogeneous, simple replication |

Roughly summarized: for a heterogeneous migration from Oracle or SQL Server to PostgreSQL, the combination of DMS and SCT is effectively the standard. For a homogeneous migration from PostgreSQL to PostgreSQL or MySQL to MySQL, native logical replication is often more faithful and cheaper. PostgreSQL publication and subscription, and MySQL replication, tend to follow along more completely, including sequences and constraints.

In practice people mix the two. For example, move the schema and secondary objects with a native dump, and handle only the zero-downtime movement of large data with DMS. Tools are meant to be combined to fit the goal; there is no need to solve everything with one.

Common Pitfalls

Finally, here is a collection of pitfalls encountered again and again in the field.

First, forgetting supplemental logging. For CDC to work you must enable appropriate logging on the source, and discovering this on migration day throws off the schedule. Oracle needs supplemental logging, MySQL needs binlog ROW format, and PostgreSQL needs wal_level set to logical.

Second, overlooking tables without primary keys. Without a primary key, CDC UPDATE and DELETE do not work properly, and LOBs may be dropped. List the keyless tables before migration and decide on a remedy.

Third, setting the LOB limit wrong. In Limited LOB Mode, setting LobMaxSize too small silently truncates data. Investigate the actual maximum LOB size on the source in advance.

Fourth, not disabling foreign keys and triggers. If the target constraints and triggers stay alive during full load, you get failures from load order or unintended side effects.

Fifth, forgetting sequence correction. It is the classic cause of primary key collisions after cut-over.

Sixth, not cleaning up resources. Leaving finished tasks and replication instances around often leaks cost.

Practical Checklist

Here is a checklist so you can verify the migration step by step.

[ Preparation ]

[ ] Confirm source/target engine versions and DMS support

[ ] Enable supplemental logging on source (supplemental/binlog/wal_level)

[ ] List keyless tables and decide on a remedy

[ ] Investigate the actual max size of LOB columns

[ ] Create a dedicated DMS account with least privilege

[ ] Check network path and security groups/firewall

[ Schema preparation ]

[ ] (heterogeneous) Convert schema with SCT and handle unconvertible items

[ ] Create tables/types on the target

[ ] Plan to disable foreign keys/triggers for full load

[ Task configuration ]

[ ] Decide replication instance size and multi-AZ

[ ] Write table mappings (selection/transformation rules)

[ ] Set LOB mode and validation in task settings

[ ] Configure logging and CloudWatch alarms

[ Run and validate ]

[ ] Confirm the connection test passes

[ ] Monitor full load progress and throughput

[ ] Monitor CDC latency (source/target)

[ ] Check data validation state and investigate mismatches

[ Cut-over ]

[ ] Confirm CDC lag is near zero

[ ] Block source writes and wait for remaining changes to apply

[ ] Correct sequence/auto-increment values

[ ] Re-enable foreign keys/triggers

[ ] Switch the connection string and confirm normal operation

[ ] Keep a rollback path (preserve the source)

[ Wrap-up ]

[ ] Stop and delete the replication task

[ ] Delete the replication instance

[ ] Post-migration monitoring and performance check

Closing

The value of DMS lies not in flashy technology but in a precise problem definition. The essence of zero-downtime migration is "how do you catch up on the changes that occur while moving the initial data," and DMS solves this cleanly by bundling full load and CDC into one tool. Add SCT for heterogeneous conversion, data validation for integrity, and monitoring and restart for operations, and it supports the entire migration process.

That said, DMS is not magic. If you do not attend to details like supplemental logging, primary keys, LOB limits, and sequence correction, you run into the scariest outcome of all: silent data corruption. And it is worth remembering that for a homogeneous migration, native logical replication may be the better choice. Knowing the tool's strengths and limits precisely, and attending to details with a checklist. In the end, a safe migration is the result of that diligence.

References

- AWS DMS official documentation: https://docs.aws.amazon.com/dms/latest/userguide/Welcome.html

- AWS DMS product page: https://aws.amazon.com/dms/

- DMS task settings reference: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TaskSettings.html

- DMS data validation: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Validating.html

- DMS LOB support: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TaskSettings.LOBSupport.html

- AWS Schema Conversion Tool docs: https://docs.aws.amazon.com/SchemaConversionTool/latest/userguide/CHAP_Welcome.html

- PostgreSQL logical replication docs: https://www.postgresql.org/docs/current/logical-replication.html

- MySQL replication docs: https://dev.mysql.com/doc/refman/8.0/en/replication.html

- AWS Database blog: https://aws.amazon.com/blogs/database/

- DMS CloudWatch monitoring: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Monitoring.html

현재 단락 (1/284)

Database migration is a rite of passage almost every infrastructure engineer faces at some point. Wh...

작성 글자: 0원문 글자: 21,390작성 단락: 0/284