Skip to content

필사 모드: An Introduction to Database Migration Strategy — Schema, Data, and Zero Downtime

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

Database migrations are something almost every service eventually has to deal with. To add a new feature, you add a column to a table, create an index, and sometimes move billions of rows into a different shape. When a code deployment goes wrong, you can simply roll back to the previous version. Databases are not that simple. A DDL statement, once executed, is hard to undo, and data that has been moved incorrectly may be lost forever.

In one service I operated, an attempt to add an index to a large table during peak traffic locked the entire table and caused a 30-minute outage. In another case, a migration that changed a column type was applied to production without verification, and some data ended up truncated. All of these incidents stemmed from treating migrations lightly, as if they were just a matter of running some SQL.

In this article, I want to treat database migrations as an engineering discipline in their own right, and walk through migration types, core principles, versioned migrations, zero-downtime strategies, transactional DDL, backups and dry runs, environment promotion, team process, and an incident-prevention checklist. I will focus on principles that apply anywhere, rather than tying everything to a specific tool.

Types of Migrations

The word migration covers a very broad range. Because the nature of the risk and the way you respond change completely depending on what you are moving, it is important to distinguish the types first.

Schema Migrations

A schema migration changes the structure of the database. Creating or dropping tables, adding or removing columns, and altering indexes or constraints all belong here. They are expressed with DDL (Data Definition Language) statements and are the first area most migration tools address.

-- Add a column (schema change)

ALTER TABLE users ADD COLUMN last_login_at TIMESTAMPTZ;

-- Add an index (schema change)

CREATE INDEX idx_users_email ON users (email);

-- Add a constraint (schema change)

ALTER TABLE orders ADD CONSTRAINT chk_amount_positive CHECK (amount > 0);

Data Migrations

A data migration moves or transforms the data itself rather than the structure. Filling a new column with default values, normalizing data from one table into several tables, or converting encodings all fall into this category. They are expressed with DML (Data Manipulation Language), and performance and lock management become crucial when handling large numbers of rows.

-- A data migration that fills a new column

UPDATE users SET display_name = username WHERE display_name IS NULL;

-- A migration that moves data into another table

INSERT INTO user_profiles (user_id, bio)

SELECT id, bio FROM users WHERE bio IS NOT NULL;

Engine Migrations

An engine migration upgrades the version of the same database product. Going from PostgreSQL 14 to 16, or MySQL 5.7 to 8.0, are typical examples. You must account for SQL compatibility, changes to default settings, and shifts in performance characteristics.

Platform Migrations

A platform migration is the largest kind of change, where you replace the database product itself or move from on-premises to the cloud. Examples include migrating from Oracle to PostgreSQL, or moving a self-hosted MySQL to Amazon RDS. These usually require a dedicated tool such as AWS DMS and a long period of parallel operation.

Comparison by Type

| Type | Target | Common tools | Risk | Reversibility |

| --- | --- | --- | --- | --- |

| Schema | Tables, columns, indexes | Flyway, Liquibase, migrate | Medium | Relatively easy |

| Data | Rows, values, formats | Batch scripts, ETL | High | Difficult |

| Engine | DBMS version | pg_upgrade, mysql_upgrade | High | Very difficult |

| Platform | DBMS product, infra | AWS DMS, dedicated migrators | Very high | Nearly impossible |

Core Principles: Reversible, Small, and Verified

There are three core principles for handling migrations safely. These hold true regardless of the tool you use.

Reversible

Whenever possible, every migration should be designed to be reversible. If you add a column, prepare the reverse script that removes it; if you create an index, prepare the script that drops it. That said, recognize that operations which delete data or convert types with loss are inherently irreversible, and treat such operations with even more care.

Small

A single migration should be broken into the smallest possible unit. A giant migration that changes ten tables at once is hard to diagnose when it fails and tricky to recover from when it fails partway. Small migrations are easy to review, and even when something goes wrong, the blast radius stays narrow.

Verified

You must verify a migration before applying it to production. Run it against data that is close to real data in development and staging environments, and measure the execution time and lock impact. The assumption "it ran fine locally, so it must be fine" is the most dangerous one.

The three pillars of a safe migration

Reversible Small Verified

| | |

reverse ready split into units run on staging

| | |

+----------------+-----------------+

|

safe deployment

Versioned Migrations

The heart of modern migrations is managing every change as a versioned file. Connecting directly to the database console and running SQL by hand is untraceable and hard to reproduce. Instead, keep migrations in version control alongside your code.

File Naming Conventions

Most tools follow a naming convention that uses a timestamp or sequence number as a prefix. This makes the execution order of migrations unambiguous.

migrations/

V20260601120000__create_users_table.sql

V20260602093000__add_email_index.sql

V20260603140000__add_last_login_column.sql

V20260604101500__backfill_display_name.sql

Tools like golang-migrate manage up and down files as pairs.

migrations/

000001_create_users_table.up.sql

000001_create_users_table.down.sql

000002_add_email_index.up.sql

000002_add_email_index.down.sql

Example Migration Files

Each migration file holds only one logical change. Below is an example of an up file that creates a users table and a down file that reverses it.

-- 000001_create_users_table.up.sql

CREATE TABLE users (

id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,

email VARCHAR(255) NOT NULL UNIQUE,

username VARCHAR(100) NOT NULL,

created_at TIMESTAMPTZ NOT NULL DEFAULT now(),

updated_at TIMESTAMPTZ NOT NULL DEFAULT now()

);

CREATE INDEX idx_users_username ON users (username);

-- 000001_create_users_table.down.sql

DROP TABLE IF EXISTS users;

The Migration History Table

To track which versions have been applied, migration tools keep a dedicated history table inside the database. Flyway uses a table called flyway_schema_history, and Liquibase uses one called databasechangelog. Thanks to this table, the tool can run only the migrations that have not yet been applied.

flyway_schema_history (conceptual)

installed_rank | version | description | success

---------------+---------+----------------------+--------

1 | 1 | create users table | true

2 | 2 | add email index | true

3 | 3 | add last login col | true

Forward-Only vs Rollback

There are broadly two philosophies for reversing migrations.

The Rollback Approach

The rollback approach prepares a reverse script (down) for every migration, and runs it to return to the previous state when something goes wrong. It is intuitive, but it has a trap. A migration that deletes or transforms data often cannot be perfectly restored by a down script, because, for example, dropping a column and adding it back will not bring back the data it contained.

The Forward-Only Approach

The forward-only approach never goes backward. When something goes wrong, you add a new migration that fixes it. If you created a bad column, instead of reverting it you write a new migration that cleans it up. Many teams prefer forward-only in large production environments because the safety of down scripts is hard to guarantee.

| Aspect | Rollback approach | Forward-only approach |

| --- | --- | --- |

| How to undo | Run the down script | Add a new migration |

| Data-loss risk | down may be incomplete | Handled explicitly |

| Operational complexity | Looks simple but has traps | Consistent and predictable |

| Recommended for | Small, early-stage | Large production |

Transactional DDL

Transactional DDL is the ability to run DDL statements inside a transaction, guaranteeing that if the migration fails partway, all changes are rolled back together. PostgreSQL can handle most DDL inside a transaction, which makes it very strong in this respect. MySQL, on the other hand, triggers an implicit commit for many DDL statements, so you cannot expect transactional DDL there.

Transactional DDL in PostgreSQL

If you wrap several DDL statements in a single transaction, then if even one of them fails midway, the entire set is cleanly reverted.

BEGIN;

ALTER TABLE orders ADD COLUMN status VARCHAR(20) NOT NULL DEFAULT 'pending';

ALTER TABLE orders ADD COLUMN shipped_at TIMESTAMPTZ;

CREATE INDEX idx_orders_status ON orders (status);

-- If an error occurs here, all three changes above are rolled back.

COMMIT;

Operations to Avoid Inside a Transaction

There is a caveat. In PostgreSQL, CREATE INDEX CONCURRENTLY cannot run inside a transaction block. This command is meant to build an index without locking the table, but inside a transaction its lock-avoidance mechanism cannot operate. Therefore, when adding an index to a large table without downtime, you must run it separately, outside any transaction.

-- Run on its own, outside a transaction (zero-downtime index build)

CREATE INDEX CONCURRENTLY idx_users_email ON users (email);

Zero-Downtime Migration Strategies

Changing the schema without stopping the service is the hardest part of migrations. The core idea is to split a risky change into multiple steps so that, at each step, both the old code and the new code keep working correctly.

The Expand and Contract Pattern

The most widely used zero-downtime pattern is Expand and Contract. Even something as simple-looking as renaming a column requires several steps to do without downtime.

Renaming a column (username -> handle) without downtime

Step 1 Expand : add the new column handle (keep username)

Step 2 Backfill : copy username values into handle

Step 3 Dual-write : the application writes to both columns

Step 4 Read swap : deploy the app to read from handle

Step 5 Contract : after stabilizing, drop the username column

The key to this pattern is that at no single step do the old and new versions of the application become incompatible. Even when the two versions coexist during a rolling deployment, data consistency is never broken.

Risky Operations and Safe Alternatives

| Risky operation | Problem | Safe alternative |

| --- | --- | --- |

| Add a NOT NULL column instantly | Full table rewrite, locks | Add as nullable, backfill, add constraint later |

| Plain CREATE INDEX | Table write lock | Use CREATE INDEX CONCURRENTLY |

| Change column type instantly | Full rewrite, long lock | Add a new column, backfill gradually |

| One huge UPDATE | Long transaction, lock contention | Process in batches |

Adding a NOT NULL Constraint Safely

Adding a NOT NULL column to a large table all at once can lock the whole table. Splitting it into steps as below is safe.

-- Step 1: add as a nullable column (fast, short lock)

ALTER TABLE users ADD COLUMN phone VARCHAR(20);

-- Step 2: backfill existing rows in batches

UPDATE users SET phone = '' WHERE phone IS NULL AND id BETWEEN 1 AND 100000;

-- (repeat, shifting the range)

-- Step 3: add a verifiable NOT NULL constraint as NOT VALID first

ALTER TABLE users ADD CONSTRAINT users_phone_not_null

CHECK (phone IS NOT NULL) NOT VALID;

-- Step 4: validate separately (without a full table lock)

ALTER TABLE users VALIDATE CONSTRAINT users_phone_not_null;

Splitting a Bulk UPDATE into Batches

Processing millions of rows in a single UPDATE creates a long transaction, causing lock contention and replication lag. Processing in small batches is safer.

-- Example of repeating in batch units

UPDATE orders

SET status = 'archived'

WHERE id IN (

SELECT id FROM orders

WHERE status = 'old' AND archived = false

LIMIT 5000

);

Backups and Dry Runs

For any migration, two things must be in place before applying it to production: a backup and a dry run.

Backups

Before a high-risk migration, always secure a backup. Understand the characteristics of logical and physical backups and choose according to the situation. Above all, the most important thing is to verify in advance that the backup is actually restorable. A backup you have never restored is not a backup.

PostgreSQL logical backup (dump a specific database)

pg_dump --format=custom --file=backup_before_migration.dump mydb

Verify restore (restore into a separate temporary database)

pg_restore --dbname=mydb_restore_test backup_before_migration.dump

MySQL logical backup

mysqldump --single-transaction --routines --triggers mydb > backup_before_migration.sql

Dry Runs

A dry run is the procedure of confirming what a migration will do in advance, without actually applying the change. Many tools provide this feature. Running it for real in a staging environment with data at a scale similar to production, while measuring elapsed time and lock impact, is also a dry run in the broad sense.

Liquibase: print SQL without actually running it (dry run)

liquibase update-sql

Flyway: check the list of migrations to be applied

flyway info

golang-migrate: check the current version

migrate -path ./migrations -database "$DATABASE_URL" version

Checking the Execution Plan

For large data migrations, you can check the execution plan in advance with EXPLAIN to catch unintended full scans or inefficiencies.

EXPLAIN ANALYZE

UPDATE orders SET status = 'archived'

WHERE created_at < now() - INTERVAL '1 year';

Environment Promotion: from dev through stg to prod

A migration should never be applied directly to production. It is verified by passing through development (dev), staging (stg), and production (prod) in turn. The same migration file must be applied identically across all environments to prevent schema drift between them.

Environment promotion pipeline

[dev] ---- verify ----> [stg] ---- verify ----> [prod]

| | |

developer local production-like live service

fast iteration data rehearsal careful apply

| | |

write schema measure run time deploy after backup

check lock impact monitor

The Role of Each Environment

| Environment | Purpose | Data | Key activities |

| --- | --- | --- | --- |

| dev | Writing migrations, fast iteration | Small seed data | Schema design, unit verification |

| stg | Production rehearsal | Production-like scale | Measure run time, lock impact |

| prod | Real application | Real data | Backup, deploy, monitor |

A staging environment is only meaningful if it has a data scale as close to production as possible. A migration that ran fine on a staging environment with only a thousand rows can take hours on a production system with hundreds of millions of rows.

Team Process

Migrations are the work of the team, not of one person. A safe migration culture requires a clear process.

Migrations Are Subject to Code Review

Every migration file goes through a pull request and code review just like application code. Reviewers check lock impact, reversibility, backfill strategy, and whether batching is used.

Verification in the CI Pipeline

It is good to automatically apply migrations to a temporary database in the CI pipeline and even verify that they can be reversed afterward.

Example of verifying apply-then-rollback in CI

migrate -path ./migrations -database "$DATABASE_URL" up

migrate -path ./migrations -database "$DATABASE_URL" down 1

migrate -path ./migrations -database "$DATABASE_URL" up

The Order of Code Deployment and Migration

The order of application code deployment and migration must be decided carefully. In general, run the expand step that adds a column before the code deployment, and the contract step that removes a column after the code deployment. This way, even when old and new code coexist during deployment, it stays safe.

Safe deployment order (adding a column)

1. Apply migration (add new column) - old code ignores the new column

2. Deploy new code (use the new column)

3. Stabilize after monitoring

Safe deployment order (removing a column)

1. Deploy new code (stop using the column)

2. Stabilize after monitoring

3. Apply migration (drop the column)

Incident-Prevention Checklist

Finally, here is a checklist to review right before a production migration. Checking this list as a habit can prevent most incidents.

Pre-production migration checklist

[ ] Is the migration version-controlled and has it passed code review?

[ ] Has it been run on staging with production-like data?

[ ] Have you measured execution time and lock impact?

[ ] Is a reversal strategy (down or forward fix) ready?

[ ] Is a recent backup secured and its restorability verified?

[ ] Are large-table indexes built with CONCURRENTLY?

[ ] Is the bulk UPDATE split into batches?

[ ] Are lock-risk operations (NOT NULL, type changes) split into steps?

[ ] Is the order of code deployment and migration decided?

[ ] Is the apply window a low-traffic period?

[ ] Are monitoring and rollback owners assigned during the apply?

Common Incident Types

| Incident type | Cause | Prevention |

| --- | --- | --- |

| Lock-induced outage | Instant index on a big table, type change | CONCURRENTLY, split into steps |

| Data loss | Unverified type conversion, deletion | Backup, staging verification |

| Replication lag | One giant transaction | Batch processing |

| Schema drift | Manual per-environment apply | Version control, automated promotion |

| Deployment conflict | Wrong code/migration order | Expand and Contract |

Conclusion

A database migration is not merely a matter of running SQL; it is an engineering discipline of managing risk. The core is to handle changes in a reversible, small, and verified state, to track every change as a versioned file, and to promote it through environments in turn.

Zero-downtime migration begins by splitting a risky change into multiple steps so that the old and new versions can coexist. The Expand and Contract pattern, transactional DDL, CONCURRENTLY indexes, and batch processing are all tools for this goal. Above all, backups and dry runs, together with the team's code-review culture, become the final safety net that prevents incidents.

I hope the principles and checklist covered in this article help make your next migration a little safer. In the end, the best migration is the one that no one notices ever happened.

References

- [PostgreSQL ALTER TABLE official documentation](https://www.postgresql.org/docs/current/sql-altertable.html)

- [MySQL official documentation](https://dev.mysql.com/doc/)

- [AWS Database Migration Service documentation](https://docs.aws.amazon.com/dms/)

- [Flyway official documentation](https://flywaydb.org/documentation/)

- [Liquibase official documentation](https://docs.liquibase.com/)

- [golang-migrate repository](https://github.com/golang-migrate/migrate)

- [Alembic official documentation](https://alembic.sqlalchemy.org/)

- [Redgate product documentation](https://documentation.red-gate.com/)

현재 단락 (1/198)

Database migrations are something almost every service eventually has to deal with. To add a new fea...

작성 글자: 0원문 글자: 16,499작성 단락: 0/198