- Published on
An Introduction to Database Migration Strategy — Schema, Data, and Zero Downtime
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- Types of Migrations
- Core Principles: Reversible, Small, and Verified
- Versioned Migrations
- Forward-Only vs Rollback
- Transactional DDL
- Zero-Downtime Migration Strategies
- Backups and Dry Runs
- Environment Promotion: from dev through stg to prod
- Team Process
- Incident-Prevention Checklist
- Conclusion
- References
Introduction
Database migrations are something almost every service eventually has to deal with. To add a new feature, you add a column to a table, create an index, and sometimes move billions of rows into a different shape. When a code deployment goes wrong, you can simply roll back to the previous version. Databases are not that simple. A DDL statement, once executed, is hard to undo, and data that has been moved incorrectly may be lost forever.
In one service I operated, an attempt to add an index to a large table during peak traffic locked the entire table and caused a 30-minute outage. In another case, a migration that changed a column type was applied to production without verification, and some data ended up truncated. All of these incidents stemmed from treating migrations lightly, as if they were just a matter of running some SQL.
In this article, I want to treat database migrations as an engineering discipline in their own right, and walk through migration types, core principles, versioned migrations, zero-downtime strategies, transactional DDL, backups and dry runs, environment promotion, team process, and an incident-prevention checklist. I will focus on principles that apply anywhere, rather than tying everything to a specific tool.
Types of Migrations
The word migration covers a very broad range. Because the nature of the risk and the way you respond change completely depending on what you are moving, it is important to distinguish the types first.
Schema Migrations
A schema migration changes the structure of the database. Creating or dropping tables, adding or removing columns, and altering indexes or constraints all belong here. They are expressed with DDL (Data Definition Language) statements and are the first area most migration tools address.
-- Add a column (schema change)
ALTER TABLE users ADD COLUMN last_login_at TIMESTAMPTZ;
-- Add an index (schema change)
CREATE INDEX idx_users_email ON users (email);
-- Add a constraint (schema change)
ALTER TABLE orders ADD CONSTRAINT chk_amount_positive CHECK (amount > 0);
Data Migrations
A data migration moves or transforms the data itself rather than the structure. Filling a new column with default values, normalizing data from one table into several tables, or converting encodings all fall into this category. They are expressed with DML (Data Manipulation Language), and performance and lock management become crucial when handling large numbers of rows.
-- A data migration that fills a new column
UPDATE users SET display_name = username WHERE display_name IS NULL;
-- A migration that moves data into another table
INSERT INTO user_profiles (user_id, bio)
SELECT id, bio FROM users WHERE bio IS NOT NULL;
Engine Migrations
An engine migration upgrades the version of the same database product. Going from PostgreSQL 14 to 16, or MySQL 5.7 to 8.0, are typical examples. You must account for SQL compatibility, changes to default settings, and shifts in performance characteristics.
Platform Migrations
A platform migration is the largest kind of change, where you replace the database product itself or move from on-premises to the cloud. Examples include migrating from Oracle to PostgreSQL, or moving a self-hosted MySQL to Amazon RDS. These usually require a dedicated tool such as AWS DMS and a long period of parallel operation.
Comparison by Type
| Type | Target | Common tools | Risk | Reversibility |
|---|---|---|---|---|
| Schema | Tables, columns, indexes | Flyway, Liquibase, migrate | Medium | Relatively easy |
| Data | Rows, values, formats | Batch scripts, ETL | High | Difficult |
| Engine | DBMS version | pg_upgrade, mysql_upgrade | High | Very difficult |
| Platform | DBMS product, infra | AWS DMS, dedicated migrators | Very high | Nearly impossible |
Core Principles: Reversible, Small, and Verified
There are three core principles for handling migrations safely. These hold true regardless of the tool you use.
Reversible
Whenever possible, every migration should be designed to be reversible. If you add a column, prepare the reverse script that removes it; if you create an index, prepare the script that drops it. That said, recognize that operations which delete data or convert types with loss are inherently irreversible, and treat such operations with even more care.
Small
A single migration should be broken into the smallest possible unit. A giant migration that changes ten tables at once is hard to diagnose when it fails and tricky to recover from when it fails partway. Small migrations are easy to review, and even when something goes wrong, the blast radius stays narrow.
Verified
You must verify a migration before applying it to production. Run it against data that is close to real data in development and staging environments, and measure the execution time and lock impact. The assumption "it ran fine locally, so it must be fine" is the most dangerous one.
The three pillars of a safe migration
Reversible Small Verified
| | |
reverse ready split into units run on staging
| | |
+----------------+-----------------+
|
safe deployment
Versioned Migrations
The heart of modern migrations is managing every change as a versioned file. Connecting directly to the database console and running SQL by hand is untraceable and hard to reproduce. Instead, keep migrations in version control alongside your code.
File Naming Conventions
Most tools follow a naming convention that uses a timestamp or sequence number as a prefix. This makes the execution order of migrations unambiguous.
migrations/
V20260601120000__create_users_table.sql
V20260602093000__add_email_index.sql
V20260603140000__add_last_login_column.sql
V20260604101500__backfill_display_name.sql
Tools like golang-migrate manage up and down files as pairs.
migrations/
000001_create_users_table.up.sql
000001_create_users_table.down.sql
000002_add_email_index.up.sql
000002_add_email_index.down.sql
Example Migration Files
Each migration file holds only one logical change. Below is an example of an up file that creates a users table and a down file that reverses it.
-- 000001_create_users_table.up.sql
CREATE TABLE users (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
email VARCHAR(255) NOT NULL UNIQUE,
username VARCHAR(100) NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_users_username ON users (username);
-- 000001_create_users_table.down.sql
DROP TABLE IF EXISTS users;
The Migration History Table
To track which versions have been applied, migration tools keep a dedicated history table inside the database. Flyway uses a table called flyway_schema_history, and Liquibase uses one called databasechangelog. Thanks to this table, the tool can run only the migrations that have not yet been applied.
flyway_schema_history (conceptual)
installed_rank | version | description | success
---------------+---------+----------------------+--------
1 | 1 | create users table | true
2 | 2 | add email index | true
3 | 3 | add last login col | true
Forward-Only vs Rollback
There are broadly two philosophies for reversing migrations.
The Rollback Approach
The rollback approach prepares a reverse script (down) for every migration, and runs it to return to the previous state when something goes wrong. It is intuitive, but it has a trap. A migration that deletes or transforms data often cannot be perfectly restored by a down script, because, for example, dropping a column and adding it back will not bring back the data it contained.
The Forward-Only Approach
The forward-only approach never goes backward. When something goes wrong, you add a new migration that fixes it. If you created a bad column, instead of reverting it you write a new migration that cleans it up. Many teams prefer forward-only in large production environments because the safety of down scripts is hard to guarantee.
| Aspect | Rollback approach | Forward-only approach |
|---|---|---|
| How to undo | Run the down script | Add a new migration |
| Data-loss risk | down may be incomplete | Handled explicitly |
| Operational complexity | Looks simple but has traps | Consistent and predictable |
| Recommended for | Small, early-stage | Large production |
Transactional DDL
Transactional DDL is the ability to run DDL statements inside a transaction, guaranteeing that if the migration fails partway, all changes are rolled back together. PostgreSQL can handle most DDL inside a transaction, which makes it very strong in this respect. MySQL, on the other hand, triggers an implicit commit for many DDL statements, so you cannot expect transactional DDL there.
Transactional DDL in PostgreSQL
If you wrap several DDL statements in a single transaction, then if even one of them fails midway, the entire set is cleanly reverted.
BEGIN;
ALTER TABLE orders ADD COLUMN status VARCHAR(20) NOT NULL DEFAULT 'pending';
ALTER TABLE orders ADD COLUMN shipped_at TIMESTAMPTZ;
CREATE INDEX idx_orders_status ON orders (status);
-- If an error occurs here, all three changes above are rolled back.
COMMIT;
Operations to Avoid Inside a Transaction
There is a caveat. In PostgreSQL, CREATE INDEX CONCURRENTLY cannot run inside a transaction block. This command is meant to build an index without locking the table, but inside a transaction its lock-avoidance mechanism cannot operate. Therefore, when adding an index to a large table without downtime, you must run it separately, outside any transaction.
-- Run on its own, outside a transaction (zero-downtime index build)
CREATE INDEX CONCURRENTLY idx_users_email ON users (email);
Zero-Downtime Migration Strategies
Changing the schema without stopping the service is the hardest part of migrations. The core idea is to split a risky change into multiple steps so that, at each step, both the old code and the new code keep working correctly.
The Expand and Contract Pattern
The most widely used zero-downtime pattern is Expand and Contract. Even something as simple-looking as renaming a column requires several steps to do without downtime.
Renaming a column (username -> handle) without downtime
Step 1 Expand : add the new column handle (keep username)
Step 2 Backfill : copy username values into handle
Step 3 Dual-write : the application writes to both columns
Step 4 Read swap : deploy the app to read from handle
Step 5 Contract : after stabilizing, drop the username column
The key to this pattern is that at no single step do the old and new versions of the application become incompatible. Even when the two versions coexist during a rolling deployment, data consistency is never broken.
Risky Operations and Safe Alternatives
| Risky operation | Problem | Safe alternative |
|---|---|---|
| Add a NOT NULL column instantly | Full table rewrite, locks | Add as nullable, backfill, add constraint later |
| Plain CREATE INDEX | Table write lock | Use CREATE INDEX CONCURRENTLY |
| Change column type instantly | Full rewrite, long lock | Add a new column, backfill gradually |
| One huge UPDATE | Long transaction, lock contention | Process in batches |
Adding a NOT NULL Constraint Safely
Adding a NOT NULL column to a large table all at once can lock the whole table. Splitting it into steps as below is safe.
-- Step 1: add as a nullable column (fast, short lock)
ALTER TABLE users ADD COLUMN phone VARCHAR(20);
-- Step 2: backfill existing rows in batches
UPDATE users SET phone = '' WHERE phone IS NULL AND id BETWEEN 1 AND 100000;
-- (repeat, shifting the range)
-- Step 3: add a verifiable NOT NULL constraint as NOT VALID first
ALTER TABLE users ADD CONSTRAINT users_phone_not_null
CHECK (phone IS NOT NULL) NOT VALID;
-- Step 4: validate separately (without a full table lock)
ALTER TABLE users VALIDATE CONSTRAINT users_phone_not_null;
Splitting a Bulk UPDATE into Batches
Processing millions of rows in a single UPDATE creates a long transaction, causing lock contention and replication lag. Processing in small batches is safer.
-- Example of repeating in batch units
UPDATE orders
SET status = 'archived'
WHERE id IN (
SELECT id FROM orders
WHERE status = 'old' AND archived = false
LIMIT 5000
);
Backups and Dry Runs
For any migration, two things must be in place before applying it to production: a backup and a dry run.
Backups
Before a high-risk migration, always secure a backup. Understand the characteristics of logical and physical backups and choose according to the situation. Above all, the most important thing is to verify in advance that the backup is actually restorable. A backup you have never restored is not a backup.
# PostgreSQL logical backup (dump a specific database)
pg_dump --format=custom --file=backup_before_migration.dump mydb
# Verify restore (restore into a separate temporary database)
pg_restore --dbname=mydb_restore_test backup_before_migration.dump
# MySQL logical backup
mysqldump --single-transaction --routines --triggers mydb > backup_before_migration.sql
Dry Runs
A dry run is the procedure of confirming what a migration will do in advance, without actually applying the change. Many tools provide this feature. Running it for real in a staging environment with data at a scale similar to production, while measuring elapsed time and lock impact, is also a dry run in the broad sense.
# Liquibase: print SQL without actually running it (dry run)
liquibase update-sql
# Flyway: check the list of migrations to be applied
flyway info
# golang-migrate: check the current version
migrate -path ./migrations -database "$DATABASE_URL" version
Checking the Execution Plan
For large data migrations, you can check the execution plan in advance with EXPLAIN to catch unintended full scans or inefficiencies.
EXPLAIN ANALYZE
UPDATE orders SET status = 'archived'
WHERE created_at < now() - INTERVAL '1 year';
Environment Promotion: from dev through stg to prod
A migration should never be applied directly to production. It is verified by passing through development (dev), staging (stg), and production (prod) in turn. The same migration file must be applied identically across all environments to prevent schema drift between them.
Environment promotion pipeline
[dev] ---- verify ----> [stg] ---- verify ----> [prod]
| | |
developer local production-like live service
fast iteration data rehearsal careful apply
| | |
write schema measure run time deploy after backup
check lock impact monitor
The Role of Each Environment
| Environment | Purpose | Data | Key activities |
|---|---|---|---|
| dev | Writing migrations, fast iteration | Small seed data | Schema design, unit verification |
| stg | Production rehearsal | Production-like scale | Measure run time, lock impact |
| prod | Real application | Real data | Backup, deploy, monitor |
A staging environment is only meaningful if it has a data scale as close to production as possible. A migration that ran fine on a staging environment with only a thousand rows can take hours on a production system with hundreds of millions of rows.
Team Process
Migrations are the work of the team, not of one person. A safe migration culture requires a clear process.
Migrations Are Subject to Code Review
Every migration file goes through a pull request and code review just like application code. Reviewers check lock impact, reversibility, backfill strategy, and whether batching is used.
Verification in the CI Pipeline
It is good to automatically apply migrations to a temporary database in the CI pipeline and even verify that they can be reversed afterward.
# Example of verifying apply-then-rollback in CI
migrate -path ./migrations -database "$DATABASE_URL" up
migrate -path ./migrations -database "$DATABASE_URL" down 1
migrate -path ./migrations -database "$DATABASE_URL" up
The Order of Code Deployment and Migration
The order of application code deployment and migration must be decided carefully. In general, run the expand step that adds a column before the code deployment, and the contract step that removes a column after the code deployment. This way, even when old and new code coexist during deployment, it stays safe.
Safe deployment order (adding a column)
1. Apply migration (add new column) - old code ignores the new column
2. Deploy new code (use the new column)
3. Stabilize after monitoring
Safe deployment order (removing a column)
1. Deploy new code (stop using the column)
2. Stabilize after monitoring
3. Apply migration (drop the column)
Incident-Prevention Checklist
Finally, here is a checklist to review right before a production migration. Checking this list as a habit can prevent most incidents.
Pre-production migration checklist
[ ] Is the migration version-controlled and has it passed code review?
[ ] Has it been run on staging with production-like data?
[ ] Have you measured execution time and lock impact?
[ ] Is a reversal strategy (down or forward fix) ready?
[ ] Is a recent backup secured and its restorability verified?
[ ] Are large-table indexes built with CONCURRENTLY?
[ ] Is the bulk UPDATE split into batches?
[ ] Are lock-risk operations (NOT NULL, type changes) split into steps?
[ ] Is the order of code deployment and migration decided?
[ ] Is the apply window a low-traffic period?
[ ] Are monitoring and rollback owners assigned during the apply?
Common Incident Types
| Incident type | Cause | Prevention |
|---|---|---|
| Lock-induced outage | Instant index on a big table, type change | CONCURRENTLY, split into steps |
| Data loss | Unverified type conversion, deletion | Backup, staging verification |
| Replication lag | One giant transaction | Batch processing |
| Schema drift | Manual per-environment apply | Version control, automated promotion |
| Deployment conflict | Wrong code/migration order | Expand and Contract |
Conclusion
A database migration is not merely a matter of running SQL; it is an engineering discipline of managing risk. The core is to handle changes in a reversible, small, and verified state, to track every change as a versioned file, and to promote it through environments in turn.
Zero-downtime migration begins by splitting a risky change into multiple steps so that the old and new versions can coexist. The Expand and Contract pattern, transactional DDL, CONCURRENTLY indexes, and batch processing are all tools for this goal. Above all, backups and dry runs, together with the team's code-review culture, become the final safety net that prevents incidents.
I hope the principles and checklist covered in this article help make your next migration a little safer. In the end, the best migration is the one that no one notices ever happened.