An Introduction to Database Migration Strategy — Schema, Data, and Zero Downtime

Introduction
Types of Migrations
Core Principles: Reversible, Small, and Verified
Versioned Migrations
Forward-Only vs Rollback
- The Rollback Approach
- The Forward-Only Approach
Transactional DDL
- Transactional DDL in PostgreSQL
- Operations to Avoid Inside a Transaction
Zero-Downtime Migration Strategies
Backups and Dry Runs
Environment Promotion: from dev through stg to prod
- The Role of Each Environment
Team Process
Incident-Prevention Checklist
- Common Incident Types
Conclusion
References

Introduction

Database migrations are something almost every service eventually has to deal with. To add a new feature, you add a column to a table, create an index, and sometimes move billions of rows into a different shape. When a code deployment goes wrong, you can simply roll back to the previous version. Databases are not that simple. A DDL statement, once executed, is hard to undo, and data that has been moved incorrectly may be lost forever.

In one service I operated, an attempt to add an index to a large table during peak traffic locked the entire table and caused a 30-minute outage. In another case, a migration that changed a column type was applied to production without verification, and some data ended up truncated. All of these incidents stemmed from treating migrations lightly, as if they were just a matter of running some SQL.

In this article, I want to treat database migrations as an engineering discipline in their own right, and walk through migration types, core principles, versioned migrations, zero-downtime strategies, transactional DDL, backups and dry runs, environment promotion, team process, and an incident-prevention checklist. I will focus on principles that apply anywhere, rather than tying everything to a specific tool.

Types of Migrations

The word migration covers a very broad range. Because the nature of the risk and the way you respond change completely depending on what you are moving, it is important to distinguish the types first.

Schema Migrations

A schema migration changes the structure of the database. Creating or dropping tables, adding or removing columns, and altering indexes or constraints all belong here. They are expressed with DDL (Data Definition Language) statements and are the first area most migration tools address.

-- Add a column (schema change)
ALTER TABLE users ADD COLUMN last_login_at TIMESTAMPTZ;

-- Add an index (schema change)
CREATE INDEX idx_users_email ON users (email);

-- Add a constraint (schema change)
ALTER TABLE orders ADD CONSTRAINT chk_amount_positive CHECK (amount > 0);

Data Migrations

A data migration moves or transforms the data itself rather than the structure. Filling a new column with default values, normalizing data from one table into several tables, or converting encodings all fall into this category. They are expressed with DML (Data Manipulation Language), and performance and lock management become crucial when handling large numbers of rows.

-- A data migration that fills a new column
UPDATE users SET display_name = username WHERE display_name IS NULL;

-- A migration that moves data into another table
INSERT INTO user_profiles (user_id, bio)
SELECT id, bio FROM users WHERE bio IS NOT NULL;

Engine Migrations

An engine migration upgrades the version of the same database product. Going from PostgreSQL 14 to 16, or MySQL 5.7 to 8.0, are typical examples. You must account for SQL compatibility, changes to default settings, and shifts in performance characteristics.

Platform Migrations

A platform migration is the largest kind of change, where you replace the database product itself or move from on-premises to the cloud. Examples include migrating from Oracle to PostgreSQL, or moving a self-hosted MySQL to Amazon RDS. These usually require a dedicated tool such as AWS DMS and a long period of parallel operation.

Comparison by Type

Type	Target	Common tools	Risk	Reversibility
Schema	Tables, columns, indexes	Flyway, Liquibase, migrate	Medium	Relatively easy
Data	Rows, values, formats	Batch scripts, ETL	High	Difficult
Engine	DBMS version	pg_upgrade, mysql_upgrade	High	Very difficult
Platform	DBMS product, infra	AWS DMS, dedicated migrators	Very high	Nearly impossible

Core Principles: Reversible, Small, and Verified

There are three core principles for handling migrations safely. These hold true regardless of the tool you use.

Reversible

Whenever possible, every migration should be designed to be reversible. If you add a column, prepare the reverse script that removes it; if you create an index, prepare the script that drops it. That said, recognize that operations which delete data or convert types with loss are inherently irreversible, and treat such operations with even more care.

Small

A single migration should be broken into the smallest possible unit. A giant migration that changes ten tables at once is hard to diagnose when it fails and tricky to recover from when it fails partway. Small migrations are easy to review, and even when something goes wrong, the blast radius stays narrow.

Verified

You must verify a migration before applying it to production. Run it against data that is close to real data in development and staging environments, and measure the execution time and lock impact. The assumption "it ran fine locally, so it must be fine" is the most dangerous one.

The three pillars of a safe migration

   Reversible          Small           Verified
        |                |                 |
   reverse ready    split into units   run on staging
        |                |                 |
        +----------------+-----------------+
                         |
                  safe deployment

Versioned Migrations

The heart of modern migrations is managing every change as a versioned file. Connecting directly to the database console and running SQL by hand is untraceable and hard to reproduce. Instead, keep migrations in version control alongside your code.

File Naming Conventions

Most tools follow a naming convention that uses a timestamp or sequence number as a prefix. This makes the execution order of migrations unambiguous.

migrations/
  V20260601120000__create_users_table.sql
  V20260602093000__add_email_index.sql
  V20260603140000__add_last_login_column.sql
  V20260604101500__backfill_display_name.sql

Tools like golang-migrate manage up and down files as pairs.

migrations/
  000001_create_users_table.up.sql
  000001_create_users_table.down.sql
  000002_add_email_index.up.sql
  000002_add_email_index.down.sql

Example Migration Files

Each migration file holds only one logical change. Below is an example of an up file that creates a users table and a down file that reverses it.

-- 000001_create_users_table.up.sql
CREATE TABLE users (
    id          BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    email       VARCHAR(255) NOT NULL UNIQUE,
    username    VARCHAR(100) NOT NULL,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_users_username ON users (username);

-- 000001_create_users_table.down.sql
DROP TABLE IF EXISTS users;

The Migration History Table

To track which versions have been applied, migration tools keep a dedicated history table inside the database. Flyway uses a table called flyway_schema_history, and Liquibase uses one called databasechangelog. Thanks to this table, the tool can run only the migrations that have not yet been applied.

flyway_schema_history (conceptual)

installed_rank | version | description          | success
---------------+---------+----------------------+--------
1              | 1       | create users table   | true
2              | 2       | add email index      | true
3              | 3       | add last login col   | true

Forward-Only vs Rollback

There are broadly two philosophies for reversing migrations.

The Rollback Approach

The rollback approach prepares a reverse script (down) for every migration, and runs it to return to the previous state when something goes wrong. It is intuitive, but it has a trap. A migration that deletes or transforms data often cannot be perfectly restored by a down script, because, for example, dropping a column and adding it back will not bring back the data it contained.

The Forward-Only Approach

The forward-only approach never goes backward. When something goes wrong, you add a new migration that fixes it. If you created a bad column, instead of reverting it you write a new migration that cleans it up. Many teams prefer forward-only in large production environments because the safety of down scripts is hard to guarantee.

Aspect	Rollback approach	Forward-only approach
How to undo	Run the down script	Add a new migration
Data-loss risk	down may be incomplete	Handled explicitly
Operational complexity	Looks simple but has traps	Consistent and predictable
Recommended for	Small, early-stage	Large production

Transactional DDL

Transactional DDL is the ability to run DDL statements inside a transaction, guaranteeing that if the migration fails partway, all changes are rolled back together. PostgreSQL can handle most DDL inside a transaction, which makes it very strong in this respect. MySQL, on the other hand, triggers an implicit commit for many DDL statements, so you cannot expect transactional DDL there.

Transactional DDL in PostgreSQL

If you wrap several DDL statements in a single transaction, then if even one of them fails midway, the entire set is cleanly reverted.

BEGIN;

ALTER TABLE orders ADD COLUMN status VARCHAR(20) NOT NULL DEFAULT 'pending';
ALTER TABLE orders ADD COLUMN shipped_at TIMESTAMPTZ;
CREATE INDEX idx_orders_status ON orders (status);

-- If an error occurs here, all three changes above are rolled back.
COMMIT;

Operations to Avoid Inside a Transaction

There is a caveat. In PostgreSQL, CREATE INDEX CONCURRENTLY cannot run inside a transaction block. This command is meant to build an index without locking the table, but inside a transaction its lock-avoidance mechanism cannot operate. Therefore, when adding an index to a large table without downtime, you must run it separately, outside any transaction.

-- Run on its own, outside a transaction (zero-downtime index build)
CREATE INDEX CONCURRENTLY idx_users_email ON users (email);

Zero-Downtime Migration Strategies

Changing the schema without stopping the service is the hardest part of migrations. The core idea is to split a risky change into multiple steps so that, at each step, both the old code and the new code keep working correctly.

The Expand and Contract Pattern

The most widely used zero-downtime pattern is Expand and Contract. Even something as simple-looking as renaming a column requires several steps to do without downtime.

Renaming a column (username -> handle) without downtime

Step 1 Expand   : add the new column handle (keep username)
Step 2 Backfill : copy username values into handle
Step 3 Dual-write : the application writes to both columns
Step 4 Read swap : deploy the app to read from handle
Step 5 Contract : after stabilizing, drop the username column

The key to this pattern is that at no single step do the old and new versions of the application become incompatible. Even when the two versions coexist during a rolling deployment, data consistency is never broken.

Risky Operations and Safe Alternatives

Risky operation	Problem	Safe alternative
Add a NOT NULL column instantly	Full table rewrite, locks	Add as nullable, backfill, add constraint later
Plain CREATE INDEX	Table write lock	Use CREATE INDEX CONCURRENTLY
Change column type instantly	Full rewrite, long lock	Add a new column, backfill gradually
One huge UPDATE	Long transaction, lock contention	Process in batches

Adding a NOT NULL Constraint Safely

Adding a NOT NULL column to a large table all at once can lock the whole table. Splitting it into steps as below is safe.

-- Step 1: add as a nullable column (fast, short lock)
ALTER TABLE users ADD COLUMN phone VARCHAR(20);

-- Step 2: backfill existing rows in batches
UPDATE users SET phone = '' WHERE phone IS NULL AND id BETWEEN 1 AND 100000;
-- (repeat, shifting the range)

-- Step 3: add a verifiable NOT NULL constraint as NOT VALID first
ALTER TABLE users ADD CONSTRAINT users_phone_not_null
    CHECK (phone IS NOT NULL) NOT VALID;

-- Step 4: validate separately (without a full table lock)
ALTER TABLE users VALIDATE CONSTRAINT users_phone_not_null;

Splitting a Bulk UPDATE into Batches

Processing millions of rows in a single UPDATE creates a long transaction, causing lock contention and replication lag. Processing in small batches is safer.

-- Example of repeating in batch units
UPDATE orders
SET status = 'archived'
WHERE id IN (
    SELECT id FROM orders
    WHERE status = 'old' AND archived = false
    LIMIT 5000
);

Backups and Dry Runs

For any migration, two things must be in place before applying it to production: a backup and a dry run.

Backups

Before a high-risk migration, always secure a backup. Understand the characteristics of logical and physical backups and choose according to the situation. Above all, the most important thing is to verify in advance that the backup is actually restorable. A backup you have never restored is not a backup.

# PostgreSQL logical backup (dump a specific database)
pg_dump --format=custom --file=backup_before_migration.dump mydb

# Verify restore (restore into a separate temporary database)
pg_restore --dbname=mydb_restore_test backup_before_migration.dump

# MySQL logical backup
mysqldump --single-transaction --routines --triggers mydb > backup_before_migration.sql

Dry Runs

A dry run is the procedure of confirming what a migration will do in advance, without actually applying the change. Many tools provide this feature. Running it for real in a staging environment with data at a scale similar to production, while measuring elapsed time and lock impact, is also a dry run in the broad sense.

# Liquibase: print SQL without actually running it (dry run)
liquibase update-sql

# Flyway: check the list of migrations to be applied
flyway info

# golang-migrate: check the current version
migrate -path ./migrations -database "$DATABASE_URL" version

Checking the Execution Plan

For large data migrations, you can check the execution plan in advance with EXPLAIN to catch unintended full scans or inefficiencies.

EXPLAIN ANALYZE
UPDATE orders SET status = 'archived'
WHERE created_at < now() - INTERVAL '1 year';

Environment Promotion: from dev through stg to prod

A migration should never be applied directly to production. It is verified by passing through development (dev), staging (stg), and production (prod) in turn. The same migration file must be applied identically across all environments to prevent schema drift between them.

Environment promotion pipeline

  [dev] ---- verify ----> [stg] ---- verify ----> [prod]
    |                       |                        |
  developer local       production-like           live service
  fast iteration        data rehearsal            careful apply
    |                       |                        |
  write schema          measure run time          deploy after backup
                        check lock impact         monitor

The Role of Each Environment

Environment	Purpose	Data	Key activities
dev	Writing migrations, fast iteration	Small seed data	Schema design, unit verification
stg	Production rehearsal	Production-like scale	Measure run time, lock impact
prod	Real application	Real data	Backup, deploy, monitor

A staging environment is only meaningful if it has a data scale as close to production as possible. A migration that ran fine on a staging environment with only a thousand rows can take hours on a production system with hundreds of millions of rows.

Team Process

Migrations are the work of the team, not of one person. A safe migration culture requires a clear process.

Migrations Are Subject to Code Review

Every migration file goes through a pull request and code review just like application code. Reviewers check lock impact, reversibility, backfill strategy, and whether batching is used.

Verification in the CI Pipeline

It is good to automatically apply migrations to a temporary database in the CI pipeline and even verify that they can be reversed afterward.

# Example of verifying apply-then-rollback in CI
migrate -path ./migrations -database "$DATABASE_URL" up
migrate -path ./migrations -database "$DATABASE_URL" down 1
migrate -path ./migrations -database "$DATABASE_URL" up

The Order of Code Deployment and Migration

The order of application code deployment and migration must be decided carefully. In general, run the expand step that adds a column before the code deployment, and the contract step that removes a column after the code deployment. This way, even when old and new code coexist during deployment, it stays safe.

Safe deployment order (adding a column)

  1. Apply migration (add new column) - old code ignores the new column
  2. Deploy new code (use the new column)
  3. Stabilize after monitoring

Safe deployment order (removing a column)

  1. Deploy new code (stop using the column)
  2. Stabilize after monitoring
  3. Apply migration (drop the column)

Incident-Prevention Checklist

Finally, here is a checklist to review right before a production migration. Checking this list as a habit can prevent most incidents.

Pre-production migration checklist

[ ] Is the migration version-controlled and has it passed code review?
[ ] Has it been run on staging with production-like data?
[ ] Have you measured execution time and lock impact?
[ ] Is a reversal strategy (down or forward fix) ready?
[ ] Is a recent backup secured and its restorability verified?
[ ] Are large-table indexes built with CONCURRENTLY?
[ ] Is the bulk UPDATE split into batches?
[ ] Are lock-risk operations (NOT NULL, type changes) split into steps?
[ ] Is the order of code deployment and migration decided?
[ ] Is the apply window a low-traffic period?
[ ] Are monitoring and rollback owners assigned during the apply?

Common Incident Types

Incident type	Cause	Prevention
Lock-induced outage	Instant index on a big table, type change	CONCURRENTLY, split into steps
Data loss	Unverified type conversion, deletion	Backup, staging verification
Replication lag	One giant transaction	Batch processing
Schema drift	Manual per-environment apply	Version control, automated promotion
Deployment conflict	Wrong code/migration order	Expand and Contract

Conclusion

A database migration is not merely a matter of running SQL; it is an engineering discipline of managing risk. The core is to handle changes in a reversible, small, and verified state, to track every change as a versioned file, and to promote it through environments in turn.

Zero-downtime migration begins by splitting a risky change into multiple steps so that the old and new versions can coexist. The Expand and Contract pattern, transactional DDL, CONCURRENTLY indexes, and batch processing are all tools for this goal. Above all, backups and dry runs, together with the team's code-review culture, become the final safety net that prevents incidents.

I hope the principles and checklist covered in this article help make your next migration a little safer. In the end, the best migration is the one that no one notices ever happened.