Skip to content
Published on

Migration Rollback and Verification — Designing a Safety Net

Authors

Introduction

Rolling back a deployment is usually simple. You bring up the previous container image again, or shift traffic back to the previous deployment. Code is idempotent, stateless, and can be swapped cleanly at any time. Database migrations, however, betray this intuition head-on.

A migration mutates state. Drop a column, and the data inside it is gone. Change a type, and you may lose precision during the conversion. Fail halfway through backfilling millions of rows, and you are left with a half-finished state where some rows hold the new value and others hold the old one. A code rollback means "return to the previous state," but a data rollback often means "make what already happened un-happen." And in the world of data, "un-happening" is almost always impossible.

This post is about designing a safety net so you can handle migrations safely. There is one central question: "When this change goes wrong, how do we survive?" We will walk through how to write migrations so that rollback is possible, how to capture backups and recovery points, how to verify schema and data and performance and the application in stages, how to spread risk with canaries and feature flags, and finally the runbook and postmortem you follow when something does go wrong.

Here is what this post covers.

  • Why rollback is inherently hard, and the irreversibility of data loss
  • Distinguishing reversible changes from irreversible ones
  • Patterns for writing rollback-able migrations
  • Strategies for backups and recovery points
  • Multi-stage verification: schema, data, performance, application smoke
  • Canary deployments and gradual rollout
  • Separating code from schema with feature flags
  • Automated data verification using checksums and re-aggregation
  • Incident response runbooks, postmortems, and a final checklist

Why Rollback Is Hard

Data loss is irreversible

Much of software engineering is reversible. Bad code is undone with git revert; a wrongly launched server is taken down again. But data is different. A row once deleted, a value once overwritten, a string once truncated cannot be restored without the original.

Nothing demonstrates this more clearly than dropping a column.

-- This migration cannot be "rolled back."
ALTER TABLE users DROP COLUMN legacy_phone;

Syntactically you can write a "down" migration like this.

-- This only restores the column "structure," not the data.
ALTER TABLE users ADD COLUMN legacy_phone VARCHAR(20);

But this down migration is a lie. The column comes back, but every phone number that lived inside it is gone forever. The migration tool knows only that "a down script exists"; it has no idea whether that script performs a meaningful restoration. This is the first trap that makes rollback dangerous. The "reversal" a tool guarantees and the "reversal" that actually preserves data are entirely different concepts.

As time passes, new data accumulates

The second factor that makes rollback hard is time. If you catch a problem within five minutes of deploying a migration, not much data has come in during that window. But if you catch it six hours later, users have generated tens of thousands of transactions on top of the new schema in the meantime.

Now, returning to the old schema means deciding what to do with the new data accumulated over those six hours. Discard it? Transform it? Values written into a column that exists only in the new schema have nowhere to go in the old one. It is no longer a simple "reversal" but a data consistency problem between two schemas.

T0   migration applied (column added, NOT NULL)
T0+5m  deploy complete, traffic normalized
T0+10m users start writing data into the new column
...
T0+6h  performance issue detected → decision to roll back
       now, what about six hours of accumulated new data?
       a naive down migration ignores or destroys it

Locks and large tables

The third factor is the physical constraints of a production environment. The moment you run an ALTER on a table with hundreds of millions of rows, depending on the database engine the entire table may be locked, and every read and write may halt during that time. The rollback migration itself can trigger another long lock, turning the rollback into a bigger incident than the original.

So a good migration strategy starts not from "how to roll back well," but from designing changes so that rollback is unnecessary in the first place, or at least safe.

Reversible Changes vs Irreversible Changes

If you place every schema change on a single axis, at one end sit perfectly reversible changes, and at the other end sit completely irreversible ones.

Change typeReversibilityRiskNotes
Add column (nullable)HighLowJust drop it; no impact on existing data
Add indexHighLowDrop to restore; only performance impact
Add tableHighLowJust drop it
Rename columnMediumMediumMust sync with application code
Add column (NOT NULL + default)MediumMediumNeeds backfill; lock risk if large
Type change (widening)MediumMediumint to bigint, usually safe
Type change (narrowing)LowHighPossible precision loss; verify data
Drop columnVery lowHighPermanent data loss
Drop tableVery lowVery highPermanent data loss
Data transform backfillLowHighIrreversible if original not preserved

The most important lesson this table offers is this: decompose an irreversible change into a sequence of reversible changes. Instead of dropping a column, first make the code stop reading it, observe for a while, and then drop it only once you judge it sufficiently safe. Every step in between is reversible.

The Expand-Contract pattern

The formalization of this decomposition is the expand-and-contract pattern, also called parallel change. As Martin Fowler captured it in evolutionary database design, this pattern consists of three phases.

1. Expand
   - add the new structure while leaving the old one in place
   - e.g., add a new column, keep the existing column
   - code writes to both, or reads the new one first with the old as fallback

2. Migrate
   - backfill existing data into the new structure
   - record incoming data into both structures (dual write)
   - verify: confirm the data in both structures matches

3. Contract
   - confirm the code no longer references the old structure
   - after a sufficient observation period, remove the old structure

The heart of this pattern is that each phase is deployed independently and each phase is reversible. If the expand phase has a problem, drop the new column. If the migrate phase has a problem, turn off dual write and use only the old column. The most dangerous contract phase is executed only after every preceding phase has been confirmed stable.

Writing Rollback-able Migrations

Always pair up and down

Most migration tools have you write up (apply) and down (revert) scripts as a pair. golang-migrate expresses this through a file naming convention.

000012_add_user_status.up.sql
000012_add_user_status.down.sql

The up script looks like this.

-- 000012_add_user_status.up.sql
ALTER TABLE users ADD COLUMN status VARCHAR(20);

The down script looks like this.

-- 000012_add_user_status.down.sql
ALTER TABLE users DROP COLUMN status;

Here the status column was newly added, so dropping it loses no data. This is a genuinely reversible migration.

Flyway's Undo migrations

Flyway expresses undo migrations with the U prefix. The revert script corresponding to versioned migration V12 is U12.

-- V12__add_user_status.sql
ALTER TABLE users ADD COLUMN status VARCHAR(20);
-- U12__add_user_status.sql
ALTER TABLE users DROP COLUMN status;

But as the Flyway documentation itself emphasizes, undo is not a cure-all. No undo can revive DROPped data, and undo scripts must be reviewed carefully.

Liquibase's automatic and manual rollback

Liquibase handles rollback at the changeset level. Some changes can have their rollback method inferred automatically; otherwise the developer writes an explicit rollback clause.

databaseChangeLog:
  - changeSet:
      id: add-user-status
      author: youngjukim
      changes:
        - addColumn:
            tableName: users
            columns:
              - column:
                  name: status
                  type: varchar(20)
      rollback:
        - dropColumn:
            tableName: users
            columnName: status

Turning irreversible changes into reversible ones

The truly dangerous changes are the ones that destroy data. If you must drop a column, do not DROP it outright; decompose it as follows.

-- Step 1: don't delete immediately; rename to mark it "deprecated"
ALTER TABLE users RENAME COLUMN legacy_phone TO legacy_phone_deprecated;
-- Step 2 (days later, after confirming no one references it): preserve in an archive table
CREATE TABLE archive_users_legacy_phone AS
SELECT id, legacy_phone_deprecated
FROM users
WHERE legacy_phone_deprecated IS NOT NULL;
-- Step 3 (after sufficient observation): the real delete
ALTER TABLE users DROP COLUMN legacy_phone_deprecated;

This way, steps 1 and 2 are both reversible, and the data is safely preserved in the archive table until step 3 runs. Even if a problem surfaces after step 3, you can restore the data from the archive table.

Patterns for avoiding locks in PostgreSQL

Adding a NOT NULL column to a large table is risky. In PostgreSQL, you minimize locking in the following order.

-- 1. add as nullable (metadata-only change, fast)
ALTER TABLE orders ADD COLUMN region VARCHAR(10);
-- 2. backfill in batches (don't lock the whole thing at once)
UPDATE orders SET region = 'UNKNOWN'
WHERE region IS NULL AND id BETWEEN 1 AND 100000;
-- repeat, shifting the range
-- 3. add the NOT NULL constraint as NOT VALID first
ALTER TABLE orders ADD CONSTRAINT orders_region_not_null
CHECK (region IS NOT NULL) NOT VALID;
-- 4. validate separately (this step avoids the ACCESS EXCLUSIVE lock)
ALTER TABLE orders VALIDATE CONSTRAINT orders_region_not_null;

Indexes likewise use the CONCURRENTLY option to avoid locking the table.

CREATE INDEX CONCURRENTLY idx_orders_region ON orders (region);

Backups and Recovery Points

If a rollback script can be a lie, what is the real safety net? It is backups. On the premise that any migration can go wrong, we must always secure a recovery point that lets us return to the state just before the change.

Recovery Point Objective and Recovery Time Objective

You need to understand two key metrics.

MetricMeaningIn migration context
RPO (Recovery Point Objective)How much data you can afford to loseIf a pre-migration snapshot exists, RPO is that point
RTO (Recovery Time Objective)How long recovery may takeThe time to restore from backup

If you take a snapshot before running the migration, in the worst case you can restore from that snapshot and guarantee an RPO at the migration's start time. But you lose all data that came in afterward, so this is truly a last resort.

Pre-migration snapshots

Just before running a large migration in production, always take a snapshot.

# logical backup (small tables, specific schema)
pg_dump --format=custom --table=users --table=orders \
  --file=/backups/pre_migration_$(date +%Y%m%d_%H%M%S).dump \
  mydb
# when you want to quickly preserve only a specific table (CTAS)
psql mydb -c "CREATE TABLE users_backup_20260616 AS TABLE users;"

For a cloud-managed database, using the snapshot feature is far faster. The point is that the first step of a migration runbook should always be "confirm the recovery point is secured."

Point-in-time recovery (PITR)

PostgreSQL's point-in-time recovery preserves the WAL (write-ahead log) so you can restore precisely to a particular moment. If a migration corrupted data at T0+30 minutes, you can restore to just before T0. To use PITR, a base backup and continuous WAL archiving must be configured in advance.

recovery flow:
  base backup (T-1d) + WAL archive (T-1d to now)
  → recovery_target_time = 'just before T0'
  → replay all transactions up to that moment to restore

Multi-Stage Verification

After a migration is applied, what should you confirm before declaring "success"? Verification should not be a single pass/fail but several layers.

┌─────────────────────────────────────────┐
│  4. Application smoke test               │  do core user flows work
├─────────────────────────────────────────┤
│  3. Performance verification             │  query plans, latency, locks
├─────────────────────────────────────────┤
│  2. Data verification                    │  row counts, checksums, integrity
├─────────────────────────────────────────┤
│  1. Schema verification                  │  did the structure change as intended
└─────────────────────────────────────────┘
        verify from the bottom up

Stage 1: Schema verification

First, confirm the schema changed as intended. Query the system catalog to check whether the column was added, the type is correct, and the constraints and indexes exist.

-- whether the column exists with the expected type
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'orders' AND column_name = 'region';
-- whether the index was created
SELECT indexname FROM pg_indexes
WHERE tablename = 'orders' AND indexname = 'idx_orders_region';

Stage 2: Data verification

A correct schema does not mean correct data. Confirm the backfill processed every row, no NULLs remain, and the transformed values are correct.

-- whether the backfill finished: NULLs should be 0
SELECT count(*) AS remaining_nulls
FROM orders WHERE region IS NULL;
-- whether the row count is preserved before and after the migration
SELECT count(*) AS total_rows FROM orders;
-- look at the distribution to check for abnormal values
SELECT region, count(*) FROM orders GROUP BY region ORDER BY 2 DESC;

Stage 3: Performance verification

Even with correct schema and data, if a new index is missing and queries fall back to full scans, you get an outage in production. Check the execution plan of core queries.

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE region = 'KR' ORDER BY created_at DESC LIMIT 50;

Confirm it uses the new index, the estimated cost is reasonable, and the actual execution time falls within your baseline. Also watch whether lock waits are piling up.

-- check the current lock-wait situation
SELECT pid, state, wait_event_type, query
FROM pg_stat_activity
WHERE wait_event_type IS NOT NULL;

Stage 4: Application smoke test

Finally, from the actual application's perspective, confirm that core user flows work. It is common for the database to look fine while the application code fails to understand the new schema.

# smoke check on core endpoints
curl -fsS https://api.example.com/health
curl -fsS https://api.example.com/orders/recent | jq '.items | length'

Automate these four stages and make them part of the migration pipeline, and the definition of "success" becomes clear. It is success only after passing all four stages.

Canaries and Gradual Rollout

Canaries are familiar from code deployment. You deploy the new version to a subset of instances first, observe the metrics, then expand gradually. You can apply similar thinking to database migrations.

A schema change itself is usually all-or-nothing, but the code paths that use that schema can be turned on gradually. Combined with the expand-contract pattern, you apply adding a new column to everyone, but you turn on the logic that actually reads and writes that new column for a small slice of traffic first.

Phase 1: schema expand (applied to all, low risk)
         add new column — no one uses it yet

Phase 2: backfill + dual write (applied to all, observe)
         record new data into both, backfill existing data

Phase 3: enable the new path for 1% of traffic (canary)
         observe metrics — error rate, latency, data consistency

Phase 4: ramp 5% → 25% → 50% → 100% gradually
         pause and verify at each step

Phase 5: disable the old path, contract (remove the old column)

What should you observe at each canary step? Application error rate and latency are table stakes, but you must also watch whether the data produced by the new path stays consistent with the data from the old path. This consistency check ties directly into the automated data verification in the next section.

Separating with Feature Flags

The trickiest part of a migration is that the deployment timing of the schema change and the code change drift apart. Change the schema first, and the old code may break; change the code first, and the new code may reference a schema that does not yet exist.

A feature flag (or feature toggle) is a powerful tool for separating the two. Expand the schema in advance, deploy all the code, but keep the new behavior off behind a flag. Then, once everything is ready, flip the flag on. Turning a flag on and off is instantaneous and reversible, so you effectively get a "rollback that never touches data."

# separate the new schema path with a feature flag
def get_order_region(order, flags):
    if flags.is_enabled("use_new_region_column", order.user_id):
        # new path: read the new column directly
        return order.region
    else:
        # old path: compute with the existing logic
        return derive_region_from_address(order.shipping_address)
# on the write path, safely dual-write
def save_order_region(order, region, flags):
    order.region = region  # always write to the new column (expand phase)
    if flags.is_enabled("backfill_legacy_region"):
        order.legacy_region_code = to_legacy_code(region)  # keep the old column too

The advantage of this structure is that if a problem is found on the new path, you flip the flag off and instantly return to the old path without touching the database. You substitute the dangerous work of a migration rollback with the safe work of a flag toggle.

There is one caveat, though. Leave flags around forever and the code fills with branches and grows complex. Flags are temporary devices. Once the new path stabilizes, you must always leave a follow-up task in the backlog to clean up the flag and the old path together.

Automated Data Verification

In a large data migration, having a human eyeball every row is impossible. You need automated data verification. The two key techniques are checksums and re-aggregation.

Confirming data match with checksums

In the expand-contract pattern, when the old and new structures should hold the same data, you can quickly confirm whether they match by comparing the checksums of both structures.

-- compare old and new columns by checksum to confirm they are semantically equal
SELECT
  md5(string_agg(legacy_region_code, ',' ORDER BY id)) AS old_checksum,
  md5(string_agg(to_legacy_code(region), ',' ORDER BY id)) AS new_checksum
FROM orders;

If the two checksums are equal, it is strong evidence that the transformation was applied consistently across every row. If they differ, you can narrow in on which rows diverged.

-- find only the mismatching rows
SELECT id, region, legacy_region_code, to_legacy_code(region) AS expected
FROM orders
WHERE legacy_region_code IS DISTINCT FROM to_legacy_code(region);

Confirming preservation with re-aggregation

In a migration that transforms data, aggregate values like sums and counts often must be preserved. Comparing aggregates before and after the migration confirms that data was neither lost nor duplicated.

-- before the migration, store a baseline aggregate
CREATE TABLE migration_baseline AS
SELECT
  count(*) AS row_count,
  sum(amount) AS total_amount,
  count(distinct user_id) AS distinct_users
FROM orders;
-- after the migration, recompute the same aggregate and compare
SELECT
  b.row_count = a.row_count AS row_count_ok,
  b.total_amount = a.total_amount AS amount_ok,
  b.distinct_users = a.distinct_users AS users_ok
FROM migration_baseline b
CROSS JOIN (
  SELECT count(*) AS row_count,
         sum(amount) AS total_amount,
         count(distinct user_id) AS distinct_users
  FROM orders
) a;

If all three columns are true, the row count, the amount total, and the distinct user count are all preserved. Such aggregate verification is light and fast, and far more practical than a full comparison even on large tables.

Sample verification and full verification

There is a balance between verification cost and confidence.

MethodCostConfidenceWhen it fits
Sample verificationLowMediumQuick first check, large volume
Aggregate verificationLowMedium-highConfirm sum/count preservation
Checksum verificationMediumHighColumn-level consistency
Full comparisonHighVery highIntegrity-critical domains like finance

In practice you usually run aggregate verification and checksum verification by default, and add full comparison only for areas you suspect.

Incident Response Runbook

No matter how well you prepare, a migration can go wrong. Improvising when it does makes things worse. With a pre-written runbook, you can follow a defined procedure even under pressure.

=== Migration Incident Runbook ===

[0] Detect
    - alarms: rising error rate, increasing latency, spiking lock waits
    - immediately designate who the Incident Commander (IC) is

[1] Assess (within 5 minutes)
    - is the change reversible? (see the reversibility table above)
    - is data being corrupted, or is it merely stuck?
    - blast radius: all users, or a subset?

[2] Stop the bleeding
    - if a feature flag exists, turn it OFF immediately (fastest and safest)
    - if a backfill job is running, pause it
    - shift traffic to the old path if needed

[3] Decide on recovery
    - flag OFF is enough → don't touch data (preferred)
    - only the schema needs reverting → run the down migration
    - data is corrupted → consider restoring from backup/PITR

[4] Execute recovery
    - confirm the recovery point (pre-migration snapshot exists?)
    - IC approval before executing
    - re-run multi-stage verification after recovery

[5] Confirm stabilization
    - the 4-stage verification (schema/data/performance/smoke) passes
    - observe whether metrics return to baseline

[6] Communicate
    - update the status page, share with stakeholders
    - record the timeline (postmortem material)

The most important principle in the runbook is that recovery which touches data is a last resort. Where possible, stop the bleeding first with a method that does not touch data, such as flag OFF or traffic shifting. A backup restore itself means another data loss (the data after the restore point), so it must be decided with the greatest care.

Postmortem

Resolving the incident is not the end. To keep the same thing from recurring, you write a postmortem. The key is a blameless attitude. Instead of "who made the mistake," you ask "what systemic conditions made this mistake possible."

A good postmortem contains the following.

  • Timeline: what happened and when. The times of detection, assessment, and recovery.
  • Impact: how many users, for how long, and in what way they were affected.
  • Root cause: the structural cause beneath the surface symptom, not the symptom itself. Asking "why" five times.
  • What went well: record the parts where the safety net worked, too. If a feature flag let you stop quickly, that is a lesson as well.
  • Action items: concrete action items with an owner and a deadline.

For example, if "the backfill was too slow" is a symptom, the root cause might be "there was no batch-size setting, so the whole thing was locked in a single transaction," and the action item becomes "apply a batch-processing template to all backfill migrations, and make a pre-rehearsal mandatory for large-table migrations."

Pitfalls

Let me restate the content so far as pitfalls you often meet in practice.

Pitfall 1: Believing it is safe because a down script exists. As we saw, the existence of a down script does not guarantee data restoration. A down that reverts a DROP only revives the column structure, not the data. Always suspect whether a down script is genuinely reversible.

Pitfall 2: Changing schema and code at once. Change the schema and code in the same deployment, and an outage occurs in the brief inconsistency window between them. Separate them with expand-contract, and control the timing with feature flags.

Pitfall 3: Running it for the first time in production. A migration not rehearsed on production-scale data meets its lock time or backfill time for the first time in production. A pre-rehearsal on a production replica tells you the expected duration and lock risk in advance.

Pitfall 4: Backfilling in a single transaction. Processing millions of rows in one UPDATE produces a huge lock, a long transaction, and a WAL explosion. Always split it into batches.

Pitfall 5: Declaring success without verification. A migration ending without errors is not success. Success is passing the multi-stage verification. In particular, do not skip data verification and the application smoke test.

Pitfall 6: Proceeding without a recovery point. Without a pre-migration snapshot or PITR configuration, in the worst case you have nowhere to return to. Step 0 of the runbook is always confirming the recovery point is secured.

Pitfall 7: Leaving feature flags forever. Flags are temporary devices. Fail to clean up the flag and the old path after stabilization, and code complexity and another kind of risk accumulate.

Final Checklist

Before applying a migration to production, check the following.

[ Preparation ]
[ ] classify the change's reversibility (reversible/partial/irreversible)
[ ] decompose irreversible changes with expand-contract
[ ] do up/down (or undo) scripts exist as a pair
[ ] does the down script genuinely preserve data (or state that it cannot)
[ ] rehearsed on a production replica (measured duration/locks)

[ Safety net ]
[ ] pre-migration snapshot or PITR recovery point secured
[ ] large backfill written as batch processing
[ ] used CONCURRENTLY for index creation
[ ] separated the new behavior with a feature flag

[ Verification ]
[ ] schema verification queries ready (column/type/index/constraint)
[ ] data verification queries ready (NULL/row count/distribution)
[ ] aggregate baseline stored and post-comparison automated
[ ] checksum comparison (old/new structure consistency)
[ ] performance verification (EXPLAIN, lock-wait check)
[ ] application smoke test ready

[ Rollout ]
[ ] canary/gradual rollout plan (1% → 100%)
[ ] metrics to observe at each step defined (error rate/latency/consistency)

[ Response ]
[ ] incident runbook written and shared
[ ] agreed on how to designate the IC (Incident Commander)
[ ] stating that stop-the-bleeding priority 1 = flag OFF
[ ] postmortem template ready

[ Cleanup ]
[ ] register flag/old-path cleanup task in the backlog after stabilization

Conclusion

The real craft in database migrations is not "rolling back well," but designing changes so that a dangerous rollback is unnecessary in the first place. Decompose irreversible changes into a sequence of reversible steps, separate schema changes from code changes with feature flags, secure a recovery point just before the change, and verify in multiple layers after applying.

What all these devices create is a single safety net. The purpose of a safety net is not to never fall, but to survive when you do. A migration will go wrong someday. What we will rely on then is not heroic improvisation, but the reversibility, backups, verification, and runbook we designed in advance. Build the safety net first, and change on top of it.

References