- Published on
Migration Rollback and Verification — Designing a Safety Net
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- Why Rollback Is Hard
- Reversible Changes vs Irreversible Changes
- Writing Rollback-able Migrations
- Backups and Recovery Points
- Multi-Stage Verification
- Canaries and Gradual Rollout
- Separating with Feature Flags
- Automated Data Verification
- Incident Response Runbook
- Postmortem
- Pitfalls
- Final Checklist
- Conclusion
- References
Introduction
Rolling back a deployment is usually simple. You bring up the previous container image again, or shift traffic back to the previous deployment. Code is idempotent, stateless, and can be swapped cleanly at any time. Database migrations, however, betray this intuition head-on.
A migration mutates state. Drop a column, and the data inside it is gone. Change a type, and you may lose precision during the conversion. Fail halfway through backfilling millions of rows, and you are left with a half-finished state where some rows hold the new value and others hold the old one. A code rollback means "return to the previous state," but a data rollback often means "make what already happened un-happen." And in the world of data, "un-happening" is almost always impossible.
This post is about designing a safety net so you can handle migrations safely. There is one central question: "When this change goes wrong, how do we survive?" We will walk through how to write migrations so that rollback is possible, how to capture backups and recovery points, how to verify schema and data and performance and the application in stages, how to spread risk with canaries and feature flags, and finally the runbook and postmortem you follow when something does go wrong.
Here is what this post covers.
- Why rollback is inherently hard, and the irreversibility of data loss
- Distinguishing reversible changes from irreversible ones
- Patterns for writing rollback-able migrations
- Strategies for backups and recovery points
- Multi-stage verification: schema, data, performance, application smoke
- Canary deployments and gradual rollout
- Separating code from schema with feature flags
- Automated data verification using checksums and re-aggregation
- Incident response runbooks, postmortems, and a final checklist
Why Rollback Is Hard
Data loss is irreversible
Much of software engineering is reversible. Bad code is undone with git revert; a wrongly launched server is taken down again. But data is different. A row once deleted, a value once overwritten, a string once truncated cannot be restored without the original.
Nothing demonstrates this more clearly than dropping a column.
-- This migration cannot be "rolled back."
ALTER TABLE users DROP COLUMN legacy_phone;
Syntactically you can write a "down" migration like this.
-- This only restores the column "structure," not the data.
ALTER TABLE users ADD COLUMN legacy_phone VARCHAR(20);
But this down migration is a lie. The column comes back, but every phone number that lived inside it is gone forever. The migration tool knows only that "a down script exists"; it has no idea whether that script performs a meaningful restoration. This is the first trap that makes rollback dangerous. The "reversal" a tool guarantees and the "reversal" that actually preserves data are entirely different concepts.
As time passes, new data accumulates
The second factor that makes rollback hard is time. If you catch a problem within five minutes of deploying a migration, not much data has come in during that window. But if you catch it six hours later, users have generated tens of thousands of transactions on top of the new schema in the meantime.
Now, returning to the old schema means deciding what to do with the new data accumulated over those six hours. Discard it? Transform it? Values written into a column that exists only in the new schema have nowhere to go in the old one. It is no longer a simple "reversal" but a data consistency problem between two schemas.
T0 migration applied (column added, NOT NULL)
T0+5m deploy complete, traffic normalized
T0+10m users start writing data into the new column
...
T0+6h performance issue detected → decision to roll back
now, what about six hours of accumulated new data?
a naive down migration ignores or destroys it
Locks and large tables
The third factor is the physical constraints of a production environment. The moment you run an ALTER on a table with hundreds of millions of rows, depending on the database engine the entire table may be locked, and every read and write may halt during that time. The rollback migration itself can trigger another long lock, turning the rollback into a bigger incident than the original.
So a good migration strategy starts not from "how to roll back well," but from designing changes so that rollback is unnecessary in the first place, or at least safe.
Reversible Changes vs Irreversible Changes
If you place every schema change on a single axis, at one end sit perfectly reversible changes, and at the other end sit completely irreversible ones.
| Change type | Reversibility | Risk | Notes |
|---|---|---|---|
| Add column (nullable) | High | Low | Just drop it; no impact on existing data |
| Add index | High | Low | Drop to restore; only performance impact |
| Add table | High | Low | Just drop it |
| Rename column | Medium | Medium | Must sync with application code |
| Add column (NOT NULL + default) | Medium | Medium | Needs backfill; lock risk if large |
| Type change (widening) | Medium | Medium | int to bigint, usually safe |
| Type change (narrowing) | Low | High | Possible precision loss; verify data |
| Drop column | Very low | High | Permanent data loss |
| Drop table | Very low | Very high | Permanent data loss |
| Data transform backfill | Low | High | Irreversible if original not preserved |
The most important lesson this table offers is this: decompose an irreversible change into a sequence of reversible changes. Instead of dropping a column, first make the code stop reading it, observe for a while, and then drop it only once you judge it sufficiently safe. Every step in between is reversible.
The Expand-Contract pattern
The formalization of this decomposition is the expand-and-contract pattern, also called parallel change. As Martin Fowler captured it in evolutionary database design, this pattern consists of three phases.
1. Expand
- add the new structure while leaving the old one in place
- e.g., add a new column, keep the existing column
- code writes to both, or reads the new one first with the old as fallback
2. Migrate
- backfill existing data into the new structure
- record incoming data into both structures (dual write)
- verify: confirm the data in both structures matches
3. Contract
- confirm the code no longer references the old structure
- after a sufficient observation period, remove the old structure
The heart of this pattern is that each phase is deployed independently and each phase is reversible. If the expand phase has a problem, drop the new column. If the migrate phase has a problem, turn off dual write and use only the old column. The most dangerous contract phase is executed only after every preceding phase has been confirmed stable.
Writing Rollback-able Migrations
Always pair up and down
Most migration tools have you write up (apply) and down (revert) scripts as a pair. golang-migrate expresses this through a file naming convention.
000012_add_user_status.up.sql
000012_add_user_status.down.sql
The up script looks like this.
-- 000012_add_user_status.up.sql
ALTER TABLE users ADD COLUMN status VARCHAR(20);
The down script looks like this.
-- 000012_add_user_status.down.sql
ALTER TABLE users DROP COLUMN status;
Here the status column was newly added, so dropping it loses no data. This is a genuinely reversible migration.
Flyway's Undo migrations
Flyway expresses undo migrations with the U prefix. The revert script corresponding to versioned migration V12 is U12.
-- V12__add_user_status.sql
ALTER TABLE users ADD COLUMN status VARCHAR(20);
-- U12__add_user_status.sql
ALTER TABLE users DROP COLUMN status;
But as the Flyway documentation itself emphasizes, undo is not a cure-all. No undo can revive DROPped data, and undo scripts must be reviewed carefully.
Liquibase's automatic and manual rollback
Liquibase handles rollback at the changeset level. Some changes can have their rollback method inferred automatically; otherwise the developer writes an explicit rollback clause.
databaseChangeLog:
- changeSet:
id: add-user-status
author: youngjukim
changes:
- addColumn:
tableName: users
columns:
- column:
name: status
type: varchar(20)
rollback:
- dropColumn:
tableName: users
columnName: status
Turning irreversible changes into reversible ones
The truly dangerous changes are the ones that destroy data. If you must drop a column, do not DROP it outright; decompose it as follows.
-- Step 1: don't delete immediately; rename to mark it "deprecated"
ALTER TABLE users RENAME COLUMN legacy_phone TO legacy_phone_deprecated;
-- Step 2 (days later, after confirming no one references it): preserve in an archive table
CREATE TABLE archive_users_legacy_phone AS
SELECT id, legacy_phone_deprecated
FROM users
WHERE legacy_phone_deprecated IS NOT NULL;
-- Step 3 (after sufficient observation): the real delete
ALTER TABLE users DROP COLUMN legacy_phone_deprecated;
This way, steps 1 and 2 are both reversible, and the data is safely preserved in the archive table until step 3 runs. Even if a problem surfaces after step 3, you can restore the data from the archive table.
Patterns for avoiding locks in PostgreSQL
Adding a NOT NULL column to a large table is risky. In PostgreSQL, you minimize locking in the following order.
-- 1. add as nullable (metadata-only change, fast)
ALTER TABLE orders ADD COLUMN region VARCHAR(10);
-- 2. backfill in batches (don't lock the whole thing at once)
UPDATE orders SET region = 'UNKNOWN'
WHERE region IS NULL AND id BETWEEN 1 AND 100000;
-- repeat, shifting the range
-- 3. add the NOT NULL constraint as NOT VALID first
ALTER TABLE orders ADD CONSTRAINT orders_region_not_null
CHECK (region IS NOT NULL) NOT VALID;
-- 4. validate separately (this step avoids the ACCESS EXCLUSIVE lock)
ALTER TABLE orders VALIDATE CONSTRAINT orders_region_not_null;
Indexes likewise use the CONCURRENTLY option to avoid locking the table.
CREATE INDEX CONCURRENTLY idx_orders_region ON orders (region);
Backups and Recovery Points
If a rollback script can be a lie, what is the real safety net? It is backups. On the premise that any migration can go wrong, we must always secure a recovery point that lets us return to the state just before the change.
Recovery Point Objective and Recovery Time Objective
You need to understand two key metrics.
| Metric | Meaning | In migration context |
|---|---|---|
| RPO (Recovery Point Objective) | How much data you can afford to lose | If a pre-migration snapshot exists, RPO is that point |
| RTO (Recovery Time Objective) | How long recovery may take | The time to restore from backup |
If you take a snapshot before running the migration, in the worst case you can restore from that snapshot and guarantee an RPO at the migration's start time. But you lose all data that came in afterward, so this is truly a last resort.
Pre-migration snapshots
Just before running a large migration in production, always take a snapshot.
# logical backup (small tables, specific schema)
pg_dump --format=custom --table=users --table=orders \
--file=/backups/pre_migration_$(date +%Y%m%d_%H%M%S).dump \
mydb
# when you want to quickly preserve only a specific table (CTAS)
psql mydb -c "CREATE TABLE users_backup_20260616 AS TABLE users;"
For a cloud-managed database, using the snapshot feature is far faster. The point is that the first step of a migration runbook should always be "confirm the recovery point is secured."
Point-in-time recovery (PITR)
PostgreSQL's point-in-time recovery preserves the WAL (write-ahead log) so you can restore precisely to a particular moment. If a migration corrupted data at T0+30 minutes, you can restore to just before T0. To use PITR, a base backup and continuous WAL archiving must be configured in advance.
recovery flow:
base backup (T-1d) + WAL archive (T-1d to now)
→ recovery_target_time = 'just before T0'
→ replay all transactions up to that moment to restore
Multi-Stage Verification
After a migration is applied, what should you confirm before declaring "success"? Verification should not be a single pass/fail but several layers.
┌─────────────────────────────────────────┐
│ 4. Application smoke test │ do core user flows work
├─────────────────────────────────────────┤
│ 3. Performance verification │ query plans, latency, locks
├─────────────────────────────────────────┤
│ 2. Data verification │ row counts, checksums, integrity
├─────────────────────────────────────────┤
│ 1. Schema verification │ did the structure change as intended
└─────────────────────────────────────────┘
verify from the bottom up
Stage 1: Schema verification
First, confirm the schema changed as intended. Query the system catalog to check whether the column was added, the type is correct, and the constraints and indexes exist.
-- whether the column exists with the expected type
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'orders' AND column_name = 'region';
-- whether the index was created
SELECT indexname FROM pg_indexes
WHERE tablename = 'orders' AND indexname = 'idx_orders_region';
Stage 2: Data verification
A correct schema does not mean correct data. Confirm the backfill processed every row, no NULLs remain, and the transformed values are correct.
-- whether the backfill finished: NULLs should be 0
SELECT count(*) AS remaining_nulls
FROM orders WHERE region IS NULL;
-- whether the row count is preserved before and after the migration
SELECT count(*) AS total_rows FROM orders;
-- look at the distribution to check for abnormal values
SELECT region, count(*) FROM orders GROUP BY region ORDER BY 2 DESC;
Stage 3: Performance verification
Even with correct schema and data, if a new index is missing and queries fall back to full scans, you get an outage in production. Check the execution plan of core queries.
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE region = 'KR' ORDER BY created_at DESC LIMIT 50;
Confirm it uses the new index, the estimated cost is reasonable, and the actual execution time falls within your baseline. Also watch whether lock waits are piling up.
-- check the current lock-wait situation
SELECT pid, state, wait_event_type, query
FROM pg_stat_activity
WHERE wait_event_type IS NOT NULL;
Stage 4: Application smoke test
Finally, from the actual application's perspective, confirm that core user flows work. It is common for the database to look fine while the application code fails to understand the new schema.
# smoke check on core endpoints
curl -fsS https://api.example.com/health
curl -fsS https://api.example.com/orders/recent | jq '.items | length'
Automate these four stages and make them part of the migration pipeline, and the definition of "success" becomes clear. It is success only after passing all four stages.
Canaries and Gradual Rollout
Canaries are familiar from code deployment. You deploy the new version to a subset of instances first, observe the metrics, then expand gradually. You can apply similar thinking to database migrations.
A schema change itself is usually all-or-nothing, but the code paths that use that schema can be turned on gradually. Combined with the expand-contract pattern, you apply adding a new column to everyone, but you turn on the logic that actually reads and writes that new column for a small slice of traffic first.
Phase 1: schema expand (applied to all, low risk)
add new column — no one uses it yet
Phase 2: backfill + dual write (applied to all, observe)
record new data into both, backfill existing data
Phase 3: enable the new path for 1% of traffic (canary)
observe metrics — error rate, latency, data consistency
Phase 4: ramp 5% → 25% → 50% → 100% gradually
pause and verify at each step
Phase 5: disable the old path, contract (remove the old column)
What should you observe at each canary step? Application error rate and latency are table stakes, but you must also watch whether the data produced by the new path stays consistent with the data from the old path. This consistency check ties directly into the automated data verification in the next section.
Separating with Feature Flags
The trickiest part of a migration is that the deployment timing of the schema change and the code change drift apart. Change the schema first, and the old code may break; change the code first, and the new code may reference a schema that does not yet exist.
A feature flag (or feature toggle) is a powerful tool for separating the two. Expand the schema in advance, deploy all the code, but keep the new behavior off behind a flag. Then, once everything is ready, flip the flag on. Turning a flag on and off is instantaneous and reversible, so you effectively get a "rollback that never touches data."
# separate the new schema path with a feature flag
def get_order_region(order, flags):
if flags.is_enabled("use_new_region_column", order.user_id):
# new path: read the new column directly
return order.region
else:
# old path: compute with the existing logic
return derive_region_from_address(order.shipping_address)
# on the write path, safely dual-write
def save_order_region(order, region, flags):
order.region = region # always write to the new column (expand phase)
if flags.is_enabled("backfill_legacy_region"):
order.legacy_region_code = to_legacy_code(region) # keep the old column too
The advantage of this structure is that if a problem is found on the new path, you flip the flag off and instantly return to the old path without touching the database. You substitute the dangerous work of a migration rollback with the safe work of a flag toggle.
There is one caveat, though. Leave flags around forever and the code fills with branches and grows complex. Flags are temporary devices. Once the new path stabilizes, you must always leave a follow-up task in the backlog to clean up the flag and the old path together.
Automated Data Verification
In a large data migration, having a human eyeball every row is impossible. You need automated data verification. The two key techniques are checksums and re-aggregation.
Confirming data match with checksums
In the expand-contract pattern, when the old and new structures should hold the same data, you can quickly confirm whether they match by comparing the checksums of both structures.
-- compare old and new columns by checksum to confirm they are semantically equal
SELECT
md5(string_agg(legacy_region_code, ',' ORDER BY id)) AS old_checksum,
md5(string_agg(to_legacy_code(region), ',' ORDER BY id)) AS new_checksum
FROM orders;
If the two checksums are equal, it is strong evidence that the transformation was applied consistently across every row. If they differ, you can narrow in on which rows diverged.
-- find only the mismatching rows
SELECT id, region, legacy_region_code, to_legacy_code(region) AS expected
FROM orders
WHERE legacy_region_code IS DISTINCT FROM to_legacy_code(region);
Confirming preservation with re-aggregation
In a migration that transforms data, aggregate values like sums and counts often must be preserved. Comparing aggregates before and after the migration confirms that data was neither lost nor duplicated.
-- before the migration, store a baseline aggregate
CREATE TABLE migration_baseline AS
SELECT
count(*) AS row_count,
sum(amount) AS total_amount,
count(distinct user_id) AS distinct_users
FROM orders;
-- after the migration, recompute the same aggregate and compare
SELECT
b.row_count = a.row_count AS row_count_ok,
b.total_amount = a.total_amount AS amount_ok,
b.distinct_users = a.distinct_users AS users_ok
FROM migration_baseline b
CROSS JOIN (
SELECT count(*) AS row_count,
sum(amount) AS total_amount,
count(distinct user_id) AS distinct_users
FROM orders
) a;
If all three columns are true, the row count, the amount total, and the distinct user count are all preserved. Such aggregate verification is light and fast, and far more practical than a full comparison even on large tables.
Sample verification and full verification
There is a balance between verification cost and confidence.
| Method | Cost | Confidence | When it fits |
|---|---|---|---|
| Sample verification | Low | Medium | Quick first check, large volume |
| Aggregate verification | Low | Medium-high | Confirm sum/count preservation |
| Checksum verification | Medium | High | Column-level consistency |
| Full comparison | High | Very high | Integrity-critical domains like finance |
In practice you usually run aggregate verification and checksum verification by default, and add full comparison only for areas you suspect.
Incident Response Runbook
No matter how well you prepare, a migration can go wrong. Improvising when it does makes things worse. With a pre-written runbook, you can follow a defined procedure even under pressure.
=== Migration Incident Runbook ===
[0] Detect
- alarms: rising error rate, increasing latency, spiking lock waits
- immediately designate who the Incident Commander (IC) is
[1] Assess (within 5 minutes)
- is the change reversible? (see the reversibility table above)
- is data being corrupted, or is it merely stuck?
- blast radius: all users, or a subset?
[2] Stop the bleeding
- if a feature flag exists, turn it OFF immediately (fastest and safest)
- if a backfill job is running, pause it
- shift traffic to the old path if needed
[3] Decide on recovery
- flag OFF is enough → don't touch data (preferred)
- only the schema needs reverting → run the down migration
- data is corrupted → consider restoring from backup/PITR
[4] Execute recovery
- confirm the recovery point (pre-migration snapshot exists?)
- IC approval before executing
- re-run multi-stage verification after recovery
[5] Confirm stabilization
- the 4-stage verification (schema/data/performance/smoke) passes
- observe whether metrics return to baseline
[6] Communicate
- update the status page, share with stakeholders
- record the timeline (postmortem material)
The most important principle in the runbook is that recovery which touches data is a last resort. Where possible, stop the bleeding first with a method that does not touch data, such as flag OFF or traffic shifting. A backup restore itself means another data loss (the data after the restore point), so it must be decided with the greatest care.
Postmortem
Resolving the incident is not the end. To keep the same thing from recurring, you write a postmortem. The key is a blameless attitude. Instead of "who made the mistake," you ask "what systemic conditions made this mistake possible."
A good postmortem contains the following.
- Timeline: what happened and when. The times of detection, assessment, and recovery.
- Impact: how many users, for how long, and in what way they were affected.
- Root cause: the structural cause beneath the surface symptom, not the symptom itself. Asking "why" five times.
- What went well: record the parts where the safety net worked, too. If a feature flag let you stop quickly, that is a lesson as well.
- Action items: concrete action items with an owner and a deadline.
For example, if "the backfill was too slow" is a symptom, the root cause might be "there was no batch-size setting, so the whole thing was locked in a single transaction," and the action item becomes "apply a batch-processing template to all backfill migrations, and make a pre-rehearsal mandatory for large-table migrations."
Pitfalls
Let me restate the content so far as pitfalls you often meet in practice.
Pitfall 1: Believing it is safe because a down script exists. As we saw, the existence of a down script does not guarantee data restoration. A down that reverts a DROP only revives the column structure, not the data. Always suspect whether a down script is genuinely reversible.
Pitfall 2: Changing schema and code at once. Change the schema and code in the same deployment, and an outage occurs in the brief inconsistency window between them. Separate them with expand-contract, and control the timing with feature flags.
Pitfall 3: Running it for the first time in production. A migration not rehearsed on production-scale data meets its lock time or backfill time for the first time in production. A pre-rehearsal on a production replica tells you the expected duration and lock risk in advance.
Pitfall 4: Backfilling in a single transaction. Processing millions of rows in one UPDATE produces a huge lock, a long transaction, and a WAL explosion. Always split it into batches.
Pitfall 5: Declaring success without verification. A migration ending without errors is not success. Success is passing the multi-stage verification. In particular, do not skip data verification and the application smoke test.
Pitfall 6: Proceeding without a recovery point. Without a pre-migration snapshot or PITR configuration, in the worst case you have nowhere to return to. Step 0 of the runbook is always confirming the recovery point is secured.
Pitfall 7: Leaving feature flags forever. Flags are temporary devices. Fail to clean up the flag and the old path after stabilization, and code complexity and another kind of risk accumulate.
Final Checklist
Before applying a migration to production, check the following.
[ Preparation ]
[ ] classify the change's reversibility (reversible/partial/irreversible)
[ ] decompose irreversible changes with expand-contract
[ ] do up/down (or undo) scripts exist as a pair
[ ] does the down script genuinely preserve data (or state that it cannot)
[ ] rehearsed on a production replica (measured duration/locks)
[ Safety net ]
[ ] pre-migration snapshot or PITR recovery point secured
[ ] large backfill written as batch processing
[ ] used CONCURRENTLY for index creation
[ ] separated the new behavior with a feature flag
[ Verification ]
[ ] schema verification queries ready (column/type/index/constraint)
[ ] data verification queries ready (NULL/row count/distribution)
[ ] aggregate baseline stored and post-comparison automated
[ ] checksum comparison (old/new structure consistency)
[ ] performance verification (EXPLAIN, lock-wait check)
[ ] application smoke test ready
[ Rollout ]
[ ] canary/gradual rollout plan (1% → 100%)
[ ] metrics to observe at each step defined (error rate/latency/consistency)
[ Response ]
[ ] incident runbook written and shared
[ ] agreed on how to designate the IC (Incident Commander)
[ ] stating that stop-the-bleeding priority 1 = flag OFF
[ ] postmortem template ready
[ Cleanup ]
[ ] register flag/old-path cleanup task in the backlog after stabilization
Conclusion
The real craft in database migrations is not "rolling back well," but designing changes so that a dangerous rollback is unnecessary in the first place. Decompose irreversible changes into a sequence of reversible steps, separate schema changes from code changes with feature flags, secure a recovery point just before the change, and verify in multiple layers after applying.
What all these devices create is a single safety net. The purpose of a safety net is not to never fall, but to survive when you do. A migration will go wrong someday. What we will rely on then is not heroic improvisation, but the reversibility, backups, verification, and runbook we designed in advance. Build the safety net first, and change on top of it.
References
- Flyway Documentation — Undo Migrations
- Flyway official site
- golang-migrate/migrate (GitHub)
- Liquibase — Rollback
- Liquibase official site
- Martin Fowler — Evolutionary Database Design
- Martin Fowler — Parallel Change (expand and contract)
- Martin Fowler — Feature Toggles (Feature Flags)
- PostgreSQL Documentation — ALTER TABLE
- PostgreSQL Documentation — Continuous Archiving and Point-in-Time Recovery
- PostgreSQL Documentation — Backup and Restore