Skip to content

필사 모드: Migration Rollback and Verification — Designing a Safety Net

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

Rolling back a deployment is usually simple. You bring up the previous container image again, or shift traffic back to the previous deployment. Code is idempotent, stateless, and can be swapped cleanly at any time. Database migrations, however, betray this intuition head-on.

A migration mutates state. Drop a column, and the data inside it is gone. Change a type, and you may lose precision during the conversion. Fail halfway through backfilling millions of rows, and you are left with a half-finished state where some rows hold the new value and others hold the old one. A code rollback means "return to the previous state," but a data rollback often means "make what already happened un-happen." And in the world of data, "un-happening" is almost always impossible.

This post is about designing a safety net so you can handle migrations safely. There is one central question: "When this change goes wrong, how do we survive?" We will walk through how to write migrations so that rollback is possible, how to capture backups and recovery points, how to verify schema and data and performance and the application in stages, how to spread risk with canaries and feature flags, and finally the runbook and postmortem you follow when something does go wrong.

Here is what this post covers.

- Why rollback is inherently hard, and the irreversibility of data loss

- Distinguishing reversible changes from irreversible ones

- Patterns for writing rollback-able migrations

- Strategies for backups and recovery points

- Multi-stage verification: schema, data, performance, application smoke

- Canary deployments and gradual rollout

- Separating code from schema with feature flags

- Automated data verification using checksums and re-aggregation

- Incident response runbooks, postmortems, and a final checklist

Why Rollback Is Hard

Data loss is irreversible

Much of software engineering is reversible. Bad code is undone with git revert; a wrongly launched server is taken down again. But data is different. A row once deleted, a value once overwritten, a string once truncated cannot be restored without the original.

Nothing demonstrates this more clearly than dropping a column.

-- This migration cannot be "rolled back."

ALTER TABLE users DROP COLUMN legacy_phone;

Syntactically you can write a "down" migration like this.

-- This only restores the column "structure," not the data.

ALTER TABLE users ADD COLUMN legacy_phone VARCHAR(20);

But this down migration is a lie. The column comes back, but every phone number that lived inside it is gone forever. The migration tool knows only that "a down script exists"; it has no idea whether that script performs a meaningful restoration. This is the first trap that makes rollback dangerous. The "reversal" a tool guarantees and the "reversal" that actually preserves data are entirely different concepts.

As time passes, new data accumulates

The second factor that makes rollback hard is time. If you catch a problem within five minutes of deploying a migration, not much data has come in during that window. But if you catch it six hours later, users have generated tens of thousands of transactions on top of the new schema in the meantime.

Now, returning to the old schema means deciding what to do with the new data accumulated over those six hours. Discard it? Transform it? Values written into a column that exists only in the new schema have nowhere to go in the old one. It is no longer a simple "reversal" but a data consistency problem between two schemas.

T0 migration applied (column added, NOT NULL)

T0+5m deploy complete, traffic normalized

T0+10m users start writing data into the new column

...

T0+6h performance issue detected → decision to roll back

now, what about six hours of accumulated new data?

a naive down migration ignores or destroys it

Locks and large tables

The third factor is the physical constraints of a production environment. The moment you run an ALTER on a table with hundreds of millions of rows, depending on the database engine the entire table may be locked, and every read and write may halt during that time. The rollback migration itself can trigger another long lock, turning the rollback into a bigger incident than the original.

So a good migration strategy starts not from "how to roll back well," but from designing changes so that rollback is unnecessary in the first place, or at least safe.

Reversible Changes vs Irreversible Changes

If you place every schema change on a single axis, at one end sit perfectly reversible changes, and at the other end sit completely irreversible ones.

| Change type | Reversibility | Risk | Notes |

| --- | --- | --- | --- |

| Add column (nullable) | High | Low | Just drop it; no impact on existing data |

| Add index | High | Low | Drop to restore; only performance impact |

| Add table | High | Low | Just drop it |

| Rename column | Medium | Medium | Must sync with application code |

| Add column (NOT NULL + default) | Medium | Medium | Needs backfill; lock risk if large |

| Type change (widening) | Medium | Medium | int to bigint, usually safe |

| Type change (narrowing) | Low | High | Possible precision loss; verify data |

| Drop column | Very low | High | Permanent data loss |

| Drop table | Very low | Very high | Permanent data loss |

| Data transform backfill | Low | High | Irreversible if original not preserved |

The most important lesson this table offers is this: decompose an irreversible change into a sequence of reversible changes. Instead of dropping a column, first make the code stop reading it, observe for a while, and then drop it only once you judge it sufficiently safe. Every step in between is reversible.

The Expand-Contract pattern

The formalization of this decomposition is the expand-and-contract pattern, also called parallel change. As Martin Fowler captured it in evolutionary database design, this pattern consists of three phases.

1. Expand

- add the new structure while leaving the old one in place

- e.g., add a new column, keep the existing column

- code writes to both, or reads the new one first with the old as fallback

2. Migrate

- backfill existing data into the new structure

- record incoming data into both structures (dual write)

- verify: confirm the data in both structures matches

3. Contract

- confirm the code no longer references the old structure

- after a sufficient observation period, remove the old structure

The heart of this pattern is that each phase is deployed independently and each phase is reversible. If the expand phase has a problem, drop the new column. If the migrate phase has a problem, turn off dual write and use only the old column. The most dangerous contract phase is executed only after every preceding phase has been confirmed stable.

Writing Rollback-able Migrations

Always pair up and down

Most migration tools have you write up (apply) and down (revert) scripts as a pair. golang-migrate expresses this through a file naming convention.

000012_add_user_status.up.sql

000012_add_user_status.down.sql

The up script looks like this.

-- 000012_add_user_status.up.sql

ALTER TABLE users ADD COLUMN status VARCHAR(20);

The down script looks like this.

-- 000012_add_user_status.down.sql

ALTER TABLE users DROP COLUMN status;

Here the status column was newly added, so dropping it loses no data. This is a genuinely reversible migration.

Flyway's Undo migrations

Flyway expresses undo migrations with the U prefix. The revert script corresponding to versioned migration V12 is U12.

-- V12__add_user_status.sql

ALTER TABLE users ADD COLUMN status VARCHAR(20);

-- U12__add_user_status.sql

ALTER TABLE users DROP COLUMN status;

But as the Flyway documentation itself emphasizes, undo is not a cure-all. No undo can revive DROPped data, and undo scripts must be reviewed carefully.

Liquibase's automatic and manual rollback

Liquibase handles rollback at the changeset level. Some changes can have their rollback method inferred automatically; otherwise the developer writes an explicit rollback clause.

databaseChangeLog:

- changeSet:

id: add-user-status

author: youngjukim

changes:

- addColumn:

tableName: users

columns:

- column:

name: status

type: varchar(20)

rollback:

- dropColumn:

tableName: users

columnName: status

Turning irreversible changes into reversible ones

The truly dangerous changes are the ones that destroy data. If you must drop a column, do not DROP it outright; decompose it as follows.

-- Step 1: don't delete immediately; rename to mark it "deprecated"

ALTER TABLE users RENAME COLUMN legacy_phone TO legacy_phone_deprecated;

-- Step 2 (days later, after confirming no one references it): preserve in an archive table

CREATE TABLE archive_users_legacy_phone AS

SELECT id, legacy_phone_deprecated

FROM users

WHERE legacy_phone_deprecated IS NOT NULL;

-- Step 3 (after sufficient observation): the real delete

ALTER TABLE users DROP COLUMN legacy_phone_deprecated;

This way, steps 1 and 2 are both reversible, and the data is safely preserved in the archive table until step 3 runs. Even if a problem surfaces after step 3, you can restore the data from the archive table.

Patterns for avoiding locks in PostgreSQL

Adding a NOT NULL column to a large table is risky. In PostgreSQL, you minimize locking in the following order.

-- 1. add as nullable (metadata-only change, fast)

ALTER TABLE orders ADD COLUMN region VARCHAR(10);

-- 2. backfill in batches (don't lock the whole thing at once)

UPDATE orders SET region = 'UNKNOWN'

WHERE region IS NULL AND id BETWEEN 1 AND 100000;

-- repeat, shifting the range

-- 3. add the NOT NULL constraint as NOT VALID first

ALTER TABLE orders ADD CONSTRAINT orders_region_not_null

CHECK (region IS NOT NULL) NOT VALID;

-- 4. validate separately (this step avoids the ACCESS EXCLUSIVE lock)

ALTER TABLE orders VALIDATE CONSTRAINT orders_region_not_null;

Indexes likewise use the CONCURRENTLY option to avoid locking the table.

CREATE INDEX CONCURRENTLY idx_orders_region ON orders (region);

Backups and Recovery Points

If a rollback script can be a lie, what is the real safety net? It is backups. On the premise that any migration can go wrong, we must always secure a recovery point that lets us return to the state just before the change.

Recovery Point Objective and Recovery Time Objective

You need to understand two key metrics.

| Metric | Meaning | In migration context |

| --- | --- | --- |

| RPO (Recovery Point Objective) | How much data you can afford to lose | If a pre-migration snapshot exists, RPO is that point |

| RTO (Recovery Time Objective) | How long recovery may take | The time to restore from backup |

If you take a snapshot before running the migration, in the worst case you can restore from that snapshot and guarantee an RPO at the migration's start time. But you lose all data that came in afterward, so this is truly a last resort.

Pre-migration snapshots

Just before running a large migration in production, always take a snapshot.

logical backup (small tables, specific schema)

pg_dump --format=custom --table=users --table=orders \

--file=/backups/pre_migration_$(date +%Y%m%d_%H%M%S).dump \

mydb

when you want to quickly preserve only a specific table (CTAS)

psql mydb -c "CREATE TABLE users_backup_20260616 AS TABLE users;"

For a cloud-managed database, using the snapshot feature is far faster. The point is that the first step of a migration runbook should always be "confirm the recovery point is secured."

Point-in-time recovery (PITR)

PostgreSQL's point-in-time recovery preserves the WAL (write-ahead log) so you can restore precisely to a particular moment. If a migration corrupted data at T0+30 minutes, you can restore to just before T0. To use PITR, a base backup and continuous WAL archiving must be configured in advance.

recovery flow:

base backup (T-1d) + WAL archive (T-1d to now)

→ recovery_target_time = 'just before T0'

→ replay all transactions up to that moment to restore

Multi-Stage Verification

After a migration is applied, what should you confirm before declaring "success"? Verification should not be a single pass/fail but several layers.

┌─────────────────────────────────────────┐

│ 4. Application smoke test │ do core user flows work

├─────────────────────────────────────────┤

│ 3. Performance verification │ query plans, latency, locks

├─────────────────────────────────────────┤

│ 2. Data verification │ row counts, checksums, integrity

├─────────────────────────────────────────┤

│ 1. Schema verification │ did the structure change as intended

└─────────────────────────────────────────┘

verify from the bottom up

Stage 1: Schema verification

First, confirm the schema changed as intended. Query the system catalog to check whether the column was added, the type is correct, and the constraints and indexes exist.

-- whether the column exists with the expected type

SELECT column_name, data_type, is_nullable

FROM information_schema.columns

WHERE table_name = 'orders' AND column_name = 'region';

-- whether the index was created

SELECT indexname FROM pg_indexes

WHERE tablename = 'orders' AND indexname = 'idx_orders_region';

Stage 2: Data verification

A correct schema does not mean correct data. Confirm the backfill processed every row, no NULLs remain, and the transformed values are correct.

-- whether the backfill finished: NULLs should be 0

SELECT count(*) AS remaining_nulls

FROM orders WHERE region IS NULL;

-- whether the row count is preserved before and after the migration

SELECT count(*) AS total_rows FROM orders;

-- look at the distribution to check for abnormal values

SELECT region, count(*) FROM orders GROUP BY region ORDER BY 2 DESC;

Stage 3: Performance verification

Even with correct schema and data, if a new index is missing and queries fall back to full scans, you get an outage in production. Check the execution plan of core queries.

EXPLAIN (ANALYZE, BUFFERS)

SELECT * FROM orders WHERE region = 'KR' ORDER BY created_at DESC LIMIT 50;

Confirm it uses the new index, the estimated cost is reasonable, and the actual execution time falls within your baseline. Also watch whether lock waits are piling up.

-- check the current lock-wait situation

SELECT pid, state, wait_event_type, query

FROM pg_stat_activity

WHERE wait_event_type IS NOT NULL;

Stage 4: Application smoke test

Finally, from the actual application's perspective, confirm that core user flows work. It is common for the database to look fine while the application code fails to understand the new schema.

smoke check on core endpoints

curl -fsS https://api.example.com/health

curl -fsS https://api.example.com/orders/recent | jq '.items | length'

Automate these four stages and make them part of the migration pipeline, and the definition of "success" becomes clear. It is success only after passing all four stages.

Canaries and Gradual Rollout

Canaries are familiar from code deployment. You deploy the new version to a subset of instances first, observe the metrics, then expand gradually. You can apply similar thinking to database migrations.

A schema change itself is usually all-or-nothing, but the code paths that use that schema can be turned on gradually. Combined with the expand-contract pattern, you apply adding a new column to everyone, but you turn on the logic that actually reads and writes that new column for a small slice of traffic first.

Phase 1: schema expand (applied to all, low risk)

add new column — no one uses it yet

Phase 2: backfill + dual write (applied to all, observe)

record new data into both, backfill existing data

Phase 3: enable the new path for 1% of traffic (canary)

observe metrics — error rate, latency, data consistency

Phase 4: ramp 5% → 25% → 50% → 100% gradually

pause and verify at each step

Phase 5: disable the old path, contract (remove the old column)

What should you observe at each canary step? Application error rate and latency are table stakes, but you must also watch whether the data produced by the new path stays consistent with the data from the old path. This consistency check ties directly into the automated data verification in the next section.

Separating with Feature Flags

The trickiest part of a migration is that the deployment timing of the schema change and the code change drift apart. Change the schema first, and the old code may break; change the code first, and the new code may reference a schema that does not yet exist.

A feature flag (or feature toggle) is a powerful tool for separating the two. Expand the schema in advance, deploy all the code, but keep the new behavior off behind a flag. Then, once everything is ready, flip the flag on. Turning a flag on and off is instantaneous and reversible, so you effectively get a "rollback that never touches data."

separate the new schema path with a feature flag

def get_order_region(order, flags):

if flags.is_enabled("use_new_region_column", order.user_id):

new path: read the new column directly

return order.region

else:

old path: compute with the existing logic

return derive_region_from_address(order.shipping_address)

on the write path, safely dual-write

def save_order_region(order, region, flags):

order.region = region # always write to the new column (expand phase)

if flags.is_enabled("backfill_legacy_region"):

order.legacy_region_code = to_legacy_code(region) # keep the old column too

The advantage of this structure is that if a problem is found on the new path, you flip the flag off and instantly return to the old path without touching the database. You substitute the dangerous work of a migration rollback with the safe work of a flag toggle.

There is one caveat, though. Leave flags around forever and the code fills with branches and grows complex. Flags are temporary devices. Once the new path stabilizes, you must always leave a follow-up task in the backlog to clean up the flag and the old path together.

Automated Data Verification

In a large data migration, having a human eyeball every row is impossible. You need automated data verification. The two key techniques are checksums and re-aggregation.

Confirming data match with checksums

In the expand-contract pattern, when the old and new structures should hold the same data, you can quickly confirm whether they match by comparing the checksums of both structures.

-- compare old and new columns by checksum to confirm they are semantically equal

SELECT

md5(string_agg(legacy_region_code, ',' ORDER BY id)) AS old_checksum,

md5(string_agg(to_legacy_code(region), ',' ORDER BY id)) AS new_checksum

FROM orders;

If the two checksums are equal, it is strong evidence that the transformation was applied consistently across every row. If they differ, you can narrow in on which rows diverged.

-- find only the mismatching rows

SELECT id, region, legacy_region_code, to_legacy_code(region) AS expected

FROM orders

WHERE legacy_region_code IS DISTINCT FROM to_legacy_code(region);

Confirming preservation with re-aggregation

In a migration that transforms data, aggregate values like sums and counts often must be preserved. Comparing aggregates before and after the migration confirms that data was neither lost nor duplicated.

-- before the migration, store a baseline aggregate

CREATE TABLE migration_baseline AS

SELECT

count(*) AS row_count,

sum(amount) AS total_amount,

count(distinct user_id) AS distinct_users

FROM orders;

-- after the migration, recompute the same aggregate and compare

SELECT

b.row_count = a.row_count AS row_count_ok,

b.total_amount = a.total_amount AS amount_ok,

b.distinct_users = a.distinct_users AS users_ok

FROM migration_baseline b

CROSS JOIN (

SELECT count(*) AS row_count,

sum(amount) AS total_amount,

count(distinct user_id) AS distinct_users

FROM orders

) a;

If all three columns are true, the row count, the amount total, and the distinct user count are all preserved. Such aggregate verification is light and fast, and far more practical than a full comparison even on large tables.

Sample verification and full verification

There is a balance between verification cost and confidence.

| Method | Cost | Confidence | When it fits |

| --- | --- | --- | --- |

| Sample verification | Low | Medium | Quick first check, large volume |

| Aggregate verification | Low | Medium-high | Confirm sum/count preservation |

| Checksum verification | Medium | High | Column-level consistency |

| Full comparison | High | Very high | Integrity-critical domains like finance |

In practice you usually run aggregate verification and checksum verification by default, and add full comparison only for areas you suspect.

Incident Response Runbook

No matter how well you prepare, a migration can go wrong. Improvising when it does makes things worse. With a pre-written runbook, you can follow a defined procedure even under pressure.

=== Migration Incident Runbook ===

[0] Detect

- alarms: rising error rate, increasing latency, spiking lock waits

- immediately designate who the Incident Commander (IC) is

[1] Assess (within 5 minutes)

- is the change reversible? (see the reversibility table above)

- is data being corrupted, or is it merely stuck?

- blast radius: all users, or a subset?

[2] Stop the bleeding

- if a feature flag exists, turn it OFF immediately (fastest and safest)

- if a backfill job is running, pause it

- shift traffic to the old path if needed

[3] Decide on recovery

- flag OFF is enough → don't touch data (preferred)

- only the schema needs reverting → run the down migration

- data is corrupted → consider restoring from backup/PITR

[4] Execute recovery

- confirm the recovery point (pre-migration snapshot exists?)

- IC approval before executing

- re-run multi-stage verification after recovery

[5] Confirm stabilization

- the 4-stage verification (schema/data/performance/smoke) passes

- observe whether metrics return to baseline

[6] Communicate

- update the status page, share with stakeholders

- record the timeline (postmortem material)

The most important principle in the runbook is that recovery which touches data is a last resort. Where possible, stop the bleeding first with a method that does not touch data, such as flag OFF or traffic shifting. A backup restore itself means another data loss (the data after the restore point), so it must be decided with the greatest care.

Postmortem

Resolving the incident is not the end. To keep the same thing from recurring, you write a postmortem. The key is a blameless attitude. Instead of "who made the mistake," you ask "what systemic conditions made this mistake possible."

A good postmortem contains the following.

- **Timeline**: what happened and when. The times of detection, assessment, and recovery.

- **Impact**: how many users, for how long, and in what way they were affected.

- **Root cause**: the structural cause beneath the surface symptom, not the symptom itself. Asking "why" five times.

- **What went well**: record the parts where the safety net worked, too. If a feature flag let you stop quickly, that is a lesson as well.

- **Action items**: concrete action items with an owner and a deadline.

For example, if "the backfill was too slow" is a symptom, the root cause might be "there was no batch-size setting, so the whole thing was locked in a single transaction," and the action item becomes "apply a batch-processing template to all backfill migrations, and make a pre-rehearsal mandatory for large-table migrations."

Pitfalls

Let me restate the content so far as pitfalls you often meet in practice.

**Pitfall 1: Believing it is safe because a down script exists.** As we saw, the existence of a down script does not guarantee data restoration. A down that reverts a DROP only revives the column structure, not the data. Always suspect whether a down script is genuinely reversible.

**Pitfall 2: Changing schema and code at once.** Change the schema and code in the same deployment, and an outage occurs in the brief inconsistency window between them. Separate them with expand-contract, and control the timing with feature flags.

**Pitfall 3: Running it for the first time in production.** A migration not rehearsed on production-scale data meets its lock time or backfill time for the first time in production. A pre-rehearsal on a production replica tells you the expected duration and lock risk in advance.

**Pitfall 4: Backfilling in a single transaction.** Processing millions of rows in one UPDATE produces a huge lock, a long transaction, and a WAL explosion. Always split it into batches.

**Pitfall 5: Declaring success without verification.** A migration ending without errors is not success. Success is passing the multi-stage verification. In particular, do not skip data verification and the application smoke test.

**Pitfall 6: Proceeding without a recovery point.** Without a pre-migration snapshot or PITR configuration, in the worst case you have nowhere to return to. Step 0 of the runbook is always confirming the recovery point is secured.

**Pitfall 7: Leaving feature flags forever.** Flags are temporary devices. Fail to clean up the flag and the old path after stabilization, and code complexity and another kind of risk accumulate.

Final Checklist

Before applying a migration to production, check the following.

[ Preparation ]

[ ] classify the change's reversibility (reversible/partial/irreversible)

[ ] decompose irreversible changes with expand-contract

[ ] do up/down (or undo) scripts exist as a pair

[ ] does the down script genuinely preserve data (or state that it cannot)

[ ] rehearsed on a production replica (measured duration/locks)

[ Safety net ]

[ ] pre-migration snapshot or PITR recovery point secured

[ ] large backfill written as batch processing

[ ] used CONCURRENTLY for index creation

[ ] separated the new behavior with a feature flag

[ Verification ]

[ ] schema verification queries ready (column/type/index/constraint)

[ ] data verification queries ready (NULL/row count/distribution)

[ ] aggregate baseline stored and post-comparison automated

[ ] checksum comparison (old/new structure consistency)

[ ] performance verification (EXPLAIN, lock-wait check)

[ ] application smoke test ready

[ Rollout ]

[ ] canary/gradual rollout plan (1% → 100%)

[ ] metrics to observe at each step defined (error rate/latency/consistency)

[ Response ]

[ ] incident runbook written and shared

[ ] agreed on how to designate the IC (Incident Commander)

[ ] stating that stop-the-bleeding priority 1 = flag OFF

[ ] postmortem template ready

[ Cleanup ]

[ ] register flag/old-path cleanup task in the backlog after stabilization

Conclusion

The real craft in database migrations is not "rolling back well," but designing changes so that a dangerous rollback is unnecessary in the first place. Decompose irreversible changes into a sequence of reversible steps, separate schema changes from code changes with feature flags, secure a recovery point just before the change, and verify in multiple layers after applying.

What all these devices create is a single safety net. The purpose of a safety net is not to never fall, but to survive when you do. A migration will go wrong someday. What we will rely on then is not heroic improvisation, but the reversibility, backups, verification, and runbook we designed in advance. Build the safety net first, and change on top of it.

References

- [Flyway Documentation — Undo Migrations](https://documentation.red-gate.com/fd/undo-migrations-184127627.html)

- [Flyway official site](https://flywaydb.org/)

- [golang-migrate/migrate (GitHub)](https://github.com/golang-migrate/migrate)

- [Liquibase — Rollback](https://docs.liquibase.com/workflows/liquibase-community/using-rollback.html)

- [Liquibase official site](https://www.liquibase.com/)

- [Martin Fowler — Evolutionary Database Design](https://martinfowler.com/articles/evodb.html)

- [Martin Fowler — Parallel Change (expand and contract)](https://martinfowler.com/bliki/ParallelChange.html)

- [Martin Fowler — Feature Toggles (Feature Flags)](https://martinfowler.com/articles/feature-toggles.html)

- [PostgreSQL Documentation — ALTER TABLE](https://www.postgresql.org/docs/current/sql-altertable.html)

- [PostgreSQL Documentation — Continuous Archiving and Point-in-Time Recovery](https://www.postgresql.org/docs/current/continuous-archiving.html)

- [PostgreSQL Documentation — Backup and Restore](https://www.postgresql.org/docs/current/backup.html)

현재 단락 (1/325)

Rolling back a deployment is usually simple. You bring up the previous container image again, or shi...

작성 글자: 0원문 글자: 24,332작성 단락: 0/325