Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction — The Most Expensive Lessons Come from Others Incidents

Outages are expensive teachers. The lesson you learn at 3 a.m. while recovering a production database stays with you for life, but its tuition is paid in lost revenue, eroded trust, and team burnout. Fortunately you do not have to learn every lesson through your own incident. Learn from others postmortems and you can dodge the same traps in advance.

Migrations rank near the top of outage statistics, because a system that ran fine for ages collapses in the single moment you change its schema. This article lays out typical migration failure modes, shows the reasoning "from symptom to cause to lesson" through three hypothetical postmortems, and closes with prevention patterns and comprehensive checklists.

Common Migration Failure Modes

First, let us catalog the patterns that recur often.

+---------------------+--------------------------------------------------+

| Failure mode | What goes wrong |

+---------------------+--------------------------------------------------+

| Lock storm | ALTER grabs a strong lock, all queries wait |

| Replication lag | bulk change makes replicas fall behind, reads |

| | go inconsistent |

| Data loss | bad transform/delete loses unrecoverable data |

| No rollback | problem found after an irreversible change |

| Timeout | migration runs too long, deploy/conn times out |

| Double-run | migration runs twice at once, state corrupted |

+---------------------+--------------------------------------------------+

Each mode has a different root cause, but they share one thing: they all "looked fine when tested at small scale." The scale effects produced by production data volume, traffic, and replication topology are what bring on the incident.

Lock Storm

`ALTER TABLE`, depending on its kind, grabs a strong lock like ACCESS EXCLUSIVE. That lock blocks all reads and writes on the table. On a small table it is instantaneous, but on a hundred-million-row table it takes minutes, and meanwhile the piled-up queries fill the connection pool and cascade into a wider outage.

ALTER TABLE (acquires strong lock, takes 5 min)

[query1] wait... [query2] wait... [query3] wait...

connection pool exhausted -> new requests rejected -> full outage

Replication Lag

A bulk UPDATE or backfill is fast on the primary, but replicas must apply the changes sequentially to catch up. If you spread reads across replicas, then during the lag users see stale data or fail to read what they just wrote (broken read-your-writes).

Data Loss

The most fatal. You DROP a column whose data was backed up nowhere, a bug in the transform logic overwrites values incorrectly, or an UPDATE missing its `WHERE` clause corrupts every row. Data loss does not come back with a code rollback.

Hypothetical Postmortem 1 — The Midnight NOT NULL

Let us follow it from symptom to cause to lesson.

[Symptom]

02:14 deploy starts (add country column to users table)

02:15 API response times spike, timeouts surge

02:17 connection pool exhaustion alert

02:31 migration force-stopped, recovery begins

02:48 recovery complete (34 min outage total)

[Cause]

Migration:

ALTER TABLE users ADD COLUMN country VARCHAR(2) NOT NULL DEFAULT 'KR';

The users table has 80 million rows. On an older database, adding a

NOT NULL column with a DEFAULT rewrites every row, holding an

ACCESS EXCLUSIVE lock the whole time. Staging had 50k rows, so it

finished in 0.1s and the problem went undetected.

[Lessons]

1. Rehearse the migration in an environment with production-like volume.

2. Split NOT NULL + DEFAULT into add nullable -> backfill -> add constraint.

3. Use migration linting to catch dangerous patterns in the PR.

The core lesson is "if staging's data volume differs from production, passing staging does not guarantee safety." Locks grow in proportion to data volume.

Hypothetical Postmortem 2 — The Vanished Address

[Symptom]

The day after deploy, a flood of support tickets: "my shipping address

is gone." Some users address field shows up empty.

[Cause]

A migration changing address from a single string to a structured object:

- add a new address_json column

- parse the old address string and fill address_json

- DROP the old address column

The parser could not handle a certain format (commaless addresses), so

about 3% became empty objects. Because the old column was DROPped in the

same deploy, the original was gone and recovery impossible.

[Lessons]

1. Do not put the transform and the deletion of old data in one deploy

(expand-contract).

2. Drop the old column only after a sufficient verification period.

3. Run sample/invariant verification before and after the transform

(check the empty-value ratio).

4. Always back up before destructive work, and confirm the PITR point.

The true cause was not the parser bug itself but that the old data was deleted in the same deploy without verification. Had the old column remained, recovery would have taken 30 minutes.

Hypothetical Postmortem 3 — The Twice-Run Migration

[Symptom]

Duplicate records found in the balance-change history table.

Some users points were credited twice.

[Cause]

During a Kubernetes rollout, the migration ran via an init container.

A data-backfill migration (recomputing points for past transactions)

started in two pods almost simultaneously. Because the backfill SQL was

not idempotent (INSERT only, no dup check), it inserted the same data

twice. The migration version lock covered the DDL, but part of the

backfill logic was written to run outside that lock.

[Lessons]

1. Write every migration, backfills included, to be idempotent

(INSERT ... ON CONFLICT, or check-exists-then-insert).

2. Guarantee single execution for migrations (Job pattern, explicit lock).

3. Avoid the concurrent init-container pattern.

4. For money-related backfills, dry-run to confirm affected row counts first.

Idempotency is a migration's seatbelt. Whether run once or ten times, the result must be the same to withstand the realities of retries and concurrency in a distributed environment.

Hypothetical Postmortem 4 — The Replica That Could Not Keep Up

[Symptom]

After a bulk-backfill deploy, some users complain "the setting I just

changed does not show up." Read-only APIs return stale data.

[Cause]

A bulk UPDATE (recompute every user's tier) ran at once on a 300M-row

table. It finished in 8 minutes on the primary, but the three read

replicas applied this change sequentially and fell behind by up to

22 minutes. Read traffic was spread across replicas, so during that

time users saw stale data.

[Lessons]

1. Split bulk changes into batches to give replicas time to catch up.

2. Monitor replication lag between batches and wait when over threshold.

3. For paths where read consistency matters, read from the primary or

guarantee session consistency.

4. Keep replication-lag alerts on during a backfill.

The pitfall of bulk changes is that success on the primary is not success for the system. It is truly done only when the replicas have converged too. The key is to proceed while checking replication lag between batches.

-- Batched backfill (10k rows each, with waits, mindful of replication lag)

DO $$

DECLARE

rows_affected int;

BEGIN

LOOP

UPDATE users SET tier = compute_tier(score)

WHERE id IN (

SELECT id FROM users WHERE tier IS NULL LIMIT 10000

);

GET DIAGNOSTICS rows_affected = ROW_COUNT;

EXIT WHEN rows_affected = 0;

PERFORM pg_sleep(0.5); -- time for replicas to catch up

END LOOP;

END $$;

Hypothetical Postmortem 5 — The Timeout Cascade

[Symptom]

The migration step in the CD pipeline fails on a timeout after 10 minutes.

But the migration keeps running in the DB. A retry starts a second

migration, causing lock contention.

[Cause]

Index creation took longer than expected, exceeding the pipeline timeout

(10 min). The pipeline declared failure, but the DB session stayed alive

and kept building the index. An automatic retry started the same

migration again -> lock conflict -> both stuck.

[Lessons]

1. Set the migration timeout generously above the actual duration.

2. On timeout, definitively clean up (cancel) the DB session before retry.

3. Turn off automatic retry, or block double-runs with idempotency + locks.

4. Run long operations (index creation) asynchronously outside the migration.

A timeout is not merely a "slow" signal; it exposes gaps in failure handling. The pattern where a zombie session lingers after a timeout and conflicts with the next run is common and dangerous.

Safe Online DDL Tools

There are dedicated tools for changing a large table's schema with zero downtime. Instead of ALTERing the original table directly, they create a new table, copy the data over incrementally, and swap it in.

+------------------+------------------+--------------------------------+

| Tool | Target | Method |

+------------------+------------------+--------------------------------+

| gh-ost | MySQL | binlog-based, no triggers |

| pt-online-schema| MySQL | trigger-based shadow table |

| -change | | |

| pg_repack | PostgreSQL | removes bloat, rewrites table |

+------------------+------------------+--------------------------------+

The operating principle is commonly as follows.

1. Create a new table with the same structure (with the desired change)

2. Copy the original data into the new table in chunks

3. Track changes that occur during the copy and apply them to the new

table (triggers or binlog)

4. Once caught up, swap the table names under a brief lock (atomic swap)

5. Clean up the old table

The advantage is changing a large table without a long lock. The downsides are complexity, double the disk usage, and trickiness when foreign keys or triggers exist. For production-grade large-table changes, such tools are effectively the standard.

Prevention Patterns — Small, Reversible, Verified, Observed

Let us distill the prevention principles common to all three postmortems.

Small

Break a big migration into small steps. Small changes run fast, hold locks briefly, and have a narrow blast radius on failure.

Bad: ALTER TABLE ... (rewrites the whole table at once, 5-min lock)

Good: add nullable (fast) -> batched backfill (10k rows each) -> add constraint (fast)

Reversible

Make every change undoable with expand-contract. Do not delete the old structure immediately; keep it until the new structure is verified.

expand : add the new structure (coexists with old, easy to revert)

migrate : write to both + backfill

switch : switch reads to the new structure

contract : drop the old structure (after it is well settled, separate deploy)

Verified

Verify the data before and after the transform. Automatically check count, sample, invariants, and empty-value ratio.

-- Check empty-value ratio after the transform (invariant verification)

SELECT

count(*) FILTER (WHERE address_json = '{}'::jsonb) AS empty_count,

count(*) AS total,

round(100.0 * count(*) FILTER (WHERE address_json = '{}'::jsonb) / count(*), 2) AS empty_pct

FROM users;

-- Stop the deploy if empty_pct exceeds a threshold

Observed

Watch the migration's progress in real time. Watch duration, lock waits, and replication lag on a dashboard, and be able to abort immediately on anomaly.

Signals to observe:

- migration elapsed time (vs estimate)

- lock wait-queue length (pg_locks, SHOW PROCESSLIST)

- replication lag (replica lag)

- connection pool utilization

- error rate, response time

Rehearsing Large Migrations

The single most powerful tool against production-grade incidents is rehearsal. Run the migration exactly as-is against a copy of production data (or production-scale synthetic data).

1. Restore a production snapshot into an isolated environment (same volume)

2. Run the migration for real and measure the duration

3. Observe lock duration, replication lag, disk usage

4. Rehearse the rollback/undo procedure too

5. Record the expected duration and risks in the runbook

If your rehearsal tells you in advance that "this migration takes 12 minutes in production and locks the table the whole time," you can decide ahead of time to switch to a zero-downtime strategy or schedule a maintenance window.

Communication and Approval

Half of migration incidents are not a technology problem but a communication problem.

- Advance notice: warn the relevant teams ahead of a risky migration. "Tonight there is a users-table migration; writes may slow for about 10 minutes."

- Approval gate: production migrations get a second person's review. The SQL someone types alone at 3 a.m. is the most dangerous.

- Rollback decision-maker: decide in advance who calls "stop and roll back." Deciding in the middle of an incident is too late.

- Status sharing: share the status of an in-progress migration in a channel in real time.

Maintenance Window vs Zero Downtime

Migration strategy splits broadly into two.

+----------------------+----------------------------------------------+

| Maintenance window | Zero downtime (online) |

+----------------------+----------------------------------------------+

| pause the service | keep the service up, apply incrementally |

| simple, risky ops OK | complex, expand-contract required |

| planned downtime | no downtime |

| fits small/internal | fits large/24x7 services |

+----------------------+----------------------------------------------+

Zero downtime is not always right. For an internal tool or a low-traffic time window, a short maintenance window can be safer and simpler than a complex zero-downtime migration. The key is to first agree on "how much downtime this service can tolerate."

Observation Queries During Migration

While a migration runs in production, check its state in real time with these queries.

-- PostgreSQL: inspect the current lock-wait situation

SELECT

blocked.pid AS blocked_pid,

blocking.pid AS blocking_pid,

blocked.query AS blocked_query,

blocking.query AS blocking_query

FROM pg_stat_activity blocked

JOIN pg_stat_activity blocking

ON blocking.pid = ANY(pg_blocking_pids(blocked.pid));

-- PostgreSQL: check replication lag (on the primary)

SELECT

client_addr,

state,

pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes

FROM pg_stat_replication;

-- inspect long-running queries in progress

SELECT pid, now() - query_start AS duration, query

FROM pg_stat_activity

WHERE state = 'active' AND now() - query_start > interval '1 minute'

ORDER BY duration DESC;

Wire these queries into a dashboard or alerts and you instantly know when a migration holds a lock too long or replication falls behind. "If you cannot see it, you cannot respond to it" is the first principle of observability.

Comprehensive Pre/Post Checklists

Before the Migration (Pre)

- [ ] Did you rehearse at production-grade data volume?

- [ ] Did you measure expected duration and lock duration?

- [ ] Did large-table changes use CONCURRENTLY/batching?

- [ ] Is it designed reversibly with expand-contract?

- [ ] Did you confirm a backup and the PITR point before destructive work?

- [ ] Is the backfill idempotent and single-execution guaranteed?

- [ ] Did you prepare verification queries (count/sample/invariant)?

- [ ] Did you notify and get approval from relevant teams?

- [ ] Did you set the rollback procedure and decision-maker?

During/After the Migration (Post)

- [ ] Are you observing progress (elapsed time, locks, replication lag)?

- [ ] Did you confirm data integrity with verification queries?

- [ ] Did you confirm replicas caught up (lag recovered)?

- [ ] Did error rate/response time return to normal?

- [ ] Did you share and record the result in a channel?

- [ ] (On anomaly) did you stop/roll back per the agreed procedure?

- [ ] If there was an incident, did you write a blameless postmortem?

How to Write a Blameless Postmortem

If an incident happened, the postmortem is what turns it into a learning asset. The key is the principle of being "blameless." You ask not "who typed it wrong" but "what systemic gap made that mistake possible."

Structure of a good postmortem:

1. Summary - what happened, when, how much impact (one paragraph)

2. Timeline - what happened, minute by minute

3. Root cause - "5 whys" past the surface to the underlying cause

4. Impact - the real impact on users/revenue/data

5. What went well - record fast detection, good response, too

6. Action items - concrete, deadlined items to prevent recurrence

Why a blameless culture matters is simple: blame people and incidents get hidden, and once hidden, learning stops. The goal is to accept that anyone can make a mistake at 3 a.m. and to fix the system so that mistake does not lead to an outage. "Why could that migration reach production without review?" is a far more productive question than "why did you type it?"

A 5-Whys Example

Incident: a NOT NULL migration froze production

Why? -> it rewrote 80M rows and held a lock

Why? -> NOT NULL + DEFAULT triggers a full rewrite

Why? -> we did not know that pattern was dangerous, review missed it

Why? -> migration linting was not in the pipeline

Why? -> there was no culture of validating migrations like code

Root cause: absence of tooling (lint) and culture (validation). Not a person's mistake.

Action: add lint + production-grade rehearsal + document risky patterns

Closing

The lessons of migration incidents mostly converge to the same place: split small, make it reversible, rehearse at production scale, verify, observe, and do not type it alone at 3 a.m. These principles are not new, but they are rarely followed until you have lived through an incident. Fortunately we can learn from others postmortems. If the three hypothetical incidents in this article prevent even one real incident of yours, you have paid the most expensive tuition at the cheapest price. Good engineering comes not from heroic recovery but from the boring discipline of making recovery unnecessary in the first place.

References

- Google SRE Book — Postmortem Culture: https://sre.google/sre-book/postmortem-culture/

- PostgreSQL ALTER TABLE lock info: https://www.postgresql.org/docs/current/sql-altertable.html

- PostgreSQL explicit locking docs: https://www.postgresql.org/docs/current/explicit-locking.html

- Flyway documentation: https://flywaydb.org/documentation/

- Liquibase documentation: https://docs.liquibase.com/

- AWS DMS docs: https://docs.aws.amazon.com/dms/

- GitHub — Online schema migrations (gh-ost): https://github.com/github/gh-ost

- Stripe Engineering — Online migrations at scale: https://stripe.com/blog/online-migrations