Skip to content

필사 모드: The Expand-Contract Pattern — The Right Way to Change Schemas with Zero Downtime

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

Changing the schema of a database that is actively serving traffic is always nerve-wracking. While requests are flowing and dozens of application instances are reading and writing the same table, a seemingly trivial change such as dropping a column or renaming one can bring the whole service to a halt.

Most teams have lived through this at least once. Right before a deploy, you rename a column with `ALTER TABLE`, ship the new code, and because the rolling deploy has not finished yet, the older instances reference the column that no longer exists and errors explode. Or you slap a `NOT NULL` constraint on a large table all at once, the entire table locks, and every write query times out in the meantime.

This article lays out the **Expand-Contract pattern** (also known as Parallel Change) that structurally prevents these incidents. The pattern starts from a single principle: "never change the schema and the code at the same time." We will walk through the three-phase flow needed to honor that principle, the real SQL for each scenario, and exactly how to coordinate with application code.

Why destructive changes are dangerous

Let us first define what a "destructive change" is. A destructive change is one that **instantly breaks the schema contract the existing code depends on**. The classic examples are:

- Renaming a column (`RENAME COLUMN`)

- Dropping a column (`DROP COLUMN`)

- Changing a column type (`ALTER COLUMN ... TYPE`)

- Applying a `NOT NULL` constraint immediately

- Renaming a table

These changes are dangerous for two main reasons.

1. Deploys are not atomic

Modern services use rolling deploys or blue-green deploys. That means there is always a window of time during which **the new version and the old version of the code are running simultaneously**. It may be tens of seconds; it may be several minutes.

If a schema change ignores this window, it will inevitably break. Change the schema first and the old code breaks; change the code first and the new code references a schema that does not exist yet. Whichever you do first, one side breaks. Resolving this contradiction is exactly what the Expand-Contract pattern is for.

2. DDL takes locks

Depending on the operation, `ALTER TABLE` takes a strong lock (`ACCESS EXCLUSIVE LOCK`). In PostgreSQL, for example, adding a column with a volatile default or adding a constraint that is validated immediately scans the whole table while holding the lock. During that time every incoming query piles up in a queue, the connection pool eventually drains, and the entire service stalls.

time ──────────────────────────────────────▶

ALTER TABLE (full scan + ACCESS EXCLUSIVE LOCK)

┌──────────────────────────────┐

│ table locked │

└──────────────────────────────┘

read/write queries ─ ─ ─ ─ ─ ─ ─ ─ waiting ▶ timeout

So a zero-downtime schema change must satisfy two axes at once: compatibility with the running code, and avoidance of blocking locks.

The three phases of the Expand-Contract pattern

The Expand-Contract pattern splits a single change into three phases. The key is that **between every phase you can always stop safely and roll back**.

Phase overview

| Phase | Schema state | Code state | Purpose |

| --- | --- | --- | --- |

| 1. Expand | old and new structures both exist | unchanged or compatible with both | add the new structure without breaking the old |

| 2. Migrate | both structures coexist | gradual shift to the new structure | move data and let code use the new structure |

| 3. Contract | old structure removed | only the new structure used | clean up the old structure no longer needed |

ASCII timeline

Expand Migrate Contract

┌─────────┐ ┌──────────────┐ ┌──────────┐

│ add new │ │ backfill + │ │ drop old │

│ column │ ───▶ │ double write │ ───▶ │ column │

│ (compat)│ │ + code shift │ │ (cleanup)│

└─────────┘ └──────────────┘ └──────────┘

│ │ │

▼ ▼ ▼

rollback ok rollback ok to undo you

(just ignore) (old column alive) must re-expand

◀── safe window: old and new code run side by side ──▶

The most important point on this timeline is that during the Expand and Migrate phases, **both old and new versions of the code can be running at the same time and all of them work correctly**. That is the essence of zero downtime.

A hands-on guide per scenario

Now let us look at each change scenario through the Expand-Contract lens, with concrete SQL. The examples are PostgreSQL-based, and I will call out MySQL differences separately.

Scenario 1: Adding a column

This is the simplest, but it still has a trap: adding a new column with a default and `NOT NULL` all at once.

In PostgreSQL 11 and later, adding a column with a constant default was optimized into a fast metadata-only operation. But if you use a volatile function as the default, the whole table has to be rewritten, so be careful.

Expand phase — add the new column as nullable.

-- safe: add a nullable column (metadata change only)

ALTER TABLE orders

ADD COLUMN discount_code text;

If you need a default, a constant default is safe but a function-based default should be avoided.

-- safe (PostgreSQL 11+): constant default

ALTER TABLE orders

ADD COLUMN status text NOT NULL DEFAULT 'pending';

-- risky: a volatile-function default can force a full table rewrite

-- ALTER TABLE orders ADD COLUMN created_at timestamptz DEFAULT now();

On the code side, deploy the code that reads and writes the new column only after the column actually exists. Because the column is nullable, the old code can simply ignore it and will not break.

Scenario 2: Renaming a column

Renaming is where incidents happen most often. It looks like a single line of `RENAME COLUMN`, but that one line instantly breaks the old code. In the Expand-Contract pattern, a rename is decomposed into **add + backfill + switch + drop** as four steps.

Say we want to rename `users.username` to `users.handle`.

Expand: add the new column

ALTER TABLE users

ADD COLUMN handle text;

Migrate: backfill and double write

Move the existing data into the new column. For a large table, a single `UPDATE` causes locking and a WAL explosion, so you must split it into batches.

-- batch backfill: process a fixed number of rows at a time

UPDATE users

SET handle = username

WHERE handle IS NULL

AND id IN (

SELECT id FROM users

WHERE handle IS NULL

ORDER BY id

LIMIT 5000

);

-- repeat the query above until the number of affected rows is 0

At the same time as the backfill, the application performs a **double write** to both columns. This way data that arrives during the backfill is reflected in both columns.

application double write (pseudocode)

def update_user_handle(user_id, new_value):

db.execute(

"UPDATE users SET username = :v, handle = :v WHERE id = :id",

v=new_value, id=user_id,

)

def read_user_handle(row):

read prefers the new column, falling back to the old one

return row["handle"] if row["handle"] is not None else row["username"]

Contract: drop the old column

Only after the backfill is finished and every instance is running code that reads and writes the new column do you drop the old one.

ALTER TABLE users

DROP COLUMN username;

If you keep this order, the service stays healthy even if a deploy stops or is rolled back at any point.

Scenario 3: Dropping a column

Dropping a column is also dangerous if done immediately, because old code that still references the column may be running. A drop is always done **after the references are removed from the code first**.

The order is:

1. Deploy code that neither reads nor writes that column first.

2. Wait until every instance has been replaced with the new code (observation window).

3. Drop the column.

-- after confirming no code references this column any longer

ALTER TABLE users

DROP COLUMN legacy_score;

It is best to leave a generous observation window before the drop. Instead of dropping the column immediately, first make the application ignore it, run for a few days while monitoring whether anyone really still uses it, and only then drop it.

Scenario 4: Adding a NOT NULL constraint

Applying `SET NOT NULL` directly to a large table takes an `ACCESS EXCLUSIVE LOCK` along with a full scan. In PostgreSQL there is a workaround: add a `CHECK` constraint as `NOT VALID` first, then validate it later.

Expand: add a CHECK constraint that defers validation

-- NOT VALID: does not check existing rows, only checks new/changed rows

ALTER TABLE orders

ADD CONSTRAINT orders_amount_not_null

CHECK (amount IS NOT NULL) NOT VALID;

`NOT VALID` takes only a short lock and returns immediately. From this point on, newly arriving data must satisfy the constraint.

Migrate: backfill existing data, then validate

-- 1) fill in existing NULL rows (in batches)

UPDATE orders SET amount = 0 WHERE amount IS NULL AND id IN (

SELECT id FROM orders WHERE amount IS NULL ORDER BY id LIMIT 5000

);

-- 2) validate the constraint: full check under a weak lock, not ACCESS EXCLUSIVE

ALTER TABLE orders

VALIDATE CONSTRAINT orders_amount_not_null;

`VALIDATE CONSTRAINT` reads the whole table but takes only a weak lock at the `SHARE UPDATE EXCLUSIVE` level, so it does not block concurrent reads and writes. That is the key difference.

Scenario 5: Changing a column type

A type change (`ALTER COLUMN ... TYPE`) almost always requires rewriting the table, so it is the heaviest operation. The safe approach is the same as a rename: **create a new column and move into it**.

Here is an example of changing `price` from `integer` to `bigint`.

-- Expand: add a column of the new type

ALTER TABLE products

ADD COLUMN price_v2 bigint;

-- Migrate: batch backfill + application double write

UPDATE products

SET price_v2 = price

WHERE price_v2 IS NULL AND id IN (

SELECT id FROM products WHERE price_v2 IS NULL ORDER BY id LIMIT 5000

);

-- (after code shift) Contract: drop the old column, optionally tidy the name

ALTER TABLE products DROP COLUMN price;

ALTER TABLE products RENAME COLUMN price_v2 TO price;

The final `RENAME` only changes metadata so it is fast, but at that instant the column name changes again, so be careful to synchronize with the code that reads this name. To be conservative, you can also skip the final `RENAME` and keep the name `price_v2` as is.

Add indexes with CONCURRENTLY

Schema changes include adding indexes. A plain `CREATE INDEX` takes a write lock on the table, but the `CONCURRENTLY` option builds the index without that lock.

-- build an index without blocking writes

CREATE INDEX CONCURRENTLY idx_orders_discount_code

ON orders (discount_code);

Note that `CONCURRENTLY` cannot run inside a transaction block, and on failure it can leave an invalid index behind, so you need cleanup logic for the failure case.

A closer look at double write and gradual backfill

The two pillars of the Migrate phase are double write and gradual backfill. They must work together for data consistency to be guaranteed.

Why you need both

If you backfill but do not double write, new writes that arrive during the backfill are reflected only in the old column, leaving the new column empty. If you double write but do not backfill, the historical data from before double write began will have the new column empty forever. So you use both to cover both "the past (backfill) and the future (double write)."

double write starts here

past data │ future data

◀─────────┼──────────▶

filled by │ filled automatically

backfill │ by double write

The batch backfill pattern

The backfill must be split into small batches with a brief pause between them, to relieve replication lag and lock contention.

gradual backfill loop (pseudocode)

BATCH = 5000

while True:

rows = db.execute("""

UPDATE users SET handle = username

WHERE handle IS NULL

AND id IN (

SELECT id FROM users WHERE handle IS NULL

ORDER BY id LIMIT :n

)

RETURNING id

""", n=BATCH)

if len(rows) == 0:

break

sleep(0.2) # relieve replication lag / lock contention

This lets even an enormous table fill up slowly without affecting the service.

Coordinating with application code

The Expand-Contract pattern is not solely a database concern; it must keep time with application deploys. Here is what state the code should be in at each phase.

| Point in time | Schema | Read code | Write code |

| --- | --- | --- | --- |

| right after Expand | new column added | reads old column | writes old column only |

| double-write deploy | both exist | new preferred, old fallback | writes both |

| new-column switch deploy | both exist | reads new column only | writes new column only |

| Contract | old column dropped | reads new column only | writes new column only |

The core rule is that **between every deploy you must wait "until all instances are replaced."** If you move to the next phase without confirming that the rolling deploy fully finished, old code may still be around and your assumptions will break.

If you use a migration tool (Flyway, Liquibase, golang-migrate, and the like), it is best to put only one safe phase in each release. Separating the Expand migration and the Contract migration into different releases lets you slip a code deploy and an observation window between them.

Keeping rollback possible at all times

The biggest advantage of Expand-Contract is that **each phase can be rolled back independently**.

- **Rolling back the Expand phase**: the newly added nullable column is used by no one, so you can leave it or drop it later. Rolling back the code is safe too.

- **Rolling back the Migrate phase**: the old column is still alive and kept current by double write, so reverting the code to the old version leaves the data intact.

- **Rolling back the Contract phase**: this phase requires care. If you have already dropped the old column, a simple code rollback cannot bring it back. So Contract is done last, only after the new structure has stabilized enough.

Because of this asymmetry there is a field maxim: "Expand freely, contract carefully." Expansion is easy to undo, but contraction is hard to undo.

Common pitfalls

Here are the pitfalls you run into most often in practice.

1. **Putting Expand and Contract in the same release.** This is the most common mistake. A code deploy and an observation window need to fit between them, and handling them at once breaks the zero-downtime guarantee.

2. **Backfilling in a single UPDATE.** Updating millions of rows in one transaction brings locking, a WAL explosion, and replication lag. Always split into batches.

3. **Forgetting the double write.** New data that arrives during the backfill goes missing from the new column. This is the chief suspect when the new column mysteriously still has NULLs after the backfill is "done."

4. **Adding a column with a function default.** A volatile default like `DEFAULT now()` can force a full table rewrite. Use a constant default, or add the column and then fill it with a backfill.

5. **Adding an immediately validated constraint.** Applying `NOT NULL` or `CHECK` directly means a full scan plus a strong lock. Split it into `NOT VALID` followed by `VALIDATE`.

6. **Running CONCURRENTLY inside a transaction.** `CREATE INDEX CONCURRENTLY` works only outside a transaction block. You must configure your migration tool so it does not automatically wrap the statement in a transaction.

7. **Not waiting for the rolling deploy to finish.** Moving to the next phase before every instance is replaced with the new code breaks your assumptions.

Zero-downtime schema change checklist

It is worth checking these before you deploy.

- [ ] Did you split this change into the three phases Expand / Migrate / Contract?

- [ ] Did you separate Expand and Contract into different releases?

- [ ] Is the new column added as nullable? (or using only a constant default)

- [ ] Is the backfill split into batches with a pause between them?

- [ ] Does double write cover both past and future data?

- [ ] Did you split the constraint into `NOT VALID` then `VALIDATE`?

- [ ] Are indexes created with `CONCURRENTLY`?

- [ ] Do you wait for the rolling deploy to finish between each phase?

- [ ] Did you keep the old structure until just before Contract to preserve a rollback path?

- [ ] Did you leave a generous observation window before dropping a column?

Conclusion

The essence of the Expand-Contract pattern is "do not change everything at once." You decompose a single destructive change into small steps that are each safe and reversible, so that the window during which old and new code coexist is always crossed safely.

At first the number of steps can feel cumbersome. But once you have experienced a midnight deploy where a single column froze the service, you understand how valuable an insurance policy this cumbersomeness is. Zero downtime is built from procedure, not luck.

References

- PostgreSQL ALTER TABLE documentation: https://www.postgresql.org/docs/current/sql-altertable.html

- PostgreSQL CREATE INDEX (CONCURRENTLY) documentation: https://www.postgresql.org/docs/current/sql-createindex.html

- MySQL Online DDL documentation: https://dev.mysql.com/doc/refman/8.4/en/innodb-online-ddl.html

- AWS Database Migration Service documentation: https://docs.aws.amazon.com/dms/

- Flyway documentation: https://flywaydb.org/documentation/

- Liquibase documentation: https://docs.liquibase.com/

- golang-migrate: https://github.com/golang-migrate/migrate

- Martin Fowler, Evolutionary Database Design: https://martinfowler.com/articles/evodb.html

현재 단락 (1/190)

Changing the schema of a database that is actively serving traffic is always nerve-wracking. While r...

작성 글자: 0원문 글자: 14,644작성 단락: 0/190