필사 모드: The Expand-Contract Pattern — The Right Way to Change Schemas with Zero Downtime
EnglishIntroduction
Changing the schema of a database that is actively serving traffic is always nerve-wracking. While requests are flowing and dozens of application instances are reading and writing the same table, a seemingly trivial change such as dropping a column or renaming one can bring the whole service to a halt.
Most teams have lived through this at least once. Right before a deploy, you rename a column with `ALTER TABLE`, ship the new code, and because the rolling deploy has not finished yet, the older instances reference the column that no longer exists and errors explode. Or you slap a `NOT NULL` constraint on a large table all at once, the entire table locks, and every write query times out in the meantime.
This article lays out the **Expand-Contract pattern** (also known as Parallel Change) that structurally prevents these incidents. The pattern starts from a single principle: "never change the schema and the code at the same time." We will walk through the three-phase flow needed to honor that principle, the real SQL for each scenario, and exactly how to coordinate with application code.
Why destructive changes are dangerous
Let us first define what a "destructive change" is. A destructive change is one that **instantly breaks the schema contract the existing code depends on**. The classic examples are:
- Renaming a column (`RENAME COLUMN`)
- Dropping a column (`DROP COLUMN`)
- Changing a column type (`ALTER COLUMN ... TYPE`)
- Applying a `NOT NULL` constraint immediately
- Renaming a table
These changes are dangerous for two main reasons.
1. Deploys are not atomic
Modern services use rolling deploys or blue-green deploys. That means there is always a window of time during which **the new version and the old version of the code are running simultaneously**. It may be tens of seconds; it may be several minutes.
If a schema change ignores this window, it will inevitably break. Change the schema first and the old code breaks; change the code first and the new code references a schema that does not exist yet. Whichever you do first, one side breaks. Resolving this contradiction is exactly what the Expand-Contract pattern is for.
2. DDL takes locks
Depending on the operation, `ALTER TABLE` takes a strong lock (`ACCESS EXCLUSIVE LOCK`). In PostgreSQL, for example, adding a column with a volatile default or adding a constraint that is validated immediately scans the whole table while holding the lock. During that time every incoming query piles up in a queue, the connection pool eventually drains, and the entire service stalls.
time ──────────────────────────────────────▶
ALTER TABLE (full scan + ACCESS EXCLUSIVE LOCK)
┌──────────────────────────────┐
│ table locked │
└──────────────────────────────┘
read/write queries ─ ─ ─ ─ ─ ─ ─ ─ waiting ▶ timeout
So a zero-downtime schema change must satisfy two axes at once: compatibility with the running code, and avoidance of blocking locks.
The three phases of the Expand-Contract pattern
The Expand-Contract pattern splits a single change into three phases. The key is that **between every phase you can always stop safely and roll back**.
Phase overview
| Phase | Schema state | Code state | Purpose |
| --- | --- | --- | --- |
| 1. Expand | old and new structures both exist | unchanged or compatible with both | add the new structure without breaking the old |
| 2. Migrate | both structures coexist | gradual shift to the new structure | move data and let code use the new structure |
| 3. Contract | old structure removed | only the new structure used | clean up the old structure no longer needed |
ASCII timeline
Expand Migrate Contract
┌─────────┐ ┌──────────────┐ ┌──────────┐
│ add new │ │ backfill + │ │ drop old │
│ column │ ───▶ │ double write │ ───▶ │ column │
│ (compat)│ │ + code shift │ │ (cleanup)│
└─────────┘ └──────────────┘ └──────────┘
│ │ │
▼ ▼ ▼
rollback ok rollback ok to undo you
(just ignore) (old column alive) must re-expand
◀── safe window: old and new code run side by side ──▶
The most important point on this timeline is that during the Expand and Migrate phases, **both old and new versions of the code can be running at the same time and all of them work correctly**. That is the essence of zero downtime.
A hands-on guide per scenario
Now let us look at each change scenario through the Expand-Contract lens, with concrete SQL. The examples are PostgreSQL-based, and I will call out MySQL differences separately.
Scenario 1: Adding a column
This is the simplest, but it still has a trap: adding a new column with a default and `NOT NULL` all at once.
In PostgreSQL 11 and later, adding a column with a constant default was optimized into a fast metadata-only operation. But if you use a volatile function as the default, the whole table has to be rewritten, so be careful.
Expand phase — add the new column as nullable.
-- safe: add a nullable column (metadata change only)
ALTER TABLE orders
ADD COLUMN discount_code text;
If you need a default, a constant default is safe but a function-based default should be avoided.
-- safe (PostgreSQL 11+): constant default
ALTER TABLE orders
ADD COLUMN status text NOT NULL DEFAULT 'pending';
-- risky: a volatile-function default can force a full table rewrite
-- ALTER TABLE orders ADD COLUMN created_at timestamptz DEFAULT now();
On the code side, deploy the code that reads and writes the new column only after the column actually exists. Because the column is nullable, the old code can simply ignore it and will not break.
Scenario 2: Renaming a column
Renaming is where incidents happen most often. It looks like a single line of `RENAME COLUMN`, but that one line instantly breaks the old code. In the Expand-Contract pattern, a rename is decomposed into **add + backfill + switch + drop** as four steps.
Say we want to rename `users.username` to `users.handle`.
Expand: add the new column
ALTER TABLE users
ADD COLUMN handle text;
Migrate: backfill and double write
Move the existing data into the new column. For a large table, a single `UPDATE` causes locking and a WAL explosion, so you must split it into batches.
-- batch backfill: process a fixed number of rows at a time
UPDATE users
SET handle = username
WHERE handle IS NULL
AND id IN (
SELECT id FROM users
WHERE handle IS NULL
ORDER BY id
LIMIT 5000
);
-- repeat the query above until the number of affected rows is 0
At the same time as the backfill, the application performs a **double write** to both columns. This way data that arrives during the backfill is reflected in both columns.
application double write (pseudocode)
def update_user_handle(user_id, new_value):
db.execute(
"UPDATE users SET username = :v, handle = :v WHERE id = :id",
v=new_value, id=user_id,
)
def read_user_handle(row):
read prefers the new column, falling back to the old one
return row["handle"] if row["handle"] is not None else row["username"]
Contract: drop the old column
Only after the backfill is finished and every instance is running code that reads and writes the new column do you drop the old one.
ALTER TABLE users
DROP COLUMN username;
If you keep this order, the service stays healthy even if a deploy stops or is rolled back at any point.
Scenario 3: Dropping a column
Dropping a column is also dangerous if done immediately, because old code that still references the column may be running. A drop is always done **after the references are removed from the code first**.
The order is:
1. Deploy code that neither reads nor writes that column first.
2. Wait until every instance has been replaced with the new code (observation window).
3. Drop the column.
-- after confirming no code references this column any longer
ALTER TABLE users
DROP COLUMN legacy_score;
It is best to leave a generous observation window before the drop. Instead of dropping the column immediately, first make the application ignore it, run for a few days while monitoring whether anyone really still uses it, and only then drop it.
Scenario 4: Adding a NOT NULL constraint
Applying `SET NOT NULL` directly to a large table takes an `ACCESS EXCLUSIVE LOCK` along with a full scan. In PostgreSQL there is a workaround: add a `CHECK` constraint as `NOT VALID` first, then validate it later.
Expand: add a CHECK constraint that defers validation
-- NOT VALID: does not check existing rows, only checks new/changed rows
ALTER TABLE orders
ADD CONSTRAINT orders_amount_not_null
CHECK (amount IS NOT NULL) NOT VALID;
`NOT VALID` takes only a short lock and returns immediately. From this point on, newly arriving data must satisfy the constraint.
Migrate: backfill existing data, then validate
-- 1) fill in existing NULL rows (in batches)
UPDATE orders SET amount = 0 WHERE amount IS NULL AND id IN (
SELECT id FROM orders WHERE amount IS NULL ORDER BY id LIMIT 5000
);
-- 2) validate the constraint: full check under a weak lock, not ACCESS EXCLUSIVE
ALTER TABLE orders
VALIDATE CONSTRAINT orders_amount_not_null;
`VALIDATE CONSTRAINT` reads the whole table but takes only a weak lock at the `SHARE UPDATE EXCLUSIVE` level, so it does not block concurrent reads and writes. That is the key difference.
Scenario 5: Changing a column type
A type change (`ALTER COLUMN ... TYPE`) almost always requires rewriting the table, so it is the heaviest operation. The safe approach is the same as a rename: **create a new column and move into it**.
Here is an example of changing `price` from `integer` to `bigint`.
-- Expand: add a column of the new type
ALTER TABLE products
ADD COLUMN price_v2 bigint;
-- Migrate: batch backfill + application double write
UPDATE products
SET price_v2 = price
WHERE price_v2 IS NULL AND id IN (
SELECT id FROM products WHERE price_v2 IS NULL ORDER BY id LIMIT 5000
);
-- (after code shift) Contract: drop the old column, optionally tidy the name
ALTER TABLE products DROP COLUMN price;
ALTER TABLE products RENAME COLUMN price_v2 TO price;
The final `RENAME` only changes metadata so it is fast, but at that instant the column name changes again, so be careful to synchronize with the code that reads this name. To be conservative, you can also skip the final `RENAME` and keep the name `price_v2` as is.
Add indexes with CONCURRENTLY
Schema changes include adding indexes. A plain `CREATE INDEX` takes a write lock on the table, but the `CONCURRENTLY` option builds the index without that lock.
-- build an index without blocking writes
CREATE INDEX CONCURRENTLY idx_orders_discount_code
ON orders (discount_code);
Note that `CONCURRENTLY` cannot run inside a transaction block, and on failure it can leave an invalid index behind, so you need cleanup logic for the failure case.
A closer look at double write and gradual backfill
The two pillars of the Migrate phase are double write and gradual backfill. They must work together for data consistency to be guaranteed.
Why you need both
If you backfill but do not double write, new writes that arrive during the backfill are reflected only in the old column, leaving the new column empty. If you double write but do not backfill, the historical data from before double write began will have the new column empty forever. So you use both to cover both "the past (backfill) and the future (double write)."
double write starts here
│
past data │ future data
◀─────────┼──────────▶
filled by │ filled automatically
backfill │ by double write
The batch backfill pattern
The backfill must be split into small batches with a brief pause between them, to relieve replication lag and lock contention.
gradual backfill loop (pseudocode)
BATCH = 5000
while True:
rows = db.execute("""
UPDATE users SET handle = username
WHERE handle IS NULL
AND id IN (
SELECT id FROM users WHERE handle IS NULL
ORDER BY id LIMIT :n
)
RETURNING id
""", n=BATCH)
if len(rows) == 0:
break
sleep(0.2) # relieve replication lag / lock contention
This lets even an enormous table fill up slowly without affecting the service.
Coordinating with application code
The Expand-Contract pattern is not solely a database concern; it must keep time with application deploys. Here is what state the code should be in at each phase.
| Point in time | Schema | Read code | Write code |
| --- | --- | --- | --- |
| right after Expand | new column added | reads old column | writes old column only |
| double-write deploy | both exist | new preferred, old fallback | writes both |
| new-column switch deploy | both exist | reads new column only | writes new column only |
| Contract | old column dropped | reads new column only | writes new column only |
The core rule is that **between every deploy you must wait "until all instances are replaced."** If you move to the next phase without confirming that the rolling deploy fully finished, old code may still be around and your assumptions will break.
If you use a migration tool (Flyway, Liquibase, golang-migrate, and the like), it is best to put only one safe phase in each release. Separating the Expand migration and the Contract migration into different releases lets you slip a code deploy and an observation window between them.
Keeping rollback possible at all times
The biggest advantage of Expand-Contract is that **each phase can be rolled back independently**.
- **Rolling back the Expand phase**: the newly added nullable column is used by no one, so you can leave it or drop it later. Rolling back the code is safe too.
- **Rolling back the Migrate phase**: the old column is still alive and kept current by double write, so reverting the code to the old version leaves the data intact.
- **Rolling back the Contract phase**: this phase requires care. If you have already dropped the old column, a simple code rollback cannot bring it back. So Contract is done last, only after the new structure has stabilized enough.
Because of this asymmetry there is a field maxim: "Expand freely, contract carefully." Expansion is easy to undo, but contraction is hard to undo.
Common pitfalls
Here are the pitfalls you run into most often in practice.
1. **Putting Expand and Contract in the same release.** This is the most common mistake. A code deploy and an observation window need to fit between them, and handling them at once breaks the zero-downtime guarantee.
2. **Backfilling in a single UPDATE.** Updating millions of rows in one transaction brings locking, a WAL explosion, and replication lag. Always split into batches.
3. **Forgetting the double write.** New data that arrives during the backfill goes missing from the new column. This is the chief suspect when the new column mysteriously still has NULLs after the backfill is "done."
4. **Adding a column with a function default.** A volatile default like `DEFAULT now()` can force a full table rewrite. Use a constant default, or add the column and then fill it with a backfill.
5. **Adding an immediately validated constraint.** Applying `NOT NULL` or `CHECK` directly means a full scan plus a strong lock. Split it into `NOT VALID` followed by `VALIDATE`.
6. **Running CONCURRENTLY inside a transaction.** `CREATE INDEX CONCURRENTLY` works only outside a transaction block. You must configure your migration tool so it does not automatically wrap the statement in a transaction.
7. **Not waiting for the rolling deploy to finish.** Moving to the next phase before every instance is replaced with the new code breaks your assumptions.
Zero-downtime schema change checklist
It is worth checking these before you deploy.
- [ ] Did you split this change into the three phases Expand / Migrate / Contract?
- [ ] Did you separate Expand and Contract into different releases?
- [ ] Is the new column added as nullable? (or using only a constant default)
- [ ] Is the backfill split into batches with a pause between them?
- [ ] Does double write cover both past and future data?
- [ ] Did you split the constraint into `NOT VALID` then `VALIDATE`?
- [ ] Are indexes created with `CONCURRENTLY`?
- [ ] Do you wait for the rolling deploy to finish between each phase?
- [ ] Did you keep the old structure until just before Contract to preserve a rollback path?
- [ ] Did you leave a generous observation window before dropping a column?
Conclusion
The essence of the Expand-Contract pattern is "do not change everything at once." You decompose a single destructive change into small steps that are each safe and reversible, so that the window during which old and new code coexist is always crossed safely.
At first the number of steps can feel cumbersome. But once you have experienced a midnight deploy where a single column froze the service, you understand how valuable an insurance policy this cumbersomeness is. Zero downtime is built from procedure, not luck.
References
- PostgreSQL ALTER TABLE documentation: https://www.postgresql.org/docs/current/sql-altertable.html
- PostgreSQL CREATE INDEX (CONCURRENTLY) documentation: https://www.postgresql.org/docs/current/sql-createindex.html
- MySQL Online DDL documentation: https://dev.mysql.com/doc/refman/8.4/en/innodb-online-ddl.html
- AWS Database Migration Service documentation: https://docs.aws.amazon.com/dms/
- Flyway documentation: https://flywaydb.org/documentation/
- Liquibase documentation: https://docs.liquibase.com/
- golang-migrate: https://github.com/golang-migrate/migrate
- Martin Fowler, Evolutionary Database Design: https://martinfowler.com/articles/evodb.html
현재 단락 (1/190)
Changing the schema of a database that is actively serving traffic is always nerve-wracking. While r...