Skip to content
Published on

The Expand-Contract Pattern — The Right Way to Change Schemas with Zero Downtime

Authors

Introduction

Changing the schema of a database that is actively serving traffic is always nerve-wracking. While requests are flowing and dozens of application instances are reading and writing the same table, a seemingly trivial change such as dropping a column or renaming one can bring the whole service to a halt.

Most teams have lived through this at least once. Right before a deploy, you rename a column with ALTER TABLE, ship the new code, and because the rolling deploy has not finished yet, the older instances reference the column that no longer exists and errors explode. Or you slap a NOT NULL constraint on a large table all at once, the entire table locks, and every write query times out in the meantime.

This article lays out the Expand-Contract pattern (also known as Parallel Change) that structurally prevents these incidents. The pattern starts from a single principle: "never change the schema and the code at the same time." We will walk through the three-phase flow needed to honor that principle, the real SQL for each scenario, and exactly how to coordinate with application code.

Why destructive changes are dangerous

Let us first define what a "destructive change" is. A destructive change is one that instantly breaks the schema contract the existing code depends on. The classic examples are:

  • Renaming a column (RENAME COLUMN)
  • Dropping a column (DROP COLUMN)
  • Changing a column type (ALTER COLUMN ... TYPE)
  • Applying a NOT NULL constraint immediately
  • Renaming a table

These changes are dangerous for two main reasons.

1. Deploys are not atomic

Modern services use rolling deploys or blue-green deploys. That means there is always a window of time during which the new version and the old version of the code are running simultaneously. It may be tens of seconds; it may be several minutes.

If a schema change ignores this window, it will inevitably break. Change the schema first and the old code breaks; change the code first and the new code references a schema that does not exist yet. Whichever you do first, one side breaks. Resolving this contradiction is exactly what the Expand-Contract pattern is for.

2. DDL takes locks

Depending on the operation, ALTER TABLE takes a strong lock (ACCESS EXCLUSIVE LOCK). In PostgreSQL, for example, adding a column with a volatile default or adding a constraint that is validated immediately scans the whole table while holding the lock. During that time every incoming query piles up in a queue, the connection pool eventually drains, and the entire service stalls.

time ──────────────────────────────────────▶
       ALTER TABLE (full scan + ACCESS EXCLUSIVE LOCK)
      ┌──────────────────────────────┐
      │  table locked                │
      └──────────────────────────────┘
   read/write queries ─ ─ ─ ─ ─ ─ ─ ─ waiting ▶ timeout

So a zero-downtime schema change must satisfy two axes at once: compatibility with the running code, and avoidance of blocking locks.

The three phases of the Expand-Contract pattern

The Expand-Contract pattern splits a single change into three phases. The key is that between every phase you can always stop safely and roll back.

Phase overview

PhaseSchema stateCode statePurpose
1. Expandold and new structures both existunchanged or compatible with bothadd the new structure without breaking the old
2. Migrateboth structures coexistgradual shift to the new structuremove data and let code use the new structure
3. Contractold structure removedonly the new structure usedclean up the old structure no longer needed

ASCII timeline

   Expand               Migrate                    Contract
 ┌─────────┐        ┌──────────────┐            ┌──────────┐
 │ add new │        │ backfill +   │            │ drop old │
 │ column  │  ───▶  │ double write │   ───▶     │ column   │
 │ (compat)│        │ + code shift │            │ (cleanup)│
 └─────────┘        └──────────────┘            └──────────┘
      │                    │                          │
      ▼                    ▼                          ▼
  rollback ok         rollback ok               to undo you
  (just ignore)     (old column alive)          must re-expand

 ◀── safe window: old and new code run side by side ──▶

The most important point on this timeline is that during the Expand and Migrate phases, both old and new versions of the code can be running at the same time and all of them work correctly. That is the essence of zero downtime.

A hands-on guide per scenario

Now let us look at each change scenario through the Expand-Contract lens, with concrete SQL. The examples are PostgreSQL-based, and I will call out MySQL differences separately.

Scenario 1: Adding a column

This is the simplest, but it still has a trap: adding a new column with a default and NOT NULL all at once.

In PostgreSQL 11 and later, adding a column with a constant default was optimized into a fast metadata-only operation. But if you use a volatile function as the default, the whole table has to be rewritten, so be careful.

Expand phase — add the new column as nullable.

-- safe: add a nullable column (metadata change only)
ALTER TABLE orders
  ADD COLUMN discount_code text;

If you need a default, a constant default is safe but a function-based default should be avoided.

-- safe (PostgreSQL 11+): constant default
ALTER TABLE orders
  ADD COLUMN status text NOT NULL DEFAULT 'pending';

-- risky: a volatile-function default can force a full table rewrite
-- ALTER TABLE orders ADD COLUMN created_at timestamptz DEFAULT now();

On the code side, deploy the code that reads and writes the new column only after the column actually exists. Because the column is nullable, the old code can simply ignore it and will not break.

Scenario 2: Renaming a column

Renaming is where incidents happen most often. It looks like a single line of RENAME COLUMN, but that one line instantly breaks the old code. In the Expand-Contract pattern, a rename is decomposed into add + backfill + switch + drop as four steps.

Say we want to rename users.username to users.handle.

Expand: add the new column

ALTER TABLE users
  ADD COLUMN handle text;

Migrate: backfill and double write

Move the existing data into the new column. For a large table, a single UPDATE causes locking and a WAL explosion, so you must split it into batches.

-- batch backfill: process a fixed number of rows at a time
UPDATE users
SET handle = username
WHERE handle IS NULL
  AND id IN (
    SELECT id FROM users
    WHERE handle IS NULL
    ORDER BY id
    LIMIT 5000
  );
-- repeat the query above until the number of affected rows is 0

At the same time as the backfill, the application performs a double write to both columns. This way data that arrives during the backfill is reflected in both columns.

# application double write (pseudocode)
def update_user_handle(user_id, new_value):
    db.execute(
        "UPDATE users SET username = :v, handle = :v WHERE id = :id",
        v=new_value, id=user_id,
    )

def read_user_handle(row):
    # read prefers the new column, falling back to the old one
    return row["handle"] if row["handle"] is not None else row["username"]

Contract: drop the old column

Only after the backfill is finished and every instance is running code that reads and writes the new column do you drop the old one.

ALTER TABLE users
  DROP COLUMN username;

If you keep this order, the service stays healthy even if a deploy stops or is rolled back at any point.

Scenario 3: Dropping a column

Dropping a column is also dangerous if done immediately, because old code that still references the column may be running. A drop is always done after the references are removed from the code first.

The order is:

  1. Deploy code that neither reads nor writes that column first.
  2. Wait until every instance has been replaced with the new code (observation window).
  3. Drop the column.
-- after confirming no code references this column any longer
ALTER TABLE users
  DROP COLUMN legacy_score;

It is best to leave a generous observation window before the drop. Instead of dropping the column immediately, first make the application ignore it, run for a few days while monitoring whether anyone really still uses it, and only then drop it.

Scenario 4: Adding a NOT NULL constraint

Applying SET NOT NULL directly to a large table takes an ACCESS EXCLUSIVE LOCK along with a full scan. In PostgreSQL there is a workaround: add a CHECK constraint as NOT VALID first, then validate it later.

Expand: add a CHECK constraint that defers validation

-- NOT VALID: does not check existing rows, only checks new/changed rows
ALTER TABLE orders
  ADD CONSTRAINT orders_amount_not_null
  CHECK (amount IS NOT NULL) NOT VALID;

NOT VALID takes only a short lock and returns immediately. From this point on, newly arriving data must satisfy the constraint.

Migrate: backfill existing data, then validate

-- 1) fill in existing NULL rows (in batches)
UPDATE orders SET amount = 0 WHERE amount IS NULL AND id IN (
  SELECT id FROM orders WHERE amount IS NULL ORDER BY id LIMIT 5000
);

-- 2) validate the constraint: full check under a weak lock, not ACCESS EXCLUSIVE
ALTER TABLE orders
  VALIDATE CONSTRAINT orders_amount_not_null;

VALIDATE CONSTRAINT reads the whole table but takes only a weak lock at the SHARE UPDATE EXCLUSIVE level, so it does not block concurrent reads and writes. That is the key difference.

Scenario 5: Changing a column type

A type change (ALTER COLUMN ... TYPE) almost always requires rewriting the table, so it is the heaviest operation. The safe approach is the same as a rename: create a new column and move into it.

Here is an example of changing price from integer to bigint.

-- Expand: add a column of the new type
ALTER TABLE products
  ADD COLUMN price_v2 bigint;

-- Migrate: batch backfill + application double write
UPDATE products
SET price_v2 = price
WHERE price_v2 IS NULL AND id IN (
  SELECT id FROM products WHERE price_v2 IS NULL ORDER BY id LIMIT 5000
);

-- (after code shift) Contract: drop the old column, optionally tidy the name
ALTER TABLE products DROP COLUMN price;
ALTER TABLE products RENAME COLUMN price_v2 TO price;

The final RENAME only changes metadata so it is fast, but at that instant the column name changes again, so be careful to synchronize with the code that reads this name. To be conservative, you can also skip the final RENAME and keep the name price_v2 as is.

Add indexes with CONCURRENTLY

Schema changes include adding indexes. A plain CREATE INDEX takes a write lock on the table, but the CONCURRENTLY option builds the index without that lock.

-- build an index without blocking writes
CREATE INDEX CONCURRENTLY idx_orders_discount_code
  ON orders (discount_code);

Note that CONCURRENTLY cannot run inside a transaction block, and on failure it can leave an invalid index behind, so you need cleanup logic for the failure case.

A closer look at double write and gradual backfill

The two pillars of the Migrate phase are double write and gradual backfill. They must work together for data consistency to be guaranteed.

Why you need both

If you backfill but do not double write, new writes that arrive during the backfill are reflected only in the old column, leaving the new column empty. If you double write but do not backfill, the historical data from before double write began will have the new column empty forever. So you use both to cover both "the past (backfill) and the future (double write)."

   double write starts here
  past data  │   future data
 ◀─────────┼──────────▶
  filled by  │  filled automatically
  backfill   │  by double write

The batch backfill pattern

The backfill must be split into small batches with a brief pause between them, to relieve replication lag and lock contention.

# gradual backfill loop (pseudocode)
BATCH = 5000
while True:
    rows = db.execute("""
        UPDATE users SET handle = username
        WHERE handle IS NULL
          AND id IN (
            SELECT id FROM users WHERE handle IS NULL
            ORDER BY id LIMIT :n
          )
        RETURNING id
    """, n=BATCH)
    if len(rows) == 0:
        break
    sleep(0.2)  # relieve replication lag / lock contention

This lets even an enormous table fill up slowly without affecting the service.

Coordinating with application code

The Expand-Contract pattern is not solely a database concern; it must keep time with application deploys. Here is what state the code should be in at each phase.

Point in timeSchemaRead codeWrite code
right after Expandnew column addedreads old columnwrites old column only
double-write deployboth existnew preferred, old fallbackwrites both
new-column switch deployboth existreads new column onlywrites new column only
Contractold column droppedreads new column onlywrites new column only

The core rule is that between every deploy you must wait "until all instances are replaced." If you move to the next phase without confirming that the rolling deploy fully finished, old code may still be around and your assumptions will break.

If you use a migration tool (Flyway, Liquibase, golang-migrate, and the like), it is best to put only one safe phase in each release. Separating the Expand migration and the Contract migration into different releases lets you slip a code deploy and an observation window between them.

Keeping rollback possible at all times

The biggest advantage of Expand-Contract is that each phase can be rolled back independently.

  • Rolling back the Expand phase: the newly added nullable column is used by no one, so you can leave it or drop it later. Rolling back the code is safe too.
  • Rolling back the Migrate phase: the old column is still alive and kept current by double write, so reverting the code to the old version leaves the data intact.
  • Rolling back the Contract phase: this phase requires care. If you have already dropped the old column, a simple code rollback cannot bring it back. So Contract is done last, only after the new structure has stabilized enough.

Because of this asymmetry there is a field maxim: "Expand freely, contract carefully." Expansion is easy to undo, but contraction is hard to undo.

Common pitfalls

Here are the pitfalls you run into most often in practice.

  1. Putting Expand and Contract in the same release. This is the most common mistake. A code deploy and an observation window need to fit between them, and handling them at once breaks the zero-downtime guarantee.

  2. Backfilling in a single UPDATE. Updating millions of rows in one transaction brings locking, a WAL explosion, and replication lag. Always split into batches.

  3. Forgetting the double write. New data that arrives during the backfill goes missing from the new column. This is the chief suspect when the new column mysteriously still has NULLs after the backfill is "done."

  4. Adding a column with a function default. A volatile default like DEFAULT now() can force a full table rewrite. Use a constant default, or add the column and then fill it with a backfill.

  5. Adding an immediately validated constraint. Applying NOT NULL or CHECK directly means a full scan plus a strong lock. Split it into NOT VALID followed by VALIDATE.

  6. Running CONCURRENTLY inside a transaction. CREATE INDEX CONCURRENTLY works only outside a transaction block. You must configure your migration tool so it does not automatically wrap the statement in a transaction.

  7. Not waiting for the rolling deploy to finish. Moving to the next phase before every instance is replaced with the new code breaks your assumptions.

Zero-downtime schema change checklist

It is worth checking these before you deploy.

  • Did you split this change into the three phases Expand / Migrate / Contract?
  • Did you separate Expand and Contract into different releases?
  • Is the new column added as nullable? (or using only a constant default)
  • Is the backfill split into batches with a pause between them?
  • Does double write cover both past and future data?
  • Did you split the constraint into NOT VALID then VALIDATE?
  • Are indexes created with CONCURRENTLY?
  • Do you wait for the rolling deploy to finish between each phase?
  • Did you keep the old structure until just before Contract to preserve a rollback path?
  • Did you leave a generous observation window before dropping a column?

Conclusion

The essence of the Expand-Contract pattern is "do not change everything at once." You decompose a single destructive change into small steps that are each safe and reversible, so that the window during which old and new code coexist is always crossed safely.

At first the number of steps can feel cumbersome. But once you have experienced a midnight deploy where a single column froze the service, you understand how valuable an insurance policy this cumbersomeness is. Zero downtime is built from procedure, not luck.

References