Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Introduction: Why Are Isolation Levels So Confusing?

The Trap of Official Documentation

The SQL standard defines four isolation levels: Read Uncommitted, Read Committed, Repeatable Read, and Serializable. It also distinguishes them by three kinds of anomalies.

Isolation Level	Dirty Read	Non-repeatable Read	Phantom Read
Read Uncommitted	Allowed	Allowed	Allowed
Read Committed	Prevented	Allowed	Allowed
Repeatable Read	Prevented	Prevented	Allowed
Serializable	Prevented	Prevented	Prevented

Memorize the table and you think you understand it. But actual databases don't behave according to this table:

PostgreSQL's Repeatable Read prevents Phantom Read (the table says it should allow it).
MySQL InnoDB's Repeatable Read is Snapshot Isolation, and Next-Key Locks also block phantoms.
Oracle's Serializable is actually Snapshot Isolation.
Write Skew is an anomaly not in the SQL standard, but it's the central problem of Snapshot Isolation.

In other words, "Repeatable Read" means different things in different DBs even though the name is the same. In this article, we look at how each DB actually works.

1. Transactions and ACID Revisited

What ACID Really Means

Atomicity: All succeed or all fail.
Consistency: Database invariants are preserved after the transaction.
Isolation: Concurrently running transactions appear not to affect each other.
Durability: Committed data is persistent.

Of these, Isolation is the hardest and most subtle. Perfect Isolation (= Serializable) guarantees the same result as running transactions serially, but performance would be terrible. That's why various weaker isolation levels are offered.

The Essence of Isolation Levels

An isolation level is a choice of "which anomalies will we allow." Weaker isolation levels are fast but require extra programmer effort to avoid bugs. Stronger isolation levels are safe but slower.

Serializable  <------- Safe, Slow
       ^
Snapshot Isolation
       ^
Repeatable Read
       ^
Read Committed
       ^
Read Uncommitted <---- Dangerous, Fast

2. Read Phenomena Fully Conquered

Dirty Read: Reading Uncommitted Data

Example:

-- Initial state: account = 100

-- T1
BEGIN;
UPDATE account SET balance = 200;  -- Not yet committed
-- Suddenly ROLLBACK

-- T2 (after T1's UPDATE, before ROLLBACK)
BEGIN;
SELECT balance FROM account;  -- Returns 200! (dirty read)

Since T1 ends up rolling back, the 200 T2 read is a value that never existed. This is extremely dangerous.

In practice: Almost every modern DB does not actually support Read Uncommitted. Even if set, it often behaves like Read Committed (e.g., PostgreSQL).

Non-repeatable Read: The Same Row Changes

-- T1
BEGIN;
SELECT balance FROM account WHERE id = 1;  -- 100

-- T2
BEGIN;
UPDATE account SET balance = 200 WHERE id = 1;
COMMIT;

-- T1 continues
SELECT balance FROM account WHERE id = 1;  -- 200 (different value!)
COMMIT;

The same SELECT within a transaction returns different results. Consistency breaks down in complex report queries.

Phantom Read: The Same WHERE Result Changes

-- T1
BEGIN;
SELECT COUNT(*) FROM orders WHERE user_id = 5;  -- 10

-- T2
BEGIN;
INSERT INTO orders (user_id, ...) VALUES (5, ...);
COMMIT;

-- T1 continues
SELECT COUNT(*) FROM orders WHERE user_id = 5;  -- 11 (phantom row!)
COMMIT;

A row that didn't exist suddenly appears. This is called a Phantom. It's similar to non-repeatable read, but the target is the result of a range query, not an individual row.

Lost Update: Writes Disappear

-- Initial: counter = 0

-- T1 and T2 simultaneously:
BEGIN;
SELECT counter FROM stats;   -- Both see 0
-- Application computes +1
UPDATE stats SET counter = 1;
COMMIT;

-- Result: counter = 1 (one update is lost!)

When two transactions read the same value and each updates it, the later write overwrites the earlier one.

Solutions:

SELECT ... FOR UPDATE (pessimistic lock)
UPDATE stats SET counter = counter + 1 (atomic operation)
Optimistic locking (WHERE version = ?)

Write Skew: Snapshot's Fatal Weakness

-- Hospital scheduling: at least 1 doctor must be on call
-- Initial state: both Alice and Bob are on call

-- T1 (Alice requests off)
BEGIN;
SELECT COUNT(*) FROM doctors WHERE on_call = true;  -- 2
-- 2 doctors, so 1 going off is fine
UPDATE doctors SET on_call = false WHERE name = 'Alice';
COMMIT;

-- T2 (Bob requests off, concurrently)
BEGIN;
SELECT COUNT(*) FROM doctors WHERE on_call = true;  -- 2 (snapshot before Alice's change)
UPDATE doctors SET on_call = false WHERE name = 'Bob';
COMMIT;

-- Result: both off! Zero doctors on call!

Since the two transactions modified different rows, there's no write conflict. But the result is that the invariant (at least 1 on call) is broken. This is called Write Skew.

Write Skew can occur under Snapshot Isolation and is fully prevented only under true Serializable.

Read Skew: Inconsistent Aggregation

-- Accounts A, B each with 100

-- T1 (read)
BEGIN;
SELECT balance FROM accounts WHERE id = 'A';  -- 100

-- T2 (transfer A -> B)
BEGIN;
UPDATE accounts SET balance = 50 WHERE id = 'A';
UPDATE accounts SET balance = 150 WHERE id = 'B';
COMMIT;

-- T1 continues
SELECT balance FROM accounts WHERE id = 'B';  -- 150
COMMIT;

T1 reads A=100, B=150 and gets a total of 250, which is wrong. Actually it should be A=50, B=150 or A=100, B=100.

Snapshot Isolation solves this perfectly: T1 sees a snapshot from its start time and reads A=100, B=100.

3. MVCC: Multi-Version Concurrency Control

The Basic Idea of MVCC

Traditional DBs managed concurrency with locks. Reads took locks that blocked writes, writes took locks that blocked reads. This led to the famous "Readers block writers, writers block readers" problem.

MVCC solves this:

"Keep multiple versions of each row, and each transaction reads the version appropriate for itself."

Result: Reads don't block writes, and writes don't block reads.

The Core Structure of MVCC

Metadata added to each row:

xmin: The transaction ID that created this version.
xmax: The transaction ID that deleted this version (0 or infinity if none).

Visibility rule:

Transaction T can see row version V <-> xmin <= T AND (xmax == 0 OR xmax > T).

That is, only versions that already existed at T's start and are still alive are visible to T.

4. Deep Dive into PostgreSQL's MVCC

The Actual Storage Structure of Rows

In PostgreSQL, every row (tuple) has the following header:

typedef struct HeapTupleHeaderData {
    TransactionId t_xmin;   // Creating transaction
    TransactionId t_xmax;   // Deleting/updating transaction
    CommandId     t_cmin;   // Command ID within same transaction
    CommandId     t_cmax;
    ItemPointerData t_ctid; // Location of this tuple (or pointer to next version)
    // ...
};

How UPDATE Actually Works

In PostgreSQL, UPDATE is not in-place:

Set xmax on the old row (mark deleted).
Insert a new row with xmin = current transaction.
Link the old row's ctid to point to the new row.

Result: multiple versions of the same logical row coexist on the page.

Before UPDATE:
[xmin=100, xmax=0, id=1, balance=100]

After UPDATE by tx 150:
[xmin=100, xmax=150, id=1, balance=100]  <- old version
[xmin=150, xmax=0, id=1, balance=200]    <- new version

Snapshot

A PostgreSQL transaction obtains a snapshot at start. The snapshot contains:

xmin: oldest active transaction ID.
xmax: largest not-yet-assigned transaction ID.
xip[]: list of currently in-progress transaction IDs.

Determining whether tuple T is visible:

def is_visible(tuple_xmin, tuple_xmax, snapshot):
    # Creator not yet committed
    if tuple_xmin in snapshot.xip:
        return False
    if tuple_xmin >= snapshot.xmax:
        return False

    # Already deleted
    if tuple_xmax != 0 and tuple_xmax not in snapshot.xip:
        if tuple_xmax < snapshot.xmax:
            return False  # delete committed

    return True

VACUUM: Garbage Collection

A side effect of MVCC is that old row versions keep piling up. Cleaning them is the job of VACUUM:

VACUUM: reclaims dead tuples no transaction can see anymore.
VACUUM FULL: reorganizes the table to reclaim space too (requires AccessExclusive lock).
Autovacuum: triggered automatically.

Neglecting VACUUM causes table bloat and sharply degrades performance.

The Truth About PostgreSQL Isolation Levels

PostgreSQL does not support Read Uncommitted. Even when set, it behaves like Read Committed. Actual implementation:

Setting	Actual Behavior
Read Uncommitted	= Read Committed
Read Committed (default)	New snapshot per query
Repeatable Read	Snapshot fixed at transaction start (= Snapshot Isolation)
Serializable	SSI (Serializable Snapshot Isolation)

Note: PostgreSQL's Repeatable Read is stronger than the standard. It also prevents Phantom Read. But Write Skew can still occur. To fully prevent it, raise to Serializable.

5. MySQL InnoDB's MVCC

Undo-Log-Based MVCC

Where PostgreSQL appends new versions in place, InnoDB uses in-place update + undo log:

Only the latest version is stored in the actual data page.
Previous versions are stored in the undo log.
When an older transaction needs to read an old version, it walks the undo log to reconstruct it.

Data page:
[id=1, balance=200, DB_TRX_ID=150, DB_ROLL_PTR=-> undo]
                                        |
                                        v
Undo Log:
[id=1, balance=100, DB_TRX_ID=100]

Pros and Cons Compared

Aspect	PostgreSQL	InnoDB
UPDATE method	New row insert + mark old row	In-place update + undo log
VACUUM needed	Yes (bloat management)	No (undo log auto-cleaned)
Index impact	All indexes updated (except HOT)	Only clustered index if index value unchanged
Old-version lookup	Instant (same page)	Must walk undo log
Long-transaction impact	Causes bloat	Undo log explosion

InnoDB Repeatable Read = Snapshot Isolation

InnoDB's default isolation level is Repeatable Read, and it creates a consistent read view at transaction start. This is effectively identical to Snapshot Isolation.

Next-Key Lock: Phantom Prevention

InnoDB prevents phantoms in Repeatable Read with Next-Key Locks:

-- id column index: 5, 10, 20

-- T1
BEGIN;
SELECT * FROM t WHERE id > 8 AND id < 15 FOR UPDATE;
-- This query locks the range (5, 10], (10, 20)

-- T2
INSERT INTO t VALUES (12);  -- Waits! (due to range lock)

A Next-Key Lock locks an index entry plus the gap before it. This achieves Repeatable Read + phantom prevention.

Gap Lock, Record Lock, Next-Key Lock

Record Lock: lock on a specific index entry.
Gap Lock: lock on the "gap" between index entries.
Next-Key Lock: Record Lock + the Gap Lock before it.

Combinations of these implement the various isolation levels.

6. Snapshot Isolation and Its Limits

The Appeal of Snapshot Isolation

Snapshot Isolation (SI) is sold as "Repeatable Read" or even "Serializable" in many DBs. Why?

Reads are fast: no locks needed, just a snapshot.
Prevents Read Skew: sees consistent point-in-time data.
Simple to implement: layers right on top of MVCC.

The First-Committer-Wins (FCW) Rule

Under SI, when multiple transactions try to modify the same row, the one that commits first wins. The other transactions must be rolled back with a "write conflict" error.

-- Both T1 and T2 try to modify the same row
-- T1: UPDATE row SET ... WHERE id = 1
-- T2: UPDATE row SET ... WHERE id = 1

-- T1 commits first -> success
-- T2 attempts commit -> ERROR: could not serialize access

The application must catch this error and retry the transaction.

Write Skew: SI's Fatal Flaw

The doctor on-call example earlier is Write Skew. When two transactions modify different rows, FCW doesn't apply. The result is a broken invariant.

The common Write Skew pattern:

Transaction reads data to check a condition.
Based on the condition, modifies a different row.
Another transaction checks the same condition with the same logic and modifies.
Both transactions succeed, but the overall condition is violated.

The Relationship Between Phantom and Write Skew

Phantom is a special case of Write Skew:

General Write Skew: reads existing rows.
Phantom Write Skew: reads that "no rows match" the condition.

-- Example: ensuring username uniqueness
-- T1: checks that username "alice" does not exist -> inserts
-- T2: checks that username "alice" does not exist -> inserts
-- Result: same name inserted twice (possible under SI!)

Preventing this requires actual Serializable or reliance on unique constraints.

7. Serializable Snapshot Isolation (SSI)

The Problem

Snapshot Isolation is fast but allows Write Skew. True Serializable is safe but slow due to locks. Is there a better way?

The 2008 paper by Michael Cahill, Uwe Rohm, and Alan Fekete, "Serializable Isolation for Snapshot Databases", provided the answer: SSI (Serializable Snapshot Isolation).

SSI's Key Idea

"Behave like SI, but detect non-serializable patterns and roll them back."

Track read/write dependency graphs between transactions, and if a cycle is detected, abort one side.

Dangerous Structure

Cahill et al. proved that "a double cycle of rw-dependencies" is the only pattern that breaks serializability:

T1 --rw--> T2 --rw--> T3

Two consecutive rw dependencies (T1 didn't read what T2 wrote, but T2 wrote what T1 read) are dangerous. SSI detects these.

PostgreSQL's SSI Implementation

Since PostgreSQL 9.1, SERIALIZABLE is implemented as SSI:

Acquire SI snapshot at transaction start.
Track predicate locks on every read/write operation.
On detecting a conflict pattern, roll back one side with ERROR: could not serialize access.
The application retries the rolled-back transaction.

Pros: performance close to SI, safety of Serializable. Cons: requires rollback-handling logic; errors occur at counterintuitive moments.

Usage Example

-- PostgreSQL
BEGIN ISOLATION LEVEL SERIALIZABLE;

-- Doctor on-call example
SELECT COUNT(*) FROM doctors WHERE on_call = true;
UPDATE doctors SET on_call = false WHERE name = 'Alice';

-- If a concurrent transaction modifies Bob,
-- one will be rolled back at commit:
-- ERROR: could not serialize access due to read/write dependencies
COMMIT;

Application pattern:

def do_doctor_off(doctor_name):
    while True:
        try:
            with db.transaction(isolation="serializable"):
                count = db.execute("SELECT COUNT(*) FROM doctors WHERE on_call = true")
                if count > 1:
                    db.execute("UPDATE doctors SET on_call = false WHERE name = %s", doctor_name)
                else:
                    raise ValueError("Cannot go off, you're the only one on call")
            return
        except SerializationError:
            continue  # retry

8. A Practical Guide to Each Isolation Level

Read Committed (Is the Default Enough?)

The default for most web applications. Fine when individual queries being consistent is enough:

Simple CRUD
User profile read/update
Blog posts

Caution: Non-repeatable read is allowed, so code that reads the same row twice in a transaction needs care.

Repeatable Read / Snapshot Isolation

Good for report generation and complex queries:

Dashboard aggregations
When a consistent view is needed across a long transaction
Read-heavy analytical queries

Caution: Be aware of the Write Skew possibility; consider explicit locks or Serializable when important invariants are involved.

Serializable (SSI)

When absolute correctness is essential:

Financial transactions
Resource allocation like doctor on-call or flight booking
Complex conditions beyond unique constraints

Trade-offs:

Performance cost (rollback overhead)
Retry logic required
Contention escalates in high-concurrency environments

How to Choose?

Start with Read Committed by default.
As inconsistency bugs surface, raise only that transaction to Serializable.
Don't make everything Serializable. Apply it selectively to critical paths.

9. Optimistic vs Pessimistic Locking

Beyond isolation levels, there are concurrency control strategies.

Pessimistic Locking

"Conflicts will happen often. Prevent them up front."

BEGIN;
SELECT * FROM account WHERE id = 1 FOR UPDATE;  -- Acquire lock
-- compute...
UPDATE account SET balance = ... WHERE id = 1;
COMMIT;  -- Release lock

Pros: no conflicts, simple logic. Cons: lock contention, deadlocks, performance degradation when locks are held long.

Optimistic Locking

"Conflicts are rare. Detect and retry."

-- Read
SELECT balance, version FROM account WHERE id = 1;  -- balance=100, version=5

-- Application computes
new_balance = 100 - 30

-- Write (version check)
UPDATE account
SET balance = 70, version = version + 1
WHERE id = 1 AND version = 5;

-- If affected rows == 0, another transaction modified it first -> retry

Pros: no lock overhead, good for high concurrency. Cons: application-side retry logic required; slower when conflicts are frequent.

Selection Guide

Situation	Recommended
Read-heavy, few writes	Optimistic
Write-heavy, frequent conflicts	Pessimistic
User-facing latency matters	Optimistic
Retry logic hard to build	Pessimistic
Distributed systems	Optimistic (lock propagation cost)

10. Distributed Transactions and Isolation

The Limits of Single-DB Isolation

Everything so far has been isolation within a single DB. What about transactions across multiple DBs or services?

Two-Phase Commit (2PC)

The traditional distributed transaction protocol:

Prepare Phase: Coordinator asks all participants "ready?"
Commit Phase: If all are ready, send "commit"; otherwise "rollback".

Problems:

Blocking on coordinator failure.
Performance hit (two network round-trips).
Unavailability grows with the number of participants.

Saga Pattern

An alternative to 2PC. Recovers from failures with compensating transactions:

Order Service: create_order(ID)
Payment Service: charge(ID)
Inventory: reserve(ID)

On failure, in reverse order:
Inventory: unreserve(ID)
Payment Service: refund(ID)
Order Service: cancel_order(ID)

Pros: no blocking, service independence. Cons: compensating logic is a burden, weak isolation (intermediate states may be externally visible).

Percolator / Spanner

Google's distributed DBs aim for true Serializable:

Percolator (on top of Bigtable): general-purpose distributed transactions via 2PC + MVCC.
Spanner: TrueTime API for global clock sync + 2PC + Paxos.

TiDB is an open-source implementation of the Percolator model.

11. Real-World Troubleshooting

Problem 1: "deadlock detected"

Cause: Two transactions waiting on each other's locks.

Solution:

Always acquire locks in the same order (e.g., ascending id).
Shorten lock hold times (shorter transactions).
Use SELECT ... FOR UPDATE SKIP LOCKED to skip.
Implement retry logic.

Problem 2: "could not serialize access"

Cause: SSI rollback.

Solution:

Retry in the application.
Exponential backoff.
Consider switching hot paths to explicit locks.

Problem 3: Bloat from Long Transactions (PostgreSQL)

Symptoms: disk usage spikes, queries slow down.

Cause: A long-open transaction blocks VACUUM.

Solution:

-- Find long-running transactions
SELECT pid, state, query, xact_start, now() - xact_start AS duration
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY xact_start;

-- Manual VACUUM
VACUUM (ANALYZE, VERBOSE) table_name;

Prevention:

Set idle_in_transaction_session_timeout.
Keep transactions short.
Don't leave transactions open and idle in the application.

Problem 4: Lost Update in Web Applications

Symptom: When users edit concurrently, one user's changes disappear.

Solution:

version column + optimistic locking.
Or CRDT-based collaborative editing.
Or user feedback ("Someone else edited first. Merge?").

Quiz Review

Q1. What's the difference between PostgreSQL's Repeatable Read and the SQL standard's Repeatable Read?

A. In the SQL standard, Repeatable Read allows Phantom Read. However, PostgreSQL's Repeatable Read is implemented as Snapshot Isolation and prevents phantoms too. But Write Skew can still occur. To fully prevent it, raise to Serializable.

Q2. How is Write Skew different from Lost Update?

A. Lost Update is when two transactions modify the same row and one update disappears. It can be prevented by the first-committer-wins rule or atomic UPDATE. Write Skew is when transactions modify different rows so there's no write conflict, but decisions based on read data break the overall invariant. Snapshot Isolation cannot prevent it; Serializable is required.

Q3. What's the biggest implementation difference between PostgreSQL's and InnoDB's MVCC?

A. PostgreSQL appends new versions into the table itself (copy-on-write). Dead tuples accumulate and must be cleaned by VACUUM. InnoDB uses in-place update + undo log. The latest version is in the data page and past versions are in the undo log. InnoDB doesn't need VACUUM, but long transactions can cause undo log explosion.

Q4. How does SSI (Serializable Snapshot Isolation) achieve Serializable while maintaining performance?

A. SSI behaves like Snapshot Isolation by default (no locks, fast reads). It tracks read/write dependencies between transactions. When a "non-serializable double rw cycle" is detected, it rolls back one transaction. This is an optimistic approach: without conflicts it keeps SI's performance, and penalties only occur during conflicts.

Q5. Why is SELECT ... FOR UPDATE still useful in an MVCC environment?

A. MVCC provides read consistency but doesn't guarantee verification of state just before write. FOR UPDATE takes a row lock on the current latest committed version and reads it. A subsequent UPDATE is based on this version, preventing Lost Update. This is needed when you want to manually block Write Skew under Snapshot Isolation or explicitly prevent Lost Update.

Conclusion: Isolation-Level Pitfalls and Remedies

Key Lessons

Don't trust the name "Repeatable Read". Each DB implements it differently. Check the docs and actual behavior.
Snapshot Isolation is not a cure-all. Write Skew is subtle but fatal. Consider Serializable for invariant-sensitive logic.
Perfect Serializable is expensive. When performance matters, a practical strategy is to keep the whole system on SI and promote only risky paths to Serializable.
MVCC is not free. PostgreSQL has VACUUM, InnoDB has the undo log. Long transactions are poison for both.
Isolation level is one layer of defense. Use unique constraints, foreign keys, and application-side validation together.

Real-World Checklist

What's the default isolation level of our team's DB?
Is there any logic where Write Skew could occur?
Are long transactions straining VACUUM/undo log?
Are we using an isolation level that needs retry logic?
Is deadlock detection and retry ready?

Much of the difficulty of distributed systems is "data consistency." Isolation levels sit right in the middle of that. Understanding actual behavior rather than memorizing names is what prevents bugs.

References

A Critique of ANSI SQL Isolation Levels (Berenson et al., 1995) - Flaws in isolation levels
Serializable Isolation for Snapshot Databases (Cahill et al., 2008) - The original SSI paper
PostgreSQL Concurrency Control - Official docs
MySQL InnoDB Locking and Transaction Model
Designing Data-Intensive Applications, Ch.7 - Martin Kleppmann's classic
Jepsen: PostgreSQL Analysis - Verification report
Hermitage: Isolation Level Test Suite - Comparison of actual DB behavior