Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

PostgreSQL is a robust, extensible open-source relational database refined over a long time. It is more than a tool that runs SQL; concurrency control, storage structure, query optimization, replication, and extensibility are all tied together by one coherent design philosophy.

This article unpacks PostgreSQL's internals around the questions "what do the processes and memory look like, how do transactions avoid seeing each other, and what path does a query take to execute?" We use many diagrams to build intuition first. Because details can vary by version, please confirm exact behavior against the official documentation (postgresql.org).

1. Process and Memory Structure

PostgreSQL adopts a multi-process architecture. Each time a client connects, a dedicated backend process is spawned, and several background processes cooperate around shared memory.

┌──────────────────────────────────┐

client ──────▶│ Postmaster (main process) │

client ──────▶│ - accept connections, fork backend│

└───────────────┬──────────────────┘

│ fork

┌─────────────────────────┼─────────────────────────┐

▼ ▼ ▼

┌───────────┐ ┌───────────┐ ┌───────────┐

│ Backend 1 │ │ Backend 2 │ │ Backend N │

│ (per sess)│ │ (per sess)│ │ (per sess)│

└─────┬─────┘ └─────┬─────┘ └─────┬─────┘

│ │ │

└────────────┬───────────┴────────────┬────────────┘

▼ ▼

┌───────────────────────────────────────────────┐

│ Shared Memory │

│ ┌─────────────────┐ ┌─────────────────────┐ │

│ │ Shared Buffers │ │ WAL Buffers │ │

│ │ (table/idx pages)│ │ (change log buffer) │ │

│ └─────────────────┘ └─────────────────────┘ │

│ ┌─────────────────┐ ┌─────────────────────┐ │

│ │ Lock / CLOG │ │ other shared state │ │

│ └─────────────────┘ └─────────────────────┘ │

└───────────────────────────────────────────────┘

▲ ▲

┌────────────┴───────────┬─────────────┴────────────┐

▼ ▼ ▼

┌───────────┐ ┌──────────────┐ ┌──────────────┐

│ Background│ │ WAL Writer │ │ Checkpointer │

│ Writer │ │ (WAL to disk) │ │ │

└───────────┘ └──────────────┘ └──────────────┘

┌───────────┐ ┌──────────────┐

│ Autovacuum│ │ Archiver etc. │

└───────────┘ └──────────────┘

Core Memory Areas

- **Shared Buffers**: a shared pool that caches table/index pages from disk in memory. Most reads/writes happen here, forming the first line of defense against disk I/O.

- **WAL Buffers**: a buffer that briefly accumulates WAL (Write-Ahead Log) records that record changes first.

- **work_mem / maintenance_work_mem**: working memory used per operation by a backend, such as sorts, hashes, and VACUUM.

Read path: query ─▶ Shared Buffers hit? ─Yes─▶ return from memory

│ No

▼

load page from disk ─▶ place in Shared Buffers ─▶ return

Write path: change ─▶ mark the page in Shared Buffers dirty

─▶ record the change in WAL (first!)

─▶ later the Background Writer/Checkpointer flush to disk

Writing the WAL first (Write-Ahead Logging) is the key. Because the log is safely written before the data files, even after a failure you can replay the log to recover consistency.

2. MVCC — Multi-Version Concurrency Control

PostgreSQL's concurrency model is **MVCC (Multi-Version Concurrency Control)**. The core idea is "reads do not block writes, and writes do not block reads." To achieve this, multiple versions of the same row are kept.

Each row version (tuple) carries invisible system columns.

Visibility info in the tuple header

┌──────────┬─────────────────────────────────────┐

│ xmin │ transaction ID that created this ver │

│ xmax │ transaction ID that deleted/updated it│

│ ctid │ physical location (block, offset) │

└──────────┴─────────────────────────────────────┘

An UPDATE does not modify the existing row in place; it **adds a new version and marks xmax on the old one**. This is why a PostgreSQL UPDATE effectively behaves close to "delete + insert."

Before UPDATE:

[v1: xmin=100, xmax=0 ] ← currently visible version

After UPDATE (txid 150):

[v1: xmin=100, xmax=150] ← old version, expired by 150

[v2: xmin=150, xmax=0 ] ← new version

Determining Visibility

Which transaction can see which version is decided by a **snapshot**. A snapshot captures "which transactions are committed as of this moment."

Visibility rules (simplified)

For a version to be visible:

1) the xmin transaction is committed and precedes my snapshot, and

2) xmax is absent, or the xmax transaction is not yet committed

──▶ thus two concurrently running transactions

can see different "worlds of versions."

VACUUM — Cleaning Up Dead Tuples

The cost of MVCC is the accumulation of "dead tuples." Old versions no longer visible to any snapshot only take up space. Reclaiming them is the job of **VACUUM**.

dead tuples accumulate ──▶ table bloat ──▶ performance drop

│

▼

VACUUM: reclaim dead-tuple space into a reusable free list

VACUUM FULL: rewrite the table to physically shrink it (heavy lock)

Autovacuum: background cleanup that runs automatically at a threshold

Also, because transaction IDs are finite, "freezing" the IDs of old tuples to prevent the **wraparound** problem is another important role of VACUUM. Tuning autovacuum is very important in operations.

3. Index Structures

PostgreSQL provides several index types to match data and query patterns. Choosing the right index is performance.

| Index | Suited queries | Typical use |

| --- | --- | --- |

| B-tree | equality/range/sort | primary keys, general columns |

| Hash | equality only | simple equality |

| GIN | multi-value containment | arrays, JSONB, full-text |

| GiST | geometric/nearest/range | spatial data, range types |

| BRIN | physically sorted large data | time series, logs |

B-tree (most common)

[ 50 ]

/ \

[20 35] [70 90]

/ | \ / | \

leaf nodes (sorted keys + row pointers) are linked together

──▶ range scans are very efficient

BRIN (Block Range Index)

groups table blocks into ranges, storing only "min/max of this range"

┌──────────┬──────────┬──────────┐

│ blk 0-127│ blk128-255│ blk256-..│

│ min/max │ min/max │ min/max │

└──────────┴──────────┴──────────┘

──▶ the index is very small; suited to naturally sorted large data

GIN is powerful when one row holds many keys (e.g., JSONB documents, arrays, word tokens). GiST suits spatial/range queries like "nearness" or "overlap."

4. Query Execution — Planner and Executor

A single line of SQL passes through several stages before it becomes a result.

SQL text

│

▼

[ Parser ] ──▶ syntax tree (parse tree)

│

▼

[ Analyze/Rewrite ] ──▶ expand views, apply rules

│

▼

[ Planner/Optimizer ]

│ - generate possible execution plans

│ - estimate each plan's "cost" using statistics (pg_statistic)

│ - pick the cheapest plan

▼

[ Executor ] ──▶ pull tuples following the plan tree

│

▼

result

Cost-Based Optimization

The optimizer estimates cost from statistics. Even for the same query, depending on data distribution an index scan may win, or a sequential scan may win.

WHERE status = 'active'

Case 1) 'active' is 1% of the total ──▶ index scan is cheaper

Case 2) 'active' is 90% of the total ──▶ sequential scan is cheaper

(sequential I/O beats random I/O)

──▶ keep statistics fresh with ANALYZE so the optimizer chooses right.

Join Strategies

Nested Loop : favorable when one side is small and indexed

Hash Join : equality-joining two large tables (builds a hash table)

Merge Join : efficient when both sides are sorted

`EXPLAIN ANALYZE` shows the estimated plan alongside actual execution, making it the most important tool for diagnosing the gap between estimate and reality.

EXPLAIN (ANALYZE, BUFFERS)

SELECT * FROM orders WHERE customer_id = 42 AND status = 'paid';

5. Replication — Streaming and Logical

PostgreSQL's replication is based on the WAL. The key distinction is "physical replication (streaming)" versus "logical replication."

Streaming replication (physical)

Primary ──── WAL record stream ───▶ Standby (byte-level replica)

│ │

writable read-only (hot standby)

──▶ replicates the entire cluster; same version/structure required

Logical replication (publish/subscribe)

Publisher ── changes (row-level INSERT/UPDATE/DELETE) ──▶ Subscriber

──▶ can selectively replicate per table

──▶ good for across major versions, partial replication, consolidation

Synchronous vs Asynchronous

Async (default): Primary commits first, Standby follows

──▶ fast, but a small tail may be lost on failure

Sync: Standby must confirm WAL receipt before commit is final

──▶ safer but adds latency

Replication is the foundation of high availability (HA) and read scaling. You spread read load to standbys and secure availability via failover (promotion) during incidents.

6. Partitioning

A technique that treats a large table logically as one while physically splitting it into multiple pieces (partitions).

orders (parent, partition key: order_date)

┌──────────────┬──────────────┬──────────────┐

▼ ▼ ▼ ▼

orders_2026q1 orders_2026q2 orders_2026q3 orders_2026q4

(Jan-Mar) (Apr-Jun) (Jul-Sep) (Oct-Dec)

CREATE TABLE orders (

id bigint,

order_date date NOT NULL,

amount numeric

) PARTITION BY RANGE (order_date);

CREATE TABLE orders_2026q1 PARTITION OF orders

FOR VALUES FROM ('2026-01-01') TO ('2026-04-01');

The biggest benefit of partitioning is **partition pruning**. When the query condition points to a specific partition only, the rest are not scanned at all.

WHERE order_date >= '2026-07-01'

Planner: prune q1, q2 partitions ─▶ scan only q3, q4

──▶ reduced scan range improves performance

Beyond RANGE, LIST and HASH partitioning are supported, and you can DETACH an old partition wholesale to simplify data lifecycle management.

7. The Extension Ecosystem — Centered on pgvector

A great strength of PostgreSQL is the ability to broaden core features with **extensions**. A prime example, prominent in the AI era, is `pgvector`.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (

id bigserial PRIMARY KEY,

content text,

embedding vector(1536) -- embedding dimension

);

Vector similarity search is used to find "semantically near" documents. You query nearest neighbors with a distance operator.

-- the 5 nearest by cosine distance

SELECT id, content

FROM documents

ORDER BY embedding <=> '[0.12, -0.04, ...]'

LIMIT 5;

At scale, approximate nearest neighbor (ANN) indexes provide speed.

Exact search (brute force) : compute distance to every vector ──▶ exact but slow

ANN index (HNSW/IVF) : narrow candidates via graph/cluster structures

──▶ a small approximation for a big speedup

Its place in a RAG pipeline

document ─▶ embed ─▶ [store in pgvector table]

question ─▶ embed ─▶ [search similar docs via <=>] ─▶ inject into LLM context

Beyond this, a rich extension ecosystem—PostGIS (spatial), pg_stat_statements (query stats), various foreign data wrappers (FDW), and more—makes PostgreSQL a "programmable data platform."

8. Common Operational Pitfalls

[ ] Disabling or neglecting Autovacuum ─▶ table bloat, wraparound risk

[ ] Stale statistics (no ANALYZE) ─▶ optimizer misjudges, bad plans

[ ] Index overuse ─▶ higher write cost, disk/maintenance burden

[ ] Neglected long-running transactions ─▶ VACUUM cannot clean dead tuples

[ ] Connection count explosion ─▶ per-process memory burden, consider a pooler (PgBouncer)

[ ] Oversized work_mem ─▶ memory blowup under concurrency

In particular, a "long-open transaction" is a silent killer. If a transaction stays open for a long time, its snapshot keeps "needing" old versions, so VACUUM cannot reclaim dead tuples and bloat accumulates.

9. Checkpoints and Crash Recovery

The WAL records changes first, but the moment they actually land in the data files is separate. That synchronization point is the **checkpoint**.

time ───────────────────────────────────────────▶

│ │ │ │

checkpoint A checkpoint B checkpoint C

│◀── changes in this span recorded only in WAL ──▶│

│

at checkpoint: flush dirty pages to disk

──▶ WAL before this point is unneeded for recovery

When a crash occurs, you only need to replay the WAL after the last checkpoint. Frequent checkpoints make recovery faster but increase I/O load; infrequent ones do the opposite. Tuning this trade-off is one axis of operations.

crash occurs ──▶ restart

│

▼

find the last checkpoint location

│

▼

replay the WAL records after it in order (REDO)

│

▼

clean up uncommitted transactions ──▶ recovered to a consistent state

PITR (Point-In-Time Recovery)

By archiving the WAL, you can go beyond simple recovery to PITR, rewinding to a "specific moment." It replays the WAL up to the desired time on top of a base backup.

base backup (snapshot) ──▶ + replay archived WAL ──▶ state at the target time

──▶ useful for recovering to just before data was accidentally deleted

This mechanism is a core foundation of backup/recovery strategy and high-availability design; in production it is recommended to run regular base backups together with WAL archiving.

Conclusion

PostgreSQL's architecture is woven by one coherent philosophy. MVCC multiplies versions for concurrency, and VACUUM pays that cost back. The WAL becomes the foundation of durability and replication, and the cost-based optimizer fuels itself on statistics to pick the best path. Indexes and partitioning are tools matched to the shape of data, and extensions stretch the core into new domains.

Once you understand these mechanisms, you can answer operational questions like "why is this query slow" or "why does the disk keep filling up" structurally. Details and defaults can vary by version, so consult the official docs alongside this when tuning and operating in practice.

References

- [PostgreSQL Documentation](https://www.postgresql.org/docs/)

- [PostgreSQL — MVCC Concurrency Control](https://www.postgresql.org/docs/current/mvcc.html)

- [PostgreSQL — Write-Ahead Logging](https://www.postgresql.org/docs/current/wal-intro.html)

- [PostgreSQL — Index Types](https://www.postgresql.org/docs/current/indexes-types.html)

- [PostgreSQL — Table Partitioning](https://www.postgresql.org/docs/current/ddl-partitioning.html)

- [PostgreSQL — Logical Replication](https://www.postgresql.org/docs/current/logical-replication.html)

- [PostgreSQL — Routine Vacuuming](https://www.postgresql.org/docs/current/routine-vacuuming.html)

- [pgvector repository](https://github.com/pgvector/pgvector)