Skip to content

✍️ 필사 모드: PostgreSQL Internals Deep Dive — MVCC, VACUUM, WAL, Query Planner, Index, Partitioning, pgvector (2025)

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

"Postgres is not a fashion. It's a philosophy." — Tom Lane (PostgreSQL major contributor, 20+ years)

In 2024 Stack Overflow Developer Survey, PostgreSQL finally overtook MySQL for #1. Not a coincidence. Over the last decade Postgres evolved into a monster: "handles JSON better than MongoDB, vector search better than Pinecone, analytics close to ClickHouse."

PostgreSQL started in 1986 as the POSTGRES project at UC Berkeley (Michael Stonebraker). SQL support in 1996 turned it into PostgreSQL, and for 30 years the philosophy has been ACID correctness, extensibility, standards compliance. This post is a map for those moving from "just using Postgres" to "understanding and tuning its internals."


1. MVCC — The Heart of PostgreSQL

Why MVCC was revolutionary

Traditional DBs used read locks: while transaction A reads, B cannot write. OLTP performance killer.

MVCC (Multi-Version Concurrency Control): "each transaction sees a snapshot at its point in time." Reads and writes don't block each other.

"Readers don't block writers, writers don't block readers."

Oracle vs PostgreSQL implementation

Oracle: old versions in Undo Segment, current version in the main table. PostgreSQL: all versions stored in the table, dead tuples cleaned by VACUUM.

Postgres's approach is simpler but has a cost: tables inevitably bloat. This is the fate of VACUUM.

Tuple attributes — xmin, xmax

Each row (tuple) has hidden system columns:

t_xmin   — transaction ID that created this tuple
t_xmax   — transaction ID that deleted/updated it (0 means still alive)
t_cmin   — command sequence within the same transaction

A transaction uses its snapshot (active XIDs at start) to decide:

  • t_xmin < MyXid, committed, and t_xmax absent or uncommitted → visible
  • otherwise → invisible

Read consistency without locks.

The truth about UPDATE — no "in-place update"

UPDATE users SET name = 'Alice' WHERE id = 1;

Actually:

  1. Set current TX ID into t_xmax of old tuple (logical delete)
  2. Insert new tuple, t_xmin = current TX ID
  3. All related indexes also get new pointer entries (HOT update exception)

PostgreSQL UPDATE costs as much as INSERT, causing write amplification. HOT (Heap-Only Tuple) avoids index updates when index keys don't change by placing the new version on the same page.


2. VACUUM — The Inescapable Burden

What VACUUM does

  1. Reclaim dead tuples — table/index space
  2. Update Visibility Map — enables Index-Only Scan
  3. Update Free Space Map — tracks space for INSERT
  4. Prevent XID wraparound — mark tuples frozen
  5. Collect statistics (with ANALYZE)

VACUUM FULL vs VACUUM

  • VACUUM — marks space reusable, table size unchanged
  • VACUUM FULL — rewrites the table, shrinks size, takes AccessExclusiveLock (outage-level)

In production, use pg_repack extension for online repack.

XID Wraparound — the 32-bit curse

Transaction IDs are 32-bit (~4.2 billion). Exhausting them makes past/future indistinguishable and the DB goes read-only.

Defenses:

  • VACUUM marks old tuples FrozenXID → visible forever
  • autovacuum_freeze_max_age (default 200M) triggers forced vacuum freeze

Wraparound incidents were reported at large services through the 2020s; 64-bit XID discussions intensified from PG 17. PG 18 is still 32-bit.

autovacuum tuning — where most people fail

Defaults assume small DBs. For tables with tens of millions of rows:

autovacuum_vacuum_scale_factor = 0.02  # default 0.22%
autovacuum_naptime = 15s               # default 1min → more often
autovacuum_max_workers = 6             # match CPU
autovacuum_vacuum_cost_limit = 2000   # default 200, raise I/O budget

Dead-tuple ratio monitoring is key:

SELECT relname, n_dead_tup, n_live_tup,
       round(100.0 * n_dead_tup / NULLIF(n_live_tup, 0), 2) AS dead_pct
FROM pg_stat_user_tables ORDER BY dead_pct DESC NULLS LAST;

3. WAL — The Elegance of Write-Ahead Log

The WAL principle

"Log changes before modifying data files."

That one rule underpins:

  • Crash recovery
  • Replication
  • Point-in-Time Recovery (PITR)
  • Logical Decoding

WAL write flow

  1. Transaction modifies data
  2. Write WAL record to WAL buffer
  3. On COMMIT, fsync WAL to disk
  4. Data pages themselves flushed later by CHECKPOINT

On commit, data pages need not be on disk yet. Recovery works as long as WAL is durable.

CHECKPOINT

  • Periodically flushes dirty pages
  • Too frequent → I/O spike
  • Too rare → WAL grows, recovery time rises
  • Controlled by checkpoint_timeout (default 5min) and max_wal_size (default 1GB)

Replication — Physical vs Logical

Physical Replication (Streaming):

  • Sends WAL byte-identical to standby
  • Whole-cluster scope
  • Sync / async / quorum supported

Logical Replication (10+):

  • Decodes WAL into logical SQL changes
  • Table/column level selection
  • Replication across different PG versions possible
  • Essential for upgrade/migration (see "Zero-Downtime DB Migration")

Synchronous replication setup

synchronous_commit = on
synchronous_standby_names = 'FIRST 2 (replica1, replica2, replica3)'

Waits for 2 of 3 ACKs. Commits continue when one replica fails.


4. Query Planner — "Why did the same query suddenly get slow?"

Parse → Rewrite → Plan → Execute

  1. Parser: SQL → AST
  2. Rewriter: expand views, apply rules
  3. Planner: pick minimum-cost plan ← the core
  4. Executor: run it

Cost model

Cost units per operation:

  • seq_page_cost = 1.0 (sequential)
  • random_page_cost = 4.0 (random)
  • cpu_tuple_cost = 0.01
  • cpu_index_tuple_cost = 0.005
  • cpu_operator_cost = 0.0025

On SSD, lowering random_page_cost to 1.1 is almost always a win.

Statistics are everything

Planner estimates cardinalities from pg_statistic:

  • n_distinct — distinct count estimate
  • most_common_vals — top frequencies
  • histogram_bounds — distribution

Without ANALYZE, stats go stale and the Planner misfires.

Reading EXPLAIN ANALYZE

EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT * FROM orders WHERE user_id = 123 AND status = 'paid';

Index Scan using idx_user_status on orders
  (cost=0.43..125.67 rows=12 width=80)
  (actual time=0.045..0.312 rows=15 loops=1)
  Index Cond: ((user_id = 123) AND (status = 'paid'::text))
  Buffers: shared hit=8 read=0

Read in order:

  1. Node type: Index Scan, Seq Scan, Hash Join, Merge Join, Nested Loop
  2. cost: estimated start-end
  3. rows: estimated vs actual — 10x gap means stats issue
  4. Buffers: hit (cache) / read (disk) — cache efficiency
  5. loops: Nested Loop inner iterations

Common problem patterns

SymptomCauseFix
Seq Scan with many rowsstale stats / no indexANALYZE / CREATE INDEX
Row estimate 1, actual tens of thousandscorrelated WHERECREATE STATISTICS
Nested Loop with large innerplanner misjudgmentverify with SET enable_nestloop = off
Sort spills to disklow work_memraise work_mem

5. Index — Master the 6 Types

B-Tree — the default and the king

  • General purpose: =, <, >, BETWEEN, ORDER BY
  • Multi-column index usable from the leftmost key
  • Since PG 12, deduplication cuts size by 30%+

Hash

  • Supports = only, no range
  • Crash-safe only from PG 10 (WAL logging added)
  • Rarely useful (B-Tree is almost always better)

GiST — Generalized Search Tree

  • Geometry (PostGIS), range types, full text
  • Pluggable — write your own index for your type
  • e.g. CREATE INDEX ON events USING GIST (during) (tsrange)

GIN — Generalized Inverted Index

  • Optimal for multi-valued (arrays, JSONB, tsvector)
  • Inverted index: each "token" → documents it appears in
  • Default for JSONB indexing
  • Costly writes, mitigated with fastupdate

BRIN — Block Range Index

  • Designed for physically sorted huge tables (time series, logs)
  • Stores only min/max per block range → extremely small
  • BRIN on 10TB time series can be under 100MB
  • Requires data to be insert-ordered

HNSW (pgvector) — the vector search standard

  • Not core, via pgvector extension (0.5.0+)
  • Hierarchical Navigable Small World graph
  • ANN (Approximate Nearest Neighbor) search
  • De facto RAG standard since 2024
CREATE EXTENSION vector;
CREATE TABLE items (id bigint, embedding vector(1536));
CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);
SELECT id FROM items ORDER BY embedding <=> '[...]'::vector LIMIT 10;

Index choice guide

WorkloadChoice
General OLTP lookupB-Tree
JSONB field searchGIN (jsonb_path_ops)
Full-text search (FTS)GIN (tsvector)
Geographic coordsGiST (PostGIS)
Huge append-only time seriesBRIN
AI embedding similarityHNSW (pgvector)
Range types (tsrange)GiST

Partial & Expression Index

-- Conditional index
CREATE INDEX ON orders (user_id) WHERE status = 'paid';

-- Expression index
CREATE INDEX ON users (lower(email));

-- Covering index (PG 11+)
CREATE INDEX ON orders (user_id) INCLUDE (total, created_at);

Partial indexes dramatically shrink index size.


6. Partitioning — Sharding in a Relational DB

Declarative Partitioning — from PG 10

CREATE TABLE events (
  id bigserial,
  user_id bigint,
  occurred_at timestamp
) PARTITION BY RANGE (occurred_at);

CREATE TABLE events_2026_04 PARTITION OF events
  FOR VALUES FROM ('2026-04-01') TO ('2026-05-01');

Partitioning strategies

  • RANGE — time, ranges (most common)
  • LIST — categories (country, region)
  • HASH — even distribution (sharding)

Partition pruning

WHERE referencing the partition key → scan only relevant partitions.

WHERE occurred_at >= '2026-04-15'
  → scan only events_2026_04

Automated partition management — pg_partman

Creating partitions monthly by hand is suicide. pg_partman auto-creates/drops.

Citus — horizontal sharding for Postgres

Acquired by Microsoft in 2019. Distributed tables across multiple Postgres nodes:

  • Auto distribution by shard key
  • Distributed query execution
  • Parallel INSERT/SELECT

In 2024, Citus became the engine of Azure Cosmos DB for PostgreSQL. Self-hosting also possible.


7. Connection Management — Why Postgres Connections Are Expensive

Process-based model

PostgreSQL uses one process per connection (not threads). Pros:

  • Isolation — one crash doesn't affect others
  • Stability — decades proven

Cons:

  • Memory — 10-20MB per connection
  • Connect cost — milliseconds
  • 1000+ connections overload

pgBouncer — the connection pool standard

[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb

[pgbouncer]
listen_port = 6432
pool_mode = transaction
default_pool_size = 20
max_client_conn = 1000

pool_mode:

  • session — client conn = server conn (1:1, pool pointless)
  • transaction — held only during a transaction (most common)
  • statement — per query (very limited)

Transaction-mode caveats

  • Prepared statement incompat — improved via track_planner_stats in PG 17+
  • Only SET LOCAL allowed, SET not
  • LISTEN/NOTIFY unavailable
  • Driver tuning needed (e.g. PreparedStatementCacheSize=0)

PgCat & Supavisor

  • PgCat — Rust pgBouncer alternative, sharding support
  • Supavisor — Supabase's Elixir-based, millions of connections

Rules of thumb

  • Pool size ~ core_count * 2 + spindle_count (PostgreSQL wiki)
  • 20-50 is typical sweet spot
  • Raising more only increases latency, not throughput

8. JSONB — The Truth About "Postgres Does Everything"

JSON vs JSONB

PropertyJSONJSONB
Storageraw textparsed binary
Sizesmallerslightly larger
Insert speedfasterslightly slower
Query speedslowerfaster
IndexingpartialGIN supported
Key orderpreservednot preserved
Duplicate keysallowedlast wins

Almost always JSONB.

GIN index strategy

-- General (large, flexible)
CREATE INDEX ON docs USING GIN (data);

-- Path operator (`@>`) only (smaller)
CREATE INDEX ON docs USING GIN (data jsonb_path_ops);

jsonb_path_ops is ~30% smaller and enough for @> queries. Other operators (?, ?|, ?&) need the general form.

MongoDB replacement?

  • Schema flexibility: JSONB comes close
  • Performance: small docs similar, giant docs favor Mongo
  • Aggregation: Postgres SQL far more powerful
  • Transactions: Postgres wins (Mongo 4.0+ caught up but with limits)
  • Write scale: Mongo sharding more mature (Citus exists)

Verdict: unless scale is extreme, Postgres handles "document + relational + analytics" in one DB.


9. Postgres in the AI Era — pgvector & pg_duckdb

pgvector (exploded post-2023)

  • 2021 by Andrew Kane
  • 2023 HNSW addition made it production-ready
  • 2024 pgvectorscale (Timescale) improved perf 10x
  • 2025 bundled by default on Supabase, Neon, RDS

DiskANN & Binary Quantization

  • DiskANN index (pgvectorscale) — search vector sets larger than memory
  • Binary Quantization — float32 → 1bit, 32x memory reduction

pg_duckdb (late 2024)

Embeds DuckDB in PostgreSQL. Push analytics queries to DuckDB.

SELECT duckdb.query('
  SELECT date_trunc(''hour'', ts), count(*)
  FROM read_parquet(''s3://logs/*.parquet'')
  GROUP BY 1
');

OLTP with Postgres, OLAP with DuckDB — in one DB. Dark horse of the 2025 HTAP war.

AI-native PG ecosystem

  • Neon — serverless, branching, copy-on-write
  • Supabase — Postgres + pgvector + Realtime + Auth bundle
  • Timescale — time-series + pgvector + AI-optimized
  • MotherDuck + DuckDB — OLAP offload

10. PostgreSQL 18 (2025) New Features

AIO — Asynchronous I/O

  • Sync I/O until now
  • From 18, io_uring (Linux) support → 30-50% faster sequential scan
  • Improved auto read-ahead

Direct I/O

  • Bypass OS page cache, PG manages directly (grow shared_buffers)
  • io_direct = data option

Native UUIDv7

  • UUIDv4 is random → index fragmentation
  • UUIDv7 is time-sorted → B-Tree friendly
  • gen_uuid_v7() built-in

Logical replication improvements

  • DDL replication (even CREATE TABLE)
  • Foundations for bidirectional (BDR-like)

Skip Scan

  • Use multi-column index even without leading column predicate
  • "Loose Index Scan" — Oracle/MySQL already had this

11. Monitoring — Must-Watch Metrics

pg_stat_statements

Cumulative stats per query. The most important extension.

SELECT query, calls, total_exec_time/1000 AS total_sec,
       mean_exec_time, rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 20;

Key metrics

MetricSourceThreshold
Cache hit ratiopg_stat_database99%+
Dead tuple %pg_stat_user_tables20% per table
Replication lagpg_stat_replication1MB
Long-running TXpg_stat_activity5 min+
Lock waitspg_locks + pg_stat_activity30s+
Connectionspg_stat_activity count80% of max_connections
WAL generationpg_stat_walvs baseline
Autovacuum lagn_dead_tup + last_vacuum24h+

The curse of long-running transactions

Long-open TX:

  • Blocks VACUUM (tuples visible to that TX can't be cleaned)
  • Dead tuples explode → table bloat
  • Defend with idle_in_transaction_session_timeout

12. Top 10 Anti-Patterns

  1. ORM N+1 queries — don't be surprised by EXPLAIN, use INCLUDE / JOIN
  2. Idle after BEGIN — long idle, blocks VACUUM
  3. Many short connections — no pgBouncer, direct
  4. Not running ANALYZE — stale stats, Planner misfires
  5. VACUUM FULL in prod — AccessExclusive lock, outage
  6. Indexing every column — writes crash, index maintenance cost
  7. Prepared statements + pgBouncer transaction mode — incompatible
  8. UUIDv4 PK abuse — index fragmentation (→ UUIDv7)
  9. SELECT * — lose planner optimization chances
  10. Full UPDATE on huge JSONB — TOAST rewrite explosion

13. Checklist for Using Postgres Wisely

  • Tune autovacuum — scale_factor, cost_limit
  • Keep pg_stat_statements always on
  • Verify ANALYZE cadence, adjust stats target
  • Deploy pgBouncer (transaction mode)
  • Lower random_page_cost for SSD
  • Raise work_mem carefully (multiplied per connection)
  • Set shared_buffers to 25% of RAM
  • Monitor replication lag — watch slot buildup
  • Set long-running TX timeout
  • Alert on pg_stat_user_tables dead_pct
  • WAL archiving — enable PITR
  • Save EXPLAIN ANALYZE baselines of key queries

Closing — The Real Meaning of "Postgres Does Everything"

The "Postgres everything" meme is real. OLTP, OLAP (pg_duckdb), vector search (pgvector), time series (Timescale), geospatial (PostGIS), graph (Apache AGE), full text (GIN), even message queues (pg_later).

But this doesn't mean "dump every workload into Postgres." It means "one DB covers most work, cutting operational complexity." Extreme scale (tens of TB OLAP, 1M events/s) still needs specialized systems.

Using Postgres means understanding the MVCC tuple model, the cost of VACUUM, the role of WAL, the planner's dependence on statistics. Debugging "why is it slow?" without these is cave exploration without a map.


If Postgres is "king of the structured world," Elasticsearch/OpenSearch is "king of unstructured text." Next post:

  • The essence of Inverted Index — why Lucene beats GIN
  • Lucene internals — Segment, Commit, Merge policies
  • BM25 vs TF-IDF — the math of scoring
  • Sharding & Replication — primary/replica, routing
  • Ingest pipelines — Logstash, Beats, OpenTelemetry → ES
  • The Query DSL forest — bool, match, term, function_score
  • Vector Search in ES — kNN, HNSW, hybrid retrieval
  • The 2021 license fight — Elastic vs AWS, OpenSearch fork
  • Operational pain — Heap, GC, split brain, circuit breakers
  • Hybrid Search & RAG — BM25 + Vector = the answer

A sweep through the history and future of search.


"The first rule of Postgres: never fight the Planner. Give it good statistics, and it will give you good plans. The second rule: know where the bodies are buried — MVCC, VACUUM, WAL. Everything else is commentary." — Bruce Momjian (PostgreSQL Global Development Group)

현재 단락 (1/299)

In 2024 Stack Overflow Developer Survey, PostgreSQL finally **overtook MySQL for #1**. Not a coincid...

작성 글자: 0원문 글자: 15,477작성 단락: 0/299