PostgreSQL Production Operations Complete Guide 2025: Extensions, Tuning, Monitoring, HA

TL;DR

Extensions are PostgreSQL's true power: pg_stat_statements (query analysis), pgvector (AI), PostGIS (geo), TimescaleDB (time-series), pg_partman (partitioning)
Three key tuning parameters: shared_buffers (25% RAM), work_mem (sort/hash memory), effective_cache_size (50-75% RAM)
Autovacuum is critical: Misconfigured causes table bloat → disk explosion → slow queries. Tuning autovacuum_vacuum_scale_factor is essential
Patroni: De facto standard for PostgreSQL HA. Auto failover within 30 seconds
Backup: pgBackRest is 10x+ faster than pg_dump. WAL archiving + PITR is essential

1. Why PostgreSQL Remains Loved in 2025

1.1 Five Key Reasons

Extensibility — 200+ official extensions. Vector, time-series, geo, all of them
JSONB performance — Comparable or faster than MongoDB for JSON
MVCC — No read-write conflicts (excellent concurrency)
Standards compliant — Most faithful ANSI SQL implementation
Open source — Zero license costs vs Oracle/SQL Server

1.2 Why PostgreSQL Is Standard Even in NewSQL Era

CockroachDB, YugabyteDB, Spanner emerged, but most use PostgreSQL wire protocol compatibility. This means PostgreSQL itself is in a stable position.

2. Essential Extensions Top 10

2.1 Operations Must-Have: pg_stat_statements

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- postgresql.conf
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = 'all'
pg_stat_statements.max = 10000

Usage:

-- TOP 10 slowest queries
SELECT
  substring(query, 1, 80) as query,
  calls,
  total_exec_time::int as total_ms,
  mean_exec_time::int as avg_ms,
  rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

This single query solves 80% of performance issues. You immediately see which queries consume the most time.

2.2 AI Era Essential: pgvector

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  id SERIAL PRIMARY KEY,
  content TEXT,
  embedding vector(768)
);

CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);

SELECT id, content, 1 - (embedding <=> '[0.1, 0.2, ...]'::vector) AS similarity
FROM documents
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 10;

Why pgvector instead of dedicated Vector DB?

Leverage existing PostgreSQL ops knowledge
ACID transactions (vector + metadata consistency)
Cost savings (no separate infrastructure)

2.3 Time-Series: TimescaleDB

CREATE EXTENSION IF NOT EXISTS timescaledb;

CREATE TABLE metrics (
  time TIMESTAMPTZ NOT NULL,
  device_id TEXT,
  temperature FLOAT
);

SELECT create_hypertable('metrics', 'time', chunk_time_interval => INTERVAL '1 day');

SELECT add_retention_policy('metrics', INTERVAL '30 days');

2.4 Other Core Extensions

Extension	Purpose
PostGIS	Geographic info
pg_partman	Auto partition management
pg_cron	In-DB scheduler
pgaudit	Audit logging
pglogical	Logical replication
citus	Horizontal sharding
pg_repack	Lock-free table reorganization

3. Core Performance Tuning

3.1 Memory Parameters

Parameter	Recommended (16GB RAM)	Description
`shared_buffers`	4GB (25% RAM)	PostgreSQL's own cache
`effective_cache_size`	12GB (75% RAM)	OS cache estimate
`work_mem`	64MB	Per sort/hash memory
`maintenance_work_mem`	1GB	For VACUUM/CREATE INDEX
`wal_buffers`	16MB	WAL buffer

Formula: shared_buffers + (work_mem × max_connections) ≤ RAM × 0.5

3.2 Checkpoint Tuning

checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
max_wal_size = 4GB
min_wal_size = 1GB

3.3 Query Planner

random_page_cost = 1.1              # SSD (HDD is 4.0)
effective_io_concurrency = 200      # SSD/NVMe
default_statistics_target = 100

4. VACUUM and Autovacuum

4.1 Why VACUUM Is Needed

PostgreSQL's MVCC handles UPDATE as "INSERT new row + mark old row". Old rows aren't deleted immediately but remain as "dead tuples". VACUUM cleans them.

Without VACUUM:

Table size keeps growing (disk explosion)
Indexes point to dead tuples → slow queries
Transaction ID wraparound — data corruption risk!

4.2 Autovacuum Monitoring

SELECT
  schemaname,
  relname,
  n_live_tup,
  n_dead_tup,
  ROUND(n_dead_tup::numeric / NULLIF(n_live_tup, 0), 2) AS dead_ratio,
  last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;

dead_ratio above 0.2 means autovacuum can't keep up.

4.3 Autovacuum Tuning

autovacuum = on
autovacuum_max_workers = 5
autovacuum_naptime = 30s
autovacuum_vacuum_scale_factor = 0.05
autovacuum_analyze_scale_factor = 0.02
autovacuum_vacuum_cost_limit = 2000

5. Monitoring and Debugging

5.1 Real-time Activity

-- Currently running queries
SELECT pid, usename, state, query_start, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;

5.2 EXPLAIN ANALYZE

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE customer_id = 12345;

Key indicators:

Seq Scan vs Index Scan — index usage
actual rows vs estimated rows — statistics accuracy
Buffers: shared hit/read — cache hit rate
Execution Time — actual time

5.3 pgBadger — Log Analysis

log_min_duration_statement = 1000
pgbadger -j 4 /var/log/postgresql/postgresql-*.log -o report.html

6. High Availability — Patroni

6.1 How Patroni Works

Each PostgreSQL node runs Patroni daemon
Patroni renews leader lock in etcd
Primary fails → lock expires → Patroni elects new Primary
HAProxy/PgBouncer auto-routes to new Primary

6.2 Patroni Alternatives

Tool	Features
Patroni	Most popular, requires etcd
repmgr	EnterpriseDB, simple
CloudNativePG	K8s native (CNCF)
Stolon	Container-friendly
pg_auto_failover	Microsoft, simple

7. Backup and Recovery

7.1 pg_dump vs pgBackRest

	pg_dump	pgBackRest
Speed	Slow (single-thread)	Fast (parallel)
Incremental	No	Yes
Compression	gzip	lz4/zstd
PITR	No	Yes
S3 integration	No	Yes

7.2 PITR (Point-in-Time Recovery)

pgbackrest --stanza=main \
  --type=time \
  --target="2025-04-15 14:30:00" \
  restore

This enables responses like "rewind to yesterday at 3:25:30 PM".

8. Common Pitfalls

8.1 Index Bloat

-- Unused indexes
SELECT
  schemaname || '.' || relname AS table,
  indexrelname AS unused_index,
  pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;

8.2 N+1 Queries

ORM's most common problem. Use pg_stat_statements to find duplicate queries called thousands of times.

8.3 Connection Explosion

PostgreSQL connection = separate process (~10MB each). Solution: PgBouncer (transaction mode recommended).

9. PostgreSQL 17 New Features (2024-09)

Improved incremental sort
Logical replication: failover slot sync
pg_basebackup: incremental backup support
VACUUM: memory-based dead tuple tracking (faster)
MERGE command: RETURNING clause added

Quiz

1. Why shouldn't shared_buffers be set to 50% of RAM?

Answer: PostgreSQL uses both its own shared_buffers and OS file system cache. Setting shared_buffers too high (1) reduces OS cache space causing double-caching inefficiency, (2) page replacement algorithm becomes inefficient, (3) backups/checkpoints slow down. 25% of RAM is generally optimal.

2. Signs that autovacuum can't keep up?

Answer: When n_dead_tup exceeds 20% of n_live_tup in pg_stat_user_tables. Solutions: (1) increase autovacuum_max_workers, (2) decrease autovacuum_vacuum_scale_factor (0.2 → 0.05), (3) decrease autovacuum_naptime, (4) per-table settings via ALTER TABLE.

3. Why is PgBouncer needed?

Answer: Each PostgreSQL connection is a separate process (~10MB). 1,000 connections = 10GB+ memory. Process creation/destruction is also costly. PgBouncer maintains a small pool of real connections and multiplexes client connections. Transaction pooling mode is most efficient but has prepared statements compatibility issues.

4. How does WAL archiving enable PITR?

Answer: WAL (Write-Ahead Log) records all data changes sequentially. WAL archiving copies these logs to external storage (S3, etc.). For recovery: (1) restore nearest base backup, (2) replay WAL up to the desired time. Result: point-in-time recovery. pgBackRest's --type=time automates this.

5. Difference between pg_repack and VACUUM FULL?

Answer: VACUUM FULL requires a table lock, unusable in production. pg_repack reorganizes tables without locks — copies data to a new table and swaps names. Essential for removing bloat from large tables in production.