Skip to content

Split View: PostgreSQL 성능 튜닝 실전 가이드 — 쿼리 최적화, 인덱스 전략, EXPLAIN 분석

|

PostgreSQL 성능 튜닝 실전 가이드 — 쿼리 최적화, 인덱스 전략, EXPLAIN 분석

PostgreSQL 성능 튜닝

들어가며

"쿼리가 느려요" — 백엔드 개발자라면 한 번쯤 듣게 되는 말입니다. PostgreSQL은 강력한 RDBMS이지만, 제대로 튜닝하지 않으면 데이터가 늘어날수록 성능이 급격히 떨어집니다. 이 글에서는 실무에서 바로 적용할 수 있는 PostgreSQL 성능 튜닝 기법을 EXPLAIN 분석부터 인덱스 전략, 쿼리 리팩토링까지 단계별로 다룹니다.


1. EXPLAIN — 쿼리 실행 계획 읽기

EXPLAIN vs EXPLAIN ANALYZE

-- 예상 실행 계획 (실제 실행 X)
EXPLAIN SELECT * FROM users WHERE email = 'test@example.com';

-- 실제 실행 + 실측 시간 포함 (⚠️ 실제로 쿼리가 실행됨)
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'test@example.com';

-- 가장 상세한 분석 (버퍼 사용량 포함)
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM users WHERE email = 'test@example.com';

EXPLAIN 출력 읽는 법

EXPLAIN ANALYZE
SELECT u.name, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.created_at > '2026-01-01';
Hash Join  (cost=1250.00..5680.50 rows=15000 width=48)
           (actual time=12.5..89.3 rows=14800 loops=1)
  Hash Cond: (o.user_id = u.id)
  -> Seq Scan on orders o  (cost=0.00..3500.00 rows=15000 width=16)
                           (actual time=0.02..45.1 rows=14800 loops=1)
       Filter: (created_at > '2026-01-01')
       Rows Removed by Filter: 85200
  -> Hash  (cost=1000.00..1000.00 rows=50000 width=36)
           (actual time=12.3..12.3 rows=50000 loops=1)
       -> Seq Scan on users u  (cost=0.00..1000.00 rows=50000 width=36)
                               (actual time=0.01..6.8 rows=50000 loops=1)
Planning Time: 0.25 ms
Execution Time: 95.8 ms

핵심 지표 해석

지표의미주의
cost예상 비용 (startup..total)상대적 단위, 절대값 아님
rows예상 행 수actual과 크게 다르면 통계 갱신 필요
actual time실제 소요 시간 (ms)startup..total
loops반복 횟수actual time × loops = 실제 총 시간
Rows Removed by Filter필터로 제거된 행 수이 숫자가 크면 인덱스 필요
Buffers: shared hit/read캐시 히트/디스크 읽기read가 크면 메모리 부족

위험 신호 🚨

# Seq Scan on 대용량 테이블 → 인덱스 필요
Seq Scan on orders  (rows=1000000)

# Nested Loop with 대량 rows → Join 방식 변경 필요
Nested Loop  (actual loops=50000)

# Sort with 외부 디스크 사용 → work_mem 부족
Sort Method: external merge  Disk: 128000kB

# 예상 rows와 실제 rows 차이 → ANALYZE 필요
(rows=100) ... (actual rows=50000)

2. 인덱스 전략 — 올바른 인덱스 설계

B-Tree 인덱스 (기본)

-- 단일 컬럼 인덱스
CREATE INDEX idx_users_email ON users(email);

-- 복합 인덱스 (순서가 중요!)
CREATE INDEX idx_orders_user_created
ON orders(user_id, created_at DESC);

-- 부분 인덱스 (조건부 인덱스)
CREATE INDEX idx_orders_active
ON orders(user_id)
WHERE status = 'active';

-- 유니크 인덱스
CREATE UNIQUE INDEX idx_users_email_unique ON users(email);

복합 인덱스의 컬럼 순서

복합 인덱스에서 컬럼 순서는 성능에 결정적 영향을 미칩니다:

-- 인덱스: (user_id, created_at, status)

-- ✅ 인덱스 활용 가능 (왼쪽부터 매칭)
WHERE user_id = 1
WHERE user_id = 1 AND created_at > '2026-01-01'
WHERE user_id = 1 AND created_at > '2026-01-01' AND status = 'active'

-- ❌ 인덱스 활용 불가 (중간 컬럼 건너뜀)
WHERE user_id = 1 AND status = 'active'  -- created_at 건너뜀
WHERE created_at > '2026-01-01'          -- user_id 없음
WHERE status = 'active'                  -- 첫 컬럼 없음

💡 원칙: 등호(=) 조건 컬럼을 앞에, 범위(>, <, BETWEEN) 조건을 뒤에 배치합니다.

GIN 인덱스 (전문 검색, JSONB, 배열)

-- JSONB 필드 인덱스
CREATE INDEX idx_products_metadata
ON products USING GIN(metadata);

-- 이제 JSONB 검색이 빨라짐
SELECT * FROM products
WHERE metadata @> '{"category": "electronics"}';

-- 배열 인덱스
CREATE INDEX idx_posts_tags
ON posts USING GIN(tags);

SELECT * FROM posts WHERE tags @> ARRAY['postgresql'];

인덱스가 사용되지 않는 경우

-- ❌ 함수 적용 시 인덱스 무시
SELECT * FROM users WHERE LOWER(email) = 'test@example.com';
-- ✅ 함수형 인덱스 생성
CREATE INDEX idx_users_email_lower ON users(LOWER(email));

-- ❌ 타입 불일치
SELECT * FROM users WHERE id = '123';  -- id가 integer인데 문자열 비교
-- ✅ 올바른 타입
SELECT * FROM users WHERE id = 123;

-- ❌ LIKE의 앞부분 와일드카드
SELECT * FROM users WHERE name LIKE '%kim%';  -- Full Scan
-- ✅ 앞부분 고정
SELECT * FROM users WHERE name LIKE 'kim%';  -- Index 사용 가능

-- ❌ OR 조건
SELECT * FROM users WHERE email = 'a@b.com' OR name = 'Kim';
-- ✅ UNION으로 분리
SELECT * FROM users WHERE email = 'a@b.com'
UNION ALL
SELECT * FROM users WHERE name = 'Kim';

인덱스 사용 현황 모니터링

-- 사용되지 않는 인덱스 찾기
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexname NOT LIKE '%pkey%'
ORDER BY pg_relation_size(indexrelid) DESC;

-- 테이블별 인덱스 크기
SELECT tablename,
       pg_size_pretty(pg_total_relation_size(tablename::regclass)) as total_size,
       pg_size_pretty(pg_indexes_size(tablename::regclass)) as index_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(tablename::regclass) DESC;

3. 쿼리 최적화 패턴

N+1 쿼리 제거

-- ❌ N+1 패턴 (ORM에서 자주 발생)
-- 1번: SELECT * FROM users LIMIT 100;
-- 100번: SELECT * FROM orders WHERE user_id = ?;

-- ✅ JOIN으로 한 번에
SELECT u.name, o.total, o.created_at
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2026-01-01'
LIMIT 100;

-- ✅ 또는 서브쿼리 + IN
SELECT * FROM orders
WHERE user_id IN (
    SELECT id FROM users WHERE created_at > '2026-01-01'
);

페이지네이션 최적화

-- ❌ OFFSET 방식 (페이지가 깊어질수록 느림)
SELECT * FROM orders ORDER BY id DESC OFFSET 100000 LIMIT 20;

-- ✅ Keyset Pagination (커서 기반)
SELECT * FROM orders
WHERE id < 900000  -- 이전 페이지 마지막 id
ORDER BY id DESC
LIMIT 20;

-- ✅ 커버링 인덱스 + 서브쿼리
SELECT o.* FROM orders o
JOIN (
    SELECT id FROM orders ORDER BY id DESC OFFSET 100000 LIMIT 20
) sub ON o.id = sub.id;

EXISTS vs IN

-- 대용량에서 EXISTS가 더 효율적 (상관 서브쿼리)
-- ✅ EXISTS
SELECT * FROM users u
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.user_id = u.id AND o.total > 10000
);

-- ❌ IN (서브쿼리 결과가 클 때)
SELECT * FROM users
WHERE id IN (
    SELECT user_id FROM orders WHERE total > 10000
);

COUNT 최적화

-- ❌ 정확한 COUNT (Full Scan 발생)
SELECT COUNT(*) FROM orders;

-- ✅ 근사값으로 충분한 경우
SELECT reltuples::bigint AS estimate
FROM pg_class WHERE relname = 'orders';

-- ✅ 조건부 COUNT 최적화
-- 인덱스가 있다면 Index Only Scan 가능
SELECT COUNT(*) FROM orders WHERE status = 'completed';
-- 필요: CREATE INDEX idx_orders_status ON orders(status);

CTE vs 서브쿼리

-- PostgreSQL 12+ 에서 CTE는 기본적으로 인라인됨
-- MATERIALIZED 힌트로 강제 물리화 가능

-- 자동 인라인 (옵티마이저가 최적화)
WITH active_users AS (
    SELECT id, name FROM users WHERE status = 'active'
)
SELECT * FROM active_users WHERE name LIKE 'K%';

-- 강제 물리화 (중복 실행 방지)
WITH active_users AS MATERIALIZED (
    SELECT id, name FROM users WHERE status = 'active'
)
SELECT * FROM active_users au
JOIN orders o ON au.id = o.user_id;

4. 설정 튜닝 — postgresql.conf

메모리 관련

# shared_buffers: 전체 RAM의 25% (가장 중요!)
# 16GB RAM 기준
shared_buffers = 4GB

# effective_cache_size: OS 캐시 포함 예상 메모리 (RAM의 75%)
effective_cache_size = 12GB

# work_mem: 정렬/해시 작업 메모리 (세션 × 쿼리당)
# 주의: 커넥션 수 × work_mem 만큼 메모리 사용 가능
work_mem = 256MB

# maintenance_work_mem: VACUUM, CREATE INDEX 시 사용
maintenance_work_mem = 1GB

WAL 및 체크포인트

# WAL 크기 (쓰기 많은 워크로드)
wal_buffers = 64MB
max_wal_size = 4GB
min_wal_size = 1GB

# 체크포인트 간격
checkpoint_completion_target = 0.9

쿼리 플래너

# 병렬 쿼리 활성화
max_parallel_workers_per_gather = 4
max_parallel_workers = 8

# 랜덤 I/O 비용 (SSD는 낮게)
random_page_cost = 1.1  # HDD: 4.0, SSD: 1.1

# 통계 수집 정밀도
default_statistics_target = 200  # 기본 100

5. VACUUM과 통계 관리

VACUUM이 필요한 이유

PostgreSQL은 MVCC(Multi-Version Concurrency Control)를 사용합니다. UPDATE/DELETE 시 기존 행을 즉시 삭제하지 않고 "dead tuple"로 남겨둡니다. VACUUM이 이를 정리합니다.

-- 테이블별 dead tuple 확인
SELECT schemaname, relname,
       n_live_tup, n_dead_tup,
       ROUND(n_dead_tup::numeric / NULLIF(n_live_tup, 0) * 100, 2) as dead_ratio
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC;

-- 수동 VACUUM (일반적으로 autovacuum이 처리)
VACUUM ANALYZE orders;

-- VACUUM FULL (테이블 잠금 주의! — 디스크 공간 회수)
VACUUM FULL orders;  -- ⚠️ 운영 중 사용 금지

Autovacuum 튜닝

# autovacuum 기본 설정
autovacuum = on
autovacuum_max_workers = 5

# 대용량 테이블용 테이블별 설정
ALTER TABLE orders SET (
    autovacuum_vacuum_threshold = 1000,
    autovacuum_vacuum_scale_factor = 0.01,  -- 1%가 dead tuple이면 실행
    autovacuum_analyze_threshold = 500,
    autovacuum_analyze_scale_factor = 0.005
);

통계 갱신

-- EXPLAIN의 예상 rows가 실제와 크게 다를 때
ANALYZE orders;

-- 전체 DB 통계 갱신
ANALYZE;

-- 특정 컬럼의 통계 정밀도 높이기
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;
ANALYZE orders;

6. 실전 트러블슈팅 — 느린 쿼리 찾기

pg_stat_statements로 슬로우 쿼리 발견

-- 확장 활성화
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- 가장 느린 쿼리 TOP 10
SELECT query,
       calls,
       ROUND(total_exec_time::numeric, 2) as total_time_ms,
       ROUND(mean_exec_time::numeric, 2) as avg_time_ms,
       rows
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

-- 가장 자주 호출되는 쿼리 (총 시간 기준)
SELECT query, calls,
       ROUND(total_exec_time::numeric, 2) as total_ms
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

현재 실행 중인 쿼리 확인

-- 오래 실행 중인 쿼리
SELECT pid, now() - pg_stat_activity.query_start AS duration,
       query, state
FROM pg_stat_activity
WHERE state != 'idle'
  AND query NOT ILIKE '%pg_stat_activity%'
ORDER BY duration DESC;

-- 잠금 대기 중인 쿼리
SELECT blocked.pid AS blocked_pid,
       blocked.query AS blocked_query,
       blocking.pid AS blocking_pid,
       blocking.query AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_locks bl ON bl.pid = blocked.pid
JOIN pg_locks kl ON kl.locktype = bl.locktype
  AND kl.relation = bl.relation
  AND kl.pid != bl.pid
JOIN pg_stat_activity blocking ON blocking.pid = kl.pid
WHERE NOT bl.granted;

7. 성능 체크리스트

운영 전 반드시 확인할 항목들:

  • EXPLAIN ANALYZE로 주요 쿼리의 실행 계획 확인
  • Seq Scan이 대용량 테이블에 발생하지 않는지 확인
  • 복합 인덱스 컬럼 순서가 쿼리 패턴에 맞는지 확인
  • 사용되지 않는 인덱스 정리 (쓰기 성능 저하 방지)
  • pg_stat_statements로 슬로우 쿼리 모니터링 설정
  • Autovacuum이 정상 작동하는지 확인
  • shared_buffers, work_mem 등 메모리 설정 검토
  • 커넥션 풀링 (PgBouncer 등) 적용 여부 확인
  • 정기적인 ANALYZE 실행으로 통계 최신화

퀴즈

Q1: EXPLAIN과 EXPLAIN ANALYZE의 차이는? EXPLAIN은 예상 실행 계획만 보여주고, EXPLAIN ANALYZE는 실제로 쿼리를 실행하여 실측 시간과 행 수를 함께 보여줍니다. ANALYZE는 실제 실행되므로 INSERT/UPDATE/DELETE에 주의가 필요합니다.

Q2: 복합 인덱스 (a, b, c)에서 WHERE a = 1 AND c = 3 쿼리가 인덱스를 완전히 활용하지 못하는 이유는?

B-Tree 복합 인덱스는 왼쪽부터 순서대로 매칭합니다. 중간 컬럼(b)을 건너뛰면 c 컬럼의 인덱스를 활용할 수 없습니다. a 컬럼만 인덱스를 사용합니다.

Q3: OFFSET 기반 페이지네이션이 깊은 페이지에서 느린 이유는? OFFSET N은 N개의 행을 읽고 버리는 방식입니다. OFFSET 100000이면 100,000행을 읽은 후 버리고 그 다음 행부터 반환하므로, 페이지가 깊어질수록 읽어야 할 행이 늘어납니다.

Q4: random_page_cost를 SSD에서 1.1로 낮추는 이유는? SSD는 랜덤 I/O와 순차 I/O의 성능 차이가 거의 없습니다. 기본값 4.0은 HDD 기준이므로, SSD에서는 1.1로 낮춰서 플래너가 인덱스 스캔을 더 적극적으로 선택하도록 유도합니다.

Q5: PostgreSQL에서 dead tuple이 생기는 이유는? MVCC 구조 때문입니다. UPDATE/DELETE 시 기존 행을 즉시 삭제하지 않고, 다른 트랜잭션이 참조할 수 있도록 "dead tuple"로 남겨둡니다. VACUUM이 이를 정리합니다.

Q6: WHERE LOWER(email) = 'test@example.com'이 인덱스를 사용하지 못할 때 해결법은?

함수형 인덱스를 생성합니다: CREATE INDEX idx_users_email_lower ON users(LOWER(email));

Q7: shared_buffers의 권장 설정값은? 전체 시스템 RAM의 약 25%입니다. 예: 16GB RAM → shared_buffers = 4GB.

Q8: EXPLAIN 출력에서 "Rows Removed by Filter" 수치가 매우 클 때 의미하는 것은? 해당 테이블에서 많은 행을 읽은 후 필터로 대부분 버리고 있다는 뜻입니다. 적절한 인덱스를 추가하면 불필요한 행 읽기를 줄일 수 있습니다.

Q9: VACUUM FULL을 운영 중에 사용하면 안 되는 이유는? VACUUM FULL은 테이블에 대해 ACCESS EXCLUSIVE 잠금을 걸어, 작업이 완료될 때까지 해당 테이블의 모든 읽기/쓰기가 차단됩니다.

Q10: EXISTS와 IN 중 대용량 서브쿼리에서 더 효율적인 것은? 일반적으로 EXISTS가 더 효율적입니다. EXISTS는 첫 번째 매칭 행을 찾으면 즉시 중단하지만, IN은 서브쿼리의 전체 결과를 먼저 생성해야 할 수 있습니다.

PostgreSQL Performance Tuning Practical Guide — Query Optimization, Index Strategy, EXPLAIN Analysis

PostgreSQL Performance Tuning

Introduction

"The query is slow" — every backend developer hears this at least once. PostgreSQL is a powerful RDBMS, but without proper tuning, performance degrades rapidly as data grows. This article covers PostgreSQL performance tuning techniques you can apply immediately in production, step by step from EXPLAIN analysis to index strategies and query refactoring.


1. EXPLAIN — Reading Query Execution Plans

EXPLAIN vs EXPLAIN ANALYZE

-- Estimated execution plan (does NOT actually execute)
EXPLAIN SELECT * FROM users WHERE email = 'test@example.com';

-- Actual execution + measured time included (Warning: actually executes the query)
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'test@example.com';

-- Most detailed analysis (includes buffer usage)
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM users WHERE email = 'test@example.com';

How to Read EXPLAIN Output

EXPLAIN ANALYZE
SELECT u.name, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.created_at > '2026-01-01';
Hash Join  (cost=1250.00..5680.50 rows=15000 width=48)
           (actual time=12.5..89.3 rows=14800 loops=1)
  Hash Cond: (o.user_id = u.id)
  -> Seq Scan on orders o  (cost=0.00..3500.00 rows=15000 width=16)
                           (actual time=0.02..45.1 rows=14800 loops=1)
       Filter: (created_at > '2026-01-01')
       Rows Removed by Filter: 85200
  -> Hash  (cost=1000.00..1000.00 rows=50000 width=36)
           (actual time=12.3..12.3 rows=50000 loops=1)
       -> Seq Scan on users u  (cost=0.00..1000.00 rows=50000 width=36)
                               (actual time=0.01..6.8 rows=50000 loops=1)
Planning Time: 0.25 ms
Execution Time: 95.8 ms

Key Metrics Interpretation

MetricMeaningNotes
costEstimated cost (startup..total)Relative units, not absolute values
rowsEstimated row countIf significantly different from actual, update stats
actual timeActual elapsed time (ms)startup..total
loopsNumber of iterationsactual time x loops = real total time
Rows Removed by FilterRows removed by filterLarge numbers indicate an index is needed
Buffers: shared hit/readCache hits/disk readsHigh read count means insufficient memory

Warning Signs

# Seq Scan on large table -> index needed
Seq Scan on orders  (rows=1000000)

# Nested Loop with large rows -> change join strategy
Nested Loop  (actual loops=50000)

# Sort with external disk usage -> insufficient work_mem
Sort Method: external merge  Disk: 128000kB

# Estimated rows vs actual rows mismatch -> ANALYZE needed
(rows=100) ... (actual rows=50000)

2. Index Strategy — Designing the Right Indexes

B-Tree Index (Default)

-- Single column index
CREATE INDEX idx_users_email ON users(email);

-- Composite index (order matters!)
CREATE INDEX idx_orders_user_created
ON orders(user_id, created_at DESC);

-- Partial index (conditional index)
CREATE INDEX idx_orders_active
ON orders(user_id)
WHERE status = 'active';

-- Unique index
CREATE UNIQUE INDEX idx_users_email_unique ON users(email);

Column Order in Composite Indexes

Column order in composite indexes has a decisive impact on performance:

-- Index: (user_id, created_at, status)

-- Can use index (matches from left)
WHERE user_id = 1
WHERE user_id = 1 AND created_at > '2026-01-01'
WHERE user_id = 1 AND created_at > '2026-01-01' AND status = 'active'

-- Cannot use index (skips middle column)
WHERE user_id = 1 AND status = 'active'  -- skips created_at
WHERE created_at > '2026-01-01'          -- missing user_id
WHERE status = 'active'                  -- missing first column

Principle: Place equality (=) condition columns first, and range (greater than, less than, BETWEEN) conditions last.

GIN Index (Full-Text Search, JSONB, Arrays)

-- JSONB field index
CREATE INDEX idx_products_metadata
ON products USING GIN(metadata);

-- JSONB search is now fast
SELECT * FROM products
WHERE metadata @> '{"category": "electronics"}';

-- Array index
CREATE INDEX idx_posts_tags
ON posts USING GIN(tags);

SELECT * FROM posts WHERE tags @> ARRAY['postgresql'];

Cases Where Indexes Are Not Used

-- Index ignored when function is applied
SELECT * FROM users WHERE LOWER(email) = 'test@example.com';
-- Create a functional index
CREATE INDEX idx_users_email_lower ON users(LOWER(email));

-- Type mismatch
SELECT * FROM users WHERE id = '123';  -- id is integer but comparing with string
-- Correct type
SELECT * FROM users WHERE id = 123;

-- Leading wildcard in LIKE
SELECT * FROM users WHERE name LIKE '%kim%';  -- Full Scan
-- Fixed prefix
SELECT * FROM users WHERE name LIKE 'kim%';  -- Index can be used

-- OR condition
SELECT * FROM users WHERE email = 'a@b.com' OR name = 'Kim';
-- Split with UNION
SELECT * FROM users WHERE email = 'a@b.com'
UNION ALL
SELECT * FROM users WHERE name = 'Kim';

Monitoring Index Usage

-- Find unused indexes
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexname NOT LIKE '%pkey%'
ORDER BY pg_relation_size(indexrelid) DESC;

-- Index size per table
SELECT tablename,
       pg_size_pretty(pg_total_relation_size(tablename::regclass)) as total_size,
       pg_size_pretty(pg_indexes_size(tablename::regclass)) as index_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(tablename::regclass) DESC;

3. Query Optimization Patterns

Eliminating N+1 Queries

-- N+1 pattern (common with ORMs)
-- Query 1: SELECT * FROM users LIMIT 100;
-- Query 2-101: SELECT * FROM orders WHERE user_id = ?;

-- Use JOIN to fetch at once
SELECT u.name, o.total, o.created_at
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2026-01-01'
LIMIT 100;

-- Or use subquery + IN
SELECT * FROM orders
WHERE user_id IN (
    SELECT id FROM users WHERE created_at > '2026-01-01'
);

Pagination Optimization

-- OFFSET approach (slower as pages get deeper)
SELECT * FROM orders ORDER BY id DESC OFFSET 100000 LIMIT 20;

-- Keyset Pagination (cursor-based)
SELECT * FROM orders
WHERE id < 900000  -- last id of the previous page
ORDER BY id DESC
LIMIT 20;

-- Covering index + subquery
SELECT o.* FROM orders o
JOIN (
    SELECT id FROM orders ORDER BY id DESC OFFSET 100000 LIMIT 20
) sub ON o.id = sub.id;

EXISTS vs IN

-- EXISTS is more efficient for large datasets (correlated subquery)
-- EXISTS
SELECT * FROM users u
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.user_id = u.id AND o.total > 10000
);

-- IN (when subquery result is large)
SELECT * FROM users
WHERE id IN (
    SELECT user_id FROM orders WHERE total > 10000
);

COUNT Optimization

-- Exact COUNT (causes Full Scan)
SELECT COUNT(*) FROM orders;

-- Approximate value when sufficient
SELECT reltuples::bigint AS estimate
FROM pg_class WHERE relname = 'orders';

-- Conditional COUNT optimization
-- Index Only Scan possible with an index
SELECT COUNT(*) FROM orders WHERE status = 'completed';
-- Required: CREATE INDEX idx_orders_status ON orders(status);

CTE vs Subquery

-- In PostgreSQL 12+, CTEs are inlined by default
-- MATERIALIZED hint forces materialization

-- Auto-inlined (optimizer optimizes)
WITH active_users AS (
    SELECT id, name FROM users WHERE status = 'active'
)
SELECT * FROM active_users WHERE name LIKE 'K%';

-- Forced materialization (prevents redundant execution)
WITH active_users AS MATERIALIZED (
    SELECT id, name FROM users WHERE status = 'active'
)
SELECT * FROM active_users au
JOIN orders o ON au.id = o.user_id;

4. Configuration Tuning — postgresql.conf

Memory Settings

# shared_buffers: 25% of total RAM (most important!)
# Based on 16GB RAM
shared_buffers = 4GB

# effective_cache_size: Expected memory including OS cache (75% of RAM)
effective_cache_size = 12GB

# work_mem: Memory for sort/hash operations (per session x per query)
# Caution: connections x work_mem = total possible memory usage
work_mem = 256MB

# maintenance_work_mem: Used for VACUUM, CREATE INDEX
maintenance_work_mem = 1GB

WAL and Checkpoints

# WAL size (write-heavy workloads)
wal_buffers = 64MB
max_wal_size = 4GB
min_wal_size = 1GB

# Checkpoint interval
checkpoint_completion_target = 0.9

Query Planner

# Enable parallel queries
max_parallel_workers_per_gather = 4
max_parallel_workers = 8

# Random I/O cost (lower for SSD)
random_page_cost = 1.1  # HDD: 4.0, SSD: 1.1

# Statistics collection precision
default_statistics_target = 200  # Default: 100

5. VACUUM and Statistics Management

Why VACUUM Is Needed

PostgreSQL uses MVCC (Multi-Version Concurrency Control). When rows are UPDATEd or DELETEd, existing rows are not immediately removed but left as "dead tuples." VACUUM cleans them up.

-- Check dead tuples per table
SELECT schemaname, relname,
       n_live_tup, n_dead_tup,
       ROUND(n_dead_tup::numeric / NULLIF(n_live_tup, 0) * 100, 2) as dead_ratio
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC;

-- Manual VACUUM (normally handled by autovacuum)
VACUUM ANALYZE orders;

-- VACUUM FULL (caution: table lock! — reclaims disk space)
VACUUM FULL orders;  -- Do NOT use in production!

Autovacuum Tuning

# Autovacuum default settings
autovacuum = on
autovacuum_max_workers = 5

# Per-table settings for large tables
ALTER TABLE orders SET (
    autovacuum_vacuum_threshold = 1000,
    autovacuum_vacuum_scale_factor = 0.01,  -- Run when 1% are dead tuples
    autovacuum_analyze_threshold = 500,
    autovacuum_analyze_scale_factor = 0.005
);

Updating Statistics

-- When EXPLAIN estimated rows differ significantly from actual
ANALYZE orders;

-- Update statistics for entire DB
ANALYZE;

-- Increase statistics precision for a specific column
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;
ANALYZE orders;

6. Practical Troubleshooting — Finding Slow Queries

Discovering Slow Queries with pg_stat_statements

-- Enable the extension
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Top 10 slowest queries
SELECT query,
       calls,
       ROUND(total_exec_time::numeric, 2) as total_time_ms,
       ROUND(mean_exec_time::numeric, 2) as avg_time_ms,
       rows
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

-- Most frequently called queries (by total time)
SELECT query, calls,
       ROUND(total_exec_time::numeric, 2) as total_ms
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

Checking Currently Running Queries

-- Long-running queries
SELECT pid, now() - pg_stat_activity.query_start AS duration,
       query, state
FROM pg_stat_activity
WHERE state != 'idle'
  AND query NOT ILIKE '%pg_stat_activity%'
ORDER BY duration DESC;

-- Queries waiting for locks
SELECT blocked.pid AS blocked_pid,
       blocked.query AS blocked_query,
       blocking.pid AS blocking_pid,
       blocking.query AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_locks bl ON bl.pid = blocked.pid
JOIN pg_locks kl ON kl.locktype = bl.locktype
  AND kl.relation = bl.relation
  AND kl.pid != bl.pid
JOIN pg_stat_activity blocking ON blocking.pid = kl.pid
WHERE NOT bl.granted;

7. Performance Checklist

Items to verify before going to production:

  • Verify execution plans for key queries with EXPLAIN ANALYZE
  • Check that Seq Scan is not occurring on large tables
  • Verify composite index column order matches query patterns
  • Clean up unused indexes (prevent write performance degradation)
  • Set up slow query monitoring with pg_stat_statements
  • Confirm autovacuum is working properly
  • Review memory settings like shared_buffers, work_mem
  • Check connection pooling (PgBouncer, etc.) is in place
  • Keep statistics up to date with regular ANALYZE runs

Quiz

Q1: What is the difference between EXPLAIN and EXPLAIN ANALYZE? EXPLAIN only shows the estimated execution plan, while EXPLAIN ANALYZE actually executes the query and shows measured times and row counts. Since ANALYZE actually executes, caution is needed with INSERT/UPDATE/DELETE.

Q2: Why does a query with WHERE a = 1 AND c = 3 on a composite index (a, b, c) fail to fully utilize the index?

B-Tree composite indexes match from left to right in order. If you skip the middle column (b), the index for column c cannot be used. Only column a uses the index.

Q3: Why is OFFSET-based pagination slow on deep pages? OFFSET N reads and discards N rows. With OFFSET 100000, it reads 100,000 rows, discards them, and returns from the next row onward, so deeper pages require reading more rows.

Q4: Why should random_page_cost be lowered to 1.1 on SSDs? SSDs have almost no difference between random I/O and sequential I/O performance. The default value of 4.0 is based on HDDs, so lowering it to 1.1 on SSDs encourages the planner to choose index scans more aggressively.

Q5: Why do dead tuples occur in PostgreSQL? Due to the MVCC architecture. When UPDATE/DELETE occurs, existing rows are not immediately deleted but left as "dead tuples" so other transactions can still reference them. VACUUM cleans them up.

Q6: What is the solution when WHERE LOWER(email) = 'test@example.com' cannot use an index?

Create a functional index: CREATE INDEX idx_users_email_lower ON users(LOWER(email));

Q7: What is the recommended setting for shared_buffers? Approximately 25% of total system RAM. Example: 16GB RAM -> shared_buffers = 4GB.

Q8: What does a very large "Rows Removed by Filter" value in EXPLAIN output mean?

It means many rows are being read from the table and then most are discarded by the filter. Adding an appropriate index can reduce unnecessary row reads.

Q9: Why should VACUUM FULL not be used in production? VACUUM FULL acquires an ACCESS EXCLUSIVE lock on the table, blocking all reads and writes to that table until the operation completes.

Q10: Between EXISTS and IN, which is more efficient for large subqueries? Generally, EXISTS is more efficient. EXISTS stops immediately when it finds the first matching row, while IN may need to generate the entire subquery result first.