Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

TL;DR

B-Tree covers 90%: fits most queries. Supports sort, range, exact match.
GIN is king for search: arrays, JSON, full-text. PostgreSQL's secret weapon.
Composite column order is decisive: WHERE a=? AND b=? differs from WHERE b=? AND a=? in usability.
Covering indexes skip table access: leverage the INCLUDE clause.
Know the 10 reasons an index is not used — statistics, functions, casts, patterns, etc.

1. Why Are Indexes Fast?

1.1 Without an index

SELECT * FROM users WHERE email = 'alice@example.com';

Without an index: Sequential Scan — read and compare every row. 100M rows = 100M comparisons.

100M rows x 100 nanoseconds = 10 seconds

1.2 With B-Tree

100M rows -> log2(10^8) ~= 27 comparisons
27 x 100 ns = 2.7 microseconds

About 4 million times faster — the magic of indexes.

1.3 The cost of an index

Not free:

Storage: indexes consume disk space
Slower writes: INSERT/UPDATE/DELETE must also update the index
Maintenance: VACUUM, REINDEX

Rule: weigh the read-to-write ratio. 100:1 favors an index; 1:100 demands care.

2. B-Tree Internals

2.1 Balanced tree

                [50]
               /    \
          [25]      [75]
         /    \    /    \
      [10]  [40] [60]  [90]

Properties:

All leaves at the same depth (balanced)
Each node stores many keys (hundreds)
Sorted order maintained

2.2 Page-based storage

PostgreSQL B-Tree page (8KB)
+----------------------------------+
| Header                           |
+----------------------------------+
| Item Pointer 1 -> key1 + ctid    |
| Item Pointer 2 -> key2 + ctid    |
+----------------------------------+
| Free Space                       |
+----------------------------------+
| Items (low to high)              |
+----------------------------------+

ctid is PostgreSQL's row location (page number, offset).

2.3 Tree depth math

Assuming 200 keys per 8KB page:

Depth 1: 200 keys
Depth 2: 40,000
Depth 3: 8,000,000
Depth 4: 1,600,000,000

100M rows = depth 4 -> only 4 disk reads per lookup.

3. Index Types (PostgreSQL)

3.1 B-Tree (default)

CREATE INDEX idx_users_email ON users(email);

Good for: exact match, range, ORDER BY, prefix LIKE 'abc%', IS NULL.

Bad for: LIKE '%abc', LIKE '%abc%', wrapped functions like LOWER(email) (need expression index).

3.2 Hash

CREATE INDEX idx_users_email_hash ON users USING hash(email);

Exact match only. Usually B-Tree is preferred; PG 10+ made hash WAL-safe, so it is occasionally useful (slightly smaller).

3.3 GIN (Generalized Inverted Index)

-- array
CREATE INDEX idx_tags ON posts USING gin(tags);

-- JSONB
CREATE INDEX idx_metadata ON products USING gin(metadata);

-- full-text
CREATE INDEX idx_content_fts ON articles USING gin(to_tsvector('english', content));

Inverted index — each key maps to a list of rows.

"redis"    -> [row1, row5, row100, ...]
"kafka"    -> [row3, row7, row50, ...]
"postgres" -> [row1, row2, row100, ...]

Good for: array containment (tags @> ARRAY['redis']), JSONB key/value, full-text, multi-value columns.

SELECT * FROM posts WHERE tags @> ARRAY['redis'];
SELECT * FROM posts WHERE tags @> ARRAY['redis', 'database'];
SELECT * FROM products WHERE metadata @> '{"category": "electronics"}';

Downside: slow build. Use fastupdate=on to defer UPDATE cost.

3.4 GiST (Generalized Search Tree)

CREATE INDEX idx_locations ON places USING gist(location);
CREATE INDEX idx_periods ON events USING gist(period);

Geography, range types, nearest-neighbor search, overlap checks.

3.5 BRIN (Block Range Index)

CREATE INDEX idx_logs_created ON logs USING brin(created_at);

Stores min/max per page range. Tiny index. Great for naturally ordered huge tables (time-series, logs). Example: 100M-row log table — BRIN is a few MB vs several GB for B-Tree.

3.6 SP-GiST

CREATE INDEX idx_ips ON connections USING spgist(ip_addr);

Non-balanced tree; IP ranges, some PostGIS types.

3.7 Comparison

Type	Exact	Range	Pattern	Array/JSON	Geo	Size	Build
B-Tree	yes	yes	prefix only	no	no	medium	fast
Hash	yes	no	no	no	no	small	fast
GIN	yes	no	full-text	yes	no	large	slow
GiST	yes	yes	no	no	yes	medium	medium
BRIN	no	yes*	no	no	no	tiny	fast
SP-GiST	yes	yes	no	no	yes	medium	medium

*BRIN works only when data is physically ordered.

4. Composite Indexes

4.1 Column order is decisive

CREATE INDEX idx_users_country_age ON users(country, age);

Usable:

WHERE country = 'KR'
WHERE country = 'KR' AND age = 30
WHERE country = 'KR' AND age > 25

Not usable:

WHERE age = 30    -- country is leading
WHERE age > 25

Rule: leftmost prefix.

4.2 Choosing column order

Priorities:

Most frequently used columns first
Higher selectivity first (filters more)
Equality before range

Example:

WHERE country = 'KR' AND age > 25 AND created_at > '2024-01-01'

Selectivity: country=30%, created_at=50%, age=70%. Optimal order puts the most selective column first:

CREATE INDEX idx_users_search ON users(country, created_at, age);

4.3 Equality + range

WHERE country = 'KR' AND age BETWEEN 25 AND 35

Best: (country, age) — equality column first.

5. Specialized Index Patterns

5.1 Partial Index

CREATE INDEX idx_active_users ON users(email)
WHERE status = 'active';

Smaller index, faster build and maintenance, better cache efficiency. Great for frequently filtered subsets.

5.2 Expression Index

CREATE INDEX idx_users_email_lower ON users(LOWER(email));

SELECT * FROM users WHERE LOWER(email) = LOWER('Alice@example.com');

Common: LOWER, UPPER, EXTRACT(year FROM ...), JSON extraction (metadata->>'category').

5.3 Covering Index (INCLUDE)

-- PostgreSQL 11+
CREATE INDEX idx_users_country_email ON users(country) INCLUDE (email, name);

Returns results without touching the table (Index Only Scan).

SELECT email, name FROM users WHERE country = 'KR';

5.4 Unique Index

CREATE UNIQUE INDEX idx_users_email_unique ON users(email);

Prevents duplicates and enables optimizer shortcuts.

6. Mastering EXPLAIN ANALYZE

6.1 Usage

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM users WHERE country = 'KR' AND age > 25;

6.2 Reading output

Index Scan using idx_users_country_age on users
  (cost=0.43..2547.39 rows=10000 width=120)
  (actual time=0.025..15.234 rows=9847 loops=1)
  Index Cond: ((country)::text = 'KR'::text)
  Filter: (age > 25)
  Rows Removed by Filter: 153
  Buffers: shared hit=1234 read=56
Planning Time: 0.123 ms
Execution Time: 15.567 ms

cost: optimizer estimate
rows vs actual rows: estimate vs reality
Buffers: cache hit/read

6.3 Estimated vs actual

Large divergence (e.g. 1000x) indicates stale stats. Run ANALYZE and consider raising default_statistics_target.

6.4 Node types

Node	Meaning	Good?
`Seq Scan`	full scan	only for small tables
`Index Scan`	index + heap lookup	yes
`Index Only Scan`	index only	best
`Bitmap Index Scan`	bitmap combine	yes
`Nested Loop`	small joins	small only
`Hash Join`	large joins	yes
`Merge Join`	sorted joins	yes
`Sort`	in-memory or on-disk	watch for disk

6.5 Red flags

Rows Removed by Filter: 100000+
Disk: external merge
estimated/actual > 100
Seq Scan on a large table

7. 10 Reasons an Index Is Not Used

7.1 Function/expression wrap

-- bad
SELECT * FROM users WHERE LOWER(email) = 'alice@example.com';
-- good
CREATE INDEX idx_email_lower ON users(LOWER(email));

7.2 Type casting

-- bad: varchar vs int
SELECT * FROM users WHERE phone = 12345678;
-- good
SELECT * FROM users WHERE phone = '12345678';

7.3 Leading wildcard

-- bad
SELECT * FROM users WHERE name LIKE '%alice%';
-- good
CREATE EXTENSION pg_trgm;
CREATE INDEX idx_users_name_trgm ON users USING gin(name gin_trgm_ops);

7.4 NULL comparison

CREATE INDEX idx_active ON users(id) WHERE deleted_at IS NULL;

7.5 OR conditions

Prefer IN or UNION over OR across columns.

7.6 Stale statistics

ANALYZE users;

7.7 Too many rows returned

Above ~10% of rows the planner may choose a seq scan — use partial indexes or different query strategies.

7.8 No optimizer hints

PostgreSQL lacks hints by design; MySQL supports USE INDEX. Use pg_hint_plan if needed.

7.9 Missing leading column

CREATE INDEX idx_users_country_age ON users(country, age);
-- bad: country missing
SELECT * FROM users WHERE age > 25;

7.10 Implicit conversion

-- bad on some DBs
SELECT * FROM orders WHERE order_id = 12345;
-- good
SELECT * FROM orders WHERE order_id = 12345::bigint;

8. Real-World Tuning Cases

8.1 Slow search

SELECT * FROM products
WHERE category = 'electronics'
  AND price BETWEEN 100 AND 500
ORDER BY created_at DESC
LIMIT 20;

Before: Seq Scan + Sort (10s). Fix:

CREATE INDEX idx_products_search ON products(category, price, created_at DESC);

After: Index Scan, 5ms — 2000x faster.

8.2 Slow JOIN

SELECT u.*, COUNT(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE u.country = 'KR'
GROUP BY u.id;

Fix:

CREATE INDEX idx_orders_user_id ON orders(user_id);

8.3 Slow JSON query

CREATE INDEX idx_events_metadata ON events USING gin(metadata);

8.4 Composite vs single

Given WHERE a=? AND b=? AND c=? — one composite index on (a, b, c) almost always beats three single-column indexes combined with BitmapAnd.

9. Index Maintenance

9.1 Find unused indexes

SELECT
  schemaname || '.' || relname AS table,
  indexrelname AS unused_index,
  pg_size_pretty(pg_relation_size(indexrelid)) AS size,
  idx_scan AS index_scans
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;

9.2 Index bloat

REINDEX INDEX CONCURRENTLY idx_users_email;

9.3 Redundant indexes

If idx_ab(a, b) exists, idx_a(a) is redundant (leftmost rule).

9.4 Monitoring

SELECT schemaname, relname, indexrelname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;

10. MySQL vs PostgreSQL

Feature	MySQL (InnoDB)	PostgreSQL
Default index	B+Tree	B-Tree
Clustered index	yes (PRIMARY KEY)	no
Hash index	Memory engine only	yes
Full-text	FULLTEXT	tsvector + GIN
JSON indexing	Generated columns	GIN on JSONB
Partial index	limited (8.0+)	yes
Expression index	8.0+	yes
Covering index	yes	yes (INCLUDE)
BRIN	no	yes
GiST	no	yes

Quiz

1. Which queries can use a composite index on (a, b, c)?

Answer: leftmost prefix rule — usable: WHERE a=?, WHERE a=? AND b=?, WHERE a=? AND b=? AND c=?, partially WHERE a=? AND c=?. Not usable: WHERE b=?, WHERE c=?, WHERE b=? AND c=?.

2. When is a GIN index appropriate?

Answer: multi-valued columns — arrays tags @> ARRAY['redis'], JSONB metadata @> '{"key": "value"}', full-text tsvector @@ tsquery, and trigram acceleration via pg_trgm for LIKE-with-leading-wildcard.

3. Estimated and actual rows differ 1000x — what now?

Answer: statistics are stale. Run ANALYZE, raise default_statistics_target (100 -> 1000), or tune per column with SET STATISTICS. Also consider lowering autovacuum's analyze_scale_factor.

4. Benefits of a Partial Index?

Answer: smaller index, faster build and maintenance, better cache efficiency. Example: CREATE INDEX idx_active ON users(email) WHERE status='active' indexes only active users.

5. Why is Index Only Scan faster than Index Scan?

Answer: a regular Index Scan fetches ctid then reads the heap row. Index Only Scan has all needed columns in the index (use INCLUDE), avoiding table access. Requires an up-to-date visibility map (VACUUM).