Skip to content
Published on

PostgreSQL Advanced Indexing Guide: GIN, GiST, BRIN, and Partial Index in Practice

Authors
  • Name
    Twitter
PostgreSQL Advanced Indexing

Introduction

PostgreSQL offers six index types: B-tree, Hash, GIN, GiST, SP-GiST, and BRIN. While most tutorials stop at B-tree, real-world applications frequently encounter problems that B-tree alone cannot efficiently solve -- JSONB queries, full-text search, spatial queries, and time-series tables with billions of rows.

This article goes beyond B-tree to explore the internal structures of GIN, GiST, BRIN, Partial Index, and Expression Index, covering when and how to apply each with EXPLAIN ANALYZE-based performance data.

PostgreSQL Index Types Overview

A quick summary of PostgreSQL index types:

Index TypeInternal StructureBest Use CasesSizeWrite Cost
B-treeBalanced treeEquality, range, sort, UNIQUEMediumLow
GINInverted index (Posting List/Tree)JSONB, arrays, tsvectorLargeHigh
GiSTGeneralized search treeSpatial data, ranges, proximityMediumMedium
BRINBlock range summaryTime-series, append-only large tablesVery smallVery low
HashHash tablePure equality lookupsSmallLow
SP-GiSTSpace-partitioned treePhone numbers, IP addresses, unbalanced treesMediumMedium

B-tree Limitations and the Need for Advanced Indexes

B-tree is optimized for scalar value equality comparisons and range searches. However, it becomes inefficient or unusable in the following scenarios:

  • JSONB containment queries: Operators like @> in WHERE metadata @> '...' cannot be indexed with B-tree.
  • Full-text search: to_tsvector()-based search requires GIN indexes.
  • Spatial queries: PostGIS functions like ST_DWithin() and ST_Contains() need GiST indexes.
  • Billion-row time-series tables: The B-tree index itself grows to tens of GB, causing memory pressure.

PostgreSQL provides specialized index types to address these challenges.

GIN Index Deep Dive

Internal Structure

GIN (Generalized Inverted Index) uses an inverted index structure. Internally, it builds a B-tree over key values, and each leaf node stores a Posting List or Posting Tree -- a list of TIDs (Tuple Identifiers) for rows containing that key.

For example, if a JSONB column contains "tags": ["python", "database"], the GIN index registers both "python" and "database" as keys and adds the row TID to each key's Posting List.

JSONB Indexing

Creating a GIN index on a JSONB column enables index scans for operators like @>, ?, ?|, and ?&.

-- Create table
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    metadata JSONB NOT NULL
);

-- Insert 1 million test rows
INSERT INTO products (name, metadata)
SELECT
    'product_' || i,
    jsonb_build_object(
        'category', (ARRAY['electronics', 'clothing', 'food', 'toys'])[1 + (i % 4)],
        'price', (random() * 1000)::int,
        'tags', jsonb_build_array(
            (ARRAY['sale', 'new', 'popular', 'limited'])[1 + (i % 4)],
            (ARRAY['premium', 'budget', 'mid-range'])[1 + (i % 3)]
        ),
        'in_stock', (i % 2 = 0)
    )
FROM generate_series(1, 1000000) AS i;

-- GIN index with default jsonb_ops
CREATE INDEX idx_products_metadata_gin ON products USING gin (metadata);

-- GIN index with jsonb_path_ops (smaller index, @> only)
CREATE INDEX idx_products_metadata_path ON products USING gin (metadata jsonb_path_ops);

-- Containment query with EXPLAIN ANALYZE
EXPLAIN ANALYZE
SELECT id, name
FROM products
WHERE metadata @> '{"category": "electronics", "in_stock": true}';

Example output:

Bitmap Heap Scan on products  (cost=52.01..2084.18 rows=500 width=20)
  (actual time=1.234..5.678 rows=125000 loops=1)
  Recheck Cond: (metadata @> '{"category": "electronics", "in_stock": true}'::jsonb)
  Heap Blocks: exact=8334
  ->  Bitmap Index Scan on idx_products_metadata_gin  (cost=0.00..51.88 rows=500 width=0)
      (actual time=0.891..0.891 rows=125000 loops=1)
        Index Cond: (metadata @> '{"category": "electronics", "in_stock": true}'::jsonb)
Planning Time: 0.152 ms
Execution Time: 12.345 ms

jsonb_ops vs jsonb_path_ops comparison:

Propertyjsonb_ops (default)jsonb_path_ops
Supported operators@>, ?, ?|, ?&, @@, @?@>, @@, @?
Index sizeLarge (indexes keys + paths)Small (stores path hashes only)
Key existence checksSupportedNot supported
Containment search speedFastFaster
-- Full-text search table with GIN index
CREATE TABLE articles (
    id SERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    body TEXT NOT NULL,
    tsv tsvector GENERATED ALWAYS AS (
        setweight(to_tsvector('english', title), 'A') ||
        setweight(to_tsvector('english', body), 'B')
    ) STORED
);

-- Create GIN index
CREATE INDEX idx_articles_tsv ON articles USING gin (tsv);

-- Full-text search query
EXPLAIN ANALYZE
SELECT id, title, ts_rank(tsv, q) AS rank
FROM articles, to_tsquery('english', 'postgresql & indexing') AS q
WHERE tsv @@ q
ORDER BY rank DESC
LIMIT 10;

Array Indexing

-- GIN index on tag arrays
CREATE TABLE posts (
    id SERIAL PRIMARY KEY,
    title TEXT,
    tags TEXT[]
);

CREATE INDEX idx_posts_tags_gin ON posts USING gin (tags);

-- Array containment search
EXPLAIN ANALYZE
SELECT * FROM posts WHERE tags @> ARRAY['postgresql', 'performance'];

-- Array overlap search
EXPLAIN ANALYZE
SELECT * FROM posts WHERE tags && ARRAY['database', 'backend'];

GIN Pending List and fastupdate

By default, GIN indexes are configured with fastupdate=on. When new rows are inserted, they are not immediately merged into the index -- instead, they are temporarily stored in a Pending List and batch-merged during VACUUM or when the Pending List exceeds its size limit.

  • Advantage: Improved write performance, especially during bulk INSERTs
  • Disadvantage: CPU/IO spikes during Pending List merge; search queries must also scan the Pending List

In production, tune the gin_pending_list_limit parameter or manually call SELECT gin_clean_pending_list('idx_name') before peak hours.

GiST Index -- Spatial Data and Range Types

Internal Structure

GiST (Generalized Search Tree) is an extensible balanced tree framework. Unlike B-tree, GiST supports various data types and search strategies through operator class methods: consistent, union, penalty, picksplit, and others.

Internally, it uses an R-tree structure that wraps spatial data in bounding boxes (MBR: Minimum Bounding Rectangle) and organizes them hierarchically.

Spatial Data Indexing (PostGIS)

-- Install PostGIS extension
CREATE EXTENSION IF NOT EXISTS postgis;

-- Spatial data table
CREATE TABLE stores (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    location GEOMETRY(Point, 4326) NOT NULL
);

-- Insert 1 million test rows (random coordinates around Seoul)
INSERT INTO stores (name, location)
SELECT
    'store_' || i,
    ST_SetSRID(ST_MakePoint(
        126.9 + random() * 0.2,   -- longitude
        37.4 + random() * 0.2     -- latitude
    ), 4326)
FROM generate_series(1, 1000000) AS i;

-- Create GiST index
CREATE INDEX idx_stores_location_gist ON stores USING gist (location);

-- Find stores within 1km radius
EXPLAIN ANALYZE
SELECT id, name,
       ST_Distance(location::geography,
                   ST_SetSRID(ST_MakePoint(127.0, 37.5), 4326)::geography) AS distance_m
FROM stores
WHERE ST_DWithin(location::geography,
                 ST_SetSRID(ST_MakePoint(127.0, 37.5), 4326)::geography,
                 1000)
ORDER BY distance_m
LIMIT 20;

Example output:

Limit  (cost=8.45..8.50 rows=20 width=44)
  (actual time=2.345..2.678 rows=20 loops=1)
  ->  Sort  (cost=8.45..8.52 rows=25 width=44)
      (actual time=2.340..2.350 rows=20 loops=1)
        Sort Key: (st_distance(...))
        Sort Method: top-N heapsort  Memory: 27kB
        ->  Index Scan using idx_stores_location_gist on stores
            (cost=0.42..7.89 rows=25 width=44)
            (actual time=0.234..2.123 rows=785 loops=1)
              Index Cond: (location && ...)
              Filter: st_dwithin(...)
Planning Time: 0.456 ms
Execution Time: 2.789 ms

Without the GiST index, a Sequential Scan across all 1 million rows would take several seconds.

Range Type Indexing

-- Overlapping time range search for a reservation system
CREATE TABLE reservations (
    id SERIAL PRIMARY KEY,
    room_id INT NOT NULL,
    period TSTZRANGE NOT NULL,
    guest_name TEXT
);

-- GiST index for range overlap optimization
CREATE INDEX idx_reservations_period_gist ON reservations USING gist (period);

-- Find overlapping reservations
EXPLAIN ANALYZE
SELECT * FROM reservations
WHERE period && tstzrange('2026-03-10 14:00', '2026-03-10 18:00', '[)');

-- EXCLUDE constraint to prevent overlaps (requires GiST)
ALTER TABLE reservations
ADD CONSTRAINT no_overlap
EXCLUDE USING gist (room_id WITH =, period WITH &&);

GiST supports range operators: && (overlap), @> (contains), <@ (contained by), << (left of), >> (right of), -|- (adjacent).

GiST indexes support KNN searches that handle ORDER BY distance patterns at the index level.

-- Find the 10 nearest stores using index scan
SELECT id, name, location <-> ST_SetSRID(ST_MakePoint(127.0, 37.5), 4326) AS dist
FROM stores
ORDER BY location <-> ST_SetSRID(ST_MakePoint(127.0, 37.5), 4326)
LIMIT 10;

The <-> operator combined with ORDER BY ... LIMIT performs efficient KNN search through the GiST index. Rather than sorting the entire table, it traverses the index tree and returns results in nearest-first order.

BRIN Index -- Time-Series and Large Tables

Internal Structure

BRIN (Block Range Index) stores only summary information -- minimum and maximum values -- for each physical block range (default: 128 pages) in the table. This makes the index extremely small, typically just a few MB even for tables with billions of rows.

The key prerequisite: there must be a strong correlation between column values and their physical storage order. Timestamp columns in time-series data are the classic example.

Time-Series Data Usage

-- Event log table (time-series)
CREATE TABLE events (
    id BIGSERIAL PRIMARY KEY,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    event_type TEXT NOT NULL,
    payload JSONB
);

-- Insert 50 million test rows
INSERT INTO events (created_at, event_type, payload)
SELECT
    '2025-01-01'::timestamptz + (i || ' seconds')::interval,
    (ARRAY['click', 'view', 'purchase', 'signup'])[1 + (i % 4)],
    jsonb_build_object('user_id', (i % 100000), 'value', random() * 100)
FROM generate_series(1, 50000000) AS i;

-- B-tree index for comparison
CREATE INDEX idx_events_created_btree ON events (created_at);

-- BRIN index
CREATE INDEX idx_events_created_brin ON events USING brin (created_at)
WITH (pages_per_range = 128);

-- Compare index sizes
SELECT
    indexname,
    pg_size_pretty(pg_relation_size(indexname::regclass)) AS index_size
FROM pg_indexes
WHERE tablename = 'events' AND indexname LIKE 'idx_events_created%';

Size comparison example:

       indexname              | index_size
------------------------------+------------
 idx_events_created_btree     | 1071 MB
 idx_events_created_brin      | 128 kB

The BRIN index is approximately 8,500 times smaller than B-tree while handling the same range queries.

-- Range query using BRIN index
EXPLAIN ANALYZE
SELECT count(*) FROM events
WHERE created_at BETWEEN '2025-06-01' AND '2025-06-30';

Example output:

Aggregate  (cost=456789.12..456789.13 rows=1 width=8)
  (actual time=234.567..234.568 rows=1 loops=1)
  ->  Bitmap Heap Scan on events  (cost=48.12..445678.90 rows=2592000 width=0)
      (actual time=12.345..198.765 rows=2592000 loops=1)
        Recheck Cond: (created_at >= ... AND created_at <= ...)
        Rows Removed by Recheck: 45678
        Heap Blocks: lossy=19200
        ->  Bitmap Index Scan on idx_events_created_brin  (cost=0.00..47.50 rows=2600000 width=0)
            (actual time=0.234..0.234 rows=192000 loops=1)
Planning Time: 0.123 ms
Execution Time: 256.789 ms

pages_per_range Tuning

The pages_per_range value is a trade-off between accuracy and index size.

pages_per_rangeIndex SizeAccuracyBest For
32LargerHighFrequent small-range queries
128 (default)MediumMediumGeneral time-series data
256+Very smallLowPrimarily large-range queries

When BRIN Is Not Suitable

  • Data is inserted randomly with no correlation between physical and logical order
  • UPDATE operations move rows to different pages (HOT failure), breaking the correlation
  • Point queries (single row lookups) are the primary workload

Check the correlation value in pg_stats to assess BRIN suitability. Values closer to 1 (absolute) are ideal.

SELECT tablename, attname, correlation
FROM pg_stats
WHERE tablename = 'events' AND attname = 'created_at';
-- correlation above 0.95 indicates good BRIN suitability

Partial Index and Expression Index

Partial Index

Create an index on only a subset of table rows to reduce index size and improve write performance.

-- Order table: index only active orders
CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    customer_id INT NOT NULL,
    status TEXT NOT NULL DEFAULT 'pending',
    total_amount NUMERIC(10,2),
    created_at TIMESTAMPTZ DEFAULT now()
);

-- Full index vs Partial index size comparison
CREATE INDEX idx_orders_status_full ON orders (status, created_at);
CREATE INDEX idx_orders_status_partial ON orders (created_at)
    WHERE status IN ('pending', 'processing');

-- Query that uses the partial index
EXPLAIN ANALYZE
SELECT * FROM orders
WHERE status = 'pending'
  AND created_at > now() - interval '7 days'
ORDER BY created_at DESC;

The key to partial indexes is that the query WHERE clause must imply the index predicate. The PostgreSQL optimizer can recognize simple inequality implications, but may not recognize complex expressions.

Unique Partial Index for Conditional Uniqueness

-- Allow only one active email per user
CREATE UNIQUE INDEX idx_unique_active_email
ON users (email) WHERE is_active = true;

-- Rows where is_active = false can have duplicate emails
-- Rows where is_active = true enforce email uniqueness

Expression Index

Create an index on the result of transforming column values. The query must use the exact same expression for the index to be utilized.

-- Expression index for case-insensitive search
CREATE INDEX idx_users_email_lower ON users (lower(email));

-- This query uses the index
EXPLAIN ANALYZE
SELECT * FROM users WHERE lower(email) = 'user@example.com';

-- Date extraction expression index
CREATE INDEX idx_orders_created_date ON orders ((created_at::date));

-- Used for date-based aggregations
EXPLAIN ANALYZE
SELECT created_at::date AS order_date, count(*)
FROM orders
WHERE created_at::date = '2026-03-10'
GROUP BY created_at::date;

-- Expression index on a specific JSONB key (smaller than full GIN)
CREATE INDEX idx_products_category ON products ((metadata->>'category'));

-- B-tree based, so equality and range searches work
EXPLAIN ANALYZE
SELECT * FROM products WHERE metadata->>'category' = 'electronics';

Combining Partial + Expression

-- Optimize email search for recently active users
CREATE INDEX idx_recent_active_users_email
ON users (lower(email))
WHERE last_login_at > now() - interval '7 days'
  AND is_active = true;

This combination dramatically reduces index size while providing optimal performance for specific query patterns.

EXPLAIN ANALYZE for Index Performance Analysis

The most reliable way to verify index usage is EXPLAIN ANALYZE.

-- Basic EXPLAIN ANALYZE
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM products
WHERE metadata @> '{"category": "electronics"}';

-- Key points to check in the execution plan:
-- 1. Scan type: Index Scan vs Bitmap Index Scan vs Seq Scan
-- 2. actual time: Real elapsed time
-- 3. rows: Estimated vs actual row count discrepancy
-- 4. Buffers: shared hit (cache) vs shared read (disk)

Diagnosing Index Non-Usage

Common reasons why an existing index is not used:

  1. Stale statistics: ANALYZE not run, causing the optimizer to estimate incorrect cardinality
  2. Low selectivity: When results exceed 10-15% of the table, Sequential Scan is more efficient
  3. Type mismatch: Query condition data type differs from the indexed column type
  4. Expression mismatch: Expression Index and query expression do not match exactly
  5. enable_indexscan = off: Index scan disabled at the session level
-- Find unused indexes
SELECT
    schemaname || '.' || relname AS table,
    indexrelname AS index,
    pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size,
    idx_scan AS times_used,
    idx_tup_read AS tuples_read
FROM pg_stat_user_indexes i
JOIN pg_index USING (indexrelid)
WHERE idx_scan = 0
  AND NOT indisunique
  AND NOT indisprimary
ORDER BY pg_relation_size(i.indexrelid) DESC;

Index Bloat Management and REINDEX

What Is Index Bloat

Due to PostgreSQL MVCC architecture, UPDATE is internally handled as DELETE + INSERT. Index entries for deleted tuples occupy space until VACUUM cleans them up. When this accumulates, index bloat occurs -- the index grows abnormally large.

Measuring Bloat

-- Measure bloat using pgstattuple extension
CREATE EXTENSION IF NOT EXISTS pgstattuple;

SELECT
    avg_leaf_density,
    leaf_pages,
    empty_pages,
    deleted_pages,
    round(100 - avg_leaf_density, 1) AS bloat_pct
FROM pgstatindex('idx_orders_status_full');

-- Consider REINDEX when bloat_pct exceeds 20%

REINDEX CONCURRENTLY

In production, use REINDEX CONCURRENTLY to rebuild indexes without locking the table.

-- Rebuild a specific index (non-blocking)
REINDEX INDEX CONCURRENTLY idx_orders_status_full;

-- Rebuild all indexes on a table
REINDEX TABLE CONCURRENTLY orders;

-- Rebuild all indexes in a schema
REINDEX SCHEMA CONCURRENTLY public;

Important caveats:

  • If REINDEX CONCURRENTLY fails midway, it leaves behind an invalid index with a _ccnew suffix. You must check for and drop these.
  • REINDEX CONCURRENTLY holds the xmin horizon during execution, which can delay dead tuple cleanup by other VACUUM processes.
-- Check for invalid indexes
SELECT indexrelid::regclass AS index_name,
       indisvalid
FROM pg_index
WHERE NOT indisvalid;

-- Drop invalid indexes
-- DROP INDEX CONCURRENTLY idx_name_ccnew;

Automated Management Strategy

-- Monitoring query to find indexes with bloat exceeding 20%
SELECT
    schemaname || '.' || tablename AS table,
    indexname,
    pg_size_pretty(pg_relation_size(indexname::regclass)) AS size,
    round(100 - (pgstatindex(indexname)).avg_leaf_density, 1) AS bloat_pct
FROM pg_indexes
WHERE schemaname = 'public'
ORDER BY bloat_pct DESC NULLS LAST;

Index Type Comparison Table

PropertyB-treeGINGiSTBRIN
Best data typesScalar valuesJSONB, arrays, tsvectorSpatial, rangesTime-series, sequential
Index sizeMediumLarge (60-80% of table)MediumExtremely small (KB)
Write overheadLowHighMediumVery low
Point queriesOptimalSupportedSupportedInefficient
Range queriesOptimalNot supportedSupportedOptimal (high correlation)
Full-text searchNot supportedOptimalNot supportedNot supported
Spatial queriesNot supportedNot supportedOptimalLimited
KNN searchNot supportedNot supportedOptimalNot supported
Unique constraintsSupportedNot supportedNot supportedNot supported
Index-only scanSupportedNot supportedNot supportedNot supported
CONCURRENTLY createSupportedSupportedSupportedSupported
Physical sort requiredNoNoNoYes

Operations Checklist

Pre-Index Creation Checklist

  • Analyze query patterns: Check high-frequency queries in pg_stat_statements
  • Verify the target column data type and operators (determine if B-tree suffices)
  • Check the current execution plan with EXPLAIN ANALYZE
  • Estimate table size and expected index size
  • Plan for non-blocking creation with the CONCURRENTLY option

Regular Monitoring Items

  • Unused indexes (pg_stat_user_indexes.idx_scan = 0)
  • Index bloat percentage (pgstatindex function)
  • Index size relative to table size
  • BRIN index correlation (pg_stats.correlation)
  • GIN Pending List size (pg_stat_all_indexes.idx_tup_insert)

Index Type Selection Flow

  1. Equality / range / sort queries -> B-tree
  2. JSONB containment, arrays, full-text search -> GIN
  3. Spatial data, range overlaps, KNN -> GiST
  4. Time-series / append-only large table range queries -> BRIN
  5. Frequently querying only specific row subsets -> Partial Index
  6. Searching on function/transformation results -> Expression Index

Conclusion

Each PostgreSQL advanced index has a clear design purpose and optimal scenario. The key is to select the right index type based on your data characteristics and query patterns.

  • If you have many JSONB or full-text search queries, consider GIN -- but account for write overhead and index size.
  • For spatial data or range overlap queries, GiST is practically mandatory.
  • For billion-row time-series tables, BRIN can reduce index size by over 99%.
  • Partial Index and Expression Index can be combined with any index type for additional optimization.

Most importantly, always validate actual execution plans with EXPLAIN ANALYZE and continuously monitor index bloat in production. Indexes are not fire-and-forget assets -- they are operational objects that require ongoing maintenance.