ClickHouse Internals Deep Dive — MergeTree, Vectorized Execution, Distributed Queries, Keeper (2025)

TL;DR

ClickHouse: built by Yandex in 2009 for internal analytics, open-sourced in 2016. Written in C++, the flagship columnar OLAP engine.
Columnar storage: values of the same column stored contiguously. For queries like SELECT sum(x) WHERE y > 0 only that column is read — 10-100x faster.
MergeTree: ClickHouse's core engine. LSM-like "parts" with background merges, but stored by column rather than by row.
Sparse Primary Index: one entry per granule (default 8192 rows), not per row. Index fits entirely in memory.
Vectorized Execution: processes tens of thousands of rows at a time in loops. SIMD-friendly — compilers exploit AVX2/AVX-512 automatically.
Compression: per-column LZ4/ZSTD by default. Time-series data uses Delta + Gorilla for aggressive compression.
Distributed Engine: sharding + replication. Query fans out to shards and aggregates.
Keeper: ZooKeeper replacement. Raft-based, reimplemented in C++. Optimized for ClickHouse workloads.
Materialized View: results precomputed at insert time. Essential for real-time aggregation.
2025 competitors: Druid, Pinot (real-time aggregation), DuckDB (embedded), Snowflake/BigQuery (managed).

1. Why ClickHouse Is Fast

1.1 OLAP vs OLTP

OLTP (Online Transaction Processing):

Reads/writes few rows (single-row inserts/updates).
O(log n) with an index.
Postgres, MySQL, Oracle.

OLAP (Online Analytical Processing):

Scans millions to billions of rows.
Only a few columns needed.
Aggregations (sum, count, avg, distinct).
ClickHouse, Snowflake, BigQuery, Druid.

1.2 Typical OLAP Query

SELECT user_country, SUM(revenue)
FROM events
WHERE event_date BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY user_country
ORDER BY SUM(revenue) DESC
LIMIT 10;

This query:

Scans billions of rows by date range.
Needs only user_country and revenue (not full rows).
Returns a small aggregated result.

Why is Postgres slow here? Row storage forces reading all columns even if unused, indexes do not help for large scans, and per-row processing accumulates function-call overhead.

1.3 Three Weapons

Columnar storage: read only the columns you need.
Vectorized execution: process tens of thousands of rows at a time with SIMD.
Heavy compression: save disk I/O bandwidth.

Combined, these make ClickHouse 100-1000x faster than Postgres for OLAP.

2. History

Yandex built ClickHouse for Metrica, their Google Analytics competitor. MySQL collapsed under billions of events per day; a team led by Alexey Milovidov rewrote the engine in C++. Prototype in 2009, production in 2012. Open-sourced in 2016 to explosive adoption: CloudFlare, Uber, Bloomberg, GitLab, Deutsche Bank. ClickHouse Inc. spun out in 2021 with $250M+ in funding and launched ClickHouse Cloud. As of 2025: 35,000+ GitHub stars, thousands of production users, monthly releases.

3. Columnar Storage Basics

3.1 Row vs Column

Row-based (Postgres, MySQL):

Row 1: [id=1, name="Alice", age=30, country="US"]
Row 2: [id=2, name="Bob",   age=25, country="UK"]

On disk: 1|Alice|30|US|2|Bob|25|UK|...

Column-based (ClickHouse):

Column "id":      [1, 2, 3, ...]
Column "name":    ["Alice", "Bob", ...]
Column "age":     [30, 25, ...]
Column "country": ["US", "UK", ...]

Each column is a separate file.

3.2 Why Columnar Is Faster for Analytics

I/O reduction: SELECT AVG(age) reads only age.bin. A 100-column table with a 1-column query means 100x less I/O.

Compression: values in one column share type and distribution. A country column of "US", "UK", "CN" compresses 10x+ with LZ4. Mixed-type rows compress poorly.

Cache-friendly: sequential column scans keep L1 hot.

SIMD: same-type contiguous values let AVX process 8 int32 values per 256-bit register.

3.3 When Row-Based Wins

Point lookup WHERE id = 42.
Update-heavy workloads.
Small result, many columns returned.

Do not use ClickHouse for OLTP.

4. MergeTree — The Heart of ClickHouse

4.1 Core Idea

Inspired by LSM-trees but redesigned for columnar storage.

Insert -> memory buffer -> disk "part" (columnar files)
                                  |
                          background merge
                                  |
                          larger sorted part

4.2 Part Structure

A part is a directory:

/var/lib/clickhouse/data/mydb/mytable/
  20240101_1_1_0/              # partition_minBlock_maxBlock_level
    checksums.txt
    columns.txt
    count.txt
    primary.idx                # sparse primary index
    user_id.bin                # column file
    user_id.mrk2               # marks (offsets)
    event_time.bin
    event_time.mrk2
    revenue.bin
    revenue.mrk2

Each column has a .bin and .mrk2 pair. Marks hold granule offsets in the .bin file.

4.3 Granules and the Sparse Primary Index

MergeTree's key insight: no per-row index entry.

Default 8192 rows = one granule.
Primary index points only to the first row of each granule.

For WHERE user_id = 12345: binary-search the primary index in memory, find the target granule, read it from disk (~8192 rows), scan for the value. The index is 1/8192 the size of a traditional one. A billion-row table's index is a few MB — entirely in RAM.

4.4 Why Sparse Is Fine

OLAP typically aggregates many rows. Reading one granule (8192 rows) maps well to disk I/O page units — several pages per granule. Point lookups are supported but are not ClickHouse's strength.

4.5 ORDER BY Is the Primary Key

CREATE TABLE events (
    event_time DateTime,
    user_id UInt64,
    event_type String,
    revenue Decimal(10, 2)
) ENGINE = MergeTree
ORDER BY (event_time, user_id)
PARTITION BY toYYYYMM(event_time);

Data is physically sorted by (event_time, user_id); the primary index is a sparse sample of that ordering.

4.6 Partitioning

PARTITION BY toYYYYMM(event_time) places monthly data in separate directories. Benefits: partition pruning, near-instant DROP PARTITION, partition-unit backup/replication. Avoid too many partitions — monthly or weekly is typical.

5. Write Path

5.1 INSERT Journey

Input batched.
Per-column vectors built in memory.
Columns sorted by ORDER BY.
Each column written to a separate compressed file.
Part directory created, manifest updated.

A new part is created; old parts continue to coexist.

5.2 Background Merges

A background thread periodically merges small parts into larger ones (same partition only). Merges happen online; old parts are dropped when done.

5.3 Why Merging Matters

Too many parts (default limit 300 per partition) slows queries and triggers the "Too many parts" error. UPDATE/DELETE mutations are actually applied during merges.

5.4 Insert Optimization

Bad: one INSERT per row — each creates a new part. Good: batched INSERT of thousands to tens of thousands of rows. For streaming use Kafka table engine or an external ingester that batches.

6. Read Path and Vectorized Execution

6.1 Query Stages

Parse and plan.
Partition pruning.
Granule selection via primary index.
Read only required columns.
Vectorized execution.
Aggregate (hash table for GROUP BY).
Sort and return.

6.2 Vectorized Execution

Traditional DBs use the Volcano model (one row per next() call) — heavy function-call overhead and branch mispredictions.

ClickHouse processes blocks (~65536 rows):

while (block = scan.next_block()) {   // 65536 rows
    auto mask = filter_block(block);  // SIMD
    aggregate_block(block, mask);     // SIMD
}

Compilers auto-vectorize these tight loops into AVX2.

6.3 SIMD Example

// Naive
int64_t sum = 0;
for (int i = 0; i < n; i++) sum += revenue[i];

// AVX2 intrinsics
__m256i sum_vec = _mm256_setzero_si256();
for (int i = 0; i < n; i += 4) {
    __m256i v = _mm256_loadu_si256((__m256i*)&revenue[i]);
    sum_vec = _mm256_add_epi64(sum_vec, v);
}

4-8x speedup measured. ClickHouse mostly relies on the compiler — code is written in a vectorization-friendly style.

6.4 Runtime CPU Dispatch

ClickHouse detects CPU features (SSE2, SSE4.2, AVX2, AVX-512) at runtime and picks the best implementation via CPUID.

7. Data Skipping Indexes

Primary index only helps on ORDER BY columns. For others:

7.1 Bloom Filter

ALTER TABLE events
ADD INDEX idx_user_email user_email TYPE bloom_filter GRANULARITY 4;

Skip granules when the bloom says "definitely not here".

7.2 MinMax

ADD INDEX idx_revenue revenue TYPE minmax GRANULARITY 1;

Cheap and very effective — recommended by default.

7.3 Set

ADD INDEX idx_country user_country TYPE set(100) GRANULARITY 4;

Good for low-cardinality columns.

7.4 N-gram

ADD INDEX idx_log_msg log_message TYPE ngrambf_v1(4, 1024, 3, 0) GRANULARITY 1;

Accelerates LIKE '%error%'-style queries on log strings.

7.5 Pitfalls

Granule-level, not row-level. Overhead on writes and part metadata. Choose carefully.

8. Compression

Per-column codecs, independently chosen.

8.1 Base Codecs

LZ4: default, fast, moderate ratio.
ZSTD: better ratio, slightly slower; levels 1-22. Good for cold data.

CREATE TABLE events (
    event_time DateTime CODEC(DoubleDelta, ZSTD(3)),
    user_id UInt64 CODEC(T64, LZ4)
)

8.2 Delta / DoubleDelta

Delta stores [1000, 1, 2, 2, 5, 2] instead of raw values. DoubleDelta (delta-of-delta) reduces regular timestamps to near-zero values — often 10x+ compression.

8.3 Gorilla

Facebook's time-series trick: XOR successive floats; zero-heavy results compress well. 5-10x on metric columns.

8.4 T64

Bit-packs integers into the minimum bit width used.

8.5 Codec Chains

event_time DateTime CODEC(DoubleDelta, LZ4)

Transform first, then general compression.

8.6 Realistic Ratios

Raw CSV 100 GB -> row DB ~80 GB (LZ4) -> ClickHouse 10-15 GB. Columnar plus specialized codecs gives 7-10x vs raw.

9. MergeTree Family

9.1 ReplacingMergeTree

Keeps only the latest row per key (by version column) during merges.

CREATE TABLE users (
    user_id UInt64,
    name String,
    updated_at DateTime
) ENGINE = ReplacingMergeTree(updated_at)
ORDER BY user_id;

Use FINAL to force deduplication at query time (slower).

9.2 SummingMergeTree

Sums numeric columns for rows sharing the ORDER BY key.

CREATE TABLE daily_stats (
    date Date,
    user_id UInt64,
    pageviews UInt64,
    clicks UInt64
) ENGINE = SummingMergeTree()
ORDER BY (date, user_id);

Always query with explicit GROUP BY + sum() for correctness before merges complete.

9.3 AggregatingMergeTree

Stores intermediate states of arbitrary aggregate functions.

CREATE TABLE hourly_stats (
    hour DateTime,
    country String,
    unique_users AggregateFunction(uniq, UInt64),
    avg_revenue AggregateFunction(avg, Decimal(10, 2))
) ENGINE = AggregatingMergeTree()
ORDER BY (hour, country);

Merges combine states — pairs with Materialized Views.

9.4 CollapsingMergeTree

Uses a sign Int8 column (+1/-1) to cancel prior state — useful for replicating MySQL binlogs.

9.5 VersionedCollapsingMergeTree

CollapsingMergeTree with explicit version for out-of-order merges.

9.6 GraphiteMergeTree

Graphite-compatible time-series metrics with TTL-based downsampling.

10. Materialized Views

MVs behave like insert triggers.

10.1 Basic

CREATE MATERIALIZED VIEW hourly_pageviews_mv
ENGINE = SummingMergeTree()
ORDER BY (hour, page)
AS
SELECT
    toStartOfHour(event_time) AS hour,
    page,
    count() AS views
FROM events
GROUP BY hour, page;

Each insert into events triggers the SELECT; results are inserted into the MV. Queries run against the much smaller MV — 100x+ faster.

10.2 AggregateFunction MV

CREATE MATERIALIZED VIEW user_metrics_mv
ENGINE = AggregatingMergeTree()
ORDER BY (date, user_id)
AS
SELECT
    toDate(event_time) AS date,
    user_id,
    uniqState(session_id) AS unique_sessions,
    sumState(revenue) AS total_revenue
FROM events
GROUP BY date, user_id;

Query with uniqMerge / sumMerge to combine states — supports distinct counts and other complex aggregates.

10.3 Fan-out

One insert can drive many MVs (hourly, daily, country, user). Pick the right MV per query.

10.4 Caveats

Insert atomicity — source failure means MV failure. MVs see only data at insert time. Heavy JOINs inside MVs can be slow. TTLs don't propagate from source to MV.

11. Projections

Evolution of MVs (ClickHouse 21+). Stores alternate sort orderings inside the same table.

ALTER TABLE events
ADD PROJECTION p_country
(
    SELECT event_time, user_country, revenue
    ORDER BY (user_country, event_time)
);

ALTER TABLE events MATERIALIZE PROJECTION p_country;

ClickHouse picks the projection automatically when the query fits.

Aspect	Materialized View	Projection
Storage	Separate table	Inside source table
Usage	Explicit name	Auto-selected
Deletion/TTL	Independent	Follows source
Complex queries	Yes	Limited

12. Distributed Engine and Sharding

ClickHouse scales horizontally by sharding.

12.1 Concept

Local table: each shard stores its own data.
Distributed table: routes queries across shards.

12.2 Cluster Configuration

<clickhouse>
  <remote_servers>
    <my_cluster>
      <shard>
        <replica>
          <host>shard1-r1</host>
          <port>9000</port>
        </replica>
        <replica>
          <host>shard1-r2</host>
          <port>9000</port>
        </replica>
      </shard>
      <shard>
        <replica>
          <host>shard2-r1</host>
        </replica>
      </shard>
    </my_cluster>
  </remote_servers>
</clickhouse>

12.3 Distributed Table

CREATE TABLE events_distributed ON CLUSTER my_cluster AS events
ENGINE = Distributed(my_cluster, default, events, rand());

12.4 Query Execution

Coordinator fans out the query, each shard aggregates locally, then the coordinator merges partial results. Map-reduce style with minimal network overhead.

12.5 Sharding Key

Good: high-cardinality, evenly distributed, used in filters (enabling local queries). Bad: country (skewed), timestamp (hot shard for latest data).

12.6 Replication

ReplicatedMergeTree replicates within a shard via Keeper/ZooKeeper:

CREATE TABLE events (...) ENGINE = ReplicatedMergeTree(
    '/clickhouse/tables/{shard}/events',
    '{replica}'
)
ORDER BY ...;

13. Keeper — ZooKeeper Replacement

ZooKeeper (Java) is memory-heavy with GC pauses and isn't optimal for ClickHouse's access patterns. ClickHouse Keeper (2021) is a C++ reimplementation with Raft consensus, ZK wire-protocol compatible, 1.5-2x faster, half the memory. Can be embedded in the ClickHouse binary or run standalone. Most new deployments use Keeper; ZK is legacy.

<clickhouse>
  <keeper_server>
    <tcp_port>9181</tcp_port>
    <server_id>1</server_id>
    <raft_configuration>
      <server><id>1</id><hostname>keeper1</hostname><port>9234</port></server>
      <server><id>2</id><hostname>keeper2</hostname><port>9234</port></server>
      <server><id>3</id><hostname>keeper3</hostname><port>9234</port></server>
    </raft_configuration>
  </keeper_server>
</clickhouse>

14. Practical Tuning

Insert batching: 10K-100K rows. Use Buffer tables, Kafka table engine, or async_insert.
ORDER BY design: low cardinality first, then higher. Three to five columns is typical.
Part management: monitor system.parts; default limit 300 per partition. Increase background_pool_size if needed.
Memory limits: max_memory_usage; enable max_bytes_before_external_group_by to spill to disk.
Observability: system.parts, system.query_log, system.metrics, system.events, system.asynchronous_metrics.

SELECT query, elapsed, memory_usage
FROM system.query_log
WHERE event_date = today()
ORDER BY elapsed DESC
LIMIT 10;

15. ClickHouse vs Alternatives

15.1 DuckDB

Embedded columnar DB (SQLite-style). Single process, no distribution. Great for local analysis up to TB scale. Same vectorized philosophy, different deployment.

15.2 Druid / Pinot

Real-time ingestion and dashboards. Druid "segments" resemble parts; Druid requires rollup definition upfront (less flexible). ClickHouse has richer SQL.

15.3 Snowflake / BigQuery

Fully managed. Higher cost, zero ops. ClickHouse is cheaper and often faster, but requires an ops team.

15.4 StarRocks / Doris

China-led alternatives with MySQL-compatible frontends and strong join performance. Smaller ecosystem.

16. Learning Path

Start with the official docs and the NYC Taxi tutorial.
Practice ORDER BY and partitioning design; follow Altinity Knowledge Base.
Explore system.* tables; watch "ClickHouse under the hood" talks by Alexey Milovidov.
Design distributed queries, run Keeper, build Materialized View patterns.

17. Cheat Sheet

Columnar storage:
  - One file per column
  - Read only needed columns
  - Better compression
  - SIMD-friendly

MergeTree:
  - Part = directory of column .bin + .mrk2
  - Granule = 8192 rows
  - Sparse primary index
  - Background merges
  - Partition pruning

Family:
  MergeTree (base), Replacing (dedup), Summing (sum),
  Aggregating (generic), Collapsing (state cancel), Replicated

Vectorization:
  - Block-wise (~65536 rows)
  - Auto SIMD (AVX2/AVX-512)
  - Runtime CPU dispatch

Indexes:
  Primary: sparse, granule starts
  Skipping: minmax / bloom / set / ngrambf

Compression:
  LZ4 (default), ZSTD (cold)
  Delta, DoubleDelta (timestamps)
  Gorilla (float), T64 (bit packing)

Materialized View:
  - Insert trigger
  - Pre-aggregation
  - AggregateFunction state

Distributed:
  - Shard x Replica
  - Distributed engine = query router
  - ReplicatedMergeTree + Keeper

Projection:
  Alternate sort inside same table, auto-selected

vs Competitors:
  DuckDB (embedded), Druid/Pinot (real-time),
  Snowflake/BQ (managed)

18. Quiz

Q1. Why does columnar storage make OLAP queries fast?

A. Three reasons: (1) I/O reduction — only needed columns are read, so a 100-column table with a 1-column query reads ~100x less. (2) Compression — uniform type/distribution in each column (e.g., country repeating "US"/"UK") compresses 10x+ with LZ4. (3) SIMD — contiguous same-type values load directly into AVX registers for 8x int32 parallelism, plus cache-line efficiency. Combined: 100-1000x faster than Postgres for analytics.

Q2. What is the sparse primary index and why is it acceptable?

A. One index entry per granule (default 8192 rows), not per row. A billion-row table has only a few MB of index — fully RAM-resident. Point lookups require a full-granule scan, but OLAP aggregates many rows, so granule-level reads match disk page I/O well. Opposite philosophy to OLTP dense indexes.

Q3. Why is vectorized execution faster than the Volcano model?

A. Eliminates per-row function-call overhead and enables SIMD. Volcano calls next() per row — tens of ns each, devastating over billions of rows, plus branch mispredictions. Vectorized runs blocks of ~65536 rows, so one call amortizes over thousands of values, and contiguous same-type data lets compilers emit AVX2/AVX-512 automatically. 10-100x measured speedup.

Q4. Why place low-cardinality columns first in ORDER BY?

A. Data is physically sorted by ORDER BY. Putting low-cardinality columns like country first creates long contiguous runs per value, so WHERE country = 'US' skips huge granule ranges. Putting high-cardinality columns first scatters values across granules, defeating pruning. Rule: frequently filtered low-cardinality columns first, time or high-cardinality range columns after.

Q5. SummingMergeTree vs AggregatingMergeTree?

A. SummingMergeTree only sums numeric columns per key — simple and fast, but limited. AggregatingMergeTree stores intermediate states (e.g., AggregateFunction(uniq, ...)) and merges them; combined with -State/-Merge suffixes it supports distinct counts, quantiles, and other complex aggregates. Use Summing for counters, Aggregating for richer analytics (typically via Materialized Views).

Q6. Why did ClickHouse Keeper replace ZooKeeper?

A. ZooKeeper's Java runtime is memory-heavy with GC pauses and not optimized for ClickHouse's high-frequency metadata patterns. Keeper is a C++ rewrite with Raft, wire-compatible with ZK, about 1.5-2x faster with half the memory. It can be embedded in the ClickHouse binary, simplifying ops. Most post-2023 deployments default to Keeper.

Q7. Why do ClickHouse and DuckDB differ in use case despite sharing columnar vectorized design?

A. Deployment model. ClickHouse is a server: networked, distributed, replicated, TB-to-PB scale, tens of thousands of QPS. DuckDB is embedded — a library running in-process (SQLite-style) on a single node with file-based .duckdb storage. DuckDB fits notebook analysis and ETL scripts; ClickHouse fits Kafka-fed real-time dashboards at scale. Different problems, not better/worse.

If this was useful, check out:

"RocksDB and LSM-Tree Deep Dive" — row-oriented LSM contrast.
"Columnar Storage: Parquet/ORC/Arrow/Dremel" — column format foundations.
"Snowflake Architecture Deep Dive" — another columnar OLAP approach.
"Apache Spark Catalyst and Tungsten" — distributed analytics alternative.