✍️ 필사 모드: ClickHouse Internals Deep Dive — MergeTree, Vectorized Execution, Distributed Queries, Keeper (2025)
EnglishTL;DR
- ClickHouse: built by Yandex in 2009 for internal analytics, open-sourced in 2016. Written in C++, the flagship columnar OLAP engine.
- Columnar storage: values of the same column stored contiguously. For queries like
SELECT sum(x) WHERE y > 0only that column is read — 10-100x faster. - MergeTree: ClickHouse's core engine. LSM-like "parts" with background merges, but stored by column rather than by row.
- Sparse Primary Index: one entry per granule (default 8192 rows), not per row. Index fits entirely in memory.
- Vectorized Execution: processes tens of thousands of rows at a time in loops. SIMD-friendly — compilers exploit AVX2/AVX-512 automatically.
- Compression: per-column LZ4/ZSTD by default. Time-series data uses Delta + Gorilla for aggressive compression.
- Distributed Engine: sharding + replication. Query fans out to shards and aggregates.
- Keeper: ZooKeeper replacement. Raft-based, reimplemented in C++. Optimized for ClickHouse workloads.
- Materialized View: results precomputed at insert time. Essential for real-time aggregation.
- 2025 competitors: Druid, Pinot (real-time aggregation), DuckDB (embedded), Snowflake/BigQuery (managed).
1. Why ClickHouse Is Fast
1.1 OLAP vs OLTP
OLTP (Online Transaction Processing):
- Reads/writes few rows (single-row inserts/updates).
O(log n)with an index.- Postgres, MySQL, Oracle.
OLAP (Online Analytical Processing):
- Scans millions to billions of rows.
- Only a few columns needed.
- Aggregations (sum, count, avg, distinct).
- ClickHouse, Snowflake, BigQuery, Druid.
1.2 Typical OLAP Query
SELECT user_country, SUM(revenue)
FROM events
WHERE event_date BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY user_country
ORDER BY SUM(revenue) DESC
LIMIT 10;
This query:
- Scans billions of rows by date range.
- Needs only
user_countryandrevenue(not full rows). - Returns a small aggregated result.
Why is Postgres slow here? Row storage forces reading all columns even if unused, indexes do not help for large scans, and per-row processing accumulates function-call overhead.
1.3 Three Weapons
- Columnar storage: read only the columns you need.
- Vectorized execution: process tens of thousands of rows at a time with SIMD.
- Heavy compression: save disk I/O bandwidth.
Combined, these make ClickHouse 100-1000x faster than Postgres for OLAP.
2. History
Yandex built ClickHouse for Metrica, their Google Analytics competitor. MySQL collapsed under billions of events per day; a team led by Alexey Milovidov rewrote the engine in C++. Prototype in 2009, production in 2012. Open-sourced in 2016 to explosive adoption: CloudFlare, Uber, Bloomberg, GitLab, Deutsche Bank. ClickHouse Inc. spun out in 2021 with $250M+ in funding and launched ClickHouse Cloud. As of 2025: 35,000+ GitHub stars, thousands of production users, monthly releases.
3. Columnar Storage Basics
3.1 Row vs Column
Row-based (Postgres, MySQL):
Row 1: [id=1, name="Alice", age=30, country="US"]
Row 2: [id=2, name="Bob", age=25, country="UK"]
On disk: 1|Alice|30|US|2|Bob|25|UK|...
Column-based (ClickHouse):
Column "id": [1, 2, 3, ...]
Column "name": ["Alice", "Bob", ...]
Column "age": [30, 25, ...]
Column "country": ["US", "UK", ...]
Each column is a separate file.
3.2 Why Columnar Is Faster for Analytics
I/O reduction: SELECT AVG(age) reads only age.bin. A 100-column table with a 1-column query means 100x less I/O.
Compression: values in one column share type and distribution. A country column of "US", "UK", "CN" compresses 10x+ with LZ4. Mixed-type rows compress poorly.
Cache-friendly: sequential column scans keep L1 hot.
SIMD: same-type contiguous values let AVX process 8 int32 values per 256-bit register.
3.3 When Row-Based Wins
- Point lookup
WHERE id = 42. - Update-heavy workloads.
- Small result, many columns returned.
Do not use ClickHouse for OLTP.
4. MergeTree — The Heart of ClickHouse
4.1 Core Idea
Inspired by LSM-trees but redesigned for columnar storage.
Insert -> memory buffer -> disk "part" (columnar files)
|
background merge
|
larger sorted part
4.2 Part Structure
A part is a directory:
/var/lib/clickhouse/data/mydb/mytable/
20240101_1_1_0/ # partition_minBlock_maxBlock_level
checksums.txt
columns.txt
count.txt
primary.idx # sparse primary index
user_id.bin # column file
user_id.mrk2 # marks (offsets)
event_time.bin
event_time.mrk2
revenue.bin
revenue.mrk2
Each column has a .bin and .mrk2 pair. Marks hold granule offsets in the .bin file.
4.3 Granules and the Sparse Primary Index
MergeTree's key insight: no per-row index entry.
- Default 8192 rows = one granule.
- Primary index points only to the first row of each granule.
For WHERE user_id = 12345: binary-search the primary index in memory, find the target granule, read it from disk (~8192 rows), scan for the value. The index is 1/8192 the size of a traditional one. A billion-row table's index is a few MB — entirely in RAM.
4.4 Why Sparse Is Fine
OLAP typically aggregates many rows. Reading one granule (8192 rows) maps well to disk I/O page units — several pages per granule. Point lookups are supported but are not ClickHouse's strength.
4.5 ORDER BY Is the Primary Key
CREATE TABLE events (
event_time DateTime,
user_id UInt64,
event_type String,
revenue Decimal(10, 2)
) ENGINE = MergeTree
ORDER BY (event_time, user_id)
PARTITION BY toYYYYMM(event_time);
Data is physically sorted by (event_time, user_id); the primary index is a sparse sample of that ordering.
4.6 Partitioning
PARTITION BY toYYYYMM(event_time) places monthly data in separate directories. Benefits: partition pruning, near-instant DROP PARTITION, partition-unit backup/replication. Avoid too many partitions — monthly or weekly is typical.
5. Write Path
5.1 INSERT Journey
- Input batched.
- Per-column vectors built in memory.
- Columns sorted by
ORDER BY. - Each column written to a separate compressed file.
- Part directory created, manifest updated.
A new part is created; old parts continue to coexist.
5.2 Background Merges
A background thread periodically merges small parts into larger ones (same partition only). Merges happen online; old parts are dropped when done.
5.3 Why Merging Matters
Too many parts (default limit 300 per partition) slows queries and triggers the "Too many parts" error. UPDATE/DELETE mutations are actually applied during merges.
5.4 Insert Optimization
Bad: one INSERT per row — each creates a new part. Good: batched INSERT of thousands to tens of thousands of rows. For streaming use Kafka table engine or an external ingester that batches.
6. Read Path and Vectorized Execution
6.1 Query Stages
- Parse and plan.
- Partition pruning.
- Granule selection via primary index.
- Read only required columns.
- Vectorized execution.
- Aggregate (hash table for GROUP BY).
- Sort and return.
6.2 Vectorized Execution
Traditional DBs use the Volcano model (one row per next() call) — heavy function-call overhead and branch mispredictions.
ClickHouse processes blocks (~65536 rows):
while (block = scan.next_block()) { // 65536 rows
auto mask = filter_block(block); // SIMD
aggregate_block(block, mask); // SIMD
}
Compilers auto-vectorize these tight loops into AVX2.
6.3 SIMD Example
// Naive
int64_t sum = 0;
for (int i = 0; i < n; i++) sum += revenue[i];
// AVX2 intrinsics
__m256i sum_vec = _mm256_setzero_si256();
for (int i = 0; i < n; i += 4) {
__m256i v = _mm256_loadu_si256((__m256i*)&revenue[i]);
sum_vec = _mm256_add_epi64(sum_vec, v);
}
4-8x speedup measured. ClickHouse mostly relies on the compiler — code is written in a vectorization-friendly style.
6.4 Runtime CPU Dispatch
ClickHouse detects CPU features (SSE2, SSE4.2, AVX2, AVX-512) at runtime and picks the best implementation via CPUID.
7. Data Skipping Indexes
Primary index only helps on ORDER BY columns. For others:
7.1 Bloom Filter
ALTER TABLE events
ADD INDEX idx_user_email user_email TYPE bloom_filter GRANULARITY 4;
Skip granules when the bloom says "definitely not here".
7.2 MinMax
ADD INDEX idx_revenue revenue TYPE minmax GRANULARITY 1;
Cheap and very effective — recommended by default.
7.3 Set
ADD INDEX idx_country user_country TYPE set(100) GRANULARITY 4;
Good for low-cardinality columns.
7.4 N-gram
ADD INDEX idx_log_msg log_message TYPE ngrambf_v1(4, 1024, 3, 0) GRANULARITY 1;
Accelerates LIKE '%error%'-style queries on log strings.
7.5 Pitfalls
Granule-level, not row-level. Overhead on writes and part metadata. Choose carefully.
8. Compression
Per-column codecs, independently chosen.
8.1 Base Codecs
- LZ4: default, fast, moderate ratio.
- ZSTD: better ratio, slightly slower; levels 1-22. Good for cold data.
CREATE TABLE events (
event_time DateTime CODEC(DoubleDelta, ZSTD(3)),
user_id UInt64 CODEC(T64, LZ4)
)
8.2 Delta / DoubleDelta
Delta stores [1000, 1, 2, 2, 5, 2] instead of raw values. DoubleDelta (delta-of-delta) reduces regular timestamps to near-zero values — often 10x+ compression.
8.3 Gorilla
Facebook's time-series trick: XOR successive floats; zero-heavy results compress well. 5-10x on metric columns.
8.4 T64
Bit-packs integers into the minimum bit width used.
8.5 Codec Chains
event_time DateTime CODEC(DoubleDelta, LZ4)
Transform first, then general compression.
8.6 Realistic Ratios
Raw CSV 100 GB -> row DB ~80 GB (LZ4) -> ClickHouse 10-15 GB. Columnar plus specialized codecs gives 7-10x vs raw.
9. MergeTree Family
9.1 ReplacingMergeTree
Keeps only the latest row per key (by version column) during merges.
CREATE TABLE users (
user_id UInt64,
name String,
updated_at DateTime
) ENGINE = ReplacingMergeTree(updated_at)
ORDER BY user_id;
Use FINAL to force deduplication at query time (slower).
9.2 SummingMergeTree
Sums numeric columns for rows sharing the ORDER BY key.
CREATE TABLE daily_stats (
date Date,
user_id UInt64,
pageviews UInt64,
clicks UInt64
) ENGINE = SummingMergeTree()
ORDER BY (date, user_id);
Always query with explicit GROUP BY + sum() for correctness before merges complete.
9.3 AggregatingMergeTree
Stores intermediate states of arbitrary aggregate functions.
CREATE TABLE hourly_stats (
hour DateTime,
country String,
unique_users AggregateFunction(uniq, UInt64),
avg_revenue AggregateFunction(avg, Decimal(10, 2))
) ENGINE = AggregatingMergeTree()
ORDER BY (hour, country);
Merges combine states — pairs with Materialized Views.
9.4 CollapsingMergeTree
Uses a sign Int8 column (+1/-1) to cancel prior state — useful for replicating MySQL binlogs.
9.5 VersionedCollapsingMergeTree
CollapsingMergeTree with explicit version for out-of-order merges.
9.6 GraphiteMergeTree
Graphite-compatible time-series metrics with TTL-based downsampling.
10. Materialized Views
MVs behave like insert triggers.
10.1 Basic
CREATE MATERIALIZED VIEW hourly_pageviews_mv
ENGINE = SummingMergeTree()
ORDER BY (hour, page)
AS
SELECT
toStartOfHour(event_time) AS hour,
page,
count() AS views
FROM events
GROUP BY hour, page;
Each insert into events triggers the SELECT; results are inserted into the MV. Queries run against the much smaller MV — 100x+ faster.
10.2 AggregateFunction MV
CREATE MATERIALIZED VIEW user_metrics_mv
ENGINE = AggregatingMergeTree()
ORDER BY (date, user_id)
AS
SELECT
toDate(event_time) AS date,
user_id,
uniqState(session_id) AS unique_sessions,
sumState(revenue) AS total_revenue
FROM events
GROUP BY date, user_id;
Query with uniqMerge / sumMerge to combine states — supports distinct counts and other complex aggregates.
10.3 Fan-out
One insert can drive many MVs (hourly, daily, country, user). Pick the right MV per query.
10.4 Caveats
Insert atomicity — source failure means MV failure. MVs see only data at insert time. Heavy JOINs inside MVs can be slow. TTLs don't propagate from source to MV.
11. Projections
Evolution of MVs (ClickHouse 21+). Stores alternate sort orderings inside the same table.
ALTER TABLE events
ADD PROJECTION p_country
(
SELECT event_time, user_country, revenue
ORDER BY (user_country, event_time)
);
ALTER TABLE events MATERIALIZE PROJECTION p_country;
ClickHouse picks the projection automatically when the query fits.
| Aspect | Materialized View | Projection |
|---|---|---|
| Storage | Separate table | Inside source table |
| Usage | Explicit name | Auto-selected |
| Deletion/TTL | Independent | Follows source |
| Complex queries | Yes | Limited |
12. Distributed Engine and Sharding
ClickHouse scales horizontally by sharding.
12.1 Concept
- Local table: each shard stores its own data.
- Distributed table: routes queries across shards.
12.2 Cluster Configuration
<clickhouse>
<remote_servers>
<my_cluster>
<shard>
<replica>
<host>shard1-r1</host>
<port>9000</port>
</replica>
<replica>
<host>shard1-r2</host>
<port>9000</port>
</replica>
</shard>
<shard>
<replica>
<host>shard2-r1</host>
</replica>
</shard>
</my_cluster>
</remote_servers>
</clickhouse>
12.3 Distributed Table
CREATE TABLE events_distributed ON CLUSTER my_cluster AS events
ENGINE = Distributed(my_cluster, default, events, rand());
12.4 Query Execution
Coordinator fans out the query, each shard aggregates locally, then the coordinator merges partial results. Map-reduce style with minimal network overhead.
12.5 Sharding Key
Good: high-cardinality, evenly distributed, used in filters (enabling local queries). Bad: country (skewed), timestamp (hot shard for latest data).
12.6 Replication
ReplicatedMergeTree replicates within a shard via Keeper/ZooKeeper:
CREATE TABLE events (...) ENGINE = ReplicatedMergeTree(
'/clickhouse/tables/{shard}/events',
'{replica}'
)
ORDER BY ...;
13. Keeper — ZooKeeper Replacement
ZooKeeper (Java) is memory-heavy with GC pauses and isn't optimal for ClickHouse's access patterns. ClickHouse Keeper (2021) is a C++ reimplementation with Raft consensus, ZK wire-protocol compatible, 1.5-2x faster, half the memory. Can be embedded in the ClickHouse binary or run standalone. Most new deployments use Keeper; ZK is legacy.
<clickhouse>
<keeper_server>
<tcp_port>9181</tcp_port>
<server_id>1</server_id>
<raft_configuration>
<server><id>1</id><hostname>keeper1</hostname><port>9234</port></server>
<server><id>2</id><hostname>keeper2</hostname><port>9234</port></server>
<server><id>3</id><hostname>keeper3</hostname><port>9234</port></server>
</raft_configuration>
</keeper_server>
</clickhouse>
14. Practical Tuning
- Insert batching: 10K-100K rows. Use Buffer tables, Kafka table engine, or
async_insert. - ORDER BY design: low cardinality first, then higher. Three to five columns is typical.
- Part management: monitor
system.parts; default limit 300 per partition. Increasebackground_pool_sizeif needed. - Memory limits:
max_memory_usage; enablemax_bytes_before_external_group_byto spill to disk. - Observability:
system.parts,system.query_log,system.metrics,system.events,system.asynchronous_metrics.
SELECT query, elapsed, memory_usage
FROM system.query_log
WHERE event_date = today()
ORDER BY elapsed DESC
LIMIT 10;
15. ClickHouse vs Alternatives
15.1 DuckDB
Embedded columnar DB (SQLite-style). Single process, no distribution. Great for local analysis up to TB scale. Same vectorized philosophy, different deployment.
15.2 Druid / Pinot
Real-time ingestion and dashboards. Druid "segments" resemble parts; Druid requires rollup definition upfront (less flexible). ClickHouse has richer SQL.
15.3 Snowflake / BigQuery
Fully managed. Higher cost, zero ops. ClickHouse is cheaper and often faster, but requires an ops team.
15.4 StarRocks / Doris
China-led alternatives with MySQL-compatible frontends and strong join performance. Smaller ecosystem.
16. Learning Path
- Start with the official docs and the NYC Taxi tutorial.
- Practice ORDER BY and partitioning design; follow Altinity Knowledge Base.
- Explore
system.*tables; watch "ClickHouse under the hood" talks by Alexey Milovidov. - Design distributed queries, run Keeper, build Materialized View patterns.
17. Cheat Sheet
Columnar storage:
- One file per column
- Read only needed columns
- Better compression
- SIMD-friendly
MergeTree:
- Part = directory of column .bin + .mrk2
- Granule = 8192 rows
- Sparse primary index
- Background merges
- Partition pruning
Family:
MergeTree (base), Replacing (dedup), Summing (sum),
Aggregating (generic), Collapsing (state cancel), Replicated
Vectorization:
- Block-wise (~65536 rows)
- Auto SIMD (AVX2/AVX-512)
- Runtime CPU dispatch
Indexes:
Primary: sparse, granule starts
Skipping: minmax / bloom / set / ngrambf
Compression:
LZ4 (default), ZSTD (cold)
Delta, DoubleDelta (timestamps)
Gorilla (float), T64 (bit packing)
Materialized View:
- Insert trigger
- Pre-aggregation
- AggregateFunction state
Distributed:
- Shard x Replica
- Distributed engine = query router
- ReplicatedMergeTree + Keeper
Projection:
Alternate sort inside same table, auto-selected
vs Competitors:
DuckDB (embedded), Druid/Pinot (real-time),
Snowflake/BQ (managed)
18. Quiz
Q1. Why does columnar storage make OLAP queries fast?
A. Three reasons: (1) I/O reduction — only needed columns are read, so a 100-column table with a 1-column query reads ~100x less. (2) Compression — uniform type/distribution in each column (e.g., country repeating "US"/"UK") compresses 10x+ with LZ4. (3) SIMD — contiguous same-type values load directly into AVX registers for 8x int32 parallelism, plus cache-line efficiency. Combined: 100-1000x faster than Postgres for analytics.
Q2. What is the sparse primary index and why is it acceptable?
A. One index entry per granule (default 8192 rows), not per row. A billion-row table has only a few MB of index — fully RAM-resident. Point lookups require a full-granule scan, but OLAP aggregates many rows, so granule-level reads match disk page I/O well. Opposite philosophy to OLTP dense indexes.
Q3. Why is vectorized execution faster than the Volcano model?
A. Eliminates per-row function-call overhead and enables SIMD. Volcano calls next() per row — tens of ns each, devastating over billions of rows, plus branch mispredictions. Vectorized runs blocks of ~65536 rows, so one call amortizes over thousands of values, and contiguous same-type data lets compilers emit AVX2/AVX-512 automatically. 10-100x measured speedup.
Q4. Why place low-cardinality columns first in ORDER BY?
A. Data is physically sorted by ORDER BY. Putting low-cardinality columns like country first creates long contiguous runs per value, so WHERE country = 'US' skips huge granule ranges. Putting high-cardinality columns first scatters values across granules, defeating pruning. Rule: frequently filtered low-cardinality columns first, time or high-cardinality range columns after.
Q5. SummingMergeTree vs AggregatingMergeTree?
A. SummingMergeTree only sums numeric columns per key — simple and fast, but limited. AggregatingMergeTree stores intermediate states (e.g., AggregateFunction(uniq, ...)) and merges them; combined with -State/-Merge suffixes it supports distinct counts, quantiles, and other complex aggregates. Use Summing for counters, Aggregating for richer analytics (typically via Materialized Views).
Q6. Why did ClickHouse Keeper replace ZooKeeper?
A. ZooKeeper's Java runtime is memory-heavy with GC pauses and not optimized for ClickHouse's high-frequency metadata patterns. Keeper is a C++ rewrite with Raft, wire-compatible with ZK, about 1.5-2x faster with half the memory. It can be embedded in the ClickHouse binary, simplifying ops. Most post-2023 deployments default to Keeper.
Q7. Why do ClickHouse and DuckDB differ in use case despite sharing columnar vectorized design?
A. Deployment model. ClickHouse is a server: networked, distributed, replicated, TB-to-PB scale, tens of thousands of QPS. DuckDB is embedded — a library running in-process (SQLite-style) on a single node with file-based .duckdb storage. DuckDB fits notebook analysis and ETL scripts; ClickHouse fits Kafka-fed real-time dashboards at scale. Different problems, not better/worse.
If this was useful, check out:
- "RocksDB and LSM-Tree Deep Dive" — row-oriented LSM contrast.
- "Columnar Storage: Parquet/ORC/Arrow/Dremel" — column format foundations.
- "Snowflake Architecture Deep Dive" — another columnar OLAP approach.
- "Apache Spark Catalyst and Tungsten" — distributed analytics alternative.
현재 단락 (1/351)
- **ClickHouse**: built by Yandex in 2009 for internal analytics, open-sourced in 2016. Written in C...