ClickHouse MergeTree — The Complete Guide to Columnar, Sparse Index, Part, and Materialized View (2025)

Prologue — How the Throne of OLAP Changed Hands

Through the early 2020s, "large-scale analytics" had a predictable menu. Cloud warehouses like Redshift, BigQuery, and Snowflake. Or on-prem batch tools like Druid, Presto/Trino, and Impala. Each had its trade-offs — cost, latency, query dialect, operational complexity.

Then ClickHouse arrived.

Yandex (Russia's Google) had been building it since 2009 for its internal web analytics product, Yandex.Metrica, and open-sourced it in 2016. As of 2026, ClickHouse is the de facto standard for open-source OLAP. Uber, Cloudflare, Spotify, and eBay use it. A new generation of observability tools (SigNoz, Posthog) is built on top of it. ClickHouse Cloud (the company) is a unicorn.

Why did it win? In one word: speed. You can run GROUP BY queries over a billion-row table at roughly one per second. Work that takes Postgres minutes on the same hardware finishes in under a second here. That 10×–1000× gap is not the result of a single optimization — it is the cumulative effect of deliberate choices at every layer: storage, indexing, and the execution engine.

This post dissects the internals of ClickHouse, especially the MergeTree engine family, across 1,400 lines. The byte layout of columnar storage, the structure of parts and granules, how the sparse primary index differs from a B-tree, how materialized views produce real-time aggregates, how distributed tables and replication work, and how the vectorized execution engine keeps the CPU cache humming.

This post is a sequel to the LSM-Tree post. MergeTree is the OLAP cousin of LSM, and many of its design decisions reuse LSM principles. At the same time, it stands on the opposite side of the Postgres post — evidence that OLTP and OLAP are too different to share the word "database" comfortably.

1. Columnar Storage — Why Store by Column

1.1 Row Store vs. Column Store

Traditional databases store data row by row. All the column values for a single row sit contiguously on disk.

Row layout:
(id=1, name="Alice", age=30, city="Seoul")
(id=2, name="Bob",   age=25, city="Tokyo")
(id=3, name="Carol", age=35, city="Seoul")

On disk:

1, Alice, 30, Seoul | 2, Bob, 25, Tokyo | 3, Carol, 35, Seoul

This layout is optimal for OLTP. Reading or writing all columns of one row in a single operation is fast.

Columnar flips it completely:

id:   [1, 2, 3]
name: [Alice, Bob, Carol]
age:  [30, 25, 35]
city: [Seoul, Tokyo, Seoul]

Each column lives in its own file/region.

1.2 Why It Wins for OLAP

Consider the query SELECT city, AVG(age) FROM users GROUP BY city.

Row store: reads every column of every row. With 10 columns, that is 10× the needed I/O.
Column store: reads only the city and age column files. I/O drops by 80%.

More benefits:

Compression ratio: values of the same type lie next to each other, which compresses dramatically well. Delta encoding for integer columns, dictionary encoding for strings. Typically 5–20× compression.
CPU cache: processing values of the same type one batch at a time benefits CPU pipelining and SIMD.
Selective scan: reading fewer columns means scanning less data.

Downsides:

Row reconstruction cost: to see all values of a single row, you must read the same index from multiple column files.
Writes are expensive: a single INSERT touches many files. So OLAP prefers batch INSERTs.
UPDATE/DELETE is very expensive: ClickHouse uses an "ALTER to mutate" model and essentially gives up on live row-level updates.

1.3 ClickHouse's Column Files

Each column of a MergeTree table lives on disk as two files:

column_name.bin    # actual data (compressed)
column_name.mrk2   # marks file — granule boundary offsets

The .bin file is compressed in blocks (LZ4 by default, ZSTD optionally). A block is sized by index_granularity_bytes (default 10MB) or max_compress_block_size (default 1MB).

.mrk2 records "at what offset in the .bin file each granule begins." We will unpack granules next.

2. Parts and Granules — The Units of MergeTree

2.1 Part — MergeTree's SSTable

A part in MergeTree plays the same role as an SSTable in LSM. An immutable, sorted chunk of data.

An INSERT creates a new part. One INSERT INTO t VALUES (...) becomes one part. Background merges combine multiple parts into larger parts.

Disk layout:

/var/lib/clickhouse/data/<db>/<table>/
├── 202604_1_1_0/                     # part name
│   ├── checksums.txt
│   ├── columns.txt                   # column list
│   ├── count.txt                     # row count
│   ├── primary.idx                   # sparse primary index
│   ├── minmax_created_at.idx         # per-partition min/max
│   ├── partition.dat
│   ├── id.bin, id.mrk2
│   ├── name.bin, name.mrk2
│   ├── age.bin, age.mrk2
│   └── ...
├── 202604_2_2_0/
└── 202604_1_2_1/                     # merge result

What the name 202604_1_2_1 means:

202604: partition id (monthly by default).
1_2: min_block_number = 1, max_block_number = 2 (the INSERT range this part covers).
_1: merge level (0 = just inserted).

2.2 Granule — The Smallest Indexing Unit

A granule is the smallest unit that the index points to inside a part. The default is 8192 rows per granule.

Why granules? A B-tree can hold one index entry per row. But for a billion rows that index would take gigabytes. ClickHouse uses a sparse index: one index entry per 8192 rows. That keeps the primary index of a billion-row table in the low MBs.

granule 0: row  0   - row  8191
granule 1: row  8192 - row  16383
granule 2: row  16384 - row  24575
...

The primary key of the first row of each granule goes into the primary.idx file:

primary.idx:
granule 0: (2026-04-01 00:00:00)
granule 1: (2026-04-01 00:15:00)
granule 2: (2026-04-01 00:32:00)
...

2.3 Sparse Primary Index — The Decisive Difference from a B-tree

How does WHERE created_at = '2026-04-01 00:20:00' get processed?

Binary-search the in-memory primary.idx.
The value falls between 00:15:00 < value < 00:32:00 → it might be in granule 1.
Look up the .bin offset for granule 1 in the marks file created_at.mrk2.
Decompress the corresponding block of the .bin file.
Linear-scan within the granule (up to 8192 rows).

In other words, we do not know exactly where a value is, but we know which granule it could be in. The rest is a scan.

Advantages:

The index is tiny → entirely loadable into memory.
Index maintenance cost is low → writes are fast.
Range queries benefit — you just read a contiguous run of granules.

Drawback:

A point lookup always scans one full granule (8192 rows) → unsuitable for OLTP.

2.4 Tuning `index_granularity`

Default is 8192. Depending on workload:

Very selective queries (finding rare rows most of the time) → lower to 1024 or 2048.
Large scans are the norm → raise to 16384 or 32768. The index shrinks further.

CREATE TABLE events (
  event_time DateTime,
  user_id UInt64,
  event_type LowCardinality(String),
  data String
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_time, user_id)
SETTINGS index_granularity = 8192;

There is also index_granularity_bytes (default 10MB since 21.8). If a row-count-sized granule grows too large in bytes, it gets chopped by byte size.

3. ORDER BY — The Heart of MergeTree

3.1 ORDER BY Is a Clustered Index

ORDER BY (event_time, user_id) means the data is physically sorted in that order. Similar to MySQL's clustered index. And those columns automatically belong to the primary key.

This one choice determines all the performance. The more your query's WHERE clause matches the ORDER BY columns from the left, the more dramatic the speedup.

3.2 Good vs. Bad ORDER BY

Suppose your service frequently runs queries like WHERE user_id = 123 AND event_time > '2026-04-01'.

Bad: ORDER BY (event_time, user_id)

Sorted by event_time → the user_id filter forces a full scan.

Good: ORDER BY (user_id, event_time)

Sorted by user_id → only the contiguous block for that user_id is read.
Within that block event_time is sorted, giving another fast range scan.

Rule of thumb: lower-cardinality columns first. Put things like country code or event type first, and time columns later.

3.3 PARTITION BY — Physical Splitting

PARTITION BY toYYYYMM(event_time) separates parts by month. Key points:

Parts from different partitions are never merged together.
Drop an old partition wholesale with ALTER TABLE t DROP PARTITION '202501'.
Queries with WHERE event_time > ... skip unrelated partitions (partition pruning).

Choose the partition key poorly and you ruin performance:

Too fine (e.g., hourly) → part count explodes → merge storm.
Too coarse (e.g., yearly) → impossible to drop old data gracefully.

Monthly is the sweet spot. Daily is fine too but demands part-count management.

3.4 Separating Primary Key and ORDER BY

Unlike Postgres, ClickHouse's primary key is not a unique constraint. It is purely the basis for the sparse index.

ENGINE = MergeTree
PRIMARY KEY (country, event_type)
ORDER BY (country, event_type, event_time)

Separating them like this means:

The index covers only (country, event_type) → small.
Actual sort order extends to (country, event_type, event_time) → range queries by event_time are still fast.

4. Data Skipping Index — Additional Indexes

4.1 Why We Need Them

The primary index is efficient only for ORDER BY columns. Filtering by other columns forces a full-granule scan. Data skipping indexes cover that gap.

4.2 Index Types

ALTER TABLE events ADD INDEX idx_user_id user_id TYPE bloom_filter GRANULARITY 4;

GRANULARITY 4 means "one skip-index entry per 4 granules." That is, one bloom filter per 32768-row block.

Supported types:

minmax: min/max value per block. Good for range queries, ideal for numbers/dates.
set(N): stores up to N unique values. Good for equality queries on low-cardinality columns.
bloom_filter: a bloom filter. Good for equality queries on high-cardinality columns.
ngrambf_v1: n-gram bloom filter. Good for string substring matching (LIKE '%foo%').
tokenbf_v1: token-level bloom filter. Good for word-level search.

4.3 A Real Example

User-ID-based queries are frequent, but putting user_id at the front of ORDER BY is hard:

ALTER TABLE events ADD INDEX idx_user_bf user_id TYPE bloom_filter(0.01) GRANULARITY 4;

Now WHERE user_id = 12345 can skip most granule groups via the bloom filter.

Caveat: skipping indexes are only effective on well-sorted data. If events for the same user_id are scattered throughout the table, the bloom filter returns true for almost every group and the index barely helps. The standard strategy is to add a skipping index to a column that is NOT part of the ORDER BY.

5. Compression — The Secret Behind 5–20×

5.1 Block-Level Compression

The .bin file is compressed in blocks. The default block size is min_compress_block_size = 65536 (64KB). A block can include multiple granules.

LZ4 is the default — fast (several GB/s decompression) with a reasonable ratio. For disk-space-sensitive deployments, ZSTD:

CREATE TABLE t (...)
ENGINE = MergeTree
ORDER BY id
SETTINGS
  min_compress_block_size = 65536,
  max_compress_block_size = 1048576,
  compress_on_write = 1;

Or per-column codecs:

CREATE TABLE t (
  id UInt64 CODEC(Delta, ZSTD),           -- optimal for incremental values
  created_at DateTime CODEC(DoubleDelta, ZSTD),  -- optimal for time series
  value Float64 CODEC(Gorilla, LZ4),      -- optimal for time-series floats (Facebook Gorilla)
  category LowCardinality(String),        -- low-cardinality string
  big_text String CODEC(ZSTD(3))          -- long text
)

5.2 Specialized Codecs

Delta: x_n - x_{n-1}. Perfect for monotonically increasing integers.
DoubleDelta: delta of deltas. Optimal for time-series timestamps. With regular intervals, nearly zero bytes.
Gorilla: developed by Facebook for time-series floats. A few bits per value when changes are small.
T64: transposes 64-bit integers and applies delta. Bit-level compression.
LZ4, ZSTD: general purpose. ZSTD is slower than LZ4 but ~30% denser.

For time-series tables (metrics, logs), the right codec combination alone commonly yields 20–50× compression.

5.3 LowCardinality — Dictionary Encoding

event_type LowCardinality(String)

ClickHouse builds the dictionary automatically. Under the hood values are stored as UInt8/UInt16 and restored through a dictionary table. Ideal for string columns with 10K or fewer distinct values. Roughly 10× smaller + 5–10× faster queries.

Status enums, category names — wrap nearly every low-cardinality string in LowCardinality by default.

6. Merge — Parts Keep Merging

6.1 Why Merge

Same reason as LSM. Every INSERT creates a new part → part count explodes → every read scans every part. So a background thread periodically merges small parts into bigger ones.

A merge is a k-way merge: many sorted inputs into one sorted output. O(N log k) — because inputs are already sorted.

6.2 Merge Scheduling

ClickHouse caps part counts:

parts_to_throw_insert = 300 by default. Over 300 parts per partition and INSERTs error out.
parts_to_delay_insert = 150 by default. Over 150 and INSERTs slow down.

If you see these: INSERTs are too frequent or merge cannot keep up.

Fixes:

Batch INSERTs. Thousands of INSERTs per second are a disaster. Instead, push ~100K rows every few seconds.
Use the Buffer engine or AsyncInsert (covered below).

6.3 Sizing Merges

max_bytes_to_merge_at_max_space_in_pool = 150GB   # max single merge size
max_bytes_to_merge_at_min_space_in_pool = 1MB
background_pool_size = 16                          # number of merge threads

Raising background_pool_size increases merge throughput. But it steals CPU from queries, so around half the CPU cores is usually sensible.

6.4 OPTIMIZE — Force a Merge

OPTIMIZE TABLE events PARTITION '202604' FINAL;

Merges every part in that partition into one. Used after batch jobs to maximize scan performance. Expensive — use carefully in production.

The FINAL keyword means "perform the final merge." For engines like ReplacingMergeTree it also performs deduplication.

7. The MergeTree Variants — The Engine Family

7.1 ReplacingMergeTree

At merge time, only the latest row survives among those with the same primary key.

CREATE TABLE user_state (
  user_id UInt64,
  name String,
  version UInt64
)
ENGINE = ReplacingMergeTree(version)
ORDER BY user_id;

Instead of UPDATE, you INSERT a new state. The row with the larger version wins.

Caveat: duplicates linger until a merge runs. Adding FINAL to queries resolves duplicates at query time (expensive). A common workaround is a query like SELECT argMax(name, version) FROM user_state GROUP BY user_id.

7.2 SummingMergeTree

At merge time, numeric columns are summed across rows sharing the same primary key.

CREATE TABLE metrics_hourly (
  metric_name String,
  hour DateTime,
  value UInt64,
  count UInt64
)
ENGINE = SummingMergeTree((value, count))
ORDER BY (metric_name, hour);

Insert incremental values and merges sum them automatically. Great for pre-aggregation tables.

7.3 AggregatingMergeTree

The most powerful one. Stores the intermediate state of custom aggregate functions.

CREATE TABLE metrics_agg (
  metric_name String,
  hour DateTime,
  unique_users AggregateFunction(uniq, UInt64),
  p99_latency AggregateFunction(quantile(0.99), Float64)
)
ENGINE = AggregatingMergeTree
ORDER BY (metric_name, hour);

INSERT INTO metrics_agg SELECT
  metric_name,
  toStartOfHour(event_time),
  uniqState(user_id),
  quantileState(0.99)(latency)
FROM events
GROUP BY metric_name, toStartOfHour(event_time);

SELECT
  metric_name,
  hour,
  uniqMerge(unique_users),
  quantileMerge(0.99)(p99_latency)
FROM metrics_agg
GROUP BY metric_name, hour;

The "-State" / "-Merge" function pair is the trick. State stores sketches like HyperLogLog, Merge combines the sketches into the final value.

It lets a GROUP BY over a multi-TB source table answer in seconds.

7.4 CollapsingMergeTree / VersionedCollapsingMergeTree

A sign column toggles "insert" and "delete". On merge, opposites cancel each other.

CREATE TABLE user_state (
  user_id UInt64,
  name String,
  sign Int8  -- +1 = insert, -1 = cancel
)
ENGINE = CollapsingMergeTree(sign)
ORDER BY user_id;

INSERT INTO user_state VALUES (1, 'Alice', 1);
-- Alice changes: cancel the old row then add the new one
INSERT INTO user_state VALUES (1, 'Alice', -1);
INSERT INTO user_state VALUES (1, 'Alice_v2', 1);

After a merge the matched sign=-1 and sign=1 rows cancel. Useful for state-machine storage, but it is hard to use correctly.

7.5 ReplicatedMergeTree — Replication

Prefix any of the engines above with Replicated and it becomes replicated. ZooKeeper (or ClickHouse Keeper) coordinates metadata.

CREATE TABLE events_local (
  event_time DateTime,
  user_id UInt64,
  ...
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_time, user_id);

The shard/replica placeholders in the path are substituted from config. Replicas in the same shard sync at the part level — the replica that received the INSERT produces a part, records it in ZooKeeper, and the others pull it.

8. Distributed Table — Sharding

8.1 Structure

ClickHouse is shared-nothing. Each shard is an independent node holding part of the data. A Distributed virtual table sits on top.

CREATE TABLE events ON CLUSTER my_cluster (...)
ENGINE = ReplicatedMergeTree(...)
ORDER BY (...);

CREATE TABLE events_distributed ON CLUSTER my_cluster
AS events
ENGINE = Distributed(my_cluster, default, events, rand());

Distributed(cluster, db, table, sharding_key):

cluster: the cluster name defined in config.
sharding_key: decides which shard an INSERT targets.

8.2 Query Distribution

SELECT ... FROM events_distributed WHERE ... goes like this:

The coordinator node fans out the query to every shard.
Each shard produces partial results from its local parts.
The coordinator gathers the partial results and aggregates the final answer.

GROUP BY is automatically two-phase: partial aggregate per shard, final aggregate on the coordinator.

8.3 Sharding Key Design

rand(): even distribution. Hard to JOIN.

cityHash64(user_id): the same user_id lands on the same shard. Ideal for per-user JOIN/GROUP BY. Risk: skew.

Custom: per your business domain. For example, tenant_id % N.

Principle: keep joins and aggregations inside a single shard. Cross-shard joins are lethal.

8.4 Replication + Sharding

A canonical setup: 3 shards × 2 replicas = 6 nodes.

shard1:  node1 (replica1), node2 (replica2)
shard2:  node3 (replica1), node4 (replica2)
shard3:  node5 (replica1), node6 (replica2)

Data is 3-way sharded, and each shard is 2-way replicated.

9. ClickHouse Keeper — Goodbye ZooKeeper

9.1 The Weight of ZooKeeper

For a long time ClickHouse used ZooKeeper to coordinate metadata — replication, DDL, etc. But ZK is JVM-based, meaning high operational complexity, and performance degrades with a very large number of znodes.

9.2 Enter Keeper

In 2021 ClickHouse Keeper arrived. Wire-protocol compatible with ZooKeeper. Written in C++, embeddable inside the ClickHouse server or runnable standalone.

Advantages:

SSD-friendly (Raft-based snapshots).
No JVM dependency.
Handles multi-GB of metadata smoothly.

Keeper is a real-world application of Raft. Under the hood it is built on the NuRaft library.

10. Materialized View — Real-Time Aggregation

10.1 Concept

An INSERT trigger. When you INSERT into the source table, aggregated rows are automatically INSERTed into the target table.

CREATE MATERIALIZED VIEW events_hourly
ENGINE = SummingMergeTree
ORDER BY (hour, event_type)
AS SELECT
  toStartOfHour(event_time) AS hour,
  event_type,
  count() AS cnt,
  sum(value) AS sum_value
FROM events
GROUP BY hour, event_type;

Now every time you INSERT into events:

The raw INSERT is stored as-is.
A partial aggregate by hour/type is INSERTed into events_hourly.
SummingMergeTree sums them in the background.

Query directly from events_hourly:

SELECT hour, event_type, sum(cnt), sum(sum_value)
FROM events_hourly
WHERE hour >= now() - INTERVAL 7 DAY
GROUP BY hour, event_type;

Hourly aggregates on a billion-row events table answer in seconds.

10.2 Caveats

A materialized view runs only at INSERT time. Historical data must be re-inserted or loaded via POPULATE.
If the INSERT into the source fails, the MV is rolled back.
Too many MVs slow INSERTs down.
An MV is not a "view" — it is an actual table. It consumes disk and produces its own parts.

10.3 The `TO` Clause

CREATE TABLE events_hourly (...) ENGINE = SummingMergeTree ORDER BY (...);

CREATE MATERIALIZED VIEW events_hourly_mv TO events_hourly AS
SELECT ... FROM events;

The TO clause points at an existing table. More explicit and recommended.

10.4 Multi-Stage Aggregation

-- Stage 1: hourly
CREATE MATERIALIZED VIEW events_hourly TO events_hourly_dest AS
SELECT toStartOfHour(event_time) AS hour, ... FROM events GROUP BY hour, ...;

-- Stage 2: daily (on top of hourly)
CREATE MATERIALIZED VIEW events_daily TO events_daily_dest AS
SELECT toDate(hour) AS day, ... FROM events_hourly_dest GROUP BY day, ...;

A cascading aggregation. Very effective for BI dashboards.

11. Projection — Multiple Indexes on the Same Table

11.1 The Problem

A MergeTree has only one ORDER BY. But what if you want both WHERE user_id = ... and WHERE event_time BETWEEN ... to be fast?

Historically the answer was two MVs: the same data with a different ORDER BY. Storage doubles.

11.2 Projection

A projection stores the same data with a different sort order and/or aggregation. It lives inside the same part.

ALTER TABLE events ADD PROJECTION user_proj (
  SELECT * ORDER BY user_id
);

ALTER TABLE events MATERIALIZE PROJECTION user_proj;

The optimizer now decides on the fly: "this is a WHERE user_id ... query, use user_proj."

Storage grows (1.5–2×) but operations stay single-table.

11.3 Aggregating Projection

Works similarly to an AggregatingMergeTree MV:

ALTER TABLE events ADD PROJECTION hourly_agg (
  SELECT
    toStartOfHour(event_time) AS hour,
    event_type,
    count(),
    sum(value)
  GROUP BY hour, event_type
);

12. Vectorized Execution Engine — The Real Secret of Speed

12.1 Row-at-a-Time vs. Vectorized

A traditional DB engine (the Volcano model) processes one row at a time:

while (row = scan->next()) {
  if (filter(row)) {
    project(row);
    aggregate(row);
  }
}

Each operator is a function call. Branch mispredictions, CPU cache misses.

Vectorized processes batches (typically 65536 rows) at a time:

while (batch = scan->next_batch(65536)) {
  mask = filter(batch);              // SIMD filter
  projected = project(batch, mask);  // transform the whole column
  aggregate_batch(projected);        // SIMD accumulation
}

Function-call overhead amortizes (once per 65536 rows).
Leverages SIMD.
CPU-cache friendly (a column batch stays in L1/L2).

DuckDB uses a similar model — covered in the DuckDB post.

12.2 ClickHouse's Block

The internal unit in ClickHouse is the Block — a bundle of columns. Default max_block_size = 65505. A scan step reads one block and passes it to the next.

Block[0] = { col1: [val, val, ...], col2: [val, val, ...], ... }  // 65505 rows
Block[1] = ...

Operator tree: ReadFromMergeTree → Filter → Projection → Aggregator → .... Each stage consumes/produces blocks.

12.3 JIT Compilation

ClickHouse uses LLVM for runtime JIT compilation. Example:

SELECT (a + b) * c - d FROM t;

Running this expression per row via an interpreter is slow. JIT converts it into native code in one shot:

; vectorized assembly
vpaddd  ymm0, ymm1, ymm2        ; a + b
vpmulld ymm0, ymm0, ymm3        ; * c
vpsubd  ymm0, ymm0, ymm4        ; - d

10–100× faster. Enabled by compile_expressions = 1 (default).

12.4 Parallelism

How a single query uses CPU cores:

Part parallelism: parts the query needs to read are distributed across cores.
Pipeline parallelism: the operator pipeline runs across multiple threads.
Shard parallelism: with a Distributed table, work spreads across nodes.

The max_threads parameter caps threads. Default is the number of cores.

13. AsyncInsert — Salvation for Small INSERTs

13.1 The Problem

ClickHouse prefers batch INSERTs. But Kafka consumers and event streams emit many small INSERTs. Each INSERT becomes a new part → merge storm → collapse.

13.2 The Fix

INSERT INTO events SETTINGS async_insert=1 VALUES ...;

The server buffers multiple INSERTs in memory. When a condition is met, it creates a part in one shot:

async_insert_max_data_size (default 1MB)
async_insert_busy_timeout_ms (default 200ms)

With wait_for_async_insert=1 the client waits until the buffer is actually flushed to disk (durability guaranteed). With 0 it confirms only that the buffer received the data (faster but lossy).

13.3 Buffer Engine (Legacy)

CREATE TABLE events_buffer AS events
ENGINE = Buffer('default', 'events', 16, 10, 60, 10000, 1000000, 10485760, 1073741824);

The Buffer engine plays a similar role — in-memory buffer with periodic flush. If this table dies, data is lost. AsyncInsert is the recommended approach.

14. Running It in Production

14.1 Good Schema Checklist

CREATE TABLE events (
  event_time DateTime CODEC(DoubleDelta, ZSTD),
  user_id UInt64,
  event_type LowCardinality(String),     -- low cardinality
  country LowCardinality(String),
  device LowCardinality(String),
  url String CODEC(ZSTD(3)),
  duration_ms UInt32 CODEC(T64, LZ4),
  session_id UUID,
  data String CODEC(ZSTD(3))
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_time)        -- monthly
ORDER BY (country, event_type, event_time, user_id)  -- low cardinality first
SETTINGS index_granularity = 8192;

ALTER TABLE events ADD INDEX idx_user user_id TYPE bloom_filter GRANULARITY 4;

DoubleDelta codec on DateTime.
Low-cardinality strings wrapped in LowCardinality.
Low-cardinality columns first in ORDER BY.
Skipping index for columns that did not fit at the front.

14.2 Monitoring Queries

Part count check:

SELECT database, table, count() AS parts, sum(rows) AS rows
FROM system.parts
WHERE active
GROUP BY database, table
ORDER BY parts DESC;

Thousands of parts is a warning sign.

Compression ratio:

SELECT
  database, table,
  formatReadableSize(sum(bytes_on_disk)) AS disk,
  formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed,
  sum(data_uncompressed_bytes) / sum(bytes_on_disk) AS ratio
FROM system.parts
WHERE active
GROUP BY database, table
ORDER BY ratio;

A ratio under 5 warrants codec/schema review.

Slow queries:

SELECT query_duration_ms, query, user, memory_usage
FROM system.query_log
WHERE event_date >= today() AND type = 'QueryFinish'
ORDER BY query_duration_ms DESC
LIMIT 10;

14.3 Anti-Patterns

Frequent UPDATE/DELETE: in ClickHouse an UPDATE is an asynchronous ALTER. Seconds to minutes of delay. Avoid.
Small INSERTs: thousands of INSERTs per second → disaster. Batch or use AsyncInsert.
Too many columns: tables with hundreds of columns explode write cost. Consider wide → narrow table splits.
High-cardinality column at the front of ORDER BY: the primary index becomes useless.
Partitions too fine: hourly partitions are almost always a disaster. Never go below daily.

14.4 Migration Strategy

Moving OLAP queries out of an existing Postgres/MySQL OLTP into ClickHouse:

CDC with Debezium + Kafka. Capture changes from the source OLTP.
ClickHouse Kafka engine consumes.
Materialized view loads the Kafka table into a MergeTree table.

This pipeline keeps OLAP analytics within sub-minute latency of OLTP.

15. The Limits of ClickHouse — Not the Answer to Everything

15.1 Workloads That Fit Poorly

OLTP: point updates, single-row lookups. Possible but slow.
Join-heavy: one or two joins in a star schema are fine, but 6–10-way joins hurt.
Ad-hoc exploration: Trino/Presto have the edge (querying Parquet/Iceberg directly).
Real-time lookup of individual events: individual record lookup requires a granule scan → slow.

15.2 The Competitive Landscape

Snowflake / BigQuery: fully managed, almost all SQL. Expensive. ClickHouse is 10–100× cheaper.
Druid / Pinot: real-time ingest + query. Closest competitors. More operationally complex.
DuckDB: single-node. In-memory / small data. Different role.
StarRocks / Doris: MPP databases from China. Claim better join performance than ClickHouse.

15.3 The Current Value Proposition

Open source.
Terabyte-scale on a single node, petabyte-scale on a cluster.
Decent SQL compatibility.
Scans tens of billions of rows per second.
Very low operational cost.

That is why it became the standard for observability (logs/metrics/traces), web analytics, real-time dashboards, CDN log analysis, and more.

Closing — A Billion Rows in One Second

The ClickHouse secret in one line: "columnar storage + sparse index + vectorized execution". The three amplify each other into a 10–1000× speedup.

Three things an operator should check reflexively:

Is ORDER BY designed well? This one line dictates performance. Low-cardinality columns first.
Are small INSERTs blocked? Stabilize with AsyncInsert, batching, or the Kafka engine.
Are you watching the part count? Thousands of parts means merge cannot keep up. Investigate.

Get those three right and ClickHouse feels almost magical.

Zooming out, the last three posts have covered the storage-engine spectrum:

Postgres: row storage + B-tree + MVCC. The king of OLTP.
LSM / RocksDB: row storage + LSM + KV. The write-heavy distributed systems engine.
ClickHouse: columnar storage + MergeTree + vectorized execution. The analytics champion.

Three systems share the word "database" but are entirely different organisms. Understanding each one's genome is the first step to picking the right tool.

The next post covers the internals of TLS 1.3 and QUIC. We will unwind the hidden dance behind every HTTPS request at the byte level. Cryptography, networking, and performance all converge on one topic.