Split View: ClickHouse Internals Deep Dive — MergeTree, 벡터화 실행, 분산 쿼리, Keeper 완전 정복 (2025)

ClickHouse Internals Deep Dive — MergeTree, 벡터화 실행, 분산 쿼리, Keeper 완전 정복 (2025)

TL;DR

ClickHouse는 Yandex가 2009년 내부 분석용으로 만들어 2016년 오픈소스화. C++로 작성, 컬럼나 OLAP 엔진의 대표.
컬럼나 저장: 같은 컬럼의 값들을 연속 저장. 분석 쿼리가 SELECT sum(x) WHERE y > 0 같은 패턴이면 해당 컬럼만 읽으면 된다 → 10-100x 빠름.
MergeTree: ClickHouse의 핵심 엔진. LSM처럼 "part"를 쓰고 백그라운드 병합. 하지만 로우가 아닌 컬럼별로 저장.
Sparse Primary Index: 모든 행이 아닌 granule(기본 8192행)마다 한 번. 인덱스가 작아 완전히 메모리에 상주.
벡터화 실행: 한 번에 수만 행을 루프로 처리. SIMD 친화적. 컴파일러가 AVX2/AVX-512 자동 활용.
압축: 컬럼별 LZ4/ZSTD 기본. 시계열은 Delta + Gorilla로 더 공격적 압축.
Distributed Engine: 샤딩 + 복제. 쿼리를 shard에 뿌리고 집계.
Keeper: ZooKeeper 대체. Raft 기반, C++로 재구현. ClickHouse 전용 최적화.
Materialized View: 삽입 시점에 미리 계산된 결과 저장. 실시간 집계에 필수.
2025 경쟁자: Druid, Pinot (실시간 집계), DuckDB (임베디드), Snowflake/BigQuery (매니지드).

1. ClickHouse는 왜 빠른가

1.1 OLAP vs OLTP

먼저 두 워크로드의 차이를 이해해야 한다.

OLTP (Online Transaction Processing):

적은 행을 읽고 쓴다 (한 행 insert, 한 행 update).
인덱스가 있으면 O(log n).
Postgres, MySQL, Oracle.

OLAP (Online Analytical Processing):

수백만~수십억 행 스캔.
적은 컬럼만 필요.
집계 (sum, count, avg, distinct).
ClickHouse, Snowflake, BigQuery, Druid.

1.2 전형적 OLAP 쿼리

SELECT user_country, SUM(revenue)
FROM events
WHERE event_date BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY user_country
ORDER BY SUM(revenue) DESC
LIMIT 10;

이 쿼리는:

날짜 범위로 수십억 행 스캔.
user_country와 revenue 두 컬럼만 필요 (row 전체 X).
집계 후 작은 결과 반환.

이 패턴에서 Postgres는 왜 느린가?

로우 기반 저장 → 모든 컬럼을 같이 읽어야 함 (사용 안 해도).
인덱스도 도움 안 됨 (전체 스캔이 아니라 수백 M 행).
행마다 따로 처리 (함수 호출 오버헤드 누적).

ClickHouse는 이 세 가지 모두 해결한다.

1.3 세 가지 무기

컬럼나 저장: 필요한 컬럼만 디스크에서 읽음.
벡터화 실행: 한 번에 수만 행을 SIMD로 처리.
고압축: 디스크 I/O 대역폭 절약.

이 셋이 합쳐져 Postgres 대비 100-1000x 빠른 OLAP 쿼리.

2. 역사

2.1 Yandex Metrica (2009)

Yandex는 Google Analytics의 경쟁 서비스 Metrica를 운영. 매일 수십억 이벤트를 저장하고 분석해야 했다.

초기엔 MySQL에 저장 → 터졌다. 그 다음 MyISAM 기반 커스텀 솔루션 → 부족. 다양한 접근 실패 후 Alexey Milovidov가 이끄는 팀이 C++로 새 엔진을 처음부터 작성.

목표:

매일 수십억 이벤트 ingest.
100ms 안에 수억 행 집계.
저비용 (수백 대 서버로 운영).

결과: ClickHouse. 2009년 프로토타입, 2012년 프로덕션.

2.2 오픈소스 공개 (2016)

Yandex는 2016년 GitHub에 ClickHouse를 공개했다. 반응은 폭발적:

CloudFlare: 전 세계 트래픽 분석 ("한 달에 수조 이벤트").
Uber: 운영 메트릭.
Bloomberg: 시장 데이터.
GitLab: 내부 분석.
Deutsche Bank: 금융 데이터.

2021년 **ClickHouse Inc.**가 설립됐다. Yandex에서 spin-off. $250M+ 시리즈 펀딩. ClickHouse Cloud(매니지드) 출시.

2.3 2025 현황

GitHub ⭐: 35,000+ (오픈소스 DB 중 최상위).
사용 기업: 수천 개.
주요 릴리스: 거의 매달.
컨트리뷰터: 전 세계 1,000+.

3. 컬럼나 저장의 기초

3.1 로우 vs 컬럼

같은 데이터를 두 가지 방식으로 저장:

로우 기반 (Postgres, MySQL):

Row 1: [id=1, name="Alice", age=30, country="US"]
Row 2: [id=2, name="Bob",   age=25, country="UK"]
Row 3: [id=3, name="Carol", age=35, country="US"]

디스크 상:

1|Alice|30|US|2|Bob|25|UK|3|Carol|35|US

컬럼 기반 (ClickHouse):

Column "id":      [1, 2, 3, 4, ...]
Column "name":    ["Alice", "Bob", "Carol", ...]
Column "age":     [30, 25, 35, ...]
Column "country": ["US", "UK", "US", ...]

각 컬럼이 별도 파일.

3.2 왜 컬럼나가 분석에 빠른가

이유 1: I/O 감소.

SELECT AVG(age) FROM users 쿼리:

Row: 모든 컬럼 읽어야 → id, name, country 다 읽음.
Column: age 컬럼만 읽음. 다른 컬럼 파일은 건드리지 않음.

테이블에 100개 컬럼이 있고 쿼리가 1개만 쓴다면? 100배 적은 I/O.

이유 2: 압축 효율.

같은 컬럼의 값은 비슷한 타입, 비슷한 분포 → 압축률 훨씬 좋다. 예를 들어 country 컬럼은 대부분 "US", "UK", "CN" 같은 몇 개 값의 반복 → LZ4로 10배 이상 압축 가능.

로우 기반은 서로 다른 타입이 섞여 있어서 압축이 어렵다.

이유 3: 캐시 친화적.

CPU가 한 번 로드한 캐시 라인을 효율적으로 사용. age 컬럼만 순회하면 그 데이터가 L1 캐시에 잘 들어감.

이유 4: SIMD 최적화.

같은 타입의 값이 연속 저장 → sum, min, max 같은 연산을 SIMD로 병렬 처리 가능. int32 8개를 AVX 256-bit 레지스터 하나에 로드.

3.3 언제 로우 기반이 낫나

Point lookup: WHERE id = 42. 한 행의 모든 컬럼 필요.
Update-heavy: 컬럼 기반은 한 행 업데이트가 여러 파일 건드림.
Small result + many columns: 10 행 SELECT *.

OLTP는 여전히 로우 기반이 맞다. ClickHouse는 OLTP에 쓰지 말 것.

4. MergeTree — ClickHouse의 심장

4.1 기본 아이디어

MergeTree는 LSM-tree에서 영감을 받았지만 컬럼나에 맞게 재설계.

Insert → 메모리 버퍼 → 디스크에 "part" 생성 (작은 컬럼나 파일들)
                                ↓
                       백그라운드에서 부분들을 병합
                                ↓
                      큰 정렬된 part

"Merge"는 엔진 이름에서 나온 것 — part들을 병합한다는 의미.

4.2 Part 구조

Part는 디렉토리:

/var/lib/clickhouse/data/mydb/mytable/
├── 20240101_1_1_0/              # part name: partition_minBlock_maxBlock_level
│   ├── checksums.txt
│   ├── columns.txt              # 컬럼 메타
│   ├── count.txt                # 행 수
│   ├── primary.idx              # sparse primary index
│   ├── user_id.bin              # 컬럼 파일
│   ├── user_id.mrk2             # 마크 (offset)
│   ├── event_time.bin
│   ├── event_time.mrk2
│   ├── revenue.bin
│   ├── revenue.mrk2
│   └── ...
├── 20240101_2_2_0/
└── 20240102_1_1_0/

각 컬럼이 .bin과 .mrk2(marks) 파일 쌍. Marks는 각 granule의 .bin 파일 offset.

4.3 Granule과 Sparse Primary Index

MergeTree의 정수: 모든 행에 인덱스 엔트리를 두지 않는다.

기본 8192행이 하나의 granule.
Primary index는 각 granule의 첫 행만 가리킴.

Row index:  0   8192  16384  24576  32768 ...
Primary idx: k0   k1    k2     k3     k4  ...

WHERE user_id = 12345 쿼리:

Primary index(메모리)에서 binary search → "granule 5" 해당.
Granule 5의 .bin offset을 .mrk2에서 조회.
Granule 5(8192행)를 디스크에서 읽기.
그 안에서 12345 찾기.

인덱스 크기가 전통적 인덱스의 1/8192. 10억 행 테이블의 인덱스가 수 MB. 메모리에 완전히 상주.

4.4 왜 sparse가 괜찮은가

OLTP는 한 행을 정확히 찾아야 한다 → dense index 필수.

OLAP는 보통 여러 행을 집계한다 → granule 하나(8192행) 읽는 것이 오히려 효율적. 디스크 I/O가 4-32 KB 페이지 단위라서 한 granule이 여러 페이지.

"정확히 1 행만 필요"한 경우는 OLAP의 주요 패턴 아님. ClickHouse는 point lookup을 지원하지만 강점이 아니다.

4.5 ORDER BY가 인덱스

ClickHouse MergeTree는 ORDER BY가 primary key다.

CREATE TABLE events (
    event_time DateTime,
    user_id UInt64,
    event_type String,
    revenue Decimal(10, 2)
) ENGINE = MergeTree
ORDER BY (event_time, user_id)
PARTITION BY toYYYYMM(event_time);

데이터가 (event_time, user_id) 순서로 물리 저장.
Primary index는 이 정렬 순서의 sparse 샘플.
WHERE event_time > ... 쿼리가 최적.

별도 PRIMARY KEY 절도 쓸 수 있지만 거의 ORDER BY와 같게 사용.

4.6 Partitioning

PARTITION BY toYYYYMM(event_time)

이 선언으로 월별로 part가 별도 디렉토리. 장점:

WHERE event_time >= '2024-06-01': 2024-06 파티션만 스캔. Partition pruning.
오래된 파티션 삭제: ALTER TABLE DROP PARTITION '202301'. Truncate 수준 빠름.
백업, 복제도 파티션 단위.

주의: 너무 많은 파티션은 part 수 폭증 → 성능 저하. 월 또는 주 단위 권장.

5. Write Path 상세

5.1 INSERT의 여정

INSERT INTO events VALUES (...), (...), (...);

Step 1: ClickHouse가 입력을 batch로 받음. 단일 insert도 내부적으로 batch.

Step 2: 메모리에 각 컬럼별 벡터 구축:

user_id: [1, 2, 3]
event_time: [t1, t2, t3]
...

Step 3: 컬럼 정렬 (ORDER BY 기준).

Step 4: 각 컬럼을 별도 파일로 write, 압축 적용.

Step 5: Part 디렉토리 생성, manifest 업데이트.

결과: 새 part 하나 생성. 기존 part와 병행 존재.

5.2 Background Merge

ClickHouse는 백그라운드 스레드가 주기적으로 작은 part들을 큰 part로 병합.

[Part A: 1000 rows]
[Part B: 1200 rows]  → [Merged Part: 2200 rows]
[Part C:  950 rows]  → [Merged Part:  950 rows]

같은 파티션 안에서만.
병합 중에도 읽기/쓰기 가능.
완료 후 옛 part 삭제.

5.3 왜 Merge가 필수인가

Part 수 제한: 기본 수천 part 이상이면 쿼리 속도 저하. Too many parts 에러.
Sparse index 효율: part 수 많으면 각 part마다 index 조회 → 오버헤드 증가.
Mutation 적용: UPDATE/DELETE가 merge 중 실제 적용됨.

5.4 Insert 최적화

BAD:

for row in data:
    INSERT INTO events VALUES (row)

한 행씩 insert하면 매번 새 part 생성 → part 수 폭발 → merge 따라잡지 못함 → "Too many parts".

GOOD:

INSERT INTO events VALUES (row1), (row2), ..., (row10000)

한 번에 수천~수만 행. 하나의 part로 합쳐짐.

Kafka / 스트리밍: 수집해서 batch로 insert. ClickHouse Kafka table engine 또는 외부 ingester.

6. Read Path와 벡터화 실행

6.1 쿼리 단계

SELECT user_country, SUM(revenue)
FROM events
WHERE event_time > '2024-01-01'
GROUP BY user_country

Step 1: Parse & Plan SQL → AST → 쿼리 plan.

Step 2: Part 선택 Partition pruning: event_time > '2024-01-01'에 해당하는 파티션만.

Step 3: Granule 선택 (각 part에서) Primary index로 해당 granule 찾기.

Step 4: 컬럼 읽기 필요한 컬럼(event_time, user_country, revenue)만.

Step 5: 벡터화 실행 한 번에 수만 행씩 처리.

Step 6: 집계 GROUP BY 결과를 해시 테이블로.

Step 7: 정렬 & 반환

6.2 벡터화 실행 상세

전통적 DB는 한 행씩 처리 (Volcano model):

while (row = scan.next()) {
    if (filter(row)) {
        aggregator.add(row);
    }
}

매 행마다 함수 호출 오버헤드. 현대 CPU는 이런 패턴에서 분기 예측 실패와 pipeline stall이 많다.

ClickHouse는 블록 단위 처리 (vectorized):

while (block = scan.next_block()) {  // 65536 rows
    auto mask = filter_block(block);  // SIMD
    aggregate_block(block, mask);     // SIMD
}

한 block = ~65536 행. 컴파일러가 루프를 SIMD로 자동 최적화. AVX2로 한 번에 8개 int32 처리.

6.3 SIMD 실제 예

SUM(revenue) 구현:

// Naive
int64_t sum = 0;
for (int i = 0; i < n; i++) sum += revenue[i];

// AVX2 (C++ intrinsics)
__m256i sum_vec = _mm256_setzero_si256();
for (int i = 0; i < n; i += 4) {
    __m256i v = _mm256_loadu_si256((__m256i*)&revenue[i]);
    sum_vec = _mm256_add_epi64(sum_vec, v);
}
// 수평 합산...

실측: SIMD 버전이 4-8배 빠름.

ClickHouse는 대부분 컴파일러에 의존. 코드 작성 시 벡터화 친화적으로 작성 → GCC/Clang이 자동 SIMD화.

6.4 런타임 CPU Dispatch

x86_64에도 CPU 종류가 다양하다. 각자 지원하는 SIMD 다름:

SSE2: 모든 64-bit x86.
SSE4.2: 2008+.
AVX2: 2013+.
AVX-512: 2016+ (고급 서버).

ClickHouse는 런타임에 CPU 감지하고 최적 구현 선택. 바이너리 안에 여러 버전 포함, CPUID로 분기.

7. Data Skipping Index

Primary index는 ORDER BY 컬럼에만 유용. 다른 컬럼에 대한 필터는?

7.1 Bloom Filter Index

ALTER TABLE events
ADD INDEX idx_user_email user_email TYPE bloom_filter GRANULARITY 4;

매 4 granule마다 bloom filter 저장.
WHERE user_email = 'x@y.com' 쿼리 시 bloom으로 "이 블록에 있을 가능성이 있는가"를 먼저 확인.
없으면 granule 건너뛰기.

RocksDB의 bloom filter와 같은 원리.

7.2 MinMax Index

ADD INDEX idx_revenue revenue TYPE minmax GRANULARITY 1;

각 granule의 min/max 값 저장.
WHERE revenue > 1000 쿼리: granule의 max가 1000보다 작으면 건너뛰기.

매우 효과적이며 저렴. 기본 권장.

7.3 Set Index

ADD INDEX idx_country user_country TYPE set(100) GRANULARITY 4;

각 granule에 distinct 값 집합(최대 100개) 저장.
WHERE user_country IN ('US', 'UK'): 해당 값이 있는 granule만.

카디널리티가 낮은 컬럼에 좋다.

7.4 N-gram Index

ADD INDEX idx_log_msg log_message TYPE ngrambf_v1(4, 1024, 3, 0) GRANULARITY 1;

로그 메시지 같은 문자열 필터 용. 4-gram을 bloom에 저장. LIKE '%error%' 쿼리 가속.

7.5 Skipping Index의 함정

Granule 단위: 개별 행 제외 불가. 실제로는 한 granule이 수천 행이라 "대부분의 granule 제외"만 효율.
메모리 오버헤드: 많이 만들면 part 메타데이터 크기 증가.
쓰기 비용: insert/merge마다 갱신.

"만능 해결사 아님 — 신중히 선택".

8. 압축

ClickHouse의 압축은 컬럼별 독립. 각 컬럼에 다른 코덱 적용 가능.

8.1 기본 코덱

LZ4 (기본): 빠른 압축/해제, 중간 압축률. 일반적 선택.

ZSTD: 더 나은 압축률, 약간 느림. 레벨 1-22 설정 가능. 콜드 데이터에.

CREATE TABLE events (
    event_time DateTime CODEC(DoubleDelta, ZSTD(3)),
    user_id UInt64 CODEC(T64, LZ4),
    ...
)

8.2 Delta 인코딩

연속된 값의 차이만 저장.

원본: [1000, 1001, 1003, 1005, 1010, 1012]
Delta: [1000, 1, 2, 2, 5, 2]

ID, 타임스탬프 같은 단조 증가 값에 효과적. 더 작은 값 → 더 잘 압축.

8.3 DoubleDelta

Delta의 Delta. 시간 간격이 비슷한 데이터(1초 간격 등)에서 대부분 값이 0.

원본: [t, t+1, t+2, t+3, t+4]
Delta: [t, 1, 1, 1, 1]
DoubleDelta: [t, 1, 0, 0, 0]

타임스탬프 컬럼에 거의 필수. 10배 이상 압축.

8.4 Gorilla

Facebook의 시계열 DB에서 유래. 부동소수점 값을 XOR 기반으로 압축.

현재 값 XOR 이전 값 → 대부분 0
0이 많으면 압축률 ↑

Float64 메트릭(CPU %, latency 등)에 효과적. 5-10배 압축.

8.5 T64

정수 컬럼의 "비트 패킹". 작은 숫자는 더 적은 비트로.

값: [5, 3, 7, 2, 6]  (32-bit 정수지만 실제론 4-bit면 충분)
T64: 각 값을 4비트로 → 8배 압축

8.6 코덱 체인

event_time DateTime CODEC(DoubleDelta, LZ4)

DoubleDelta 먼저 (delta로 변환), 그 다음 LZ4 (추가 압축). 체인으로 더 좋은 비율.

8.7 실제 압축률

YouTube 영상 시청 로그 (약 10개 컬럼):

Raw CSV:       100 GB
Row DB:         80 GB (LZ4)
ClickHouse:     10-15 GB

컬럼나 + 최적 코덱 조합 = Raw 대비 7-10배. 이것이 ClickHouse가 저렴한 이유.

9. MergeTree Family

기본 MergeTree 외에 특수 변형들.

9.1 ReplacingMergeTree

중복 키를 병합 시 제거.

CREATE TABLE users (
    user_id UInt64,
    name String,
    updated_at DateTime
) ENGINE = ReplacingMergeTree(updated_at)
ORDER BY user_id;

INSERT INTO users VALUES (1, 'Alice', now());
INSERT INTO users VALUES (1, 'Alice Updated', now());

같은 user_id의 행이 있으면 최신 updated_at만 유지 (병합 후).

주의: Merge는 비동기라 쿼리 시점에는 중복이 보일 수 있다. FINAL 키워드로 쿼리 시점에 강제 병합:

SELECT * FROM users FINAL WHERE user_id = 1;

FINAL은 느리다. 데이터가 "eventually unique"한 경우에 적합.

9.2 SummingMergeTree

같은 키의 숫자 컬럼을 병합 시 합산.

CREATE TABLE daily_stats (
    date Date,
    user_id UInt64,
    pageviews UInt64,
    clicks UInt64
) ENGINE = SummingMergeTree()
ORDER BY (date, user_id);

INSERT INTO daily_stats VALUES ('2024-01-01', 1, 10, 2);
INSERT INTO daily_stats VALUES ('2024-01-01', 1, 5, 1);

-- 쿼리 (결과):
-- 2024-01-01, 1, 15, 3

사전 집계. 실시간 대시보드에 유용. Merge가 완료되기 전엔 여러 행이 보일 수 있으므로 쿼리 시 GROUP BY + sum() 권장:

SELECT date, user_id, sum(pageviews), sum(clicks)
FROM daily_stats
GROUP BY date, user_id;

9.3 AggregatingMergeTree

더 일반화된 집계. 임의의 aggregate function 저장.

CREATE TABLE hourly_stats (
    hour DateTime,
    country String,
    unique_users AggregateFunction(uniq, UInt64),
    avg_revenue AggregateFunction(avg, Decimal(10, 2))
) ENGINE = AggregatingMergeTree()
ORDER BY (hour, country);

AggregateFunction(uniq, ...) = uniq의 중간 상태를 저장. 병합 시 상태들을 merge.

Materialized View와 같이 쓰인다 (아래).

9.4 CollapsingMergeTree

같은 키의 이전 상태를 "+1 / -1"로 취소.

CREATE TABLE orders (
    order_id UInt64,
    status String,
    sign Int8
) ENGINE = CollapsingMergeTree(sign)
ORDER BY order_id;

INSERT INTO orders VALUES (1, 'pending', 1);  -- 주문 생성
INSERT INTO orders VALUES (1, 'pending', -1); -- 이전 상태 취소
INSERT INTO orders VALUES (1, 'shipped', 1);  -- 새 상태

-- Merge 후: (1, 'shipped', 1)만 남음

변경 이력을 유지하면서 최종 상태를 유지하는 트릭. MySQL의 bin log를 ClickHouse에 복제할 때 유용.

9.5 VersionedCollapsingMergeTree

CollapsingMergeTree의 version 명시 버전. Merge 순서 보장.

9.6 GraphiteMergeTree

시계열 메트릭 (Graphite 호환) 전용. TTL로 해상도 저감.

10. Materialized View

ClickHouse의 MV는 insert trigger와 같다.

10.1 기본

CREATE MATERIALIZED VIEW hourly_pageviews_mv
ENGINE = SummingMergeTree()
ORDER BY (hour, page)
AS
SELECT
    toStartOfHour(event_time) AS hour,
    page,
    count() AS views
FROM events
GROUP BY hour, page;

작동: events에 insert할 때마다, 해당 행들에 SELECT이 실행되어 결과가 hourly_pageviews_mv에 삽입.

쿼리 시 MV를 원본 테이블처럼 사용:

SELECT hour, page, sum(views)
FROM hourly_pageviews_mv
WHERE hour >= '2024-01-01'
GROUP BY hour, page;

원본 10억 행 대신 수백만 행 집계 → 100배 이상 빠름.

10.2 AggregateFunction MV

더 강력한 패턴. AggregatingMergeTree와 결합:

CREATE MATERIALIZED VIEW user_metrics_mv
ENGINE = AggregatingMergeTree()
ORDER BY (date, user_id)
AS
SELECT
    toDate(event_time) AS date,
    user_id,
    uniqState(session_id) AS unique_sessions,
    sumState(revenue) AS total_revenue
FROM events
GROUP BY date, user_id;

uniqState는 uniq의 중간 상태. 쿼리 시:

SELECT date, uniqMerge(unique_sessions), sumMerge(total_revenue)
FROM user_metrics_mv
GROUP BY date;

중간 상태들을 merge해서 최종 결과. DISTINCT 집계도 가능.

10.3 Fan-out

하나의 insert가 여러 MV를 트리거:

events (원본)
  ├→ hourly_mv (시간별)
  ├→ daily_mv  (일별)
  ├→ country_mv (국가별)
  └→ user_mv   (사용자별)

각 MV가 서로 다른 dimension으로 사전 집계. 쿼리 시 적절한 MV 선택.

10.4 주의사항

원본 insert 실패 = MV insert도 실패 (원자적).
MV 시점에 원본 row가 사라지면 안 됨 (MV는 insert 시점의 데이터만 봄).
JOIN은 제한적. MV 안에서 큰 조인은 성능 문제.
TTL 관리 독립적. 원본의 TTL이 MV로 전파되지 않음.

11. Projections

Materialized View의 진화. ClickHouse 21+.

11.1 개념

MV와 비슷하지만 같은 테이블 안에 다른 정렬 순서의 데이터를 저장.

ALTER TABLE events
ADD PROJECTION p_country
(
    SELECT event_time, user_country, revenue
    ORDER BY (user_country, event_time)
);

ALTER TABLE events MATERIALIZE PROJECTION p_country;

이제 테이블에 두 정렬본:

원본: ORDER BY event_time.
Projection: ORDER BY user_country.

쿼리가 WHERE user_country = 'US'를 사용하면 ClickHouse가 자동으로 projection 사용.

11.2 MV vs Projection

항목	Materialized View	Projection
저장 위치	별도 테이블	같은 테이블
사용	명시적 이름	자동 선택
삭제/TTL	독립	원본 따라감
복잡 쿼리	가능	제한적

**"자동 선택"**이 projection의 매력. 개발자가 어느 정렬을 써야 할지 몰라도 ClickHouse가 알아서.

12. Distributed Engine과 샤딩

ClickHouse는 샤딩(수평 분할)으로 확장.

12.1 개념

Local table: 각 shard가 자체 데이터 저장.
Distributed table: 쿼리를 여러 shard에 분산.

           ┌─────────────────────┐
           │ Distributed Table   │
           │  events_distributed │
           └──────────┬──────────┘
                      │
      ┌───────────────┼───────────────┐
      ↓               ↓               ↓
  ┌─────────┐    ┌─────────┐    ┌─────────┐
  │ Shard 1 │    │ Shard 2 │    │ Shard 3 │
  │ events  │    │ events  │    │ events  │
  │ (local) │    │ (local) │    │ (local) │
  └─────────┘    └─────────┘    └─────────┘

12.2 Cluster 설정

<clickhouse>
  <remote_servers>
    <my_cluster>
      <shard>
        <replica>
          <host>shard1-r1</host>
          <port>9000</port>
        </replica>
        <replica>
          <host>shard1-r2</host>
          <port>9000</port>
        </replica>
      </shard>
      <shard>
        <replica>
          <host>shard2-r1</host>
        </replica>
      </shard>
    </my_cluster>
  </remote_servers>
</clickhouse>

2 shard × 2 replica = 4 노드.

12.3 Distributed Table

CREATE TABLE events_distributed ON CLUSTER my_cluster AS events
ENGINE = Distributed(my_cluster, default, events, rand());

원본 events (local table)를 각 노드에 배치.
events_distributed가 query router 역할.
rand()는 샤딩 키 — 어느 shard로 insert할지.

12.4 쿼리 실행

SELECT user_country, sum(revenue)
FROM events_distributed
GROUP BY user_country;

요청 받은 노드가 쿼리를 모든 shard에 전송.
각 shard가 로컬에서 집계 (부분 결과).
Coordinator 노드가 부분 결과들을 merge하고 최종 반환.

MapReduce와 유사. 단 네트워크 제외하면 매우 효율.

12.5 샤딩 키 선택

좋은 샤딩 키:

고카디널리티 (user_id, session_id).
균등 분포.
쿼리에 자주 사용됨 → 로컬 쿼리 가능.

나쁜 선택:

country → 소수 shard에 몰림.
timestamp → hot shard (최신 데이터에 집중).

12.6 Replication

ReplicatedMergeTree 엔진으로 shard 내 복제:

CREATE TABLE events (...) ENGINE = ReplicatedMergeTree(
    '/clickhouse/tables/{shard}/events',  -- ZooKeeper 경로
    '{replica}'                            -- replica 이름
)
ORDER BY ...;

같은 shard의 replica들이 ZooKeeper(or Keeper)를 통해 동기화.
한 replica가 insert 받으면 나머지가 복제.
자동 fail-over.

13. Keeper — ZooKeeper를 대체

13.1 ZooKeeper의 문제

ClickHouse는 원래 메타데이터 관리에 ZooKeeper를 썼다. Replicated table의 정합성, 분산 DDL 등.

문제:

Java: 메모리 사용량 많음, GC pause.
성능: ClickHouse 사용 패턴에 덜 최적.
운영 복잡도: 별도 스택.

13.2 ClickHouse Keeper

2021년 ClickHouse Keeper 등장. 특징:

C++로 재구현: ClickHouse 팀이 직접.
ZooKeeper 호환: 같은 API, 클라이언트 변경 불필요.
Raft 합의: 직접 구현.
성능: ZK 대비 1.5-2배 빠름.
메모리: 절반 사용.
단일 바이너리: ClickHouse 안에 내장 가능.

13.3 배포

<clickhouse>
  <keeper_server>
    <tcp_port>9181</tcp_port>
    <server_id>1</server_id>
    <raft_configuration>
      <server>
        <id>1</id>
        <hostname>keeper1</hostname>
        <port>9234</port>
      </server>
      <server>
        <id>2</id>
        <hostname>keeper2</hostname>
        <port>9234</port>
      </server>
      <server>
        <id>3</id>
        <hostname>keeper3</hostname>
        <port>9234</port>
      </server>
    </raft_configuration>
  </keeper_server>
</clickhouse>

3노드 Raft cluster. ClickHouse와 같은 서버에 배포 가능 (embedded) 또는 독립.

13.4 ZK → Keeper 마이그레이션

기존 ZK 클러스터에서 Keeper로 옮기는 도구 제공. 무중단 이관 가능.

대부분의 새 배포는 Keeper 기본. ZK는 legacy.

14. 실무 튜닝

14.1 Insert 최적화

Batch size: 10,000 - 100,000 행.
Buffer table: 짧은 버퍼 레이어로 소량 insert를 묶기.
Kafka table engine: Kafka에서 자동 consume.
async_insert: 여러 클라이언트의 insert를 서버가 batch.

14.2 ORDER BY 선택

카디널리티 낮은 → 높은 순: (country, user_id, event_time) → (event_time, user_id, country)보다 좋음 (scan 시 prefix 활용).
필터에 자주 쓰는 컬럼 앞.
너무 많이 넣지 말 것: 보통 3-5 컬럼.

14.3 Part 수 관리

-- Part 상태 확인
SELECT count() FROM system.parts WHERE active;

-- 병합 강제 (조심)
OPTIMIZE TABLE events FINAL;

기본 허용: 300 part / partition. 많으면 "Too many parts" 에러.
Background merge 속도: background_pool_size 증가.

14.4 메모리 제한

SET max_memory_usage = 10000000000;  -- 10 GB

쿼리 메모리 제한. 초과 시 실패. GROUP BY with high cardinality가 특히 주의.

SET max_bytes_before_external_group_by = 5000000000;

"메모리 초과 시 디스크로 spill". 느리지만 실패 대신.

14.5 주요 메트릭

system. 테이블들:

system.parts: 모든 part 정보.
system.query_log: 모든 쿼리 기록.
system.metrics: 현재 메트릭.
system.events: 누적 이벤트 (쿼리 수 등).
system.asynchronous_metrics: CPU, 메모리 등.

SELECT
    query,
    elapsed,
    memory_usage
FROM system.query_log
WHERE event_date = today()
ORDER BY elapsed DESC
LIMIT 10;

15. ClickHouse vs 대안

15.1 DuckDB

"임베디드 ClickHouse". 단일 프로세스, SQLite 같은 사용 방식.

항목	ClickHouse	DuckDB
배포	서버	임베디드
분산	Yes	No (단일 노드)
쓰기 성능	매우 높음	보통
읽기 성능	매우 높음	매우 높음
용도	수 TB~PB, 대규모	GB~TB, 로컬 분석

같은 벡터화 실행 철학. 둘 다 우수. 사용 목적이 다름.

15.2 Druid / Pinot

실시간 분석 + 시계열 특화.

ClickHouse: 범용 OLAP, SQL 풍부.
Druid/Pinot: 실시간 ingestion, 대시보드 최적.

Druid는 segment 개념이 part와 비슷. Druid는 집계 정책을 미리 정의(rollup) → 저장 효율 좋지만 유연성 낮음.

15.3 Snowflake / BigQuery

매니지드, 완전 클라우드.

ClickHouse: 자체 호스팅 가능, 훨씬 저렴, 더 빠른 일부 쿼리.
Snowflake/BQ: 관리 없음, 확장 자동.

"1 PB 데이터, 팀 5명" → Snowflake. "성능 최우선, 운영팀 있음" → ClickHouse.

15.4 StarRocks / Doris

중국 기업이 주도하는 ClickHouse 대안. MySQL 호환 프론트엔드, 일부 조인 성능 우수. 아직 생태계 작음.

15.5 Apache Druid vs Apache Pinot

둘 다 유사. Pinot이 LinkedIn 출신, Druid가 Metamarkets 출신. 집계 기능과 운영 방식에 차이.

16. 학습 로드맵

1단계: 시작

https://clickhouse.com/docs — 공식 문서.
공식 tutorial 따라하기 (NYC Taxi 데이터).
clickhouse-client로 대화형 실험.

2단계: 설계

"Altinity Knowledge Base" — 실무 팁.
YouTube의 Altinity, ClickHouse 공식 채널.
ORDER BY, partitioning 설계 연습.

3단계: 내부

ClickHouse GitHub 소스.
"ClickHouse under the hood" 발표 (Alexey Milovidov).
system.* 테이블 탐험.

4단계: 스케일

분산 쿼리 설계.
Keeper 운영.
Materialized view 패턴.

책:

"ClickHouse: The Real Warehouse Experience" — Altinity.
"Fundamentals of Data Engineering" — 일반 개념.

컨퍼런스:

ClickHouse Meetup.
Open Source Summit.

17. 요약 — 한 장 정리

┌─────────────────────────────────────────────────────┐
│            ClickHouse Cheat Sheet                    │
├─────────────────────────────────────────────────────┤
│ 컬럼나 저장:                                          │
│   - 컬럼별 파일                                        │
│   - 필요 컬럼만 읽음                                   │
│   - 압축 효율 ↑                                       │
│   - SIMD 친화적                                       │
│                                                       │
│ MergeTree:                                            │
│   - Part = 디렉토리(컬럼별 .bin + .mrk2)              │
│   - Granule = 8192 행                                 │
│   - Sparse primary index                              │
│   - Background merge                                  │
│   - Partition pruning                                 │
│                                                       │
│ Family:                                               │
│   MergeTree: 기본                                     │
│   ReplacingMergeTree: 중복 제거                       │
│   SummingMergeTree: 합산                              │
│   AggregatingMergeTree: 일반 집계                      │
│   CollapsingMergeTree: 상태 취소                      │
│   ReplicatedMergeTree: 복제                          │
│                                                       │
│ 벡터화:                                                │
│   - Block 단위 처리 (~65536 행)                        │
│   - SIMD 자동 (AVX2/AVX-512)                          │
│   - 런타임 CPU dispatch                                │
│                                                       │
│ 인덱스:                                                │
│   Primary: sparse, granule 시작                       │
│   Data Skipping: minmax/bloom/set/ngrambf            │
│                                                       │
│ 압축:                                                  │
│   LZ4 (기본), ZSTD (고압축)                            │
│   Delta, DoubleDelta (timestamp)                      │
│   Gorilla (float)                                     │
│   T64 (비트 패킹)                                      │
│                                                       │
│ Materialized View:                                    │
│   - INSERT trigger                                    │
│   - 사전 집계                                          │
│   - AggregateFunction state                           │
│                                                       │
│ Distributed:                                          │
│   - Shard × Replica                                   │
│   - Distributed engine = query router                 │
│   - ReplicatedMergeTree                               │
│   - Keeper (ZK 대체)                                  │
│                                                       │
│ Projection:                                           │
│   같은 테이블에 다른 정렬                                │
│   자동 선택                                            │
│                                                       │
│ 튜닝:                                                  │
│   - Insert batch 크게 (10K+)                          │
│   - ORDER BY 카디널리티 낮은 → 높은                     │
│   - Part 수 모니터링                                   │
│   - system.query_log 분석                             │
│                                                       │
│ vs 경쟁:                                               │
│   DuckDB: 임베디드                                    │
│   Druid/Pinot: 실시간                                 │
│   Snowflake/BQ: 매니지드                               │
└─────────────────────────────────────────────────────┘

18. 퀴즈

Q1. ClickHouse의 "컬럼나 저장"이 OLAP 쿼리를 빠르게 하는 이유 3가지는?

A. (1) I/O 감소 — SELECT AVG(age)에서 age 컬럼 파일만 읽으면 되고 다른 100개 컬럼은 건드리지 않는다(100배 I/O 절감). (2) 압축률 증가 — 같은 컬럼은 타입과 분포가 비슷해서 country 같은 컬럼은 수 개 값의 반복으로 10배+ 압축. 서로 다른 타입이 섞인 row보다 훨씬 효율적. (3) SIMD 최적화 — 같은 타입 값이 연속 저장돼 AVX2로 한 번에 8개 int32 처리, 벡터 레지스터에 그대로 로드. 추가로 캐시 라인 효율도 좋다. 이 셋이 합쳐져 Postgres 대비 100-1000배 빠른 분석 쿼리.

Q2. MergeTree의 "sparse primary index"는 무엇이며 왜 괜찮은가?

A. 모든 행이 아니라 granule(기본 8192 행)마다 하나의 인덱스 엔트리만 저장. 10억 행 테이블의 인덱스가 수 MB로 줄어 메모리에 완전 상주한다. 쿼리 시 binary search로 해당 granule을 찾고 그 granule(8192 행)을 통째로 스캔. "1 행을 정확히 찾기"는 느리지만 OLAP의 전형적 패턴은 수십만~수억 행 집계이므로 granule 단위 스캔이 오히려 효율적(디스크 I/O가 page 단위라 granule 하나가 이미 여러 페이지). OLTP의 dense index 철학과 정반대의 설계.

Q3. 벡터화 실행(Vectorized Execution)이 Volcano 모델보다 빠른 이유는?

A. 함수 호출 오버헤드 제거와 SIMD 활용. Volcano 모델은 한 행씩 next() 호출 → 함수 호출당 수십 ns 오버헤드 × 수억 행 = 치명적 누적. 또한 분기 예측 실패와 pipeline stall로 현대 CPU를 낭비. 벡터화는 **블록 단위(~65536 행)**로 처리 — 루프 안에서 여러 행을 한 번에 처리하므로 함수 호출 오버헤드는 블록 당 한 번만. 게다가 연속 메모리에 같은 타입이 있어 컴파일러가 SIMD(AVX2/AVX-512)로 자동 최적화. 실측 10-100배 빠름. ClickHouse, DuckDB, Snowflake, BigQuery 모두 이 패턴.

Q4. ORDER BY (A, B, C) 선택 시 카디널리티 낮은 → 높은 순으로 배치하라는 이유는?

A. Prefix 기반 데이터 스킵 효율 극대화. ClickHouse는 ORDER BY 순서대로 물리 저장되고, Primary index는 이 정렬의 sparse 샘플. 카디널리티 낮은 컬럼(예: country, 200개)을 앞에 두면 같은 국가 값이 긴 연속 범위를 차지 → WHERE country = 'US' 쿼리가 큰 granule 범위를 한 번에 skip. 만약 event_time(카디널리티 높음)을 먼저 두면 같은 국가가 퍼져 있어 WHERE country = ... 쿼리가 거의 전체 스캔. 반대로 event_time 같은 시간 컬럼은 범위 쿼리에 쓰이므로 그 뒤에 두면 두 컬럼 모두에서 효율. 실무 원칙: 필터에 자주 쓰이는 저카디널리티 컬럼 먼저.

Q5. SummingMergeTree와 AggregatingMergeTree의 차이는?

A. 집계 함수의 범위. SummingMergeTree는 숫자 컬럼의 합산만 한다 — 같은 key의 행이 병합될 때 숫자 컬럼이 자동으로 sum. 간단하고 빠름. 한계: sum 외의 집계(avg, uniq, quantile)는 불가. AggregatingMergeTree는 임의의 aggregate function의 중간 상태(AggregateFunction(uniq, ...) 등)를 저장하고 병합 시 state끼리 merge. uniqState/uniqMerge 패턴으로 distinct count 같은 복잡 집계도 사전 계산 가능. 더 유연하지만 쿼리 시 -Merge suffix 필요. 실무: 단순 카운터/합산엔 Summing, 고급 집계엔 Aggregating + Materialized View 조합.

Q6. ClickHouse Keeper가 ZooKeeper를 대체한 이유는?

A. 주된 이유는 운영 복잡도와 ClickHouse 워크로드 특화. ZooKeeper는 Java라 (1) 메모리 사용량이 크고 (2) GC pause가 발생. 또한 ClickHouse 사용 패턴(많은 작은 알림, 고빈도 metadata update)에 덜 최적이었다. ClickHouse Keeper는 C++로 재작성 + Raft 직접 구현 → ZK 대비 1.5-2배 빠름, 메모리 절반. ZK API 호환이라 클라이언트 코드 변경 불필요. 단일 바이너리라 ClickHouse와 같은 프로세스에 embedded 가능 → 스택 단순화. 2023년 이후 대부분의 새 배포가 Keeper 기본. ZK는 legacy로 취급.

Q7. ClickHouse와 DuckDB는 모두 컬럼나 + 벡터화이지만 용도가 다른 이유는?

A. 배포 모델과 확장성의 차이. ClickHouse는 서버 아키텍처 — 네트워크 서비스, 분산, 복제, 수 TB~PB 스케일. 팀이 운영하고 수만 QPS 처리. DuckDB는 임베디드 아키텍처 — 프로세스 안에 라이브러리로 내장(SQLite처럼), 단일 노드, 파일 기반(.duckdb). 로컬 개발/노트북 분석/ETL 스크립트에 적합. 둘 다 훌륭한 벡터화 엔진이지만, "분산 시스템이 필요한가"로 갈린다. 예: Jupyter notebook에서 10GB Parquet 분석 → DuckDB. Kafka로부터 초당 100만 이벤트 수집 + 실시간 대시보드 → ClickHouse. "ClickHouse가 더 좋다"가 아니라 "다른 문제를 푼다".

이 글이 도움이 됐다면 다음 포스트도 확인해 보세요:

"RocksDB & LSM-Tree Deep Dive" — 로우 기반 LSM과의 비교.
"Columnar Storage Parquet/ORC/Arrow/Dremel" — 컬럼 포맷의 기반.
"Snowflake Architecture Deep Dive" — 또 다른 컬럼나 OLAP의 접근.
"Apache Spark Catalyst & Tungsten" — 분산 분석 엔진의 대안.

ClickHouse Internals Deep Dive — MergeTree, Vectorized Execution, Distributed Queries, Keeper (2025)

TL;DR

ClickHouse: built by Yandex in 2009 for internal analytics, open-sourced in 2016. Written in C++, the flagship columnar OLAP engine.
Columnar storage: values of the same column stored contiguously. For queries like SELECT sum(x) WHERE y > 0 only that column is read — 10-100x faster.
MergeTree: ClickHouse's core engine. LSM-like "parts" with background merges, but stored by column rather than by row.
Sparse Primary Index: one entry per granule (default 8192 rows), not per row. Index fits entirely in memory.
Vectorized Execution: processes tens of thousands of rows at a time in loops. SIMD-friendly — compilers exploit AVX2/AVX-512 automatically.
Compression: per-column LZ4/ZSTD by default. Time-series data uses Delta + Gorilla for aggressive compression.
Distributed Engine: sharding + replication. Query fans out to shards and aggregates.
Keeper: ZooKeeper replacement. Raft-based, reimplemented in C++. Optimized for ClickHouse workloads.
Materialized View: results precomputed at insert time. Essential for real-time aggregation.
2025 competitors: Druid, Pinot (real-time aggregation), DuckDB (embedded), Snowflake/BigQuery (managed).

1. Why ClickHouse Is Fast

1.1 OLAP vs OLTP

OLTP (Online Transaction Processing):

Reads/writes few rows (single-row inserts/updates).
O(log n) with an index.
Postgres, MySQL, Oracle.

OLAP (Online Analytical Processing):

Scans millions to billions of rows.
Only a few columns needed.
Aggregations (sum, count, avg, distinct).
ClickHouse, Snowflake, BigQuery, Druid.

1.2 Typical OLAP Query

SELECT user_country, SUM(revenue)
FROM events
WHERE event_date BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY user_country
ORDER BY SUM(revenue) DESC
LIMIT 10;

This query:

Scans billions of rows by date range.
Needs only user_country and revenue (not full rows).
Returns a small aggregated result.

Why is Postgres slow here? Row storage forces reading all columns even if unused, indexes do not help for large scans, and per-row processing accumulates function-call overhead.

1.3 Three Weapons

Columnar storage: read only the columns you need.
Vectorized execution: process tens of thousands of rows at a time with SIMD.
Heavy compression: save disk I/O bandwidth.

Combined, these make ClickHouse 100-1000x faster than Postgres for OLAP.

2. History

Yandex built ClickHouse for Metrica, their Google Analytics competitor. MySQL collapsed under billions of events per day; a team led by Alexey Milovidov rewrote the engine in C++. Prototype in 2009, production in 2012. Open-sourced in 2016 to explosive adoption: CloudFlare, Uber, Bloomberg, GitLab, Deutsche Bank. ClickHouse Inc. spun out in 2021 with $250M+ in funding and launched ClickHouse Cloud. As of 2025: 35,000+ GitHub stars, thousands of production users, monthly releases.

3. Columnar Storage Basics

3.1 Row vs Column

Row-based (Postgres, MySQL):

Row 1: [id=1, name="Alice", age=30, country="US"]
Row 2: [id=2, name="Bob",   age=25, country="UK"]

On disk: 1|Alice|30|US|2|Bob|25|UK|...

Column-based (ClickHouse):

Column "id":      [1, 2, 3, ...]
Column "name":    ["Alice", "Bob", ...]
Column "age":     [30, 25, ...]
Column "country": ["US", "UK", ...]

Each column is a separate file.

3.2 Why Columnar Is Faster for Analytics

I/O reduction: SELECT AVG(age) reads only age.bin. A 100-column table with a 1-column query means 100x less I/O.

Compression: values in one column share type and distribution. A country column of "US", "UK", "CN" compresses 10x+ with LZ4. Mixed-type rows compress poorly.

Cache-friendly: sequential column scans keep L1 hot.

SIMD: same-type contiguous values let AVX process 8 int32 values per 256-bit register.

3.3 When Row-Based Wins

Point lookup WHERE id = 42.
Update-heavy workloads.
Small result, many columns returned.

Do not use ClickHouse for OLTP.

4. MergeTree — The Heart of ClickHouse

4.1 Core Idea

Inspired by LSM-trees but redesigned for columnar storage.

Insert -> memory buffer -> disk "part" (columnar files)
                                  |
                          background merge
                                  |
                          larger sorted part

4.2 Part Structure

A part is a directory:

/var/lib/clickhouse/data/mydb/mytable/
  20240101_1_1_0/              # partition_minBlock_maxBlock_level
    checksums.txt
    columns.txt
    count.txt
    primary.idx                # sparse primary index
    user_id.bin                # column file
    user_id.mrk2               # marks (offsets)
    event_time.bin
    event_time.mrk2
    revenue.bin
    revenue.mrk2

Each column has a .bin and .mrk2 pair. Marks hold granule offsets in the .bin file.

4.3 Granules and the Sparse Primary Index

MergeTree's key insight: no per-row index entry.

Default 8192 rows = one granule.
Primary index points only to the first row of each granule.

For WHERE user_id = 12345: binary-search the primary index in memory, find the target granule, read it from disk (~8192 rows), scan for the value. The index is 1/8192 the size of a traditional one. A billion-row table's index is a few MB — entirely in RAM.

4.4 Why Sparse Is Fine

OLAP typically aggregates many rows. Reading one granule (8192 rows) maps well to disk I/O page units — several pages per granule. Point lookups are supported but are not ClickHouse's strength.

4.5 ORDER BY Is the Primary Key

CREATE TABLE events (
    event_time DateTime,
    user_id UInt64,
    event_type String,
    revenue Decimal(10, 2)
) ENGINE = MergeTree
ORDER BY (event_time, user_id)
PARTITION BY toYYYYMM(event_time);

Data is physically sorted by (event_time, user_id); the primary index is a sparse sample of that ordering.

4.6 Partitioning

PARTITION BY toYYYYMM(event_time) places monthly data in separate directories. Benefits: partition pruning, near-instant DROP PARTITION, partition-unit backup/replication. Avoid too many partitions — monthly or weekly is typical.

5. Write Path

5.1 INSERT Journey

Input batched.
Per-column vectors built in memory.
Columns sorted by ORDER BY.
Each column written to a separate compressed file.
Part directory created, manifest updated.

A new part is created; old parts continue to coexist.

5.2 Background Merges

A background thread periodically merges small parts into larger ones (same partition only). Merges happen online; old parts are dropped when done.

5.3 Why Merging Matters

Too many parts (default limit 300 per partition) slows queries and triggers the "Too many parts" error. UPDATE/DELETE mutations are actually applied during merges.

5.4 Insert Optimization

Bad: one INSERT per row — each creates a new part. Good: batched INSERT of thousands to tens of thousands of rows. For streaming use Kafka table engine or an external ingester that batches.

6. Read Path and Vectorized Execution

6.1 Query Stages

Parse and plan.
Partition pruning.
Granule selection via primary index.
Read only required columns.
Vectorized execution.
Aggregate (hash table for GROUP BY).
Sort and return.

6.2 Vectorized Execution

Traditional DBs use the Volcano model (one row per next() call) — heavy function-call overhead and branch mispredictions.

ClickHouse processes blocks (~65536 rows):

while (block = scan.next_block()) {   // 65536 rows
    auto mask = filter_block(block);  // SIMD
    aggregate_block(block, mask);     // SIMD
}

Compilers auto-vectorize these tight loops into AVX2.

6.3 SIMD Example

// Naive
int64_t sum = 0;
for (int i = 0; i < n; i++) sum += revenue[i];

// AVX2 intrinsics
__m256i sum_vec = _mm256_setzero_si256();
for (int i = 0; i < n; i += 4) {
    __m256i v = _mm256_loadu_si256((__m256i*)&revenue[i]);
    sum_vec = _mm256_add_epi64(sum_vec, v);
}

4-8x speedup measured. ClickHouse mostly relies on the compiler — code is written in a vectorization-friendly style.

6.4 Runtime CPU Dispatch

ClickHouse detects CPU features (SSE2, SSE4.2, AVX2, AVX-512) at runtime and picks the best implementation via CPUID.

7. Data Skipping Indexes

Primary index only helps on ORDER BY columns. For others:

7.1 Bloom Filter

ALTER TABLE events
ADD INDEX idx_user_email user_email TYPE bloom_filter GRANULARITY 4;

Skip granules when the bloom says "definitely not here".

7.2 MinMax

ADD INDEX idx_revenue revenue TYPE minmax GRANULARITY 1;

Cheap and very effective — recommended by default.

7.3 Set

ADD INDEX idx_country user_country TYPE set(100) GRANULARITY 4;

Good for low-cardinality columns.

7.4 N-gram

ADD INDEX idx_log_msg log_message TYPE ngrambf_v1(4, 1024, 3, 0) GRANULARITY 1;

Accelerates LIKE '%error%'-style queries on log strings.

7.5 Pitfalls

Granule-level, not row-level. Overhead on writes and part metadata. Choose carefully.

8. Compression

Per-column codecs, independently chosen.

8.1 Base Codecs

LZ4: default, fast, moderate ratio.
ZSTD: better ratio, slightly slower; levels 1-22. Good for cold data.

CREATE TABLE events (
    event_time DateTime CODEC(DoubleDelta, ZSTD(3)),
    user_id UInt64 CODEC(T64, LZ4)
)

8.2 Delta / DoubleDelta

Delta stores [1000, 1, 2, 2, 5, 2] instead of raw values. DoubleDelta (delta-of-delta) reduces regular timestamps to near-zero values — often 10x+ compression.

8.3 Gorilla

Facebook's time-series trick: XOR successive floats; zero-heavy results compress well. 5-10x on metric columns.

8.4 T64

Bit-packs integers into the minimum bit width used.

8.5 Codec Chains

event_time DateTime CODEC(DoubleDelta, LZ4)

Transform first, then general compression.

8.6 Realistic Ratios

Raw CSV 100 GB -> row DB ~80 GB (LZ4) -> ClickHouse 10-15 GB. Columnar plus specialized codecs gives 7-10x vs raw.

9. MergeTree Family

9.1 ReplacingMergeTree

Keeps only the latest row per key (by version column) during merges.

CREATE TABLE users (
    user_id UInt64,
    name String,
    updated_at DateTime
) ENGINE = ReplacingMergeTree(updated_at)
ORDER BY user_id;

Use FINAL to force deduplication at query time (slower).

9.2 SummingMergeTree

Sums numeric columns for rows sharing the ORDER BY key.

CREATE TABLE daily_stats (
    date Date,
    user_id UInt64,
    pageviews UInt64,
    clicks UInt64
) ENGINE = SummingMergeTree()
ORDER BY (date, user_id);

Always query with explicit GROUP BY + sum() for correctness before merges complete.

9.3 AggregatingMergeTree

Stores intermediate states of arbitrary aggregate functions.

CREATE TABLE hourly_stats (
    hour DateTime,
    country String,
    unique_users AggregateFunction(uniq, UInt64),
    avg_revenue AggregateFunction(avg, Decimal(10, 2))
) ENGINE = AggregatingMergeTree()
ORDER BY (hour, country);

Merges combine states — pairs with Materialized Views.

9.4 CollapsingMergeTree

Uses a sign Int8 column (+1/-1) to cancel prior state — useful for replicating MySQL binlogs.

9.5 VersionedCollapsingMergeTree

CollapsingMergeTree with explicit version for out-of-order merges.

9.6 GraphiteMergeTree

Graphite-compatible time-series metrics with TTL-based downsampling.

10. Materialized Views

MVs behave like insert triggers.

10.1 Basic

CREATE MATERIALIZED VIEW hourly_pageviews_mv
ENGINE = SummingMergeTree()
ORDER BY (hour, page)
AS
SELECT
    toStartOfHour(event_time) AS hour,
    page,
    count() AS views
FROM events
GROUP BY hour, page;

Each insert into events triggers the SELECT; results are inserted into the MV. Queries run against the much smaller MV — 100x+ faster.

10.2 AggregateFunction MV

CREATE MATERIALIZED VIEW user_metrics_mv
ENGINE = AggregatingMergeTree()
ORDER BY (date, user_id)
AS
SELECT
    toDate(event_time) AS date,
    user_id,
    uniqState(session_id) AS unique_sessions,
    sumState(revenue) AS total_revenue
FROM events
GROUP BY date, user_id;

Query with uniqMerge / sumMerge to combine states — supports distinct counts and other complex aggregates.

10.3 Fan-out

One insert can drive many MVs (hourly, daily, country, user). Pick the right MV per query.

10.4 Caveats

Insert atomicity — source failure means MV failure. MVs see only data at insert time. Heavy JOINs inside MVs can be slow. TTLs don't propagate from source to MV.

11. Projections

Evolution of MVs (ClickHouse 21+). Stores alternate sort orderings inside the same table.

ALTER TABLE events
ADD PROJECTION p_country
(
    SELECT event_time, user_country, revenue
    ORDER BY (user_country, event_time)
);

ALTER TABLE events MATERIALIZE PROJECTION p_country;

ClickHouse picks the projection automatically when the query fits.

Aspect	Materialized View	Projection
Storage	Separate table	Inside source table
Usage	Explicit name	Auto-selected
Deletion/TTL	Independent	Follows source
Complex queries	Yes	Limited

12. Distributed Engine and Sharding

ClickHouse scales horizontally by sharding.

12.1 Concept

Local table: each shard stores its own data.
Distributed table: routes queries across shards.

12.2 Cluster Configuration

<clickhouse>
  <remote_servers>
    <my_cluster>
      <shard>
        <replica>
          <host>shard1-r1</host>
          <port>9000</port>
        </replica>
        <replica>
          <host>shard1-r2</host>
          <port>9000</port>
        </replica>
      </shard>
      <shard>
        <replica>
          <host>shard2-r1</host>
        </replica>
      </shard>
    </my_cluster>
  </remote_servers>
</clickhouse>

12.3 Distributed Table

CREATE TABLE events_distributed ON CLUSTER my_cluster AS events
ENGINE = Distributed(my_cluster, default, events, rand());

12.4 Query Execution

Coordinator fans out the query, each shard aggregates locally, then the coordinator merges partial results. Map-reduce style with minimal network overhead.

12.5 Sharding Key

Good: high-cardinality, evenly distributed, used in filters (enabling local queries). Bad: country (skewed), timestamp (hot shard for latest data).

12.6 Replication

ReplicatedMergeTree replicates within a shard via Keeper/ZooKeeper:

CREATE TABLE events (...) ENGINE = ReplicatedMergeTree(
    '/clickhouse/tables/{shard}/events',
    '{replica}'
)
ORDER BY ...;

13. Keeper — ZooKeeper Replacement

ZooKeeper (Java) is memory-heavy with GC pauses and isn't optimal for ClickHouse's access patterns. ClickHouse Keeper (2021) is a C++ reimplementation with Raft consensus, ZK wire-protocol compatible, 1.5-2x faster, half the memory. Can be embedded in the ClickHouse binary or run standalone. Most new deployments use Keeper; ZK is legacy.

<clickhouse>
  <keeper_server>
    <tcp_port>9181</tcp_port>
    <server_id>1</server_id>
    <raft_configuration>
      <server><id>1</id><hostname>keeper1</hostname><port>9234</port></server>
      <server><id>2</id><hostname>keeper2</hostname><port>9234</port></server>
      <server><id>3</id><hostname>keeper3</hostname><port>9234</port></server>
    </raft_configuration>
  </keeper_server>
</clickhouse>

14. Practical Tuning

Insert batching: 10K-100K rows. Use Buffer tables, Kafka table engine, or async_insert.
ORDER BY design: low cardinality first, then higher. Three to five columns is typical.
Part management: monitor system.parts; default limit 300 per partition. Increase background_pool_size if needed.
Memory limits: max_memory_usage; enable max_bytes_before_external_group_by to spill to disk.
Observability: system.parts, system.query_log, system.metrics, system.events, system.asynchronous_metrics.

SELECT query, elapsed, memory_usage
FROM system.query_log
WHERE event_date = today()
ORDER BY elapsed DESC
LIMIT 10;

15. ClickHouse vs Alternatives

15.1 DuckDB

Embedded columnar DB (SQLite-style). Single process, no distribution. Great for local analysis up to TB scale. Same vectorized philosophy, different deployment.

15.2 Druid / Pinot

Real-time ingestion and dashboards. Druid "segments" resemble parts; Druid requires rollup definition upfront (less flexible). ClickHouse has richer SQL.

15.3 Snowflake / BigQuery

Fully managed. Higher cost, zero ops. ClickHouse is cheaper and often faster, but requires an ops team.

15.4 StarRocks / Doris

China-led alternatives with MySQL-compatible frontends and strong join performance. Smaller ecosystem.

16. Learning Path

Start with the official docs and the NYC Taxi tutorial.
Practice ORDER BY and partitioning design; follow Altinity Knowledge Base.
Explore system.* tables; watch "ClickHouse under the hood" talks by Alexey Milovidov.
Design distributed queries, run Keeper, build Materialized View patterns.

17. Cheat Sheet

Columnar storage:
  - One file per column
  - Read only needed columns
  - Better compression
  - SIMD-friendly

MergeTree:
  - Part = directory of column .bin + .mrk2
  - Granule = 8192 rows
  - Sparse primary index
  - Background merges
  - Partition pruning

Family:
  MergeTree (base), Replacing (dedup), Summing (sum),
  Aggregating (generic), Collapsing (state cancel), Replicated

Vectorization:
  - Block-wise (~65536 rows)
  - Auto SIMD (AVX2/AVX-512)
  - Runtime CPU dispatch

Indexes:
  Primary: sparse, granule starts
  Skipping: minmax / bloom / set / ngrambf

Compression:
  LZ4 (default), ZSTD (cold)
  Delta, DoubleDelta (timestamps)
  Gorilla (float), T64 (bit packing)

Materialized View:
  - Insert trigger
  - Pre-aggregation
  - AggregateFunction state

Distributed:
  - Shard x Replica
  - Distributed engine = query router
  - ReplicatedMergeTree + Keeper

Projection:
  Alternate sort inside same table, auto-selected

vs Competitors:
  DuckDB (embedded), Druid/Pinot (real-time),
  Snowflake/BQ (managed)

18. Quiz

Q1. Why does columnar storage make OLAP queries fast?

A. Three reasons: (1) I/O reduction — only needed columns are read, so a 100-column table with a 1-column query reads ~100x less. (2) Compression — uniform type/distribution in each column (e.g., country repeating "US"/"UK") compresses 10x+ with LZ4. (3) SIMD — contiguous same-type values load directly into AVX registers for 8x int32 parallelism, plus cache-line efficiency. Combined: 100-1000x faster than Postgres for analytics.

Q2. What is the sparse primary index and why is it acceptable?

A. One index entry per granule (default 8192 rows), not per row. A billion-row table has only a few MB of index — fully RAM-resident. Point lookups require a full-granule scan, but OLAP aggregates many rows, so granule-level reads match disk page I/O well. Opposite philosophy to OLTP dense indexes.

Q3. Why is vectorized execution faster than the Volcano model?

A. Eliminates per-row function-call overhead and enables SIMD. Volcano calls next() per row — tens of ns each, devastating over billions of rows, plus branch mispredictions. Vectorized runs blocks of ~65536 rows, so one call amortizes over thousands of values, and contiguous same-type data lets compilers emit AVX2/AVX-512 automatically. 10-100x measured speedup.

Q4. Why place low-cardinality columns first in ORDER BY?

A. Data is physically sorted by ORDER BY. Putting low-cardinality columns like country first creates long contiguous runs per value, so WHERE country = 'US' skips huge granule ranges. Putting high-cardinality columns first scatters values across granules, defeating pruning. Rule: frequently filtered low-cardinality columns first, time or high-cardinality range columns after.

Q5. SummingMergeTree vs AggregatingMergeTree?

A. SummingMergeTree only sums numeric columns per key — simple and fast, but limited. AggregatingMergeTree stores intermediate states (e.g., AggregateFunction(uniq, ...)) and merges them; combined with -State/-Merge suffixes it supports distinct counts, quantiles, and other complex aggregates. Use Summing for counters, Aggregating for richer analytics (typically via Materialized Views).

Q6. Why did ClickHouse Keeper replace ZooKeeper?

A. ZooKeeper's Java runtime is memory-heavy with GC pauses and not optimized for ClickHouse's high-frequency metadata patterns. Keeper is a C++ rewrite with Raft, wire-compatible with ZK, about 1.5-2x faster with half the memory. It can be embedded in the ClickHouse binary, simplifying ops. Most post-2023 deployments default to Keeper.

Q7. Why do ClickHouse and DuckDB differ in use case despite sharing columnar vectorized design?

A. Deployment model. ClickHouse is a server: networked, distributed, replicated, TB-to-PB scale, tens of thousands of QPS. DuckDB is embedded — a library running in-process (SQLite-style) on a single node with file-based .duckdb storage. DuckDB fits notebook analysis and ETL scripts; ClickHouse fits Kafka-fed real-time dashboards at scale. Different problems, not better/worse.

If this was useful, check out:

"RocksDB and LSM-Tree Deep Dive" — row-oriented LSM contrast.
"Columnar Storage: Parquet/ORC/Arrow/Dremel" — column format foundations.
"Snowflake Architecture Deep Dive" — another columnar OLAP approach.
"Apache Spark Catalyst and Tungsten" — distributed analytics alternative.