Split View: 확률적 자료구조 완전 가이드 2025: Bloom Filter, HyperLogLog, Count-Min Sketch, 실전 활용

✨ Learn with Quiz

확률적 자료구조 완전 가이드 2025: Bloom Filter, HyperLogLog, Count-Min Sketch, 실전 활용

TL;DR

확률적 자료구조의 본질: 100% 정확도를 포기하고 메모리/속도를 극단적으로 절약
Bloom Filter: "이 원소가 집합에 있을 수 있다" — false positive O, false negative X. RocksDB, Cassandra, BigTable의 핵심
HyperLogLog: 12KB로 수십억 원소의 고유 카운트. Redis PFADD/PFCOUNT, BigQuery, Druid 사용
Count-Min Sketch: 스트림에서 원소 빈도 추정. 핫 키 탐지, DDoS 감지
트레이드오프: 정확도 vs 메모리 vs 속도 — 정확한 답이 필요 없을 때 폭발적 효율

1. 확률적 자료구조란?

1.1 핵심 아이디어

"100% 정확한 답이 필요 없다면, 메모리와 속도를 극단적으로 절약할 수 있다."

전통적 자료구조:

HashMap으로 100만 원소 저장 → 약 50 MB
"이 원소가 있는가?" → 정확한 답

확률적 자료구조:

Bloom Filter로 100만 원소 저장 → 약 1 MB (50배 절약!)
"이 원소가 있는가?" → "있을 수도 있다" 또는 "확실히 없다"

트레이드오프: 메모리 50배 절약 + 약간의 false positive (1%).

1.2 왜 필요한가?

대규모 시스템에서:

메모리는 비싸다 — RAM은 디스크의 100배+
L1/L2 캐시에 들어가야 빠름 → 작을수록 좋음
수십억 원소 — HashMap이 불가능
분산 시스템 — 작은 데이터가 네트워크 효율적

1.3 정확도가 필요 없는 경우

문제	정확한 답 필요?	확률적 OK?
사용자 인증	✅	❌
"이 URL을 본 적 있나?"	❌	✅
일일 활성 사용자 수	❌ (대략 1000만이면 충분)	✅
핫 페이지 탐지	❌	✅
캐시 사전 검증	❌	✅

2. Bloom Filter — 가장 유명한 확률적 자료구조

2.1 원리

비트 배열 + 여러 해시 함수.

삽입:

원소를 K개의 해시 함수에 통과 → K개의 인덱스
각 인덱스의 비트를 1로 설정

조회:

같은 K개 해시 함수
모든 인덱스의 비트가 1이면 → "있을 수도"
하나라도 0이면 → "확실히 없음"

2.2 시각화

비트 배열 (8비트):  [0, 0, 0, 0, 0, 0, 0, 0]

"apple" 삽입 (해시 → [1, 4]):
                  [0, 1, 0, 0, 1, 0, 0, 0]

"banana" 삽입 (해시 → [3, 6]):
                  [0, 1, 0, 1, 1, 0, 1, 0]

"apple" 조회 (해시 → [1, 4]): 둘 다 1 → "있을 수도" ✅
"grape" 조회 (해시 → [2, 5]): 0이 있음 → "확실히 없음" ❌
"cherry" 조회 (해시 → [1, 6]): 우연히 둘 다 1 → "있을 수도" ⚠️ (false positive)

2.3 false positive율 계산

수식:

P(fp) = (1 - e^(-k*n/m))^k

n: 원소 수
m: 비트 수
k: 해시 함수 수

최적 k:

k = (m/n) * ln(2)

1% false positive가 목표라면:

m/n ≈ 9.6 비트
k ≈ 7

100만 원소 = 1.2 MB + 7 해시 함수 = HashMap의 50배 작음.

2.4 Python 구현

import hashlib
import math

class BloomFilter:
    def __init__(self, n, fp_rate):
        self.size = -int(n * math.log(fp_rate) / (math.log(2) ** 2))
        self.hash_count = int(self.size / n * math.log(2))
        self.bits = [0] * self.size

    def _hashes(self, item):
        for i in range(self.hash_count):
            h = hashlib.md5(f"{item}{i}".encode()).hexdigest()
            yield int(h, 16) % self.size

    def add(self, item):
        for h in self._hashes(item):
            self.bits[h] = 1

    def contains(self, item):
        return all(self.bits[h] for h in self._hashes(item))

# 사용
bf = BloomFilter(n=1000000, fp_rate=0.01)
bf.add("alice@example.com")
print(bf.contains("alice@example.com"))  # True
print(bf.contains("bob@example.com"))    # 거의 항상 False

2.5 Bloom Filter의 한계

삭제 불가능 — 비트를 0으로 하면 다른 원소에 영향
resize 어려움 — 새 필터 만들고 모든 원소 재삽입 필요
count 불가능 — 몇 개 들어있는지 모름

해결책: Counting Bloom Filter, Cuckoo Filter (아래 참조).

2.6 Bloom Filter의 실전 활용

RocksDB / Cassandra / BigTable

LSM 트리 기반 DB에서, SSTable마다 Bloom Filter:

쿼리: "key123"

SSTable 1의 Bloom Filter: "확실히 없음" → 디스크 안 읽음 ✅
SSTable 2의 Bloom Filter: "확실히 없음" → 디스크 안 읽음 ✅
SSTable 3의 Bloom Filter: "있을 수도"  → 디스크 읽음 (실제 검색)

효과: 디스크 I/O 99% 감소. 전체 SSTable을 안 읽음.

CDN 캐시

요청: GET /image/123.jpg

Bloom Filter: "캐시에 없음" → 즉시 origin 요청 (캐시 lookup 생략)
            "있을 수도"   → 캐시 lookup

효과: 캐시 미스 시 시간 절약.

비밀번호 누출 검사

HaveIBeenPwned의 비밀번호 1억+ 데이터베이스 → 클라이언트가 직접 확인 가능 (Bloom Filter 다운로드).

Chrome 악성 URL 검사

Google이 알려진 악성 URL 데이터베이스 → Bloom Filter로 클라이언트에 전송 → 브라우저가 즉시 확인.

3. Counting Bloom Filter

3.1 문제 해결

기본 Bloom Filter는 삭제 불가능. Counting Bloom Filter는 비트 대신 카운터:

배열: [0, 0, 0, 0, 0, 0, 0, 0]  (각 셀이 4비트 카운터)

"apple" 삽입 ([1, 4]):
        [0, 1, 0, 0, 1, 0, 0, 0]

"apple" 또 삽입:
        [0, 2, 0, 0, 2, 0, 0, 0]

"apple" 삭제:
        [0, 1, 0, 0, 1, 0, 0, 0]

3.2 단점

메모리 4배 — 1비트 → 4비트 (오버플로 방지 위해)
여전히 false positive 있음

3.3 Cuckoo Filter (대안)

Cuckoo Filter는 Bloom Filter보다 우수한 대안:

삭제 가능
false positive율 더 낮음
메모리 효율 비슷
"Cuckoo hashing" 기반

# pycuckoofilter 예시
from cuckoofilter import CuckooFilter

cf = CuckooFilter(capacity=1000000, error_rate=0.01)
cf.insert("alice")
print(cf.contains("alice"))  # True
cf.delete("alice")  # 가능!

언제 Cuckoo > Bloom: 삭제가 필요할 때, 더 낮은 false positive 필요할 때.

4. HyperLogLog — 카디널리티 추정

4.1 문제

"오늘 우리 사이트에 몇 명의 고유 방문자가 왔나?"

전통적 방법: HashSet에 모든 방문자 ID 저장 → 1000만 방문자 = 800 MB.

HyperLogLog: 12 KB로 수십억 원소까지 추정. 1% 오차.

4.2 기본 아이디어

원소를 해시 → 이진수의 선행 0의 개수 추적.

"alice" → hash → 00000010110...  (5개의 선행 0)
"bob"   → hash → 00100110...     (2개의 선행 0)
"carol" → hash → 0001110...      (3개의 선행 0)

관찰: 무작위 해시에서 K개의 선행 0이 나올 확률 = 1/2^K.

→ "최대 5개의 선행 0이 나왔다" → 약 2^5 = 32개의 고유 원소 추정.

현실: 단일 측정은 부정확. 여러 버킷에 나누어 평균 → 정확도 향상. 이게 HyperLogLog.

4.3 정확도

원소 수	메모리	표준 오차
1,000	12 KB	0.81%
1,000,000	12 KB	0.81%
1,000,000,000	12 KB	0.81%

핵심: 메모리는 원소 수에 무관! 1억 원소나 100만 원소나 같은 메모리.

4.4 Redis HyperLogLog

PFADD visitors:2025-04-15 "user1" "user2" "user3"
PFADD visitors:2025-04-15 "user1"  # 중복 무시
PFCOUNT visitors:2025-04-15  # 약 3 (2일 수도, 3일 수도)

# 여러 HLL 병합
PFMERGE visitors:week visitors:2025-04-14 visitors:2025-04-15

4.5 HyperLogLog의 실전 활용

일일 활성 사용자 (DAU)

import redis

r = redis.Redis()

def track_user(user_id):
    today = datetime.now().strftime("%Y-%m-%d")
    r.pfadd(f"dau:{today}", user_id)

def get_dau(date):
    return r.pfcount(f"dau:{date}")

수억 사용자도 12 KB로 처리 가능.

고유 IP 카운트 (DDoS 분석)

PFADD unique_ips:2025-04-15 "1.2.3.4" "5.6.7.8"
PFCOUNT unique_ips:2025-04-15

Big Data — BigQuery, Druid

SELECT APPROX_COUNT_DISTINCT(user_id) FROM events;

내부적으로 HyperLogLog 사용. 정확한 COUNT(DISTINCT) 대비 100배+ 빠름.

5. Count-Min Sketch — 빈도 추정

5.1 문제

스트림에서 "각 원소가 몇 번 나타났나?"

전통적: HashMap → 메모리 폭증.

Count-Min Sketch: 작은 메모리로 빈도 추정.

5.2 원리

2D 배열 (d × w) + d개의 해시 함수.

삽입:

각 해시 함수로 인덱스 계산
해당 카운터 증가

조회:

같은 해시 함수
모든 카운터 중 최솟값 반환 (그래서 "Count-Min")

5.3 시각화

배열 (3 x 5):
[0, 0, 0, 0, 0]  ← hash1
[0, 0, 0, 0, 0]  ← hash2
[0, 0, 0, 0, 0]  ← hash3

"apple" 삽입 5번:
hash1("apple") = 1 → 카운터 [0, 5, 0, 0, 0]
hash2("apple") = 3 → 카운터 [0, 0, 0, 5, 0]
hash3("apple") = 0 → 카운터 [5, 0, 0, 0, 0]

"banana" 삽입 2번:
hash1("banana") = 4 → [0, 5, 0, 0, 2]
hash2("banana") = 1 → [0, 2, 0, 5, 0]
hash3("banana") = 2 → [5, 0, 2, 0, 0]

"apple" 조회: min(5, 5, 5) = 5 ✅
"banana" 조회: min(2, 2, 2) = 2 ✅

collision 발생 시: 다른 원소와 충돌하면 카운터가 부풀려질 수 있지만, 최솟값을 취하므로 항상 진짜 카운트보다 ≥ (overestimate).

5.4 정확도

오차 ε, 신뢰도 1-δ
필요 크기:
  w = ⌈e/ε⌉
  d = ⌈ln(1/δ)⌉

예: 1% 오차, 99% 신뢰도 → w=272, d=5 → 1360 카운터 = 5 KB. 수십억 원소 처리 가능.

5.5 실전 활용

핫 키 탐지 (Heavy Hitters)

# 가장 자주 본 키 찾기
sketch = CountMinSketch(width=272, depth=5)

for query in stream:
    sketch.add(query)

# Top-K 찾기 (separate min-heap 사용)

Memcached의 LRU 대안

tinylfu 캐시는 Count-Min Sketch로 빈도 추적 → 자주 쓰이는 키만 캐시에 유지.

DDoS 감지

각 IP의 요청 수를 추정 → 임계값 초과 시 차단.

Apache Spark, Flink

대규모 스트림 처리에서 빈도 분석.

6. MinHash — Jaccard 유사도

6.1 문제

"두 집합이 얼마나 비슷한가?" (Jaccard 유사도)

J(A, B) = |A ∩ B| / |A ∪ B|

전통적 방법: 모든 원소 비교 → O(|A| × |B|).

6.2 MinHash 원리

각 집합의 "최소 해시 값"이 같을 확률 = Jaccard 유사도.

def minhash(set_, num_hashes=128):
    return [min(hash(item, seed=i) for item in set_) for i in range(num_hashes)]

# 두 집합의 MinHash 비교
mh1 = minhash(set1)
mh2 = minhash(set2)

similarity = sum(a == b for a, b in zip(mh1, mh2)) / len(mh1)
# similarity ≈ Jaccard(set1, set2)

6.3 활용

중복 문서 탐지

수십억 웹 페이지에서 거의 같은 페이지 찾기. Google이 사용.

표절 탐지

논문, 코드 비교.

클러스터링

LSH (Locality Sensitive Hashing)와 결합하여 비슷한 항목 그룹화.

7. Skip List — 확률적 균형 트리

7.1 정렬된 자료구조

균형 이진 트리(AVL, Red-Black) 대신 확률적 접근.

7.2 원리

다층 연결 리스트. 상위 레벨은 sparse, 하위 레벨은 dense.

Level 3:  1 ───────────────→ 9
Level 2:  1 ────→ 4 ────→ 7 ─→ 9
Level 1:  1 → 3 → 4 → 5 → 7 → 8 → 9

검색: 상위에서 시작, 너무 크면 한 레벨 내려감.

삽입 시 레벨: 동전 던지기로 결정. 50% 확률로 다음 레벨.

7.3 성능

작업	평균	최악
검색	O(log n)	O(n)
삽입	O(log n)	O(n)
삭제	O(log n)	O(n)

장점: 균형 트리보다 구현 단순, lock-free 동시성 쉬움.

7.4 활용

Redis Sorted Set: ZADD/ZRANGE의 내부 구조
LevelDB/RocksDB의 memtable
Java ConcurrentSkipListMap

8. T-Digest — 분위수 추정

8.1 문제

대규모 스트림에서 p99 latency 같은 분위수를 어떻게 계산하나?

전통적: 모든 값 저장 후 정렬 → 메모리 폭증.

T-Digest: 분포의 "스케치"를 작은 메모리로 유지.

8.2 원리

값을 buckets로 그룹화. 분포의 끝(p99, p99.9)에 더 많은 buckets, 중간(p50)에 적게.

근거: 우리는 보통 **꼬리(tail latency)**에 관심. 중간값은 대략적이어도 OK.

8.3 사용

from tdigest import TDigest

t = TDigest()
for value in stream:
    t.update(value)

p50 = t.percentile(50)
p95 = t.percentile(95)
p99 = t.percentile(99)
p999 = t.percentile(99.9)

8.4 활용

모니터링: Prometheus, Datadog, New Relic
데이터베이스: Druid, Apache Pinot
스트림 처리: Apache Beam, Flink

9. 트레이드오프 비교

9.1 모든 자료구조 비교

자료구조	답하는 질문	메모리	정확도	사용
HashSet	정확한 멤버십	O(n)	100%	작은 데이터
Bloom Filter	"있을 수도?"	O(n) very small	FP O, FN X	캐시 사전 검증
Cuckoo Filter	멤버십 + 삭제	O(n) very small	FP 더 낮음	Bloom 대안
Counter	정확한 카운트	O(n)	100%	작은 데이터
HyperLogLog	"고유 원소 수?"	12 KB 고정	~1%	DAU, 카디널리티
Count-Min Sketch	"X가 몇 번?"	small	overestimate	핫 키, DDoS
MinHash + LSH	"비슷한 집합?"	small	Jaccard 추정	중복 탐지
T-Digest	"분위수?"	small	분포 추정	latency 모니터링
Skip List	정렬된 셋	O(n)	100%	Redis Sorted Set

9.2 메모리 비교 (1억 고유 사용자)

방법	메모리
HashSet (UUID)	~3.6 GB
Bitmap (sequential ID)	12.5 MB
Bloom Filter (1% FP)	120 MB
HyperLogLog	12 KB

HyperLogLog가 30만 배 작습니다.

10. 실전 — Kafka의 Heavy Hitter 탐지

10.1 시나리오

Kafka 토픽에 초당 100만 메시지. "어떤 user_id가 가장 활발한가?" (스팸 탐지)

10.2 구현

from countminsketch import CountMinSketch
import heapq

# Count-Min Sketch + Top-K 힙
sketch = CountMinSketch(width=10000, depth=5)
top_k = []  # min-heap of (count, user_id)
TOP_K_SIZE = 100

def process_message(user_id):
    sketch.add(user_id)
    estimated_count = sketch.estimate(user_id)
    
    if len(top_k) < TOP_K_SIZE:
        heapq.heappush(top_k, (estimated_count, user_id))
    elif estimated_count > top_k[0][0]:
        heapq.heapreplace(top_k, (estimated_count, user_id))

# 메모리: ~200 KB로 1억 사용자 처리 가능

10.3 결과

전통적 HashMap = 수 GB → CMS = 200 KB (10,000배 절약).

약간의 부정확성을 받아들이면 메모리 효율이 폭발적입니다.

퀴즈

1. Bloom Filter는 왜 false negative가 없나요?

답: 원소를 삽입할 때 K개의 비트를 모두 1로 설정합니다. 같은 원소를 조회할 때 K개의 같은 비트를 검사하면 반드시 모두 1입니다. 따라서 "없음"이라고 답하는 것은 비트가 0이라는 의미 = 그 원소는 절대 삽입된 적 없음. False negative 불가능. 단, 다른 원소들이 우연히 같은 비트를 1로 만들면 false positive 발생.

2. HyperLogLog가 12 KB로 수십억을 처리할 수 있는 이유는?

답: HyperLogLog는 원소 자체를 저장하지 않습니다. 대신 해시의 선행 0 개수의 분포를 추적합니다. 1억 원소든 1조 원소든 같은 분포 정보를 저장하므로 메모리는 일정합니다. 12 KB 안에 1024개의 6비트 버킷이 들어가고, 이걸로 평균 0.81% 오차의 카디널리티 추정이 가능합니다. 메모리는 데이터 크기와 무관하다는 것이 핵심.

3. Count-Min Sketch가 항상 overestimate인 이유는?

답: 여러 원소가 같은 카운터 위치에 충돌할 수 있습니다. 충돌 시 카운터가 여러 원소의 합이 되어 실제보다 큰 값. 최솟값을 취해도 충돌의 영향이 남아있을 수 있어 항상 ≥ 진짜 값입니다. 반대로 underestimate는 불가능 — 어떤 위치도 진짜 카운트보다 작을 수 없기 때문입니다.

4. Cuckoo Filter가 Bloom Filter보다 좋은 점은?

답: (1) 삭제 가능 — Bloom Filter는 비트를 0으로 못 만듭니다 (다른 원소 영향), Cuckoo는 안전하게 삭제. (2) 더 낮은 false positive율 — 같은 메모리에서 더 정확. (3) 더 좋은 캐시 성능 — 메모리 접근 패턴이 더 효율적. 단점: 구현이 조금 더 복잡, 만수 시 (capacity 초과) 삽입 실패 가능.

5. p99 latency 측정에 T-Digest가 유리한 이유는?

답: 전통적 방법은 모든 latency 값을 저장한 후 정렬 → 1억 요청이면 800 MB+ 메모리. T-Digest는 분포의 "스케치"를 수 KB로 저장합니다. 핵심 트릭: 꼬리(p99, p99.9)에 더 많은 buckets, 중간(p50)에 적게. 모니터링은 보통 꼬리에 관심이 있으므로 정확도 손실 없이 메모리 절약. Datadog, Prometheus 등이 사용합니다.

참고 자료

Bloom Filter 원논문 — Burton Bloom (1970)
HyperLogLog 원논문 — Flajolet et al.
Count-Min Sketch 원논문 — Cormode & Muthukrishnan
Cuckoo Filter — Fan et al.
Probabilistic Data Structures and Algorithms — Andrii Gakhov 책
Redis HyperLogLog
RocksDB Bloom Filter
Apache DataSketches — 고품질 구현
stream-lib — Java 라이브러리
t-digest — Ted Dunning 구현
pdsa.gakhov.com — 시각화 + 시뮬레이션

The Complete Guide to Probabilistic Data Structures 2025: Bloom Filter, HyperLogLog, Count-Min Sketch, and Real-World Usage

TL;DR

Essence of probabilistic data structures: Trade 100% accuracy for extreme savings in memory/speed
Bloom Filter: "This element may be in the set" — false positive O, false negative X. Core of RocksDB, Cassandra, BigTable
HyperLogLog: Count unique elements among billions with 12KB. Used by Redis PFADD/PFCOUNT, BigQuery, Druid
Count-Min Sketch: Estimate element frequency in streams. Hot-key detection, DDoS detection
Tradeoff: accuracy vs memory vs speed — explosive efficiency when exact answers are unnecessary

1. What Are Probabilistic Data Structures?

1.1 Core Idea

"If you don't need a 100% accurate answer, you can save memory and speed extremely."

Traditional data structures:

Storing 1M elements in a HashMap → about 50 MB
"Does this element exist?" → exact answer

Probabilistic data structures:

Storing 1M elements in a Bloom Filter → about 1 MB (50x savings!)
"Does this element exist?" → "maybe" or "definitely not"

Tradeoff: 50x memory savings + a small false positive rate (1%).

1.2 Why Do We Need Them?

In large-scale systems:

Memory is expensive — RAM is 100x+ more expensive than disk
Must fit in L1/L2 cache to be fast → smaller is better
Billions of elements — HashMap becomes infeasible
Distributed systems — smaller data is more network-efficient

1.3 Cases Where Accuracy Is Unnecessary

Problem	Need Exact Answer?	Probabilistic OK?
User authentication	Yes	No
"Have I seen this URL before?"	No	Yes
Daily active users count	No (roughly 10M is enough)	Yes
Hot page detection	No	Yes
Cache pre-check	No	Yes

2. Bloom Filter — The Most Famous Probabilistic Data Structure

2.1 Principle

Bit array + multiple hash functions.

Insertion:

Pass the element through K hash functions → K indices
Set the bit at each index to 1

Query:

Same K hash functions
If all bits are 1 → "maybe present"
If any bit is 0 → "definitely not present"

2.2 Visualization

Bit array (8 bits):  [0, 0, 0, 0, 0, 0, 0, 0]

Insert "apple" (hash → [1, 4]):
                  [0, 1, 0, 0, 1, 0, 0, 0]

Insert "banana" (hash → [3, 6]):
                  [0, 1, 0, 1, 1, 0, 1, 0]

Query "apple" (hash → [1, 4]): both 1 → "maybe present" OK
Query "grape" (hash → [2, 5]): has a 0 → "definitely not present" NO
Query "cherry" (hash → [1, 6]): coincidentally both 1 → "maybe present" (false positive)

2.3 Computing the False Positive Rate

Formula:

P(fp) = (1 - e^(-k*n/m))^k

n: number of elements
m: number of bits
k: number of hash functions

Optimal k:

k = (m/n) * ln(2)

If you target 1% false positive:

m/n ≈ 9.6 bits
k ≈ 7

1M elements = 1.2 MB + 7 hash functions = 50x smaller than HashMap.

2.4 Python Implementation

import hashlib
import math

class BloomFilter:
    def __init__(self, n, fp_rate):
        self.size = -int(n * math.log(fp_rate) / (math.log(2) ** 2))
        self.hash_count = int(self.size / n * math.log(2))
        self.bits = [0] * self.size

    def _hashes(self, item):
        for i in range(self.hash_count):
            h = hashlib.md5(f"{item}{i}".encode()).hexdigest()
            yield int(h, 16) % self.size

    def add(self, item):
        for h in self._hashes(item):
            self.bits[h] = 1

    def contains(self, item):
        return all(self.bits[h] for h in self._hashes(item))

# Usage
bf = BloomFilter(n=1000000, fp_rate=0.01)
bf.add("alice@example.com")
print(bf.contains("alice@example.com"))  # True
print(bf.contains("bob@example.com"))    # Almost always False

2.5 Limitations of Bloom Filters

No deletion — setting bits to 0 would affect other elements
Hard to resize — must create a new filter and reinsert all elements
No count — cannot know how many elements are inside

Solutions: Counting Bloom Filter, Cuckoo Filter (see below).

2.6 Real-World Usage of Bloom Filters

RocksDB / Cassandra / BigTable

In LSM-tree-based DBs, there is a Bloom Filter per SSTable:

Query: "key123"

SSTable 1 Bloom Filter: "definitely not present" → skip disk read OK
SSTable 2 Bloom Filter: "definitely not present" → skip disk read OK
SSTable 3 Bloom Filter: "maybe present"         → read disk (actual search)

Effect: 99% reduction in disk I/O. Entire SSTables are never read.

CDN Cache

Request: GET /image/123.jpg

Bloom Filter: "not in cache" → go straight to origin (skip cache lookup)
              "maybe"        → cache lookup

Effect: Saves time on cache miss.

Password Leak Checks

HaveIBeenPwned's 100M+ leaked password database → clients can check locally (download the Bloom Filter).

Chrome Malicious URL Check

Google's known malicious URL database → sent to clients as a Bloom Filter → browser checks immediately.

3. Counting Bloom Filter

3.1 Solving the Problem

A basic Bloom Filter does not support deletion. A Counting Bloom Filter uses counters instead of bits:

Array: [0, 0, 0, 0, 0, 0, 0, 0]  (each cell is a 4-bit counter)

Insert "apple" ([1, 4]):
        [0, 1, 0, 0, 1, 0, 0, 0]

Insert "apple" again:
        [0, 2, 0, 0, 2, 0, 0, 0]

Delete "apple":
        [0, 1, 0, 0, 1, 0, 0, 0]

3.2 Drawbacks

4x the memory — 1 bit → 4 bits (to prevent overflow)
Still has false positives

3.3 Cuckoo Filter (Alternative)

The Cuckoo Filter is a superior alternative to the Bloom Filter:

Supports deletion
Lower false positive rate
Comparable memory efficiency
Based on "Cuckoo hashing"

# pycuckoofilter example
from cuckoofilter import CuckooFilter

cf = CuckooFilter(capacity=1000000, error_rate=0.01)
cf.insert("alice")
print(cf.contains("alice"))  # True
cf.delete("alice")  # Possible!

When Cuckoo > Bloom: when deletion is needed, or when a lower false positive rate is required.

4. HyperLogLog — Cardinality Estimation

4.1 The Problem

"How many unique visitors came to our site today?"

Traditional method: Store all visitor IDs in a HashSet → 10M visitors = 800 MB.

HyperLogLog: Estimate up to billions of elements with 12 KB. 1% error.

4.2 Basic Idea

Hash each element → track the number of leading zeros in the binary representation.

"alice" → hash → 00000010110...  (5 leading zeros)
"bob"   → hash → 00100110...     (2 leading zeros)
"carol" → hash → 0001110...      (3 leading zeros)

Observation: The probability of K leading zeros in a random hash is 1/2^K.

→ "We saw at most 5 leading zeros" → estimated about 2^5 = 32 unique elements.

Reality: Single measurements are inaccurate. Split across multiple buckets and average → improved accuracy. This is HyperLogLog.

4.3 Accuracy

Elements	Memory	Standard Error
1,000	12 KB	0.81%
1,000,000	12 KB	0.81%
1,000,000,000	12 KB	0.81%

Key point: Memory is independent of the number of elements! 100M elements and 1M elements take the same memory.

4.4 Redis HyperLogLog

PFADD visitors:2025-04-15 "user1" "user2" "user3"
PFADD visitors:2025-04-15 "user1"  # duplicates ignored
PFCOUNT visitors:2025-04-15  # about 3 (could be 2 or 3)

# Merge multiple HLLs
PFMERGE visitors:week visitors:2025-04-14 visitors:2025-04-15

4.5 Real-World Usage of HyperLogLog

Daily Active Users (DAU)

import redis

r = redis.Redis()

def track_user(user_id):
    today = datetime.now().strftime("%Y-%m-%d")
    r.pfadd(f"dau:{today}", user_id)

def get_dau(date):
    return r.pfcount(f"dau:{date}")

Hundreds of millions of users can be handled with 12 KB.

Unique IP Count (DDoS Analysis)

PFADD unique_ips:2025-04-15 "1.2.3.4" "5.6.7.8"
PFCOUNT unique_ips:2025-04-15

Big Data — BigQuery, Druid

SELECT APPROX_COUNT_DISTINCT(user_id) FROM events;

Uses HyperLogLog internally. 100x+ faster than exact COUNT(DISTINCT).

5. Count-Min Sketch — Frequency Estimation

5.1 The Problem

In a stream, "how many times has each element appeared?"

Traditional: HashMap → memory explosion.

Count-Min Sketch: Estimate frequencies with small memory.

5.2 Principle

2D array (d × w) + d hash functions.

Insert:

Compute an index with each hash function
Increment the corresponding counter

Query:

Same hash functions
Return the minimum counter value among them (hence "Count-Min")

5.3 Visualization

Array (3 x 5):
[0, 0, 0, 0, 0]  ← hash1
[0, 0, 0, 0, 0]  ← hash2
[0, 0, 0, 0, 0]  ← hash3

Insert "apple" 5 times:
hash1("apple") = 1 → counter [0, 5, 0, 0, 0]
hash2("apple") = 3 → counter [0, 0, 0, 5, 0]
hash3("apple") = 0 → counter [5, 0, 0, 0, 0]

Insert "banana" 2 times:
hash1("banana") = 4 → [0, 5, 0, 0, 2]
hash2("banana") = 1 → [0, 2, 0, 5, 0]
hash3("banana") = 2 → [5, 0, 2, 0, 0]

Query "apple": min(5, 5, 5) = 5 OK
Query "banana": min(2, 2, 2) = 2 OK

On collision: if another element collides, the counter may be inflated, but since we take the minimum, it is always >= the true count (overestimate).

5.4 Accuracy

Error ε, confidence 1-δ
Required size:
  w = ⌈e/ε⌉
  d = ⌈ln(1/δ)⌉

Example: 1% error, 99% confidence → w=272, d=5 → 1360 counters = 5 KB. Can handle billions of elements.

5.5 Real-World Usage

Heavy Hitters (Hot Key Detection)

# Find the most frequent keys
sketch = CountMinSketch(width=272, depth=5)

for query in stream:
    sketch.add(query)

# Top-K discovery (uses a separate min-heap)

Alternative to Memcached LRU

The tinylfu cache uses Count-Min Sketch to track frequency → keeps only frequently used keys in cache.

DDoS Detection

Estimate the request count per IP → block when exceeding a threshold.

Apache Spark, Flink

Frequency analysis in large-scale stream processing.

6. MinHash — Jaccard Similarity

6.1 The Problem

"How similar are two sets?" (Jaccard similarity)

J(A, B) = |A ∩ B| / |A ∪ B|

Traditional method: compare all elements → O(|A| × |B|).

6.2 MinHash Principle

The probability that the "minimum hash value" of each set is equal = Jaccard similarity.

def minhash(set_, num_hashes=128):
    return [min(hash(item, seed=i) for item in set_) for i in range(num_hashes)]

# Compare MinHashes of two sets
mh1 = minhash(set1)
mh2 = minhash(set2)

similarity = sum(a == b for a, b in zip(mh1, mh2)) / len(mh1)
# similarity ≈ Jaccard(set1, set2)

6.3 Usage

Duplicate Document Detection

Finding near-identical pages among billions of web pages. Used by Google.

Plagiarism Detection

Comparison of papers and code.

Recommendation Systems

Finding "similar users".

Clustering

Combined with LSH (Locality Sensitive Hashing) to group similar items.

7. Skip List — Probabilistic Balanced Tree

7.1 Sorted Data Structure

A probabilistic alternative to balanced binary trees (AVL, Red-Black).

7.2 Principle

Multi-level linked list. Upper levels are sparse, lower levels dense.

Level 3:  1 ───────────────→ 9
Level 2:  1 ────→ 4 ────→ 7 ─→ 9
Level 1:  1 → 3 → 4 → 5 → 7 → 8 → 9

Search: start from the top, drop a level if too large.

Level on insert: decided by coin flip. 50% chance of going to the next level.

7.3 Performance

Operation	Average	Worst
Search	O(log n)	O(n)
Insert	O(log n)	O(n)
Delete	O(log n)	O(n)

Advantages: simpler to implement than balanced trees, easy lock-free concurrency.

7.4 Usage

Redis Sorted Set: internal structure of ZADD/ZRANGE
LevelDB/RocksDB memtable
Java ConcurrentSkipListMap

8. T-Digest — Quantile Estimation

8.1 The Problem

How do you compute quantiles like p99 latency in a large-scale stream?

Traditional: store all values then sort → memory explosion.

T-Digest: maintain a "sketch" of the distribution with small memory.

8.2 Principle

Group values into buckets. More buckets at the distribution's tail (p99, p99.9), fewer in the middle (p50).

Rationale: we usually care about the tail (tail latency). The median can be approximate.

8.3 Usage

from tdigest import TDigest

t = TDigest()
for value in stream:
    t.update(value)

p50 = t.percentile(50)
p95 = t.percentile(95)
p99 = t.percentile(99)
p999 = t.percentile(99.9)

8.4 Usage

Monitoring: Prometheus, Datadog, New Relic
Databases: Druid, Apache Pinot
Stream processing: Apache Beam, Flink

9. Tradeoff Comparison

9.1 Comparison of All Data Structures

Structure	Question it Answers	Memory	Accuracy	Usage
HashSet	Exact membership	O(n)	100%	small data
Bloom Filter	"Maybe present?"	O(n) very small	FP O, FN X	cache pre-check
Cuckoo Filter	Membership + deletion	O(n) very small	Lower FP	Bloom alternative
Counter	Exact count	O(n)	100%	small data
HyperLogLog	"Number of unique elements?"	12 KB fixed	~1%	DAU, cardinality
Count-Min Sketch	"How many times X?"	small	overestimate	hot keys, DDoS
MinHash + LSH	"Similar sets?"	small	Jaccard estimate	duplicate detection
T-Digest	"Quantiles?"	small	distribution estimate	latency monitoring
Skip List	Sorted set	O(n)	100%	Redis Sorted Set

9.2 Memory Comparison (100M Unique Users)

Method	Memory
HashSet (UUID)	~3.6 GB
Bitmap (sequential ID)	12.5 MB
Bloom Filter (1% FP)	120 MB
HyperLogLog	12 KB

HyperLogLog is 300,000x smaller.

10. Real-World — Heavy Hitter Detection in Kafka

10.1 Scenario

1M messages/sec on a Kafka topic. "Which user_id is most active?" (spam detection)

10.2 Implementation

from countminsketch import CountMinSketch
import heapq

# Count-Min Sketch + Top-K heap
sketch = CountMinSketch(width=10000, depth=5)
top_k = []  # min-heap of (count, user_id)
TOP_K_SIZE = 100

def process_message(user_id):
    sketch.add(user_id)
    estimated_count = sketch.estimate(user_id)
    
    if len(top_k) < TOP_K_SIZE:
        heapq.heappush(top_k, (estimated_count, user_id))
    elif estimated_count > top_k[0][0]:
        heapq.heapreplace(top_k, (estimated_count, user_id))

# Memory: ~200 KB handles 100M users

10.3 Result

Traditional HashMap = several GB → CMS = 200 KB (10,000x savings).

Accepting a bit of inaccuracy yields explosive memory efficiency.

Quiz

1. Why does the Bloom Filter have no false negatives?

Answer: When inserting an element, all K bits are set to 1. When querying the same element, the same K bits are checked and they must all be 1. Therefore, answering "not present" means some bit was 0, meaning the element was never inserted. False negatives are impossible. However, other elements coincidentally setting the same bits to 1 causes false positives.

2. Why can HyperLogLog handle billions with 12 KB?

Answer: HyperLogLog does not store the elements themselves. Instead, it tracks the distribution of the number of leading zeros in the hashes. Whether it is 100M or 1T elements, the same distribution information is stored, so memory is constant. 12 KB holds 1024 six-bit buckets, enabling cardinality estimation with an average error of 0.81%. The key is that memory is independent of data size.

3. Why does Count-Min Sketch always overestimate?

Answer: Multiple elements can collide at the same counter position. On collision, the counter becomes the sum of multiple elements' counts, yielding a larger-than-real value. Even taking the minimum may leave collision effects, so the result is always >= the true value. Conversely, underestimates are impossible because no position can be smaller than the true count.

4. What are the advantages of the Cuckoo Filter over the Bloom Filter?

Answer: (1) Deletable — the Bloom Filter cannot turn bits back to 0 (affects other elements), while the Cuckoo Filter can safely delete. (2) Lower false positive rate — more accurate at the same memory. (3) Better cache performance — more efficient memory access patterns. Downsides: slightly more complex to implement, and insertion can fail when saturated (capacity exceeded).

5. Why is T-Digest advantageous for p99 latency measurement?

Answer: Traditional methods store all latency values then sort → 800 MB+ memory for 100M requests. T-Digest stores a "sketch" of the distribution in a few KB. The key trick: more buckets at the tail (p99, p99.9), fewer in the middle (p50). Since monitoring usually focuses on the tail, memory is saved without loss of accuracy where it matters. Used by Datadog, Prometheus, and others.

References

Bloom Filter Original Paper — Burton Bloom (1970)
HyperLogLog Original Paper — Flajolet et al.
Count-Min Sketch Original Paper — Cormode & Muthukrishnan
Cuckoo Filter — Fan et al.
Probabilistic Data Structures and Algorithms — Andrii Gakhov (book)
Redis HyperLogLog
RocksDB Bloom Filter
Apache DataSketches — high-quality implementations
stream-lib — Java library
t-digest — Ted Dunning's implementation
pdsa.gakhov.com — visualizations + simulations

확률적 자료구조 완전 가이드 2025: Bloom Filter, HyperLogLog, Count-Min Sketch, 실전 활용

TL;DR

1. 확률적 자료구조란?

1.1 핵심 아이디어

1.2 왜 필요한가?

1.3 정확도가 필요 없는 경우

2. Bloom Filter — 가장 유명한 확률적 자료구조

2.1 원리

2.2 시각화

2.3 false positive율 계산

2.4 Python 구현

2.5 Bloom Filter의 한계

2.6 Bloom Filter의 실전 활용

RocksDB / Cassandra / BigTable

CDN 캐시

비밀번호 누출 검사

Chrome 악성 URL 검사

3. Counting Bloom Filter

3.1 문제 해결

3.2 단점

3.3 Cuckoo Filter (대안)

4. HyperLogLog — 카디널리티 추정

4.1 문제

4.2 기본 아이디어

4.3 정확도

4.4 Redis HyperLogLog

4.5 HyperLogLog의 실전 활용

일일 활성 사용자 (DAU)

고유 IP 카운트 (DDoS 분석)

Big Data — BigQuery, Druid

5. Count-Min Sketch — 빈도 추정

5.1 문제

5.2 원리

5.3 시각화

5.4 정확도

5.5 실전 활용

핫 키 탐지 (Heavy Hitters)

Memcached의 LRU 대안

DDoS 감지

Apache Spark, Flink

6. MinHash — Jaccard 유사도

6.1 문제

6.2 MinHash 원리

6.3 활용

중복 문서 탐지

표절 탐지

추천 시스템

클러스터링

7. Skip List — 확률적 균형 트리

7.1 정렬된 자료구조

7.2 원리

7.3 성능

7.4 활용

8. T-Digest — 분위수 추정

8.1 문제

8.2 원리

8.3 사용

8.4 활용

9. 트레이드오프 비교

9.1 모든 자료구조 비교

9.2 메모리 비교 (1억 고유 사용자)

10. 실전 — Kafka의 Heavy Hitter 탐지

10.1 시나리오

10.2 구현

10.3 결과

퀴즈

참고 자료

The Complete Guide to Probabilistic Data Structures 2025: Bloom Filter, HyperLogLog, Count-Min Sketch, and Real-World Usage

TL;DR

1. What Are Probabilistic Data Structures?

1.1 Core Idea

1.2 Why Do We Need Them?

1.3 Cases Where Accuracy Is Unnecessary

2. Bloom Filter — The Most Famous Probabilistic Data Structure

2.1 Principle

2.2 Visualization

2.3 Computing the False Positive Rate

2.4 Python Implementation

2.5 Limitations of Bloom Filters

2.6 Real-World Usage of Bloom Filters