Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

TL;DR

Essence of probabilistic data structures: Trade 100% accuracy for extreme savings in memory/speed
Bloom Filter: "This element may be in the set" — false positive O, false negative X. Core of RocksDB, Cassandra, BigTable
HyperLogLog: Count unique elements among billions with 12KB. Used by Redis PFADD/PFCOUNT, BigQuery, Druid
Count-Min Sketch: Estimate element frequency in streams. Hot-key detection, DDoS detection
Tradeoff: accuracy vs memory vs speed — explosive efficiency when exact answers are unnecessary

1. What Are Probabilistic Data Structures?

1.1 Core Idea

"If you don't need a 100% accurate answer, you can save memory and speed extremely."

Traditional data structures:

Storing 1M elements in a HashMap → about 50 MB
"Does this element exist?" → exact answer

Probabilistic data structures:

Storing 1M elements in a Bloom Filter → about 1 MB (50x savings!)
"Does this element exist?" → "maybe" or "definitely not"

Tradeoff: 50x memory savings + a small false positive rate (1%).

1.2 Why Do We Need Them?

In large-scale systems:

Memory is expensive — RAM is 100x+ more expensive than disk
Must fit in L1/L2 cache to be fast → smaller is better
Billions of elements — HashMap becomes infeasible
Distributed systems — smaller data is more network-efficient

1.3 Cases Where Accuracy Is Unnecessary

Problem	Need Exact Answer?	Probabilistic OK?
User authentication	Yes	No
"Have I seen this URL before?"	No	Yes
Daily active users count	No (roughly 10M is enough)	Yes
Hot page detection	No	Yes
Cache pre-check	No	Yes

2. Bloom Filter — The Most Famous Probabilistic Data Structure

2.1 Principle

Bit array + multiple hash functions.

Insertion:

Pass the element through K hash functions → K indices
Set the bit at each index to 1

Query:

Same K hash functions
If all bits are 1 → "maybe present"
If any bit is 0 → "definitely not present"

2.2 Visualization

Bit array (8 bits):  [0, 0, 0, 0, 0, 0, 0, 0]

Insert "apple" (hash → [1, 4]):
                  [0, 1, 0, 0, 1, 0, 0, 0]

Insert "banana" (hash → [3, 6]):
                  [0, 1, 0, 1, 1, 0, 1, 0]

Query "apple" (hash → [1, 4]): both 1 → "maybe present" OK
Query "grape" (hash → [2, 5]): has a 0 → "definitely not present" NO
Query "cherry" (hash → [1, 6]): coincidentally both 1 → "maybe present" (false positive)

2.3 Computing the False Positive Rate

Formula:

P(fp) = (1 - e^(-k*n/m))^k

n: number of elements
m: number of bits
k: number of hash functions

Optimal k:

k = (m/n) * ln(2)

If you target 1% false positive:

m/n ≈ 9.6 bits
k ≈ 7

1M elements = 1.2 MB + 7 hash functions = 50x smaller than HashMap.

2.4 Python Implementation

import hashlib
import math

class BloomFilter:
    def __init__(self, n, fp_rate):
        self.size = -int(n * math.log(fp_rate) / (math.log(2) ** 2))
        self.hash_count = int(self.size / n * math.log(2))
        self.bits = [0] * self.size

    def _hashes(self, item):
        for i in range(self.hash_count):
            h = hashlib.md5(f"{item}{i}".encode()).hexdigest()
            yield int(h, 16) % self.size

    def add(self, item):
        for h in self._hashes(item):
            self.bits[h] = 1

    def contains(self, item):
        return all(self.bits[h] for h in self._hashes(item))

# Usage
bf = BloomFilter(n=1000000, fp_rate=0.01)
bf.add("alice@example.com")
print(bf.contains("alice@example.com"))  # True
print(bf.contains("bob@example.com"))    # Almost always False

2.5 Limitations of Bloom Filters

No deletion — setting bits to 0 would affect other elements
Hard to resize — must create a new filter and reinsert all elements
No count — cannot know how many elements are inside

Solutions: Counting Bloom Filter, Cuckoo Filter (see below).

2.6 Real-World Usage of Bloom Filters

RocksDB / Cassandra / BigTable

In LSM-tree-based DBs, there is a Bloom Filter per SSTable:

Query: "key123"

SSTable 1 Bloom Filter: "definitely not present" → skip disk read OK
SSTable 2 Bloom Filter: "definitely not present" → skip disk read OK
SSTable 3 Bloom Filter: "maybe present"         → read disk (actual search)

Effect: 99% reduction in disk I/O. Entire SSTables are never read.

CDN Cache

Request: GET /image/123.jpg

Bloom Filter: "not in cache" → go straight to origin (skip cache lookup)
              "maybe"        → cache lookup

Effect: Saves time on cache miss.

Password Leak Checks

HaveIBeenPwned's 100M+ leaked password database → clients can check locally (download the Bloom Filter).

Chrome Malicious URL Check

Google's known malicious URL database → sent to clients as a Bloom Filter → browser checks immediately.

3. Counting Bloom Filter

3.1 Solving the Problem

A basic Bloom Filter does not support deletion. A Counting Bloom Filter uses counters instead of bits:

Array: [0, 0, 0, 0, 0, 0, 0, 0]  (each cell is a 4-bit counter)

Insert "apple" ([1, 4]):
        [0, 1, 0, 0, 1, 0, 0, 0]

Insert "apple" again:
        [0, 2, 0, 0, 2, 0, 0, 0]

Delete "apple":
        [0, 1, 0, 0, 1, 0, 0, 0]

3.2 Drawbacks

4x the memory — 1 bit → 4 bits (to prevent overflow)
Still has false positives

3.3 Cuckoo Filter (Alternative)

The Cuckoo Filter is a superior alternative to the Bloom Filter:

Supports deletion
Lower false positive rate
Comparable memory efficiency
Based on "Cuckoo hashing"

# pycuckoofilter example
from cuckoofilter import CuckooFilter

cf = CuckooFilter(capacity=1000000, error_rate=0.01)
cf.insert("alice")
print(cf.contains("alice"))  # True
cf.delete("alice")  # Possible!

When Cuckoo > Bloom: when deletion is needed, or when a lower false positive rate is required.

4. HyperLogLog — Cardinality Estimation

4.1 The Problem

"How many unique visitors came to our site today?"

Traditional method: Store all visitor IDs in a HashSet → 10M visitors = 800 MB.

HyperLogLog: Estimate up to billions of elements with 12 KB. 1% error.

4.2 Basic Idea

Hash each element → track the number of leading zeros in the binary representation.

"alice" → hash → 00000010110...  (5 leading zeros)
"bob"   → hash → 00100110...     (2 leading zeros)
"carol" → hash → 0001110...      (3 leading zeros)

Observation: The probability of K leading zeros in a random hash is 1/2^K.

→ "We saw at most 5 leading zeros" → estimated about 2^5 = 32 unique elements.

Reality: Single measurements are inaccurate. Split across multiple buckets and average → improved accuracy. This is HyperLogLog.

4.3 Accuracy

Elements	Memory	Standard Error
1,000	12 KB	0.81%
1,000,000	12 KB	0.81%
1,000,000,000	12 KB	0.81%

Key point: Memory is independent of the number of elements! 100M elements and 1M elements take the same memory.

4.4 Redis HyperLogLog

PFADD visitors:2025-04-15 "user1" "user2" "user3"
PFADD visitors:2025-04-15 "user1"  # duplicates ignored
PFCOUNT visitors:2025-04-15  # about 3 (could be 2 or 3)

# Merge multiple HLLs
PFMERGE visitors:week visitors:2025-04-14 visitors:2025-04-15

4.5 Real-World Usage of HyperLogLog

Daily Active Users (DAU)

import redis

r = redis.Redis()

def track_user(user_id):
    today = datetime.now().strftime("%Y-%m-%d")
    r.pfadd(f"dau:{today}", user_id)

def get_dau(date):
    return r.pfcount(f"dau:{date}")

Hundreds of millions of users can be handled with 12 KB.

Unique IP Count (DDoS Analysis)

PFADD unique_ips:2025-04-15 "1.2.3.4" "5.6.7.8"
PFCOUNT unique_ips:2025-04-15

Big Data — BigQuery, Druid

SELECT APPROX_COUNT_DISTINCT(user_id) FROM events;

Uses HyperLogLog internally. 100x+ faster than exact COUNT(DISTINCT).

5. Count-Min Sketch — Frequency Estimation

5.1 The Problem

In a stream, "how many times has each element appeared?"

Traditional: HashMap → memory explosion.

Count-Min Sketch: Estimate frequencies with small memory.

5.2 Principle

2D array (d × w) + d hash functions.

Insert:

Compute an index with each hash function
Increment the corresponding counter

Query:

Same hash functions
Return the minimum counter value among them (hence "Count-Min")

5.3 Visualization

Array (3 x 5):
[0, 0, 0, 0, 0]  ← hash1
[0, 0, 0, 0, 0]  ← hash2
[0, 0, 0, 0, 0]  ← hash3

Insert "apple" 5 times:
hash1("apple") = 1 → counter [0, 5, 0, 0, 0]
hash2("apple") = 3 → counter [0, 0, 0, 5, 0]
hash3("apple") = 0 → counter [5, 0, 0, 0, 0]

Insert "banana" 2 times:
hash1("banana") = 4 → [0, 5, 0, 0, 2]
hash2("banana") = 1 → [0, 2, 0, 5, 0]
hash3("banana") = 2 → [5, 0, 2, 0, 0]

Query "apple": min(5, 5, 5) = 5 OK
Query "banana": min(2, 2, 2) = 2 OK

On collision: if another element collides, the counter may be inflated, but since we take the minimum, it is always >= the true count (overestimate).

5.4 Accuracy

Error ε, confidence 1-δ
Required size:
  w = ⌈e/ε⌉
  d = ⌈ln(1/δ)⌉

Example: 1% error, 99% confidence → w=272, d=5 → 1360 counters = 5 KB. Can handle billions of elements.

5.5 Real-World Usage

Heavy Hitters (Hot Key Detection)

# Find the most frequent keys
sketch = CountMinSketch(width=272, depth=5)

for query in stream:
    sketch.add(query)

# Top-K discovery (uses a separate min-heap)

Alternative to Memcached LRU

The tinylfu cache uses Count-Min Sketch to track frequency → keeps only frequently used keys in cache.

DDoS Detection

Estimate the request count per IP → block when exceeding a threshold.

Apache Spark, Flink

Frequency analysis in large-scale stream processing.

6. MinHash — Jaccard Similarity

6.1 The Problem

"How similar are two sets?" (Jaccard similarity)

J(A, B) = |A ∩ B| / |A ∪ B|

Traditional method: compare all elements → O(|A| × |B|).

6.2 MinHash Principle

The probability that the "minimum hash value" of each set is equal = Jaccard similarity.

def minhash(set_, num_hashes=128):
    return [min(hash(item, seed=i) for item in set_) for i in range(num_hashes)]

# Compare MinHashes of two sets
mh1 = minhash(set1)
mh2 = minhash(set2)

similarity = sum(a == b for a, b in zip(mh1, mh2)) / len(mh1)
# similarity ≈ Jaccard(set1, set2)

6.3 Usage

Duplicate Document Detection

Finding near-identical pages among billions of web pages. Used by Google.

Plagiarism Detection

Comparison of papers and code.

Recommendation Systems

Finding "similar users".

Clustering

Combined with LSH (Locality Sensitive Hashing) to group similar items.

7. Skip List — Probabilistic Balanced Tree

7.1 Sorted Data Structure

A probabilistic alternative to balanced binary trees (AVL, Red-Black).

7.2 Principle

Multi-level linked list. Upper levels are sparse, lower levels dense.

Level 3:  1 ───────────────→ 9
Level 2:  1 ────→ 4 ────→ 7 ─→ 9
Level 1:  1 → 3 → 4 → 5 → 7 → 8 → 9

Search: start from the top, drop a level if too large.

Level on insert: decided by coin flip. 50% chance of going to the next level.

7.3 Performance

Operation	Average	Worst
Search	O(log n)	O(n)
Insert	O(log n)	O(n)
Delete	O(log n)	O(n)

Advantages: simpler to implement than balanced trees, easy lock-free concurrency.

7.4 Usage

Redis Sorted Set: internal structure of ZADD/ZRANGE
LevelDB/RocksDB memtable
Java ConcurrentSkipListMap

8. T-Digest — Quantile Estimation

8.1 The Problem

How do you compute quantiles like p99 latency in a large-scale stream?

Traditional: store all values then sort → memory explosion.

T-Digest: maintain a "sketch" of the distribution with small memory.

8.2 Principle

Group values into buckets. More buckets at the distribution's tail (p99, p99.9), fewer in the middle (p50).

Rationale: we usually care about the tail (tail latency). The median can be approximate.

8.3 Usage

from tdigest import TDigest

t = TDigest()
for value in stream:
    t.update(value)

p50 = t.percentile(50)
p95 = t.percentile(95)
p99 = t.percentile(99)
p999 = t.percentile(99.9)

8.4 Usage

Monitoring: Prometheus, Datadog, New Relic
Databases: Druid, Apache Pinot
Stream processing: Apache Beam, Flink

9. Tradeoff Comparison

9.1 Comparison of All Data Structures

Structure	Question it Answers	Memory	Accuracy	Usage
HashSet	Exact membership	O(n)	100%	small data
Bloom Filter	"Maybe present?"	O(n) very small	FP O, FN X	cache pre-check
Cuckoo Filter	Membership + deletion	O(n) very small	Lower FP	Bloom alternative
Counter	Exact count	O(n)	100%	small data
HyperLogLog	"Number of unique elements?"	12 KB fixed	~1%	DAU, cardinality
Count-Min Sketch	"How many times X?"	small	overestimate	hot keys, DDoS
MinHash + LSH	"Similar sets?"	small	Jaccard estimate	duplicate detection
T-Digest	"Quantiles?"	small	distribution estimate	latency monitoring
Skip List	Sorted set	O(n)	100%	Redis Sorted Set

9.2 Memory Comparison (100M Unique Users)

Method	Memory
HashSet (UUID)	~3.6 GB
Bitmap (sequential ID)	12.5 MB
Bloom Filter (1% FP)	120 MB
HyperLogLog	12 KB

HyperLogLog is 300,000x smaller.

10. Real-World — Heavy Hitter Detection in Kafka

10.1 Scenario

1M messages/sec on a Kafka topic. "Which user_id is most active?" (spam detection)

10.2 Implementation

from countminsketch import CountMinSketch
import heapq

# Count-Min Sketch + Top-K heap
sketch = CountMinSketch(width=10000, depth=5)
top_k = []  # min-heap of (count, user_id)
TOP_K_SIZE = 100

def process_message(user_id):
    sketch.add(user_id)
    estimated_count = sketch.estimate(user_id)
    
    if len(top_k) < TOP_K_SIZE:
        heapq.heappush(top_k, (estimated_count, user_id))
    elif estimated_count > top_k[0][0]:
        heapq.heapreplace(top_k, (estimated_count, user_id))

# Memory: ~200 KB handles 100M users

10.3 Result

Traditional HashMap = several GB → CMS = 200 KB (10,000x savings).

Accepting a bit of inaccuracy yields explosive memory efficiency.

Quiz

1. Why does the Bloom Filter have no false negatives?

Answer: When inserting an element, all K bits are set to 1. When querying the same element, the same K bits are checked and they must all be 1. Therefore, answering "not present" means some bit was 0, meaning the element was never inserted. False negatives are impossible. However, other elements coincidentally setting the same bits to 1 causes false positives.

2. Why can HyperLogLog handle billions with 12 KB?

Answer: HyperLogLog does not store the elements themselves. Instead, it tracks the distribution of the number of leading zeros in the hashes. Whether it is 100M or 1T elements, the same distribution information is stored, so memory is constant. 12 KB holds 1024 six-bit buckets, enabling cardinality estimation with an average error of 0.81%. The key is that memory is independent of data size.

3. Why does Count-Min Sketch always overestimate?

Answer: Multiple elements can collide at the same counter position. On collision, the counter becomes the sum of multiple elements' counts, yielding a larger-than-real value. Even taking the minimum may leave collision effects, so the result is always >= the true value. Conversely, underestimates are impossible because no position can be smaller than the true count.

4. What are the advantages of the Cuckoo Filter over the Bloom Filter?

Answer: (1) Deletable — the Bloom Filter cannot turn bits back to 0 (affects other elements), while the Cuckoo Filter can safely delete. (2) Lower false positive rate — more accurate at the same memory. (3) Better cache performance — more efficient memory access patterns. Downsides: slightly more complex to implement, and insertion can fail when saturated (capacity exceeded).

5. Why is T-Digest advantageous for p99 latency measurement?

Answer: Traditional methods store all latency values then sort → 800 MB+ memory for 100M requests. T-Digest stores a "sketch" of the distribution in a few KB. The key trick: more buckets at the tail (p99, p99.9), fewer in the middle (p50). Since monitoring usually focuses on the tail, memory is saved without loss of accuracy where it matters. Used by Datadog, Prometheus, and others.

References

Bloom Filter Original Paper — Burton Bloom (1970)
HyperLogLog Original Paper — Flajolet et al.
Count-Min Sketch Original Paper — Cormode & Muthukrishnan
Cuckoo Filter — Fan et al.
Probabilistic Data Structures and Algorithms — Andrii Gakhov (book)
Redis HyperLogLog
RocksDB Bloom Filter
Apache DataSketches — high-quality implementations
stream-lib — Java library
t-digest — Ted Dunning's implementation
pdsa.gakhov.com — visualizations + simulations