Skip to content

✍️ 필사 모드: The Complete Guide to Probabilistic Data Structures 2025: Bloom Filter, HyperLogLog, Count-Min Sketch, and Real-World Usage

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

TL;DR

  • Essence of probabilistic data structures: Trade 100% accuracy for extreme savings in memory/speed
  • Bloom Filter: "This element may be in the set" — false positive O, false negative X. Core of RocksDB, Cassandra, BigTable
  • HyperLogLog: Count unique elements among billions with 12KB. Used by Redis PFADD/PFCOUNT, BigQuery, Druid
  • Count-Min Sketch: Estimate element frequency in streams. Hot-key detection, DDoS detection
  • Tradeoff: accuracy vs memory vs speed — explosive efficiency when exact answers are unnecessary

1. What Are Probabilistic Data Structures?

1.1 Core Idea

"If you don't need a 100% accurate answer, you can save memory and speed extremely."

Traditional data structures:

  • Storing 1M elements in a HashMap → about 50 MB
  • "Does this element exist?" → exact answer

Probabilistic data structures:

  • Storing 1M elements in a Bloom Filter → about 1 MB (50x savings!)
  • "Does this element exist?" → "maybe" or "definitely not"

Tradeoff: 50x memory savings + a small false positive rate (1%).

1.2 Why Do We Need Them?

In large-scale systems:

  • Memory is expensive — RAM is 100x+ more expensive than disk
  • Must fit in L1/L2 cache to be fast → smaller is better
  • Billions of elements — HashMap becomes infeasible
  • Distributed systems — smaller data is more network-efficient

1.3 Cases Where Accuracy Is Unnecessary

ProblemNeed Exact Answer?Probabilistic OK?
User authenticationYesNo
"Have I seen this URL before?"NoYes
Daily active users countNo (roughly 10M is enough)Yes
Hot page detectionNoYes
Cache pre-checkNoYes

2. Bloom Filter — The Most Famous Probabilistic Data Structure

2.1 Principle

Bit array + multiple hash functions.

Insertion:

  1. Pass the element through K hash functions → K indices
  2. Set the bit at each index to 1

Query:

  1. Same K hash functions
  2. If all bits are 1 → "maybe present"
  3. If any bit is 0 → "definitely not present"

2.2 Visualization

Bit array (8 bits):  [0, 0, 0, 0, 0, 0, 0, 0]

Insert "apple" (hash → [1, 4]):
                  [0, 1, 0, 0, 1, 0, 0, 0]

Insert "banana" (hash → [3, 6]):
                  [0, 1, 0, 1, 1, 0, 1, 0]

Query "apple" (hash → [1, 4]): both 1"maybe present" OK
Query "grape" (hash → [2, 5]): has a 0"definitely not present" NO
Query "cherry" (hash → [1, 6]): coincidentally both 1"maybe present" (false positive)

2.3 Computing the False Positive Rate

Formula:

P(fp) = (1 - e^(-k*n/m))^k
  • n: number of elements
  • m: number of bits
  • k: number of hash functions

Optimal k:

k = (m/n) * ln(2)

If you target 1% false positive:

m/n ≈ 9.6 bits
k ≈ 7

1M elements = 1.2 MB + 7 hash functions = 50x smaller than HashMap.

2.4 Python Implementation

import hashlib
import math

class BloomFilter:
    def __init__(self, n, fp_rate):
        self.size = -int(n * math.log(fp_rate) / (math.log(2) ** 2))
        self.hash_count = int(self.size / n * math.log(2))
        self.bits = [0] * self.size

    def _hashes(self, item):
        for i in range(self.hash_count):
            h = hashlib.md5(f"{item}{i}".encode()).hexdigest()
            yield int(h, 16) % self.size

    def add(self, item):
        for h in self._hashes(item):
            self.bits[h] = 1

    def contains(self, item):
        return all(self.bits[h] for h in self._hashes(item))

# Usage
bf = BloomFilter(n=1000000, fp_rate=0.01)
bf.add("alice@example.com")
print(bf.contains("alice@example.com"))  # True
print(bf.contains("bob@example.com"))    # Almost always False

2.5 Limitations of Bloom Filters

  • No deletion — setting bits to 0 would affect other elements
  • Hard to resize — must create a new filter and reinsert all elements
  • No count — cannot know how many elements are inside

Solutions: Counting Bloom Filter, Cuckoo Filter (see below).

2.6 Real-World Usage of Bloom Filters

RocksDB / Cassandra / BigTable

In LSM-tree-based DBs, there is a Bloom Filter per SSTable:

Query: "key123"

SSTable 1 Bloom Filter: "definitely not present" → skip disk read OK
SSTable 2 Bloom Filter: "definitely not present" → skip disk read OK
SSTable 3 Bloom Filter: "maybe present"         → read disk (actual search)

Effect: 99% reduction in disk I/O. Entire SSTables are never read.

CDN Cache

Request: GET /image/123.jpg

Bloom Filter: "not in cache" → go straight to origin (skip cache lookup)
              "maybe"        → cache lookup

Effect: Saves time on cache miss.

Password Leak Checks

HaveIBeenPwned's 100M+ leaked password database → clients can check locally (download the Bloom Filter).

Chrome Malicious URL Check

Google's known malicious URL database → sent to clients as a Bloom Filter → browser checks immediately.


3. Counting Bloom Filter

3.1 Solving the Problem

A basic Bloom Filter does not support deletion. A Counting Bloom Filter uses counters instead of bits:

Array: [0, 0, 0, 0, 0, 0, 0, 0]  (each cell is a 4-bit counter)

Insert "apple" ([1, 4]):
        [0, 1, 0, 0, 1, 0, 0, 0]

Insert "apple" again:
        [0, 2, 0, 0, 2, 0, 0, 0]

Delete "apple":
        [0, 1, 0, 0, 1, 0, 0, 0]

3.2 Drawbacks

  • 4x the memory — 1 bit → 4 bits (to prevent overflow)
  • Still has false positives

3.3 Cuckoo Filter (Alternative)

The Cuckoo Filter is a superior alternative to the Bloom Filter:

  • Supports deletion
  • Lower false positive rate
  • Comparable memory efficiency
  • Based on "Cuckoo hashing"
# pycuckoofilter example
from cuckoofilter import CuckooFilter

cf = CuckooFilter(capacity=1000000, error_rate=0.01)
cf.insert("alice")
print(cf.contains("alice"))  # True
cf.delete("alice")  # Possible!

When Cuckoo > Bloom: when deletion is needed, or when a lower false positive rate is required.


4. HyperLogLog — Cardinality Estimation

4.1 The Problem

"How many unique visitors came to our site today?"

Traditional method: Store all visitor IDs in a HashSet → 10M visitors = 800 MB.

HyperLogLog: Estimate up to billions of elements with 12 KB. 1% error.

4.2 Basic Idea

Hash each element → track the number of leading zeros in the binary representation.

"alice" → hash → 00000010110...  (5 leading zeros)
"bob"   → hash → 00100110...     (2 leading zeros)
"carol" → hash → 0001110...      (3 leading zeros)

Observation: The probability of K leading zeros in a random hash is 1/2^K.

→ "We saw at most 5 leading zeros" → estimated about 2^5 = 32 unique elements.

Reality: Single measurements are inaccurate. Split across multiple buckets and average → improved accuracy. This is HyperLogLog.

4.3 Accuracy

ElementsMemoryStandard Error
1,00012 KB0.81%
1,000,00012 KB0.81%
1,000,000,00012 KB0.81%

Key point: Memory is independent of the number of elements! 100M elements and 1M elements take the same memory.

4.4 Redis HyperLogLog

PFADD visitors:2025-04-15 "user1" "user2" "user3"
PFADD visitors:2025-04-15 "user1"  # duplicates ignored
PFCOUNT visitors:2025-04-15  # about 3 (could be 2 or 3)

# Merge multiple HLLs
PFMERGE visitors:week visitors:2025-04-14 visitors:2025-04-15

4.5 Real-World Usage of HyperLogLog

Daily Active Users (DAU)

import redis

r = redis.Redis()

def track_user(user_id):
    today = datetime.now().strftime("%Y-%m-%d")
    r.pfadd(f"dau:{today}", user_id)

def get_dau(date):
    return r.pfcount(f"dau:{date}")

Hundreds of millions of users can be handled with 12 KB.

Unique IP Count (DDoS Analysis)

PFADD unique_ips:2025-04-15 "1.2.3.4" "5.6.7.8"
PFCOUNT unique_ips:2025-04-15

Big Data — BigQuery, Druid

SELECT APPROX_COUNT_DISTINCT(user_id) FROM events;

Uses HyperLogLog internally. 100x+ faster than exact COUNT(DISTINCT).


5. Count-Min Sketch — Frequency Estimation

5.1 The Problem

In a stream, "how many times has each element appeared?"

Traditional: HashMap → memory explosion.

Count-Min Sketch: Estimate frequencies with small memory.

5.2 Principle

2D array (d × w) + d hash functions.

Insert:

  • Compute an index with each hash function
  • Increment the corresponding counter

Query:

  • Same hash functions
  • Return the minimum counter value among them (hence "Count-Min")

5.3 Visualization

Array (3 x 5):
[0, 0, 0, 0, 0]  ← hash1
[0, 0, 0, 0, 0]  ← hash2
[0, 0, 0, 0, 0]  ← hash3

Insert "apple" 5 times:
hash1("apple") = 1 → counter [0, 5, 0, 0, 0]
hash2("apple") = 3 → counter [0, 0, 0, 5, 0]
hash3("apple") = 0 → counter [5, 0, 0, 0, 0]

Insert "banana" 2 times:
hash1("banana") = 4[0, 5, 0, 0, 2]
hash2("banana") = 1[0, 2, 0, 5, 0]
hash3("banana") = 2[5, 0, 2, 0, 0]

Query "apple": min(5, 5, 5) = 5 OK
Query "banana": min(2, 2, 2) = 2 OK

On collision: if another element collides, the counter may be inflated, but since we take the minimum, it is always >= the true count (overestimate).

5.4 Accuracy

Error ε, confidence 1-δ
Required size:
  w = ⌈e/ε⌉
  d = ⌈ln(1/δ)

Example: 1% error, 99% confidence → w=272, d=5 → 1360 counters = 5 KB. Can handle billions of elements.

5.5 Real-World Usage

Heavy Hitters (Hot Key Detection)

# Find the most frequent keys
sketch = CountMinSketch(width=272, depth=5)

for query in stream:
    sketch.add(query)

# Top-K discovery (uses a separate min-heap)

Alternative to Memcached LRU

The tinylfu cache uses Count-Min Sketch to track frequency → keeps only frequently used keys in cache.

DDoS Detection

Estimate the request count per IP → block when exceeding a threshold.

Frequency analysis in large-scale stream processing.


6. MinHash — Jaccard Similarity

6.1 The Problem

"How similar are two sets?" (Jaccard similarity)

J(A, B) = |AB| / |AB|

Traditional method: compare all elements → O(|A| × |B|).

6.2 MinHash Principle

The probability that the "minimum hash value" of each set is equal = Jaccard similarity.

def minhash(set_, num_hashes=128):
    return [min(hash(item, seed=i) for item in set_) for i in range(num_hashes)]

# Compare MinHashes of two sets
mh1 = minhash(set1)
mh2 = minhash(set2)

similarity = sum(a == b for a, b in zip(mh1, mh2)) / len(mh1)
# similarity ≈ Jaccard(set1, set2)

6.3 Usage

Duplicate Document Detection

Finding near-identical pages among billions of web pages. Used by Google.

Plagiarism Detection

Comparison of papers and code.

Recommendation Systems

Finding "similar users".

Clustering

Combined with LSH (Locality Sensitive Hashing) to group similar items.


7. Skip List — Probabilistic Balanced Tree

7.1 Sorted Data Structure

A probabilistic alternative to balanced binary trees (AVL, Red-Black).

7.2 Principle

Multi-level linked list. Upper levels are sparse, lower levels dense.

Level 3:  1 ───────────────→ 9
Level 2:  1 ────→ 4 ────→ 7 ─→ 9
Level 1:  1345789

Search: start from the top, drop a level if too large.

Level on insert: decided by coin flip. 50% chance of going to the next level.

7.3 Performance

OperationAverageWorst
SearchO(log n)O(n)
InsertO(log n)O(n)
DeleteO(log n)O(n)

Advantages: simpler to implement than balanced trees, easy lock-free concurrency.

7.4 Usage

  • Redis Sorted Set: internal structure of ZADD/ZRANGE
  • LevelDB/RocksDB memtable
  • Java ConcurrentSkipListMap

8. T-Digest — Quantile Estimation

8.1 The Problem

How do you compute quantiles like p99 latency in a large-scale stream?

Traditional: store all values then sort → memory explosion.

T-Digest: maintain a "sketch" of the distribution with small memory.

8.2 Principle

Group values into buckets. More buckets at the distribution's tail (p99, p99.9), fewer in the middle (p50).

Rationale: we usually care about the tail (tail latency). The median can be approximate.

8.3 Usage

from tdigest import TDigest

t = TDigest()
for value in stream:
    t.update(value)

p50 = t.percentile(50)
p95 = t.percentile(95)
p99 = t.percentile(99)
p999 = t.percentile(99.9)

8.4 Usage

  • Monitoring: Prometheus, Datadog, New Relic
  • Databases: Druid, Apache Pinot
  • Stream processing: Apache Beam, Flink

9. Tradeoff Comparison

9.1 Comparison of All Data Structures

StructureQuestion it AnswersMemoryAccuracyUsage
HashSetExact membershipO(n)100%small data
Bloom Filter"Maybe present?"O(n) very smallFP O, FN Xcache pre-check
Cuckoo FilterMembership + deletionO(n) very smallLower FPBloom alternative
CounterExact countO(n)100%small data
HyperLogLog"Number of unique elements?"12 KB fixed~1%DAU, cardinality
Count-Min Sketch"How many times X?"smalloverestimatehot keys, DDoS
MinHash + LSH"Similar sets?"smallJaccard estimateduplicate detection
T-Digest"Quantiles?"smalldistribution estimatelatency monitoring
Skip ListSorted setO(n)100%Redis Sorted Set

9.2 Memory Comparison (100M Unique Users)

MethodMemory
HashSet (UUID)~3.6 GB
Bitmap (sequential ID)12.5 MB
Bloom Filter (1% FP)120 MB
HyperLogLog12 KB

HyperLogLog is 300,000x smaller.


10. Real-World — Heavy Hitter Detection in Kafka

10.1 Scenario

1M messages/sec on a Kafka topic. "Which user_id is most active?" (spam detection)

10.2 Implementation

from countminsketch import CountMinSketch
import heapq

# Count-Min Sketch + Top-K heap
sketch = CountMinSketch(width=10000, depth=5)
top_k = []  # min-heap of (count, user_id)
TOP_K_SIZE = 100

def process_message(user_id):
    sketch.add(user_id)
    estimated_count = sketch.estimate(user_id)
    
    if len(top_k) < TOP_K_SIZE:
        heapq.heappush(top_k, (estimated_count, user_id))
    elif estimated_count > top_k[0][0]:
        heapq.heapreplace(top_k, (estimated_count, user_id))

# Memory: ~200 KB handles 100M users

10.3 Result

Traditional HashMap = several GB → CMS = 200 KB (10,000x savings).

Accepting a bit of inaccuracy yields explosive memory efficiency.


Quiz

1. Why does the Bloom Filter have no false negatives?

Answer: When inserting an element, all K bits are set to 1. When querying the same element, the same K bits are checked and they must all be 1. Therefore, answering "not present" means some bit was 0, meaning the element was never inserted. False negatives are impossible. However, other elements coincidentally setting the same bits to 1 causes false positives.

2. Why can HyperLogLog handle billions with 12 KB?

Answer: HyperLogLog does not store the elements themselves. Instead, it tracks the distribution of the number of leading zeros in the hashes. Whether it is 100M or 1T elements, the same distribution information is stored, so memory is constant. 12 KB holds 1024 six-bit buckets, enabling cardinality estimation with an average error of 0.81%. The key is that memory is independent of data size.

3. Why does Count-Min Sketch always overestimate?

Answer: Multiple elements can collide at the same counter position. On collision, the counter becomes the sum of multiple elements' counts, yielding a larger-than-real value. Even taking the minimum may leave collision effects, so the result is always >= the true value. Conversely, underestimates are impossible because no position can be smaller than the true count.

4. What are the advantages of the Cuckoo Filter over the Bloom Filter?

Answer: (1) Deletable — the Bloom Filter cannot turn bits back to 0 (affects other elements), while the Cuckoo Filter can safely delete. (2) Lower false positive rate — more accurate at the same memory. (3) Better cache performance — more efficient memory access patterns. Downsides: slightly more complex to implement, and insertion can fail when saturated (capacity exceeded).

5. Why is T-Digest advantageous for p99 latency measurement?

Answer: Traditional methods store all latency values then sort → 800 MB+ memory for 100M requests. T-Digest stores a "sketch" of the distribution in a few KB. The key trick: more buckets at the tail (p99, p99.9), fewer in the middle (p50). Since monitoring usually focuses on the tail, memory is saved without loss of accuracy where it matters. Used by Datadog, Prometheus, and others.


References

현재 단락 (1/281)

- **Essence of probabilistic data structures**: Trade 100% accuracy for **extreme savings in memory/...

작성 글자: 0원문 글자: 14,842작성 단락: 0/281