- Published on
The Complete Guide to Probabilistic Data Structures 2025: Bloom Filter, HyperLogLog, Count-Min Sketch, and Real-World Usage
- Authors

- Name
- Youngju Kim
- @fjvbn20031
TL;DR
- Essence of probabilistic data structures: Trade 100% accuracy for extreme savings in memory/speed
- Bloom Filter: "This element may be in the set" — false positive O, false negative X. Core of RocksDB, Cassandra, BigTable
- HyperLogLog: Count unique elements among billions with 12KB. Used by Redis
PFADD/PFCOUNT, BigQuery, Druid - Count-Min Sketch: Estimate element frequency in streams. Hot-key detection, DDoS detection
- Tradeoff: accuracy vs memory vs speed — explosive efficiency when exact answers are unnecessary
1. What Are Probabilistic Data Structures?
1.1 Core Idea
"If you don't need a 100% accurate answer, you can save memory and speed extremely."
Traditional data structures:
- Storing 1M elements in a HashMap → about 50 MB
- "Does this element exist?" → exact answer
Probabilistic data structures:
- Storing 1M elements in a Bloom Filter → about 1 MB (50x savings!)
- "Does this element exist?" → "maybe" or "definitely not"
Tradeoff: 50x memory savings + a small false positive rate (1%).
1.2 Why Do We Need Them?
In large-scale systems:
- Memory is expensive — RAM is 100x+ more expensive than disk
- Must fit in L1/L2 cache to be fast → smaller is better
- Billions of elements — HashMap becomes infeasible
- Distributed systems — smaller data is more network-efficient
1.3 Cases Where Accuracy Is Unnecessary
| Problem | Need Exact Answer? | Probabilistic OK? |
|---|---|---|
| User authentication | Yes | No |
| "Have I seen this URL before?" | No | Yes |
| Daily active users count | No (roughly 10M is enough) | Yes |
| Hot page detection | No | Yes |
| Cache pre-check | No | Yes |
2. Bloom Filter — The Most Famous Probabilistic Data Structure
2.1 Principle
Bit array + multiple hash functions.
Insertion:
- Pass the element through K hash functions → K indices
- Set the bit at each index to 1
Query:
- Same K hash functions
- If all bits are 1 → "maybe present"
- If any bit is 0 → "definitely not present"
2.2 Visualization
Bit array (8 bits): [0, 0, 0, 0, 0, 0, 0, 0]
Insert "apple" (hash → [1, 4]):
[0, 1, 0, 0, 1, 0, 0, 0]
Insert "banana" (hash → [3, 6]):
[0, 1, 0, 1, 1, 0, 1, 0]
Query "apple" (hash → [1, 4]): both 1 → "maybe present" OK
Query "grape" (hash → [2, 5]): has a 0 → "definitely not present" NO
Query "cherry" (hash → [1, 6]): coincidentally both 1 → "maybe present" (false positive)
2.3 Computing the False Positive Rate
Formula:
P(fp) = (1 - e^(-k*n/m))^k
n: number of elementsm: number of bitsk: number of hash functions
Optimal k:
k = (m/n) * ln(2)
If you target 1% false positive:
m/n ≈ 9.6 bits
k ≈ 7
1M elements = 1.2 MB + 7 hash functions = 50x smaller than HashMap.
2.4 Python Implementation
import hashlib
import math
class BloomFilter:
def __init__(self, n, fp_rate):
self.size = -int(n * math.log(fp_rate) / (math.log(2) ** 2))
self.hash_count = int(self.size / n * math.log(2))
self.bits = [0] * self.size
def _hashes(self, item):
for i in range(self.hash_count):
h = hashlib.md5(f"{item}{i}".encode()).hexdigest()
yield int(h, 16) % self.size
def add(self, item):
for h in self._hashes(item):
self.bits[h] = 1
def contains(self, item):
return all(self.bits[h] for h in self._hashes(item))
# Usage
bf = BloomFilter(n=1000000, fp_rate=0.01)
bf.add("alice@example.com")
print(bf.contains("alice@example.com")) # True
print(bf.contains("bob@example.com")) # Almost always False
2.5 Limitations of Bloom Filters
- No deletion — setting bits to 0 would affect other elements
- Hard to resize — must create a new filter and reinsert all elements
- No count — cannot know how many elements are inside
Solutions: Counting Bloom Filter, Cuckoo Filter (see below).
2.6 Real-World Usage of Bloom Filters
RocksDB / Cassandra / BigTable
In LSM-tree-based DBs, there is a Bloom Filter per SSTable:
Query: "key123"
SSTable 1 Bloom Filter: "definitely not present" → skip disk read OK
SSTable 2 Bloom Filter: "definitely not present" → skip disk read OK
SSTable 3 Bloom Filter: "maybe present" → read disk (actual search)
Effect: 99% reduction in disk I/O. Entire SSTables are never read.
CDN Cache
Request: GET /image/123.jpg
Bloom Filter: "not in cache" → go straight to origin (skip cache lookup)
"maybe" → cache lookup
Effect: Saves time on cache miss.
Password Leak Checks
HaveIBeenPwned's 100M+ leaked password database → clients can check locally (download the Bloom Filter).
Chrome Malicious URL Check
Google's known malicious URL database → sent to clients as a Bloom Filter → browser checks immediately.
3. Counting Bloom Filter
3.1 Solving the Problem
A basic Bloom Filter does not support deletion. A Counting Bloom Filter uses counters instead of bits:
Array: [0, 0, 0, 0, 0, 0, 0, 0] (each cell is a 4-bit counter)
Insert "apple" ([1, 4]):
[0, 1, 0, 0, 1, 0, 0, 0]
Insert "apple" again:
[0, 2, 0, 0, 2, 0, 0, 0]
Delete "apple":
[0, 1, 0, 0, 1, 0, 0, 0]
3.2 Drawbacks
- 4x the memory — 1 bit → 4 bits (to prevent overflow)
- Still has false positives
3.3 Cuckoo Filter (Alternative)
The Cuckoo Filter is a superior alternative to the Bloom Filter:
- Supports deletion
- Lower false positive rate
- Comparable memory efficiency
- Based on "Cuckoo hashing"
# pycuckoofilter example
from cuckoofilter import CuckooFilter
cf = CuckooFilter(capacity=1000000, error_rate=0.01)
cf.insert("alice")
print(cf.contains("alice")) # True
cf.delete("alice") # Possible!
When Cuckoo > Bloom: when deletion is needed, or when a lower false positive rate is required.
4. HyperLogLog — Cardinality Estimation
4.1 The Problem
"How many unique visitors came to our site today?"
Traditional method: Store all visitor IDs in a HashSet → 10M visitors = 800 MB.
HyperLogLog: Estimate up to billions of elements with 12 KB. 1% error.
4.2 Basic Idea
Hash each element → track the number of leading zeros in the binary representation.
"alice" → hash → 00000010110... (5 leading zeros)
"bob" → hash → 00100110... (2 leading zeros)
"carol" → hash → 0001110... (3 leading zeros)
Observation: The probability of K leading zeros in a random hash is 1/2^K.
→ "We saw at most 5 leading zeros" → estimated about 2^5 = 32 unique elements.
Reality: Single measurements are inaccurate. Split across multiple buckets and average → improved accuracy. This is HyperLogLog.
4.3 Accuracy
| Elements | Memory | Standard Error |
|---|---|---|
| 1,000 | 12 KB | 0.81% |
| 1,000,000 | 12 KB | 0.81% |
| 1,000,000,000 | 12 KB | 0.81% |
Key point: Memory is independent of the number of elements! 100M elements and 1M elements take the same memory.
4.4 Redis HyperLogLog
PFADD visitors:2025-04-15 "user1" "user2" "user3"
PFADD visitors:2025-04-15 "user1" # duplicates ignored
PFCOUNT visitors:2025-04-15 # about 3 (could be 2 or 3)
# Merge multiple HLLs
PFMERGE visitors:week visitors:2025-04-14 visitors:2025-04-15
4.5 Real-World Usage of HyperLogLog
Daily Active Users (DAU)
import redis
r = redis.Redis()
def track_user(user_id):
today = datetime.now().strftime("%Y-%m-%d")
r.pfadd(f"dau:{today}", user_id)
def get_dau(date):
return r.pfcount(f"dau:{date}")
Hundreds of millions of users can be handled with 12 KB.
Unique IP Count (DDoS Analysis)
PFADD unique_ips:2025-04-15 "1.2.3.4" "5.6.7.8"
PFCOUNT unique_ips:2025-04-15
Big Data — BigQuery, Druid
SELECT APPROX_COUNT_DISTINCT(user_id) FROM events;
Uses HyperLogLog internally. 100x+ faster than exact COUNT(DISTINCT).
5. Count-Min Sketch — Frequency Estimation
5.1 The Problem
In a stream, "how many times has each element appeared?"
Traditional: HashMap → memory explosion.
Count-Min Sketch: Estimate frequencies with small memory.
5.2 Principle
2D array (d × w) + d hash functions.
Insert:
- Compute an index with each hash function
- Increment the corresponding counter
Query:
- Same hash functions
- Return the minimum counter value among them (hence "Count-Min")
5.3 Visualization
Array (3 x 5):
[0, 0, 0, 0, 0] ← hash1
[0, 0, 0, 0, 0] ← hash2
[0, 0, 0, 0, 0] ← hash3
Insert "apple" 5 times:
hash1("apple") = 1 → counter [0, 5, 0, 0, 0]
hash2("apple") = 3 → counter [0, 0, 0, 5, 0]
hash3("apple") = 0 → counter [5, 0, 0, 0, 0]
Insert "banana" 2 times:
hash1("banana") = 4 → [0, 5, 0, 0, 2]
hash2("banana") = 1 → [0, 2, 0, 5, 0]
hash3("banana") = 2 → [5, 0, 2, 0, 0]
Query "apple": min(5, 5, 5) = 5 OK
Query "banana": min(2, 2, 2) = 2 OK
On collision: if another element collides, the counter may be inflated, but since we take the minimum, it is always >= the true count (overestimate).
5.4 Accuracy
Error ε, confidence 1-δ
Required size:
w = ⌈e/ε⌉
d = ⌈ln(1/δ)⌉
Example: 1% error, 99% confidence → w=272, d=5 → 1360 counters = 5 KB. Can handle billions of elements.
5.5 Real-World Usage
Heavy Hitters (Hot Key Detection)
# Find the most frequent keys
sketch = CountMinSketch(width=272, depth=5)
for query in stream:
sketch.add(query)
# Top-K discovery (uses a separate min-heap)
Alternative to Memcached LRU
The tinylfu cache uses Count-Min Sketch to track frequency → keeps only frequently used keys in cache.
DDoS Detection
Estimate the request count per IP → block when exceeding a threshold.
Apache Spark, Flink
Frequency analysis in large-scale stream processing.
6. MinHash — Jaccard Similarity
6.1 The Problem
"How similar are two sets?" (Jaccard similarity)
J(A, B) = |A ∩ B| / |A ∪ B|
Traditional method: compare all elements → O(|A| × |B|).
6.2 MinHash Principle
The probability that the "minimum hash value" of each set is equal = Jaccard similarity.
def minhash(set_, num_hashes=128):
return [min(hash(item, seed=i) for item in set_) for i in range(num_hashes)]
# Compare MinHashes of two sets
mh1 = minhash(set1)
mh2 = minhash(set2)
similarity = sum(a == b for a, b in zip(mh1, mh2)) / len(mh1)
# similarity ≈ Jaccard(set1, set2)
6.3 Usage
Duplicate Document Detection
Finding near-identical pages among billions of web pages. Used by Google.
Plagiarism Detection
Comparison of papers and code.
Recommendation Systems
Finding "similar users".
Clustering
Combined with LSH (Locality Sensitive Hashing) to group similar items.
7. Skip List — Probabilistic Balanced Tree
7.1 Sorted Data Structure
A probabilistic alternative to balanced binary trees (AVL, Red-Black).
7.2 Principle
Multi-level linked list. Upper levels are sparse, lower levels dense.
Level 3: 1 ───────────────→ 9
Level 2: 1 ────→ 4 ────→ 7 ─→ 9
Level 1: 1 → 3 → 4 → 5 → 7 → 8 → 9
Search: start from the top, drop a level if too large.
Level on insert: decided by coin flip. 50% chance of going to the next level.
7.3 Performance
| Operation | Average | Worst |
|---|---|---|
| Search | O(log n) | O(n) |
| Insert | O(log n) | O(n) |
| Delete | O(log n) | O(n) |
Advantages: simpler to implement than balanced trees, easy lock-free concurrency.
7.4 Usage
- Redis Sorted Set: internal structure of ZADD/ZRANGE
- LevelDB/RocksDB memtable
- Java ConcurrentSkipListMap
8. T-Digest — Quantile Estimation
8.1 The Problem
How do you compute quantiles like p99 latency in a large-scale stream?
Traditional: store all values then sort → memory explosion.
T-Digest: maintain a "sketch" of the distribution with small memory.
8.2 Principle
Group values into buckets. More buckets at the distribution's tail (p99, p99.9), fewer in the middle (p50).
Rationale: we usually care about the tail (tail latency). The median can be approximate.
8.3 Usage
from tdigest import TDigest
t = TDigest()
for value in stream:
t.update(value)
p50 = t.percentile(50)
p95 = t.percentile(95)
p99 = t.percentile(99)
p999 = t.percentile(99.9)
8.4 Usage
- Monitoring: Prometheus, Datadog, New Relic
- Databases: Druid, Apache Pinot
- Stream processing: Apache Beam, Flink
9. Tradeoff Comparison
9.1 Comparison of All Data Structures
| Structure | Question it Answers | Memory | Accuracy | Usage |
|---|---|---|---|---|
| HashSet | Exact membership | O(n) | 100% | small data |
| Bloom Filter | "Maybe present?" | O(n) very small | FP O, FN X | cache pre-check |
| Cuckoo Filter | Membership + deletion | O(n) very small | Lower FP | Bloom alternative |
| Counter | Exact count | O(n) | 100% | small data |
| HyperLogLog | "Number of unique elements?" | 12 KB fixed | ~1% | DAU, cardinality |
| Count-Min Sketch | "How many times X?" | small | overestimate | hot keys, DDoS |
| MinHash + LSH | "Similar sets?" | small | Jaccard estimate | duplicate detection |
| T-Digest | "Quantiles?" | small | distribution estimate | latency monitoring |
| Skip List | Sorted set | O(n) | 100% | Redis Sorted Set |
9.2 Memory Comparison (100M Unique Users)
| Method | Memory |
|---|---|
| HashSet (UUID) | ~3.6 GB |
| Bitmap (sequential ID) | 12.5 MB |
| Bloom Filter (1% FP) | 120 MB |
| HyperLogLog | 12 KB |
HyperLogLog is 300,000x smaller.
10. Real-World — Heavy Hitter Detection in Kafka
10.1 Scenario
1M messages/sec on a Kafka topic. "Which user_id is most active?" (spam detection)
10.2 Implementation
from countminsketch import CountMinSketch
import heapq
# Count-Min Sketch + Top-K heap
sketch = CountMinSketch(width=10000, depth=5)
top_k = [] # min-heap of (count, user_id)
TOP_K_SIZE = 100
def process_message(user_id):
sketch.add(user_id)
estimated_count = sketch.estimate(user_id)
if len(top_k) < TOP_K_SIZE:
heapq.heappush(top_k, (estimated_count, user_id))
elif estimated_count > top_k[0][0]:
heapq.heapreplace(top_k, (estimated_count, user_id))
# Memory: ~200 KB handles 100M users
10.3 Result
Traditional HashMap = several GB → CMS = 200 KB (10,000x savings).
Accepting a bit of inaccuracy yields explosive memory efficiency.
Quiz
1. Why does the Bloom Filter have no false negatives?
Answer: When inserting an element, all K bits are set to 1. When querying the same element, the same K bits are checked and they must all be 1. Therefore, answering "not present" means some bit was 0, meaning the element was never inserted. False negatives are impossible. However, other elements coincidentally setting the same bits to 1 causes false positives.
2. Why can HyperLogLog handle billions with 12 KB?
Answer: HyperLogLog does not store the elements themselves. Instead, it tracks the distribution of the number of leading zeros in the hashes. Whether it is 100M or 1T elements, the same distribution information is stored, so memory is constant. 12 KB holds 1024 six-bit buckets, enabling cardinality estimation with an average error of 0.81%. The key is that memory is independent of data size.
3. Why does Count-Min Sketch always overestimate?
Answer: Multiple elements can collide at the same counter position. On collision, the counter becomes the sum of multiple elements' counts, yielding a larger-than-real value. Even taking the minimum may leave collision effects, so the result is always >= the true value. Conversely, underestimates are impossible because no position can be smaller than the true count.
4. What are the advantages of the Cuckoo Filter over the Bloom Filter?
Answer: (1) Deletable — the Bloom Filter cannot turn bits back to 0 (affects other elements), while the Cuckoo Filter can safely delete. (2) Lower false positive rate — more accurate at the same memory. (3) Better cache performance — more efficient memory access patterns. Downsides: slightly more complex to implement, and insertion can fail when saturated (capacity exceeded).
5. Why is T-Digest advantageous for p99 latency measurement?
Answer: Traditional methods store all latency values then sort → 800 MB+ memory for 100M requests. T-Digest stores a "sketch" of the distribution in a few KB. The key trick: more buckets at the tail (p99, p99.9), fewer in the middle (p50). Since monitoring usually focuses on the tail, memory is saved without loss of accuracy where it matters. Used by Datadog, Prometheus, and others.
References
- Bloom Filter Original Paper — Burton Bloom (1970)
- HyperLogLog Original Paper — Flajolet et al.
- Count-Min Sketch Original Paper — Cormode & Muthukrishnan
- Cuckoo Filter — Fan et al.
- Probabilistic Data Structures and Algorithms — Andrii Gakhov (book)
- Redis HyperLogLog
- RocksDB Bloom Filter
- Apache DataSketches — high-quality implementations
- stream-lib — Java library
- t-digest — Ted Dunning's implementation
- pdsa.gakhov.com — visualizations + simulations