Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Introduction: Why Do We Need Consistent Hashing?

The Problem: The Trap of Hash-Based Sharding

Suppose you operate four cache servers and distribute keys with a simple hash:

def get_server(key, num_servers):
    return hash(key) % num_servers

# Distribute across 4 servers
get_server("user:1234", 4)  # server 2
get_server("user:5678", 4)  # server 0
get_server("user:9012", 4)  # server 3

Works fine in steady state. But what happens when traffic grows and you scale up to 5 servers?

get_server("user:1234", 5)  # server 4 (was 2 → now 4)
get_server("user:5678", 5)  # server 3 (was 0 → now 3)
get_server("user:9012", 5)  # server 2 (was 3 → now 2)

Almost every key moves. Cache misses explode, traffic floods the database, and the service collapses. This is called the Rehashing Storm.

The Promise of Consistent Hashing

Consistent Hashing, published in 1997 by David Karger and colleagues at MIT, solves this elegantly:

When nodes change from N to N+1, only K/N keys on average need to move. (K is the total number of keys.)

So scaling from 4 to 5 servers moves only about 1/5 of keys. The other 4/5 stay put. This principle became the foundation of countless distributed systems: Amazon DynamoDB, Apache Cassandra, Riak, Memcached, Akamai CDN, and more.

graph LR
    A[Key] --> B{Hashing method}
    B -->|modulo| C[80% keys move<br/>Disaster]
    B -->|Consistent Hash| D[20% keys move<br/>Normal]

1. The Core Idea: The Hash Ring

Place Nodes and Keys on a Circular Space

The core idea of Consistent Hashing is surprisingly simple:

Treat the output space of the hash function as a circle (0 to 2^32-1).
Hash each node (server) and place it on a point on the ring.
Hash each key (data) and place it on a point on the ring.
A key is stored on the nearest node in the clockwise direction.

        0 / 2^32
            |
     NodeA  |
    ---+----+----+---
    |       |       |
    |       |       |
    NodeD      NodeB
    |       |       |
    |       |       |
    ---+----+----+---
     NodeC  |
            |

Basic Python Implementation

import hashlib
from bisect import bisect_right, insort

class ConsistentHashBasic:
    def __init__(self):
        self.ring = {}       # hash → node
        self.sorted_keys = []

    def _hash(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def add_node(self, node: str):
        h = self._hash(node)
        self.ring[h] = node
        insort(self.sorted_keys, h)

    def remove_node(self, node: str):
        h = self._hash(node)
        del self.ring[h]
        self.sorted_keys.remove(h)

    def get_node(self, key: str) -> str:
        if not self.ring:
            return None
        h = self._hash(key)
        # Find the nearest node clockwise
        idx = bisect_right(self.sorted_keys, h)
        if idx == len(self.sorted_keys):
            idx = 0  # Wrap around the ring
        return self.ring[self.sorted_keys[idx]]

# Example usage
ch = ConsistentHashBasic()
for node in ["server-A", "server-B", "server-C", "server-D"]:
    ch.add_node(node)

print(ch.get_node("user:1234"))  # server-C
print(ch.get_node("user:5678"))  # server-A

Impact of Adding/Removing Nodes

When you add node E:

E's hash value lands somewhere between existing nodes.
Only keys between E and the preceding node move to E.
The rest of the keys stay put.

On average only K/N (1/5) keys move. That's the magic.

2. Problems With the Basic Approach: Imbalance

Basic Consistent Hashing isn't perfect. The biggest issue is uneven key distribution.

Problem 1: Skew With Few Nodes

With only 4 nodes on the ring, even if hash values are uniformly distributed, the size of the "arc" each node owns varies wildly:

NodeA: 0 ~ 5 (5%)
NodeB: 5 ~ 40 (35%)  ← heavy load
NodeC: 40 ~ 50 (10%)
NodeD: 50 ~ 100 (50%) ← half the ring

Result: NodeD gets half of all traffic, and NodeA sits nearly idle.

Problem 2: Worsening Skew on Node Removal

If NodeB goes down, all of B's keys pile onto NodeC. A hotspot forms.

Problem 3: Hard to Reflect Heterogeneous Server Capacity

Some servers might have 16 cores and 64GB RAM; others 4 cores and 16GB. The basic scheme can't express capacity ratios.

3. The Fix: Virtual Nodes (vnodes)

The Idea

Represent each physical node as multiple virtual nodes scattered on the ring. For example, with 150 vnodes per node:

10 physical nodes × 150 = 1,500 virtual nodes distributed on the ring.
By the law of large numbers, each physical node's arc becomes uniform.

4 physical nodes (150 vnodes each)

Distribution on ring:
A-1, B-7, C-3, A-99, D-42, B-15, C-88, D-1, A-15, ...
(nearly uniformly interleaved)

Virtual Node Implementation

class ConsistentHashWithVNodes:
    def __init__(self, vnodes_per_node: int = 150):
        self.vnodes_per_node = vnodes_per_node
        self.ring = {}
        self.sorted_keys = []

    def _hash(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def add_node(self, node: str, weight: int = 1):
        # Higher weight → more vnodes
        count = self.vnodes_per_node * weight
        for i in range(count):
            vnode_key = f"{node}#{i}"
            h = self._hash(vnode_key)
            self.ring[h] = node
            insort(self.sorted_keys, h)

    def remove_node(self, node: str, weight: int = 1):
        count = self.vnodes_per_node * weight
        for i in range(count):
            vnode_key = f"{node}#{i}"
            h = self._hash(vnode_key)
            del self.ring[h]
            self.sorted_keys.remove(h)

    def get_node(self, key: str) -> str:
        if not self.ring:
            return None
        h = self._hash(key)
        idx = bisect_right(self.sorted_keys, h)
        if idx == len(self.sorted_keys):
            idx = 0
        return self.ring[self.sorted_keys[idx]]

# Example: servers with different capacities
ch = ConsistentHashWithVNodes(vnodes_per_node=150)
ch.add_node("server-A", weight=2)  # 2x performance → 300 vnodes
ch.add_node("server-B", weight=1)  # standard → 150 vnodes
ch.add_node("server-C", weight=1)
ch.add_node("server-D", weight=1)

# Now A gets ~40% of keys, the rest get ~20% each

Vnode Count Selection Guide

Vnode count	Std. deviation	Memory overhead
10	`±30%`	low
100	`±10%`	medium
160	`±6%`	balanced (Ketama default)
1000	`±2%`	high

In practice, 100-200 is the sweet spot. Memcached's libketama uses 160; Cassandra defaults to 256 (num_tokens).

4. Benchmark: How Even Is the Actual Distribution?

import random
import statistics

def benchmark(vnodes_per_node: int, num_nodes: int, num_keys: int):
    ch = ConsistentHashWithVNodes(vnodes_per_node)
    for i in range(num_nodes):
        ch.add_node(f"node-{i}")

    counts = {f"node-{i}": 0 for i in range(num_nodes)}
    for k in range(num_keys):
        key = f"key-{random.random()}"
        node = ch.get_node(key)
        counts[node] += 1

    values = list(counts.values())
    avg = statistics.mean(values)
    stddev = statistics.stdev(values)
    max_val = max(values)
    min_val = min(values)

    print(f"vnodes={vnodes_per_node}: "
          f"min={min_val}, max={max_val}, "
          f"stddev={stddev:.0f} ({stddev/avg*100:.1f}% of avg)")

# Distribute 1,000,000 keys across 10 nodes
benchmark(vnodes_per_node=1,   num_nodes=10, num_keys=1_000_000)
benchmark(vnodes_per_node=10,  num_nodes=10, num_keys=1_000_000)
benchmark(vnodes_per_node=100, num_nodes=10, num_keys=1_000_000)
benchmark(vnodes_per_node=500, num_nodes=10, num_keys=1_000_000)

Expected output:

vnodes=1:   min=30000, max=250000, stddev=70000 (70.0% of avg)
vnodes=10:  min=60000, max=180000, stddev=35000 (35.0% of avg)
vnodes=100: min=92000, max=110000, stddev=5800  (5.8% of avg)
vnodes=500: min=97000, max=103000, stddev=2000  (2.0% of avg)

With just 100 virtual nodes you can achieve ±6% balance.

5. Jump Consistent Hash: Uniform Without Virtual Nodes

In 2014, John Lamping and Eric Veach at Google published Jump Consistent Hash, which achieves O(1) space and nearly perfect uniform distribution without virtual nodes.

Striking Properties

Memory: O(1) (no node list needed!)
Time: O(log N) (N is node count)
Distribution: nearly perfect
Downside: inconvenient node removal (only the last node can be removed)

The Jump Hash Algorithm

def jump_consistent_hash(key: int, num_buckets: int) -> int:
    """
    Google Jump Consistent Hash
    key: hashed integer
    num_buckets: number of buckets (nodes)
    """
    b = -1
    j = 0
    while j < num_buckets:
        b = j
        key = (key * 2862933555777941757 + 1) & 0xFFFFFFFFFFFFFFFF
        j = int((b + 1) * (1 << 31) / ((key >> 33) + 1))
    return b

# Example
import hashlib

def hash_key(key: str) -> int:
    return int(hashlib.md5(key.encode()).hexdigest()[:16], 16)

num_nodes = 10
key = hash_key("user:1234")
bucket = jump_consistent_hash(key, num_nodes)
print(f"user:1234 → bucket {bucket}")

Why Does This Work?

The core of Jump Hash is a pseudo-random probabilistic skip. Each iteration probabilistically decides the "next jump position." Mathematically proven properties:

When node count changes from N to N+1, exactly K/(N+1) keys move to the new node.
Distribution is perfectly uniform (no virtual nodes required).

Limitation: Node Removal

Jump Hash can only remove the last node (highest index). Removing a middle node shifts all following indices, breaking every key mapping.

To work around this, add a Lookup Table or use Rendezvous Hashing, described next.

6. Rendezvous Hashing (HRW: Highest Random Weight)

In 1997, David Thaler and Chinya Ravishankar proposed Rendezvous Hashing, another elegant solution.

The Idea

For each key, compute the hash with every node and pick the node with the highest (or lowest) value.

def rendezvous_hash(key: str, nodes: list) -> str:
    """
    Assign each key to the node with the highest weight
    """
    if not nodes:
        return None

    max_weight = -1
    selected_node = None

    for node in nodes:
        combined = f"{key}:{node}"
        h = int(hashlib.md5(combined.encode()).hexdigest(), 16)
        if h > max_weight:
            max_weight = h
            selected_node = node

    return selected_node

# Example
nodes = ["server-A", "server-B", "server-C", "server-D"]
print(rendezvous_hash("user:1234", nodes))  # e.g., "server-C"

Pros and Cons

Pros:

Extremely simple (just needs a hash function)
Perfectly uniform distribution
Only K/N keys move on node add/remove
Unlike Jump Hash, any node can be removed
Great for replica selection: picking the Top-K nodes immediately gives K replica placements.

Cons:

O(N) time complexity: slow with many nodes.
Not suitable beyond 1000 nodes.

Replica Selection

Where Rendezvous Hashing shines is Top-K node selection:

def rendezvous_hash_topk(key: str, nodes: list, k: int) -> list:
    weights = []
    for node in nodes:
        h = int(hashlib.md5(f"{key}:{node}".encode()).hexdigest(), 16)
        weights.append((h, node))
    weights.sort(reverse=True)
    return [node for _, node in weights[:k]]

# Pick 3 nodes for 3 replicas
replicas = rendezvous_hash_topk("user:1234", nodes, k=3)
print(replicas)  # ["server-C", "server-A", "server-D"]

This is similar to the idea used in Cassandra's replica placement strategy.

7. Anchor Hashing: A Modern Algorithm (2018)

Anchor Hashing, published in 2018 by Gal Mendelson et al., provides both Jump Hash's speed and Rendezvous's flexible node removal.

Key properties:

O(log N) time complexity
O(N) space (constant memory per node)
Arbitrary node removal
Perfectly uniform distribution

The real implementation is complex, so we only mention it here. Maglev Hash (Google) falls in the same category.

8. How Is This Used in Real Systems?

DynamoDB: Basic Consistent Hashing + vnodes

Amazon DynamoDB follows the design of the Amazon Dynamo paper (2007):

Circular 128-bit hash space.
Each physical node holds many virtual nodes (tokens).
Each key is replicated to the next N clockwise nodes (Preference List).
On node add/remove, data only moves between adjacent nodes.

Cassandra: Token Ring

Cassandra has a similar structure but uses slightly different terminology:

Token = virtual node (Murmur3 64-bit hash)
Default num_tokens: 256 → each physical node owns 256 tokens.
Replication is configured via Replication Strategy (SimpleStrategy, NetworkTopologyStrategy).

# cassandra.yaml
num_tokens: 256
partitioner: org.apache.cassandra.dht.Murmur3Partitioner

Memcached (libketama): The Virtual Node Standard

libketama, built by Last.fm, is a Memcached client library that became the de facto standard for Consistent Hashing:

160 virtual nodes per physical node.
MD5 hash (128 bits truncated to 32 bits).
Server info string → MD5 → four 32-bit integers → four virtual nodes.

Nginx: upstream consistent hash

Nginx supports Consistent Hashing for upstream load balancing:

upstream backend {
    hash $request_uri consistent;
    server backend1.example.com;
    server backend2.example.com;
    server backend3.example.com;
}

The consistent keyword enables Ketama-based Consistent Hashing.

Envoy / HAProxy / Istio

Envoy's Ring Hash LB Policy and Maglev LB Policy implement Ketama-style and Google Maglev-style Consistent Hashing respectively.

# Envoy config
load_balancing_policy:
  policies:
  - typed_extension_config:
      name: envoy.load_balancing_policies.ring_hash
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.load_balancing_policies.ring_hash.v3.RingHash
        minimum_ring_size: 1024

9. Common Mistakes and Pitfalls

Pitfall 1: Choice of Hash Function

Bad choice:

def bad_hash(key):
    return hash(key)  # Python's built-in hash() changes on restart!

Python's built-in hash() uses a different seed per process for security reasons. In a distributed system, having the mapping change on every restart is catastrophic.

Good choices:

MD5, SHA-1: slow but good distribution.
MurmurHash3: fast with good distribution (used in Cassandra).
xxHash: very fast (increasingly preferred).

Pitfall 2: Too Few Virtual Nodes

4 nodes × 1 vnode = 4 → severely unbalanced distribution. Always use at least 100 virtual nodes per node.

Pitfall 3: String Format of Node IDs

If vnode name format isn't consistent, the mapping isn't reproducible:

# Bad
f"{node}-{i}"   # node1-0, node1-1, ...
f"{node}_{i}"   # node1_0, node1_1, ... (different system)

# Good: document team/org standard
f"{node}#{i}"   # explicitly documented

Pitfall 4: Hot Keys

Consistent Hashing only works well when key distribution is uniform. If one key (e.g., user:celebrity_account) gets huge traffic, one node is overloaded.

Solutions:

Key Salting: user:celebrity_account:{random_suffix} to spread across shards.
Local Cache: reduce distributed cache calls with in-app caching.
Read Replica: add read-only replicas.

Pitfall 5: Rapid Cluster Churn

In environments like Kubernetes where nodes are frequently created/destroyed, data movement happens every time. Rebalancing Throttling is needed:

Warm-up period: new nodes gradually take traffic.
Cooldown: wait for data to fully migrate after node removal.

10. Going Further: Weighted Consistent Hashing

Maglev (Google)

Google's Maglev was designed with these requirements:

Perfect uniformity (across similarly weighted servers)
Minimum disruption (on node changes)
O(1) lookup (fast runtime queries)

Maglev computes a preference permutation for each server and uses it to build a lookup table of size M. Lookup is lookup_table[hash(key) % M] in O(1).

# Conceptual pseudocode
def build_maglev_table(servers, table_size=65537):
    # table_size should be a prime
    permutations = {
        s: generate_permutation(s, table_size) for s in servers
    }
    table = [None] * table_size
    next_idx = {s: 0 for s in servers}

    filled = 0
    while filled < table_size:
        for s in servers:
            while table[permutations[s][next_idx[s]]] is not None:
                next_idx[s] += 1
            table[permutations[s][next_idx[s]]] = s
            next_idx[s] += 1
            filled += 1
            if filled == table_size:
                break
    return table

Maglev is available as Envoy's maglev load balancer policy and is used internally by Google Cloud Load Balancer.

Assigning Weights

With heterogeneous servers (e.g., 16-core vs. 4-core), reflect weights by:

Vnode approach: vnodes = base * weight
Maglev approach: use the permutation weight times.
Rendezvous weighting: bias the hash value by multiplying by log(weight).

11. Performance Comparison: Which Algorithm to Choose?

Algorithm	Time	Space	Uniformity	Node removal	Main uses
Modulo Hash	O(1)	O(1)	Perfect	Full rebalance	Simple sharding
Consistent + vnode	O(log V)	O(V)	Good (with higher V)	Arbitrary	Cassandra, DynamoDB
Jump Hash	O(log N)	O(1)	Perfect	Last only	Google internal
Rendezvous Hash	O(N)	O(N)	Perfect	Arbitrary	Small clusters
Maglev	O(1) lookup	O(M)	Near perfect	Arbitrary	Google Cloud LB
Anchor Hashing	O(log N)	O(N)	Perfect	Arbitrary	Modern systems

Selection Guide

Nodes < 100, frequent churn: Rendezvous Hashing (simple, flexible)
Large-scale cache/DB cluster: Consistent Hashing + vnode (battle-tested)
Static node set, extreme performance: Jump Hash (O(log N))
L4/L7 load balancer: Maglev (O(1) lookup)
Heterogeneous capacity servers: vnode weights or weighted Maglev

12. Operational Tips: Lessons From Production

Tip 1: Never Change the Vnode Count

Changing num_tokens from 150 to 200 in production causes nearly full data rebalance. Pick carefully at initial cluster design time.

Tip 2: Monitor Distribution With Logs

Regularly measure how uniform the actual key distribution is:

# Prometheus metrics
consistent_hash_node_keys{node="server-A"} 12050
consistent_hash_node_keys{node="server-B"} 10800
consistent_hash_node_keys{node="server-C"} 11200
consistent_hash_node_keys{node="server-D"} 12500

If standard deviation spikes suddenly, suspect a hot key.

Tip 3: Graceful Shutdown

Don't kill a node abruptly when removing it:

Mark it "draining" (remove from ring).
Wait for existing connections to complete.
Verify data migration.
Terminate the process.

Tip 4: Hash Function Speed vs. Quality

At hundreds of thousands of lookups per second, MD5 can be slow. Benchmark:

Hash function	Speed (MB/s)	Distribution quality
xxHash3	~30,000	Good
MurmurHash3	~6,000	Good
CityHash	~10,000	Good
MD5	~500	Excellent
SHA-256	~400	Excellent

For cache sharding, xxHash or Murmur is appropriate.

Quiz Review

Q1. When scaling from 4 to 5 servers, what percentage of keys move under plain modulo hashing (hash % N)?

A. About 80% of keys move, because the results of hash % 4 and hash % 5 are largely unrelated. Consistent Hashing, in contrast, moves only about 20% (K/N = K/5).

Q2. Why are virtual nodes needed in basic Consistent Hashing?

A. With few nodes, the distribution of nodes on the ring is uneven, causing some nodes to get disproportionate keys. Scattering each physical node as 100-200 virtual nodes gives nearly uniform distribution by the law of large numbers. It also lets you assign weights by tuning vnode counts per server capacity.

Q3. What are the biggest pros and cons of Jump Consistent Hash?

A. Pros: O(1) space and perfectly uniform distribution. Works well without virtual nodes and is very fast. Cons: Only the last (highest-index) node can be removed. Removing a middle node breaks the entire mapping. If arbitrary node removal is required, use Rendezvous Hashing or Anchor Hashing.

Q4. Why is Rendezvous Hashing useful for selecting N replicas?

A. Rendezvous Hashing computes the weight (hash) of every node for each key. Sorting them and picking the top K naturally determines K replica placements. This has the same effect as "next K clockwise nodes" on a Consistent Hashing ring, but with a more uniform distribution.

Q5. How many virtual nodes per node does Memcached's libketama use by default, and why?

A. Ketama uses 160 vnodes per node. It's a balance between balance (std. dev ~5%) and memory overhead. In practice, it splits the MD5 result into four 4-byte integers, creating 4 virtual nodes per "replica." So 40 replicas × 4 = 160.

Closing: Lessons From Consistent Hashing

Consistent Hashing was conceived in 1997 at MIT to solve the CDN web-cache problem. It birthed Akamai, and has been a core building block of distributed systems for over 20 years.

Five Core Principles

Circular hash space: placing nodes and keys in the same space to define adjacency.
Clockwise assignment: keys go to the nearest clockwise node.
Virtual nodes: uniform distribution via the law of large numbers.
Minimum disruption: only K/N keys move on node add/remove.
Weight flexibility: vnode counts reflect capacity differences.

Modern Evolution

Jump Hash: O(1) space for extreme efficiency.
Rendezvous Hashing: the pinnacle of simplicity.
Maglev: O(1) lookup + weighted LB.
Anchor Hashing: the latest improvement.

Remember

"The answer to 'where should this data go' in distributed systems" is almost always a variant of Consistent Hashing.

The Redis Cluster, DynamoDB, Cassandra, Memcached, and CDN you use today all run on this 30-year-old idea. Knowing the fundamentals changes the depth of debugging and design.

Next time you ask "why does this key go to that server?", picture the hash ring in your head. The answer is there.

Introduction: Why Do We Need Consistent Hashing?

The Problem: The Trap of Hash-Based Sharding

The Promise of Consistent Hashing

1. The Core Idea: The Hash Ring

Place Nodes and Keys on a Circular Space

Basic Python Implementation

Impact of Adding/Removing Nodes

2. Problems With the Basic Approach: Imbalance

Problem 1: Skew With Few Nodes

Problem 2: Worsening Skew on Node Removal

Problem 3: Hard to Reflect Heterogeneous Server Capacity

3. The Fix: Virtual Nodes (vnodes)

The Idea

Virtual Node Implementation

Vnode Count Selection Guide

4. Benchmark: How Even Is the Actual Distribution?

5. Jump Consistent Hash: Uniform Without Virtual Nodes

Striking Properties

The Jump Hash Algorithm

Why Does This Work?

Limitation: Node Removal

6. Rendezvous Hashing (HRW: Highest Random Weight)

The Idea

Pros and Cons

Replica Selection

7. Anchor Hashing: A Modern Algorithm (2018)

8. How Is This Used in Real Systems?

DynamoDB: Basic Consistent Hashing + vnodes

Cassandra: Token Ring

Memcached (libketama): The Virtual Node Standard

Nginx: upstream consistent hash

Envoy / HAProxy / Istio

9. Common Mistakes and Pitfalls

Pitfall 1: Choice of Hash Function

Pitfall 2: Too Few Virtual Nodes

Pitfall 3: String Format of Node IDs

Pitfall 4: Hot Keys

Pitfall 5: Rapid Cluster Churn

10. Going Further: Weighted Consistent Hashing

Maglev (Google)

Assigning Weights

11. Performance Comparison: Which Algorithm to Choose?

Selection Guide

12. Operational Tips: Lessons From Production

Tip 1: Never Change the Vnode Count

Tip 2: Monitor Distribution With Logs

Tip 3: Graceful Shutdown

Tip 4: Hash Function Speed vs. Quality

Quiz Review

Closing: Lessons From Consistent Hashing

Five Core Principles

Modern Evolution

Remember

References