Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

TL;DR

Sharding = the answer to horizontal scaling: essential at a scale where a single DB cannot cope
Shard key selection is destiny: wrong choice leads to hot spots and resharding hell
Three main strategies: Range, Hash, Directory (Lookup) — each has trade-offs
Distributed transactions are hard: 2PC is slow, Saga is practical
Vitess is the standard: used by YouTube/Slack/GitHub. A distributed layer on top of MySQL

1. What is Sharding?

1.1 Limits of a Single DB

[Single DB]
  ├─ CPU: up to one machine's worth
  ├─ RAM: 256GB? 512GB?
  ├─ Disk: 30TB
  └─ Network: 25 Gbps

→ Eventually hits the limit

Vertical Scaling:

Bigger machine
Has limits (largest EC2 = 24TB RAM, $20+/hour)
Single point of failure

Horizontal Scaling:

Multiple machines
Infinite scaling
Increased complexity

1.2 Can Read Replicas Solve It?

[Primary] (writes)
   ↓ replication
[Replica 1] [Replica 2] [Replica 3] (reads)

Helps: distributes read traffic.

Limitation: writes still go through a single Primary. If writes are the bottleneck, this does not help.

→ Sharding is required.

1.3 Definition of Sharding

Sharding = distributing data across multiple independent DBs (shards).

[users table]
  ↓ shard by user_id
[Shard 1: user_id 1-1M]
[Shard 2: user_id 1M-2M]
[Shard 3: user_id 2M-3M]
...

Each shard is a completely independent DB with its own CPU, RAM, and disk.

1.4 Effects of Sharding

	Single DB	100 Shards
Data	1 TB	100 TB
Write TPS	10,000	1,000,000
Cost	$10K/month	$50K/month
Complexity	Low	Very High

100x performance + 5x cost = good ROI. However, complexity is a serious problem.

2. Shard Key Selection — the Most Important Decision

2.1 Conditions for a Good Shard Key

Even distribution — data spread evenly across all shards
Immutable — rarely changes once set
Matches query patterns — data queried together lives on the same shard
High cardinality — enough unique values
Not monotonically increasing — auto-increment IDs risk hot spots

2.2 Common Shard Key Candidates

user_id:

Groups a user's data on one shard (related data together)
Large users become hot spots
Cross-user JOINs are hard

created_at:

Monotonically increasing → all new data goes to one shard (hot spot)
OK for time-series data

hash(id):

Even distribution
No range queries possible
Related data lands on different shards

country:

Geographic efficiency
Very uneven (US is huge, Monaco is tiny)

organization_id:

Ideal for B2B SaaS
Large organizations become hot spots

2.3 Slack's Case

Slack's shard key = workspace_id.

Why:

All data from the same workspace on one shard → fast lookups
Almost no cross-workspace JOINs
Natural isolation per workspace

Downside: a massive enterprise workspace overwhelms a single shard → the "Whale shard" problem.

Solution: huge workspaces get their own dedicated shard.

3. Sharding Strategies

3.1 Range Sharding

Partition data by key range.

Shard 1: id 1 ~ 1,000,000
Shard 2: id 1,000,001 ~ 2,000,000
Shard 3: id 2,000,001 ~ 3,000,000

Pros:

Efficient range queries (WHERE id BETWEEN 500000 AND 600000)
Simple routing
Easy to add new shards

Cons:

Hot spot risk: with monotonically increasing keys, new data lands entirely on the last shard
Possible uneven distribution

Use cases: time-series data, BigTable, HBase

3.2 Hash Sharding

Partition by the hash of the shard key.

shard_id = hash(user_id) % num_shards

Pros:

Even distribution — assuming a good hash
Safe even for monotonic keys

Cons:

No range queries (WHERE id BETWEEN ...)
Adding a shard forces nearly all data to be redistributed (because % num_shards changes)

Solution: Consistent Hashing — only some data moves when shards are added or removed.

3.3 Consistent Hashing

[Circular hash space]
        Shard A
       /
  Shard D
       \
        Shard B
       /
  Shard C

Each shard and data item is mapped to the same hash space. Data goes to the nearest shard clockwise.

When adding a shard: only the affected section is redistributed (1/N of the total).

Virtual Nodes: represent each shard with multiple virtual nodes → more even distribution.

Used by: Cassandra, DynamoDB, Riak, Memcached.

3.4 Directory-Based Sharding

Use a lookup table to explicitly map which data lives on which shard.

[Lookup]
user_1 → Shard A
user_2 → Shard B
user_3 → Shard A
...

Pros:

Flexible: arbitrary mappings possible
Handles uneven distributions
Easier migrations

Cons:

The lookup itself becomes a bottleneck (caching is mandatory)
Extra infrastructure (Redis, Consul)
Single point of failure

Used by: Vitess (a variation of this approach)

3.5 Geographic Sharding

Shard by location:

US users → US shard
EU users → EU shard
Asia users → Asia shard

Pros:

Close to users → lower latency
Compliance (GDPR — EU data stays in EU)
Natural partitioning

Cons:

Hard when users move
Global queries are hard
Very uneven (different user counts per region)

3.6 Comparison Table

Strategy	Evenness	Range Queries	Adding Shards	Query Pattern
Range	Low	Yes	Easy	Range
Hash	High	No	Hard	Exact
Consistent Hash	High	No	Easy	Exact
Directory	High	Partial	Easy	Flexible
Geographic	Low	Partial	Moderate	Regional

4. The Hot Spot Problem

4.1 What is a Hot Spot?

Traffic or data concentrating on a specific shard.

Causes:

Wrong shard key (popular users, popular products)
Monotonically increasing keys (created_at)
Uneven distribution (a few giant organizations)

4.2 Impact

[Shard 1: 10% traffic]
[Shard 2: 5% traffic]
[Shard 3: 80% traffic] ← Hot spot! Dying
[Shard 4: 5% traffic]

The entire system is bottlenecked by the weakest shard.

4.3 Detection

-- Data volume per shard
SELECT shard_id, COUNT(*) as row_count
FROM users
GROUP BY shard_id;

-- Query count per shard (application metrics)

Warning signs:

One shard is 5x+ larger than the average
One shard's CPU/IO is at 100%

4.4 Solutions

1. Change the shard key

Best but hardest (requires resharding)

2. Composite Key

shard_key = hash(user_id + timestamp_bucket)

Combines time-based + user-based factors.

3. Sub-sharding

Split hot shards into finer pieces
Handle "Whale shards"

4. Caching

Move hot data to Redis
Reduce DB load

5. Read Replicas

Distribute read load
Writes remain hot

4.5 Example: Twitter's Celebrity Problem

Problem: a celebrity user (say, Elon Musk) tweets → must update timelines of hundreds of millions of followers.

Solution:

Fan-out on write (regular users): pre-populate follower timelines on tweet
Fan-in on read (celebrities): do not build timelines on tweet, fetch on read
Hybrid: fan-out for active followers, fan-in for inactive ones

5. Difficulty of Distributed Queries

5.1 Single-Shard Queries

SELECT * FROM users WHERE user_id = 12345;

→ determine shard by user_id → query only that shard. Very fast.

5.2 Cross-Shard Queries

SELECT COUNT(*) FROM users WHERE country = 'KR';

→ query all shards → aggregate results. Slow + heavy load.

5.3 Difficulty of JOINs

SELECT u.name, o.total
FROM users u
JOIN orders o ON o.user_id = u.user_id
WHERE u.country = 'KR';

What if users and orders live on different shards? You need a distributed JOIN.

Strategies:

Co-location: keep users and orders with the same user_id on the same shard
Denormalization: copy user info into orders
Distributed query engine: Vitess, Citus handle it automatically
CQRS: use a separate DW (Snowflake) for queries

5.4 Distributed Transactions

def transfer(from_user, to_user, amount):
    # The two users may live on different shards
    db.update(f"... WHERE user_id={from_user}", -amount)
    db.update(f"... WHERE user_id={to_user}", +amount)

Problem: first update succeeds + second fails → money disappears.

Solutions:

1. 2PC (Two-Phase Commit):

Prepare → Commit
All shards must agree to commit
Slow + single point of failure → not recommended

2. Saga Pattern:

Compensating transactions
Eventually consistent
De facto standard in practice

3. Keep in the same shard:

When possible, put sender/receiver in the same shard
Not always feasible

6. Resharding — the Scariest Operation

6.1 Why is Resharding Hard?

Data movement: moving tens of TB
Zero downtime: no user impact
Rollback: recover from problems
Consistency: handle writes/reads mid-migration

6.2 Resharding Scenarios

1. Adding shards: 4 → 8 shards 2. Splitting a shard: dividing a huge shard in two 3. Merging shards: combining small shards 4. Changing the shard key: the hardest of all

6.3 Resharding Patterns

1. Bulk Copy + Switchover:

Copy data to new shards
Short maintenance window
Switch traffic
Drop old shards

Downside: downtime.

2. CDC + Dual Write:

Sync old → new shards via CDC
Application writes to both old + new (dual write)
Verify data parity
Switch reads to new
Retire old shards

Zero downtime, complex.

3. Online Migration:

Vitess's VReplication
Automated zero-downtime migration
Progress monitoring

6.4 Case: Slack's Sharding Evolution

2017: single MySQL → replicas
2018: first sharding (workspace_id)
2020: introduced Vitess
2022: handled whale workspaces
2024: thousands of shards, petabytes

Lesson: sharding is never one and done. It is a continuous evolution.

7. Vitess — a Distributed Layer on Top of MySQL

7.1 What is Vitess?

A MySQL sharding solution built by YouTube. CNCF graduated.

[Client]
   ↓
[VTGate] (query router)
   ↓
[VTTablet] [VTTablet] [VTTablet]
   ↓        ↓          ↓
[MySQL]  [MySQL]    [MySQL]

Pros:

MySQL compatible — existing code works as is
Automatic sharding — key-based routing
Schema migrations — safe DDL
VReplication — zero-downtime resharding
Battle-tested at scale — YouTube, Slack, GitHub, Square

7.2 Companies Using It

YouTube (creator, petabytes)
Slack (large scale)
GitHub (migrated MySQL → Vitess)
Square
Pinterest
JD.com (China's largest e-commerce)

7.3 VSchema Example

{
  "sharded": true,
  "vindexes": {
    "hash": {
      "type": "hash"
    }
  },
  "tables": {
    "users": {
      "column_vindexes": [
        {
          "column": "user_id",
          "name": "hash"
        }
      ]
    }
  }
}

→ VTGate automatically shards based on user_id.

8. Other Distributed DB Options

8.1 Citus (PostgreSQL)

A PostgreSQL extension. Supports distributed queries and distributed transactions.

SELECT create_distributed_table('events', 'user_id');

Pros: PostgreSQL compatible, keep SQL as is. Cons: inherits PostgreSQL limitations (less mature than Vitess).

8.2 CockroachDB

Designed from scratch as a distributed SQL DB.

Pros:

PostgreSQL compatible
Automatic sharding (range-based)
Global distribution support
Distributed ACID transactions

Cons:

Some PG features unsupported
More expensive to operate
Learning curve

Used by: DoorDash, Bose, Comcast.

8.3 YugabyteDB

Similar to CockroachDB. Both PostgreSQL compatible and Cassandra compatible.

8.4 TiDB

MySQL compatible + automatic distribution. Popular in China.

8.5 Comparison Table

	Vitess	Citus	CockroachDB	YugabyteDB	TiDB
Base	MySQL	PostgreSQL	Custom	Custom	Custom
Compatibility	MySQL	PostgreSQL	PostgreSQL	PG + CQL	MySQL
Distributed Transactions	Limited	Yes	Yes	Yes	Yes
Automatic Sharding	Manual key	Manual key	Automatic	Automatic	Automatic
Operational Difficulty	Moderate	Low	Moderate	Moderate	Moderate
Validation	YouTube, Slack	Many	DoorDash	Many	Chinese giants

9. In Practice — Designing E-commerce Sharding

9.1 Scenario

100 million users
10 million daily orders
Single PostgreSQL hitting its limits

9.2 Analysis

Tables:

users (100M rows)
orders (1B rows)
products (1M rows)
reviews (5B rows)

9.3 Sharding Decisions

1. users:

Shard key: user_id (hash)
16 shards
Even distribution

2. orders:

Shard key: user_id (hash, same shard as users)
Same user's user + orders on the same shard → JOIN is fast

3. products:

Small (1M rows)
No sharding — single DB + read replicas
Or replicate across all shards

4. reviews:

Shard key: product_id (hash)
Fits product-based analytics

9.4 Query Patterns

-- User orders (same shard)
SELECT * FROM orders WHERE user_id = 12345;
-- Fast

-- User info + orders (same shard, JOIN OK)
SELECT u.*, o.*
FROM users u JOIN orders o ON o.user_id = u.user_id
WHERE u.user_id = 12345;
-- Fast

-- Product reviews (different shard)
SELECT * FROM reviews WHERE product_id = 67890;
-- Fast

-- All reviews written by a user (cross-shard)
SELECT * FROM reviews WHERE user_id = 12345;
-- Slow — scans every reviews shard

-- Solution: denormalize. Store review summaries on the users shard too.

9.5 Infrastructure

[Application]
    ↓
[ProxySQL / Vitess]
    ↓
[16 user shards]
[16 order shards (co-located)]
[1 products DB + 5 read replicas]
[8 reviews shards]

9.6 Monitoring

Query count per shard
Disk usage per shard
Cross-shard query ratio
Hot spot detection

10. Sharding Pitfalls and Lessons

10.1 Sharding Too Early

Wrong decision: sharding at 1000 QPS of traffic.

Reality:

Single PostgreSQL = 50,000 QPS+
Read replica = 200,000 QPS+
Proper indexes + caching = even more

→ Try every other optimization first before sharding.

10.2 Wrong Shard Key

GitHub's first sharding attempt failure case:

Sharded by repo_id → popular repos became hot spots
Linus's linux repository overwhelmed a single shard
Resharding required (months of work)

Lesson: the shard key is extremely hard to change. Be careful up front.

10.3 Avoiding Distributed Transactions

Distributed transactions really are hard. Whenever possible, keep transactions within a single shard.

While designing: shape the data model so that every write transaction fits in a single shard.

10.4 Operational Complexity

A sharded system means:

Complex backups (16 shards simultaneously)
Complex monitoring
Harder debugging (which shard?)
Harder migrations

→ Automation is essential. Have enough operations staff.

10.5 Is NewSQL the Answer?

NewSQL systems like CockroachDB and YugabyteDB offer automatic sharding + distributed transactions. None of the complexity of manual sharding.

When to go NewSQL:

New projects
Strong consistency required
Limited operations staff

When to go manual sharding:

Existing MySQL/PostgreSQL
Extreme performance needed
Concerns about NewSQL maturity

Quiz

1. Why can sharding not be replaced by read replicas?

Answer: read replicas only distribute reads. Writes still go through a single Primary. A system where writes are the bottleneck (e.g. 1 billion INSERTs per day) cannot be fixed with read replicas. Only sharding can spread writes across multiple machines. That said, sharding brings enormous complexity, so first check whether read replicas + caching + index optimization can solve the problem.

2. What are the pros and cons of hash sharding?

Answer: Pros: even distribution (with a good hash). Safe even with monotonically increasing keys. Cons: no range queries (WHERE id BETWEEN ...). Adding a shard redistributes nearly all data (because % num_shards changes). Solution: Consistent Hashing — only a fraction (1/N) of the data moves on add/remove. Used by Cassandra, DynamoDB, and Riak. Virtual nodes give more even distribution.

3. How do you solve the hot spot problem?

Answer: (1) Change the shard key — best, but requires resharding (hard), (2) Composite key — hash(user_id + timestamp_bucket) to spread across time/users, (3) Sub-sharding — split hot shards finer ("whale shard"), (4) Caching — move hot data to Redis, (5) Read replicas — distribute read load. The most common hot spots: popular users, popular products, monotonically increasing keys. Always review hot spot scenarios when choosing a shard key.

4. Why did Vitess become the standard for MySQL distribution?

Answer: (1) Born at YouTube — proven at massive scale, (2) MySQL compatible — existing code/tools keep working, (3) Automatic sharding — key-based routing, (4) VReplication — zero-downtime migrations and resharding, (5) CNCF graduated — cloud-native standard, (6) Proven users — Slack, GitHub, Square. Downside: distributed transactions are relatively weak (the Saga pattern is recommended). PostgreSQL users use Citus for a similar role.

5. When should you use NewSQL (CockroachDB) vs. manual sharding?

Answer: Pick NewSQL (CockroachDB, YugabyteDB) when: new project, strong consistency required (distributed ACID), limited operations staff, PostgreSQL compatibility is acceptable. Pick manual sharding (Vitess + MySQL) when: existing MySQL system, extreme performance needed, worried about NewSQL maturity, want more control. Reality: new systems increasingly go NewSQL; existing huge systems stick with Vitess. Both are valid — decide based on trade-offs.

References

Designing Data-Intensive Applications — Martin Kleppmann (sharding chapter)
Vitess — official YouTube project
Citus Data
CockroachDB Docs
YugabyteDB
TiDB
Slack's Data Stores — Vitess migration
GitHub MySQL → Vitess
Stripe's Sharding Approach
Discord's Trillion Messages — uses Cassandra
Pinterest Sharding