Split View: DB 샤딩 & 파티셔닝 완전 가이드 2025: 수평 확장, Range/Hash/Directory, 리샤딩, Vitess

DB 샤딩 & 파티셔닝 완전 가이드 2025: 수평 확장, Range/Hash/Directory, 리샤딩, Vitess

TL;DR

샤딩 = 수평 확장의 답: 단일 DB로 안 되는 규모에서 필수
샤드 키 선택이 운명: 잘못 선택하면 hot spot, 재샤딩 지옥
3대 전략: Range, Hash, Directory (Lookup) — 각각 장단점
분산 트랜잭션이 어려움: 2PC는 느림, Saga가 현실적
Vitess가 표준: YouTube/Slack/GitHub이 사용. MySQL 위의 분산 레이어

1. 샤딩이란?

1.1 단일 DB의 한계

[Single DB]
  ├─ CPU: 1 머신만큼만
  ├─ RAM: 256GB? 512GB?
  ├─ 디스크: 30TB
  └─ 네트워크: 25 Gbps

→ 결국 한계에 도달

수직 확장 (Vertical Scaling):

더 큰 머신
한계 있음 (가장 큰 EC2 = 24TB RAM, $20+/시간)
단일 장애점

수평 확장 (Horizontal Scaling):

여러 머신
무한 확장
복잡도 증가

1.2 Read Replica로 해결?

[Primary] (쓰기)
   ↓ replication
[Replica 1] [Replica 2] [Replica 3] (읽기)

도움: 읽기 트래픽 분산.

한계: 쓰기는 여전히 Primary 1대. 쓰기가 병목이면 도움 안 됨.

→ 샤딩이 필요합니다.

1.3 샤딩의 정의

샤딩 = 데이터를 여러 독립적 DB(샤드)에 분산.

[users 테이블]
  ↓ shard by user_id
[Shard 1: user_id 1-1M]
[Shard 2: user_id 1M-2M]
[Shard 3: user_id 2M-3M]
...

각 샤드는 완전히 독립적인 DB. 자체 CPU, RAM, 디스크.

1.4 샤딩의 효과

	단일 DB	100 샤드
데이터	1 TB	100 TB
쓰기 TPS	10,000	1,000,000
비용	$10K/월	$50K/월
복잡도	낮음	매우 높음

100배 성능 + 5배 비용 = ROI 좋음. 단, 복잡도가 큰 문제.

2. 샤드 키 선택 — 가장 중요한 결정

2.1 좋은 샤드 키의 조건

균등 분포 — 데이터가 모든 샤드에 균등 분산
불변 — 한 번 정해지면 거의 안 변함
쿼리 패턴 일치 — 자주 함께 조회되는 데이터는 같은 샤드
카디널리티 높음 — 충분히 많은 고유 값
단조 증가 X — auto-increment ID는 hot spot 위험

2.2 흔한 샤드 키 후보

user_id:

✅ 사용자 데이터를 한 샤드에 모음 (관련 데이터 같이)
❌ 큰 사용자가 hot spot
❌ user 간 JOIN 어려움

created_at:

❌ 단조 증가 → 모든 새 데이터가 한 샤드에 (hot spot)
시계열 데이터에는 OK

hash(id):

✅ 균등 분포
❌ 범위 쿼리 불가능
❌ 관련 데이터가 다른 샤드

country:

✅ Geographic 효율
❌ 매우 불균등 (US가 거대, 모나코는 작음)

organization_id:

✅ B2B SaaS에 이상적
❌ 큰 조직이 hot spot

2.3 Slack의 사례

Slack의 샤드 키 = workspace_id.

이유:

같은 workspace의 모든 데이터를 한 샤드에 → 빠른 조회
workspace 간 JOIN 거의 없음
workspace 단위로 자연스러운 격리

단점: 거대 enterprise workspace가 단일 샤드 압도 → "Whale shard" 문제.

해결: 거대 workspace는 자체 전용 샤드.

3. 샤딩 전략

3.1 Range Sharding

데이터를 키의 범위로 분할.

Shard 1: id 1 ~ 1,000,000
Shard 2: id 1,000,001 ~ 2,000,000
Shard 3: id 2,000,001 ~ 3,000,000

장점:

범위 쿼리 효율 (WHERE id BETWEEN 500000 AND 600000)
단순한 라우팅
새 샤드 추가 쉬움

단점:

Hot spot 위험: 단조 증가 키면 새 데이터가 모두 마지막 샤드에
불균등 분포 가능

사용 사례: 시계열 데이터, BigTable, HBase

3.2 Hash Sharding

샤드 키의 hash로 분할.

shard_id = hash(user_id) % num_shards

장점:

균등 분포 — hash가 잘 만들어졌으면
단조 증가 키도 안전

단점:

범위 쿼리 불가능 (WHERE id BETWEEN ...)
새 샤드 추가 시 거의 모든 데이터 재분배 (% num_shards가 변하니까)

해결: Consistent Hashing — 샤드 추가/제거 시 일부 데이터만 이동.

3.3 Consistent Hashing

[원형 해시 공간]
        Shard A
       /
  Shard D
       \
        Shard B
       /
  Shard C

각 샤드와 데이터를 같은 해시 공간에 매핑. 데이터는 시계방향으로 가장 가까운 샤드에.

샤드 추가 시: 해당 부분만 재분배 (전체의 1/N).

Virtual Nodes: 각 샤드를 여러 가상 노드로 표현 → 더 균등.

사용: Cassandra, DynamoDB, Riak, Memcached.

3.4 Directory-Based Sharding

Lookup 테이블로 어떤 데이터가 어떤 샤드에 있는지 명시적 매핑.

[Lookup]
user_1 → Shard A
user_2 → Shard B
user_3 → Shard A
...

장점:

유연: 임의의 매핑 가능
균등하지 않은 분포도 처리
마이그레이션 쉬움

단점:

Lookup 자체가 병목 (캐싱 필수)
추가 인프라 (Redis, Consul)
단일 장애점

사용: Vitess (이런 방식의 변형)

3.5 Geographic Sharding

위치별 샤드:

US 사용자 → US 샤드
EU 사용자 → EU 샤드
Asia 사용자 → Asia 샤드

장점:

사용자에 가까이 → 낮은 latency
규제 준수 (GDPR — EU 데이터를 EU에)
자연스러운 분할

단점:

사용자 이동 시 어려움
글로벌 쿼리 어려움
매우 불균등 (지역별 사용자 수)

3.6 비교표

전략	균등성	범위 쿼리	추가	쿼리 패턴
Range	낮음	✅	쉬움	범위
Hash	높음	❌	어려움	정확
Consistent Hash	높음	❌	쉬움	정확
Directory	높음	부분	쉬움	자유
Geographic	낮음	부분	보통	지역

4. Hot Spot 문제

4.1 Hot Spot이란?

특정 샤드에 트래픽/데이터가 집중되는 현상.

원인:

잘못된 샤드 키 (인기 사용자, 인기 상품)
단조 증가 키 (created_at)
불균등 분포 (몇몇 거대 조직)

4.2 영향

[Shard 1: 10% 트래픽]
[Shard 2: 5% 트래픽]
[Shard 3: 80% 트래픽] ← Hot spot! 죽음
[Shard 4: 5% 트래픽]

전체 시스템이 가장 약한 샤드에 의해 결정됩니다.

4.3 탐지

-- 샤드별 데이터 양
SELECT shard_id, COUNT(*) as row_count
FROM users
GROUP BY shard_id;

-- 샤드별 쿼리 수 (애플리케이션 메트릭)

경고 신호:

한 샤드가 평균보다 5배+ 큼
한 샤드의 CPU/IO가 100%

4.4 해결책

1. 샤드 키 변경

가장 좋지만 가장 어려움 (재샤딩 필요)

2. Composite Key

shard_key = hash(user_id + timestamp_bucket)

시간 기반 + 사용자 기반 결합.

3. 하위 샤딩 (Sub-sharding)

Hot 샤드를 더 잘게 분할
"Whale shard" 처리

4. 캐싱

핫 데이터를 Redis로
DB 부하 감소

5. 읽기 복제본

읽기 부하 분산
쓰기는 여전히 hot

4.5 예시: Twitter의 Celebrity Problem

문제: 한 셀럽 사용자(예: 일론 머스크)가 트윗 → 수억 followers의 timeline 업데이트.

해결:

Fan-out on write (일반 사용자): 트윗 시 followers의 timeline에 미리 작성
Fan-in on read (셀럽): 트윗 시 timeline 안 만듦, 읽을 때 가져옴
하이브리드: 활성 followers는 fan-out, 비활성은 fan-in

5. 분산 쿼리의 어려움

5.1 단일 샤드 쿼리

SELECT * FROM users WHERE user_id = 12345;

→ user_id로 샤드 결정 → 해당 샤드만 쿼리. 매우 빠름.

5.2 Cross-Shard 쿼리

SELECT COUNT(*) FROM users WHERE country = 'KR';

→ 모든 샤드에 쿼리 → 결과 합산. 느림 + 부하.

5.3 JOIN의 어려움

SELECT u.name, o.total
FROM users u
JOIN orders o ON o.user_id = u.user_id
WHERE u.country = 'KR';

users와 orders가 다른 샤드에 있으면? 분산 JOIN 필요.

전략:

Co-location: 같은 user_id의 users와 orders를 같은 샤드에
Denormalization: orders에 user 정보 복사
분산 쿼리 엔진: Vitess, Citus가 자동 처리
CQRS: 쿼리용 별도 DW (Snowflake) 사용

5.4 분산 트랜잭션

def transfer(from_user, to_user, amount):
    # 두 사용자가 다른 샤드에 있을 수 있음
    db.update(f"... WHERE user_id={from_user}", -amount)
    db.update(f"... WHERE user_id={to_user}", +amount)

문제: 첫 update 성공 + 두 번째 실패 → 돈 사라짐.

해결:

1. 2PC (Two-Phase Commit):

Prepare → Commit
모든 샤드가 동의해야 commit
느림 + 단일 장애점 → 비추천

2. Saga 패턴:

보상 트랜잭션
Eventually consistent
실전 표준

3. 같은 샤드에 보장:

가능하면 전송자/수신자를 같은 샤드에
항상 가능하지는 않음

6. 리샤딩 — 가장 무서운 작업

6.1 왜 리샤딩이 어려운가?

데이터 이동: 수십 TB 이동
무중단: 사용자 영향 X
롤백: 문제 시 복구
일관성: 이동 중 쓰기/읽기 처리

6.2 리샤딩 시나리오

1. 샤드 추가: 4 샤드 → 8 샤드 2. 샤드 분할: 거대 샤드를 둘로 3. 샤드 병합: 작은 샤드 합치기 4. 샤드 키 변경: 가장 어려움

6.3 리샤딩 패턴

1. Bulk Copy + Switchover:

새 샤드에 데이터 복사
짧은 maintenance window
트래픽 전환
옛 샤드 삭제

단점: 다운타임.

2. CDC + Dual Write:

CDC로 옛 → 새 샤드 동기화
애플리케이션이 옛 + 새 둘 다 쓰기 (dual write)
데이터 일치 확인
읽기를 새로 전환
옛 샤드 폐기

무중단, 복잡.

3. Online Migration:

Vitess의 VReplication
자동화된 무중단 마이그레이션
진행률 모니터링

6.4 사례: Slack의 샤딩 진화

2017: 단일 MySQL → 복제본
2018: 첫 샤딩 (workspace_id)
2020: Vitess 도입
2022: Whale workspace 처리
2024: 수천 샤드, 페타바이트

교훈: 샤딩은 한 번에 끝나지 않음. 지속적 진화.

7. Vitess — MySQL 위의 분산 레이어

7.1 Vitess란?

YouTube가 만든 MySQL 샤딩 솔루션. CNCF 졸업.

[Client]
   ↓
[VTGate] (쿼리 라우터)
   ↓
[VTTablet] [VTTablet] [VTTablet]
   ↓        ↓          ↓
[MySQL]  [MySQL]    [MySQL]

장점:

MySQL 호환 — 기존 코드 그대로
자동 샤딩 — 키 기반 라우팅
스키마 마이그레이션 — 안전한 DDL
VReplication — 무중단 리샤딩
검증된 규모 — YouTube, Slack, GitHub, Square

7.2 사용 회사

YouTube (창시자, 페타바이트)
Slack (대규모)
GitHub (MySQL → Vitess 전환)
Square
Pinterest
JD.com (중국 최대 e-commerce)

7.3 VSchema 예시

{
  "sharded": true,
  "vindexes": {
    "hash": {
      "type": "hash"
    }
  },
  "tables": {
    "users": {
      "column_vindexes": [
        {
          "column": "user_id",
          "name": "hash"
        }
      ]
    }
  }
}

→ VTGate가 자동으로 user_id 기반 샤딩.

8. 다른 분산 DB 옵션

8.1 Citus (PostgreSQL)

PostgreSQL 확장. 분산 쿼리, 분산 트랜잭션 지원.

SELECT create_distributed_table('events', 'user_id');

장점: PostgreSQL 호환, SQL 그대로. 단점: PostgreSQL 한계 (Vitess보다 성숙도 낮음).

8.2 CockroachDB

처음부터 분산 SQL DB로 설계.

장점:

PostgreSQL 호환
자동 샤딩 (range-based)
글로벌 분산 지원
분산 ACID 트랜잭션

단점:

일부 PG 기능 미지원
더 비싼 운영
학습 곡선

사용: DoorDash, Bose, Comcast.

8.3 YugabyteDB

CockroachDB와 비슷. PostgreSQL 호환 + Cassandra 호환 동시.

8.4 TiDB

MySQL 호환 + 자동 분산. 중국에서 인기.

8.5 비교표

	Vitess	Citus	CockroachDB	YugabyteDB	TiDB
기반	MySQL	PostgreSQL	자체	자체	자체
호환성	MySQL	PostgreSQL	PostgreSQL	PG + CQL	MySQL
분산 트랜잭션	제한	✅	✅	✅	✅
자동 샤딩	수동 키	수동 키	자동	자동	자동
운영 난이도	보통	낮음	보통	보통	보통
검증	YouTube, Slack	다수	DoorDash	다수	중국 큰 회사

9. 실전 — e-commerce 샤딩 설계

9.1 시나리오

1억 사용자
일일 1000만 주문
단일 PostgreSQL 한계 도달

9.2 분석

테이블:

users (1억 행)
orders (10억 행)
products (100만 행)
reviews (50억 행)

9.3 샤딩 결정

1. users:

샤드 키: user_id (hash)
16개 샤드
균등 분포

2. orders:

샤드 키: user_id (hash, users와 같은 샤드)
같은 사용자의 user + orders가 같은 샤드 → JOIN 빠름

3. products:

작음 (100만 행)
샤딩 X — 단일 DB + 읽기 복제본
또는 모든 샤드에 복제

4. reviews:

샤드 키: product_id (hash)
product 기준 분석에 적합

9.4 쿼리 패턴

-- 사용자 주문 (같은 샤드)
SELECT * FROM orders WHERE user_id = 12345;
-- ✅ 빠름

-- 사용자 정보 + 주문 (같은 샤드, JOIN OK)
SELECT u.*, o.*
FROM users u JOIN orders o ON o.user_id = u.user_id
WHERE u.user_id = 12345;
-- ✅ 빠름

-- 제품 리뷰 (다른 샤드)
SELECT * FROM reviews WHERE product_id = 67890;
-- ✅ 빠름

-- 사용자가 작성한 모든 리뷰 (cross-shard)
SELECT * FROM reviews WHERE user_id = 12345;
-- ⚠️ 느림 — 모든 reviews 샤드 검색

-- 해결: 비정규화. users 샤드에도 review 요약 저장.

9.5 인프라

[Application]
    ↓
[ProxySQL / Vitess]
    ↓
[16 user shards]
[16 order shards (co-located)]
[1 products DB + 5 read replicas]
[8 reviews shards]

9.6 모니터링

샤드별 쿼리 수
샤드별 디스크 사용량
Cross-shard 쿼리 비율
Hot spot 탐지

10. 샤딩의 함정과 교훈

10.1 너무 일찍 샤딩

잘못된 결정: 트래픽이 1000 QPS인데 샤딩.

현실:

단일 PostgreSQL = 50,000 QPS+
Read replica = 200,000 QPS+
적절한 인덱스 + 캐싱 = 더 많이

→ 샤딩 전에 다른 모든 최적화를 먼저 시도하세요.

10.2 잘못된 샤드 키

GitHub의 첫 샤딩 시도가 실패한 사례:

repo_id로 샤딩 → 인기 repo가 hot spot
Linus의 linux 저장소가 단일 샤드 압도
재샤딩 필요 (수개월의 작업)

교훈: 샤드 키는 변경하기 매우 어렵다. 처음에 신중히.

10.3 분산 트랜잭션 회피

분산 트랜잭션은 정말 어려움. 가능한 한 같은 샤드 안에서만 트랜잭션 보장.

설계 시: 쓰기 트랜잭션이 항상 단일 샤드 안에 들어가도록 데이터 모델 설계.

10.4 운영 복잡도

샤딩한 시스템은:

백업 복잡 (16개 샤드 동시)
모니터링 복잡
디버깅 어려움 (어느 샤드?)
마이그레이션 어려움

→ 자동화 필수. 운영 인력 충분히.

10.5 NewSQL이 답?

CockroachDB, YugabyteDB 같은 NewSQL은 자동 샤딩 + 분산 트랜잭션. 수동 샤딩의 복잡도 없음.

언제 NewSQL?:

새 프로젝트
강한 일관성 필요
운영 인력 부족

언제 수동 샤딩?:

기존 MySQL/PostgreSQL
극단적 성능 필요
NewSQL 성숙도 우려

퀴즈

1. 샤딩이 read replica로 해결할 수 없는 이유는?

답: Read replica는 읽기만 분산합니다. 쓰기는 여전히 Primary 1대가 처리. 쓰기가 병목인 시스템 (예: 일일 10억 INSERT)은 read replica로 해결 불가능. 샤딩만이 쓰기를 여러 머신에 분산할 수 있습니다. 단, 샤딩은 복잡도가 매우 높으므로 read replica + 캐싱 + 인덱스 최적화로 해결 가능한지 먼저 검토해야 합니다.

2. Hash sharding의 장단점은?

답: 장점: 균등 분포 (hash가 잘 만들어졌으면). 단조 증가 키도 안전. 단점: 범위 쿼리 불가능 (WHERE id BETWEEN ...). 샤드 추가 시 거의 모든 데이터 재분배 (% num_shards가 변하므로). 해결책: Consistent Hashing — 샤드 추가/제거 시 일부 데이터(1/N)만 이동. Cassandra, DynamoDB, Riak가 사용. Virtual nodes로 더 균등한 분포.

3. Hot Spot 문제를 어떻게 해결하나요?

답: (1) 샤드 키 변경 — 가장 좋지만 재샤딩 필요 (어려움), (2) Composite key — hash(user_id + timestamp_bucket)로 시간/사용자 분산, (3) 하위 샤딩 — Hot 샤드를 더 잘게 분할 ("whale shard"), (4) 캐싱 — 핫 데이터를 Redis로, (5) 읽기 복제본 — 읽기 부하 분산. 가장 흔한 hot spot: 인기 사용자, 인기 상품, 단조 증가 키. 샤드 키 선택 시 반드시 hot spot 시나리오 검토.

4. Vitess가 MySQL 분산의 표준이 된 이유는?

답: (1) YouTube에서 시작 — 거대한 규모에서 검증됨, (2) MySQL 호환 — 기존 코드/도구 그대로 사용, (3) 자동 샤딩 — 키 기반 라우팅, (4) VReplication — 무중단 마이그레이션과 리샤딩, (5) CNCF 졸업 — 클라우드 네이티브 표준, (6) 검증된 사용자 — Slack, GitHub, Square. 단점은 분산 트랜잭션이 약한 편 (Saga 패턴 권장). PostgreSQL 사용자는 Citus가 유사한 역할.

5. 언제 NewSQL (CockroachDB) vs 수동 샤딩?

답: NewSQL (CockroachDB, YugabyteDB) 선택: 새 프로젝트, 강한 일관성 필요 (분산 ACID), 운영 인력 부족, PostgreSQL 호환 가능. 수동 샤딩 (Vitess + MySQL) 선택: 기존 MySQL 시스템, 극단적 성능 필요, NewSQL 성숙도 우려, 더 큰 통제력 원함. 현실: 새 시스템은 점점 NewSQL로, 기존 거대 시스템은 Vitess 유지. 둘 다 valid한 선택이며 트레이드오프에 따라 결정.

참고 자료

Designing Data-Intensive Applications — Martin Kleppmann (샤딩 챕터)
Vitess — YouTube 공식
Citus Data
CockroachDB Docs
YugabyteDB
TiDB
Slack's Data Stores — Vitess 마이그레이션
GitHub MySQL → Vitess
Stripe's Sharding Approach
Discord's Trillion Messages — Cassandra 사용
Pinterest Sharding

DB Sharding & Partitioning Complete Guide 2025: Horizontal Scaling, Range/Hash/Directory, Resharding, Vitess

TL;DR

Sharding = the answer to horizontal scaling: essential at a scale where a single DB cannot cope
Shard key selection is destiny: wrong choice leads to hot spots and resharding hell
Three main strategies: Range, Hash, Directory (Lookup) — each has trade-offs
Distributed transactions are hard: 2PC is slow, Saga is practical
Vitess is the standard: used by YouTube/Slack/GitHub. A distributed layer on top of MySQL

1. What is Sharding?

1.1 Limits of a Single DB

[Single DB]
  ├─ CPU: up to one machine's worth
  ├─ RAM: 256GB? 512GB?
  ├─ Disk: 30TB
  └─ Network: 25 Gbps

→ Eventually hits the limit

Vertical Scaling:

Bigger machine
Has limits (largest EC2 = 24TB RAM, $20+/hour)
Single point of failure

Horizontal Scaling:

Multiple machines
Infinite scaling
Increased complexity

1.2 Can Read Replicas Solve It?

[Primary] (writes)
   ↓ replication
[Replica 1] [Replica 2] [Replica 3] (reads)

Helps: distributes read traffic.

Limitation: writes still go through a single Primary. If writes are the bottleneck, this does not help.

→ Sharding is required.

1.3 Definition of Sharding

Sharding = distributing data across multiple independent DBs (shards).

[users table]
  ↓ shard by user_id
[Shard 1: user_id 1-1M]
[Shard 2: user_id 1M-2M]
[Shard 3: user_id 2M-3M]
...

Each shard is a completely independent DB with its own CPU, RAM, and disk.

1.4 Effects of Sharding

	Single DB	100 Shards
Data	1 TB	100 TB
Write TPS	10,000	1,000,000
Cost	$10K/month	$50K/month
Complexity	Low	Very High

100x performance + 5x cost = good ROI. However, complexity is a serious problem.

2. Shard Key Selection — the Most Important Decision

2.1 Conditions for a Good Shard Key

Even distribution — data spread evenly across all shards
Immutable — rarely changes once set
Matches query patterns — data queried together lives on the same shard
High cardinality — enough unique values
Not monotonically increasing — auto-increment IDs risk hot spots

2.2 Common Shard Key Candidates

user_id:

Groups a user's data on one shard (related data together)
Large users become hot spots
Cross-user JOINs are hard

created_at:

Monotonically increasing → all new data goes to one shard (hot spot)
OK for time-series data

hash(id):

Even distribution
No range queries possible
Related data lands on different shards

country:

Geographic efficiency
Very uneven (US is huge, Monaco is tiny)

organization_id:

Ideal for B2B SaaS
Large organizations become hot spots

2.3 Slack's Case

Slack's shard key = workspace_id.

Why:

All data from the same workspace on one shard → fast lookups
Almost no cross-workspace JOINs
Natural isolation per workspace

Downside: a massive enterprise workspace overwhelms a single shard → the "Whale shard" problem.

Solution: huge workspaces get their own dedicated shard.

3. Sharding Strategies

3.1 Range Sharding

Partition data by key range.

Shard 1: id 1 ~ 1,000,000
Shard 2: id 1,000,001 ~ 2,000,000
Shard 3: id 2,000,001 ~ 3,000,000

Pros:

Efficient range queries (WHERE id BETWEEN 500000 AND 600000)
Simple routing
Easy to add new shards

Cons:

Hot spot risk: with monotonically increasing keys, new data lands entirely on the last shard
Possible uneven distribution

Use cases: time-series data, BigTable, HBase

3.2 Hash Sharding

Partition by the hash of the shard key.

shard_id = hash(user_id) % num_shards

Pros:

Even distribution — assuming a good hash
Safe even for monotonic keys

Cons:

No range queries (WHERE id BETWEEN ...)
Adding a shard forces nearly all data to be redistributed (because % num_shards changes)

Solution: Consistent Hashing — only some data moves when shards are added or removed.

3.3 Consistent Hashing

[Circular hash space]
        Shard A
       /
  Shard D
       \
        Shard B
       /
  Shard C

Each shard and data item is mapped to the same hash space. Data goes to the nearest shard clockwise.

When adding a shard: only the affected section is redistributed (1/N of the total).

Virtual Nodes: represent each shard with multiple virtual nodes → more even distribution.

Used by: Cassandra, DynamoDB, Riak, Memcached.

3.4 Directory-Based Sharding

Use a lookup table to explicitly map which data lives on which shard.

[Lookup]
user_1 → Shard A
user_2 → Shard B
user_3 → Shard A
...

Pros:

Flexible: arbitrary mappings possible
Handles uneven distributions
Easier migrations

Cons:

The lookup itself becomes a bottleneck (caching is mandatory)
Extra infrastructure (Redis, Consul)
Single point of failure

Used by: Vitess (a variation of this approach)

3.5 Geographic Sharding

Shard by location:

US users → US shard
EU users → EU shard
Asia users → Asia shard

Pros:

Close to users → lower latency
Compliance (GDPR — EU data stays in EU)
Natural partitioning

Cons:

Hard when users move
Global queries are hard
Very uneven (different user counts per region)

3.6 Comparison Table

Strategy	Evenness	Range Queries	Adding Shards	Query Pattern
Range	Low	Yes	Easy	Range
Hash	High	No	Hard	Exact
Consistent Hash	High	No	Easy	Exact
Directory	High	Partial	Easy	Flexible
Geographic	Low	Partial	Moderate	Regional

4. The Hot Spot Problem

4.1 What is a Hot Spot?

Traffic or data concentrating on a specific shard.

Causes:

Wrong shard key (popular users, popular products)
Monotonically increasing keys (created_at)
Uneven distribution (a few giant organizations)

4.2 Impact

[Shard 1: 10% traffic]
[Shard 2: 5% traffic]
[Shard 3: 80% traffic] ← Hot spot! Dying
[Shard 4: 5% traffic]

The entire system is bottlenecked by the weakest shard.

4.3 Detection

-- Data volume per shard
SELECT shard_id, COUNT(*) as row_count
FROM users
GROUP BY shard_id;

-- Query count per shard (application metrics)

Warning signs:

One shard is 5x+ larger than the average
One shard's CPU/IO is at 100%

4.4 Solutions

1. Change the shard key

Best but hardest (requires resharding)

2. Composite Key

shard_key = hash(user_id + timestamp_bucket)

Combines time-based + user-based factors.

3. Sub-sharding

Split hot shards into finer pieces
Handle "Whale shards"

4. Caching

Move hot data to Redis
Reduce DB load

5. Read Replicas

Distribute read load
Writes remain hot

4.5 Example: Twitter's Celebrity Problem

Problem: a celebrity user (say, Elon Musk) tweets → must update timelines of hundreds of millions of followers.

Solution:

Fan-out on write (regular users): pre-populate follower timelines on tweet
Fan-in on read (celebrities): do not build timelines on tweet, fetch on read
Hybrid: fan-out for active followers, fan-in for inactive ones

5. Difficulty of Distributed Queries

5.1 Single-Shard Queries

SELECT * FROM users WHERE user_id = 12345;

→ determine shard by user_id → query only that shard. Very fast.

5.2 Cross-Shard Queries

SELECT COUNT(*) FROM users WHERE country = 'KR';

→ query all shards → aggregate results. Slow + heavy load.

5.3 Difficulty of JOINs

SELECT u.name, o.total
FROM users u
JOIN orders o ON o.user_id = u.user_id
WHERE u.country = 'KR';

What if users and orders live on different shards? You need a distributed JOIN.

Strategies:

Co-location: keep users and orders with the same user_id on the same shard
Denormalization: copy user info into orders
Distributed query engine: Vitess, Citus handle it automatically
CQRS: use a separate DW (Snowflake) for queries

5.4 Distributed Transactions

def transfer(from_user, to_user, amount):
    # The two users may live on different shards
    db.update(f"... WHERE user_id={from_user}", -amount)
    db.update(f"... WHERE user_id={to_user}", +amount)

Problem: first update succeeds + second fails → money disappears.

Solutions:

1. 2PC (Two-Phase Commit):

Prepare → Commit
All shards must agree to commit
Slow + single point of failure → not recommended

2. Saga Pattern:

Compensating transactions
Eventually consistent
De facto standard in practice

3. Keep in the same shard:

When possible, put sender/receiver in the same shard
Not always feasible

6. Resharding — the Scariest Operation

6.1 Why is Resharding Hard?

Data movement: moving tens of TB
Zero downtime: no user impact
Rollback: recover from problems
Consistency: handle writes/reads mid-migration

6.2 Resharding Scenarios

1. Adding shards: 4 → 8 shards 2. Splitting a shard: dividing a huge shard in two 3. Merging shards: combining small shards 4. Changing the shard key: the hardest of all

6.3 Resharding Patterns

1. Bulk Copy + Switchover:

Copy data to new shards
Short maintenance window
Switch traffic
Drop old shards

Downside: downtime.

2. CDC + Dual Write:

Sync old → new shards via CDC
Application writes to both old + new (dual write)
Verify data parity
Switch reads to new
Retire old shards

Zero downtime, complex.

3. Online Migration:

Vitess's VReplication
Automated zero-downtime migration
Progress monitoring

6.4 Case: Slack's Sharding Evolution

2017: single MySQL → replicas
2018: first sharding (workspace_id)
2020: introduced Vitess
2022: handled whale workspaces
2024: thousands of shards, petabytes

Lesson: sharding is never one and done. It is a continuous evolution.

7. Vitess — a Distributed Layer on Top of MySQL

7.1 What is Vitess?

A MySQL sharding solution built by YouTube. CNCF graduated.

[Client]
   ↓
[VTGate] (query router)
   ↓
[VTTablet] [VTTablet] [VTTablet]
   ↓        ↓          ↓
[MySQL]  [MySQL]    [MySQL]

Pros:

MySQL compatible — existing code works as is
Automatic sharding — key-based routing
Schema migrations — safe DDL
VReplication — zero-downtime resharding
Battle-tested at scale — YouTube, Slack, GitHub, Square

7.2 Companies Using It

YouTube (creator, petabytes)
Slack (large scale)
GitHub (migrated MySQL → Vitess)
Square
Pinterest
JD.com (China's largest e-commerce)

7.3 VSchema Example

{
  "sharded": true,
  "vindexes": {
    "hash": {
      "type": "hash"
    }
  },
  "tables": {
    "users": {
      "column_vindexes": [
        {
          "column": "user_id",
          "name": "hash"
        }
      ]
    }
  }
}

→ VTGate automatically shards based on user_id.

8. Other Distributed DB Options

8.1 Citus (PostgreSQL)

A PostgreSQL extension. Supports distributed queries and distributed transactions.

SELECT create_distributed_table('events', 'user_id');

Pros: PostgreSQL compatible, keep SQL as is. Cons: inherits PostgreSQL limitations (less mature than Vitess).

8.2 CockroachDB

Designed from scratch as a distributed SQL DB.

Pros:

PostgreSQL compatible
Automatic sharding (range-based)
Global distribution support
Distributed ACID transactions

Cons:

Some PG features unsupported
More expensive to operate
Learning curve

Used by: DoorDash, Bose, Comcast.

8.3 YugabyteDB

Similar to CockroachDB. Both PostgreSQL compatible and Cassandra compatible.

8.4 TiDB

MySQL compatible + automatic distribution. Popular in China.

8.5 Comparison Table

	Vitess	Citus	CockroachDB	YugabyteDB	TiDB
Base	MySQL	PostgreSQL	Custom	Custom	Custom
Compatibility	MySQL	PostgreSQL	PostgreSQL	PG + CQL	MySQL
Distributed Transactions	Limited	Yes	Yes	Yes	Yes
Automatic Sharding	Manual key	Manual key	Automatic	Automatic	Automatic
Operational Difficulty	Moderate	Low	Moderate	Moderate	Moderate
Validation	YouTube, Slack	Many	DoorDash	Many	Chinese giants

9. In Practice — Designing E-commerce Sharding

9.1 Scenario

100 million users
10 million daily orders
Single PostgreSQL hitting its limits

9.2 Analysis

Tables:

users (100M rows)
orders (1B rows)
products (1M rows)
reviews (5B rows)

9.3 Sharding Decisions

1. users:

Shard key: user_id (hash)
16 shards
Even distribution

2. orders:

Shard key: user_id (hash, same shard as users)
Same user's user + orders on the same shard → JOIN is fast

3. products:

Small (1M rows)
No sharding — single DB + read replicas
Or replicate across all shards

4. reviews:

Shard key: product_id (hash)
Fits product-based analytics

9.4 Query Patterns

-- User orders (same shard)
SELECT * FROM orders WHERE user_id = 12345;
-- Fast

-- User info + orders (same shard, JOIN OK)
SELECT u.*, o.*
FROM users u JOIN orders o ON o.user_id = u.user_id
WHERE u.user_id = 12345;
-- Fast

-- Product reviews (different shard)
SELECT * FROM reviews WHERE product_id = 67890;
-- Fast

-- All reviews written by a user (cross-shard)
SELECT * FROM reviews WHERE user_id = 12345;
-- Slow — scans every reviews shard

-- Solution: denormalize. Store review summaries on the users shard too.

9.5 Infrastructure

[Application]
    ↓
[ProxySQL / Vitess]
    ↓
[16 user shards]
[16 order shards (co-located)]
[1 products DB + 5 read replicas]
[8 reviews shards]

9.6 Monitoring

Query count per shard
Disk usage per shard
Cross-shard query ratio
Hot spot detection

10. Sharding Pitfalls and Lessons

10.1 Sharding Too Early

Wrong decision: sharding at 1000 QPS of traffic.

Reality:

Single PostgreSQL = 50,000 QPS+
Read replica = 200,000 QPS+
Proper indexes + caching = even more

→ Try every other optimization first before sharding.

10.2 Wrong Shard Key

GitHub's first sharding attempt failure case:

Sharded by repo_id → popular repos became hot spots
Linus's linux repository overwhelmed a single shard
Resharding required (months of work)

Lesson: the shard key is extremely hard to change. Be careful up front.

10.3 Avoiding Distributed Transactions

Distributed transactions really are hard. Whenever possible, keep transactions within a single shard.

While designing: shape the data model so that every write transaction fits in a single shard.

10.4 Operational Complexity

A sharded system means:

Complex backups (16 shards simultaneously)
Complex monitoring
Harder debugging (which shard?)
Harder migrations

→ Automation is essential. Have enough operations staff.

10.5 Is NewSQL the Answer?

NewSQL systems like CockroachDB and YugabyteDB offer automatic sharding + distributed transactions. None of the complexity of manual sharding.

When to go NewSQL:

New projects
Strong consistency required
Limited operations staff

When to go manual sharding:

Existing MySQL/PostgreSQL
Extreme performance needed
Concerns about NewSQL maturity

Quiz

1. Why can sharding not be replaced by read replicas?

Answer: read replicas only distribute reads. Writes still go through a single Primary. A system where writes are the bottleneck (e.g. 1 billion INSERTs per day) cannot be fixed with read replicas. Only sharding can spread writes across multiple machines. That said, sharding brings enormous complexity, so first check whether read replicas + caching + index optimization can solve the problem.

2. What are the pros and cons of hash sharding?

Answer: Pros: even distribution (with a good hash). Safe even with monotonically increasing keys. Cons: no range queries (WHERE id BETWEEN ...). Adding a shard redistributes nearly all data (because % num_shards changes). Solution: Consistent Hashing — only a fraction (1/N) of the data moves on add/remove. Used by Cassandra, DynamoDB, and Riak. Virtual nodes give more even distribution.

3. How do you solve the hot spot problem?

Answer: (1) Change the shard key — best, but requires resharding (hard), (2) Composite key — hash(user_id + timestamp_bucket) to spread across time/users, (3) Sub-sharding — split hot shards finer ("whale shard"), (4) Caching — move hot data to Redis, (5) Read replicas — distribute read load. The most common hot spots: popular users, popular products, monotonically increasing keys. Always review hot spot scenarios when choosing a shard key.

4. Why did Vitess become the standard for MySQL distribution?

Answer: (1) Born at YouTube — proven at massive scale, (2) MySQL compatible — existing code/tools keep working, (3) Automatic sharding — key-based routing, (4) VReplication — zero-downtime migrations and resharding, (5) CNCF graduated — cloud-native standard, (6) Proven users — Slack, GitHub, Square. Downside: distributed transactions are relatively weak (the Saga pattern is recommended). PostgreSQL users use Citus for a similar role.

5. When should you use NewSQL (CockroachDB) vs. manual sharding?

Answer: Pick NewSQL (CockroachDB, YugabyteDB) when: new project, strong consistency required (distributed ACID), limited operations staff, PostgreSQL compatibility is acceptable. Pick manual sharding (Vitess + MySQL) when: existing MySQL system, extreme performance needed, worried about NewSQL maturity, want more control. Reality: new systems increasingly go NewSQL; existing huge systems stick with Vitess. Both are valid — decide based on trade-offs.

References

Designing Data-Intensive Applications — Martin Kleppmann (sharding chapter)
Vitess — official YouTube project
Citus Data
CockroachDB Docs
YugabyteDB
TiDB
Slack's Data Stores — Vitess migration
GitHub MySQL → Vitess
Stripe's Sharding Approach
Discord's Trillion Messages — uses Cassandra
Pinterest Sharding