Split View: 시스템 디자인 면접 2025 완전 정복: FAANG이 묻는 30문제와 AI 시대의 새로운 유형

시스템 디자인 면접 2025 완전 정복: FAANG이 묻는 30문제와 AI 시대의 새로운 유형

1. 왜 시스템 디자인이 중요한가
2. 45분 시스템 디자인 프레임워크
3. 반드시 알아야 할 핵심 개념 10가지
4. 2025 출제 빈도 TOP 15 문제
- 문제별 접근 팁
5. 2025 신유형: AI/ML 시스템 설계
6. 봉투 뒤 계산 치트시트
7. 학습 자료 TOP 5
8. 면접관이 실제로 평가하는 5가지
실전 퀴즈
참고 자료

1. 왜 시스템 디자인이 중요한가

시스템 디자인 면접은 시니어 엔지니어 채용의 핵심 관문입니다. 2025년 기준 주요 통계를 보면 그 중요성이 명확합니다.

시스템 디자인 면접의 현재 위상

시니어 후보의 70% 이상이 최소 1회의 시스템 디자인 라운드를 경험합니다
60%의 질문이 클라우드 네이티브 아키텍처와 관련됩니다 (AWS, GCP, Azure)
출제되는 문제의 80%가 핵심 20% 개념에 집중합니다 (파레토 법칙)
FAANG(Meta, Apple, Amazon, Netflix, Google) 기준 시스템 디자인은 2~3 라운드 배정됩니다

레벨별 기대치

레벨	기대 수준	핵심 평가 포인트
L3-L4 (주니어)	기본 컴포넌트 이해	API 설계, 단일 서비스 설계
L5 (시니어)	분산 시스템 전체 설계	트레이드오프 분석, 확장성
L6+ (스태프)	대규모 시스템 아키텍처	조직 간 영향력, 기술 전략

2025년 신유형의 등장

2025년에는 기존 시스템 디자인 문제에 더해 새로운 유형이 등장했습니다.

AI/ML 시스템 설계: 추천 시스템, LLM 서빙, RAG 파이프라인
실시간 시스템: 30문제 중 18문제가 실시간 처리를 요구
비용 최적화: 클라우드 비용을 고려한 설계가 필수 평가 항목으로 추가
보안 설계: Zero Trust, 데이터 프라이버시가 기본 요구사항

기존에는 "트위터를 설계하세요" 같은 클래식한 질문이 주를 이뤘다면, 이제는 "실시간 추천 시스템에서 모델 서빙 레이턴시를 50ms 이하로 유지하면서 일일 10억 건의 추론 요청을 처리하는 시스템을 설계하세요"와 같은 복합적인 질문이 늘어나고 있습니다.

2. 45분 시스템 디자인 프레임워크

면접에서 가장 흔한 실수는 바로 설계에 뛰어드는 것입니다. 45분이라는 제한 시간 안에서 체계적으로 접근하는 프레임워크가 필수입니다.

2-1. 요구사항 명확화 (5분)

면접관에게 질문을 던져 범위를 좁히는 단계입니다. 이 단계를 건너뛰는 후보자는 거의 100% 탈락합니다.

기능 요구사항 (Functional Requirements)

질문 예시:
- "사용자가 할 수 있는 핵심 동작 3가지는 무엇인가요?"
- "읽기와 쓰기 비율이 어느 정도인가요?"
- "데이터는 얼마나 보존해야 하나요?"

비기능 요구사항 (Non-Functional Requirements)

확인 사항:
- 가용성: 99.9% vs 99.99% (연간 다운타임 8.7시간 vs 52분)
- 지연 시간: p99 < 200ms 등 구체적 수치
- 일관성: 강한 일관성 vs 최종 일관성
- 규모: DAU, 동시 접속자 수, 데이터 크기

2-2. 봉투 뒤 계산 (5분)

규모를 숫자로 파악하는 단계입니다. 면접관은 정확한 숫자보다 추정 과정과 사고력을 봅니다.

예시: 채팅 시스템의 규모 추정

- DAU: 5천만 명
- 하루 평균 메시지: 40건/사용자
- 총 메시지: 5천만 x 40 = 20억 건/일
- QPS: 20억 / 86,400 = 약 23,000 QPS
- 피크 QPS: 평균의 3배 = 약 70,000 QPS
- 메시지 크기: 평균 100바이트
- 일일 저장소: 20억 x 100B = 200GB/일
- 연간 저장소: 200GB x 365 = 73TB/년

2-3. 고수준 설계 (15분)

핵심 컴포넌트와 데이터 흐름을 다이어그램으로 표현합니다.

전형적인 고수준 아키텍처 구성:

Client --> Load Balancer --> API Gateway
                               |
                    +----------+----------+
                    |          |          |
                Service A  Service B  Service C
                    |          |          |
                 DB (SQL)  Cache     Message Queue
                              |          |
                           DB (NoSQL)  Worker

이 단계에서 반드시 포함해야 할 것들:

클라이언트와 서버 간 프로토콜 (HTTP, WebSocket, gRPC)
데이터 저장소 선택과 근거
캐시 레이어 위치
비동기 처리가 필요한 부분

2-4. 딥다이브 (15분)

면접관이 관심을 보이는 컴포넌트를 상세하게 설계합니다.

딥다이브 영역 예시:

1. 데이터베이스 스키마 설계
   - 테이블 구조, 인덱스, 파티셔닝 전략

2. API 설계
   - 엔드포인트, 요청/응답 형식, 에러 처리

3. 핵심 알고리즘
   - 뉴스 피드: Fan-out on write vs Fan-out on read
   - 검색: 역인덱스 구조
   - 추천: 협업 필터링 vs 콘텐츠 기반 필터링

2-5. 확장성, 장애 처리, 모니터링 (5분)

마지막으로 시스템의 견고함을 보여주는 단계입니다.

체크리스트:
- 단일 장애점(SPOF) 식별 및 제거
- 데이터 복제 전략 (리더-팔로워, 멀티 리더)
- 장애 감지 및 자동 복구 (Health Check, Circuit Breaker)
- 모니터링 지표 (Latency, Error Rate, Throughput)
- 알림 체계 (PagerDuty, Slack 연동)

3. 반드시 알아야 할 핵심 개념 10가지

시스템 디자인 문제의 80%를 커버하는 핵심 개념입니다.

3-1. 수평 vs 수직 확장

수직 확장(Scale Up): 단일 서버의 성능을 높이는 방식

장점: 구현 간단, 데이터 일관성 유지 용이
단점: 하드웨어 한계 존재, 단일 장애점

수평 확장(Scale Out): 서버 수를 늘리는 방식

장점: 이론적으로 무한 확장 가능, 장애 내성
단점: 분산 시스템 복잡성, 데이터 일관성 문제

수직 확장 예시:
  서버 1대 (4 CPU, 16GB RAM) → 서버 1대 (32 CPU, 128GB RAM)

수평 확장 예시:
  서버 1대 → 서버 10대 (각 4 CPU, 16GB RAM) + Load Balancer

실전에서는 대부분 수평 확장을 기본으로 설계하되, 데이터베이스처럼 수직 확장이 먼저 고려되는 컴포넌트도 있습니다.

3-2. 로드 밸런서 (L4 vs L7)

로드 밸런서는 트래픽을 여러 서버에 분산합니다.

구분	L4 (전송 계층)	L7 (응용 계층)
동작 기준	IP, 포트	URL, 헤더, 쿠키
속도	빠름	상대적으로 느림
유연성	낮음	높음 (경로 기반 라우팅)
SSL 종료	불가	가능
사용 예	내부 서비스 간 통신	API 게이트웨이, 웹 서버

L7 로드 밸런서 라우팅 예시:

/api/users/*  --> User Service Cluster
/api/orders/* --> Order Service Cluster
/api/search/* --> Search Service Cluster
/static/*     --> CDN Origin

3-3. 캐싱 전략

캐시는 시스템 성능의 핵심입니다. 주요 패턴 3가지를 비교합니다.

Cache-Aside (Lazy Loading)

읽기 요청 흐름:
1. 캐시에서 데이터 조회
2. 캐시 히트 → 데이터 반환
3. 캐시 미스 → DB에서 조회 → 캐시에 저장 → 반환

장점: 필요한 데이터만 캐시, 캐시 장애 시 DB로 폴백
단점: 첫 요청 항상 느림, 캐시와 DB 불일치 가능

Write-Through

쓰기 요청 흐름:
1. 캐시에 데이터 쓰기
2. 캐시가 동기적으로 DB에 기록
3. 완료 응답

장점: 캐시와 DB 항상 일치
단점: 쓰기 지연 증가, 불필요한 데이터도 캐시

Write-Behind (Write-Back)

쓰기 요청 흐름:
1. 캐시에 데이터 쓰기
2. 즉시 완료 응답
3. 비동기적으로 배치 처리하여 DB에 기록

장점: 쓰기 성능 극대화
단점: 캐시 장애 시 데이터 유실 위험

3-4. 데이터베이스 (SQL vs NoSQL, 샤딩, 복제)

SQL vs NoSQL 선택 기준

기준	SQL (관계형)	NoSQL (비관계형)
데이터 구조	정형 데이터, 관계 복잡	비정형, 스키마 유연
일관성	ACID 보장	최종 일관성 (대부분)
확장성	수직 확장 우선	수평 확장 용이
사용 예	결제, 재고 관리	로그, 세션, 소셜 피드
대표 제품	PostgreSQL, MySQL	MongoDB, Cassandra, DynamoDB

샤딩 전략

1. 범위 기반 샤딩 (Range-based)
   - user_id 1~100만 → Shard 1
   - user_id 100만~200만 → Shard 2
   - 장점: 범위 쿼리 효율적
   - 단점: 핫스팟 발생 가능

2. 해시 기반 샤딩 (Hash-based)
   - hash(user_id) % N → Shard 번호
   - 장점: 균등 분배
   - 단점: 범위 쿼리 비효율, 리밸런싱 복잡

3. 디렉토리 기반 샤딩
   - 조회 테이블로 관리
   - 장점: 유연한 매핑
   - 단점: 조회 테이블이 단일 장애점

3-5. 메시지 큐 (Kafka, RabbitMQ)

비동기 통신의 핵심 컴포넌트입니다.

특성	Apache Kafka	RabbitMQ
모델	로그 기반 스트리밍	메시지 브로커
처리량	초당 수백만 건	초당 수만 건
메시지 보존	설정 기간 동안 보존	소비 후 삭제
순서 보장	파티션 내 보장	큐 내 보장
사용 사례	이벤트 스트리밍, 로그	작업 큐, RPC

Kafka 아키텍처 예시:

Producer --> Topic (Partition 0) --> Consumer Group A
                                 --> Consumer Group B
         --> Topic (Partition 1) --> Consumer Group A
                                 --> Consumer Group B
         --> Topic (Partition 2) --> Consumer Group A
                                 --> Consumer Group B

- 파티션: 병렬 처리 단위
- Consumer Group: 독립적 소비
- Offset: 소비 위치 추적

3-6. CDN과 엣지 컴퓨팅

**CDN(Content Delivery Network)**은 전 세계 엣지 서버에 콘텐츠를 캐싱합니다.

CDN 동작 원리:

사용자(한국) --> 한국 엣지 서버 (캐시 히트) --> 즉시 응답
                                (캐시 미스) --> 오리진 서버(미국) --> 캐시 저장 --> 응답

Pull CDN: 첫 요청 시 오리진에서 가져와 캐시 (CloudFront, Cloudflare)
Push CDN: 미리 콘텐츠를 배포 (대용량 정적 파일에 적합)

엣지 컴퓨팅은 CDN을 넘어 연산까지 엣지에서 처리합니다.

Cloudflare Workers, AWS Lambda@Edge, Vercel Edge Functions
A/B 테스트, 인증, 개인화를 엣지에서 수행
오리진 서버 부하 감소 + 지연 시간 최소화

3-7. API 설계 (REST vs gRPC vs GraphQL)

특성	REST	gRPC	GraphQL
프로토콜	HTTP/1.1	HTTP/2	HTTP
데이터 형식	JSON	Protocol Buffers	JSON
타입 안전성	낮음	높음 (스키마 필수)	중간 (스키마 필수)
스트리밍	제한적	양방향 지원	구독 지원
사용 사례	공개 API	내부 마이크로서비스	클라이언트 주도 쿼리

REST 예시:
GET /api/v1/users/123
GET /api/v1/users/123/orders?page=1&limit=10

gRPC 예시:
service UserService {
  rpc GetUser(GetUserRequest) returns (User);
  rpc ListOrders(ListOrdersRequest) returns (stream Order);
}

GraphQL 예시:
query {
  user(id: "123") {
    name
    orders(first: 10) {
      id
      total
    }
  }
}

3-8. Consistent Hashing

분산 시스템에서 노드 추가/삭제 시 최소한의 데이터만 재분배하는 알고리즘입니다.

기본 해싱의 문제:
  hash(key) % N = 서버 번호
  N이 변경되면 거의 모든 키가 재매핑됨

Consistent Hashing:
  - 해시 공간을 원형(0 ~ 2^32-1)으로 구성
  - 서버와 키 모두 원 위에 배치
  - 키는 시계 방향으로 가장 가까운 서버에 할당
  - 서버 추가/삭제 시 인접 구간만 영향

가상 노드 (Virtual Nodes):
  - 각 물리 서버를 여러 가상 노드로 매핑
  - 부하를 더 균등하게 분배
  - 서버 성능 차이에 따라 가상 노드 수 조절 가능

3-9. Rate Limiting (Token Bucket, Sliding Window)

시스템을 과도한 트래픽으로부터 보호하는 기법입니다.

Token Bucket 알고리즘

동작 원리:
1. 고정 크기의 버킷에 일정 속도로 토큰 추가
2. 요청마다 토큰 1개 소비
3. 토큰이 없으면 요청 거부 (HTTP 429)

파라미터:
- 버킷 크기: 100 (순간 최대 요청)
- 리필 속도: 10개/초 (평균 처리 속도)

Sliding Window Log 알고리즘

동작 원리:
1. 각 요청의 타임스탬프를 정렬된 로그에 기록
2. 현재 시간 기준 윈도우(예: 1분) 밖의 기록 제거
3. 윈도우 내 요청 수가 한계를 초과하면 거부

장점: 정확한 윈도우 경계 처리
단점: 메모리 사용량 높음 (모든 타임스탬프 저장)

분산 환경에서의 Rate Limiting

방법 1: 중앙 집중식 (Redis 기반)
  - 모든 서버가 Redis의 카운터 공유
  - 정확하지만 Redis가 병목/장애점

방법 2: 로컬 + 동기화
  - 각 서버에서 로컬 카운팅
  - 주기적으로 중앙과 동기화
  - 빠르지만 약간의 초과 허용

3-10. CAP 정리와 실전 트레이드오프

CAP 정리: 분산 시스템은 다음 3가지 중 최대 2가지만 동시에 보장할 수 있습니다.

C (Consistency): 모든 노드가 같은 데이터를 봄
A (Availability): 모든 요청에 응답
P (Partition Tolerance): 네트워크 분할에도 동작

네트워크 분할은 현실에서 항상 발생하므로, 실질적인 선택은 CP vs AP입니다.

시스템	선택	이유
결제 시스템	CP	금액 정합성이 최우선
소셜 미디어 피드	AP	잠깐의 불일치보다 서비스 중단이 더 치명적
재고 관리	CP	초과 판매 방지
DNS	AP	가용성이 핵심
채팅 메시지	AP	최종 일관성으로 충분

4. 2025 출제 빈도 TOP 15 문제

실제 면접에서 가장 자주 출제되는 15개 문제입니다.

순위	문제	난이도	출제 기업	핵심 개념
1	URL 단축기 설계	하	전 기업 공통	해싱, DB 선택, 읽기 최적화
2	Rate Limiter 설계	하	전 기업 공통	Token Bucket, 분산 카운팅
3	뉴스 피드 시스템	중	Meta, Twitter	Fan-out, 캐시 계층, 랭킹
4	채팅 시스템	중	WhatsApp, Slack	WebSocket, 메시지 큐, 상태 관리
5	동영상 스트리밍 플랫폼	중	Netflix, YouTube	CDN, 적응형 비트레이트, 트랜스코딩
6	검색 자동완성	중	Google, Amazon	Trie, ElasticSearch, 프리픽스 매칭
7	위치 기반 서비스	중	Uber, DoorDash	Geohash, QuadTree, 근접 검색
8	알림 시스템	중	전 기업 공통	Push/Pull, 메시지 큐, 우선순위
9	분산 캐시 시스템	상	Amazon, Google	Consistent Hashing, 복제, 장애 감지
10	분산 메시지 큐	상	Kafka 개발팀, LinkedIn	파티셔닝, ISR, 컨슈머 그룹
11	결제 시스템	상	Stripe, Toss, PayPal	멱등성, 사가 패턴, 이중 기록
12	실시간 게이밍 백엔드	상	Riot Games, Epic	상태 동기화, UDP, 지연 보상
13	추천 시스템	상	Netflix, Spotify	ML 파이프라인, Feature Store
14	분산 파일 시스템	상	Google, Microsoft	GFS, HDFS, 청크 서버
15	광고 클릭 집계 시스템	상	Google, Meta	스트림 처리, 정확히 한 번 처리

문제별 접근 팁

URL 단축기 - 가장 기본적인 문제이지만 깊이를 보여줄 수 있습니다.

핵심 설계 포인트:
1. 단축 URL 생성: Base62 인코딩 vs 해시 (MD5/SHA256 절단)
2. 충돌 처리: DB 유니크 제약 + 재시도 vs 카운터 기반
3. 리다이렉션: 301 (영구) vs 302 (임시) 선택 근거
4. 캐시: 자주 접근되는 URL을 Redis에 캐시 (80/20 법칙)
5. 분석: 클릭 수, 지역, 디바이스 추적

뉴스 피드 시스템 - Fan-out 전략이 핵심입니다.

Fan-out on Write (Push 모델):
- 글 작성 시 모든 팔로워의 피드에 즉시 기록
- 장점: 읽기 빠름
- 단점: 팔로워가 많은 사용자(셀럽)의 쓰기 비용 높음

Fan-out on Read (Pull 모델):
- 피드 조회 시 팔로워 목록에서 실시간 집계
- 장점: 쓰기 비용 낮음
- 단점: 읽기 느림

하이브리드 (실전 해답):
- 일반 사용자: Push 모델
- 셀럽 (팔로워 100만+): Pull 모델
- 합쳐서 최종 피드 생성

채팅 시스템 - 실시간 통신이 핵심입니다.

아키텍처:
1. 접속 관리: WebSocket 서버 + 연결 상태 저장소
2. 메시지 전달: 1:1은 직접 전달, 그룹은 Fan-out
3. 오프라인 처리: 메시지 큐에 저장 후 재접속 시 전달
4. 읽음 확인: 별도 서비스로 분리
5. 메시지 저장: 시간순 정렬에 최적화된 DB 선택
   - HBase/Cassandra (넓은 열 저장소)

5. 2025 신유형: AI/ML 시스템 설계

2025년 시스템 디자인 면접에서 가장 큰 변화는 AI/ML 관련 문제의 증가입니다. Google, Meta, Amazon은 물론 스타트업까지 AI 시스템 설계 역량을 평가합니다.

5-1. 추천 시스템 아키텍처

추천 시스템은 AI 시스템 설계의 가장 클래식한 문제입니다.

추천 시스템 전체 파이프라인:

데이터 수집 --> Feature Store --> 모델 훈련 --> 모델 레지스트리
     |                              |                |
사용자 행동 로그              오프라인 배치         A/B 테스트
클릭, 구매, 시청            일일/주간 갱신      챔피언/챌린저
     |                              |                |
     v                              v                v
실시간 특성 계산  -->  후보 생성  -->  랭킹  -->  재랭킹  -->  표시
                    (Retrieval)  (Scoring)  (Re-ranking)

핵심 컴포넌트:
1. Feature Store: 오프라인(배치) + 온라인(실시간) 특성 관리
2. Candidate Generation: ANN(근사 최근접 이웃)으로 후보 1000개 선별
3. Ranking Model: 후보를 정밀 모델로 점수화 (딥러닝)
4. Re-ranking: 비즈니스 규칙, 다양성, 신선도 적용

5-2. LLM 서빙 시스템

대규모 언어 모델(LLM) 서빙은 2025년 가장 핫한 주제입니다.

LLM 서빙 아키텍처:

요청 --> API Gateway --> 요청 라우터 --> GPU 클러스터
                           |               |
                      토큰 제한기     모델 인스턴스 (vLLM)
                      큐 관리자          |
                           |         KV Cache 관리
                      응답 스트리밍 <---  |
                           |
                      비용 추적기

핵심 최적화 기법:
1. KV Cache: 이전 토큰의 Key/Value를 캐싱하여 중복 계산 방지
2. Continuous Batching: 동적으로 요청을 배치 처리
3. PagedAttention: 메모리를 페이지 단위로 관리 (vLLM)
4. 양자화: FP16 → INT8/INT4로 모델 크기 축소
5. Speculative Decoding: 작은 모델로 초안 생성 후 큰 모델로 검증

오토스케일링 전략

GPU 오토스케일링 고려사항:
- GPU 인스턴스 시작 시간: 3-10분 (CPU 대비 매우 느림)
- 예측 기반 스케일링: 시간대별 트래픽 패턴 사전 학습
- 큐 깊이 기반: 대기 중인 요청 수로 스케일링 트리거
- 비용 최적화: 스팟 인스턴스 활용 + 온디맨드 폴백

5-3. RAG (Retrieval-Augmented Generation) 파이프라인

RAG은 LLM의 환각(hallucination)을 줄이고 최신 정보를 제공하는 핵심 패턴입니다.

RAG 파이프라인 아키텍처:

문서 수집 --> 청킹 --> 임베딩 생성 --> 벡터 DB 저장
                                        |
사용자 질문 --> 쿼리 임베딩 --> 유사도 검색 --> 상위 K개 문서 검색
                                                    |
                              프롬프트 구성 <---------+
                                   |
                              LLM 생성 --> 응답 + 출처 인용

핵심 설계 결정:
1. 청킹 전략: 고정 크기 vs 의미 기반 (문단/섹션)
2. 임베딩 모델: OpenAI Ada-002, Cohere Embed, 오픈소스 모델
3. 벡터 DB: Pinecone (관리형), Milvus (자체 호스팅), pgvector
4. 검색 방식: 순수 벡터 검색 vs 하이브리드 (벡터 + 키워드)
5. 리랭킹: Cross-encoder로 검색 결과 정밀 재정렬

5-4. 실시간 ML 파이프라인

실시간 ML 파이프라인:

이벤트 소스 --> Kafka --> Stream Processor (Flink) --> Feature 계산
                                                         |
                                                    Feature Store
                                                         |
API 요청 --> Model Server --> 추론 결과 --> 비즈니스 로직
                |
           Model Registry (MLflow)

사용 사례:
- 이상 거래 탐지: 결제 이벤트 → 실시간 특성 → 사기 점수
- 실시간 추천: 사용자 행동 → 실시간 임베딩 갱신 → 추천 갱신
- 동적 가격 책정: 수요/공급 신호 → 가격 모델 → 가격 조정

5-5. AI 안전장치 설계

AI 시스템의 안전성은 2025년 면접에서 필수 토픽입니다.

AI 안전장치 아키텍처:

사용자 입력 --> 입력 필터 --> LLM --> 출력 필터 --> 사용자 응답
                 |                      |
            콘텐츠 분류기          환각 감지기
            프롬프트 인젝션 탐지    사실 검증기
            PII 감지/마스킹        독성 필터
                 |                      |
            차단 or 경고           수정 or 차단

설계 포인트:
1. 다층 방어: 입력/처리/출력 각 단계에 필터
2. 비동기 감사: 전체 대화 로그를 비동기로 분석
3. 적응형 규칙: 새로운 공격 패턴에 대한 빠른 규칙 업데이트
4. 인간 참여 루프: 자동화가 불확실한 케이스에 인간 검토
5. 레이턴시 예산: 안전장치가 추가하는 지연 시간 최소화 (목표: 50ms 이하)

6. 봉투 뒤 계산 치트시트

면접에서 바로 활용할 수 있는 핵심 숫자들입니다.

6-1. 지연 시간 참조표

연산                          시간
---------------------------  ----------
L1 캐시 참조                   1 ns
L2 캐시 참조                   4 ns
메인 메모리(RAM) 참조           100 ns
SSD 랜덤 읽기                  100 us (마이크로초)
HDD 탐색                      10 ms
같은 데이터센터 네트워크 왕복    0.5 ms
대륙 간 네트워크 왕복           150 ms

기억 팁:
- L1은 RAM보다 100배 빠름
- SSD는 RAM보다 1000배 느림
- HDD는 SSD보다 100배 느림
- 네트워크는 로컬 I/O보다 항상 느림

6-2. 용량 계산 공식

QPS (Queries Per Second):
  QPS = DAU * 평균 요청 수 / 86,400
  피크 QPS = 평균 QPS * 2~5 (서비스 특성에 따라)

저장소:
  일일 데이터 = 일일 신규 레코드 * 평균 레코드 크기
  연간 데이터 = 일일 데이터 * 365
  총 저장소 = 연간 데이터 * 보존 기간(년) * 복제 계수(보통 3)

대역폭:
  인바운드 = 쓰기 QPS * 평균 요청 크기
  아웃바운드 = 읽기 QPS * 평균 응답 크기

6-3. 규모 감각 키우기

주요 서비스의 대략적 규모:

서비스        DAU          QPS(평균)    저장소
------       --------     ---------   --------
Twitter      5억          30만        매일 수 TB
Instagram    20억         100만       매일 수십 TB
YouTube      20억+        50만+       분당 500시간 업로드
WhatsApp     20억+        수백만      매일 100억 메시지
Google 검색  85억 쿼리/일  10만+       수백 PB 인덱스

2의 거듭제곱 (자주 사용):
- 2^10 = 1,024 (약 1천)
- 2^20 = 1,048,576 (약 100만)
- 2^30 = 1,073,741,824 (약 10억)
- 2^40 = 약 1조 (1TB)

7. 학습 자료 TOP 5

시스템 디자인 면접 준비를 위한 최고의 학습 자료입니다.

7-1. DDIA (Designing Data-Intensive Applications)

Martin Kleppmann 저. 시스템 디자인의 바이블이라 불리는 책입니다.

대상: 분산 시스템 원리를 깊이 이해하고 싶은 엔지니어
강점: 이론적 기초부터 실전 트레이드오프까지 체계적 설명
주요 내용: 데이터 모델, 저장소 엔진, 복제, 파티셔닝, 트랜잭션, 일관성, 배치/스트림 처리
학습 팁: 전체 통독 후 면접 전에 핵심 챕터 복습 (특히 5~9장)

7-2. System Design Interview (Alex Xu) Vol.1 + Vol.2

면접에 가장 직접적으로 도움되는 실전 가이드입니다.

대상: 시스템 디자인 면접을 처음 준비하는 엔지니어
강점: 문제별 단계적 풀이, 다이어그램 풍부
Vol.1: URL 단축기, 뉴스 피드, 채팅, 검색 자동완성 등 13문제
Vol.2: 위치 서비스, 게이밍 순위표, 결제 시스템 등 13문제

7-3. ByteByteGo (유튜브 + 뉴스레터)

Alex Xu가 운영하는 시각적 학습 플랫폼입니다.

대상: 시각적 학습을 선호하는 엔지니어
강점: 아키텍처 다이어그램 품질이 뛰어남
콘텐츠: 유튜브 동영상, 주간 뉴스레터, 온라인 강좌
추천 활용법: 통근 시간에 영상으로 개념 복습

7-4. Grokking the System Design Interview

Educative 플랫폼의 인터랙티브 학습 코스입니다.

대상: 체계적인 커리큘럼을 원하는 엔지니어
강점: 단계별 학습 경로, 퀴즈 포함
구성: 핵심 개념 + 15개 설계 문제 + 용어집

7-5. Codemia (120+ 연습 문제)

2024년에 등장한 시스템 디자인 연습 플랫폼입니다.

대상: 다양한 문제를 풀어보고 싶은 엔지니어
강점: 120개 이상의 문제, 난이도별 분류
특징: 커뮤니티 솔루션 비교, 타이머 기능
추천 활용법: 주 3~4문제씩 45분 타이머로 연습

학습 로드맵 (12주)

주차별 학습 계획:

1-2주: 기초 개념 (DDIA 핵심 챕터)
  - 확장성, 가용성, 일관성 원리
  - 데이터베이스, 캐시, 메시지 큐 기초

3-4주: 프레임워크 연습 (Alex Xu Vol.1)
  - 45분 프레임워크 체화
  - 쉬운 문제 5개 풀기

5-8주: 실전 문제 풀이 (Alex Xu Vol.2 + Codemia)
  - 중급 문제 10개
  - 주 3-4문제, 타이머 사용

9-10주: 고급 주제 + AI/ML 시스템
  - 분산 시스템 심화
  - LLM 서빙, 추천 시스템

11-12주: 모의 면접
  - 동료와 모의 면접 3-5회
  - 약점 보완 집중 학습

8. 면접관이 실제로 평가하는 5가지

8-1. 모호함 속 구조화 능력

면접관은 일부러 모호한 질문을 던집니다. "구글 드라이브를 설계하세요"라고 했을 때 바로 코딩을 시작하면 감점입니다. 질문을 던져 범위를 좁히는 능력이 핵심입니다.

좋은 명확화 질문 예시:
- "파일 업로드와 다운로드 중 어디에 집중할까요?"
- "사용자 규모가 어느 정도인가요? 1억 명 수준인가요?"
- "실시간 공동 편집도 범위에 포함되나요?"
- "모바일과 웹 모두 지원해야 하나요?"

8-2. 트레이드오프 분석력

"왜 A가 아닌 B를 선택했는가?"에 대한 명확한 답변 능력입니다.

트레이드오프 분석 예시:

질문: "메시지 저장소로 Cassandra와 MySQL 중 어떤 것을 선택하시겠습니까?"

좋은 답변 구조:
1. 요구사항 확인: "채팅 메시지는 쓰기가 많고 (Write-heavy),
   시간순 조회가 대부분이며, 강한 일관성보다 가용성이 중요합니다."

2. 옵션 비교:
   - MySQL: ACID 보장, 조인 가능, 하지만 수평 확장이 어렵고
     샤딩 복잡성이 높음
   - Cassandra: 수평 확장 용이, 쓰기 최적화,
     시간순 데이터에 적합, 하지만 조인 불가

3. 결론: "우리 요구사항에서는 쓰기 성능과 수평 확장성이
   핵심이므로 Cassandra를 선택합니다. 사용자 프로필처럼
   관계가 복잡한 데이터는 별도 MySQL에 저장합니다."

8-3. 확장 시나리오 대응

"사용자가 10배 늘어나면 어떻게 하겠는가?"라는 질문에 대한 대응입니다.

확장 시나리오 대응 프레임워크:

현재 규모 (100만 DAU):
- 단일 DB, 읽기 복제본 2대, 캐시 서버 1대

10배 성장 (1000만 DAU):
- DB 샤딩 도입 (사용자 ID 기반)
- 캐시 클러스터 확장 (Redis Cluster)
- CDN 도입 (정적 자산)
- 읽기/쓰기 분리

100배 성장 (1억 DAU):
- 멀티 리전 배포
- 마이크로서비스 분리
- 이벤트 기반 아키텍처 전환
- 전용 검색 엔진 (Elasticsearch)

8-4. 실제 경험 기반 판단

면접관은 교과서적 답변보다 실제 경험에서 우러나온 판단을 높이 평가합니다.

"이전 프로젝트에서 캐시 무효화 문제를 겪었는데..."
"Redis를 사용했을 때 메모리 한계에 부딪혔고, 이때 학습한 것은..."
"실제 장애 상황에서 서킷 브레이커가 어떻게 도움이 됐는지..."

8-5. 커뮤니케이션 (그림 + 설명)

시스템 디자인 면접은 대화입니다. 혼자 30분간 설명하는 것이 아니라 면접관과 함께 설계를 만들어가야 합니다.

효과적인 커뮤니케이션 기법:

1. 화이트보드 활용
   - 항상 그림과 함께 설명
   - 데이터 흐름을 화살표로 표시
   - 컴포넌트 이름을 명확히 표기

2. 체크포인트 설정
   - "여기까지 괜찮으시면 다음으로 넘어가겠습니다"
   - "이 부분을 더 깊이 들어갈까요?"

3. 생각 과정 공유
   - 침묵하며 생각하지 말고 사고 과정을 소리 내어 공유
   - "두 가지 옵션이 있는데, A는... B는..."

실전 퀴즈

학습한 내용을 점검해보세요.

Q1. 시스템 디자인 면접에서 첫 5분 동안 해야 할 가장 중요한 것은?

A: 요구사항 명확화

기능 요구사항(핵심 기능 3가지)과 비기능 요구사항(규모, 지연 시간, 가용성)을 면접관에게 질문하여 설계 범위를 좁혀야 합니다. 바로 설계에 뛰어드는 것은 가장 흔한 실수입니다.

Q2. CAP 정리에서 네트워크 파티션이 발생했을 때, 결제 시스템은 CP와 AP 중 무엇을 선택해야 하나요?

A: CP (일관성 + 분할 내성)

결제 시스템에서는 금액의 정합성이 최우선입니다. 일시적으로 서비스가 불가한 것(가용성 포기)이 잘못된 금액을 처리하는 것(일관성 포기)보다 낫습니다. 재시도 메커니즘과 멱등성을 통해 가용성 저하를 보완합니다.

Q3. 뉴스 피드에서 팔로워 100만 명인 셀럽의 글 게시에 적합한 Fan-out 전략은?

A: Fan-out on Read (Pull 모델) 또는 하이브리드

셀럽의 글을 팔로워 100만 명에게 모두 Push하면 쓰기 비용이 너무 높습니다. 셀럽의 글은 Pull 모델(피드 조회 시 실시간 집계)을 사용하고, 일반 사용자는 Push 모델을 사용하는 하이브리드 접근이 실전 해답입니다.

Q4. LLM 서빙에서 KV Cache의 역할은 무엇이며, 왜 중요한가요?

A: 이전 토큰의 Key/Value 텐서를 캐싱하여 중복 계산을 방지합니다.

자기회귀(autoregressive) 생성에서 각 새 토큰은 이전 모든 토큰의 attention을 다시 계산해야 합니다. KV Cache는 이미 계산된 Key/Value를 저장하여 새 토큰 생성 시 이전 토큰의 재계산을 방지합니다. 이를 통해 생성 속도가 수배에서 수십 배까지 향상되며, GPU 메모리 관리가 핵심 과제가 됩니다 (PagedAttention 등).

Q5. DAU 5000만 명인 채팅 앱의 피크 QPS를 봉투 뒤 계산으로 추정하세요.

A: 약 70,000 QPS

계산 과정:

DAU: 5000만 명
평균 메시지 수: 40건/사용자/일
총 메시지: 5000만 x 40 = 20억 건/일
평균 QPS: 20억 / 86,400 = 약 23,000 QPS
피크 QPS: 평균의 약 3배 = 약 70,000 QPS

피크 배수는 서비스 특성에 따라 다르며, 채팅 앱은 저녁 시간대에 집중되므로 2~5배 범위가 일반적입니다.

참고 자료

도서

Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017)
Alex Xu, System Design Interview: An Insider's Guide Volume 1 (2020)
Alex Xu, System Design Interview: An Insider's Guide Volume 2 (2022)
Maheshwari, Acing the System Design Interview (Manning, 2024)
Gaurav Sen, System Design Simplified (2023)

온라인 강좌 및 플랫폼

Educative - Grokking the System Design Interview
ByteByteGo - System Design Course (bytebytego.com)
Codemia - 120+ System Design Practice Problems (codemia.dev)
Exponent - System Design Interview Course (tryexponent.com)
Donnemartin - System Design Primer (GitHub 오픈소스)

기술 블로그 및 논문

Google SRE Book - Site Reliability Engineering (sre.google/sre-book)
Amazon Builders Library (aws.amazon.com/builders-library)
Meta Engineering Blog (engineering.fb.com)
Netflix Tech Blog (netflixtechblog.com)
Uber Engineering Blog (eng.uber.com)
Leslie Lamport, The Part-Time Parliament (Paxos 논문, 1998)
DeCandia et al., Dynamo: Amazon's Highly Available Key-value Store (SOSP 2007)
Chang et al., Bigtable: A Distributed Storage System for Structured Data (OSDI 2006)
Kafka 공식 문서 (kafka.apache.org)
vLLM 프로젝트 (vllm.ai) - PagedAttention 논문 포함

System Design Interview 2025: Top 30 FAANG Questions and the New AI/ML Design Round

1. Why System Design Matters
2. The 45-Minute System Design Framework
3. Ten Essential Concepts You Must Know
4. Top 15 Most Frequently Asked Problems in 2025
- Problem-Specific Tips
5. New in 2025: AI/ML System Design
6. Back-of-the-Envelope Estimation Cheat Sheet
7. Top 5 Study Resources
8. Five Things Interviewers Actually Evaluate
Practice Quiz
References

1. Why System Design Matters

System design interviews are the key gatekeeping round for senior engineering hires. The 2025 numbers make the importance crystal clear.

The Current Landscape

70%+ of senior candidates face at least one system design round
60% of questions are related to cloud-native architecture (AWS, GCP, Azure)
80% of problems focus on 20% of core concepts (Pareto principle)
At FAANG (Meta, Apple, Amazon, Netflix, Google), system design occupies 2-3 rounds

Expectations by Level

Level	Expected Depth	Key Evaluation Points
L3-L4 (Junior)	Basic component understanding	API design, single-service design
L5 (Senior)	Full distributed system design	Trade-off analysis, scalability
L6+ (Staff)	Large-scale system architecture	Cross-org impact, technical strategy

New Question Types in 2025

2025 has introduced entirely new categories alongside classic system design problems.

AI/ML System Design: Recommendation systems, LLM serving, RAG pipelines
Real-time Systems: 18 out of 30 problems require real-time processing
Cost Optimization: Cloud cost considerations are now a mandatory evaluation criterion
Security Design: Zero Trust and data privacy are baseline requirements

Where the classic question was "Design Twitter," interviewers now ask questions like "Design a real-time recommendation system that maintains model serving latency under 50ms while handling 1 billion inference requests per day."

2. The 45-Minute System Design Framework

The most common mistake is jumping straight into the design. A systematic framework is essential to navigate the 45-minute time constraint.

2-1. Clarify Requirements (5 minutes)

Ask the interviewer questions to narrow the scope. Candidates who skip this step fail almost 100% of the time.

Functional Requirements

Example questions:
- "What are the 3 core actions a user can perform?"
- "What is the read-to-write ratio?"
- "How long must we retain data?"

Non-Functional Requirements

Things to confirm:
- Availability: 99.9% vs 99.99% (annual downtime 8.7 hours vs 52 minutes)
- Latency: specific targets like p99 < 200ms
- Consistency: strong consistency vs eventual consistency
- Scale: DAU, concurrent users, data volume

2-2. Back-of-the-Envelope Estimation (5 minutes)

This step quantifies the scale. Interviewers care more about the reasoning process than exact numbers.

Example: Estimating scale for a chat system

- DAU: 50 million
- Average messages per day: 40 per user
- Total messages: 50M x 40 = 2 billion/day
- QPS: 2 billion / 86,400 = ~23,000 QPS
- Peak QPS: 3x average = ~70,000 QPS
- Message size: 100 bytes average
- Daily storage: 2B x 100B = 200GB/day
- Annual storage: 200GB x 365 = 73TB/year

2-3. High-Level Design (15 minutes)

Express the core components and data flow as a diagram.

Typical high-level architecture:

Client --> Load Balancer --> API Gateway
                               |
                    +----------+----------+
                    |          |          |
                Service A  Service B  Service C
                    |          |          |
                 DB (SQL)  Cache     Message Queue
                              |          |
                           DB (NoSQL)  Worker

Must-includes at this stage:

Client-server protocol (HTTP, WebSocket, gRPC)
Data store selection with rationale
Cache layer placement
Identification of async processing needs

2-4. Deep Dive (15 minutes)

Dive deep into the components the interviewer shows interest in.

Deep dive areas:

1. Database Schema Design
   - Table structure, indexes, partitioning strategy

2. API Design
   - Endpoints, request/response format, error handling

3. Core Algorithms
   - News Feed: Fan-out on write vs Fan-out on read
   - Search: Inverted index structure
   - Recommendations: Collaborative filtering vs content-based filtering

2-5. Scalability, Fault Tolerance, Monitoring (5 minutes)

The final step demonstrates your system's robustness.

Checklist:
- Identify and eliminate Single Points of Failure (SPOF)
- Data replication strategy (Leader-Follower, Multi-Leader)
- Failure detection and automatic recovery (Health Checks, Circuit Breaker)
- Monitoring metrics (Latency, Error Rate, Throughput)
- Alerting system (PagerDuty, Slack integration)

3. Ten Essential Concepts You Must Know

These core concepts cover 80% of all system design questions.

3-1. Horizontal vs Vertical Scaling

Vertical Scaling (Scale Up): Increasing the power of a single server

Pros: Simple to implement, easy to maintain data consistency
Cons: Hardware limits exist, single point of failure

Horizontal Scaling (Scale Out): Adding more servers

Pros: Theoretically unlimited scaling, fault tolerance
Cons: Distributed system complexity, data consistency challenges

Vertical scaling:
  1 server (4 CPU, 16GB RAM) --> 1 server (32 CPU, 128GB RAM)

Horizontal scaling:
  1 server --> 10 servers (each 4 CPU, 16GB RAM) + Load Balancer

In practice, most designs default to horizontal scaling, though some components like databases may start with vertical scaling.

3-2. Load Balancers (L4 vs L7)

Load balancers distribute traffic across multiple servers.

Aspect	L4 (Transport Layer)	L7 (Application Layer)
Operates on	IP, Port	URL, Headers, Cookies
Speed	Faster	Relatively slower
Flexibility	Low	High (path-based routing)
SSL Termination	No	Yes
Use case	Internal service communication	API Gateway, Web servers

L7 load balancer routing example:

/api/users/*  --> User Service Cluster
/api/orders/* --> Order Service Cluster
/api/search/* --> Search Service Cluster
/static/*     --> CDN Origin

3-3. Caching Strategies

Caching is the cornerstone of system performance. Here are three key patterns.

Cache-Aside (Lazy Loading)

Read request flow:
1. Look up data in cache
2. Cache hit --> return data
3. Cache miss --> query DB --> store in cache --> return

Pros: Only caches needed data, DB fallback on cache failure
Cons: First request always slow, cache-DB inconsistency possible

Write-Through

Write request flow:
1. Write data to cache
2. Cache synchronously writes to DB
3. Return completion response

Pros: Cache and DB always consistent
Cons: Increased write latency, caches unused data

Write-Behind (Write-Back)

Write request flow:
1. Write data to cache
2. Return completion immediately
3. Asynchronously batch-write to DB

Pros: Maximum write performance
Cons: Risk of data loss on cache failure

3-4. Databases (SQL vs NoSQL, Sharding, Replication)

SQL vs NoSQL Selection Criteria

Criteria	SQL (Relational)	NoSQL (Non-Relational)
Data structure	Structured, complex relations	Unstructured, flexible schema
Consistency	ACID guaranteed	Eventual consistency (mostly)
Scalability	Vertical scaling first	Horizontal scaling friendly
Use case	Payments, inventory	Logs, sessions, social feeds
Products	PostgreSQL, MySQL	MongoDB, Cassandra, DynamoDB

Sharding Strategies

1. Range-based Sharding
   - user_id 1~1M --> Shard 1
   - user_id 1M~2M --> Shard 2
   - Pros: Efficient range queries
   - Cons: Hotspot risk

2. Hash-based Sharding
   - hash(user_id) % N --> Shard number
   - Pros: Even distribution
   - Cons: Inefficient range queries, complex rebalancing

3. Directory-based Sharding
   - Managed via lookup table
   - Pros: Flexible mapping
   - Cons: Lookup table becomes SPOF

3-5. Message Queues (Kafka, RabbitMQ)

The backbone of asynchronous communication.

Feature	Apache Kafka	RabbitMQ
Model	Log-based streaming	Message broker
Throughput	Millions per second	Tens of thousands per second
Message retention	Retained for configured period	Deleted after consumption
Ordering guarantee	Within partition	Within queue
Use case	Event streaming, logs	Task queues, RPC

Kafka architecture:

Producer --> Topic (Partition 0) --> Consumer Group A
                                 --> Consumer Group B
         --> Topic (Partition 1) --> Consumer Group A
                                 --> Consumer Group B
         --> Topic (Partition 2) --> Consumer Group A
                                 --> Consumer Group B

- Partition: unit of parallelism
- Consumer Group: independent consumption
- Offset: tracks consumption position

3-6. CDN and Edge Computing

A CDN (Content Delivery Network) caches content on edge servers worldwide.

CDN operation:

User (Korea) --> Korea edge server (cache hit) --> immediate response
                                    (cache miss) --> origin (US) --> cache --> response

Pull CDN: Fetches from origin on first request (CloudFront, Cloudflare)
Push CDN: Pre-distributes content (suited for large static files)

Edge Computing goes beyond CDN by running computation at the edge.

Cloudflare Workers, AWS Lambda@Edge, Vercel Edge Functions
A/B testing, authentication, personalization at the edge
Reduces origin server load + minimizes latency

3-7. API Design (REST vs gRPC vs GraphQL)

Feature	REST	gRPC	GraphQL
Protocol	HTTP/1.1	HTTP/2	HTTP
Data format	JSON	Protocol Buffers	JSON
Type safety	Low	High (schema required)	Medium (schema required)
Streaming	Limited	Bidirectional support	Subscription support
Use case	Public APIs	Internal microservices	Client-driven queries

REST example:
GET /api/v1/users/123
GET /api/v1/users/123/orders?page=1&limit=10

gRPC example:
service UserService {
  rpc GetUser(GetUserRequest) returns (User);
  rpc ListOrders(ListOrdersRequest) returns (stream Order);
}

GraphQL example:
query {
  user(id: "123") {
    name
    orders(first: 10) {
      id
      total
    }
  }
}

3-8. Consistent Hashing

An algorithm that minimizes data redistribution when nodes are added or removed in a distributed system.

The problem with basic hashing:
  hash(key) % N = server number
  When N changes, almost all keys get remapped

Consistent Hashing:
  - Hash space arranged in a ring (0 to 2^32-1)
  - Both servers and keys are placed on the ring
  - Keys are assigned to the nearest server clockwise
  - Adding/removing a server only affects adjacent segments

Virtual Nodes:
  - Each physical server maps to multiple virtual nodes
  - Distributes load more evenly
  - Number of virtual nodes can be adjusted by server capacity

3-9. Rate Limiting (Token Bucket, Sliding Window)

Protects the system from excessive traffic.

Token Bucket Algorithm

How it works:
1. Tokens are added to a fixed-size bucket at a constant rate
2. Each request consumes 1 token
3. If no tokens remain, request is rejected (HTTP 429)

Parameters:
- Bucket size: 100 (maximum burst)
- Refill rate: 10/second (average throughput)

Sliding Window Log Algorithm

How it works:
1. Record each request timestamp in a sorted log
2. Remove entries outside the current window (e.g., 1 minute)
3. Reject if request count within window exceeds the limit

Pros: Precise window boundary handling
Cons: High memory usage (stores all timestamps)

Rate Limiting in Distributed Environments

Approach 1: Centralized (Redis-based)
  - All servers share a Redis counter
  - Accurate but Redis becomes bottleneck/SPOF

Approach 2: Local + Synchronization
  - Each server counts locally
  - Periodically syncs with central store
  - Fast but allows slight over-limit

3-10. CAP Theorem and Real-World Trade-offs

CAP Theorem: A distributed system can guarantee at most 2 of these 3 properties simultaneously.

C (Consistency): All nodes see the same data
A (Availability): Every request gets a response
P (Partition Tolerance): System works despite network partitions

Since network partitions are inevitable in reality, the practical choice is CP vs AP.

System	Choice	Reason
Payment system	CP	Financial accuracy is paramount
Social media feed	AP	Brief inconsistency is better than downtime
Inventory management	CP	Prevent overselling
DNS	AP	Availability is critical
Chat messages	AP	Eventual consistency is sufficient

4. Top 15 Most Frequently Asked Problems in 2025

The 15 problems that appear most often in real interviews.

Rank	Problem	Difficulty	Companies	Key Concepts
1	URL Shortener	Easy	All	Hashing, DB selection, read optimization
2	Rate Limiter	Easy	All	Token Bucket, distributed counting
3	News Feed System	Medium	Meta, Twitter	Fan-out, cache layers, ranking
4	Chat System	Medium	WhatsApp, Slack	WebSocket, message queues, state management
5	Video Streaming Platform	Medium	Netflix, YouTube	CDN, adaptive bitrate, transcoding
6	Search Autocomplete	Medium	Google, Amazon	Trie, ElasticSearch, prefix matching
7	Location-Based Service	Medium	Uber, DoorDash	Geohash, QuadTree, proximity search
8	Notification System	Medium	All	Push/Pull, message queues, priority
9	Distributed Cache	Hard	Amazon, Google	Consistent Hashing, replication, failure detection
10	Distributed Message Queue	Hard	Kafka team, LinkedIn	Partitioning, ISR, consumer groups
11	Payment System	Hard	Stripe, Toss, PayPal	Idempotency, Saga pattern, double-entry
12	Real-time Gaming Backend	Hard	Riot Games, Epic	State synchronization, UDP, lag compensation
13	Recommendation System	Hard	Netflix, Spotify	ML pipeline, Feature Store
14	Distributed File System	Hard	Google, Microsoft	GFS, HDFS, chunk servers
15	Ad Click Aggregation	Hard	Google, Meta	Stream processing, exactly-once semantics

Problem-Specific Tips

URL Shortener - The most fundamental problem, but one where you can demonstrate depth.

Key design points:
1. Short URL generation: Base62 encoding vs hash (MD5/SHA256 truncation)
2. Collision handling: DB unique constraint + retry vs counter-based
3. Redirection: 301 (permanent) vs 302 (temporary) with rationale
4. Caching: Frequently accessed URLs in Redis (80/20 rule)
5. Analytics: Click count, region, device tracking

News Feed System - The fan-out strategy is the core decision.

Fan-out on Write (Push model):
- When a post is created, immediately write to all followers' feeds
- Pros: Fast reads
- Cons: High write cost for users with many followers (celebrities)

Fan-out on Read (Pull model):
- When a feed is viewed, aggregate in real-time from followed users
- Pros: Low write cost
- Cons: Slow reads

Hybrid (Production answer):
- Regular users: Push model
- Celebrities (1M+ followers): Pull model
- Merge both for the final feed

Chat System - Real-time communication is the core challenge.

Architecture:
1. Connection management: WebSocket servers + connection state store
2. Message delivery: Direct for 1:1, fan-out for groups
3. Offline handling: Store in message queue, deliver on reconnection
4. Read receipts: Separate service
5. Message storage: DB optimized for chronological ordering
   - HBase/Cassandra (wide-column stores)

5. New in 2025: AI/ML System Design

The biggest shift in 2025 system design interviews is the surge of AI/ML-related problems. Google, Meta, and Amazon -- as well as startups -- now evaluate AI system design capabilities.

5-1. Recommendation System Architecture

Recommendation systems are the most classic AI system design problem.

Full recommendation pipeline:

Data Collection --> Feature Store --> Model Training --> Model Registry
     |                                    |                    |
User behavior logs               Offline batch            A/B Testing
Clicks, purchases, views        Daily/weekly update    Champion/Challenger
     |                                    |                    |
     v                                    v                    v
Real-time feature   -->  Candidate    -->  Ranking  -->  Re-ranking  -->  Display
computation              Generation       (Scoring)     (Re-ranking)
                        (Retrieval)

Key components:
1. Feature Store: Offline (batch) + Online (real-time) feature management
2. Candidate Generation: ANN (Approximate Nearest Neighbors) to select ~1000 candidates
3. Ranking Model: Score candidates with deep learning model
4. Re-ranking: Apply business rules, diversity, freshness

5-2. LLM Serving System

Large Language Model (LLM) serving is the hottest topic in 2025.

LLM Serving Architecture:

Request --> API Gateway --> Request Router --> GPU Cluster
                               |                   |
                          Token Limiter     Model Instances (vLLM)
                          Queue Manager           |
                               |            KV Cache Management
                          Response Streaming <--- |
                               |
                          Cost Tracker

Key optimization techniques:
1. KV Cache: Cache previous tokens' Key/Value to avoid redundant computation
2. Continuous Batching: Dynamically batch requests
3. PagedAttention: Manage memory in page units (vLLM)
4. Quantization: FP16 --> INT8/INT4 to reduce model size
5. Speculative Decoding: Draft with small model, verify with large model

Auto-scaling Strategy

GPU auto-scaling considerations:
- GPU instance startup time: 3-10 minutes (much slower than CPU)
- Predictive scaling: Pre-learn hourly traffic patterns
- Queue-depth based: Trigger scaling on pending request count
- Cost optimization: Spot instances + on-demand fallback

5-3. RAG (Retrieval-Augmented Generation) Pipeline

RAG is a key pattern for reducing LLM hallucinations and providing up-to-date information.

RAG Pipeline Architecture:

Document ingestion --> Chunking --> Embedding generation --> Vector DB storage
                                                                |
User query --> Query embedding --> Similarity search --> Top-K document retrieval
                                                              |
                                  Prompt construction <-------+
                                        |
                                  LLM generation --> Response + source citations

Key design decisions:
1. Chunking strategy: Fixed-size vs semantic (paragraph/section)
2. Embedding model: OpenAI Ada-002, Cohere Embed, open-source models
3. Vector DB: Pinecone (managed), Milvus (self-hosted), pgvector
4. Search approach: Pure vector search vs hybrid (vector + keyword)
5. Re-ranking: Cross-encoder for precise reordering of search results

5-4. Real-time ML Pipeline

Real-time ML Pipeline:

Event source --> Kafka --> Stream Processor (Flink) --> Feature computation
                                                            |
                                                       Feature Store
                                                            |
API request --> Model Server --> Inference result --> Business logic
                   |
              Model Registry (MLflow)

Use cases:
- Fraud detection: Payment event --> Real-time features --> Fraud score
- Real-time recommendations: User behavior --> Real-time embedding update --> Refreshed recs
- Dynamic pricing: Supply/demand signals --> Pricing model --> Price adjustment

5-5. AI Safety Design

AI system safety is a mandatory topic in 2025 interviews.

AI Safety Architecture:

User input --> Input filter --> LLM --> Output filter --> User response
                 |                         |
            Content classifier       Hallucination detector
            Prompt injection detection   Fact checker
            PII detection/masking        Toxicity filter
                 |                         |
            Block or warn            Modify or block

Design points:
1. Multi-layer defense: Filters at input/processing/output stages
2. Async audit: Analyze full conversation logs asynchronously
3. Adaptive rules: Fast rule updates for new attack patterns
4. Human-in-the-loop: Human review for uncertain automated cases
5. Latency budget: Minimize added latency from safety layers (target: under 50ms)

6. Back-of-the-Envelope Estimation Cheat Sheet

Essential numbers you can use directly in interviews.

6-1. Latency Reference Table

Operation                      Time
---------------------------   ----------
L1 cache reference             1 ns
L2 cache reference             4 ns
Main memory (RAM) reference    100 ns
SSD random read                100 us (microseconds)
HDD seek                       10 ms
Same-datacenter network RTT    0.5 ms
Cross-continent network RTT    150 ms

Memory tips:
- L1 is 100x faster than RAM
- SSD is 1000x slower than RAM
- HDD is 100x slower than SSD
- Network is always slower than local I/O

6-2. Capacity Calculation Formulas

QPS (Queries Per Second):
  QPS = DAU * average requests per user / 86,400
  Peak QPS = average QPS * 2~5 (depends on service)

Storage:
  Daily data = daily new records * average record size
  Annual data = daily data * 365
  Total storage = annual data * retention period (years) * replication factor (usually 3)

Bandwidth:
  Inbound = write QPS * average request size
  Outbound = read QPS * average response size

6-3. Building Scale Intuition

Approximate scale of major services:

Service       DAU          Avg QPS     Storage
------       --------     ---------   --------
Twitter      500M         300K        Several TB/day
Instagram    2B           1M          Tens of TB/day
YouTube      2B+          500K+       500 hours uploaded/min
WhatsApp     2B+          Millions    10B messages/day
Google Search 8.5B queries/day  100K+  Hundreds of PB indexed

Powers of 2 (frequently used):
- 2^10 = 1,024 (~1 thousand)
- 2^20 = 1,048,576 (~1 million)
- 2^30 = 1,073,741,824 (~1 billion)
- 2^40 = ~1 trillion (1TB)

7. Top 5 Study Resources

The best materials for preparing for system design interviews.

7-1. DDIA (Designing Data-Intensive Applications)

By Martin Kleppmann. Widely regarded as the bible of system design.

Audience: Engineers who want deep understanding of distributed system principles
Strengths: Systematic coverage from theory to practical trade-offs
Key topics: Data models, storage engines, replication, partitioning, transactions, consistency, batch/stream processing
Study tip: Read cover-to-cover, then revisit key chapters before interviews (especially chapters 5-9)

7-2. System Design Interview (Alex Xu) Vol.1 + Vol.2

The most interview-relevant practical guide.

Audience: Engineers preparing for system design interviews for the first time
Strengths: Step-by-step solutions per problem, rich diagrams
Vol.1: URL shortener, news feed, chat, search autocomplete, and 9 more (13 total)
Vol.2: Location services, gaming leaderboard, payment system, and 10 more (13 total)

A visual learning platform run by Alex Xu.

Audience: Engineers who prefer visual learning
Strengths: Outstanding architecture diagram quality
Content: YouTube videos, weekly newsletter, online course
Recommended usage: Review concepts via videos during commute

7-4. Grokking the System Design Interview

An interactive learning course on the Educative platform.

Audience: Engineers seeking a structured curriculum
Strengths: Step-by-step learning path with quizzes
Structure: Core concepts + 15 design problems + glossary

7-5. Codemia (120+ Practice Problems)

A system design practice platform launched in 2024.

Audience: Engineers wanting exposure to a wide range of problems
Strengths: 120+ problems, difficulty-tiered
Features: Community solution comparison, timer feature
Recommended usage: Practice 3-4 problems per week with a 45-minute timer

12-Week Study Roadmap

Weekly study plan:

Weeks 1-2: Foundational Concepts (DDIA key chapters)
  - Scalability, availability, consistency principles
  - Database, cache, message queue basics

Weeks 3-4: Framework Practice (Alex Xu Vol.1)
  - Internalize the 45-minute framework
  - Solve 5 easy problems

Weeks 5-8: Problem-Solving Practice (Alex Xu Vol.2 + Codemia)
  - 10 medium-difficulty problems
  - 3-4 problems per week, using a timer

Weeks 9-10: Advanced Topics + AI/ML Systems
  - Deep dive into distributed systems
  - LLM serving, recommendation systems

Weeks 11-12: Mock Interviews
  - 3-5 mock interviews with peers
  - Focused review of weak areas

8. Five Things Interviewers Actually Evaluate

8-1. Ability to Structure Ambiguity

Interviewers intentionally ask vague questions. When asked "Design Google Drive," jumping straight into coding is a red flag. The ability to ask questions and narrow the scope is what matters.

Good clarification questions:
- "Should I focus on file upload or download?"
- "What is the user scale? Are we talking 100 million users?"
- "Is real-time collaborative editing in scope?"
- "Do we need to support both mobile and web?"

8-2. Trade-off Analysis

The ability to clearly answer "Why B instead of A?"

Trade-off analysis example:

Question: "Would you choose Cassandra or MySQL for message storage?"

Good answer structure:
1. Confirm requirements: "Chat messages are write-heavy,
   mostly queried chronologically, and availability matters
   more than strong consistency."

2. Compare options:
   - MySQL: ACID guarantees, supports joins, but horizontal
     scaling is difficult and sharding is complex
   - Cassandra: Easy horizontal scaling, write-optimized,
     great for time-series data, but no joins

3. Conclusion: "Given our requirements prioritize write
   performance and horizontal scalability, I would choose
   Cassandra. For data with complex relationships like user
   profiles, I would use a separate MySQL instance."

8-3. Scaling Scenario Response

Handling the question "What if users grow 10x?"

Scaling scenario framework:

Current scale (1M DAU):
- Single DB, 2 read replicas, 1 cache server

10x growth (10M DAU):
- Introduce DB sharding (user ID-based)
- Expand cache cluster (Redis Cluster)
- Add CDN (static assets)
- Read/write separation

100x growth (100M DAU):
- Multi-region deployment
- Microservice decomposition
- Event-driven architecture migration
- Dedicated search engine (Elasticsearch)

8-4. Experience-Based Judgment

Interviewers value real-world experience over textbook answers.

"In a previous project, I encountered a cache invalidation issue..."
"When using Redis, we hit memory limits, and what I learned was..."
"How a circuit breaker helped during an actual outage..."

8-5. Communication (Diagrams + Explanation)

A system design interview is a conversation. You should not monologue for 30 minutes; instead, build the design together with the interviewer.

Effective communication techniques:

1. Use the whiteboard
   - Always explain with diagrams
   - Show data flow with arrows
   - Label components clearly

2. Set checkpoints
   - "If this looks good so far, I will move on"
   - "Should I go deeper into this component?"

3. Share your thought process
   - Think out loud instead of going silent
   - "I see two options: A does... while B does..."

Practice Quiz

Test your understanding of the key concepts.

Q1. What is the single most important thing to do in the first 5 minutes of a system design interview?

A: Clarify the requirements.

Ask the interviewer about functional requirements (top 3 features) and non-functional requirements (scale, latency, availability) to narrow the design scope. Jumping straight into design is the most common mistake.

Q2. Under the CAP theorem, when a network partition occurs, should a payment system choose CP or AP?

A: CP (Consistency + Partition Tolerance)

In a payment system, financial accuracy is the top priority. Temporary service unavailability (sacrificing availability) is better than processing incorrect amounts (sacrificing consistency). Retry mechanisms and idempotency are used to mitigate availability degradation.

Q3. What fan-out strategy is appropriate for a celebrity with 1 million followers posting on a news feed?

A: Fan-out on Read (Pull model) or Hybrid

Pushing a celebrity's post to all 1 million followers incurs excessive write costs. The production answer is a hybrid approach: use the Pull model (real-time aggregation on feed view) for celebrities and the Push model for regular users, then merge both to generate the final feed.

Q4. What is the role of KV Cache in LLM serving, and why is it important?

A: It caches previously computed Key/Value tensors to avoid redundant computation.

In autoregressive generation, each new token requires computing attention over all previous tokens. KV Cache stores already-computed Key/Value pairs so that generating a new token does not require re-computing previous tokens. This speeds up generation by several times to orders of magnitude, and GPU memory management becomes the key challenge (hence techniques like PagedAttention).

Q5. Estimate the peak QPS for a chat app with 50 million DAU using back-of-the-envelope calculation.

A: Approximately 70,000 QPS

Calculation:

DAU: 50 million
Average messages: 40 per user per day
Total messages: 50M x 40 = 2 billion per day
Average QPS: 2 billion / 86,400 = approximately 23,000 QPS
Peak QPS: approximately 3x average = approximately 70,000 QPS

The peak multiplier varies by service; for chat apps that see evening usage spikes, 2-5x is typical.

References

Books

Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017)
Alex Xu, System Design Interview: An Insider's Guide Volume 1 (2020)
Alex Xu, System Design Interview: An Insider's Guide Volume 2 (2022)
Maheshwari, Acing the System Design Interview (Manning, 2024)
Gaurav Sen, System Design Simplified (2023)

Online Courses and Platforms

Educative - Grokking the System Design Interview
ByteByteGo - System Design Course (bytebytego.com)
Codemia - 120+ System Design Practice Problems (codemia.dev)
Exponent - System Design Interview Course (tryexponent.com)
Donnemartin - System Design Primer (GitHub open-source)

Engineering Blogs and Papers

Google SRE Book - Site Reliability Engineering (sre.google/sre-book)
Amazon Builders Library (aws.amazon.com/builders-library)
Meta Engineering Blog (engineering.fb.com)
Netflix Tech Blog (netflixtechblog.com)
Uber Engineering Blog (eng.uber.com)
Leslie Lamport, The Part-Time Parliament (Paxos paper, 1998)
DeCandia et al., Dynamo: Amazon's Highly Available Key-value Store (SOSP 2007)
Chang et al., Bigtable: A Distributed Storage System for Structured Data (OSDI 2006)
Apache Kafka Official Documentation (kafka.apache.org)
vLLM Project (vllm.ai) - including PagedAttention paper