System Design Interview 2025: Top 30 FAANG Questions and the New AI/ML Design Round

1. Why System Design Matters
2. The 45-Minute System Design Framework
3. Ten Essential Concepts You Must Know
4. Top 15 Most Frequently Asked Problems in 2025
- Problem-Specific Tips
5. New in 2025: AI/ML System Design
6. Back-of-the-Envelope Estimation Cheat Sheet
7. Top 5 Study Resources
8. Five Things Interviewers Actually Evaluate
Practice Quiz
References

1. Why System Design Matters

System design interviews are the key gatekeeping round for senior engineering hires. The 2025 numbers make the importance crystal clear.

The Current Landscape

70%+ of senior candidates face at least one system design round
60% of questions are related to cloud-native architecture (AWS, GCP, Azure)
80% of problems focus on 20% of core concepts (Pareto principle)
At FAANG (Meta, Apple, Amazon, Netflix, Google), system design occupies 2-3 rounds

Expectations by Level

Level	Expected Depth	Key Evaluation Points
L3-L4 (Junior)	Basic component understanding	API design, single-service design
L5 (Senior)	Full distributed system design	Trade-off analysis, scalability
L6+ (Staff)	Large-scale system architecture	Cross-org impact, technical strategy

New Question Types in 2025

2025 has introduced entirely new categories alongside classic system design problems.

AI/ML System Design: Recommendation systems, LLM serving, RAG pipelines
Real-time Systems: 18 out of 30 problems require real-time processing
Cost Optimization: Cloud cost considerations are now a mandatory evaluation criterion
Security Design: Zero Trust and data privacy are baseline requirements

Where the classic question was "Design Twitter," interviewers now ask questions like "Design a real-time recommendation system that maintains model serving latency under 50ms while handling 1 billion inference requests per day."

2. The 45-Minute System Design Framework

The most common mistake is jumping straight into the design. A systematic framework is essential to navigate the 45-minute time constraint.

2-1. Clarify Requirements (5 minutes)

Ask the interviewer questions to narrow the scope. Candidates who skip this step fail almost 100% of the time.

Functional Requirements

Example questions:
- "What are the 3 core actions a user can perform?"
- "What is the read-to-write ratio?"
- "How long must we retain data?"

Non-Functional Requirements

Things to confirm:
- Availability: 99.9% vs 99.99% (annual downtime 8.7 hours vs 52 minutes)
- Latency: specific targets like p99 < 200ms
- Consistency: strong consistency vs eventual consistency
- Scale: DAU, concurrent users, data volume

2-2. Back-of-the-Envelope Estimation (5 minutes)

This step quantifies the scale. Interviewers care more about the reasoning process than exact numbers.

Example: Estimating scale for a chat system

- DAU: 50 million
- Average messages per day: 40 per user
- Total messages: 50M x 40 = 2 billion/day
- QPS: 2 billion / 86,400 = ~23,000 QPS
- Peak QPS: 3x average = ~70,000 QPS
- Message size: 100 bytes average
- Daily storage: 2B x 100B = 200GB/day
- Annual storage: 200GB x 365 = 73TB/year

2-3. High-Level Design (15 minutes)

Express the core components and data flow as a diagram.

Typical high-level architecture:

Client --> Load Balancer --> API Gateway
                               |
                    +----------+----------+
                    |          |          |
                Service A  Service B  Service C
                    |          |          |
                 DB (SQL)  Cache     Message Queue
                              |          |
                           DB (NoSQL)  Worker

Must-includes at this stage:

Client-server protocol (HTTP, WebSocket, gRPC)
Data store selection with rationale
Cache layer placement
Identification of async processing needs

2-4. Deep Dive (15 minutes)

Dive deep into the components the interviewer shows interest in.

Deep dive areas:

1. Database Schema Design
   - Table structure, indexes, partitioning strategy

2. API Design
   - Endpoints, request/response format, error handling

3. Core Algorithms
   - News Feed: Fan-out on write vs Fan-out on read
   - Search: Inverted index structure
   - Recommendations: Collaborative filtering vs content-based filtering

2-5. Scalability, Fault Tolerance, Monitoring (5 minutes)

The final step demonstrates your system's robustness.

Checklist:
- Identify and eliminate Single Points of Failure (SPOF)
- Data replication strategy (Leader-Follower, Multi-Leader)
- Failure detection and automatic recovery (Health Checks, Circuit Breaker)
- Monitoring metrics (Latency, Error Rate, Throughput)
- Alerting system (PagerDuty, Slack integration)

3. Ten Essential Concepts You Must Know

These core concepts cover 80% of all system design questions.

3-1. Horizontal vs Vertical Scaling

Vertical Scaling (Scale Up): Increasing the power of a single server

Pros: Simple to implement, easy to maintain data consistency
Cons: Hardware limits exist, single point of failure

Horizontal Scaling (Scale Out): Adding more servers

Pros: Theoretically unlimited scaling, fault tolerance
Cons: Distributed system complexity, data consistency challenges

Vertical scaling:
  1 server (4 CPU, 16GB RAM) --> 1 server (32 CPU, 128GB RAM)

Horizontal scaling:
  1 server --> 10 servers (each 4 CPU, 16GB RAM) + Load Balancer

In practice, most designs default to horizontal scaling, though some components like databases may start with vertical scaling.

3-2. Load Balancers (L4 vs L7)

Load balancers distribute traffic across multiple servers.

Aspect	L4 (Transport Layer)	L7 (Application Layer)
Operates on	IP, Port	URL, Headers, Cookies
Speed	Faster	Relatively slower
Flexibility	Low	High (path-based routing)
SSL Termination	No	Yes
Use case	Internal service communication	API Gateway, Web servers

L7 load balancer routing example:

/api/users/*  --> User Service Cluster
/api/orders/* --> Order Service Cluster
/api/search/* --> Search Service Cluster
/static/*     --> CDN Origin

3-3. Caching Strategies

Caching is the cornerstone of system performance. Here are three key patterns.

Cache-Aside (Lazy Loading)

Read request flow:
1. Look up data in cache
2. Cache hit --> return data
3. Cache miss --> query DB --> store in cache --> return

Pros: Only caches needed data, DB fallback on cache failure
Cons: First request always slow, cache-DB inconsistency possible

Write-Through

Write request flow:
1. Write data to cache
2. Cache synchronously writes to DB
3. Return completion response

Pros: Cache and DB always consistent
Cons: Increased write latency, caches unused data

Write-Behind (Write-Back)

Write request flow:
1. Write data to cache
2. Return completion immediately
3. Asynchronously batch-write to DB

Pros: Maximum write performance
Cons: Risk of data loss on cache failure

3-4. Databases (SQL vs NoSQL, Sharding, Replication)

SQL vs NoSQL Selection Criteria

Criteria	SQL (Relational)	NoSQL (Non-Relational)
Data structure	Structured, complex relations	Unstructured, flexible schema
Consistency	ACID guaranteed	Eventual consistency (mostly)
Scalability	Vertical scaling first	Horizontal scaling friendly
Use case	Payments, inventory	Logs, sessions, social feeds
Products	PostgreSQL, MySQL	MongoDB, Cassandra, DynamoDB

Sharding Strategies

1. Range-based Sharding
   - user_id 1~1M --> Shard 1
   - user_id 1M~2M --> Shard 2
   - Pros: Efficient range queries
   - Cons: Hotspot risk

2. Hash-based Sharding
   - hash(user_id) % N --> Shard number
   - Pros: Even distribution
   - Cons: Inefficient range queries, complex rebalancing

3. Directory-based Sharding
   - Managed via lookup table
   - Pros: Flexible mapping
   - Cons: Lookup table becomes SPOF

3-5. Message Queues (Kafka, RabbitMQ)

The backbone of asynchronous communication.

Feature	Apache Kafka	RabbitMQ
Model	Log-based streaming	Message broker
Throughput	Millions per second	Tens of thousands per second
Message retention	Retained for configured period	Deleted after consumption
Ordering guarantee	Within partition	Within queue
Use case	Event streaming, logs	Task queues, RPC

Kafka architecture:

Producer --> Topic (Partition 0) --> Consumer Group A
                                 --> Consumer Group B
         --> Topic (Partition 1) --> Consumer Group A
                                 --> Consumer Group B
         --> Topic (Partition 2) --> Consumer Group A
                                 --> Consumer Group B

- Partition: unit of parallelism
- Consumer Group: independent consumption
- Offset: tracks consumption position

3-6. CDN and Edge Computing

A CDN (Content Delivery Network) caches content on edge servers worldwide.

CDN operation:

User (Korea) --> Korea edge server (cache hit) --> immediate response
                                    (cache miss) --> origin (US) --> cache --> response

Pull CDN: Fetches from origin on first request (CloudFront, Cloudflare)
Push CDN: Pre-distributes content (suited for large static files)

Edge Computing goes beyond CDN by running computation at the edge.

Cloudflare Workers, AWS Lambda@Edge, Vercel Edge Functions
A/B testing, authentication, personalization at the edge
Reduces origin server load + minimizes latency

3-7. API Design (REST vs gRPC vs GraphQL)

Feature	REST	gRPC	GraphQL
Protocol	HTTP/1.1	HTTP/2	HTTP
Data format	JSON	Protocol Buffers	JSON
Type safety	Low	High (schema required)	Medium (schema required)
Streaming	Limited	Bidirectional support	Subscription support
Use case	Public APIs	Internal microservices	Client-driven queries

REST example:
GET /api/v1/users/123
GET /api/v1/users/123/orders?page=1&limit=10

gRPC example:
service UserService {
  rpc GetUser(GetUserRequest) returns (User);
  rpc ListOrders(ListOrdersRequest) returns (stream Order);
}

GraphQL example:
query {
  user(id: "123") {
    name
    orders(first: 10) {
      id
      total
    }
  }
}

3-8. Consistent Hashing

An algorithm that minimizes data redistribution when nodes are added or removed in a distributed system.

The problem with basic hashing:
  hash(key) % N = server number
  When N changes, almost all keys get remapped

Consistent Hashing:
  - Hash space arranged in a ring (0 to 2^32-1)
  - Both servers and keys are placed on the ring
  - Keys are assigned to the nearest server clockwise
  - Adding/removing a server only affects adjacent segments

Virtual Nodes:
  - Each physical server maps to multiple virtual nodes
  - Distributes load more evenly
  - Number of virtual nodes can be adjusted by server capacity

3-9. Rate Limiting (Token Bucket, Sliding Window)

Protects the system from excessive traffic.

Token Bucket Algorithm

How it works:
1. Tokens are added to a fixed-size bucket at a constant rate
2. Each request consumes 1 token
3. If no tokens remain, request is rejected (HTTP 429)

Parameters:
- Bucket size: 100 (maximum burst)
- Refill rate: 10/second (average throughput)

Sliding Window Log Algorithm

How it works:
1. Record each request timestamp in a sorted log
2. Remove entries outside the current window (e.g., 1 minute)
3. Reject if request count within window exceeds the limit

Pros: Precise window boundary handling
Cons: High memory usage (stores all timestamps)

Rate Limiting in Distributed Environments

Approach 1: Centralized (Redis-based)
  - All servers share a Redis counter
  - Accurate but Redis becomes bottleneck/SPOF

Approach 2: Local + Synchronization
  - Each server counts locally
  - Periodically syncs with central store
  - Fast but allows slight over-limit

3-10. CAP Theorem and Real-World Trade-offs

CAP Theorem: A distributed system can guarantee at most 2 of these 3 properties simultaneously.

C (Consistency): All nodes see the same data
A (Availability): Every request gets a response
P (Partition Tolerance): System works despite network partitions

Since network partitions are inevitable in reality, the practical choice is CP vs AP.

System	Choice	Reason
Payment system	CP	Financial accuracy is paramount
Social media feed	AP	Brief inconsistency is better than downtime
Inventory management	CP	Prevent overselling
DNS	AP	Availability is critical
Chat messages	AP	Eventual consistency is sufficient

4. Top 15 Most Frequently Asked Problems in 2025

The 15 problems that appear most often in real interviews.

Rank	Problem	Difficulty	Companies	Key Concepts
1	URL Shortener	Easy	All	Hashing, DB selection, read optimization
2	Rate Limiter	Easy	All	Token Bucket, distributed counting
3	News Feed System	Medium	Meta, Twitter	Fan-out, cache layers, ranking
4	Chat System	Medium	WhatsApp, Slack	WebSocket, message queues, state management
5	Video Streaming Platform	Medium	Netflix, YouTube	CDN, adaptive bitrate, transcoding
6	Search Autocomplete	Medium	Google, Amazon	Trie, ElasticSearch, prefix matching
7	Location-Based Service	Medium	Uber, DoorDash	Geohash, QuadTree, proximity search
8	Notification System	Medium	All	Push/Pull, message queues, priority
9	Distributed Cache	Hard	Amazon, Google	Consistent Hashing, replication, failure detection
10	Distributed Message Queue	Hard	Kafka team, LinkedIn	Partitioning, ISR, consumer groups
11	Payment System	Hard	Stripe, Toss, PayPal	Idempotency, Saga pattern, double-entry
12	Real-time Gaming Backend	Hard	Riot Games, Epic	State synchronization, UDP, lag compensation
13	Recommendation System	Hard	Netflix, Spotify	ML pipeline, Feature Store
14	Distributed File System	Hard	Google, Microsoft	GFS, HDFS, chunk servers
15	Ad Click Aggregation	Hard	Google, Meta	Stream processing, exactly-once semantics

Problem-Specific Tips

URL Shortener - The most fundamental problem, but one where you can demonstrate depth.

Key design points:
1. Short URL generation: Base62 encoding vs hash (MD5/SHA256 truncation)
2. Collision handling: DB unique constraint + retry vs counter-based
3. Redirection: 301 (permanent) vs 302 (temporary) with rationale
4. Caching: Frequently accessed URLs in Redis (80/20 rule)
5. Analytics: Click count, region, device tracking

News Feed System - The fan-out strategy is the core decision.

Fan-out on Write (Push model):
- When a post is created, immediately write to all followers' feeds
- Pros: Fast reads
- Cons: High write cost for users with many followers (celebrities)

Fan-out on Read (Pull model):
- When a feed is viewed, aggregate in real-time from followed users
- Pros: Low write cost
- Cons: Slow reads

Hybrid (Production answer):
- Regular users: Push model
- Celebrities (1M+ followers): Pull model
- Merge both for the final feed

Chat System - Real-time communication is the core challenge.

Architecture:
1. Connection management: WebSocket servers + connection state store
2. Message delivery: Direct for 1:1, fan-out for groups
3. Offline handling: Store in message queue, deliver on reconnection
4. Read receipts: Separate service
5. Message storage: DB optimized for chronological ordering
   - HBase/Cassandra (wide-column stores)

5. New in 2025: AI/ML System Design

The biggest shift in 2025 system design interviews is the surge of AI/ML-related problems. Google, Meta, and Amazon -- as well as startups -- now evaluate AI system design capabilities.

5-1. Recommendation System Architecture

Recommendation systems are the most classic AI system design problem.

Full recommendation pipeline:

Data Collection --> Feature Store --> Model Training --> Model Registry
     |                                    |                    |
User behavior logs               Offline batch            A/B Testing
Clicks, purchases, views        Daily/weekly update    Champion/Challenger
     |                                    |                    |
     v                                    v                    v
Real-time feature   -->  Candidate    -->  Ranking  -->  Re-ranking  -->  Display
computation              Generation       (Scoring)     (Re-ranking)
                        (Retrieval)

Key components:
1. Feature Store: Offline (batch) + Online (real-time) feature management
2. Candidate Generation: ANN (Approximate Nearest Neighbors) to select ~1000 candidates
3. Ranking Model: Score candidates with deep learning model
4. Re-ranking: Apply business rules, diversity, freshness

5-2. LLM Serving System

Large Language Model (LLM) serving is the hottest topic in 2025.

LLM Serving Architecture:

Request --> API Gateway --> Request Router --> GPU Cluster
                               |                   |
                          Token Limiter     Model Instances (vLLM)
                          Queue Manager           |
                               |            KV Cache Management
                          Response Streaming <--- |
                               |
                          Cost Tracker

Key optimization techniques:
1. KV Cache: Cache previous tokens' Key/Value to avoid redundant computation
2. Continuous Batching: Dynamically batch requests
3. PagedAttention: Manage memory in page units (vLLM)
4. Quantization: FP16 --> INT8/INT4 to reduce model size
5. Speculative Decoding: Draft with small model, verify with large model

Auto-scaling Strategy

GPU auto-scaling considerations:
- GPU instance startup time: 3-10 minutes (much slower than CPU)
- Predictive scaling: Pre-learn hourly traffic patterns
- Queue-depth based: Trigger scaling on pending request count
- Cost optimization: Spot instances + on-demand fallback

5-3. RAG (Retrieval-Augmented Generation) Pipeline

RAG is a key pattern for reducing LLM hallucinations and providing up-to-date information.

RAG Pipeline Architecture:

Document ingestion --> Chunking --> Embedding generation --> Vector DB storage
                                                                |
User query --> Query embedding --> Similarity search --> Top-K document retrieval
                                                              |
                                  Prompt construction <-------+
                                        |
                                  LLM generation --> Response + source citations

Key design decisions:
1. Chunking strategy: Fixed-size vs semantic (paragraph/section)
2. Embedding model: OpenAI Ada-002, Cohere Embed, open-source models
3. Vector DB: Pinecone (managed), Milvus (self-hosted), pgvector
4. Search approach: Pure vector search vs hybrid (vector + keyword)
5. Re-ranking: Cross-encoder for precise reordering of search results

5-4. Real-time ML Pipeline

Real-time ML Pipeline:

Event source --> Kafka --> Stream Processor (Flink) --> Feature computation
                                                            |
                                                       Feature Store
                                                            |
API request --> Model Server --> Inference result --> Business logic
                   |
              Model Registry (MLflow)

Use cases:
- Fraud detection: Payment event --> Real-time features --> Fraud score
- Real-time recommendations: User behavior --> Real-time embedding update --> Refreshed recs
- Dynamic pricing: Supply/demand signals --> Pricing model --> Price adjustment

5-5. AI Safety Design

AI system safety is a mandatory topic in 2025 interviews.

AI Safety Architecture:

User input --> Input filter --> LLM --> Output filter --> User response
                 |                         |
            Content classifier       Hallucination detector
            Prompt injection detection   Fact checker
            PII detection/masking        Toxicity filter
                 |                         |
            Block or warn            Modify or block

Design points:
1. Multi-layer defense: Filters at input/processing/output stages
2. Async audit: Analyze full conversation logs asynchronously
3. Adaptive rules: Fast rule updates for new attack patterns
4. Human-in-the-loop: Human review for uncertain automated cases
5. Latency budget: Minimize added latency from safety layers (target: under 50ms)

6. Back-of-the-Envelope Estimation Cheat Sheet

Essential numbers you can use directly in interviews.

6-1. Latency Reference Table

Operation                      Time
---------------------------   ----------
L1 cache reference             1 ns
L2 cache reference             4 ns
Main memory (RAM) reference    100 ns
SSD random read                100 us (microseconds)
HDD seek                       10 ms
Same-datacenter network RTT    0.5 ms
Cross-continent network RTT    150 ms

Memory tips:
- L1 is 100x faster than RAM
- SSD is 1000x slower than RAM
- HDD is 100x slower than SSD
- Network is always slower than local I/O

6-2. Capacity Calculation Formulas

QPS (Queries Per Second):
  QPS = DAU * average requests per user / 86,400
  Peak QPS = average QPS * 2~5 (depends on service)

Storage:
  Daily data = daily new records * average record size
  Annual data = daily data * 365
  Total storage = annual data * retention period (years) * replication factor (usually 3)

Bandwidth:
  Inbound = write QPS * average request size
  Outbound = read QPS * average response size

6-3. Building Scale Intuition

Approximate scale of major services:

Service       DAU          Avg QPS     Storage
------       --------     ---------   --------
Twitter      500M         300K        Several TB/day
Instagram    2B           1M          Tens of TB/day
YouTube      2B+          500K+       500 hours uploaded/min
WhatsApp     2B+          Millions    10B messages/day
Google Search 8.5B queries/day  100K+  Hundreds of PB indexed

Powers of 2 (frequently used):
- 2^10 = 1,024 (~1 thousand)
- 2^20 = 1,048,576 (~1 million)
- 2^30 = 1,073,741,824 (~1 billion)
- 2^40 = ~1 trillion (1TB)

7. Top 5 Study Resources

The best materials for preparing for system design interviews.

7-1. DDIA (Designing Data-Intensive Applications)

By Martin Kleppmann. Widely regarded as the bible of system design.

Audience: Engineers who want deep understanding of distributed system principles
Strengths: Systematic coverage from theory to practical trade-offs
Key topics: Data models, storage engines, replication, partitioning, transactions, consistency, batch/stream processing
Study tip: Read cover-to-cover, then revisit key chapters before interviews (especially chapters 5-9)

7-2. System Design Interview (Alex Xu) Vol.1 + Vol.2

The most interview-relevant practical guide.

Audience: Engineers preparing for system design interviews for the first time
Strengths: Step-by-step solutions per problem, rich diagrams
Vol.1: URL shortener, news feed, chat, search autocomplete, and 9 more (13 total)
Vol.2: Location services, gaming leaderboard, payment system, and 10 more (13 total)

A visual learning platform run by Alex Xu.

Audience: Engineers who prefer visual learning
Strengths: Outstanding architecture diagram quality
Content: YouTube videos, weekly newsletter, online course
Recommended usage: Review concepts via videos during commute

7-4. Grokking the System Design Interview

An interactive learning course on the Educative platform.

Audience: Engineers seeking a structured curriculum
Strengths: Step-by-step learning path with quizzes
Structure: Core concepts + 15 design problems + glossary

7-5. Codemia (120+ Practice Problems)

A system design practice platform launched in 2024.

Audience: Engineers wanting exposure to a wide range of problems
Strengths: 120+ problems, difficulty-tiered
Features: Community solution comparison, timer feature
Recommended usage: Practice 3-4 problems per week with a 45-minute timer

12-Week Study Roadmap

Weekly study plan:

Weeks 1-2: Foundational Concepts (DDIA key chapters)
  - Scalability, availability, consistency principles
  - Database, cache, message queue basics

Weeks 3-4: Framework Practice (Alex Xu Vol.1)
  - Internalize the 45-minute framework
  - Solve 5 easy problems

Weeks 5-8: Problem-Solving Practice (Alex Xu Vol.2 + Codemia)
  - 10 medium-difficulty problems
  - 3-4 problems per week, using a timer

Weeks 9-10: Advanced Topics + AI/ML Systems
  - Deep dive into distributed systems
  - LLM serving, recommendation systems

Weeks 11-12: Mock Interviews
  - 3-5 mock interviews with peers
  - Focused review of weak areas

8. Five Things Interviewers Actually Evaluate

8-1. Ability to Structure Ambiguity

Interviewers intentionally ask vague questions. When asked "Design Google Drive," jumping straight into coding is a red flag. The ability to ask questions and narrow the scope is what matters.

Good clarification questions:
- "Should I focus on file upload or download?"
- "What is the user scale? Are we talking 100 million users?"
- "Is real-time collaborative editing in scope?"
- "Do we need to support both mobile and web?"

8-2. Trade-off Analysis

The ability to clearly answer "Why B instead of A?"

Trade-off analysis example:

Question: "Would you choose Cassandra or MySQL for message storage?"

Good answer structure:
1. Confirm requirements: "Chat messages are write-heavy,
   mostly queried chronologically, and availability matters
   more than strong consistency."

2. Compare options:
   - MySQL: ACID guarantees, supports joins, but horizontal
     scaling is difficult and sharding is complex
   - Cassandra: Easy horizontal scaling, write-optimized,
     great for time-series data, but no joins

3. Conclusion: "Given our requirements prioritize write
   performance and horizontal scalability, I would choose
   Cassandra. For data with complex relationships like user
   profiles, I would use a separate MySQL instance."

8-3. Scaling Scenario Response

Handling the question "What if users grow 10x?"

Scaling scenario framework:

Current scale (1M DAU):
- Single DB, 2 read replicas, 1 cache server

10x growth (10M DAU):
- Introduce DB sharding (user ID-based)
- Expand cache cluster (Redis Cluster)
- Add CDN (static assets)
- Read/write separation

100x growth (100M DAU):
- Multi-region deployment
- Microservice decomposition
- Event-driven architecture migration
- Dedicated search engine (Elasticsearch)

8-4. Experience-Based Judgment

Interviewers value real-world experience over textbook answers.

"In a previous project, I encountered a cache invalidation issue..."
"When using Redis, we hit memory limits, and what I learned was..."
"How a circuit breaker helped during an actual outage..."

8-5. Communication (Diagrams + Explanation)

A system design interview is a conversation. You should not monologue for 30 minutes; instead, build the design together with the interviewer.

Effective communication techniques:

1. Use the whiteboard
   - Always explain with diagrams
   - Show data flow with arrows
   - Label components clearly

2. Set checkpoints
   - "If this looks good so far, I will move on"
   - "Should I go deeper into this component?"

3. Share your thought process
   - Think out loud instead of going silent
   - "I see two options: A does... while B does..."

Practice Quiz

Test your understanding of the key concepts.

Q1. What is the single most important thing to do in the first 5 minutes of a system design interview?

A: Clarify the requirements.

Ask the interviewer about functional requirements (top 3 features) and non-functional requirements (scale, latency, availability) to narrow the design scope. Jumping straight into design is the most common mistake.

Q2. Under the CAP theorem, when a network partition occurs, should a payment system choose CP or AP?

A: CP (Consistency + Partition Tolerance)

In a payment system, financial accuracy is the top priority. Temporary service unavailability (sacrificing availability) is better than processing incorrect amounts (sacrificing consistency). Retry mechanisms and idempotency are used to mitigate availability degradation.

Q3. What fan-out strategy is appropriate for a celebrity with 1 million followers posting on a news feed?

A: Fan-out on Read (Pull model) or Hybrid

Pushing a celebrity's post to all 1 million followers incurs excessive write costs. The production answer is a hybrid approach: use the Pull model (real-time aggregation on feed view) for celebrities and the Push model for regular users, then merge both to generate the final feed.

Q4. What is the role of KV Cache in LLM serving, and why is it important?

A: It caches previously computed Key/Value tensors to avoid redundant computation.

In autoregressive generation, each new token requires computing attention over all previous tokens. KV Cache stores already-computed Key/Value pairs so that generating a new token does not require re-computing previous tokens. This speeds up generation by several times to orders of magnitude, and GPU memory management becomes the key challenge (hence techniques like PagedAttention).

Q5. Estimate the peak QPS for a chat app with 50 million DAU using back-of-the-envelope calculation.

A: Approximately 70,000 QPS

Calculation:

DAU: 50 million
Average messages: 40 per user per day
Total messages: 50M x 40 = 2 billion per day
Average QPS: 2 billion / 86,400 = approximately 23,000 QPS
Peak QPS: approximately 3x average = approximately 70,000 QPS

The peak multiplier varies by service; for chat apps that see evening usage spikes, 2-5x is typical.

References

Books

Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017)
Alex Xu, System Design Interview: An Insider's Guide Volume 1 (2020)
Alex Xu, System Design Interview: An Insider's Guide Volume 2 (2022)
Maheshwari, Acing the System Design Interview (Manning, 2024)
Gaurav Sen, System Design Simplified (2023)

Online Courses and Platforms

Educative - Grokking the System Design Interview
ByteByteGo - System Design Course (bytebytego.com)
Codemia - 120+ System Design Practice Problems (codemia.dev)
Exponent - System Design Interview Course (tryexponent.com)
Donnemartin - System Design Primer (GitHub open-source)

Engineering Blogs and Papers

Google SRE Book - Site Reliability Engineering (sre.google/sre-book)
Amazon Builders Library (aws.amazon.com/builders-library)
Meta Engineering Blog (engineering.fb.com)
Netflix Tech Blog (netflixtechblog.com)
Uber Engineering Blog (eng.uber.com)
Leslie Lamport, The Part-Time Parliament (Paxos paper, 1998)
DeCandia et al., Dynamo: Amazon's Highly Available Key-value Store (SOSP 2007)
Chang et al., Bigtable: A Distributed Storage System for Structured Data (OSDI 2006)
Apache Kafka Official Documentation (kafka.apache.org)
vLLM Project (vllm.ai) - including PagedAttention paper