- Published on
System Design Interview 2025: Top 30 FAANG Questions and the New AI/ML Design Round
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- 1. Why System Design Matters
- 2. The 45-Minute System Design Framework
- 3. Ten Essential Concepts You Must Know
- 3-1. Horizontal vs Vertical Scaling
- 3-2. Load Balancers (L4 vs L7)
- 3-3. Caching Strategies
- 3-4. Databases (SQL vs NoSQL, Sharding, Replication)
- 3-5. Message Queues (Kafka, RabbitMQ)
- 3-6. CDN and Edge Computing
- 3-7. API Design (REST vs gRPC vs GraphQL)
- 3-8. Consistent Hashing
- 3-9. Rate Limiting (Token Bucket, Sliding Window)
- 3-10. CAP Theorem and Real-World Trade-offs
- 4. Top 15 Most Frequently Asked Problems in 2025
- 5. New in 2025: AI/ML System Design
- 6. Back-of-the-Envelope Estimation Cheat Sheet
- 7. Top 5 Study Resources
- 8. Five Things Interviewers Actually Evaluate
- Practice Quiz
- References
1. Why System Design Matters
System design interviews are the key gatekeeping round for senior engineering hires. The 2025 numbers make the importance crystal clear.
The Current Landscape
- 70%+ of senior candidates face at least one system design round
- 60% of questions are related to cloud-native architecture (AWS, GCP, Azure)
- 80% of problems focus on 20% of core concepts (Pareto principle)
- At FAANG (Meta, Apple, Amazon, Netflix, Google), system design occupies 2-3 rounds
Expectations by Level
| Level | Expected Depth | Key Evaluation Points |
|---|---|---|
| L3-L4 (Junior) | Basic component understanding | API design, single-service design |
| L5 (Senior) | Full distributed system design | Trade-off analysis, scalability |
| L6+ (Staff) | Large-scale system architecture | Cross-org impact, technical strategy |
New Question Types in 2025
2025 has introduced entirely new categories alongside classic system design problems.
- AI/ML System Design: Recommendation systems, LLM serving, RAG pipelines
- Real-time Systems: 18 out of 30 problems require real-time processing
- Cost Optimization: Cloud cost considerations are now a mandatory evaluation criterion
- Security Design: Zero Trust and data privacy are baseline requirements
Where the classic question was "Design Twitter," interviewers now ask questions like "Design a real-time recommendation system that maintains model serving latency under 50ms while handling 1 billion inference requests per day."
2. The 45-Minute System Design Framework
The most common mistake is jumping straight into the design. A systematic framework is essential to navigate the 45-minute time constraint.
2-1. Clarify Requirements (5 minutes)
Ask the interviewer questions to narrow the scope. Candidates who skip this step fail almost 100% of the time.
Functional Requirements
Example questions:
- "What are the 3 core actions a user can perform?"
- "What is the read-to-write ratio?"
- "How long must we retain data?"
Non-Functional Requirements
Things to confirm:
- Availability: 99.9% vs 99.99% (annual downtime 8.7 hours vs 52 minutes)
- Latency: specific targets like p99 < 200ms
- Consistency: strong consistency vs eventual consistency
- Scale: DAU, concurrent users, data volume
2-2. Back-of-the-Envelope Estimation (5 minutes)
This step quantifies the scale. Interviewers care more about the reasoning process than exact numbers.
Example: Estimating scale for a chat system
- DAU: 50 million
- Average messages per day: 40 per user
- Total messages: 50M x 40 = 2 billion/day
- QPS: 2 billion / 86,400 = ~23,000 QPS
- Peak QPS: 3x average = ~70,000 QPS
- Message size: 100 bytes average
- Daily storage: 2B x 100B = 200GB/day
- Annual storage: 200GB x 365 = 73TB/year
2-3. High-Level Design (15 minutes)
Express the core components and data flow as a diagram.
Typical high-level architecture:
Client --> Load Balancer --> API Gateway
|
+----------+----------+
| | |
Service A Service B Service C
| | |
DB (SQL) Cache Message Queue
| |
DB (NoSQL) Worker
Must-includes at this stage:
- Client-server protocol (HTTP, WebSocket, gRPC)
- Data store selection with rationale
- Cache layer placement
- Identification of async processing needs
2-4. Deep Dive (15 minutes)
Dive deep into the components the interviewer shows interest in.
Deep dive areas:
1. Database Schema Design
- Table structure, indexes, partitioning strategy
2. API Design
- Endpoints, request/response format, error handling
3. Core Algorithms
- News Feed: Fan-out on write vs Fan-out on read
- Search: Inverted index structure
- Recommendations: Collaborative filtering vs content-based filtering
2-5. Scalability, Fault Tolerance, Monitoring (5 minutes)
The final step demonstrates your system's robustness.
Checklist:
- Identify and eliminate Single Points of Failure (SPOF)
- Data replication strategy (Leader-Follower, Multi-Leader)
- Failure detection and automatic recovery (Health Checks, Circuit Breaker)
- Monitoring metrics (Latency, Error Rate, Throughput)
- Alerting system (PagerDuty, Slack integration)
3. Ten Essential Concepts You Must Know
These core concepts cover 80% of all system design questions.
3-1. Horizontal vs Vertical Scaling
Vertical Scaling (Scale Up): Increasing the power of a single server
- Pros: Simple to implement, easy to maintain data consistency
- Cons: Hardware limits exist, single point of failure
Horizontal Scaling (Scale Out): Adding more servers
- Pros: Theoretically unlimited scaling, fault tolerance
- Cons: Distributed system complexity, data consistency challenges
Vertical scaling:
1 server (4 CPU, 16GB RAM) --> 1 server (32 CPU, 128GB RAM)
Horizontal scaling:
1 server --> 10 servers (each 4 CPU, 16GB RAM) + Load Balancer
In practice, most designs default to horizontal scaling, though some components like databases may start with vertical scaling.
3-2. Load Balancers (L4 vs L7)
Load balancers distribute traffic across multiple servers.
| Aspect | L4 (Transport Layer) | L7 (Application Layer) |
|---|---|---|
| Operates on | IP, Port | URL, Headers, Cookies |
| Speed | Faster | Relatively slower |
| Flexibility | Low | High (path-based routing) |
| SSL Termination | No | Yes |
| Use case | Internal service communication | API Gateway, Web servers |
L7 load balancer routing example:
/api/users/* --> User Service Cluster
/api/orders/* --> Order Service Cluster
/api/search/* --> Search Service Cluster
/static/* --> CDN Origin
3-3. Caching Strategies
Caching is the cornerstone of system performance. Here are three key patterns.
Cache-Aside (Lazy Loading)
Read request flow:
1. Look up data in cache
2. Cache hit --> return data
3. Cache miss --> query DB --> store in cache --> return
Pros: Only caches needed data, DB fallback on cache failure
Cons: First request always slow, cache-DB inconsistency possible
Write-Through
Write request flow:
1. Write data to cache
2. Cache synchronously writes to DB
3. Return completion response
Pros: Cache and DB always consistent
Cons: Increased write latency, caches unused data
Write-Behind (Write-Back)
Write request flow:
1. Write data to cache
2. Return completion immediately
3. Asynchronously batch-write to DB
Pros: Maximum write performance
Cons: Risk of data loss on cache failure
3-4. Databases (SQL vs NoSQL, Sharding, Replication)
SQL vs NoSQL Selection Criteria
| Criteria | SQL (Relational) | NoSQL (Non-Relational) |
|---|---|---|
| Data structure | Structured, complex relations | Unstructured, flexible schema |
| Consistency | ACID guaranteed | Eventual consistency (mostly) |
| Scalability | Vertical scaling first | Horizontal scaling friendly |
| Use case | Payments, inventory | Logs, sessions, social feeds |
| Products | PostgreSQL, MySQL | MongoDB, Cassandra, DynamoDB |
Sharding Strategies
1. Range-based Sharding
- user_id 1~1M --> Shard 1
- user_id 1M~2M --> Shard 2
- Pros: Efficient range queries
- Cons: Hotspot risk
2. Hash-based Sharding
- hash(user_id) % N --> Shard number
- Pros: Even distribution
- Cons: Inefficient range queries, complex rebalancing
3. Directory-based Sharding
- Managed via lookup table
- Pros: Flexible mapping
- Cons: Lookup table becomes SPOF
3-5. Message Queues (Kafka, RabbitMQ)
The backbone of asynchronous communication.
| Feature | Apache Kafka | RabbitMQ |
|---|---|---|
| Model | Log-based streaming | Message broker |
| Throughput | Millions per second | Tens of thousands per second |
| Message retention | Retained for configured period | Deleted after consumption |
| Ordering guarantee | Within partition | Within queue |
| Use case | Event streaming, logs | Task queues, RPC |
Kafka architecture:
Producer --> Topic (Partition 0) --> Consumer Group A
--> Consumer Group B
--> Topic (Partition 1) --> Consumer Group A
--> Consumer Group B
--> Topic (Partition 2) --> Consumer Group A
--> Consumer Group B
- Partition: unit of parallelism
- Consumer Group: independent consumption
- Offset: tracks consumption position
3-6. CDN and Edge Computing
A CDN (Content Delivery Network) caches content on edge servers worldwide.
CDN operation:
User (Korea) --> Korea edge server (cache hit) --> immediate response
(cache miss) --> origin (US) --> cache --> response
Pull CDN: Fetches from origin on first request (CloudFront, Cloudflare)
Push CDN: Pre-distributes content (suited for large static files)
Edge Computing goes beyond CDN by running computation at the edge.
- Cloudflare Workers, AWS Lambda@Edge, Vercel Edge Functions
- A/B testing, authentication, personalization at the edge
- Reduces origin server load + minimizes latency
3-7. API Design (REST vs gRPC vs GraphQL)
| Feature | REST | gRPC | GraphQL |
|---|---|---|---|
| Protocol | HTTP/1.1 | HTTP/2 | HTTP |
| Data format | JSON | Protocol Buffers | JSON |
| Type safety | Low | High (schema required) | Medium (schema required) |
| Streaming | Limited | Bidirectional support | Subscription support |
| Use case | Public APIs | Internal microservices | Client-driven queries |
REST example:
GET /api/v1/users/123
GET /api/v1/users/123/orders?page=1&limit=10
gRPC example:
service UserService {
rpc GetUser(GetUserRequest) returns (User);
rpc ListOrders(ListOrdersRequest) returns (stream Order);
}
GraphQL example:
query {
user(id: "123") {
name
orders(first: 10) {
id
total
}
}
}
3-8. Consistent Hashing
An algorithm that minimizes data redistribution when nodes are added or removed in a distributed system.
The problem with basic hashing:
hash(key) % N = server number
When N changes, almost all keys get remapped
Consistent Hashing:
- Hash space arranged in a ring (0 to 2^32-1)
- Both servers and keys are placed on the ring
- Keys are assigned to the nearest server clockwise
- Adding/removing a server only affects adjacent segments
Virtual Nodes:
- Each physical server maps to multiple virtual nodes
- Distributes load more evenly
- Number of virtual nodes can be adjusted by server capacity
3-9. Rate Limiting (Token Bucket, Sliding Window)
Protects the system from excessive traffic.
Token Bucket Algorithm
How it works:
1. Tokens are added to a fixed-size bucket at a constant rate
2. Each request consumes 1 token
3. If no tokens remain, request is rejected (HTTP 429)
Parameters:
- Bucket size: 100 (maximum burst)
- Refill rate: 10/second (average throughput)
Sliding Window Log Algorithm
How it works:
1. Record each request timestamp in a sorted log
2. Remove entries outside the current window (e.g., 1 minute)
3. Reject if request count within window exceeds the limit
Pros: Precise window boundary handling
Cons: High memory usage (stores all timestamps)
Rate Limiting in Distributed Environments
Approach 1: Centralized (Redis-based)
- All servers share a Redis counter
- Accurate but Redis becomes bottleneck/SPOF
Approach 2: Local + Synchronization
- Each server counts locally
- Periodically syncs with central store
- Fast but allows slight over-limit
3-10. CAP Theorem and Real-World Trade-offs
CAP Theorem: A distributed system can guarantee at most 2 of these 3 properties simultaneously.
- C (Consistency): All nodes see the same data
- A (Availability): Every request gets a response
- P (Partition Tolerance): System works despite network partitions
Since network partitions are inevitable in reality, the practical choice is CP vs AP.
| System | Choice | Reason |
|---|---|---|
| Payment system | CP | Financial accuracy is paramount |
| Social media feed | AP | Brief inconsistency is better than downtime |
| Inventory management | CP | Prevent overselling |
| DNS | AP | Availability is critical |
| Chat messages | AP | Eventual consistency is sufficient |
4. Top 15 Most Frequently Asked Problems in 2025
The 15 problems that appear most often in real interviews.
| Rank | Problem | Difficulty | Companies | Key Concepts |
|---|---|---|---|---|
| 1 | URL Shortener | Easy | All | Hashing, DB selection, read optimization |
| 2 | Rate Limiter | Easy | All | Token Bucket, distributed counting |
| 3 | News Feed System | Medium | Meta, Twitter | Fan-out, cache layers, ranking |
| 4 | Chat System | Medium | WhatsApp, Slack | WebSocket, message queues, state management |
| 5 | Video Streaming Platform | Medium | Netflix, YouTube | CDN, adaptive bitrate, transcoding |
| 6 | Search Autocomplete | Medium | Google, Amazon | Trie, ElasticSearch, prefix matching |
| 7 | Location-Based Service | Medium | Uber, DoorDash | Geohash, QuadTree, proximity search |
| 8 | Notification System | Medium | All | Push/Pull, message queues, priority |
| 9 | Distributed Cache | Hard | Amazon, Google | Consistent Hashing, replication, failure detection |
| 10 | Distributed Message Queue | Hard | Kafka team, LinkedIn | Partitioning, ISR, consumer groups |
| 11 | Payment System | Hard | Stripe, Toss, PayPal | Idempotency, Saga pattern, double-entry |
| 12 | Real-time Gaming Backend | Hard | Riot Games, Epic | State synchronization, UDP, lag compensation |
| 13 | Recommendation System | Hard | Netflix, Spotify | ML pipeline, Feature Store |
| 14 | Distributed File System | Hard | Google, Microsoft | GFS, HDFS, chunk servers |
| 15 | Ad Click Aggregation | Hard | Google, Meta | Stream processing, exactly-once semantics |
Problem-Specific Tips
URL Shortener - The most fundamental problem, but one where you can demonstrate depth.
Key design points:
1. Short URL generation: Base62 encoding vs hash (MD5/SHA256 truncation)
2. Collision handling: DB unique constraint + retry vs counter-based
3. Redirection: 301 (permanent) vs 302 (temporary) with rationale
4. Caching: Frequently accessed URLs in Redis (80/20 rule)
5. Analytics: Click count, region, device tracking
News Feed System - The fan-out strategy is the core decision.
Fan-out on Write (Push model):
- When a post is created, immediately write to all followers' feeds
- Pros: Fast reads
- Cons: High write cost for users with many followers (celebrities)
Fan-out on Read (Pull model):
- When a feed is viewed, aggregate in real-time from followed users
- Pros: Low write cost
- Cons: Slow reads
Hybrid (Production answer):
- Regular users: Push model
- Celebrities (1M+ followers): Pull model
- Merge both for the final feed
Chat System - Real-time communication is the core challenge.
Architecture:
1. Connection management: WebSocket servers + connection state store
2. Message delivery: Direct for 1:1, fan-out for groups
3. Offline handling: Store in message queue, deliver on reconnection
4. Read receipts: Separate service
5. Message storage: DB optimized for chronological ordering
- HBase/Cassandra (wide-column stores)
5. New in 2025: AI/ML System Design
The biggest shift in 2025 system design interviews is the surge of AI/ML-related problems. Google, Meta, and Amazon -- as well as startups -- now evaluate AI system design capabilities.
5-1. Recommendation System Architecture
Recommendation systems are the most classic AI system design problem.
Full recommendation pipeline:
Data Collection --> Feature Store --> Model Training --> Model Registry
| | |
User behavior logs Offline batch A/B Testing
Clicks, purchases, views Daily/weekly update Champion/Challenger
| | |
v v v
Real-time feature --> Candidate --> Ranking --> Re-ranking --> Display
computation Generation (Scoring) (Re-ranking)
(Retrieval)
Key components:
1. Feature Store: Offline (batch) + Online (real-time) feature management
2. Candidate Generation: ANN (Approximate Nearest Neighbors) to select ~1000 candidates
3. Ranking Model: Score candidates with deep learning model
4. Re-ranking: Apply business rules, diversity, freshness
5-2. LLM Serving System
Large Language Model (LLM) serving is the hottest topic in 2025.
LLM Serving Architecture:
Request --> API Gateway --> Request Router --> GPU Cluster
| |
Token Limiter Model Instances (vLLM)
Queue Manager |
| KV Cache Management
Response Streaming <--- |
|
Cost Tracker
Key optimization techniques:
1. KV Cache: Cache previous tokens' Key/Value to avoid redundant computation
2. Continuous Batching: Dynamically batch requests
3. PagedAttention: Manage memory in page units (vLLM)
4. Quantization: FP16 --> INT8/INT4 to reduce model size
5. Speculative Decoding: Draft with small model, verify with large model
Auto-scaling Strategy
GPU auto-scaling considerations:
- GPU instance startup time: 3-10 minutes (much slower than CPU)
- Predictive scaling: Pre-learn hourly traffic patterns
- Queue-depth based: Trigger scaling on pending request count
- Cost optimization: Spot instances + on-demand fallback
5-3. RAG (Retrieval-Augmented Generation) Pipeline
RAG is a key pattern for reducing LLM hallucinations and providing up-to-date information.
RAG Pipeline Architecture:
Document ingestion --> Chunking --> Embedding generation --> Vector DB storage
|
User query --> Query embedding --> Similarity search --> Top-K document retrieval
|
Prompt construction <-------+
|
LLM generation --> Response + source citations
Key design decisions:
1. Chunking strategy: Fixed-size vs semantic (paragraph/section)
2. Embedding model: OpenAI Ada-002, Cohere Embed, open-source models
3. Vector DB: Pinecone (managed), Milvus (self-hosted), pgvector
4. Search approach: Pure vector search vs hybrid (vector + keyword)
5. Re-ranking: Cross-encoder for precise reordering of search results
5-4. Real-time ML Pipeline
Real-time ML Pipeline:
Event source --> Kafka --> Stream Processor (Flink) --> Feature computation
|
Feature Store
|
API request --> Model Server --> Inference result --> Business logic
|
Model Registry (MLflow)
Use cases:
- Fraud detection: Payment event --> Real-time features --> Fraud score
- Real-time recommendations: User behavior --> Real-time embedding update --> Refreshed recs
- Dynamic pricing: Supply/demand signals --> Pricing model --> Price adjustment
5-5. AI Safety Design
AI system safety is a mandatory topic in 2025 interviews.
AI Safety Architecture:
User input --> Input filter --> LLM --> Output filter --> User response
| |
Content classifier Hallucination detector
Prompt injection detection Fact checker
PII detection/masking Toxicity filter
| |
Block or warn Modify or block
Design points:
1. Multi-layer defense: Filters at input/processing/output stages
2. Async audit: Analyze full conversation logs asynchronously
3. Adaptive rules: Fast rule updates for new attack patterns
4. Human-in-the-loop: Human review for uncertain automated cases
5. Latency budget: Minimize added latency from safety layers (target: under 50ms)
6. Back-of-the-Envelope Estimation Cheat Sheet
Essential numbers you can use directly in interviews.
6-1. Latency Reference Table
Operation Time
--------------------------- ----------
L1 cache reference 1 ns
L2 cache reference 4 ns
Main memory (RAM) reference 100 ns
SSD random read 100 us (microseconds)
HDD seek 10 ms
Same-datacenter network RTT 0.5 ms
Cross-continent network RTT 150 ms
Memory tips:
- L1 is 100x faster than RAM
- SSD is 1000x slower than RAM
- HDD is 100x slower than SSD
- Network is always slower than local I/O
6-2. Capacity Calculation Formulas
QPS (Queries Per Second):
QPS = DAU * average requests per user / 86,400
Peak QPS = average QPS * 2~5 (depends on service)
Storage:
Daily data = daily new records * average record size
Annual data = daily data * 365
Total storage = annual data * retention period (years) * replication factor (usually 3)
Bandwidth:
Inbound = write QPS * average request size
Outbound = read QPS * average response size
6-3. Building Scale Intuition
Approximate scale of major services:
Service DAU Avg QPS Storage
------ -------- --------- --------
Twitter 500M 300K Several TB/day
Instagram 2B 1M Tens of TB/day
YouTube 2B+ 500K+ 500 hours uploaded/min
WhatsApp 2B+ Millions 10B messages/day
Google Search 8.5B queries/day 100K+ Hundreds of PB indexed
Powers of 2 (frequently used):
- 2^10 = 1,024 (~1 thousand)
- 2^20 = 1,048,576 (~1 million)
- 2^30 = 1,073,741,824 (~1 billion)
- 2^40 = ~1 trillion (1TB)
7. Top 5 Study Resources
The best materials for preparing for system design interviews.
7-1. DDIA (Designing Data-Intensive Applications)
By Martin Kleppmann. Widely regarded as the bible of system design.
- Audience: Engineers who want deep understanding of distributed system principles
- Strengths: Systematic coverage from theory to practical trade-offs
- Key topics: Data models, storage engines, replication, partitioning, transactions, consistency, batch/stream processing
- Study tip: Read cover-to-cover, then revisit key chapters before interviews (especially chapters 5-9)
7-2. System Design Interview (Alex Xu) Vol.1 + Vol.2
The most interview-relevant practical guide.
- Audience: Engineers preparing for system design interviews for the first time
- Strengths: Step-by-step solutions per problem, rich diagrams
- Vol.1: URL shortener, news feed, chat, search autocomplete, and 9 more (13 total)
- Vol.2: Location services, gaming leaderboard, payment system, and 10 more (13 total)
7-3. ByteByteGo (YouTube + Newsletter)
A visual learning platform run by Alex Xu.
- Audience: Engineers who prefer visual learning
- Strengths: Outstanding architecture diagram quality
- Content: YouTube videos, weekly newsletter, online course
- Recommended usage: Review concepts via videos during commute
7-4. Grokking the System Design Interview
An interactive learning course on the Educative platform.
- Audience: Engineers seeking a structured curriculum
- Strengths: Step-by-step learning path with quizzes
- Structure: Core concepts + 15 design problems + glossary
7-5. Codemia (120+ Practice Problems)
A system design practice platform launched in 2024.
- Audience: Engineers wanting exposure to a wide range of problems
- Strengths: 120+ problems, difficulty-tiered
- Features: Community solution comparison, timer feature
- Recommended usage: Practice 3-4 problems per week with a 45-minute timer
12-Week Study Roadmap
Weekly study plan:
Weeks 1-2: Foundational Concepts (DDIA key chapters)
- Scalability, availability, consistency principles
- Database, cache, message queue basics
Weeks 3-4: Framework Practice (Alex Xu Vol.1)
- Internalize the 45-minute framework
- Solve 5 easy problems
Weeks 5-8: Problem-Solving Practice (Alex Xu Vol.2 + Codemia)
- 10 medium-difficulty problems
- 3-4 problems per week, using a timer
Weeks 9-10: Advanced Topics + AI/ML Systems
- Deep dive into distributed systems
- LLM serving, recommendation systems
Weeks 11-12: Mock Interviews
- 3-5 mock interviews with peers
- Focused review of weak areas
8. Five Things Interviewers Actually Evaluate
8-1. Ability to Structure Ambiguity
Interviewers intentionally ask vague questions. When asked "Design Google Drive," jumping straight into coding is a red flag. The ability to ask questions and narrow the scope is what matters.
Good clarification questions:
- "Should I focus on file upload or download?"
- "What is the user scale? Are we talking 100 million users?"
- "Is real-time collaborative editing in scope?"
- "Do we need to support both mobile and web?"
8-2. Trade-off Analysis
The ability to clearly answer "Why B instead of A?"
Trade-off analysis example:
Question: "Would you choose Cassandra or MySQL for message storage?"
Good answer structure:
1. Confirm requirements: "Chat messages are write-heavy,
mostly queried chronologically, and availability matters
more than strong consistency."
2. Compare options:
- MySQL: ACID guarantees, supports joins, but horizontal
scaling is difficult and sharding is complex
- Cassandra: Easy horizontal scaling, write-optimized,
great for time-series data, but no joins
3. Conclusion: "Given our requirements prioritize write
performance and horizontal scalability, I would choose
Cassandra. For data with complex relationships like user
profiles, I would use a separate MySQL instance."
8-3. Scaling Scenario Response
Handling the question "What if users grow 10x?"
Scaling scenario framework:
Current scale (1M DAU):
- Single DB, 2 read replicas, 1 cache server
10x growth (10M DAU):
- Introduce DB sharding (user ID-based)
- Expand cache cluster (Redis Cluster)
- Add CDN (static assets)
- Read/write separation
100x growth (100M DAU):
- Multi-region deployment
- Microservice decomposition
- Event-driven architecture migration
- Dedicated search engine (Elasticsearch)
8-4. Experience-Based Judgment
Interviewers value real-world experience over textbook answers.
- "In a previous project, I encountered a cache invalidation issue..."
- "When using Redis, we hit memory limits, and what I learned was..."
- "How a circuit breaker helped during an actual outage..."
8-5. Communication (Diagrams + Explanation)
A system design interview is a conversation. You should not monologue for 30 minutes; instead, build the design together with the interviewer.
Effective communication techniques:
1. Use the whiteboard
- Always explain with diagrams
- Show data flow with arrows
- Label components clearly
2. Set checkpoints
- "If this looks good so far, I will move on"
- "Should I go deeper into this component?"
3. Share your thought process
- Think out loud instead of going silent
- "I see two options: A does... while B does..."
Practice Quiz
Test your understanding of the key concepts.
Q1. What is the single most important thing to do in the first 5 minutes of a system design interview?
A: Clarify the requirements.
Ask the interviewer about functional requirements (top 3 features) and non-functional requirements (scale, latency, availability) to narrow the design scope. Jumping straight into design is the most common mistake.
Q2. Under the CAP theorem, when a network partition occurs, should a payment system choose CP or AP?
A: CP (Consistency + Partition Tolerance)
In a payment system, financial accuracy is the top priority. Temporary service unavailability (sacrificing availability) is better than processing incorrect amounts (sacrificing consistency). Retry mechanisms and idempotency are used to mitigate availability degradation.
Q3. What fan-out strategy is appropriate for a celebrity with 1 million followers posting on a news feed?
A: Fan-out on Read (Pull model) or Hybrid
Pushing a celebrity's post to all 1 million followers incurs excessive write costs. The production answer is a hybrid approach: use the Pull model (real-time aggregation on feed view) for celebrities and the Push model for regular users, then merge both to generate the final feed.
Q4. What is the role of KV Cache in LLM serving, and why is it important?
A: It caches previously computed Key/Value tensors to avoid redundant computation.
In autoregressive generation, each new token requires computing attention over all previous tokens. KV Cache stores already-computed Key/Value pairs so that generating a new token does not require re-computing previous tokens. This speeds up generation by several times to orders of magnitude, and GPU memory management becomes the key challenge (hence techniques like PagedAttention).
Q5. Estimate the peak QPS for a chat app with 50 million DAU using back-of-the-envelope calculation.
A: Approximately 70,000 QPS
Calculation:
- DAU: 50 million
- Average messages: 40 per user per day
- Total messages: 50M x 40 = 2 billion per day
- Average QPS: 2 billion / 86,400 = approximately 23,000 QPS
- Peak QPS: approximately 3x average = approximately 70,000 QPS
The peak multiplier varies by service; for chat apps that see evening usage spikes, 2-5x is typical.
References
Books
- Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017)
- Alex Xu, System Design Interview: An Insider's Guide Volume 1 (2020)
- Alex Xu, System Design Interview: An Insider's Guide Volume 2 (2022)
- Maheshwari, Acing the System Design Interview (Manning, 2024)
- Gaurav Sen, System Design Simplified (2023)
Online Courses and Platforms
- Educative - Grokking the System Design Interview
- ByteByteGo - System Design Course (bytebytego.com)
- Codemia - 120+ System Design Practice Problems (codemia.dev)
- Exponent - System Design Interview Course (tryexponent.com)
- Donnemartin - System Design Primer (GitHub open-source)
Engineering Blogs and Papers
- Google SRE Book - Site Reliability Engineering (sre.google/sre-book)
- Amazon Builders Library (aws.amazon.com/builders-library)
- Meta Engineering Blog (engineering.fb.com)
- Netflix Tech Blog (netflixtechblog.com)
- Uber Engineering Blog (eng.uber.com)
- Leslie Lamport, The Part-Time Parliament (Paxos paper, 1998)
- DeCandia et al., Dynamo: Amazon's Highly Available Key-value Store (SOSP 2007)
- Chang et al., Bigtable: A Distributed Storage System for Structured Data (OSDI 2006)
- Apache Kafka Official Documentation (kafka.apache.org)
- vLLM Project (vllm.ai) - including PagedAttention paper