System Design Fundamentals — Scalability, Availability, Caching, and Message Queues

What Is System Design
Scalability
Load Balancing
Caching Strategies
CDN
Databases
CAP Theorem
Message Queues
API Design
Monitoring and Logging
Practical Example: URL Shortener
Practical Example: Chat System

What Is System Design

System design is the process of defining the architecture of large-scale software systems. It goes beyond writing code, dealing with how to build scalable and reliable systems that serve millions of users.

Why It Matters

In practice: Real services cannot run on a single server. You need to handle traffic growth, failures, and data management.
In interviews: Most major tech companies, including FAANG, include system design rounds.
Career growth: Architectural thinking is essential for advancing to senior engineering roles.

Interview Approach

There is no single correct answer in a system design interview. What matters is your thought process.

Clarify requirements: Identify functional and non-functional requirements
Estimate scale: Calculate user count, QPS, and storage needs
High-level design: Place the core components
Detailed design: Identify and address bottlenecks
Discuss tradeoffs: Every decision has pros and cons

Scalability

Scalability is a system's ability to handle increasing load.

Vertical Scaling (Scale Up)

Adding more powerful CPUs, more RAM, or larger disks to a single server.

Pros:

Simple to implement
Easy to maintain data consistency
No network calls, so lower latency

Cons:

Hardware has physical limits
Creates a single point of failure
Cost grows exponentially

Horizontal Scaling (Scale Out)

Adding more servers to distribute the load.

Pros:

Theoretically unlimited scaling
Achieves fault tolerance
Cost-effective

Cons:

Increases distributed system complexity
Harder to maintain data consistency
Requires load balancing

Stateless Design

The key to horizontal scaling is stateless design. When servers hold no state, any server can handle any request.

Stateful server:
  User A -> Server 1 (session stored)
  User A -> Server 2 (no session -> error!)

Stateless server:
  User A -> Server 1 (looks up session in external store)
  User A -> Server 2 (looks up session in external store -> works)

State such as sessions and caches is stored in external systems like Redis or Memcached.

Load Balancing

A load balancer distributes incoming traffic across multiple servers.

L4 vs L7 Load Balancing

L4 (Transport Layer):

Operates at the TCP/UDP level
Routes based on IP and port information only
Fast and efficient
Example: AWS NLB

L7 (Application Layer):

Operates at the HTTP/HTTPS level
Can route based on URL paths, headers, cookies, etc.
More granular control
Example: AWS ALB, Nginx

Load Balancing Algorithms

Round Robin: Distributes requests sequentially to each server. Simple but does not account for differences in server capacity.

Weighted Round Robin: Assigns weights to servers, sending more requests to higher-capacity machines.

IP Hash: Hashes the client IP to always route to the same server. Useful when session persistence is needed, but requires redistribution when servers are added or removed.

Least Connections: Sends requests to the server with the fewest active connections. Effective when request processing times vary.

Round Robin example:

  Request 1 -> Server A
  Request 2 -> Server B
  Request 3 -> Server C
  Request 4 -> Server A  (cycles back)
  Request 5 -> Server B  (cycles back)

Health Checks

The load balancer periodically checks server health. Servers that fail to respond or return errors are removed from the rotation.

Load Balancer:
  /health -> Server A: 200 OK (healthy)
  /health -> Server B: 503 Error (unhealthy -> removed)
  /health -> Server C: 200 OK (healthy)

  Traffic goes to Server A and C only

Caching Strategies

Caching stores frequently accessed data in a fast storage layer to reduce response times.

Cache Aside (Lazy Loading)

The most widely used pattern.

The application checks the cache first
On a cache hit, it returns the cached data
On a cache miss, it queries the database, stores the result in the cache, and returns it

Read flow:
  Client -> Check cache
    Hit -> Return from cache
    Miss -> Query DB -> Store in cache -> Return

Pros: Only caches data that is actually requested. Falls back to the database if the cache fails. Cons: First request always incurs a cache miss. Stale data is possible.

Write Through

Writes data to both the cache and the database simultaneously.

Write flow:
  Client -> Update cache -> Update DB -> Response

Pros: Cache and database stay consistent. Cons: Increased write latency. Caches data that may never be read.

Write Behind (Write Back)

Writes to the cache first and asynchronously persists to the database later.

Write flow:
  Client -> Update cache -> Response (immediate)
  Background: Cache -> DB sync (delayed)

Pros: Very fast writes. Reduces database load. Cons: Risk of data loss if the cache fails.

TTL and Cache Invalidation

TTL (Time To Live): Sets an expiration time on cache entries. Too short reduces cache efficiency; too long causes stale data.

Invalidation strategies:

Time-based: Automatic deletion when TTL expires
Event-based: Immediate invalidation when data changes
Version-based: Include a version in the cache key so new versions automatically create fresh entries

TTL examples:
  SET user:123 "Kim" EX 3600        (expires in 1 hour)
  SET product:456 "Laptop" EX 86400 (expires in 24 hours)

Cache Stampede (Thundering Herd)

When a popular cache entry expires, a flood of requests simultaneously hits the database.

Solutions:

Locking: Only one request queries the database; others wait
Probabilistic early refresh: Randomly refresh the cache before TTL expires
Staggered TTL: Set slightly different TTLs for different cache entries

CDN

A CDN (Content Delivery Network) serves content from edge servers distributed around the world, delivering data from the location closest to the user.

How It Works

User (Seoul) -> Seoul edge server (cache hit) -> Instant response
User (Seoul) -> Seoul edge server (cache miss) -> Origin server -> Cache at edge -> Response

Static Asset Delivery

CDNs are primarily used for serving static content:

Images, video, and audio
CSS and JavaScript files
Font files
HTML pages (static sites)

Optimizing Cache Hit Ratio

A higher cache hit ratio means better CDN efficiency.

Optimization techniques:

Set appropriate Cache-Control headers
Use file versioning (e.g., style.v2.css or content hashing)
Use long TTLs for infrequently changing content
Normalize query strings

Pull CDN vs Push CDN

Pull CDN: Fetches content from the origin on the first request and caches it. Suitable for high-traffic sites.

Push CDN: Content is pre-uploaded to the CDN. Suitable when content rarely changes.

Databases

SQL vs NoSQL

SQL (Relational Databases):

Structured data with strict schemas
ACID transaction guarantees
Complex queries with JOINs
Examples: PostgreSQL, MySQL, Oracle

NoSQL (Non-Relational Databases):

Flexible schemas
Easy horizontal scaling
High read/write throughput
Examples: MongoDB, Cassandra, DynamoDB, Redis

Selection criteria:

Criterion	SQL	NoSQL
Data structure	Structured	Flexible
Scaling	Primarily vertical	Primarily horizontal
Consistency	Strong consistency	Eventual consistency (typically)
Transactions	ACID	BASE
Queries	Complex JOINs possible	Optimized for simple key-value lookups

Sharding

Distributing data across multiple database instances.

Hash-based sharding: Determines the shard by hashing the key. Achieves even distribution but requires expensive redistribution when shards are added or removed.

user_id % 4 = 0 -> Shard A
user_id % 4 = 1 -> Shard B
user_id % 4 = 2 -> Shard C
user_id % 4 = 3 -> Shard D

Range-based sharding: Assigns shards based on key ranges. Good for range queries but can create hotspots.

Consistent Hashing: Uses a hash ring to minimize key redistribution when nodes are added or removed. Widely used in distributed caches and databases.

Replication

Copying data to multiple nodes to improve availability and read performance.

Primary-Replica (Master-Slave):

Writes go to the primary; reads go to replicas
On primary failure, a replica is promoted

Multi-Primary (Master-Master):

Multiple nodes accept writes
Requires conflict resolution strategies

Primary-Replica architecture:

  Writes -> [Primary]
               |
        +------+------+
        |              |
    [Replica 1]    [Replica 2]
        |              |
    Read requests   Read requests

Partitioning

Logically dividing data.

Horizontal partitioning: Splits by rows. Similar to sharding. Vertical partitioning: Splits by columns. Separates frequently accessed columns from rarely accessed ones.

CAP Theorem

A distributed system cannot simultaneously satisfy all three of the following properties.

The Three Properties

Consistency: Every node has the same data. Reads always return the most recent write.

Availability: Every request receives a response. The system responds even if some nodes are down.

Partition Tolerance: The system continues to operate despite network partitions.

Real-World Choices

Network partitions are unavoidable in distributed systems, so the practical choice is between CP and AP.

CP systems (Consistency + Partition Tolerance):

May reject some requests during a partition to maintain consistency
Examples: HBase, MongoDB (default config), Zookeeper

AP systems (Availability + Partition Tolerance):

May return stale data during a partition to maintain availability
Examples: Cassandra, DynamoDB, CouchDB

CP system (during partition):
  Node A: latest data = "v2"
  Node B: stale data = "v1" -> Request rejected (consistency first)

AP system (during partition):
  Node A: latest data = "v2"
  Node B: stale data = "v1" -> Returns "v1" (availability first)

PACELC

An extension of CAP that also considers the tradeoff between Latency (L) and Consistency (C) when there is no partition (Else).

PA/EL: Availability during partition, low latency otherwise (e.g., DynamoDB)
PC/EC: Consistency during partition, consistency otherwise (e.g., HBase)
PA/EC: Availability during partition, consistency otherwise (e.g., MongoDB)

Message Queues

A message queue is middleware that enables asynchronous communication. It decouples producers from consumers, reducing coupling and improving scalability.

Why They Are Needed

Problems with synchronous communication:

When Service A calls Service B directly, a failure in B affects A as well
Traffic spikes can overload downstream services
Long-running tasks delay the entire response

Synchronous (problem):
  Order API -> Payment -> Inventory -> Notification -> Response
  (Total latency = sum of each service's latency)

Asynchronous (solution):
  Order API -> Publish event to queue -> Respond immediately
  Payment Service    <- Consumes from queue
  Inventory Service  <- Consumes from queue
  Notification Service <- Consumes from queue
  (Each service processes independently)

Apache Kafka

A distributed event streaming platform.

Key features:

High throughput (millions of messages per second)
Message durability (persisted to disk)
Parallel processing via consumer groups
Topic-based partitioning

Kafka architecture:
  Producer -> Topic (Partition 0, 1, 2) -> Consumer Group
                                            Consumer A (P0)
                                            Consumer B (P1)
                                            Consumer C (P2)

Use cases: Log aggregation, event sourcing, real-time stream processing, microservice communication

RabbitMQ

A traditional message broker using the AMQP protocol.

Key features:

Multiple routing patterns (Direct, Topic, Fanout, Headers)
Message acknowledgment mechanism
Priority queues
Flexible routing

Use cases: Task queues, RPC patterns, scenarios requiring complex routing

Amazon SQS

A fully managed message queue service from AWS.

Key features:

Serverless and fully managed
Auto-scaling
Standard queues (at-least-once delivery) and FIFO queues (exactly-once delivery)
Dead Letter Queue support

Kafka vs RabbitMQ vs SQS Comparison

Feature	Kafka	RabbitMQ	SQS
Throughput	Very high	Medium	High
Message ordering	Guaranteed within partition	Can be guaranteed	FIFO only
Message retention	Configurable retention period	Deleted after consumption	Up to 14 days
Operational complexity	High	Medium	Low (managed)
Best for	Event streaming	Task queues	Serverless workloads

Event-Driven Architecture

Event-driven architecture built on message queues is a core pattern in microservice environments.

Event Sourcing: Records state changes as events. Current state is reconstructed by replaying events.

CQRS: Separates commands (writes) and queries (reads), optimizing each independently.

Event Sourcing example:
  Event 1: Order Created (OrderCreated)
  Event 2: Payment Completed (PaymentCompleted)
  Event 3: Shipping Started (ShippingStarted)
  Event 4: Delivery Completed (DeliveryCompleted)

  Current state = result of applying Events 1 + 2 + 3 + 4

API Design

REST vs GraphQL vs gRPC

REST:

Based on HTTP methods (GET, POST, PUT, DELETE)
Resource-oriented design
Most widely adopted
Over-fetching and under-fetching problems

GET /api/users/123
POST /api/orders
PUT /api/users/123
DELETE /api/orders/456

GraphQL:

Clients request exactly the data they need
Single endpoint
Schema-based type system
Solves over-fetching and under-fetching

query {
  user(id: "123") {
    name
    email
    orders {
      id
      total
    }
  }
}

gRPC:

Protocol Buffers serialization
Uses HTTP/2 (bidirectional streaming)
High performance
Ideal for inter-service communication

service UserService {
  rpc GetUser (GetUserRequest) returns (UserResponse);
  rpc ListUsers (ListUsersRequest) returns (stream UserResponse);
}

API Comparison

Feature	REST	GraphQL	gRPC
Protocol	HTTP/1.1	HTTP/1.1	HTTP/2
Data format	JSON	JSON	Protocol Buffers
Type system	None (OpenAPI for docs)	Built-in	Built-in
Streaming	Limited	Subscriptions	Bidirectional streaming
Best for	Public APIs	Frontend BFF	Microservices

API Versioning

Strategies for maintaining backward compatibility when APIs change.

URL path versioning: /api/v1/users, /api/v2/users
Header versioning: Accept: application/vnd.api.v2+json
Query parameter: /api/users?version=2

Rate Limiting

Limits request frequency to prevent API abuse and protect the system.

Algorithms:

Token Bucket: Tokens are added at a steady rate; each request consumes a token. Allows burst traffic.
Sliding Window: Counts requests within a moving time window. Fewer boundary issues.
Fixed Window: Resets the counter at fixed intervals. Simple to implement.

Token Bucket:
  Bucket capacity: 10
  Refill rate: 1/second

  Time 0: 10 tokens (5 requests -> 5 remaining)
  Time 5: 10 tokens (5 refilled, capped at 10)

Monitoring and Logging

Metrics

Quantify the health of the system.

Key metrics:

Latency: p50, p95, p99 percentile response times
Throughput: QPS (Queries Per Second)
Error rate: Ratio of 4xx and 5xx responses
Resource utilization: CPU, memory, disk, network

Tools: Prometheus, Grafana, Datadog, CloudWatch

Alerting

Notifies on-call engineers when metrics exceed thresholds.

Good alerting criteria:

Alerts must be actionable (the recipient can take action)
Minimize false positives
Differentiate severity levels (Critical, Warning, Info)

Distributed Tracing

Tracks the complete path of a request across microservices.

User request (trace-id: abc123)
  -> API Gateway (span-1, 5ms)
    -> Auth Service (span-2, 10ms)
    -> Order Service (span-3, 50ms)
      -> Payment Service (span-4, 200ms)
      -> Inventory Service (span-5, 30ms)
    -> Notification Service (span-6, 15ms)
  Total duration: 310ms

Tools: Jaeger, Zipkin, AWS X-Ray, OpenTelemetry

Logging Strategy

Structured logs: Recording logs in JSON format makes them easier to search and analyze.

{
  "timestamp": "2026-04-12T10:30:00Z",
  "level": "ERROR",
  "service": "order-service",
  "trace_id": "abc123",
  "message": "Payment failed",
  "user_id": "user-789",
  "order_id": "order-456",
  "error_code": "INSUFFICIENT_FUNDS"
}

Use the ELK stack (Elasticsearch + Logstash + Kibana) or a Loki + Grafana combination to collect, store, and visualize logs.

Practical Example: URL Shortener

A frequently asked system design interview question.

Requirements

Functional requirements:

Convert long URLs to short URLs
Redirect short URLs to the original URL
Support custom short URLs
URL expiration

Non-functional requirements:

High availability
Low latency (redirects must be fast)
Short URLs must not be predictable

Scale Estimation

Writes: 100 million per day, approximately 1,160 QPS
Reads: 10x writes = 1 billion per day, approximately 11,600 QPS
Storage: About 180 billion records over 5 years, 500 bytes per record = approximately 90 TB

High-Level Design

Client -> Load Balancer -> API Server -> DB
                              |
                           Cache (Redis)

Short Key Generation

Method 1: Hash and truncate

Generate a hash with MD5 or SHA-256 and use the first 7 characters
Base62 encoding (a-z, A-Z, 0-9) = 62 to the 7th power = approximately 3.5 trillion combinations

Method 2: Counter-based

Encode an auto-incrementing ID in Base62
In distributed environments, use Zookeeper or Snowflake IDs

Method 3: Pre-generation

Pre-generate keys and store them in a database
A Key Generation Service (KGS) manages the key pool

Detailed Design

Redirect flow:

1. Client requests short.url/abc123
2. Load balancer forwards to API server
3. API server checks Redis cache
4. Cache hit -> 301 Redirect response
5. Cache miss -> Query DB -> Store in cache -> 301 Redirect

301 vs 302 redirect:

301 (Permanent): Browser caches the redirect. Reduces server load. Harder to collect analytics.
302 (Temporary): Every request hits the server. Enables click analytics. Higher server load.

Practical Example: Chat System

Requirements

Functional requirements:

One-to-one and group chat
Send and receive messages
Online/offline status indicators
Read receipts
File and image sharing
Message search

Non-functional requirements:

Real-time message delivery (low latency)
Message ordering guarantees
Message durability (no data loss)
High availability

Communication Protocols

HTTP Polling: The client periodically asks the server for new messages. Simple but inefficient.

Long Polling: The server holds the connection until a new message arrives. More efficient than polling but requires timeout management.

WebSocket: Enables real-time bidirectional communication. After an initial HTTP handshake, a persistent connection is maintained. The best fit for real-time chat.

HTTP Polling:
  Client -> Server: New messages? (every 1 second)
  Server -> Client: None
  Client -> Server: New messages? (every 1 second)
  Server -> Client: Yes! (delivers message)

WebSocket:
  Client <-> Server: Persistent connection
  Server -> Client: New message pushed immediately

High-Level Design

Sender -> WebSocket Server -> Message Queue -> WebSocket Server -> Receiver
               |                                     |
           Message DB                        Presence Service
               |
         Message Search (Elasticsearch)

Message Store Selection

Characteristics of chat messages:

Sequential writes (append only)
Most reads are for recent messages
Random reads occur only during search
Very large data volumes

Suitable databases:

Cassandra: High write throughput, well-suited for time-based partitioning
HBase: Optimized for large-scale sequential writes

Message ID Design

Message IDs must be sortable by time to guarantee ordering.

Snowflake ID approach:

| Timestamp (41 bits) | Datacenter (5 bits) | Machine (5 bits) | Sequence (12 bits) |

Sortable by time
No collisions in distributed environments
Generates up to 4,096 IDs per second per machine

Presence Management

Heartbeat approach: Clients send a periodic heartbeat (e.g., every 5 seconds). If no heartbeat is received for a set period, the user is marked offline.

User A: Sends heartbeat (every 5 seconds)
  Server: Records last heartbeat time
  No heartbeat for 30 seconds -> Status changed to "offline"

Group Chat

In group chat, messages must be delivered to all group members.

Small groups (tens of members):

Copy the message directly into each member's message queue

Large groups (thousands of members):

Fan-out becomes inefficient
A pull-based approach where recipients fetch messages is more suitable

Conclusion

The essence of system design is understanding tradeoffs. No system is perfect, and the goal is always to make the best choices for the given requirements.

Key principles:

Start simple and add complexity only when needed
Consider scalability from the start, but avoid over-engineering
Failures will happen. Build fault tolerance into the design
A system cannot be operated without monitoring
Data is the most valuable asset. Always implement backups and replication

The concepts covered here form the foundation of system design. Apply them in real projects, study diverse system design case studies, and continue building your skills.

Table of Contents

What Is System Design

Why It Matters

Interview Approach

Scalability

Vertical Scaling (Scale Up)

Horizontal Scaling (Scale Out)

Stateless Design

Load Balancing

L4 vs L7 Load Balancing

Load Balancing Algorithms

Health Checks

Caching Strategies

Cache Aside (Lazy Loading)

Write Through

Write Behind (Write Back)

TTL and Cache Invalidation

Cache Stampede (Thundering Herd)

CDN

How It Works

Static Asset Delivery

Optimizing Cache Hit Ratio

Pull CDN vs Push CDN

Databases

SQL vs NoSQL

Sharding

Replication

Partitioning

CAP Theorem

The Three Properties

Real-World Choices

PACELC

Message Queues

Why They Are Needed

Apache Kafka

RabbitMQ

Amazon SQS

Kafka vs RabbitMQ vs SQS Comparison

Event-Driven Architecture

API Design

REST vs GraphQL vs gRPC

API Comparison

API Versioning

Rate Limiting

Monitoring and Logging

Metrics

Alerting

Distributed Tracing

Logging Strategy

Practical Example: URL Shortener

Requirements

Scale Estimation

High-Level Design

Short Key Generation

Detailed Design

Practical Example: Chat System

Requirements

Communication Protocols

High-Level Design

Message Store Selection

Message ID Design

Presence Management

Group Chat

Conclusion