Skip to content
Published on

System Design Fundamentals — Scalability, Availability, Caching, and Message Queues

Authors

Table of Contents

  1. What Is System Design
  2. Scalability
  3. Load Balancing
  4. Caching Strategies
  5. CDN
  6. Databases
  7. CAP Theorem
  8. Message Queues
  9. API Design
  10. Monitoring and Logging
  11. Practical Example: URL Shortener
  12. Practical Example: Chat System

What Is System Design

System design is the process of defining the architecture of large-scale software systems. It goes beyond writing code, dealing with how to build scalable and reliable systems that serve millions of users.

Why It Matters

  • In practice: Real services cannot run on a single server. You need to handle traffic growth, failures, and data management.
  • In interviews: Most major tech companies, including FAANG, include system design rounds.
  • Career growth: Architectural thinking is essential for advancing to senior engineering roles.

Interview Approach

There is no single correct answer in a system design interview. What matters is your thought process.

  1. Clarify requirements: Identify functional and non-functional requirements
  2. Estimate scale: Calculate user count, QPS, and storage needs
  3. High-level design: Place the core components
  4. Detailed design: Identify and address bottlenecks
  5. Discuss tradeoffs: Every decision has pros and cons

Scalability

Scalability is a system's ability to handle increasing load.

Vertical Scaling (Scale Up)

Adding more powerful CPUs, more RAM, or larger disks to a single server.

Pros:

  • Simple to implement
  • Easy to maintain data consistency
  • No network calls, so lower latency

Cons:

  • Hardware has physical limits
  • Creates a single point of failure
  • Cost grows exponentially

Horizontal Scaling (Scale Out)

Adding more servers to distribute the load.

Pros:

  • Theoretically unlimited scaling
  • Achieves fault tolerance
  • Cost-effective

Cons:

  • Increases distributed system complexity
  • Harder to maintain data consistency
  • Requires load balancing

Stateless Design

The key to horizontal scaling is stateless design. When servers hold no state, any server can handle any request.

Stateful server:
  User A -> Server 1 (session stored)
  User A -> Server 2 (no session -> error!)

Stateless server:
  User A -> Server 1 (looks up session in external store)
  User A -> Server 2 (looks up session in external store -> works)

State such as sessions and caches is stored in external systems like Redis or Memcached.


Load Balancing

A load balancer distributes incoming traffic across multiple servers.

L4 vs L7 Load Balancing

L4 (Transport Layer):

  • Operates at the TCP/UDP level
  • Routes based on IP and port information only
  • Fast and efficient
  • Example: AWS NLB

L7 (Application Layer):

  • Operates at the HTTP/HTTPS level
  • Can route based on URL paths, headers, cookies, etc.
  • More granular control
  • Example: AWS ALB, Nginx

Load Balancing Algorithms

Round Robin: Distributes requests sequentially to each server. Simple but does not account for differences in server capacity.

Weighted Round Robin: Assigns weights to servers, sending more requests to higher-capacity machines.

IP Hash: Hashes the client IP to always route to the same server. Useful when session persistence is needed, but requires redistribution when servers are added or removed.

Least Connections: Sends requests to the server with the fewest active connections. Effective when request processing times vary.

Round Robin example:

  Request 1 -> Server A
  Request 2 -> Server B
  Request 3 -> Server C
  Request 4 -> Server A  (cycles back)
  Request 5 -> Server B  (cycles back)

Health Checks

The load balancer periodically checks server health. Servers that fail to respond or return errors are removed from the rotation.

Load Balancer:
  /health -> Server A: 200 OK (healthy)
  /health -> Server B: 503 Error (unhealthy -> removed)
  /health -> Server C: 200 OK (healthy)

  Traffic goes to Server A and C only

Caching Strategies

Caching stores frequently accessed data in a fast storage layer to reduce response times.

Cache Aside (Lazy Loading)

The most widely used pattern.

  1. The application checks the cache first
  2. On a cache hit, it returns the cached data
  3. On a cache miss, it queries the database, stores the result in the cache, and returns it
Read flow:
  Client -> Check cache
    Hit -> Return from cache
    Miss -> Query DB -> Store in cache -> Return

Pros: Only caches data that is actually requested. Falls back to the database if the cache fails. Cons: First request always incurs a cache miss. Stale data is possible.

Write Through

Writes data to both the cache and the database simultaneously.

Write flow:
  Client -> Update cache -> Update DB -> Response

Pros: Cache and database stay consistent. Cons: Increased write latency. Caches data that may never be read.

Write Behind (Write Back)

Writes to the cache first and asynchronously persists to the database later.

Write flow:
  Client -> Update cache -> Response (immediate)
  Background: Cache -> DB sync (delayed)

Pros: Very fast writes. Reduces database load. Cons: Risk of data loss if the cache fails.

TTL and Cache Invalidation

TTL (Time To Live): Sets an expiration time on cache entries. Too short reduces cache efficiency; too long causes stale data.

Invalidation strategies:

  • Time-based: Automatic deletion when TTL expires
  • Event-based: Immediate invalidation when data changes
  • Version-based: Include a version in the cache key so new versions automatically create fresh entries
TTL examples:
  SET user:123 "Kim" EX 3600        (expires in 1 hour)
  SET product:456 "Laptop" EX 86400 (expires in 24 hours)

Cache Stampede (Thundering Herd)

When a popular cache entry expires, a flood of requests simultaneously hits the database.

Solutions:

  • Locking: Only one request queries the database; others wait
  • Probabilistic early refresh: Randomly refresh the cache before TTL expires
  • Staggered TTL: Set slightly different TTLs for different cache entries

CDN

A CDN (Content Delivery Network) serves content from edge servers distributed around the world, delivering data from the location closest to the user.

How It Works

User (Seoul) -> Seoul edge server (cache hit) -> Instant response
User (Seoul) -> Seoul edge server (cache miss) -> Origin server -> Cache at edge -> Response

Static Asset Delivery

CDNs are primarily used for serving static content:

  • Images, video, and audio
  • CSS and JavaScript files
  • Font files
  • HTML pages (static sites)

Optimizing Cache Hit Ratio

A higher cache hit ratio means better CDN efficiency.

Optimization techniques:

  • Set appropriate Cache-Control headers
  • Use file versioning (e.g., style.v2.css or content hashing)
  • Use long TTLs for infrequently changing content
  • Normalize query strings

Pull CDN vs Push CDN

Pull CDN: Fetches content from the origin on the first request and caches it. Suitable for high-traffic sites.

Push CDN: Content is pre-uploaded to the CDN. Suitable when content rarely changes.


Databases

SQL vs NoSQL

SQL (Relational Databases):

  • Structured data with strict schemas
  • ACID transaction guarantees
  • Complex queries with JOINs
  • Examples: PostgreSQL, MySQL, Oracle

NoSQL (Non-Relational Databases):

  • Flexible schemas
  • Easy horizontal scaling
  • High read/write throughput
  • Examples: MongoDB, Cassandra, DynamoDB, Redis

Selection criteria:

CriterionSQLNoSQL
Data structureStructuredFlexible
ScalingPrimarily verticalPrimarily horizontal
ConsistencyStrong consistencyEventual consistency (typically)
TransactionsACIDBASE
QueriesComplex JOINs possibleOptimized for simple key-value lookups

Sharding

Distributing data across multiple database instances.

Hash-based sharding: Determines the shard by hashing the key. Achieves even distribution but requires expensive redistribution when shards are added or removed.

user_id % 4 = 0 -> Shard A
user_id % 4 = 1 -> Shard B
user_id % 4 = 2 -> Shard C
user_id % 4 = 3 -> Shard D

Range-based sharding: Assigns shards based on key ranges. Good for range queries but can create hotspots.

Consistent Hashing: Uses a hash ring to minimize key redistribution when nodes are added or removed. Widely used in distributed caches and databases.

Replication

Copying data to multiple nodes to improve availability and read performance.

Primary-Replica (Master-Slave):

  • Writes go to the primary; reads go to replicas
  • On primary failure, a replica is promoted

Multi-Primary (Master-Master):

  • Multiple nodes accept writes
  • Requires conflict resolution strategies
Primary-Replica architecture:

  Writes -> [Primary]
               |
        +------+------+
        |              |
    [Replica 1]    [Replica 2]
        |              |
    Read requests   Read requests

Partitioning

Logically dividing data.

Horizontal partitioning: Splits by rows. Similar to sharding. Vertical partitioning: Splits by columns. Separates frequently accessed columns from rarely accessed ones.


CAP Theorem

A distributed system cannot simultaneously satisfy all three of the following properties.

The Three Properties

Consistency: Every node has the same data. Reads always return the most recent write.

Availability: Every request receives a response. The system responds even if some nodes are down.

Partition Tolerance: The system continues to operate despite network partitions.

Real-World Choices

Network partitions are unavoidable in distributed systems, so the practical choice is between CP and AP.

CP systems (Consistency + Partition Tolerance):

  • May reject some requests during a partition to maintain consistency
  • Examples: HBase, MongoDB (default config), Zookeeper

AP systems (Availability + Partition Tolerance):

  • May return stale data during a partition to maintain availability
  • Examples: Cassandra, DynamoDB, CouchDB
CP system (during partition):
  Node A: latest data = "v2"
  Node B: stale data = "v1" -> Request rejected (consistency first)

AP system (during partition):
  Node A: latest data = "v2"
  Node B: stale data = "v1" -> Returns "v1" (availability first)

PACELC

An extension of CAP that also considers the tradeoff between Latency (L) and Consistency (C) when there is no partition (Else).

  • PA/EL: Availability during partition, low latency otherwise (e.g., DynamoDB)
  • PC/EC: Consistency during partition, consistency otherwise (e.g., HBase)
  • PA/EC: Availability during partition, consistency otherwise (e.g., MongoDB)

Message Queues

A message queue is middleware that enables asynchronous communication. It decouples producers from consumers, reducing coupling and improving scalability.

Why They Are Needed

Problems with synchronous communication:

  • When Service A calls Service B directly, a failure in B affects A as well
  • Traffic spikes can overload downstream services
  • Long-running tasks delay the entire response
Synchronous (problem):
  Order API -> Payment -> Inventory -> Notification -> Response
  (Total latency = sum of each service's latency)

Asynchronous (solution):
  Order API -> Publish event to queue -> Respond immediately
  Payment Service    <- Consumes from queue
  Inventory Service  <- Consumes from queue
  Notification Service <- Consumes from queue
  (Each service processes independently)

Apache Kafka

A distributed event streaming platform.

Key features:

  • High throughput (millions of messages per second)
  • Message durability (persisted to disk)
  • Parallel processing via consumer groups
  • Topic-based partitioning
Kafka architecture:
  Producer -> Topic (Partition 0, 1, 2) -> Consumer Group
                                            Consumer A (P0)
                                            Consumer B (P1)
                                            Consumer C (P2)

Use cases: Log aggregation, event sourcing, real-time stream processing, microservice communication

RabbitMQ

A traditional message broker using the AMQP protocol.

Key features:

  • Multiple routing patterns (Direct, Topic, Fanout, Headers)
  • Message acknowledgment mechanism
  • Priority queues
  • Flexible routing

Use cases: Task queues, RPC patterns, scenarios requiring complex routing

Amazon SQS

A fully managed message queue service from AWS.

Key features:

  • Serverless and fully managed
  • Auto-scaling
  • Standard queues (at-least-once delivery) and FIFO queues (exactly-once delivery)
  • Dead Letter Queue support

Kafka vs RabbitMQ vs SQS Comparison

FeatureKafkaRabbitMQSQS
ThroughputVery highMediumHigh
Message orderingGuaranteed within partitionCan be guaranteedFIFO only
Message retentionConfigurable retention periodDeleted after consumptionUp to 14 days
Operational complexityHighMediumLow (managed)
Best forEvent streamingTask queuesServerless workloads

Event-Driven Architecture

Event-driven architecture built on message queues is a core pattern in microservice environments.

Event Sourcing: Records state changes as events. Current state is reconstructed by replaying events.

CQRS: Separates commands (writes) and queries (reads), optimizing each independently.

Event Sourcing example:
  Event 1: Order Created (OrderCreated)
  Event 2: Payment Completed (PaymentCompleted)
  Event 3: Shipping Started (ShippingStarted)
  Event 4: Delivery Completed (DeliveryCompleted)

  Current state = result of applying Events 1 + 2 + 3 + 4

API Design

REST vs GraphQL vs gRPC

REST:

  • Based on HTTP methods (GET, POST, PUT, DELETE)
  • Resource-oriented design
  • Most widely adopted
  • Over-fetching and under-fetching problems
GET /api/users/123
POST /api/orders
PUT /api/users/123
DELETE /api/orders/456

GraphQL:

  • Clients request exactly the data they need
  • Single endpoint
  • Schema-based type system
  • Solves over-fetching and under-fetching
query {
  user(id: "123") {
    name
    email
    orders {
      id
      total
    }
  }
}

gRPC:

  • Protocol Buffers serialization
  • Uses HTTP/2 (bidirectional streaming)
  • High performance
  • Ideal for inter-service communication
service UserService {
  rpc GetUser (GetUserRequest) returns (UserResponse);
  rpc ListUsers (ListUsersRequest) returns (stream UserResponse);
}

API Comparison

FeatureRESTGraphQLgRPC
ProtocolHTTP/1.1HTTP/1.1HTTP/2
Data formatJSONJSONProtocol Buffers
Type systemNone (OpenAPI for docs)Built-inBuilt-in
StreamingLimitedSubscriptionsBidirectional streaming
Best forPublic APIsFrontend BFFMicroservices

API Versioning

Strategies for maintaining backward compatibility when APIs change.

  • URL path versioning: /api/v1/users, /api/v2/users
  • Header versioning: Accept: application/vnd.api.v2+json
  • Query parameter: /api/users?version=2

Rate Limiting

Limits request frequency to prevent API abuse and protect the system.

Algorithms:

  • Token Bucket: Tokens are added at a steady rate; each request consumes a token. Allows burst traffic.
  • Sliding Window: Counts requests within a moving time window. Fewer boundary issues.
  • Fixed Window: Resets the counter at fixed intervals. Simple to implement.
Token Bucket:
  Bucket capacity: 10
  Refill rate: 1/second

  Time 0: 10 tokens (5 requests -> 5 remaining)
  Time 5: 10 tokens (5 refilled, capped at 10)

Monitoring and Logging

Metrics

Quantify the health of the system.

Key metrics:

  • Latency: p50, p95, p99 percentile response times
  • Throughput: QPS (Queries Per Second)
  • Error rate: Ratio of 4xx and 5xx responses
  • Resource utilization: CPU, memory, disk, network

Tools: Prometheus, Grafana, Datadog, CloudWatch

Alerting

Notifies on-call engineers when metrics exceed thresholds.

Good alerting criteria:

  • Alerts must be actionable (the recipient can take action)
  • Minimize false positives
  • Differentiate severity levels (Critical, Warning, Info)

Distributed Tracing

Tracks the complete path of a request across microservices.

User request (trace-id: abc123)
  -> API Gateway (span-1, 5ms)
    -> Auth Service (span-2, 10ms)
    -> Order Service (span-3, 50ms)
      -> Payment Service (span-4, 200ms)
      -> Inventory Service (span-5, 30ms)
    -> Notification Service (span-6, 15ms)
  Total duration: 310ms

Tools: Jaeger, Zipkin, AWS X-Ray, OpenTelemetry

Logging Strategy

Structured logs: Recording logs in JSON format makes them easier to search and analyze.

{
  "timestamp": "2026-04-12T10:30:00Z",
  "level": "ERROR",
  "service": "order-service",
  "trace_id": "abc123",
  "message": "Payment failed",
  "user_id": "user-789",
  "order_id": "order-456",
  "error_code": "INSUFFICIENT_FUNDS"
}

Use the ELK stack (Elasticsearch + Logstash + Kibana) or a Loki + Grafana combination to collect, store, and visualize logs.


Practical Example: URL Shortener

A frequently asked system design interview question.

Requirements

Functional requirements:

  • Convert long URLs to short URLs
  • Redirect short URLs to the original URL
  • Support custom short URLs
  • URL expiration

Non-functional requirements:

  • High availability
  • Low latency (redirects must be fast)
  • Short URLs must not be predictable

Scale Estimation

  • Writes: 100 million per day, approximately 1,160 QPS
  • Reads: 10x writes = 1 billion per day, approximately 11,600 QPS
  • Storage: About 180 billion records over 5 years, 500 bytes per record = approximately 90 TB

High-Level Design

Client -> Load Balancer -> API Server -> DB
                              |
                           Cache (Redis)

Short Key Generation

Method 1: Hash and truncate

  • Generate a hash with MD5 or SHA-256 and use the first 7 characters
  • Base62 encoding (a-z, A-Z, 0-9) = 62 to the 7th power = approximately 3.5 trillion combinations

Method 2: Counter-based

  • Encode an auto-incrementing ID in Base62
  • In distributed environments, use Zookeeper or Snowflake IDs

Method 3: Pre-generation

  • Pre-generate keys and store them in a database
  • A Key Generation Service (KGS) manages the key pool

Detailed Design

Redirect flow:

1. Client requests short.url/abc123
2. Load balancer forwards to API server
3. API server checks Redis cache
4. Cache hit -> 301 Redirect response
5. Cache miss -> Query DB -> Store in cache -> 301 Redirect

301 vs 302 redirect:

  • 301 (Permanent): Browser caches the redirect. Reduces server load. Harder to collect analytics.
  • 302 (Temporary): Every request hits the server. Enables click analytics. Higher server load.

Practical Example: Chat System

Requirements

Functional requirements:

  • One-to-one and group chat
  • Send and receive messages
  • Online/offline status indicators
  • Read receipts
  • File and image sharing
  • Message search

Non-functional requirements:

  • Real-time message delivery (low latency)
  • Message ordering guarantees
  • Message durability (no data loss)
  • High availability

Communication Protocols

HTTP Polling: The client periodically asks the server for new messages. Simple but inefficient.

Long Polling: The server holds the connection until a new message arrives. More efficient than polling but requires timeout management.

WebSocket: Enables real-time bidirectional communication. After an initial HTTP handshake, a persistent connection is maintained. The best fit for real-time chat.

HTTP Polling:
  Client -> Server: New messages? (every 1 second)
  Server -> Client: None
  Client -> Server: New messages? (every 1 second)
  Server -> Client: Yes! (delivers message)

WebSocket:
  Client <-> Server: Persistent connection
  Server -> Client: New message pushed immediately

High-Level Design

Sender -> WebSocket Server -> Message Queue -> WebSocket Server -> Receiver
               |                                     |
           Message DB                        Presence Service
               |
         Message Search (Elasticsearch)

Message Store Selection

Characteristics of chat messages:

  • Sequential writes (append only)
  • Most reads are for recent messages
  • Random reads occur only during search
  • Very large data volumes

Suitable databases:

  • Cassandra: High write throughput, well-suited for time-based partitioning
  • HBase: Optimized for large-scale sequential writes

Message ID Design

Message IDs must be sortable by time to guarantee ordering.

Snowflake ID approach:

| Timestamp (41 bits) | Datacenter (5 bits) | Machine (5 bits) | Sequence (12 bits) |
  • Sortable by time
  • No collisions in distributed environments
  • Generates up to 4,096 IDs per second per machine

Presence Management

Heartbeat approach: Clients send a periodic heartbeat (e.g., every 5 seconds). If no heartbeat is received for a set period, the user is marked offline.

User A: Sends heartbeat (every 5 seconds)
  Server: Records last heartbeat time
  No heartbeat for 30 seconds -> Status changed to "offline"

Group Chat

In group chat, messages must be delivered to all group members.

Small groups (tens of members):

  • Copy the message directly into each member's message queue

Large groups (thousands of members):

  • Fan-out becomes inefficient
  • A pull-based approach where recipients fetch messages is more suitable

Conclusion

The essence of system design is understanding tradeoffs. No system is perfect, and the goal is always to make the best choices for the given requirements.

Key principles:

  • Start simple and add complexity only when needed
  • Consider scalability from the start, but avoid over-engineering
  • Failures will happen. Build fault tolerance into the design
  • A system cannot be operated without monitoring
  • Data is the most valuable asset. Always implement backups and replication

The concepts covered here form the foundation of system design. Apply them in real projects, study diverse system design case studies, and continue building your skills.