Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Prologue — "22 theory posts done. How does reality run?"

Season 2 covered principles and patterns. Season 3 is about real-world cases.

Why does Netflix run 7,000+ microservices?
Why is Stripe still a Ruby monolith?
How does Cloudflare handle 32 trillion requests per day?
How does Shopify absorb $10B in Black Friday revenue?

Everything below comes from public engineering blogs, conference talks, and postmortems. No insider info.

Season 3 Ep 1 — architecture case studies.

Part 1 — Netflix — An Orchestra of 7,000 Microservices

Scale (2024 public data)

- MAU: 270M
- Concurrent streams: tens of millions
- Traffic share: ~15% of the internet
- Microservices: 7,000+ (estimated)
- AWS EC2 instances: millions
- Cassandra clusters: thousands
- Kafka messages: trillions per day

Architecture stack

Edge: Zuul (API Gateway) → migrating to Spring Cloud Gateway
Service Mesh: in-house (Ribbon, Eureka) → moving to gRPC
DB: Cassandra (primary), DynamoDB, EVCache (Memcached)
Caching: EVCache (in-house), Moneta
Streaming: Kafka + Flink
Compute: AWS EC2 + Titus (in-house container platform)
Chaos Engineering: Chaos Monkey (invented here)

Chaos Engineering — Netflix's invention

Chaos Monkey: randomly kills production servers
Chaos Gorilla: takes down an AZ
Chaos Kong: takes down a region

Philosophy: "Failure is inevitable. Rehearse it to build resilience." Result: no major outage in 10+ years (survived the 2023 AWS us-east-1 incident).

Encoding strategy

One movie → 120+ variants
(resolution x bitrate x audio x HDR x subtitles)

Per-title encoding: analyze optimal bitrate per film
Per-shot encoding: different bitrate per scene → 20% bandwidth saved
AV1 codec: activated in 2024, 20-30% more efficient than H.264

Open Connect — in-house CDN

Why: commercial CDNs couldn't match scale/cost
How: Netflix servers placed inside ISPs worldwide
Effect: minimal backbone load, reduced latency
Scale: 18,000+ servers in 175+ countries

Deployment — Spinnaker

Built in 2014 at Netflix, open-sourced. Pre-GitOps era CD. 2025 reality: Spinnaker fading, ArgoCD dominant. Netflix is modernizing internally.

Lessons

Scale defines architecture: 270M users demands microservices.
Expect failure: Chaos Engineering treats outages as "when", not "if".
Build aggressively: CDN, container platform, all in-house.
Data-driven detail: per-shot encoding is micro-optimization at macro-scale.
But 7,000 services is not your target — it is Netflix's necessity.

Part 2 — Stripe — The Monolith Wins

Scale (2024 public)

- Payment volume: $1T+ per year (2023)
- Customers: millions (Uber, Amazon, Google, OpenAI, ...)
- Ruby monolith: tens of millions of lines
- Employees: 8,000+

"A company this big, still a monolith?"

Stripe proudly says: "We are a monolith."

Reasons:

1. Payment domain needs strong consistency (ACID)
2. Microservices = distributed transactions = complexity
3. Developer productivity: one repo, one deploy
4. Strong tests → refactor freely

Actual stack

Sorbet (Ruby type checker) — built by Stripe
Rails-based monolith
Thousands of endpoints
Postgres (sharded) + MongoDB (some legacy)
Kafka for events
Spark for analytics

Sorbet adds static typing to Ruby. Open-sourced in 2019. Same role TypeScript plays for JS.

API versioning

Date-based: Stripe-Version: 2024-12-18
New version whenever there is a breaking change
Every version kept forever (v1 from 2011 still works)

Internals:
- Request → version translator → current internal schema
- Current internal → version translator → Response
- 10+ years of versions vs customer trust

Idempotency

Every POST supports Idempotency-Key. Redis + DB retain it for 24 hours. Concurrent requests locked to a single execution. Stripe effectively set the industry standard for Idempotency.

Data infrastructure

Pervasive Caching: each service owns its cache
Event sourcing: payment history is immutable
Double-entry bookkeeping: accounting style (balances = sum of events)
Strong consistency: Spanner-like in-house implementation

Lessons

A monolith can process $1T. Microservices aren't mandatory.
Build your own tools: Sorbet, Skycfg (Starlark config).
APIs are contracts: design for 10-year maintenance.
Idempotency is table stakes.
Developer experience is the product.

Part 3 — Cloudflare — Ruler of the Edge

Scale (2024)

- Network: 300+ cities, 120+ countries
- Traffic: 32 trillion requests/day
- DNS: 30+ quadrillion queries/day
- DDoS defense: blocked 209 Tbps attack (2024)
- Workers: 7M+ deploys daily

Core product stack

Edge: in-house hardware + software stack
Network: Anycast (1.1.1.1 — one IP, 300 PoPs)
DDoS: L3/L4/L7 layered defense
Workers: V8 isolate-based serverless
Workers AI: edge inference (LLM)
R2: S3-compatible object storage (zero egress fees is the differentiator)
D1: SQLite-based distributed DB
Durable Objects: stateful serverless
Hyperdrive: DB connection pooler at edge
Tunnel: Zero Trust private network

Workers — why V8 isolate?

Lambda: container → 100ms+ cold start
Workers: V8 isolate → 5ms cold start

Why:
- V8 isolates are tiny (few MB memory)
- Thousands of isolates in one process
- Each isolate is isolated (same V8)

Constraints: only a subset of Node APIs (Web standards first). No filesystem. 50ms CPU limit (30s on paid).

Network stack highlights

Quicksilver: in-house KV store (ultra-fast config distribution)
Unimog: L4 load balancer
BGP: global announcements, Argo Smart Routing

Nov 2022 outage postmortem (famous)

What: Cloudflare dashboard down worldwide
Cause: BGP route propagation error during a switch upgrade
Duration: 2 hours
Lesson: "Control plane must be distributed" — Argo control plane redesigned

Value: the postmortem was public — the whole industry learned.

Lessons

Edge is the future: distribute globally, not centrally.
Own the stack end-to-end: from network hardware to runtime.
Transparent postmortems build trust.
Free tier is strategic: developer lock-in.
Differentiation from other CDNs: developer platform (Workers) is the core.

Part 4 — Shopify — King of Black Friday

Scale (2023-2024)

- Merchants: 5M+
- Black Friday 2023: $9.3B revenue (4 days)
- Peak RPS: 700K+ requests/sec
- Orders: tens of thousands per minute
- GMV: $300B+ annually

Architecture stack

Rails monolith: "Shopify Core"
Ruby: primary language (heavy YJIT usage)
MySQL: sharded, Vitess-like in-house implementation
Kafka: event streaming
Go, Elixir, Rust: specialized services
Kubernetes (partial), GCP + in-house data centers

Pods — Shopify's sharding

Pod: isolated unit grouping merchants
Each Pod: MySQL shard + Redis + other resources
One Pod fails → other Pods unaffected (blast radius limited)
Dozens of Pods worldwide

Effect: a single customer saturating a DB does not affect others.

Black Friday preparation

1. Expected traffic: 4-10x normal
2. Rehearsals start 6 months out
3. Game days: deliberate failure simulation
4. Capacity planning: efficiency gains, not just scale-out
5. Feature freeze: major deploys stop 2 weeks prior
6. On-call fully staffed and rehearsed

YJIT — Ruby JIT turning point

2022: YJIT rewritten in Rust
2023+: adopted by Shopify Core
Effect: 15% server CPU reduction → lower server cost
Shopify leads YJIT development (Maxime Chevalier-Boisvert)

Storefront — new frontend stack

Hydrogen: Remix-based SSR framework by Shopify
Oxygen: Shopify's deployment platform
Before: Liquid template engine (in-house)
After: React + TypeScript

Lessons

Shard by Pod: customer isolation is scaling's friend.
Black Friday = your ordinary skill multiplied. Practice daily.
Monolith + peripheral micro: Rails + Go/Elixir for specialized parts.
Invest in language/runtime (YJIT): compounding returns.
Game day culture: rehearse failure.

Part 5 — Discord — Chat at Tens of Millions Concurrent

Scale

- Registered users: 300M+
- DAU: 150M+
- Messages: 40B+/day
- Voice minutes: tens of billions/month

Interesting language choices

Elixir (Erlang VM): chat service (massive concurrency)
Rust: performance-critical (voice servers, etc.)
Python: ML, some backend
Go, TypeScript: miscellaneous

ScyllaDB replaces Cassandra (2023)

Problem: 177 Cassandra nodes, maintenance hell
Solution: migrate to ScyllaDB (Cassandra rewritten in C++)
Result: 177 → 72 nodes, latency cut 50%, $$$ saved

Lesson: language/runtime swaps can halve infrastructure. Worth the engineering investment.

Lessons

BEAM (Erlang) power: WhatsApp, Discord prove it.
Mix languages: match domain to language.
DB replacement is possible: abstract the interface early.
Adopt Rust gradually: start from hot paths.

Part 6 — GitHub — Largest Code Host

Scale

- Repos: 500M+
- Developers: 150M+
- Git data: several PB
- Actions: millions of jobs per minute

Architecture evolution

2008: Rails monolith (founded)
~2016: MySQL + Redis + memcached
2020+: Spokes (Git storage), Kafka, Kubernetes gradual adoption
2022-: Azure migration under Microsoft

Notable engineering

Monolith-first: still runs a large Rails app
Spokes: distributed Git storage (in-house)
Codespaces: Kubernetes orchestration for VS Code Server
Copilot: OpenAI collaboration

Lessons

Rails still works — if done right.
Git storage is hard: filesystem problem at its core.
Developer platform = your own code. Codespaces is GitHub's own dev env.

Part 7 — Figma — The Real-Time Collaboration Revolution

Technical challenges

Dozens editing the same canvas simultaneously
Latency must feel under 50ms
Offline edits that merge back
Undo/Redo without confusion

CRDT-based co-editing

Conflict-free Replicated Data Types
Each edit is commutative (order-independent)
Server is a mediator, final state converges

Figma's data model:
- Node tree (DOM-like)
- Each edit is an operation type
- WebSocket for real-time sync

Rust-based core

Co-editor core written in Rust
Executed in browser via WASM
Near-native performance

Lessons

Real-time collab means CRDT: Google Docs, Figma both.
Web is eating native: thanks to WASM.
Product obsession beats scale: Figma is a product company first.

Part 8 — Notion — Block-Based Document Data Model

Data model

Everything is a block
Page = block of blocks
Text, heading, list, code, ... all blocks

Block data structure:
{
  id: "block_123",
  type: "paragraph",
  content: [...],
  parent: "block_456",
  children: ["block_789", ...],
  properties: {...}
}

Character: recursive tree. Every user's data worldwide shares the same structure.

Postgres + caching

Primary: Postgres (all blocks)
Cache: Redis, Memcached
Search: Elasticsearch
AI: OpenAI embeddings

2024 growth: AI features explode Postgres load. Sharding evolves.

Lessons

Data model is everything: a well-designed schema scales.
Blocks = flexibility: block-centric beats page-centric.
Postgres is powerful: no NoSQL needed to reach Notion's scale.

Part 9 — Spotify — Team Structure and Architecture

Spotify Model (controversial)

Squad: team (product area)
Tribe: group of Squads
Chapter: same role (e.g., backend engineers)
Guild: interest-based community

2024 reality: Spotify itself admitted "we don't actually follow this model." Lesson: org structure varies by company. Do not copy the template.

Backstage — internal developer portal

2020 open-sourced: internal tool released to the world
Features: service catalog, docs, TechDocs, templates
CNCF: incubating project

2025 reality: the standard tool for Platform Engineering.

Event-driven

Kafka-centric architecture
Play event → broadcast to dozens of services
Recommendations are built on event aggregation

Lessons

The "Spotify Model" myth: organizational structures don't generalize.
Internal platform → OSS: the Backstage move.
Event streaming = flexibility: new services plug in easily.

Part 10 — Airbnb — Monolith → Services → Monolith?

Journey

2008-2017: Rails monolith
2017-2022: microservices migration (complexity explosion)
2022+: partial consolidation ("macroservices")

Lesson: publicly admitted — "we split too far into microservices, we're merging some back."

Inventions

Airflow: data pipeline scheduler (2015, now standard)
Lottie: vector animation (frontend)
Knowledge Graph: search/recommendations

Lessons

Over-splitting is dangerous: MSA is not free.
Retrospective culture: admit mistakes, then fix them.
Side tools build the brand: Airflow is part of Airbnb's image.

Part 11 — 10 Common Patterns

Patterns extracted from these cases:

Monoliths are fine: Stripe, Shopify, GitHub — large scale with monoliths.
Shard via Pods: Shopify Pod, Netflix region.
Event-driven: Kafka at the center (Spotify, Shopify, Netflix).
Own CDN/infrastructure: Netflix Open Connect, Cloudflare's full stack.
Chaos Engineering: Netflix origin, industry-wide.
Mix languages: Rust (perf), Elixir/Erlang (concurrency), Go (infra).
GitHub flow: PR review + CI + auto deploy.
Open source → standard: Airflow, Kubernetes, Backstage, Sorbet.
Public postmortems: Cloudflare, GitLab transparency.
Build your own tools: at scale, always.

Part 12 — What to Copy vs Not Copy

Copy

Strong tests + CI/CD
Postmortem culture
Idempotency (Stripe style)
Feature Flags
Observability (logs/metrics/traces)
DORA metrics

Don't copy

7,000 microservices (Netflix) — not your scale
Build your own CDN (Netflix) — pointless unless hosting bill is huge
Spotify Model verbatim — your context differs
Rewrite everything in Rust — no bottleneck, no meaning
Kubernetes everywhere — overkill for small startups

Part 13 — Small Company Cases

Plausible Analytics (10 people):

Elixir monolith
ClickHouse for analytics
A few production servers
Open source, bootstrapped
$2M+ ARR

Gumroad (few dozen):

Rails monolith
Postgres, Redis
Heroku → AWS
Simple. That's the point.

Bytebase (few dozen):

Go + React
SQLite (embedded) + Postgres option
CNCF-oriented

Lesson: the smaller the team, the stronger simplicity is as a weapon. A monolith + one DB can still reach $100M+.

Part 14 — 12 Checklist Items

Part 15 — 10 Antipatterns

Mimicking Netflix (big-company mode) → not your scale
100 services for 100 engineers → unmanageable
Building your own DB/language → waste below 10,000-person scale
Trend-chasing → stability sacrificed
Secret postmortems → repeat the same mistakes
Chaos Engineering without prep → just outages
Mandating Kubernetes → simplicity lost
Microservices + strong consistency → distributed transaction hell
Rust everywhere → hiring problem
Blindly copying Spotify Model → ignores context

Closing — "On the Shoulders of Giants"

Netflix, Stripe, Cloudflare aren't collections of geniuses. They are:

Years of repeated mistakes and learning
Transparent postmortems (abundant public material)
Principles discovered at enormous scale

If your company is small, simplicity is your weapon. Large companies regret "why did we start with microservices when we were small?"

Next — Season 3 Ep 2 — dissecting famous postmortems: Cloudflare 2022, Fastly 2021, AWS 2017, Heroku DNS issues, Knight Capital $440M in 8 minutes.

Learning from failure is the fastest growth.

Next — "Dissecting Famous Postmortems: Cloudflare, Fastly, AWS, Knight Capital Failures"

Season 3 Ep 2 covers:

Cloudflare 2022-06 (BGP), 2019-07 (Regex)
Fastly 2021-06 (one customer config → half the internet down)
AWS S3 2017-02 (one typo → us-east-1 down)
Knight Capital 2012 ($440M lost in 8 minutes)
GitLab 2017 (DB wipe)
Common patterns

Failure teaches faster. See you in the next post.