Prologue — "22 theory posts done. How does reality run?"
Season 2 covered principles and patterns. Season 3 is about real-world cases.
- Why does Netflix run 7,000+ microservices?
- Why is Stripe still a Ruby monolith?
- How does Cloudflare handle 32 trillion requests per day?
- How does Shopify absorb
$10Bin Black Friday revenue?
Everything below comes from public engineering blogs, conference talks, and postmortems. No insider info.
Season 3 Ep 1 — architecture case studies.
Part 1 — Netflix — An Orchestra of 7,000 Microservices
Scale (2024 public data)
- MAU: 270M
- Concurrent streams: tens of millions
- Traffic share: ~15% of the internet
- Microservices: 7,000+ (estimated)
- AWS EC2 instances: millions
- Cassandra clusters: thousands
- Kafka messages: trillions per day
Architecture stack
Edge: Zuul (API Gateway) → migrating to Spring Cloud Gateway
Service Mesh: in-house (Ribbon, Eureka) → moving to gRPC
DB: Cassandra (primary), DynamoDB, EVCache (Memcached)
Caching: EVCache (in-house), Moneta
Streaming: Kafka + Flink
Compute: AWS EC2 + Titus (in-house container platform)
Chaos Engineering: Chaos Monkey (invented here)
Chaos Engineering — Netflix's invention
Chaos Monkey: randomly kills production servers
Chaos Gorilla: takes down an AZ
Chaos Kong: takes down a region
Philosophy: "Failure is inevitable. Rehearse it to build resilience." Result: no major outage in 10+ years (survived the 2023 AWS us-east-1 incident).
Encoding strategy
One movie → 120+ variants
(resolution x bitrate x audio x HDR x subtitles)
Per-title encoding: analyze optimal bitrate per film
Per-shot encoding: different bitrate per scene → 20% bandwidth saved
AV1 codec: activated in 2024, 20-30% more efficient than H.264
Open Connect — in-house CDN
Why: commercial CDNs couldn't match scale/cost
How: Netflix servers placed inside ISPs worldwide
Effect: minimal backbone load, reduced latency
Scale: 18,000+ servers in 175+ countries
Deployment — Spinnaker
Built in 2014 at Netflix, open-sourced. Pre-GitOps era CD. 2025 reality: Spinnaker fading, ArgoCD dominant. Netflix is modernizing internally.
Lessons
- Scale defines architecture: 270M users demands microservices.
- Expect failure: Chaos Engineering treats outages as "when", not "if".
- Build aggressively: CDN, container platform, all in-house.
- Data-driven detail: per-shot encoding is micro-optimization at macro-scale.
- But 7,000 services is not your target — it is Netflix's necessity.
Part 2 — Stripe — The Monolith Wins
Scale (2024 public)
- Payment volume: $1T+ per year (2023)
- Customers: millions (Uber, Amazon, Google, OpenAI, ...)
- Ruby monolith: tens of millions of lines
- Employees: 8,000+
"A company this big, still a monolith?"
Stripe proudly says: "We are a monolith."
Reasons:
1. Payment domain needs strong consistency (ACID)
2. Microservices = distributed transactions = complexity
3. Developer productivity: one repo, one deploy
4. Strong tests → refactor freely
Actual stack
Sorbet (Ruby type checker) — built by Stripe
Rails-based monolith
Thousands of endpoints
Postgres (sharded) + MongoDB (some legacy)
Kafka for events
Spark for analytics
Sorbet adds static typing to Ruby. Open-sourced in 2019. Same role TypeScript plays for JS.
API versioning
Date-based: Stripe-Version: 2024-12-18
New version whenever there is a breaking change
Every version kept forever (v1 from 2011 still works)
Internals:
- Request → version translator → current internal schema
- Current internal → version translator → Response
- 10+ years of versions vs customer trust
Idempotency
Every POST supports Idempotency-Key. Redis + DB retain it for 24 hours. Concurrent requests locked to a single execution. Stripe effectively set the industry standard for Idempotency.
Data infrastructure
Pervasive Caching: each service owns its cache
Event sourcing: payment history is immutable
Double-entry bookkeeping: accounting style (balances = sum of events)
Strong consistency: Spanner-like in-house implementation
Lessons
- A monolith can process
$1T. Microservices aren't mandatory. - Build your own tools: Sorbet, Skycfg (Starlark config).
- APIs are contracts: design for 10-year maintenance.
- Idempotency is table stakes.
- Developer experience is the product.
Part 3 — Cloudflare — Ruler of the Edge
Scale (2024)
- Network: 300+ cities, 120+ countries
- Traffic: 32 trillion requests/day
- DNS: 30+ quadrillion queries/day
- DDoS defense: blocked 209 Tbps attack (2024)
- Workers: 7M+ deploys daily
Core product stack
Edge: in-house hardware + software stack
Network: Anycast (1.1.1.1 — one IP, 300 PoPs)
DDoS: L3/L4/L7 layered defense
Workers: V8 isolate-based serverless
Workers AI: edge inference (LLM)
R2: S3-compatible object storage (zero egress fees is the differentiator)
D1: SQLite-based distributed DB
Durable Objects: stateful serverless
Hyperdrive: DB connection pooler at edge
Tunnel: Zero Trust private network
Workers — why V8 isolate?
Lambda: container → 100ms+ cold start
Workers: V8 isolate → 5ms cold start
Why:
- V8 isolates are tiny (few MB memory)
- Thousands of isolates in one process
- Each isolate is isolated (same V8)
Constraints: only a subset of Node APIs (Web standards first). No filesystem. 50ms CPU limit (30s on paid).
Network stack highlights
Quicksilver: in-house KV store (ultra-fast config distribution)
Unimog: L4 load balancer
BGP: global announcements, Argo Smart Routing
Nov 2022 outage postmortem (famous)
What: Cloudflare dashboard down worldwide
Cause: BGP route propagation error during a switch upgrade
Duration: 2 hours
Lesson: "Control plane must be distributed" — Argo control plane redesigned
Value: the postmortem was public — the whole industry learned.
Lessons
- Edge is the future: distribute globally, not centrally.
- Own the stack end-to-end: from network hardware to runtime.
- Transparent postmortems build trust.
- Free tier is strategic: developer lock-in.
- Differentiation from other CDNs: developer platform (Workers) is the core.
Part 4 — Shopify — King of Black Friday
Scale (2023-2024)
- Merchants: 5M+
- Black Friday 2023: $9.3B revenue (4 days)
- Peak RPS: 700K+ requests/sec
- Orders: tens of thousands per minute
- GMV: $300B+ annually
Architecture stack
Rails monolith: "Shopify Core"
Ruby: primary language (heavy YJIT usage)
MySQL: sharded, Vitess-like in-house implementation
Kafka: event streaming
Go, Elixir, Rust: specialized services
Kubernetes (partial), GCP + in-house data centers
Pods — Shopify's sharding
Pod: isolated unit grouping merchants
Each Pod: MySQL shard + Redis + other resources
One Pod fails → other Pods unaffected (blast radius limited)
Dozens of Pods worldwide
Effect: a single customer saturating a DB does not affect others.
Black Friday preparation
1. Expected traffic: 4-10x normal
2. Rehearsals start 6 months out
3. Game days: deliberate failure simulation
4. Capacity planning: efficiency gains, not just scale-out
5. Feature freeze: major deploys stop 2 weeks prior
6. On-call fully staffed and rehearsed
YJIT — Ruby JIT turning point
2022: YJIT rewritten in Rust
2023+: adopted by Shopify Core
Effect: 15% server CPU reduction → lower server cost
Shopify leads YJIT development (Maxime Chevalier-Boisvert)
Storefront — new frontend stack
Hydrogen: Remix-based SSR framework by Shopify
Oxygen: Shopify's deployment platform
Before: Liquid template engine (in-house)
After: React + TypeScript
Lessons
- Shard by Pod: customer isolation is scaling's friend.
- Black Friday = your ordinary skill multiplied. Practice daily.
- Monolith + peripheral micro: Rails + Go/Elixir for specialized parts.
- Invest in language/runtime (YJIT): compounding returns.
- Game day culture: rehearse failure.
Part 5 — Discord — Chat at Tens of Millions Concurrent
Scale
- Registered users: 300M+
- DAU: 150M+
- Messages: 40B+/day
- Voice minutes: tens of billions/month
Interesting language choices
Elixir (Erlang VM): chat service (massive concurrency)
Rust: performance-critical (voice servers, etc.)
Python: ML, some backend
Go, TypeScript: miscellaneous
ScyllaDB replaces Cassandra (2023)
Problem: 177 Cassandra nodes, maintenance hell
Solution: migrate to ScyllaDB (Cassandra rewritten in C++)
Result: 177 → 72 nodes, latency cut 50%, $$$ saved
Lesson: language/runtime swaps can halve infrastructure. Worth the engineering investment.
Lessons
- BEAM (Erlang) power: WhatsApp, Discord prove it.
- Mix languages: match domain to language.
- DB replacement is possible: abstract the interface early.
- Adopt Rust gradually: start from hot paths.
Part 6 — GitHub — Largest Code Host
Scale
- Repos: 500M+
- Developers: 150M+
- Git data: several PB
- Actions: millions of jobs per minute
Architecture evolution
2008: Rails monolith (founded)
~2016: MySQL + Redis + memcached
2020+: Spokes (Git storage), Kafka, Kubernetes gradual adoption
2022-: Azure migration under Microsoft
Notable engineering
Monolith-first: still runs a large Rails app
Spokes: distributed Git storage (in-house)
Codespaces: Kubernetes orchestration for VS Code Server
Copilot: OpenAI collaboration
Lessons
- Rails still works — if done right.
- Git storage is hard: filesystem problem at its core.
- Developer platform = your own code. Codespaces is GitHub's own dev env.
Part 7 — Figma — The Real-Time Collaboration Revolution
Technical challenges
Dozens editing the same canvas simultaneously
Latency must feel under 50ms
Offline edits that merge back
Undo/Redo without confusion
CRDT-based co-editing
Conflict-free Replicated Data Types
Each edit is commutative (order-independent)
Server is a mediator, final state converges
Figma's data model:
- Node tree (DOM-like)
- Each edit is an operation type
- WebSocket for real-time sync
Rust-based core
Co-editor core written in Rust
Executed in browser via WASM
Near-native performance
Lessons
- Real-time collab means CRDT: Google Docs, Figma both.
- Web is eating native: thanks to WASM.
- Product obsession beats scale: Figma is a product company first.
Part 8 — Notion — Block-Based Document Data Model
Data model
Everything is a block
Page = block of blocks
Text, heading, list, code, ... all blocks
Block data structure:
{
id: "block_123",
type: "paragraph",
content: [...],
parent: "block_456",
children: ["block_789", ...],
properties: {...}
}
Character: recursive tree. Every user's data worldwide shares the same structure.
Postgres + caching
Primary: Postgres (all blocks)
Cache: Redis, Memcached
Search: Elasticsearch
AI: OpenAI embeddings
2024 growth: AI features explode Postgres load. Sharding evolves.
Lessons
- Data model is everything: a well-designed schema scales.
- Blocks = flexibility: block-centric beats page-centric.
- Postgres is powerful: no NoSQL needed to reach Notion's scale.
Part 9 — Spotify — Team Structure and Architecture
Spotify Model (controversial)
Squad: team (product area)
Tribe: group of Squads
Chapter: same role (e.g., backend engineers)
Guild: interest-based community
2024 reality: Spotify itself admitted "we don't actually follow this model." Lesson: org structure varies by company. Do not copy the template.
Backstage — internal developer portal
2020 open-sourced: internal tool released to the world
Features: service catalog, docs, TechDocs, templates
CNCF: incubating project
2025 reality: the standard tool for Platform Engineering.
Event-driven
Kafka-centric architecture
Play event → broadcast to dozens of services
Recommendations are built on event aggregation
Lessons
- The "Spotify Model" myth: organizational structures don't generalize.
- Internal platform → OSS: the Backstage move.
- Event streaming = flexibility: new services plug in easily.
Part 10 — Airbnb — Monolith → Services → Monolith?
Journey
2008-2017: Rails monolith
2017-2022: microservices migration (complexity explosion)
2022+: partial consolidation ("macroservices")
Lesson: publicly admitted — "we split too far into microservices, we're merging some back."
Inventions
- Airflow: data pipeline scheduler (2015, now standard)
- Lottie: vector animation (frontend)
- Knowledge Graph: search/recommendations
Lessons
- Over-splitting is dangerous: MSA is not free.
- Retrospective culture: admit mistakes, then fix them.
- Side tools build the brand: Airflow is part of Airbnb's image.
Part 11 — 10 Common Patterns
Patterns extracted from these cases:
- Monoliths are fine: Stripe, Shopify, GitHub — large scale with monoliths.
- Shard via Pods: Shopify Pod, Netflix region.
- Event-driven: Kafka at the center (Spotify, Shopify, Netflix).
- Own CDN/infrastructure: Netflix Open Connect, Cloudflare's full stack.
- Chaos Engineering: Netflix origin, industry-wide.
- Mix languages: Rust (perf), Elixir/Erlang (concurrency), Go (infra).
- GitHub flow: PR review + CI + auto deploy.
- Open source → standard: Airflow, Kubernetes, Backstage, Sorbet.
- Public postmortems: Cloudflare, GitLab transparency.
- Build your own tools: at scale, always.
Part 12 — What to Copy vs Not Copy
Copy
- Strong tests + CI/CD
- Postmortem culture
- Idempotency (Stripe style)
- Feature Flags
- Observability (logs/metrics/traces)
- DORA metrics
Don't copy
- 7,000 microservices (Netflix) — not your scale
- Build your own CDN (Netflix) — pointless unless hosting bill is huge
- Spotify Model verbatim — your context differs
- Rewrite everything in Rust — no bottleneck, no meaning
- Kubernetes everywhere — overkill for small startups
Part 13 — Small Company Cases
Plausible Analytics (10 people):
- Elixir monolith
- ClickHouse for analytics
- A few production servers
- Open source, bootstrapped
- $2M+ ARR
Gumroad (few dozen):
- Rails monolith
- Postgres, Redis
- Heroku → AWS
- Simple. That's the point.
Bytebase (few dozen):
- Go + React
- SQLite (embedded) + Postgres option
- CNCF-oriented
Lesson: the smaller the team, the stronger simplicity is as a weapon. A monolith + one DB can still reach $100M+.
Part 14 — 12 Checklist Items
- Compare your company against the case studies
- Adopt a slice of Netflix Chaos Engineering (start simple)
- Implement Stripe's Idempotency Key
- Read Cloudflare's public postmortems
- Apply Shopify's Pod isolation concept
- Use Rust on hot paths (Discord style)
- Figma's CRDT (when real-time collab is needed)
- Learn Notion's data model
- Spotify's Backstage (Platform)
- Consider Airbnb's "macroservices"
- Choose tools to match company size
- Adopt a postmortem template for your team
Part 15 — 10 Antipatterns
- Mimicking Netflix (big-company mode) → not your scale
- 100 services for 100 engineers → unmanageable
- Building your own DB/language → waste below 10,000-person scale
- Trend-chasing → stability sacrificed
- Secret postmortems → repeat the same mistakes
- Chaos Engineering without prep → just outages
- Mandating Kubernetes → simplicity lost
- Microservices + strong consistency → distributed transaction hell
- Rust everywhere → hiring problem
- Blindly copying Spotify Model → ignores context
Closing — "On the Shoulders of Giants"
Netflix, Stripe, Cloudflare aren't collections of geniuses. They are:
- Years of repeated mistakes and learning
- Transparent postmortems (abundant public material)
- Principles discovered at enormous scale
If your company is small, simplicity is your weapon. Large companies regret "why did we start with microservices when we were small?"
Next — Season 3 Ep 2 — dissecting famous postmortems: Cloudflare 2022, Fastly 2021, AWS 2017, Heroku DNS issues, Knight Capital $440M in 8 minutes.
Learning from failure is the fastest growth.
Next — "Dissecting Famous Postmortems: Cloudflare, Fastly, AWS, Knight Capital Failures"
Season 3 Ep 2 covers:
- Cloudflare 2022-06 (BGP), 2019-07 (Regex)
- Fastly 2021-06 (one customer config → half the internet down)
- AWS S3 2017-02 (one typo → us-east-1 down)
- Knight Capital 2012 (
$440Mlost in 8 minutes) - GitLab 2017 (DB wipe)
- Common patterns
Failure teaches faster. See you in the next post.
현재 단락 (1/316)
Season 2 covered principles and patterns. Season 3 is about real-world cases.