Split View: 아키텍처 사례 연구 — Netflix·Stripe·Cloudflare·Shopify를 2025년 기준으로 심층 분석

아키텍처 사례 연구 — Netflix·Stripe·Cloudflare·Shopify를 2025년 기준으로 심층 분석

프롤로그 — "이론 22편 끝냈다. 현실은 어떻게 돌아가는가?"

Season 2에서 우리는 원칙과 패턴을 배웠다. 이제 Season 3은 현실의 사례.

Netflix는 왜 7,000개 마이크로서비스를 운영하는가?
Stripe는 왜 아직도 Ruby 모노리스인가?
Cloudflare는 어떻게 하루 32조 요청을 처리하는가?
Shopify는 Black Friday에 $10B 매출을 어떻게 받아내는가?

이 글은 공개된 엔지니어링 블로그, 컨퍼런스 발표, 포스트모템에서 추출한 실제 아키텍처. 내부 정보는 없고, 모든 출처는 공개 자료다.

Season 3 Ep 1 — 아키텍처 사례 연구.

1부 — Netflix — 7,000개 마이크로서비스의 오케스트라

규모 (2024 공개 자료)

- MAU: 2.7억
- 동시 스트리밍: 수천만
- 트래픽: 인터넷 트래픽의 15%
- 마이크로서비스: 7,000개+ (추정)
- AWS EC2 인스턴스: 수백만
- Cassandra 클러스터: 수천 개
- Kafka 메시지: 하루 수조 건

아키텍처 스택

Edge: Zuul (API Gateway) → Spring Cloud Gateway로 이동 중
Service Mesh: 자체 구현 (Ribbon, Eureka) → gRPC 기반
DB: Cassandra (primary), DynamoDB, EVCache (Memcached)
Caching: EVCache (자체 개발), Moneta
Streaming: Kafka + Flink
Compute: AWS EC2 + Titus (자체 컨테이너 플랫폼)
Chaos Engineering: Chaos Monkey (발명)

Chaos Engineering — Netflix의 발명

Chaos Monkey: 프로덕션에서 무작위 서버 종료
Chaos Gorilla: 가용 영역(AZ) 전체 종료
Chaos Kong: 리전 전체 종료

철학: "장애는 피할 수 없다. 미리 연습해서 복원력을 만든다." 효과: 10년 넘게 큰 outage 없음 (2023 AWS us-east-1 장애 때도 살아남음).

인코딩 전략

한 영화 → 120개 이상 버전
(해상도 × 비트레이트 × 오디오 × HDR × 자막)

Per-title encoding: 영화마다 최적 비트레이트 분석
Per-shot encoding: 장면마다 다른 비트레이트 → 대역폭 20% 절감
AV1 codec: 2024년 활성화, H.264 대비 20-30% 효율

Open Connect — 자체 CDN

이유: 상용 CDN으로는 규모/비용 감당 X
구조: 전 세계 ISP의 데이터센터에 자체 서버 설치
효과: 인터넷 backbone 부하 최소화, 지연 감소
규모: 18,000+ 서버 in 175+ 국가

배포 도구 — Spinnaker

2014년 Netflix 개발 → 오픈소스. GitOps 이전 시대의 CD. 2025 현실: Spinnaker 쇠퇴, ArgoCD가 대세. Netflix 내부도 modernize 중.

교훈

규모가 아키텍처를 정의: 2.7억 사용자엔 마이크로서비스 필수
실패 예상: Chaos Engineering은 "if"가 아닌 "when"
자체 도구 과감: CDN, 컨테이너 플랫폼 자체 개발
데이터 중심: Per-shot encoding 같은 세밀한 최적화
근데 7,000개는 당신 회사가 따라할 것 X — Netflix니까 필요

2부 — Stripe — 모노리스가 이긴다

규모 (2024 공개)

- 결제 처리: 연 $1T+ (2023년)
- 고객: 수백만 (Uber, Amazon, Google, OpenAI 등)
- Ruby 모노리스: 코드 수천만 줄
- 직원: 8,000+

"대기업이 왜 아직 모노리스?"

Stripe는 **"우리는 모노리스다"**를 자랑스럽게 말함.

이유:

1. 결제 도메인은 강한 일관성 필요 (ACID)
2. 마이크로서비스 = 분산 트랜잭션 = 복잡도
3. 개발자 생산성: 한 repo, 한 배포
4. 강한 테스트 → 리팩터 자유

실제 구조

Sorbet (Ruby type checker) — Stripe 개발
Rails 기반 모노리스
수천 개 endpoint
Postgres (sharded) + MongoDB (일부 레거시)
Kafka for 이벤트
Spark for analytics

Sorbet: Ruby에 정적 타입을 추가한 Stripe의 자체 도구. 2019 오픈소스. TypeScript가 JS에 한 역할.

API 버저닝

날짜 기반: Stripe-Version: 2024-12-18
매일 새 버전 가능 (breaking change 있을 때만)
각 버전 영원히 유지 (2011년 v1 여전히 작동)

내부 구현:
- Request → version translator → 현재 내부 스키마
- 현재 내부 → version translator → Response
- 10년치 버전 유지 비용 vs 고객 신뢰

Idempotency

모든 POST API에 Idempotency-Key 지원. Redis + DB로 24시간 보관. 동시 요청 시 lock으로 단일 처리 보장. Stripe가 Idempotency 표준을 만듦.

Data Infrastructure

Pervasive Caching: 각 서비스가 자체 캐시
Event sourcing: 결제 이력은 절대 수정 X
Double-entry bookkeeping: 회계 스타일 (balances = sum of events)
Strong consistency: Spanner-like 자체 구현

교훈

모노리스도 $1T 처리 가능: 마이크로서비스가 필수 아님
도구 자체 개발: Sorbet, Skycfg (Starlark config)
API는 계약: 10년 유지 각오로 설계
Idempotency가 기본: 분산 시스템의 필수
개발자 경험 = 제품: 내부 도구도 UX 중시

3부 — Cloudflare — Edge의 제왕

규모 (2024)

- 네트워크: 300+ 도시, 120+ 국가
- 트래픽: 하루 32조 요청
- DNS: 하루 30+ quadrillion 쿼리
- DDoS 차단: 209 Tbps 공격 방어 (2024)
- Workers: 매일 7백만+ 개 배포

핵심 제품 스택

Edge: 자체 하드웨어 + 소프트웨어 스택
Network: Anycast (1.1.1.1 하나의 IP, 300 PoP)
DDoS: L3/L4/L7 다층 방어
Workers: V8 isolate 기반 serverless
Workers AI: Edge inference (LLM)
R2: S3 호환 object storage (egress 무료가 차별화)
D1: SQLite-based distributed DB
Durable Objects: Stateful 서버리스
Hyperdrive: DB connection pooler at edge
Tunnel: Zero Trust private network

Workers — 왜 V8 isolate?

Lambda: 컨테이너 → 100ms+ cold start
Workers: V8 isolate → 5ms cold start

이유:
- V8 isolate는 가벼움 (수 MB 메모리)
- 같은 프로세스에 수천 isolate
- 각 isolate는 격리 (같은 V8)

제약: Node API 일부만 지원 (표준 Web API 중심). 파일시스템 X. CPU 시간 50ms 제한 (paid는 30초).

네트워크 스택 특징

Quicksilver: 자체 KV store (configuration 초고속 분산)
Unimog: L4 load balancer
BGP: 전 세계 광고, Argo Smart Routing

2022 11월 장애 포스트모템 (유명)

무슨 일: 전 세계 Cloudflare dashboard 다운
원인: BGP 경로 전파 오류 (한 스위치 업그레이드 중)
지속: 2시간
교훈: "Control plane은 분산돼야" — Argo control plane 재설계

가치: 포스트모템 공개 — 업계 전체 배움.

교훈

Edge가 미래: 중앙 데이터센터보다 전 세계 분산
자체 스택 전체: 네트워크 하드웨어부터 런타임까지
투명한 포스트모템: 신뢰 쌓는 문화
무료 tier 전략적: 개발자 lock-in
다른 CDN 차별화: 개발자 플랫폼 (Workers) 핵심

4부 — Shopify — Black Friday의 왕

규모 (2023-2024)

- Merchant: 500만+ 가맹점
- Black Friday 2023: $9.3B 매출 (4일)
- Peak RPS: 70만+ 요청/초
- Orders: 분당 수만 건
- GMV: 연 $300B+

아키텍처 스택

Rails 모노리스: "Shopify Core"
Ruby: 주력 언어 (YJIT 적극 활용)
MySQL: sharded, Vitess와 유사한 자체 구현
Kafka: 이벤트 스트리밍
Go, Elixir, Rust: 특수 서비스
Kubernetes (일부), GCP + 자체 데이터센터

Pods — Shopify의 샤딩

Pod: 가맹점을 그룹화한 격리 단위
각 Pod: MySQL shard + Redis + 다른 리소스
한 Pod 문제 → 다른 Pod 영향 X (blast radius 제한)
전 세계 수십 개 Pod

효과: 한 고객이 DB 포화시켜도 다른 고객 영향 X.

Black Friday 준비

1. 연간 예측 트래픽 4-10x
2. 6개월 전부터 리허설
3. Game day: 의도적 장애 시뮬레이션
4. Capacity planning: 무한 증설이 아닌 효율 개선
5. Feature freeze: 2주 전부터 주요 배포 중단
6. On-call 대대적 준비

YJIT — Ruby JIT의 전환점

2022: YJIT Rust로 재작성
2023+: Shopify Core 채택
효과: 전체 서버 CPU 15% 감소 → 서버 비용 절감
Shopify가 YJIT 개발 이끔 (Maxime Chevalier-Boisvert)

Storefront — 새로운 FE 스택

Hydrogen: Shopify의 Remix 기반 SSR 프레임워크
Oxygen: Shopify의 배포 플랫폼
이전: Liquid 템플릿 엔진 (자체)
이후: React + TypeScript

교훈

Pod로 샤딩: 고객 격리는 scaling의 친구
Black Friday는 평소 실력의 배가: 매일 조금씩
모노리스 + 주변 micro: Rails + Go/Elixir 특수 부분
YJIT처럼 언어/런타임 투자: 복리 효과
Game day 문화: 장애를 연습

5부 — Discord — 수천만 동시 사용자 채팅

규모

- Registered users: 3억+
- DAU: 1.5억+
- Messages: 하루 400억+
- Voice minutes: 월 수십억

재밌는 언어 선택

Elixir (Erlang VM): 채팅 서비스 (대규모 동시성)
Rust: 성능 크리티컬 (voice server 등)
Python: ML, 백엔드 일부
Go, TypeScript: 기타

ScyllaDB로 Cassandra 대체 (2023)

문제: Cassandra 177 노드, 유지보수 지옥
해결: ScyllaDB로 마이그레이션 (C++ 재작성 Cassandra)
결과: 노드 177 → 72, 레이턴시 50% 감소, $$$ 절감

교훈: 언어/런타임 변경으로 인프라 절반. 엔지니어링 시간 투자할 가치.

교훈

BEAM(Erlang)의 힘: WhatsApp, Discord가 증명
언어 섞어라: 도메인마다 적합한 언어
DB 교체는 가능: 사전에 interface 추상화
Rust 점진 도입: 성능 크리티컬부터

6부 — GitHub — 가장 큰 코드 저장소

규모

- Repos: 5억+
- 개발자: 1.5억+
- Git 데이터: 수 PB
- Actions: 분당 수백만 작업

아키텍처 진화

2008: Rails 모노리스 (시작)
~2016: MySQL + Redis + memcached
2020+: Spokes (Git storage), Kafka, Kubernetes 점진 도입
2022-: Microsoft 인수 후 Azure 마이그레이션 중

재미있는 엔지니어링

Monolith-first: 큰 Rails 앱을 여전히 유지
Spokes: Git을 분산 저장 (자체)
Codespaces: VS Code Server를 위한 Kubernetes 오케스트레이션
Copilot: OpenAI 협업

교훈

Rails는 여전히 유효: 올바르게 쓰면
Git 저장은 어렵다: 파일시스템 문제
Developer platform = 자체 코드: Codespaces는 GitHub dev 환경

7부 — Figma — 실시간 협업 혁명

기술적 도전

수십 명이 같은 캔버스 동시 편집
지연 50ms 이내 느껴야
Offline 편집 후 merge
Undo/Redo 혼란 없이

CRDT 기반 동시 편집

Conflict-free Replicated Data Types
각 편집은 commutative (순서 무관)
서버는 중재자, 최종 상태는 수렴

Figma의 데이터 모델:
- Node tree (DOM-like)
- 각 edit은 operation 타입
- WebSocket으로 실시간 sync

Rust 기반 core

Rust로 작성된 co-editor core
WASM으로 브라우저 실행
Native 수준 성능

교훈

실시간 협업 = CRDT: Google Docs, Figma 모두
Web 앱이 Native를 대체: WASM 덕분
회사 === 프로덕트: Figma는 프로덕트 집요함

8부 — Notion — 블록 기반 문서의 데이터 모델

데이터 모델

모든 것이 block
Page = block of blocks
Text, heading, list, code, ... 모두 block

Block 데이터 구조:
{
  id: "block_123",
  type: "paragraph",
  content: [...],
  parent: "block_456",
  children: ["block_789", ...],
  properties: {...}
}

특징: 재귀적 트리. 전 세계 Notion 사용자의 모든 데이터가 같은 구조.

Postgres + Caching

Primary: Postgres (모든 block)
Cache: Redis, Memcached
Search: Elasticsearch
AI: OpenAI embedding

2024 Growth: AI 기능 확대로 Postgres 부하 급증. Sharding 진화.

교훈

데이터 모델이 전부: 잘 설계된 스키마는 확장 가능
블록 = 유연성: "페이지" 중심이 아니라 "블록" 중심
Postgres의 힘: NoSQL 안 써도 Notion 스케일 가능

9부 — Spotify — 팀 구조와 아키텍처

Spotify Model (논란 있음)

Squad: 팀 (product area)
Tribe: 여러 Squad 묶음
Chapter: 같은 역할 (예: 백엔드 엔지니어)
Guild: 관심사 기반 커뮤니티

2024 현실: Spotify 자신도 "우리가 이 모델대로 안 함"이라고 공개. 교훈: 조직 구조는 회사마다 다르다. 템플릿 복사 X.

Backstage — 내부 개발자 포털

2020 오픈소스: 내부 도구를 외부로
기능: 서비스 카탈로그, 문서, TechDocs, 템플릿
CNCF 채택: Incubating 프로젝트

2025 현실: Platform Engineering의 표준 도구.

Event-driven

Kafka 중심 아키텍처
Play event → 수십 서비스에 broadcast
Recommendation은 이벤트 집계 기반

교훈

"Spotify Model"의 신화: 조직 구조는 일반화 어렵다
Internal platform → OSS: Backstage 방식
Event streaming = 유연성: 새 서비스 쉽게 추가

10부 — Airbnb — Monolith → Services → Monolith?

여정

2008-2017: Rails 모노리스
2017-2022: 마이크로서비스 이동 (복잡도 폭발)
2022+: 일부 재통합 ("macroservices")

교훈: 실수 공개. "마이크로서비스 너무 많이 쪼갰다, 일부 합치는 중."

발명품

Airflow: 데이터 파이프라인 스케줄러 (2015, 지금 표준)
Lottie: 벡터 애니메이션 (프론트엔드)
Knowledge Graph: 검색/추천

교훈

과도한 쪼개기 위험: MSA는 공짜 아님
Retrospective 문화: 실수 인정 + 수정
주변 도구가 브랜드: Airflow가 회사 이미지

11부 — 공통 패턴 10가지

사례들에서 추출한 패턴:

모노리스도 OK: Stripe, Shopify, GitHub — 규모 되는데 모노리스
샤딩/Pod로 격리: Shopify Pod, Netflix region
Event-driven: Kafka 중심 (Spotify, Shopify, Netflix)
자체 CDN/인프라: Netflix Open Connect, Cloudflare 자체 스택
Chaos Engineering: Netflix 시작, 업계 확산
언어 섞어 쓰기: Rust(성능), Elixir/Erlang(동시성), Go(인프라)
GitHub 플로우: PR 리뷰 + CI + 자동 배포
오픈소스 → 표준: Airflow, Kubernetes, Backstage, Sorbet
포스트모템 공개: Cloudflare, GitLab 투명성
도구 자체 개발 과감: 규모 되면 무조건

12부 — 따라할 것 vs 따라하지 말 것

따라할 것

강한 테스트 + CI/CD
포스트모템 문화
Idempotency (Stripe 방식)
Feature Flag
관측 가능성 (logs/metrics/traces)
DORA 지표 측정

따라하지 말 것

7,000개 마이크로서비스 (Netflix) — 당신 회사 규모 X
자체 CDN 제작 (Netflix) — 호스팅 비용 충분히 크지 않으면 낭비
Spotify Model 복사 — 조직 context 다름
Rust로 전체 재작성 — 성능 병목 없으면 의미 X
Kubernetes everything — 작은 스타트업엔 과함

13부 — 작은 회사의 사례

Plausible Analytics (10명):

Elixir 모노리스
ClickHouse for analytics
Prod 서버 몇 대
Open source, bootstrapped
$2M+ ARR

Gumroad (수십 명):

Rails 모노리스
Postgres, Redis
Heroku → AWS 이동
Simple. 그게 포인트.

Bytebase (수십 명):

Go + React
SQLite (embedded) + Postgres 옵션
CNCF 지향

교훈: 작은 팀일수록 단순함이 무기. 모노리스 + 한 DB로도 $100M+ 가능.

14부 — 체크리스트 12개

15부 — 안티패턴 10가지

Netflix 흉내 (대기업 방식) → 당신 회사 규모 X
100명에 100개 서비스 → 운영 불가능
자체 DB/언어 제작 → 10,000명 규모 아니면 낭비
트렌드 따라감 → 안정성 희생
포스트모템 비밀 → 같은 실수 반복
Chaos Engineering 준비 없이 → 그냥 장애
Kubernetes 강제 → 단순함 상실
Microservices + 강한 일관성 → 분산 트랜잭션 지옥
Rust everywhere → 팀 채용 어려움
Spotify Model 맹신 → 컨텍스트 무시

마무리 — "거인의 어깨 위에서"

Netflix, Stripe, Cloudflare는 천재들만 모인 곳이 아니다.

수년간 반복된 실수와 학습
투명한 포스트모템 (공개 자료 풍부)
대단한 규모에서 발견한 원칙

당신 회사 규모가 작다면 오히려 단순함의 무기를 가졌다. 큰 회사는 "그게 작을 때 왜 시작부터 Microservices 썼지"를 후회한다.

다음 글은 Season 3 Ep 2 — 유명 포스트모템 해부. Cloudflare 2022, Fastly 2021, AWS 2017, Heroku DNS 이슈, Knight Capital $440M 8분 등. 실패에서 배우는 게 가장 빠른 성장.

다음 글 예고 — "유명 포스트모템 해부: Cloudflare·Fastly·AWS·Knight Capital의 실패에서 배우기"

Season 3 Ep 2는:

Cloudflare 2022-06 (BGP), 2019-07 (Regex)
Fastly 2021-06 (한 고객 config → 인터넷 절반 다운)
AWS S3 2017-02 (오타 하나로 us-east-1 다운)
Knight Capital 2012 (8분만에 $440M 손실)
GitLab 2017 (DB wipe)
Common patterns

실패에서 더 빨리 배운다. 다음 글에서.

Architecture Case Studies — Netflix, Stripe, Cloudflare, Shopify in 2025

Prologue — "22 theory posts done. How does reality run?"

Season 2 covered principles and patterns. Season 3 is about real-world cases.

Why does Netflix run 7,000+ microservices?
Why is Stripe still a Ruby monolith?
How does Cloudflare handle 32 trillion requests per day?
How does Shopify absorb $10B in Black Friday revenue?

Everything below comes from public engineering blogs, conference talks, and postmortems. No insider info.

Season 3 Ep 1 — architecture case studies.

Part 1 — Netflix — An Orchestra of 7,000 Microservices

Scale (2024 public data)

- MAU: 270M
- Concurrent streams: tens of millions
- Traffic share: ~15% of the internet
- Microservices: 7,000+ (estimated)
- AWS EC2 instances: millions
- Cassandra clusters: thousands
- Kafka messages: trillions per day

Architecture stack

Edge: Zuul (API Gateway) → migrating to Spring Cloud Gateway
Service Mesh: in-house (Ribbon, Eureka) → moving to gRPC
DB: Cassandra (primary), DynamoDB, EVCache (Memcached)
Caching: EVCache (in-house), Moneta
Streaming: Kafka + Flink
Compute: AWS EC2 + Titus (in-house container platform)
Chaos Engineering: Chaos Monkey (invented here)

Chaos Engineering — Netflix's invention

Chaos Monkey: randomly kills production servers
Chaos Gorilla: takes down an AZ
Chaos Kong: takes down a region

Philosophy: "Failure is inevitable. Rehearse it to build resilience." Result: no major outage in 10+ years (survived the 2023 AWS us-east-1 incident).

Encoding strategy

One movie → 120+ variants
(resolution x bitrate x audio x HDR x subtitles)

Per-title encoding: analyze optimal bitrate per film
Per-shot encoding: different bitrate per scene → 20% bandwidth saved
AV1 codec: activated in 2024, 20-30% more efficient than H.264

Open Connect — in-house CDN

Why: commercial CDNs couldn't match scale/cost
How: Netflix servers placed inside ISPs worldwide
Effect: minimal backbone load, reduced latency
Scale: 18,000+ servers in 175+ countries

Deployment — Spinnaker

Built in 2014 at Netflix, open-sourced. Pre-GitOps era CD. 2025 reality: Spinnaker fading, ArgoCD dominant. Netflix is modernizing internally.

Lessons

Scale defines architecture: 270M users demands microservices.
Expect failure: Chaos Engineering treats outages as "when", not "if".
Build aggressively: CDN, container platform, all in-house.
Data-driven detail: per-shot encoding is micro-optimization at macro-scale.
But 7,000 services is not your target — it is Netflix's necessity.

Part 2 — Stripe — The Monolith Wins

Scale (2024 public)

- Payment volume: $1T+ per year (2023)
- Customers: millions (Uber, Amazon, Google, OpenAI, ...)
- Ruby monolith: tens of millions of lines
- Employees: 8,000+

"A company this big, still a monolith?"

Stripe proudly says: "We are a monolith."

Reasons:

1. Payment domain needs strong consistency (ACID)
2. Microservices = distributed transactions = complexity
3. Developer productivity: one repo, one deploy
4. Strong tests → refactor freely

Actual stack

Sorbet (Ruby type checker) — built by Stripe
Rails-based monolith
Thousands of endpoints
Postgres (sharded) + MongoDB (some legacy)
Kafka for events
Spark for analytics

Sorbet adds static typing to Ruby. Open-sourced in 2019. Same role TypeScript plays for JS.

API versioning

Date-based: Stripe-Version: 2024-12-18
New version whenever there is a breaking change
Every version kept forever (v1 from 2011 still works)

Internals:
- Request → version translator → current internal schema
- Current internal → version translator → Response
- 10+ years of versions vs customer trust

Idempotency

Every POST supports Idempotency-Key. Redis + DB retain it for 24 hours. Concurrent requests locked to a single execution. Stripe effectively set the industry standard for Idempotency.

Data infrastructure

Pervasive Caching: each service owns its cache
Event sourcing: payment history is immutable
Double-entry bookkeeping: accounting style (balances = sum of events)
Strong consistency: Spanner-like in-house implementation

Lessons

A monolith can process $1T. Microservices aren't mandatory.
Build your own tools: Sorbet, Skycfg (Starlark config).
APIs are contracts: design for 10-year maintenance.
Idempotency is table stakes.
Developer experience is the product.

Part 3 — Cloudflare — Ruler of the Edge

Scale (2024)

- Network: 300+ cities, 120+ countries
- Traffic: 32 trillion requests/day
- DNS: 30+ quadrillion queries/day
- DDoS defense: blocked 209 Tbps attack (2024)
- Workers: 7M+ deploys daily

Core product stack

Edge: in-house hardware + software stack
Network: Anycast (1.1.1.1 — one IP, 300 PoPs)
DDoS: L3/L4/L7 layered defense
Workers: V8 isolate-based serverless
Workers AI: edge inference (LLM)
R2: S3-compatible object storage (zero egress fees is the differentiator)
D1: SQLite-based distributed DB
Durable Objects: stateful serverless
Hyperdrive: DB connection pooler at edge
Tunnel: Zero Trust private network

Workers — why V8 isolate?

Lambda: container → 100ms+ cold start
Workers: V8 isolate → 5ms cold start

Why:
- V8 isolates are tiny (few MB memory)
- Thousands of isolates in one process
- Each isolate is isolated (same V8)

Constraints: only a subset of Node APIs (Web standards first). No filesystem. 50ms CPU limit (30s on paid).

Network stack highlights

Quicksilver: in-house KV store (ultra-fast config distribution)
Unimog: L4 load balancer
BGP: global announcements, Argo Smart Routing

Nov 2022 outage postmortem (famous)

What: Cloudflare dashboard down worldwide
Cause: BGP route propagation error during a switch upgrade
Duration: 2 hours
Lesson: "Control plane must be distributed" — Argo control plane redesigned

Value: the postmortem was public — the whole industry learned.

Lessons

Edge is the future: distribute globally, not centrally.
Own the stack end-to-end: from network hardware to runtime.
Transparent postmortems build trust.
Free tier is strategic: developer lock-in.
Differentiation from other CDNs: developer platform (Workers) is the core.

Part 4 — Shopify — King of Black Friday

Scale (2023-2024)

- Merchants: 5M+
- Black Friday 2023: $9.3B revenue (4 days)
- Peak RPS: 700K+ requests/sec
- Orders: tens of thousands per minute
- GMV: $300B+ annually

Architecture stack

Rails monolith: "Shopify Core"
Ruby: primary language (heavy YJIT usage)
MySQL: sharded, Vitess-like in-house implementation
Kafka: event streaming
Go, Elixir, Rust: specialized services
Kubernetes (partial), GCP + in-house data centers

Pods — Shopify's sharding

Pod: isolated unit grouping merchants
Each Pod: MySQL shard + Redis + other resources
One Pod fails → other Pods unaffected (blast radius limited)
Dozens of Pods worldwide

Effect: a single customer saturating a DB does not affect others.

Black Friday preparation

1. Expected traffic: 4-10x normal
2. Rehearsals start 6 months out
3. Game days: deliberate failure simulation
4. Capacity planning: efficiency gains, not just scale-out
5. Feature freeze: major deploys stop 2 weeks prior
6. On-call fully staffed and rehearsed

YJIT — Ruby JIT turning point

2022: YJIT rewritten in Rust
2023+: adopted by Shopify Core
Effect: 15% server CPU reduction → lower server cost
Shopify leads YJIT development (Maxime Chevalier-Boisvert)

Storefront — new frontend stack

Hydrogen: Remix-based SSR framework by Shopify
Oxygen: Shopify's deployment platform
Before: Liquid template engine (in-house)
After: React + TypeScript

Lessons

Shard by Pod: customer isolation is scaling's friend.
Black Friday = your ordinary skill multiplied. Practice daily.
Monolith + peripheral micro: Rails + Go/Elixir for specialized parts.
Invest in language/runtime (YJIT): compounding returns.
Game day culture: rehearse failure.

Part 5 — Discord — Chat at Tens of Millions Concurrent

Scale

- Registered users: 300M+
- DAU: 150M+
- Messages: 40B+/day
- Voice minutes: tens of billions/month

Interesting language choices

Elixir (Erlang VM): chat service (massive concurrency)
Rust: performance-critical (voice servers, etc.)
Python: ML, some backend
Go, TypeScript: miscellaneous

ScyllaDB replaces Cassandra (2023)

Problem: 177 Cassandra nodes, maintenance hell
Solution: migrate to ScyllaDB (Cassandra rewritten in C++)
Result: 177 → 72 nodes, latency cut 50%, $$$ saved

Lesson: language/runtime swaps can halve infrastructure. Worth the engineering investment.

Lessons

BEAM (Erlang) power: WhatsApp, Discord prove it.
Mix languages: match domain to language.
DB replacement is possible: abstract the interface early.
Adopt Rust gradually: start from hot paths.

Part 6 — GitHub — Largest Code Host

Scale

- Repos: 500M+
- Developers: 150M+
- Git data: several PB
- Actions: millions of jobs per minute

Architecture evolution

2008: Rails monolith (founded)
~2016: MySQL + Redis + memcached
2020+: Spokes (Git storage), Kafka, Kubernetes gradual adoption
2022-: Azure migration under Microsoft

Notable engineering

Monolith-first: still runs a large Rails app
Spokes: distributed Git storage (in-house)
Codespaces: Kubernetes orchestration for VS Code Server
Copilot: OpenAI collaboration

Lessons

Rails still works — if done right.
Git storage is hard: filesystem problem at its core.
Developer platform = your own code. Codespaces is GitHub's own dev env.

Part 7 — Figma — The Real-Time Collaboration Revolution

Technical challenges

Dozens editing the same canvas simultaneously
Latency must feel under 50ms
Offline edits that merge back
Undo/Redo without confusion

CRDT-based co-editing

Conflict-free Replicated Data Types
Each edit is commutative (order-independent)
Server is a mediator, final state converges

Figma's data model:
- Node tree (DOM-like)
- Each edit is an operation type
- WebSocket for real-time sync

Rust-based core

Co-editor core written in Rust
Executed in browser via WASM
Near-native performance

Lessons

Real-time collab means CRDT: Google Docs, Figma both.
Web is eating native: thanks to WASM.
Product obsession beats scale: Figma is a product company first.

Part 8 — Notion — Block-Based Document Data Model

Data model

Everything is a block
Page = block of blocks
Text, heading, list, code, ... all blocks

Block data structure:
{
  id: "block_123",
  type: "paragraph",
  content: [...],
  parent: "block_456",
  children: ["block_789", ...],
  properties: {...}
}

Character: recursive tree. Every user's data worldwide shares the same structure.

Postgres + caching

Primary: Postgres (all blocks)
Cache: Redis, Memcached
Search: Elasticsearch
AI: OpenAI embeddings

2024 growth: AI features explode Postgres load. Sharding evolves.

Lessons

Data model is everything: a well-designed schema scales.
Blocks = flexibility: block-centric beats page-centric.
Postgres is powerful: no NoSQL needed to reach Notion's scale.

Part 9 — Spotify — Team Structure and Architecture

Spotify Model (controversial)

Squad: team (product area)
Tribe: group of Squads
Chapter: same role (e.g., backend engineers)
Guild: interest-based community

2024 reality: Spotify itself admitted "we don't actually follow this model." Lesson: org structure varies by company. Do not copy the template.

Backstage — internal developer portal

2020 open-sourced: internal tool released to the world
Features: service catalog, docs, TechDocs, templates
CNCF: incubating project

2025 reality: the standard tool for Platform Engineering.

Event-driven

Kafka-centric architecture
Play event → broadcast to dozens of services
Recommendations are built on event aggregation

Lessons

The "Spotify Model" myth: organizational structures don't generalize.
Internal platform → OSS: the Backstage move.
Event streaming = flexibility: new services plug in easily.

Part 10 — Airbnb — Monolith → Services → Monolith?

Journey

2008-2017: Rails monolith
2017-2022: microservices migration (complexity explosion)
2022+: partial consolidation ("macroservices")

Lesson: publicly admitted — "we split too far into microservices, we're merging some back."

Inventions

Airflow: data pipeline scheduler (2015, now standard)
Lottie: vector animation (frontend)
Knowledge Graph: search/recommendations

Lessons

Over-splitting is dangerous: MSA is not free.
Retrospective culture: admit mistakes, then fix them.
Side tools build the brand: Airflow is part of Airbnb's image.

Part 11 — 10 Common Patterns

Patterns extracted from these cases:

Monoliths are fine: Stripe, Shopify, GitHub — large scale with monoliths.
Shard via Pods: Shopify Pod, Netflix region.
Event-driven: Kafka at the center (Spotify, Shopify, Netflix).
Own CDN/infrastructure: Netflix Open Connect, Cloudflare's full stack.
Chaos Engineering: Netflix origin, industry-wide.
Mix languages: Rust (perf), Elixir/Erlang (concurrency), Go (infra).
GitHub flow: PR review + CI + auto deploy.
Open source → standard: Airflow, Kubernetes, Backstage, Sorbet.
Public postmortems: Cloudflare, GitLab transparency.
Build your own tools: at scale, always.

Part 12 — What to Copy vs Not Copy

Copy

Strong tests + CI/CD
Postmortem culture
Idempotency (Stripe style)
Feature Flags
Observability (logs/metrics/traces)
DORA metrics

Don't copy

7,000 microservices (Netflix) — not your scale
Build your own CDN (Netflix) — pointless unless hosting bill is huge
Spotify Model verbatim — your context differs
Rewrite everything in Rust — no bottleneck, no meaning
Kubernetes everywhere — overkill for small startups

Part 13 — Small Company Cases

Plausible Analytics (10 people):

Elixir monolith
ClickHouse for analytics
A few production servers
Open source, bootstrapped
$2M+ ARR

Gumroad (few dozen):

Rails monolith
Postgres, Redis
Heroku → AWS
Simple. That's the point.

Bytebase (few dozen):

Go + React
SQLite (embedded) + Postgres option
CNCF-oriented

Lesson: the smaller the team, the stronger simplicity is as a weapon. A monolith + one DB can still reach $100M+.

Part 14 — 12 Checklist Items

Part 15 — 10 Antipatterns

Mimicking Netflix (big-company mode) → not your scale
100 services for 100 engineers → unmanageable
Building your own DB/language → waste below 10,000-person scale
Trend-chasing → stability sacrificed
Secret postmortems → repeat the same mistakes
Chaos Engineering without prep → just outages
Mandating Kubernetes → simplicity lost
Microservices + strong consistency → distributed transaction hell
Rust everywhere → hiring problem
Blindly copying Spotify Model → ignores context

Closing — "On the Shoulders of Giants"

Netflix, Stripe, Cloudflare aren't collections of geniuses. They are:

Years of repeated mistakes and learning
Transparent postmortems (abundant public material)
Principles discovered at enormous scale

If your company is small, simplicity is your weapon. Large companies regret "why did we start with microservices when we were small?"

Next — Season 3 Ep 2 — dissecting famous postmortems: Cloudflare 2022, Fastly 2021, AWS 2017, Heroku DNS issues, Knight Capital $440M in 8 minutes.

Learning from failure is the fastest growth.

Next — "Dissecting Famous Postmortems: Cloudflare, Fastly, AWS, Knight Capital Failures"

Season 3 Ep 2 covers:

Cloudflare 2022-06 (BGP), 2019-07 (Regex)
Fastly 2021-06 (one customer config → half the internet down)
AWS S3 2017-02 (one typo → us-east-1 down)
Knight Capital 2012 ($440M lost in 8 minutes)
GitLab 2017 (DB wipe)
Common patterns

Failure teaches faster. See you in the next post.