Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Introduction

The Chrome browser, Android OS, and Netflix app that we use daily are massive software systems consisting of billions of lines of code. These are not created by a single genius but built simultaneously by thousands of developers collaborating.

How exactly is such massive software developed, deployed, updated, and maintained long-term?

This post covers the core principles of large-scale software development comprehensively, from architecture to update system implementation.

1. Defining Large-Scale Software

When Does It Become "Large-Scale"?

Generally, software meeting one or more of these criteria is classified as large-scale:

Criterion	Example
Code size	Millions to billions of lines
Development staff	Hundreds to thousands
Number of services	Hundreds to thousands of microservices
User base	Hundreds of millions+ DAU
Deployment frequency	Dozens to hundreds per day

Real-World Examples

Google: A single monorepo with approximately 2 billion+ lines of code. Tens of thousands of commits daily by 25,000+ developers.
Meta (Facebook): Tens of thousands of engineers collaborate in a single repo. Uses Buck, their custom build system.
Netflix: Operates with approximately 1,000+ microservices, performing thousands of deployments per day. Pioneer of Chaos Engineering.

2. Monorepo vs Multi-repo

The first decision in managing large-scale software is the code repository strategy.

Monorepo

All project code managed in a single repository.

company-repo/
  frontend/
  backend/
  mobile/
  infra/
  shared-libs/
  tools/

Advantages: Easy code sharing, atomic commits possible, easier refactoring, consistent tooling

Disadvantages: Repository size becomes massive, complex build systems needed, access control challenges

Multi-repo

Each project, service, or library managed in independent repositories.

Advantages: Independent development cycles per team, small repos with fast cloning, clear service boundaries

Disadvantages: Cumbersome code sharing, dependency hell (Diamond Dependency), difficult cross-project refactoring

Enterprise Choices

Company	Strategy	Tools
Google	Monorepo	Piper (custom VCS) + Bazel
Meta	Monorepo	Mercurial + Buck
Netflix	Multi-repo	Gradle + Nebula
Microsoft	Monorepo (partial)	VFS for Git (GVFS)
Amazon	Multi-repo	Brazil (custom build system)

The key is not that one strategy is absolutely better, but choosing the strategy that fits your organization's scale and culture.

3. Microservices Architecture

From Monolith to Microservices

Early startups typically start with a single application (monolith). As scale grows, problems emerge: build times exceeding 30 minutes, a single bug crashing the entire system, frequent code conflicts between teams, inability to scale specific functions independently.

Microservices Architecture (MSA) addresses these problems.

Core MSA Components

API Gateway - Single entry point for all client requests. Handles authentication, routing, rate limiting, and logging.

# API Gateway routing example
routes:
  - path: /api/users
    service: user-service
    methods: [GET, POST]
  - path: /api/orders
    service: order-service
    methods: [GET, POST, PUT]
  - path: /api/payments
    service: payment-service
    methods: [POST]

Service Mesh - Infrastructure layer managing service-to-service communication. Uses tools like Istio and Linkerd.

# Istio VirtualService example
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: user-service
spec:
  hosts:
    - user-service
  http:
    - route:
        - destination:
            host: user-service
            subset: v2
          weight: 90
        - destination:
            host: user-service
            subset: v1
          weight: 10

Service Discovery - Mechanism enabling hundreds of services to find each other. Consul, Eureka, Kubernetes DNS.

Circuit Breaker - Pattern preventing cascade failures when a specific service fails.

# Circuit Breaker pseudocode
class CircuitBreaker:
    def __init__(self, threshold=5, timeout=30):
        self.failure_count = 0
        self.threshold = threshold
        self.timeout = timeout
        self.state = "CLOSED"  # CLOSED, OPEN, HALF_OPEN

    def call(self, func):
        if self.state == "OPEN":
            if self.timeout_expired():
                self.state = "HALF_OPEN"
            else:
                raise CircuitOpenError("Service unavailable")

        try:
            result = func()
            self.on_success()
            return result
        except Exception as e:
            self.on_failure()
            raise e

    def on_failure(self):
        self.failure_count += 1
        if self.failure_count >= self.threshold:
            self.state = "OPEN"

    def on_success(self):
        self.failure_count = 0
        self.state = "CLOSED"

4. CI/CD Pipeline

Continuous Integration (CI)

When a developer pushes code, build, test, and static analysis run automatically.

# GitHub Actions CI Pipeline example
name: CI Pipeline
on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'
      - name: Install dependencies
        run: npm ci
      - name: Lint
        run: npm run lint
      - name: Unit Tests
        run: npm run test:unit
      - name: Integration Tests
        run: npm run test:integration
      - name: Build
        run: npm run build

Continuous Deployment (CD) - Deployment Strategies

Canary Deployment - Deploy the new version to a subset (e.g., 1-5%) of users first, then gradually increase if no issues.

Traffic distribution:
  v1 (current): 95%  ████████████████████░
  v2 (new):      5%  █░░░░░░░░░░░░░░░░░░░

After 30 min (no issues):
  v1 (current): 50%  ██████████░░░░░░░░░░
  v2 (new):     50%  ██████████░░░░░░░░░░

After 1 hour (no issues):
  v1 (current):  0%  ░░░░░░░░░░░░░░░░░░░░
  v2 (new):    100%  ████████████████████

Blue-Green Deployment - Maintain two identical environments (Blue, Green) and switch traffic at once.

Rolling Update - Replace instances one by one sequentially. Kubernetes default deployment strategy.

# Kubernetes Rolling Update configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
  template:
    spec:
      containers:
        - name: my-app
          image: my-app:v2
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10

5. Feature Flags

Separate Deployment from Release

Feature flags enable deploying code while keeping functionality inactive.

// Feature flag usage example
const featureFlags = {
  newCheckoutFlow: false,
  darkMode: true,
  aiRecommendation: false,
};

function renderCheckout() {
  if (featureFlags.newCheckoutFlow) {
    return renderNewCheckout();
  }
  return renderLegacyCheckout();
}

Feature Flag Applications

Use Case	Description
Gradual Release	Roll out 1% - 10% - 50% - 100%
A/B Testing	Compare two UI variants
Kill Switch	Instantly disable features during outages
Beta Testing	Expose only to specific user groups
Regional Release	Korea first, then global expansion

Note: Accumulated feature flags become technical debt. Clean up flags after releases are complete.

6. SW Update System Implementation

This section is the core of this post. Delivering updates safely and smoothly to users is as important as developing the software itself.

6-1. Desktop App Updates

Electron auto-updater - Most Electron-based apps (VS Code, Slack Desktop, Discord) use electron-updater.

// Electron auto-updater implementation
import { autoUpdater } from 'electron-updater';

autoUpdater.setFeedURL({
  provider: 'github',
  owner: 'my-org',
  repo: 'my-app',
});

autoUpdater.on('update-available', (info) => {
  log.info('New version found:', info.version);
  showUpdateNotification(info);
});

autoUpdater.on('download-progress', (progress) => {
  updateProgressBar(progress.percent);
});

autoUpdater.on('update-downloaded', (info) => {
  showRestartDialog(info.version);
});

// Check for updates every 4 hours
setInterval(() => {
  autoUpdater.checkForUpdates();
}, 4 * 60 * 60 * 1000);

Delta Updates - Download only changed portions instead of the entire file.

Full update:  v1.0 (200MB) -> v1.1 (200MB) = 200MB download
Delta update: v1.0 (200MB) -> v1.1 (200MB) = 15MB download (diff only)

Chrome and VS Code use this approach. Binary diffing algorithms like Courgette (Chrome) are key.

6-2. Mobile App Updates

App Store-Based Updates follow the standard submission, review, approval, and distribution flow.

Staged Rollout - Google Play's staged rollout feature allows deploying to only a subset of users first.

Day 1: 1% rollout -> Monitor crash reports
Day 2: 5% rollout -> Check user feedback
Day 3: 20% rollout
Day 5: 50% rollout
Day 7: 100% rollout

OTA (Over-The-Air) Updates - Push JavaScript bundles directly bypassing the app store. Available in React Native via EAS Update.

// React Native OTA Update (EAS Update example)
import * as Updates from 'expo-updates';

async function checkForOTAUpdate() {
  try {
    const update = await Updates.checkForUpdateAsync();
    if (update.isAvailable) {
      await Updates.fetchUpdateAsync();
      Alert.alert(
        'Update Complete',
        'A new version is ready. Apply now?',
        [
          { text: 'Later', style: 'cancel' },
          { text: 'Apply Now', onPress: () => Updates.reloadAsync() },
        ]
      );
    }
  } catch (error) {
    console.error('OTA update failed:', error);
  }
}

Note: Apple guidelines prohibit using OTA to change the main functionality or purpose of an app.

6-3. Server Updates (Deployment)

Server software updates follow the CI/CD deployment strategies described earlier.

# ArgoCD GitOps-based server update
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-service
spec:
  project: default
  source:
    repoURL: https://github.com/my-org/k8s-manifests
    path: services/my-service
    targetRevision: main
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Hot Swap vs Cold Update

Method	Description	Example
Hot Swap	Replace modules without app restart	Java class reloading, Erlang hot code loading
Warm Restart	Restart process only (no OS reboot)	Node.js graceful restart
Cold Update	Full app shutdown then reinstall	Desktop apps, OS updates

Erlang/OTP is famous for its hot code loading that can replace code in a running system without interruption. Used in systems requiring 99.999% uptime like telephone exchanges.

7. Technical Debt Management

Types of Technical Debt

Deliberate + Prudent: "Let's go with a simple implementation now and refactor next quarter"
Deliberate + Reckless: "Skip the design, just build it fast"
Inadvertent + Prudent: "Looking back after implementation, a better design existed"
Inadvertent + Reckless: "What's a layered architecture?"

Refactoring Strategies

Boy Scout Rule: Leave the code cleaner than you found it.

Golden Path First: Refactor the most-used code paths (Happy Path) first.

Strangler Fig Pattern: Don't replace legacy systems all at once; gradually redirect traffic to the new system.

Phase 1: Legacy(100%) --- New(0%)
Phase 2: Legacy(70%)  --- New(30%)
Phase 3: Legacy(30%)  --- New(70%)
Phase 4: Legacy(0%)   --- New(100%)  <- Remove legacy

8. Incident Response

Postmortem

An analysis document written after an incident. Blameless Culture is the core principle.

# Postmortem Template

## Incident Summary
- Occurred: 2026-04-10 14:23 KST
- Recovered: 2026-04-10 15:07 KST
- Impact: ~30% of all users (payment service)
- Severity: P1 (Critical)

## Timeline
- 14:23 - Payment success rate drop alert triggered
- 14:25 - On-call engineer response started
- 14:32 - Root cause identified (DB connection pool exhaustion)
- 14:45 - Emergency patch deployment started
- 15:07 - Full service recovery confirmed

## Root Cause
A newly deployed query was missing an index, increasing response time
from 50ms to 5s, causing connection pool exhaustion.

## Action Items
- [Done] Added missing index
- [In Progress] Build automated slow query detection
- [Planned] Automate query performance testing before deployment

SRE (Site Reliability Engineering)

A concept created by Google: applying software engineering methodology to operations.

Core concepts: SLI, SLO, SLA, Error Budget.

SLO: 99.95% availability (monthly)

This month: 99.97%
Error budget remaining: 0.02%
Status: New feature deployment OK

If 99.93%?
Error budget: Exhausted
Status: Focus on stability only

9. Documentation and Knowledge Management

ADR (Architecture Decision Record)

A lightweight framework for documenting architecture decisions.

RFC (Request for Comments)

A process for gathering team-wide feedback before major technical changes.

Major company practices:

Google: Very strong Design Docs culture
Meta: Shares and discusses RFCs on Workplace
Amazon: Famous for 6-pager documents and narrative-based meetings

10. The Power of Open Source

InnerSource

Applying open-source development methodology inside a company.

Traditional corporate development:
  Team A -> Request Library X -> Team B -> Backlog wait (2-3 months)

InnerSource:
  Team A -> Submit PR to Library X -> Team B review -> Merge (1-2 weeks)

Practical Checklist

Architecture
  [ ] Are service boundaries clearly defined?
  [ ] Is there an API versioning strategy?
  [ ] Are circuit breakers / retry policies implemented?

CI/CD
  [ ] Is the path from code commit to production automated?
  [ ] Is rollback possible within 5 minutes?
  [ ] Is there a canary/blue-green deployment strategy?

Update System
  [ ] Is auto-update implemented?
  [ ] Is update file integrity verified (checksum, signature)?
  [ ] Is there a rollback mechanism for update failures?

Monitoring
  [ ] Are SLI/SLOs defined?
  [ ] Is there an alert escalation policy?
  [ ] Is distributed tracing implemented?

Technical Debt
  [ ] Is technical debt measured regularly?
  [ ] Is engineering time allocated for refactoring?
  [ ] Is there a dependency update policy?

Documentation
  [ ] Are ADRs being written?
  [ ] Are API docs auto-generated?
  [ ] Is onboarding documentation up to date?

Conclusion

Developing and maintaining large-scale software is not just about "coding well." Architecture design, build systems, deployment strategies, update mechanisms, incident response, documentation, and team culture must all work together as a unified system.

Key lessons:

Build small, deploy often - Small, continuous changes are safer than big releases
Automate everything you can - CI/CD, testing, updates, monitoring
Design for failure - Circuit breakers, rollback, canary deployments are essential
Update safely - Checksum verification, signature checking, rollback mechanisms
Documentation is investment - ADRs, RFCs, postmortems are for your future self
Technical debt accrues interest - Regular repayment prevents development velocity collapse

Good software is not built in one go. It is an ongoing cycle of continuous improvement, safe deployment, and learning from failure.

References

Google Engineering Practices: Code review guide and large-scale software development cases
Software Engineering at Google (O'Reilly): Google's software engineering philosophy
The Site Reliability Workbook (O'Reilly): SRE practical guide
Martin Fowler - Microservices: Fundamental concepts of microservice architecture
Electron auto-updater official documentation: Desktop app auto-update implementation guide
The Twelve-Factor App: 12 principles for modern web app development