- Published on
How Is Large-Scale Software Developed, Managed, and Maintained? — Including SW Update Systems
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Introduction
The Chrome browser, Android OS, and Netflix app that we use daily are massive software systems consisting of billions of lines of code. These are not created by a single genius but built simultaneously by thousands of developers collaborating.
How exactly is such massive software developed, deployed, updated, and maintained long-term?
This post covers the core principles of large-scale software development comprehensively, from architecture to update system implementation.
1. Defining Large-Scale Software
When Does It Become "Large-Scale"?
Generally, software meeting one or more of these criteria is classified as large-scale:
| Criterion | Example |
|---|---|
| Code size | Millions to billions of lines |
| Development staff | Hundreds to thousands |
| Number of services | Hundreds to thousands of microservices |
| User base | Hundreds of millions+ DAU |
| Deployment frequency | Dozens to hundreds per day |
Real-World Examples
- Google: A single monorepo with approximately 2 billion+ lines of code. Tens of thousands of commits daily by 25,000+ developers.
- Meta (Facebook): Tens of thousands of engineers collaborate in a single repo. Uses Buck, their custom build system.
- Netflix: Operates with approximately 1,000+ microservices, performing thousands of deployments per day. Pioneer of Chaos Engineering.
2. Monorepo vs Multi-repo
The first decision in managing large-scale software is the code repository strategy.
Monorepo
All project code managed in a single repository.
company-repo/
frontend/
backend/
mobile/
infra/
shared-libs/
tools/
Advantages: Easy code sharing, atomic commits possible, easier refactoring, consistent tooling
Disadvantages: Repository size becomes massive, complex build systems needed, access control challenges
Multi-repo
Each project, service, or library managed in independent repositories.
Advantages: Independent development cycles per team, small repos with fast cloning, clear service boundaries
Disadvantages: Cumbersome code sharing, dependency hell (Diamond Dependency), difficult cross-project refactoring
Enterprise Choices
| Company | Strategy | Tools |
|---|---|---|
| Monorepo | Piper (custom VCS) + Bazel | |
| Meta | Monorepo | Mercurial + Buck |
| Netflix | Multi-repo | Gradle + Nebula |
| Microsoft | Monorepo (partial) | VFS for Git (GVFS) |
| Amazon | Multi-repo | Brazil (custom build system) |
The key is not that one strategy is absolutely better, but choosing the strategy that fits your organization's scale and culture.
3. Microservices Architecture
From Monolith to Microservices
Early startups typically start with a single application (monolith). As scale grows, problems emerge: build times exceeding 30 minutes, a single bug crashing the entire system, frequent code conflicts between teams, inability to scale specific functions independently.
Microservices Architecture (MSA) addresses these problems.
Core MSA Components
API Gateway - Single entry point for all client requests. Handles authentication, routing, rate limiting, and logging.
# API Gateway routing example
routes:
- path: /api/users
service: user-service
methods: [GET, POST]
- path: /api/orders
service: order-service
methods: [GET, POST, PUT]
- path: /api/payments
service: payment-service
methods: [POST]
Service Mesh - Infrastructure layer managing service-to-service communication. Uses tools like Istio and Linkerd.
# Istio VirtualService example
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: user-service
spec:
hosts:
- user-service
http:
- route:
- destination:
host: user-service
subset: v2
weight: 90
- destination:
host: user-service
subset: v1
weight: 10
Service Discovery - Mechanism enabling hundreds of services to find each other. Consul, Eureka, Kubernetes DNS.
Circuit Breaker - Pattern preventing cascade failures when a specific service fails.
# Circuit Breaker pseudocode
class CircuitBreaker:
def __init__(self, threshold=5, timeout=30):
self.failure_count = 0
self.threshold = threshold
self.timeout = timeout
self.state = "CLOSED" # CLOSED, OPEN, HALF_OPEN
def call(self, func):
if self.state == "OPEN":
if self.timeout_expired():
self.state = "HALF_OPEN"
else:
raise CircuitOpenError("Service unavailable")
try:
result = func()
self.on_success()
return result
except Exception as e:
self.on_failure()
raise e
def on_failure(self):
self.failure_count += 1
if self.failure_count >= self.threshold:
self.state = "OPEN"
def on_success(self):
self.failure_count = 0
self.state = "CLOSED"
4. CI/CD Pipeline
Continuous Integration (CI)
When a developer pushes code, build, test, and static analysis run automatically.
# GitHub Actions CI Pipeline example
name: CI Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
build-and-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '20'
- name: Install dependencies
run: npm ci
- name: Lint
run: npm run lint
- name: Unit Tests
run: npm run test:unit
- name: Integration Tests
run: npm run test:integration
- name: Build
run: npm run build
Continuous Deployment (CD) - Deployment Strategies
Canary Deployment - Deploy the new version to a subset (e.g., 1-5%) of users first, then gradually increase if no issues.
Traffic distribution:
v1 (current): 95% ████████████████████░
v2 (new): 5% █░░░░░░░░░░░░░░░░░░░
After 30 min (no issues):
v1 (current): 50% ██████████░░░░░░░░░░
v2 (new): 50% ██████████░░░░░░░░░░
After 1 hour (no issues):
v1 (current): 0% ░░░░░░░░░░░░░░░░░░░░
v2 (new): 100% ████████████████████
Blue-Green Deployment - Maintain two identical environments (Blue, Green) and switch traffic at once.
Rolling Update - Replace instances one by one sequentially. Kubernetes default deployment strategy.
# Kubernetes Rolling Update configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 4
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
template:
spec:
containers:
- name: my-app
image: my-app:v2
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
5. Feature Flags
Separate Deployment from Release
Feature flags enable deploying code while keeping functionality inactive.
// Feature flag usage example
const featureFlags = {
newCheckoutFlow: false,
darkMode: true,
aiRecommendation: false,
};
function renderCheckout() {
if (featureFlags.newCheckoutFlow) {
return renderNewCheckout();
}
return renderLegacyCheckout();
}
Feature Flag Applications
| Use Case | Description |
|---|---|
| Gradual Release | Roll out 1% - 10% - 50% - 100% |
| A/B Testing | Compare two UI variants |
| Kill Switch | Instantly disable features during outages |
| Beta Testing | Expose only to specific user groups |
| Regional Release | Korea first, then global expansion |
Note: Accumulated feature flags become technical debt. Clean up flags after releases are complete.
6. SW Update System Implementation
This section is the core of this post. Delivering updates safely and smoothly to users is as important as developing the software itself.
6-1. Desktop App Updates
Electron auto-updater - Most Electron-based apps (VS Code, Slack Desktop, Discord) use electron-updater.
// Electron auto-updater implementation
import { autoUpdater } from 'electron-updater';
autoUpdater.setFeedURL({
provider: 'github',
owner: 'my-org',
repo: 'my-app',
});
autoUpdater.on('update-available', (info) => {
log.info('New version found:', info.version);
showUpdateNotification(info);
});
autoUpdater.on('download-progress', (progress) => {
updateProgressBar(progress.percent);
});
autoUpdater.on('update-downloaded', (info) => {
showRestartDialog(info.version);
});
// Check for updates every 4 hours
setInterval(() => {
autoUpdater.checkForUpdates();
}, 4 * 60 * 60 * 1000);
Delta Updates - Download only changed portions instead of the entire file.
Full update: v1.0 (200MB) -> v1.1 (200MB) = 200MB download
Delta update: v1.0 (200MB) -> v1.1 (200MB) = 15MB download (diff only)
Chrome and VS Code use this approach. Binary diffing algorithms like Courgette (Chrome) are key.
6-2. Mobile App Updates
App Store-Based Updates follow the standard submission, review, approval, and distribution flow.
Staged Rollout - Google Play's staged rollout feature allows deploying to only a subset of users first.
Day 1: 1% rollout -> Monitor crash reports
Day 2: 5% rollout -> Check user feedback
Day 3: 20% rollout
Day 5: 50% rollout
Day 7: 100% rollout
OTA (Over-The-Air) Updates - Push JavaScript bundles directly bypassing the app store. Available in React Native via EAS Update.
// React Native OTA Update (EAS Update example)
import * as Updates from 'expo-updates';
async function checkForOTAUpdate() {
try {
const update = await Updates.checkForUpdateAsync();
if (update.isAvailable) {
await Updates.fetchUpdateAsync();
Alert.alert(
'Update Complete',
'A new version is ready. Apply now?',
[
{ text: 'Later', style: 'cancel' },
{ text: 'Apply Now', onPress: () => Updates.reloadAsync() },
]
);
}
} catch (error) {
console.error('OTA update failed:', error);
}
}
Note: Apple guidelines prohibit using OTA to change the main functionality or purpose of an app.
6-3. Server Updates (Deployment)
Server software updates follow the CI/CD deployment strategies described earlier.
# ArgoCD GitOps-based server update
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: my-service
spec:
project: default
source:
repoURL: https://github.com/my-org/k8s-manifests
path: services/my-service
targetRevision: main
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
Hot Swap vs Cold Update
| Method | Description | Example |
|---|---|---|
| Hot Swap | Replace modules without app restart | Java class reloading, Erlang hot code loading |
| Warm Restart | Restart process only (no OS reboot) | Node.js graceful restart |
| Cold Update | Full app shutdown then reinstall | Desktop apps, OS updates |
Erlang/OTP is famous for its hot code loading that can replace code in a running system without interruption. Used in systems requiring 99.999% uptime like telephone exchanges.
7. Technical Debt Management
Types of Technical Debt
Deliberate + Prudent: "Let's go with a simple implementation now and refactor next quarter"
Deliberate + Reckless: "Skip the design, just build it fast"
Inadvertent + Prudent: "Looking back after implementation, a better design existed"
Inadvertent + Reckless: "What's a layered architecture?"
Refactoring Strategies
Boy Scout Rule: Leave the code cleaner than you found it.
Golden Path First: Refactor the most-used code paths (Happy Path) first.
Strangler Fig Pattern: Don't replace legacy systems all at once; gradually redirect traffic to the new system.
Phase 1: Legacy(100%) --- New(0%)
Phase 2: Legacy(70%) --- New(30%)
Phase 3: Legacy(30%) --- New(70%)
Phase 4: Legacy(0%) --- New(100%) <- Remove legacy
8. Incident Response
Postmortem
An analysis document written after an incident. Blameless Culture is the core principle.
# Postmortem Template
## Incident Summary
- Occurred: 2026-04-10 14:23 KST
- Recovered: 2026-04-10 15:07 KST
- Impact: ~30% of all users (payment service)
- Severity: P1 (Critical)
## Timeline
- 14:23 - Payment success rate drop alert triggered
- 14:25 - On-call engineer response started
- 14:32 - Root cause identified (DB connection pool exhaustion)
- 14:45 - Emergency patch deployment started
- 15:07 - Full service recovery confirmed
## Root Cause
A newly deployed query was missing an index, increasing response time
from 50ms to 5s, causing connection pool exhaustion.
## Action Items
- [Done] Added missing index
- [In Progress] Build automated slow query detection
- [Planned] Automate query performance testing before deployment
SRE (Site Reliability Engineering)
A concept created by Google: applying software engineering methodology to operations.
Core concepts: SLI, SLO, SLA, Error Budget.
SLO: 99.95% availability (monthly)
This month: 99.97%
Error budget remaining: 0.02%
Status: New feature deployment OK
If 99.93%?
Error budget: Exhausted
Status: Focus on stability only
9. Documentation and Knowledge Management
ADR (Architecture Decision Record)
A lightweight framework for documenting architecture decisions.
RFC (Request for Comments)
A process for gathering team-wide feedback before major technical changes.
Major company practices:
- Google: Very strong Design Docs culture
- Meta: Shares and discusses RFCs on Workplace
- Amazon: Famous for 6-pager documents and narrative-based meetings
10. The Power of Open Source
InnerSource
Applying open-source development methodology inside a company.
Traditional corporate development:
Team A -> Request Library X -> Team B -> Backlog wait (2-3 months)
InnerSource:
Team A -> Submit PR to Library X -> Team B review -> Merge (1-2 weeks)
Practical Checklist
Architecture
[ ] Are service boundaries clearly defined?
[ ] Is there an API versioning strategy?
[ ] Are circuit breakers / retry policies implemented?
CI/CD
[ ] Is the path from code commit to production automated?
[ ] Is rollback possible within 5 minutes?
[ ] Is there a canary/blue-green deployment strategy?
Update System
[ ] Is auto-update implemented?
[ ] Is update file integrity verified (checksum, signature)?
[ ] Is there a rollback mechanism for update failures?
Monitoring
[ ] Are SLI/SLOs defined?
[ ] Is there an alert escalation policy?
[ ] Is distributed tracing implemented?
Technical Debt
[ ] Is technical debt measured regularly?
[ ] Is engineering time allocated for refactoring?
[ ] Is there a dependency update policy?
Documentation
[ ] Are ADRs being written?
[ ] Are API docs auto-generated?
[ ] Is onboarding documentation up to date?
Conclusion
Developing and maintaining large-scale software is not just about "coding well." Architecture design, build systems, deployment strategies, update mechanisms, incident response, documentation, and team culture must all work together as a unified system.
Key lessons:
- Build small, deploy often - Small, continuous changes are safer than big releases
- Automate everything you can - CI/CD, testing, updates, monitoring
- Design for failure - Circuit breakers, rollback, canary deployments are essential
- Update safely - Checksum verification, signature checking, rollback mechanisms
- Documentation is investment - ADRs, RFCs, postmortems are for your future self
- Technical debt accrues interest - Regular repayment prevents development velocity collapse
Good software is not built in one go. It is an ongoing cycle of continuous improvement, safe deployment, and learning from failure.
References
- Google Engineering Practices: Code review guide and large-scale software development cases
- Software Engineering at Google (O'Reilly): Google's software engineering philosophy
- The Site Reliability Workbook (O'Reilly): SRE practical guide
- Martin Fowler - Microservices: Fundamental concepts of microservice architecture
- Electron auto-updater official documentation: Desktop app auto-update implementation guide
- The Twelve-Factor App: 12 principles for modern web app development