- Authors
- Name
- Entering
- Cell-based architecture core concepts
- Bulkhead pattern and fault isolation
- Cell routing strategy
- Kubernetes-based cell implementation
- AWS-based cell implementation
- Cell Distribution Strategy (Canary per Cell)
- Real examples
- Data partitioning
- Monitoring and Observability
- Troubleshooting
- Precautions during operation
- Failure cases and recovery
- Checklist
- References

Entering
When operating a large-scale distributed system, you face an uncomfortable truth. No matter how sophisticated a failure response system is, if a failure occurs in a single component shared by the entire system, the entire service is interrupted. Representative examples include a 6-hour total failure of Facebook (now Meta) in 2021, a widespread failure due to a WAN configuration error in Microsoft Azure in 2023, and Cloudflare's API Gateway failure in 2024. What these cases have in common is that the Blast Radius affected the entire system.
Cell-Based Architecture is a structural solution to this problem. This is a pattern that divides the system into independent cells and isolates them so that even if a failure occurs in one cell, it does not affect other cells. AWS has applied it to its own infrastructure and registered it as an official document in the Well-Architected Framework, and it is an architecture that has been verified in production by companies such as Slack, DoorDash, and Salesforce.
In this article, it is comprehensively covered at the operational level, from the core principles of cell-based architecture to the relationship with the Bulkhead pattern, cell routing strategy (Consistent Hashing, Partition Key), Kubernetes and AWS-based implementation, cell-level canary deployment, data partitioning, monitoring and observability, and actual operational examples and troubleshooting.
Cell-based architecture core concepts
What is a Cell?
A cell is a self-contained deployment unit that can independently perform the entire function of a service. Each cell has its own computing resources, data storage, message queue, and cache, and does not share state with other cells. Even if one cell completely fails, the remaining cells operate normally.
The key properties of cell-based architecture are:
- Isolation: Computing, storage, and network resources are not shared between cells.
- Independent Deployment: Each cell can be deployed, upgraded, and rolled back independently.
- Horizontal Scaling: Expand overall system capacity by adding cells.
- Fault Isolation: Failure of one cell does not propagate to other cells.
- Capacity Capping: Each cell has a fixed maximum capacity.
Differences from traditional architecture
전통적 아키텍처 (공유 인프라):
┌─────────────────────────────────────────────┐
│ Load Balancer │
├─────────────────────────────────────────────┤
│ App Server 1 | App Server 2 | App Server 3│ <-- 공유 컴퓨팅
├─────────────────────────────────────────────┤
│ Shared Database │ <-- 단일 장애점
├─────────────────────────────────────────────┤
│ Shared Cache │ <-- 단일 장애점
└─────────────────────────────────────────────┘
장애 시: 전체 사용자 영향 (Blast Radius = 100%)
셀 기반 아키텍처 (격리된 인프라):
┌──────────────┐
│ Cell Router │ <-- 유일한 공유 컴포넌트 (최소화)
└──────┬───────┘
│
┌────┴────┬────────┬────────┐
▼ ▼ ▼ ▼
┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
│Cell 1│ │Cell 2│ │Cell 3│ │Cell 4│
│ App │ │ App │ │ App │ │ App │
│ DB │ │ DB │ │ DB │ │ DB │
│Cache │ │Cache │ │Cache │ │Cache │
└──────┘ └──────┘ └──────┘ └──────┘
장애 시: 해당 셀 사용자만 영향 (Blast Radius = 25%)
Comparison of architectural patterns
| Comparison Items | Cell-Based | Multi-Region | Active-Active | Traditional Monolith |
|---|---|---|---|---|
| Fault Isolation Level | By cell (5-10% users) | Regional level (30-50% users) | Regional level | None (100% users) |
| Implementation Complexity | High | Medium-High | High | low |
| Infrastructure Costs | Medium-High (Per Cell Overhead) | High (regional replication) | very high | low |
| latency | Low (intracell communication) | Variable (inter-region delay) | variable | low |
| Deployment Flexibility | Canary per cell possible | Regional level | Regional level | Full distribution |
| data consistency | Strong coherence within the cell | eventual consistency | Conflict resolution required | strong consistency |
| Extension method | Add Cell | Add region | Add region | Vertical expansion |
| Operational Complexity | High (N cells managed) | middle | High | low |
Bulkhead pattern and fault isolation
Cell-based architecture is essentially an infrastructure-level application of the Bulkhead pattern. Bulkhead is a concept derived from a ship's bulkhead. Even if flooding occurs in one compartment, the bulkhead blocks the inflow of water into other compartments, preventing the entire ship from sinking.
Bulkhead application level
The Bulkhead pattern can be applied at several levels, with cell-based architecture being the highest level application.
- Thread Pool Level: Hystrix/Resilience4j’s Thread Pool Bulkhead (smallest unit)
- Process level: Separation of processes by function
- Service Level: Isolation between microservices
- Infrastructure level: Isolation into separate VMs, clusters, VPCs
- Cell level: Bundles the entire stack (compute+DB+cache+queue) into one isolation unit (largest unit)
Blast Radius Calculation
The failure explosion radius can be calculated by the following formula:
단일 셀 장애 시 영향받는 사용자 비율 = 1 / N (N = 셀 수)
예시:
- 셀 4개: 장애 시 최대 25% 사용자 영향
- 셀 10개: 장애 시 최대 10% 사용자 영향
- 셀 20개: 장애 시 최대 5% 사용자 영향
However, if the number of cells is increased indefinitely, operational complexity and cost increase exponentially. Typically, 5-10% users per cell is a good balance between cost and isolation level.
Cell routing strategy
The most important design decision in a cell-based architecture is which users to route to which cells. Routing must be deterministic, and the same user must always be directed to the same cell.
Consistent Hashing based routing
Consistent Hashing ensures that only the minimum number of keys are relocated when cells are added or removed.
# cell_router.py - Consistent Hashing 기반 셀 라우터
import hashlib
from bisect import bisect_right
from typing import Optional
class CellRouter:
"""Consistent Hashing 기반 셀 라우터.
가상 노드(virtual node)를 사용하여 해시 링 위에
셀을 균등하게 분산 배치한다.
"""
def __init__(self, virtual_nodes: int = 150):
self.virtual_nodes = virtual_nodes
self.ring: list[int] = []
self.ring_to_cell: dict[int, str] = {}
self.cells: dict[str, dict] = {}
def _hash(self, key: str) -> int:
"""SHA-256 해시로 키를 정수 값으로 변환한다."""
digest = hashlib.sha256(key.encode()).hexdigest()
return int(digest[:16], 16)
def add_cell(self, cell_id: str, metadata: Optional[dict] = None) -> None:
"""셀을 해시 링에 추가한다.
Args:
cell_id: 셀 식별자 (예: "cell-us-east-001")
metadata: 셀 메타데이터 (endpoint, capacity 등)
"""
self.cells[cell_id] = metadata or {}
for i in range(self.virtual_nodes):
virtual_key = f"{cell_id}:vn{i}"
hash_value = self._hash(virtual_key)
self.ring.append(hash_value)
self.ring_to_cell[hash_value] = cell_id
self.ring.sort()
def remove_cell(self, cell_id: str) -> None:
"""셀을 해시 링에서 제거한다.
제거된 셀의 트래픽은 인접 셀로 자동 재배치된다.
"""
self.ring = [
h for h in self.ring
if self.ring_to_cell.get(h) != cell_id
]
self.ring_to_cell = {
h: c for h, c in self.ring_to_cell.items()
if c != cell_id
}
del self.cells[cell_id]
def route(self, partition_key: str) -> str:
"""파티션 키로 대상 셀을 결정한다.
Args:
partition_key: 라우팅 키 (예: tenant_id, user_id, org_id)
Returns:
대상 셀 ID
"""
if not self.ring:
raise ValueError("해시 링에 셀이 없습니다")
hash_value = self._hash(partition_key)
idx = bisect_right(self.ring, hash_value)
if idx == len(self.ring):
idx = 0
return self.ring_to_cell[self.ring[idx]]
def get_cell_distribution(self) -> dict[str, int]:
"""각 셀이 해시 링에서 차지하는 비율을 반환한다."""
distribution: dict[str, int] = {}
for cell_id in self.ring_to_cell.values():
distribution[cell_id] = distribution.get(cell_id, 0) + 1
return distribution
# 사용 예시
if __name__ == "__main__":
router = CellRouter(virtual_nodes=150)
# 셀 등록
router.add_cell("cell-001", {"region": "us-east-1", "capacity": 10000})
router.add_cell("cell-002", {"region": "us-east-1", "capacity": 10000})
router.add_cell("cell-003", {"region": "us-west-2", "capacity": 10000})
router.add_cell("cell-004", {"region": "eu-west-1", "capacity": 10000})
# 사용자 라우팅
user_ids = ["user-12345", "user-67890", "org-acme", "org-globex"]
for uid in user_ids:
target_cell = router.route(uid)
print(f"{uid} -> {target_cell}")
# 셀 분포 확인
dist = router.get_cell_distribution()
total = sum(dist.values())
for cell_id, count in sorted(dist.items()):
pct = (count / total) * 100
print(f"{cell_id}: {pct:.1f}%")
Partition Key selection criteria
The partition key to be used for routing is determined according to the business domain.
| domain | partition key | Advantages | Precautions |
|---|---|---|---|
| SaaS multi-tenant | tenant_id / org_id | Complete isolation between tenants | Hotspots available for large tenants |
| social media | user_id | Securing data locality for each user | Influencer account imbalance |
| e-commerce | region + user_id | Geographic Delay Optimization | Complicated cross-region order processing |
| Messaging | workspace_id | Communication locality within workspace | Workspace size deviation |
| Payment | merchant_id | Easy compliance with regulations at the affiliated store level | Large affiliates require separate cell |
In the case of large tenants (noisy neighbors), the Dedicated Cell strategy, which allocates dedicated cells, must also be applied.
Kubernetes-based cell implementation
The most common ways to implement cells in a Kubernetes environment are namespace-based isolation or cluster-based isolation. If you need a high level of fault isolation, choose separate clusters per cell, and if cost efficiency is important, choose namespace isolation.
Cell Deployment Manifest
# cell-deployment.yaml
# 각 셀은 독립적인 네임스페이스에 배포된다
apiVersion: v1
kind: Namespace
metadata:
name: cell-001
labels:
cell-id: 'cell-001'
region: 'us-east-1'
tier: 'standard'
---
# 셀 전용 리소스 제한 (noisy neighbor 방지)
apiVersion: v1
kind: ResourceQuota
metadata:
name: cell-001-quota
namespace: cell-001
spec:
hard:
requests.cpu: '16'
requests.memory: '32Gi'
limits.cpu: '32'
limits.memory: '64Gi'
pods: '100'
services: '20'
persistentvolumeclaims: '10'
---
# 셀 애플리케이션 배포
apiVersion: apps/v1
kind: Deployment
metadata:
name: cell-app
namespace: cell-001
labels:
app: cell-app
cell-id: 'cell-001'
spec:
replicas: 3
selector:
matchLabels:
app: cell-app
cell-id: 'cell-001'
template:
metadata:
labels:
app: cell-app
cell-id: 'cell-001'
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: cell-app
cell-id: 'cell-001'
containers:
- name: app
image: myregistry/cell-app:v2.4.1
ports:
- containerPort: 8080
env:
- name: CELL_ID
value: 'cell-001'
- name: DB_HOST
value: 'cell-001-db.cell-001.svc.cluster.local'
- name: CACHE_HOST
value: 'cell-001-redis.cell-001.svc.cluster.local'
- name: QUEUE_URL
value: 'https://sqs.us-east-1.amazonaws.com/123456789/cell-001-queue'
resources:
requests:
cpu: '500m'
memory: '1Gi'
limits:
cpu: '1000m'
memory: '2Gi'
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
---
# 셀 전용 데이터베이스 (StatefulSet)
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: cell-001-db
namespace: cell-001
spec:
serviceName: cell-001-db
replicas: 3
selector:
matchLabels:
app: cell-db
cell-id: 'cell-001'
template:
metadata:
labels:
app: cell-db
cell-id: 'cell-001'
spec:
containers:
- name: postgres
image: postgres:16
ports:
- containerPort: 5432
env:
- name: POSTGRES_DB
value: 'cell_001'
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ['ReadWriteOnce']
storageClassName: gp3-encrypted
resources:
requests:
storage: 100Gi
---
# 셀 전용 Redis 캐시
apiVersion: apps/v1
kind: Deployment
metadata:
name: cell-001-redis
namespace: cell-001
spec:
replicas: 1
selector:
matchLabels:
app: cell-redis
cell-id: 'cell-001'
template:
metadata:
labels:
app: cell-redis
cell-id: 'cell-001'
spec:
containers:
- name: redis
image: redis:7-alpine
ports:
- containerPort: 6379
args: ['--maxmemory', '2gb', '--maxmemory-policy', 'allkeys-lru']
---
# 셀 간 트래픽 차단을 위한 NetworkPolicy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: cell-isolation
namespace: cell-001
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress:
# 같은 셀 내부 트래픽만 허용
- from:
- namespaceSelector:
matchLabels:
cell-id: 'cell-001'
# Cell Router에서 오는 트래픽 허용
- from:
- namespaceSelector:
matchLabels:
role: 'cell-router'
egress:
# 같은 셀 내부로의 트래픽
- to:
- namespaceSelector:
matchLabels:
cell-id: 'cell-001'
# DNS 허용
- to:
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- port: 53
protocol: UDP
# 외부 AWS 서비스 접근 허용 (SQS, S3 등)
- to:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 10.0.0.0/8
ports:
- port: 443
protocol: TCP
The key in this manifest is NetworkPolicy. By explicitly blocking network traffic between cells, fault propagation is isolated at the network level.
AWS-based cell implementation
In an AWS environment, stronger isolation can be implemented by utilizing VPCs, subnets, and security groups. By separating each cell into a separate VPC or AWS account, complete isolation, including IAM boundaries, is possible.
Terraform-based cell infrastructure
# modules/cell/main.tf
# 재사용 가능한 셀 인프라 모듈
variable "cell_id" {
description = "셀 고유 식별자"
type = string
}
variable "cell_cidr" {
description = "셀 VPC CIDR 블록"
type = string
}
variable "environment" {
description = "배포 환경"
type = string
default = "production"
}
variable "max_capacity" {
description = "셀 최대 사용자 수"
type = number
default = 10000
}
# 셀 전용 VPC
resource "aws_vpc" "cell" {
cidr_block = var.cell_cidr
enable_dns_support = true
enable_dns_hostnames = true
tags = {
Name = "cell-${var.cell_id}-vpc"
CellId = var.cell_id
Environment = var.environment
}
}
# 가용 영역별 프라이빗 서브넷
resource "aws_subnet" "private" {
count = 3
vpc_id = aws_vpc.cell.id
cidr_block = cidrsubnet(var.cell_cidr, 4, count.index)
availability_zone = data.aws_availability_zones.available.names[count.index]
tags = {
Name = "cell-${var.cell_id}-private-${count.index}"
CellId = var.cell_id
}
}
# 셀 전용 ECS 클러스터
resource "aws_ecs_cluster" "cell" {
name = "cell-${var.cell_id}"
setting {
name = "containerInsights"
value = "enabled"
}
tags = {
CellId = var.cell_id
}
}
# 셀 전용 RDS (Multi-AZ)
resource "aws_db_instance" "cell" {
identifier = "cell-${var.cell_id}-db"
engine = "postgres"
engine_version = "16.4"
instance_class = "db.r6g.xlarge"
allocated_storage = 100
max_allocated_storage = 500
storage_encrypted = true
storage_type = "gp3"
multi_az = true
db_subnet_group_name = aws_db_subnet_group.cell.name
vpc_security_group_ids = [aws_security_group.cell_db.id]
db_name = "cell_${replace(var.cell_id, "-", "_")}"
username = "cell_admin"
password = data.aws_secretsmanager_secret_version.db_password.secret_string
backup_retention_period = 14
deletion_protection = true
tags = {
CellId = var.cell_id
}
}
# 셀 전용 ElastiCache Redis
resource "aws_elasticache_replication_group" "cell" {
replication_group_id = "cell-${var.cell_id}-redis"
description = "Redis cluster for cell ${var.cell_id}"
node_type = "cache.r6g.large"
num_cache_clusters = 2
engine_version = "7.1"
port = 6379
subnet_group_name = aws_elasticache_subnet_group.cell.name
security_group_ids = [aws_security_group.cell_cache.id]
at_rest_encryption_enabled = true
transit_encryption_enabled = true
automatic_failover_enabled = true
tags = {
CellId = var.cell_id
}
}
# 셀 전용 SQS 큐
resource "aws_sqs_queue" "cell" {
name = "cell-${var.cell_id}-events"
visibility_timeout_seconds = 60
message_retention_seconds = 1209600 # 14일
receive_wait_time_seconds = 20
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.cell_dlq.arn
maxReceiveCount = 3
})
tags = {
CellId = var.cell_id
}
}
# 셀 Auto Scaling
resource "aws_appautoscaling_target" "cell_ecs" {
max_capacity = var.max_capacity / 100 # 인스턴스당 100명 기준
min_capacity = 3
resource_id = "service/${aws_ecs_cluster.cell.name}/${aws_ecs_service.cell.name}"
scalable_dimension = "ecs:service:DesiredCount"
service_namespace = "ecs"
}
resource "aws_appautoscaling_policy" "cell_cpu" {
name = "cell-${var.cell_id}-cpu-scaling"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.cell_ecs.resource_id
scalable_dimension = aws_appautoscaling_target.cell_ecs.scalable_dimension
service_namespace = aws_appautoscaling_target.cell_ecs.service_namespace
target_tracking_scaling_policy_configuration {
target_value = 60.0
scale_in_cooldown = 300
scale_out_cooldown = 60
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageCPUUtilization"
}
}
}
# 출력
output "cell_vpc_id" {
value = aws_vpc.cell.id
}
output "cell_db_endpoint" {
value = aws_db_instance.cell.endpoint
}
output "cell_redis_endpoint" {
value = aws_elasticache_replication_group.cell.primary_endpoint_address
}
output "cell_sqs_url" {
value = aws_sqs_queue.cell.url
}
Cell provisioning
# environments/production/main.tf
# 실제 셀 프로비저닝
module "cell_001" {
source = "../../modules/cell"
cell_id = "001"
cell_cidr = "10.1.0.0/16"
environment = "production"
max_capacity = 10000
}
module "cell_002" {
source = "../../modules/cell"
cell_id = "002"
cell_cidr = "10.2.0.0/16"
environment = "production"
max_capacity = 10000
}
module "cell_003" {
source = "../../modules/cell"
cell_id = "003"
cell_cidr = "10.3.0.0/16"
environment = "production"
max_capacity = 10000
}
Cell Distribution Strategy (Canary per Cell)
One of the most powerful benefits of cell-based architecture is cell-level canary deployment. Deploy the new version to one cell first, and gradually expand to other cells if there are no problems. Even if a failure occurs, only the users in that cell are affected.
Deployment Pipeline
# .github/workflows/cell-canary-deploy.yaml
# 셀 단위 카나리 배포 파이프라인
name: Cell Canary Deployment
on:
push:
branches: [main]
env:
IMAGE_TAG: ${{ github.sha }}
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build and push container image
run: |
docker build -t myregistry/cell-app:${{ env.IMAGE_TAG }} .
docker push myregistry/cell-app:${{ env.IMAGE_TAG }}
# Phase 1: 카나리 셀에 배포
deploy-canary:
needs: build
runs-on: ubuntu-latest
environment: canary
steps:
- name: Deploy to canary cell (cell-001)
run: |
kubectl set image deployment/cell-app \
app=myregistry/cell-app:${{ env.IMAGE_TAG }} \
-n cell-001
kubectl rollout status deployment/cell-app \
-n cell-001 --timeout=300s
- name: Run smoke tests against canary cell
run: |
./scripts/smoke-test.sh cell-001
- name: Monitor canary metrics (10 minutes)
run: |
./scripts/canary-monitor.sh cell-001 600
# Phase 2: 첫 번째 배치 (30% 셀)
deploy-batch-1:
needs: deploy-canary
runs-on: ubuntu-latest
environment: production-batch-1
strategy:
max-parallel: 2
steps:
- name: Deploy to batch 1 cells
run: |
for CELL in cell-002 cell-003 cell-004; do
kubectl set image deployment/cell-app \
app=myregistry/cell-app:${{ env.IMAGE_TAG }} \
-n $CELL
kubectl rollout status deployment/cell-app \
-n $CELL --timeout=300s
done
# Phase 3: 나머지 전체 셀
deploy-remaining:
needs: deploy-batch-1
runs-on: ubuntu-latest
environment: production-all
steps:
- name: Deploy to all remaining cells
run: |
REMAINING_CELLS=$(kubectl get ns -l cell-id \
--no-headers -o custom-columns=":metadata.name" | \
grep -v -E "cell-001|cell-002|cell-003|cell-004")
for CELL in $REMAINING_CELLS; do
kubectl set image deployment/cell-app \
app=myregistry/cell-app:${{ env.IMAGE_TAG }} \
-n $CELL
kubectl rollout status deployment/cell-app \
-n $CELL --timeout=300s
done
Real examples
Slack’s cell-based architecture
Slack has transitioned to a large-scale workspace-based cell architecture starting in 2022. The partition key is workspace_id, and all channels, messages, and files within one workspace are stored in the same cell. Before Slack's cell transition, an overall service outage in February 2022 affected all users, but after the cell transition, failures in individual cells only affected users in that cell (approximately 5-8% of the total).
Slack's key design decisions include:
- Vitess-based MySQL sharding configured on a cell basis
- Cell router is designed as a Thin Layer containing only minimal logic
- Dedicated Cell assigned to large enterprise customers (e.g. IBM, Amazon)
- If inter-cell communication is required (cross-workspace search, etc.), use Asynchronous Message Bus
DoorDash’s Cell Architecture
DoorDash introduced a region-based cell architecture in 2023. The partition key is geographic_region, which partitions the United States into several geographic cells. Through this, traffic surges in certain areas (such as specific cities during Super Bowl season) are isolated so that they do not affect other areas.
- Use DynamoDB as cell-level data storage
- Apache Kafka clusters are also separated by cell
- Control feature rollout via cell-level Feature Flag
- Establish automation to drain traffic from that cell to an adjacent cell in the event of a failure
Salesforce’s Pod Architecture
Salesforce is one of the pioneers of cell-based architecture. In Salesforce, cells are called Pods, and each Pod contains a complete Salesforce instance. Dozens of Pods are distributed around the world, and customers (tenants) are pinned to specific Pods.
- Routing cells to per-tenant instance URL (e.g. na1.salesforce.com, eu5.salesforce.com)
- Self-developed Org Migration tool for data migration between pods
- Operate release cycle independent of Pod-level maintenance window
Data partitioning
The most challenging challenge in cell-based architecture is data partitioning. Cross-cell query requirements must be met while maintaining cell independence.
Partitioning strategy
1. Cell-Local Data
This is data that is accessed only inside the cell, and exists only in the database of that cell. For example, the user's messages, order history, and session information.
2. Global Reference Data
This is read-only data that is equally required in all cells. This includes exchange rate information, product catalogs, country codes, etc. This data is asynchronously replicated from the global data store to each cell.
3. Cross-Cell Aggregate Data
This is a case where data from all cells must be aggregated, such as overall system statistics and global dashboards. Metrics aggregated from each cell are exported to a central analytics platform (e.g. Snowflake, BigQuery) for processing.
Data Migration
Data migration is required when rebalancing cells. The core principles are:
- Dual Write: Write to both source and target cells during the migration period.
- Gradual transition: Switch read to target cell first, then switch write.
- Rollback possibility: Keep original cell data for at least 48 hours
Monitoring and Observability
In a cell-based architecture, both cell-level metrics and global aggregate metrics are required.
Cell health check script
#!/bin/bash
# cell-health-check.sh
# 전체 셀의 상태를 점검하는 헬스체크 스크립트
set -euo pipefail
CELL_ROUTER_URL="${CELL_ROUTER_URL:-http://cell-router.internal:8080}"
ALERT_WEBHOOK="${ALERT_WEBHOOK:-}"
HEALTH_THRESHOLD=3 # 연속 실패 횟수 임계값
declare -A FAILURE_COUNT
check_cell_health() {
local cell_id="$1"
local cell_endpoint="$2"
# 애플리케이션 헬스 체크
local http_code
http_code=$(curl -s -o /dev/null -w "%{http_code}" \
--connect-timeout 5 --max-time 10 \
"${cell_endpoint}/health/ready" 2>/dev/null || echo "000")
if [[ "$http_code" != "200" ]]; then
FAILURE_COUNT[$cell_id]=$(( ${FAILURE_COUNT[$cell_id]:-0} + 1 ))
echo "[WARN] ${cell_id}: health check failed (HTTP ${http_code}, consecutive: ${FAILURE_COUNT[$cell_id]})"
if [[ ${FAILURE_COUNT[$cell_id]} -ge $HEALTH_THRESHOLD ]]; then
echo "[CRITICAL] ${cell_id}: ${HEALTH_THRESHOLD} consecutive failures - triggering alert"
send_alert "$cell_id" "$http_code"
fi
return 1
fi
# 성공 시 카운터 리셋
FAILURE_COUNT[$cell_id]=0
# 셀 메트릭 수집
local metrics
metrics=$(curl -s --max-time 5 "${cell_endpoint}/metrics/cell" 2>/dev/null || echo "{}")
local active_connections error_rate p99_latency
active_connections=$(echo "$metrics" | jq -r '.active_connections // "N/A"')
error_rate=$(echo "$metrics" | jq -r '.error_rate_percent // "N/A"')
p99_latency=$(echo "$metrics" | jq -r '.p99_latency_ms // "N/A"')
echo "[OK] ${cell_id}: connections=${active_connections}, errors=${error_rate}%, p99=${p99_latency}ms"
return 0
}
send_alert() {
local cell_id="$1"
local http_code="$2"
local timestamp
timestamp=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
if [[ -n "$ALERT_WEBHOOK" ]]; then
curl -s -X POST "$ALERT_WEBHOOK" \
-H "Content-Type: application/json" \
-d "{
\"severity\": \"critical\",
\"cell_id\": \"${cell_id}\",
\"message\": \"Cell ${cell_id} health check failed (HTTP ${http_code})\",
\"timestamp\": \"${timestamp}\",
\"action\": \"investigate_and_consider_drain\"
}"
fi
}
# 셀 목록 조회 및 점검
echo "=== Cell Health Check: $(date -u +"%Y-%m-%d %H:%M:%S UTC") ==="
CELLS=$(curl -s "${CELL_ROUTER_URL}/cells" | jq -r '.[] | "\(.cell_id)|\(.endpoint)"')
TOTAL=0
HEALTHY=0
UNHEALTHY=0
while IFS='|' read -r cell_id endpoint; do
if check_cell_health "$cell_id" "$endpoint"; then
HEALTHY=$((HEALTHY + 1))
else
UNHEALTHY=$((UNHEALTHY + 1))
fi
TOTAL=$((TOTAL + 1))
done <<< "$CELLS"
echo ""
echo "=== Summary: Total=${TOTAL}, Healthy=${HEALTHY}, Unhealthy=${UNHEALTHY} ==="
# 전체 중 50% 이상 비정상이면 글로벌 알림
if [[ $TOTAL -gt 0 ]] && [[ $((UNHEALTHY * 100 / TOTAL)) -ge 50 ]]; then
echo "[GLOBAL CRITICAL] More than 50% cells unhealthy - possible global issue"
fi
Key monitoring metrics
The metrics that must be monitored in a cell-based architecture are:
Cell level metrics:
- Request throughput (RPS) and error rate per cell
- P50/P95/P99 delay time per cell
- DB connection pool usage rate by cell
- CPU/memory utilization by cell
- Queue Depth per cell
Global level metrics:
- Request throughput and latency of cell routers
- Traffic imbalance ratio between cells
- Number of abnormal cells
- Global error rate (sum of all cells)
Routing rule settings
# cell-routing-rules.yaml
# 셀 라우팅 규칙 정의
apiVersion: v1
kind: ConfigMap
metadata:
name: cell-routing-config
namespace: cell-router
data:
routing-rules.yaml: |
version: "2.0"
default_strategy: "consistent-hashing"
partition_key: "X-Tenant-Id"
cells:
- id: "cell-001"
endpoint: "https://cell-001.internal.example.com"
region: "us-east-1"
status: "active"
capacity: 10000
weight: 100
- id: "cell-002"
endpoint: "https://cell-002.internal.example.com"
region: "us-east-1"
status: "active"
capacity: 10000
weight: 100
- id: "cell-003"
endpoint: "https://cell-003.internal.example.com"
region: "us-west-2"
status: "active"
capacity: 10000
weight: 100
- id: "cell-004"
endpoint: "https://cell-004.internal.example.com"
region: "eu-west-1"
status: "draining"
capacity: 10000
weight: 0
drain_target: "cell-003"
# 전용 셀 (대형 테넌트)
dedicated_cells:
- tenant_id: "tenant-enterprise-001"
cell_id: "cell-dedicated-001"
reason: "SLA 요구사항 - 99.99% 가용성"
- tenant_id: "tenant-enterprise-002"
cell_id: "cell-dedicated-002"
reason: "데이터 레지던시 - EU GDPR 준수"
# 장애 시 트래픽 전환 규칙
failover_rules:
health_check_interval_seconds: 10
consecutive_failures_threshold: 3
failover_strategy: "nearest-healthy-cell"
auto_failback: true
failback_delay_minutes: 15
# 셀 용량 제한
capacity_management:
overflow_strategy: "reject-with-retry-after"
overflow_http_status: 503
retry_after_seconds: 30
capacity_warning_threshold_percent: 80
capacity_critical_threshold_percent: 95
Troubleshooting
Problem 1: Cell-to-cell traffic imbalance
Symptom: CPU utilization of a specific cell exceeds 90%, but other cells remain at 30%.
Cause: A large tenant is placed in a regular cell, causing excessive load on that cell.
Solution: Identify large tenants and migrate them to dedicated cells. In the routing table, the corresponding tenant is mapped to a dedicated cell, and traffic is switched after data migration.
Problem 2: Failure of the cell router itself
Symptom: The cell router becomes unresponsive and the entire system is paralyzed.
Cause: The cell router becomes a SPOF (Single Point of Failure).
Solution: Multiplex cell routers and ensure availability of the cell router itself through DNS-based load balancing. Cell router logic is kept to a minimum to reduce the likelihood of failure. By caching cell allocation information in the client, it allows direct access to existing cells even in the event of a router failure.
Problem 3: Cross-cell data query
Symptom: When viewing overall user statistics on the administrator dashboard, response time slows down to tens of seconds.
Cause: A method of sequentially sending queries to all cells and counting is used.
Solution: Periodically export aggregated data from each cell to a central analytical database (OLAP). Administrator dashboards are retrieved from this central repository. If real-time is required, configure streaming aggregation with Change Data Capture (CDC).
Issue 4: Data inconsistency during data migration between cells
Symptom: Some users' data is missing or duplicated after cell relocation.
Cause: A synchronization mismatch occurs between the source and destination cells during the double write period.
Workaround: Perform data checksum verification before and after migration. If it is based on event sourcing, the state of the target cell can be accurately restored through event replay. Original cell data is preserved for at least 48 hours for migration rollback.
Precautions during operation
Cell routers must be thin layer
The cell router is the only shared component in the system. Adding complex business logic here nullifies the benefits of cell-based isolation. A cell router must perform only three functions: extracting partition keys, calculating hashes, and mapping cells.
Golden ratio of cell sizes
If a cell is too large, the scope of impact in the event of a failure expands, while if the cell is too small, operational complexity and costs increase. According to AWS's recommendations, a size that accommodates 5-10% of total users is appropriate. This means 10-20 cells.
Data residency in global deployment
To comply with data regulations such as GDPR and the Personal Data Protection Act, user data in a specific region must be processed only in cells in that region. When routing cells, regional constraints must be considered along with the partition key.
Testing Strategy
Cell-level chaos engineering is essential in cell-based architecture. You should use AWS Fault Injection Service (FIS) or Chaos Mesh to simulate failures in individual cells and regularly verify that other cells are not affected. Cell-level fault injection training is recommended at least once a quarter.
Failure cases and recovery
Case 1: Total failure due to cell router configuration error
In one company, incorrect settings were distributed when updating the cell routing table, resulting in an incident where all users were routed to one cell. The cell exceeded its capacity and started returning a 503 error, leaving other cells idle.
Lesson: Canary deployment must be applied even when changing routing settings. Changes to routing tables must go through an automated test pipeline that includes verification of cell traffic distribution.
Case 2: Hidden dependencies between cells
We thought we had achieved complete isolation between cells, but the external API (payment gateway) shared by all cells failed, so all cells were affected at the same time.
Lesson: Identify shared dependencies outside of the cell, isolate them by cell if possible, or at least apply Circuit Breaker and Fallback. When reviewing architecture, you should create a Dependency Map to find hidden shared points.
Case 3: Data migration failure
During cell relocation, double writing was interrupted due to a network failure, resulting in a situation where the data in the source cell and target cell did not match. I attempted a rollback, but some data had already been cleaned up from the original cells.
Lesson: Migration must be designed to ensure idempotent, and original data deletion must be performed after a sufficient stabilization period (at least 48 hours) after migration completion.
Checklist
Summarizes items to be reviewed before introducing cell-based architecture.
Design Stage:
- Partition key selection completed (tenant, user, region, etc.)
- Determine cell size (recommended for 5-10% of users)
- Cell router design (Thin Layer principle)
- Definition of cross-cell data access pattern
- Establishment of global reference data replication strategy
- Check data residency requirements
Implementation Phase:
- Independent infrastructure provisioning for each cell (VPC, DB, Cache, Queue)
- Implement cell-to-cell isolation with NetworkPolicy or security groups
- Consistent Hashing based router implementation
- Building a cell-level canary deployment pipeline
- Implementation of health check endpoint for each cell
- Cell router multiplexing
Operation Phase:
- Building a cell-level monitoring dashboard
- Setting cell-level notification rules
- Automated traffic drain in case of cell failure
- Preparation of cell-to-cell data migration tools
- Quarterly cell failure injection training plan
- Cell capacity monitoring and auto-expansion settings
- Establishment of global metric aggregation pipeline
- Automate verification of routing settings changes
References
- AWS Well-Architected - Reducing Scope of Impact with Cell-Based Architecture
- InfoQ - Cell-Based Architecture for Distributed Systems
- System Design Newsletter - Cell-Based Architecture
- AWS Solutions Library - Guidance for Cell-Based Architecture on AWS
- InfoQ Minibook - Cell-Based Architecture 2024
- Slack Engineering - Building Reliable Distributed Systems
- DoorDash Engineering Blog