Skip to content

필사 모드: Cell-Based Architecture Design and Operation: Strategy to Minimize Failure Explosion Radius

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Entering

When operating a large-scale distributed system, you face an uncomfortable truth. No matter how sophisticated a failure response system is, if a failure occurs in a single component shared by the entire system, the entire service is interrupted. Representative examples include a 6-hour total failure of Facebook (now Meta) in 2021, a widespread failure due to a WAN configuration error in Microsoft Azure in 2023, and Cloudflare's API Gateway failure in 2024. What these cases have in common is that the **Blast Radius** affected the entire system.

Cell-Based Architecture is a structural solution to this problem. This is a pattern that divides the system into independent cells and isolates them so that even if a failure occurs in one cell, it does not affect other cells. AWS has applied it to its own infrastructure and registered it as an official document in the Well-Architected Framework, and it is an architecture that has been verified in production by companies such as Slack, DoorDash, and Salesforce.

In this article, it is comprehensively covered at the operational level, from the core principles of cell-based architecture to the relationship with the Bulkhead pattern, cell routing strategy (Consistent Hashing, Partition Key), Kubernetes and AWS-based implementation, cell-level canary deployment, data partitioning, monitoring and observability, and actual operational examples and troubleshooting.

Cell-based architecture core concepts

What is a Cell?

A cell is a self-contained deployment unit that can independently perform the entire function of a service. Each cell has its own computing resources, data storage, message queue, and cache, and does not share state with other cells. Even if one cell completely fails, the remaining cells operate normally.

The key properties of cell-based architecture are:

- **Isolation**: Computing, storage, and network resources are not shared between cells.

- **Independent Deployment**: Each cell can be deployed, upgraded, and rolled back independently.

- **Horizontal Scaling**: Expand overall system capacity by adding cells.

- **Fault Isolation**: Failure of one cell does not propagate to other cells.

- **Capacity Capping**: Each cell has a fixed maximum capacity.

Differences from traditional architecture

전통적 아키텍처 (공유 인프라):

┌─────────────────────────────────────────────┐

│ Load Balancer │

├─────────────────────────────────────────────┤

│ App Server 1 | App Server 2 | App Server 3│ <-- 공유 컴퓨팅

├─────────────────────────────────────────────┤

│ Shared Database │ <-- 단일 장애점

├─────────────────────────────────────────────┤

│ Shared Cache │ <-- 단일 장애점

└─────────────────────────────────────────────┘

장애 시: 전체 사용자 영향 (Blast Radius = 100%)

셀 기반 아키텍처 (격리된 인프라):

┌──────────────┐

│ Cell Router │ <-- 유일한 공유 컴포넌트 (최소화)

└──────┬───────┘

┌────┴────┬────────┬────────┐

▼ ▼ ▼ ▼

┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐

│Cell 1│ │Cell 2│ │Cell 3│ │Cell 4│

│ App │ │ App │ │ App │ │ App │

│ DB │ │ DB │ │ DB │ │ DB │

│Cache │ │Cache │ │Cache │ │Cache │

└──────┘ └──────┘ └──────┘ └──────┘

장애 시: 해당 셀 사용자만 영향 (Blast Radius = 25%)

Comparison of architectural patterns

| Comparison Items | Cell-Based | Multi-Region | Active-Active | Traditional Monolith |

| ------------------------- | -------------------------------- | ----------------------------- | ---------------------------- | -------------------- |

| Fault Isolation Level | By cell (5-10% users) | Regional level (30-50% users) | Regional level | None (100% users) |

| Implementation Complexity | High | Medium-High | High | low |

| Infrastructure Costs | Medium-High (Per Cell Overhead) | High (regional replication) | very high | low |

| latency | Low (intracell communication) | Variable (inter-region delay) | variable | low |

| Deployment Flexibility | Canary per cell possible | Regional level | Regional level | Full distribution |

| data consistency | Strong coherence within the cell | eventual consistency | Conflict resolution required | strong consistency |

| Extension method | Add Cell | Add region | Add region | Vertical expansion |

| Operational Complexity | High (N cells managed) | middle | High | low |

Bulkhead pattern and fault isolation

Cell-based architecture is essentially an infrastructure-level application of the Bulkhead pattern. Bulkhead is a concept derived from a ship's bulkhead. Even if flooding occurs in one compartment, the bulkhead blocks the inflow of water into other compartments, preventing the entire ship from sinking.

Bulkhead application level

The Bulkhead pattern can be applied at several levels, with cell-based architecture being the highest level application.

1. **Thread Pool Level**: Hystrix/Resilience4j’s Thread Pool Bulkhead (smallest unit)

2. **Process level**: Separation of processes by function

3. **Service Level**: Isolation between microservices

4. **Infrastructure level**: Isolation into separate VMs, clusters, VPCs

5. **Cell level**: Bundles the entire stack (compute+DB+cache+queue) into one isolation unit (largest unit)

Blast Radius Calculation

The failure explosion radius can be calculated by the following formula:

단일 셀 장애 시 영향받는 사용자 비율 = 1 / N (N = 셀 수)

예시:

- 셀 4개: 장애 시 최대 25% 사용자 영향

- 셀 10개: 장애 시 최대 10% 사용자 영향

- 셀 20개: 장애 시 최대 5% 사용자 영향

However, if the number of cells is increased indefinitely, operational complexity and cost increase exponentially. Typically, **5-10% users per cell** is a good balance between cost and isolation level.

Cell routing strategy

The most important design decision in a cell-based architecture is which users to route to which cells. Routing must be deterministic, and the same user must always be directed to the same cell.

Consistent Hashing based routing

Consistent Hashing ensures that only the minimum number of keys are relocated when cells are added or removed.

cell_router.py - Consistent Hashing 기반 셀 라우터

from bisect import bisect_right

from typing import Optional

class CellRouter:

"""Consistent Hashing 기반 셀 라우터.

가상 노드(virtual node)를 사용하여 해시 링 위에

셀을 균등하게 분산 배치한다.

"""

def __init__(self, virtual_nodes: int = 150):

self.virtual_nodes = virtual_nodes

self.ring: list[int] = []

self.ring_to_cell: dict[int, str] = {}

self.cells: dict[str, dict] = {}

def _hash(self, key: str) -> int:

"""SHA-256 해시로 키를 정수 값으로 변환한다."""

digest = hashlib.sha256(key.encode()).hexdigest()

return int(digest[:16], 16)

def add_cell(self, cell_id: str, metadata: Optional[dict] = None) -> None:

"""셀을 해시 링에 추가한다.

Args:

cell_id: 셀 식별자 (예: "cell-us-east-001")

metadata: 셀 메타데이터 (endpoint, capacity 등)

"""

self.cells[cell_id] = metadata or {}

for i in range(self.virtual_nodes):

virtual_key = f"{cell_id}:vn{i}"

hash_value = self._hash(virtual_key)

self.ring.append(hash_value)

self.ring_to_cell[hash_value] = cell_id

self.ring.sort()

def remove_cell(self, cell_id: str) -> None:

"""셀을 해시 링에서 제거한다.

제거된 셀의 트래픽은 인접 셀로 자동 재배치된다.

"""

self.ring = [

h for h in self.ring

if self.ring_to_cell.get(h) != cell_id

]

self.ring_to_cell = {

h: c for h, c in self.ring_to_cell.items()

if c != cell_id

}

del self.cells[cell_id]

def route(self, partition_key: str) -> str:

"""파티션 키로 대상 셀을 결정한다.

Args:

partition_key: 라우팅 키 (예: tenant_id, user_id, org_id)

Returns:

대상 셀 ID

"""

if not self.ring:

raise ValueError("해시 링에 셀이 없습니다")

hash_value = self._hash(partition_key)

idx = bisect_right(self.ring, hash_value)

if idx == len(self.ring):

idx = 0

return self.ring_to_cell[self.ring[idx]]

def get_cell_distribution(self) -> dict[str, int]:

"""각 셀이 해시 링에서 차지하는 비율을 반환한다."""

distribution: dict[str, int] = {}

for cell_id in self.ring_to_cell.values():

distribution[cell_id] = distribution.get(cell_id, 0) + 1

return distribution

사용 예시

if __name__ == "__main__":

router = CellRouter(virtual_nodes=150)

셀 등록

router.add_cell("cell-001", {"region": "us-east-1", "capacity": 10000})

router.add_cell("cell-002", {"region": "us-east-1", "capacity": 10000})

router.add_cell("cell-003", {"region": "us-west-2", "capacity": 10000})

router.add_cell("cell-004", {"region": "eu-west-1", "capacity": 10000})

사용자 라우팅

user_ids = ["user-12345", "user-67890", "org-acme", "org-globex"]

for uid in user_ids:

target_cell = router.route(uid)

print(f"{uid} -> {target_cell}")

셀 분포 확인

dist = router.get_cell_distribution()

total = sum(dist.values())

for cell_id, count in sorted(dist.items()):

pct = (count / total) * 100

print(f"{cell_id}: {pct:.1f}%")

Partition Key selection criteria

The partition key to be used for routing is determined according to the business domain.

| domain | partition key | Advantages | Precautions |

| ----------------- | ------------------ | -------------------------------------------------------------- | ----------------------------------------- |

| SaaS multi-tenant | tenant_id / org_id | Complete isolation between tenants | Hotspots available for large tenants |

| social media | user_id | Securing data locality for each user | Influencer account imbalance |

| e-commerce | region + user_id | Geographic Delay Optimization | Complicated cross-region order processing |

| Messaging | workspace_id | Communication locality within workspace | Workspace size deviation |

| Payment | merchant_id | Easy compliance with regulations at the affiliated store level | Large affiliates require separate cell |

In the case of large tenants (noisy neighbors), the **Dedicated Cell** strategy, which allocates dedicated cells, must also be applied.

Kubernetes-based cell implementation

The most common ways to implement cells in a Kubernetes environment are **namespace-based isolation** or **cluster-based isolation**. If you need a high level of fault isolation, choose separate clusters per cell, and if cost efficiency is important, choose namespace isolation.

Cell Deployment Manifest

cell-deployment.yaml

각 셀은 독립적인 네임스페이스에 배포된다

apiVersion: v1

kind: Namespace

metadata:

name: cell-001

labels:

cell-id: 'cell-001'

region: 'us-east-1'

tier: 'standard'

셀 전용 리소스 제한 (noisy neighbor 방지)

apiVersion: v1

kind: ResourceQuota

metadata:

name: cell-001-quota

namespace: cell-001

spec:

hard:

requests.cpu: '16'

requests.memory: '32Gi'

limits.cpu: '32'

limits.memory: '64Gi'

pods: '100'

services: '20'

persistentvolumeclaims: '10'

셀 애플리케이션 배포

apiVersion: apps/v1

kind: Deployment

metadata:

name: cell-app

namespace: cell-001

labels:

app: cell-app

cell-id: 'cell-001'

spec:

replicas: 3

selector:

matchLabels:

app: cell-app

cell-id: 'cell-001'

template:

metadata:

labels:

app: cell-app

cell-id: 'cell-001'

spec:

topologySpreadConstraints:

- maxSkew: 1

topologyKey: topology.kubernetes.io/zone

whenUnsatisfiable: DoNotSchedule

labelSelector:

matchLabels:

app: cell-app

cell-id: 'cell-001'

containers:

- name: app

image: myregistry/cell-app:v2.4.1

ports:

- containerPort: 8080

env:

- name: CELL_ID

value: 'cell-001'

- name: DB_HOST

value: 'cell-001-db.cell-001.svc.cluster.local'

- name: CACHE_HOST

value: 'cell-001-redis.cell-001.svc.cluster.local'

- name: QUEUE_URL

value: 'https://sqs.us-east-1.amazonaws.com/123456789/cell-001-queue'

resources:

requests:

cpu: '500m'

memory: '1Gi'

limits:

cpu: '1000m'

memory: '2Gi'

readinessProbe:

httpGet:

path: /health/ready

port: 8080

initialDelaySeconds: 10

periodSeconds: 5

livenessProbe:

httpGet:

path: /health/live

port: 8080

initialDelaySeconds: 30

periodSeconds: 10

셀 전용 데이터베이스 (StatefulSet)

apiVersion: apps/v1

kind: StatefulSet

metadata:

name: cell-001-db

namespace: cell-001

spec:

serviceName: cell-001-db

replicas: 3

selector:

matchLabels:

app: cell-db

cell-id: 'cell-001'

template:

metadata:

labels:

app: cell-db

cell-id: 'cell-001'

spec:

containers:

- name: postgres

image: postgres:16

ports:

- containerPort: 5432

env:

- name: POSTGRES_DB

value: 'cell_001'

volumeMounts:

- name: data

mountPath: /var/lib/postgresql/data

volumeClaimTemplates:

- metadata:

name: data

spec:

accessModes: ['ReadWriteOnce']

storageClassName: gp3-encrypted

resources:

requests:

storage: 100Gi

셀 전용 Redis 캐시

apiVersion: apps/v1

kind: Deployment

metadata:

name: cell-001-redis

namespace: cell-001

spec:

replicas: 1

selector:

matchLabels:

app: cell-redis

cell-id: 'cell-001'

template:

metadata:

labels:

app: cell-redis

cell-id: 'cell-001'

spec:

containers:

- name: redis

image: redis:7-alpine

ports:

- containerPort: 6379

args: ['--maxmemory', '2gb', '--maxmemory-policy', 'allkeys-lru']

셀 간 트래픽 차단을 위한 NetworkPolicy

apiVersion: networking.k8s.io/v1

kind: NetworkPolicy

metadata:

name: cell-isolation

namespace: cell-001

spec:

podSelector: {}

policyTypes:

- Ingress

- Egress

ingress:

같은 셀 내부 트래픽만 허용

- from:

- namespaceSelector:

matchLabels:

cell-id: 'cell-001'

Cell Router에서 오는 트래픽 허용

- from:

- namespaceSelector:

matchLabels:

role: 'cell-router'

egress:

같은 셀 내부로의 트래픽

- to:

- namespaceSelector:

matchLabels:

cell-id: 'cell-001'

DNS 허용

- to:

- namespaceSelector: {}

podSelector:

matchLabels:

k8s-app: kube-dns

ports:

- port: 53

protocol: UDP

외부 AWS 서비스 접근 허용 (SQS, S3 등)

- to:

- ipBlock:

cidr: 0.0.0.0/0

except:

- 10.0.0.0/8

ports:

- port: 443

protocol: TCP

The key in this manifest is **NetworkPolicy**. By explicitly blocking network traffic between cells, fault propagation is isolated at the network level.

AWS-based cell implementation

In an AWS environment, stronger isolation can be implemented by utilizing VPCs, subnets, and security groups. By separating each cell into a separate VPC or AWS account, complete isolation, including IAM boundaries, is possible.

Terraform-based cell infrastructure

modules/cell/main.tf

재사용 가능한 셀 인프라 모듈

variable "cell_id" {

description = "셀 고유 식별자"

type = string

}

variable "cell_cidr" {

description = "셀 VPC CIDR 블록"

type = string

}

variable "environment" {

description = "배포 환경"

type = string

default = "production"

}

variable "max_capacity" {

description = "셀 최대 사용자 수"

type = number

default = 10000

}

셀 전용 VPC

resource "aws_vpc" "cell" {

cidr_block = var.cell_cidr

enable_dns_support = true

enable_dns_hostnames = true

tags = {

Name = "cell-${var.cell_id}-vpc"

CellId = var.cell_id

Environment = var.environment

}

}

가용 영역별 프라이빗 서브넷

resource "aws_subnet" "private" {

count = 3

vpc_id = aws_vpc.cell.id

cidr_block = cidrsubnet(var.cell_cidr, 4, count.index)

availability_zone = data.aws_availability_zones.available.names[count.index]

tags = {

Name = "cell-${var.cell_id}-private-${count.index}"

CellId = var.cell_id

}

}

셀 전용 ECS 클러스터

resource "aws_ecs_cluster" "cell" {

name = "cell-${var.cell_id}"

setting {

name = "containerInsights"

value = "enabled"

}

tags = {

CellId = var.cell_id

}

}

셀 전용 RDS (Multi-AZ)

resource "aws_db_instance" "cell" {

identifier = "cell-${var.cell_id}-db"

engine = "postgres"

engine_version = "16.4"

instance_class = "db.r6g.xlarge"

allocated_storage = 100

max_allocated_storage = 500

storage_encrypted = true

storage_type = "gp3"

multi_az = true

db_subnet_group_name = aws_db_subnet_group.cell.name

vpc_security_group_ids = [aws_security_group.cell_db.id]

db_name = "cell_${replace(var.cell_id, "-", "_")}"

username = "cell_admin"

password = data.aws_secretsmanager_secret_version.db_password.secret_string

backup_retention_period = 14

deletion_protection = true

tags = {

CellId = var.cell_id

}

}

셀 전용 ElastiCache Redis

resource "aws_elasticache_replication_group" "cell" {

replication_group_id = "cell-${var.cell_id}-redis"

description = "Redis cluster for cell ${var.cell_id}"

node_type = "cache.r6g.large"

num_cache_clusters = 2

engine_version = "7.1"

port = 6379

subnet_group_name = aws_elasticache_subnet_group.cell.name

security_group_ids = [aws_security_group.cell_cache.id]

at_rest_encryption_enabled = true

transit_encryption_enabled = true

automatic_failover_enabled = true

tags = {

CellId = var.cell_id

}

}

셀 전용 SQS 큐

resource "aws_sqs_queue" "cell" {

name = "cell-${var.cell_id}-events"

visibility_timeout_seconds = 60

message_retention_seconds = 1209600 # 14일

receive_wait_time_seconds = 20

redrive_policy = jsonencode({

deadLetterTargetArn = aws_sqs_queue.cell_dlq.arn

maxReceiveCount = 3

})

tags = {

CellId = var.cell_id

}

}

셀 Auto Scaling

resource "aws_appautoscaling_target" "cell_ecs" {

max_capacity = var.max_capacity / 100 # 인스턴스당 100명 기준

min_capacity = 3

resource_id = "service/${aws_ecs_cluster.cell.name}/${aws_ecs_service.cell.name}"

scalable_dimension = "ecs:service:DesiredCount"

service_namespace = "ecs"

}

resource "aws_appautoscaling_policy" "cell_cpu" {

name = "cell-${var.cell_id}-cpu-scaling"

policy_type = "TargetTrackingScaling"

resource_id = aws_appautoscaling_target.cell_ecs.resource_id

scalable_dimension = aws_appautoscaling_target.cell_ecs.scalable_dimension

service_namespace = aws_appautoscaling_target.cell_ecs.service_namespace

target_tracking_scaling_policy_configuration {

target_value = 60.0

scale_in_cooldown = 300

scale_out_cooldown = 60

predefined_metric_specification {

predefined_metric_type = "ECSServiceAverageCPUUtilization"

}

}

}

출력

output "cell_vpc_id" {

value = aws_vpc.cell.id

}

output "cell_db_endpoint" {

value = aws_db_instance.cell.endpoint

}

output "cell_redis_endpoint" {

value = aws_elasticache_replication_group.cell.primary_endpoint_address

}

output "cell_sqs_url" {

value = aws_sqs_queue.cell.url

}

Cell provisioning

environments/production/main.tf

실제 셀 프로비저닝

module "cell_001" {

source = "../../modules/cell"

cell_id = "001"

cell_cidr = "10.1.0.0/16"

environment = "production"

max_capacity = 10000

}

module "cell_002" {

source = "../../modules/cell"

cell_id = "002"

cell_cidr = "10.2.0.0/16"

environment = "production"

max_capacity = 10000

}

module "cell_003" {

source = "../../modules/cell"

cell_id = "003"

cell_cidr = "10.3.0.0/16"

environment = "production"

max_capacity = 10000

}

Cell Distribution Strategy (Canary per Cell)

One of the most powerful benefits of cell-based architecture is cell-level canary deployment. Deploy the new version to one cell first, and gradually expand to other cells if there are no problems. Even if a failure occurs, only the users in that cell are affected.

Deployment Pipeline

.github/workflows/cell-canary-deploy.yaml

셀 단위 카나리 배포 파이프라인

name: Cell Canary Deployment

on:

push:

branches: [main]

env:

IMAGE_TAG: ${{ github.sha }}

jobs:

build:

runs-on: ubuntu-latest

steps:

- uses: actions/checkout@v4

- name: Build and push container image

run: |

docker build -t myregistry/cell-app:${{ env.IMAGE_TAG }} .

docker push myregistry/cell-app:${{ env.IMAGE_TAG }}

Phase 1: 카나리 셀에 배포

deploy-canary:

needs: build

runs-on: ubuntu-latest

environment: canary

steps:

- name: Deploy to canary cell (cell-001)

run: |

kubectl set image deployment/cell-app \

app=myregistry/cell-app:${{ env.IMAGE_TAG }} \

-n cell-001

kubectl rollout status deployment/cell-app \

-n cell-001 --timeout=300s

- name: Run smoke tests against canary cell

run: |

./scripts/smoke-test.sh cell-001

- name: Monitor canary metrics (10 minutes)

run: |

./scripts/canary-monitor.sh cell-001 600

Phase 2: 첫 번째 배치 (30% 셀)

deploy-batch-1:

needs: deploy-canary

runs-on: ubuntu-latest

environment: production-batch-1

strategy:

max-parallel: 2

steps:

- name: Deploy to batch 1 cells

run: |

for CELL in cell-002 cell-003 cell-004; do

kubectl set image deployment/cell-app \

app=myregistry/cell-app:${{ env.IMAGE_TAG }} \

-n $CELL

kubectl rollout status deployment/cell-app \

-n $CELL --timeout=300s

done

Phase 3: 나머지 전체 셀

deploy-remaining:

needs: deploy-batch-1

runs-on: ubuntu-latest

environment: production-all

steps:

- name: Deploy to all remaining cells

run: |

REMAINING_CELLS=$(kubectl get ns -l cell-id \

--no-headers -o custom-columns=":metadata.name" | \

grep -v -E "cell-001|cell-002|cell-003|cell-004")

for CELL in $REMAINING_CELLS; do

kubectl set image deployment/cell-app \

app=myregistry/cell-app:${{ env.IMAGE_TAG }} \

-n $CELL

kubectl rollout status deployment/cell-app \

-n $CELL --timeout=300s

done

Real examples

Slack’s cell-based architecture

Slack has transitioned to a large-scale workspace-based cell architecture starting in 2022. The partition key is **workspace_id**, and all channels, messages, and files within one workspace are stored in the same cell. Before Slack's cell transition, an overall service outage in February 2022 affected all users, but after the cell transition, failures in individual cells only affected users in that cell (approximately 5-8% of the total).

Slack's key design decisions include:

- **Vitess**-based MySQL sharding configured on a cell basis

- Cell router is designed as a **Thin Layer** containing only minimal logic

- **Dedicated Cell** assigned to large enterprise customers (e.g. IBM, Amazon)

- If inter-cell communication is required (cross-workspace search, etc.), use **Asynchronous Message Bus**

DoorDash’s Cell Architecture

DoorDash introduced a region-based cell architecture in 2023. The partition key is **geographic_region**, which partitions the United States into several geographic cells. Through this, traffic surges in certain areas (such as specific cities during Super Bowl season) are isolated so that they do not affect other areas.

- Use **DynamoDB** as cell-level data storage

- Apache Kafka clusters are also separated by cell

- Control feature rollout via cell-level **Feature Flag**

- Establish automation to **drain** traffic from that cell to an adjacent cell in the event of a failure

Salesforce’s Pod Architecture

Salesforce is one of the pioneers of cell-based architecture. In Salesforce, cells are called **Pods**, and each Pod contains a complete Salesforce instance. Dozens of Pods are distributed around the world, and customers (tenants) are pinned to specific Pods.

- Routing cells to per-tenant **instance URL** (e.g. na1.salesforce.com, eu5.salesforce.com)

- Self-developed **Org Migration** tool for data migration between pods

- Operate release cycle independent of Pod-level maintenance window

Data partitioning

The most challenging challenge in cell-based architecture is **data partitioning**. Cross-cell query requirements must be met while maintaining cell independence.

Partitioning strategy

**1. Cell-Local Data**

This is data that is accessed only inside the cell, and exists only in the database of that cell. For example, the user's messages, order history, and session information.

**2. Global Reference Data**

This is read-only data that is equally required in all cells. This includes exchange rate information, product catalogs, country codes, etc. This data is **asynchronously replicated** from the global data store to each cell.

**3. Cross-Cell Aggregate Data**

This is a case where data from all cells must be aggregated, such as overall system statistics and global dashboards. Metrics aggregated from each cell are exported to a central analytics platform (e.g. Snowflake, BigQuery) for processing.

Data Migration

Data migration is required when rebalancing cells. The core principles are:

- **Dual Write**: Write to both source and target cells during the migration period.

- **Gradual transition**: Switch read to target cell first, then switch write.

- **Rollback possibility**: Keep original cell data for at least 48 hours

Monitoring and Observability

In a cell-based architecture, both **cell-level metrics** and **global aggregate metrics** are required.

Cell health check script

#!/bin/bash

cell-health-check.sh

전체 셀의 상태를 점검하는 헬스체크 스크립트

set -euo pipefail

CELL_ROUTER_URL="${CELL_ROUTER_URL:-http://cell-router.internal:8080}"

ALERT_WEBHOOK="${ALERT_WEBHOOK:-}"

HEALTH_THRESHOLD=3 # 연속 실패 횟수 임계값

declare -A FAILURE_COUNT

check_cell_health() {

local cell_id="$1"

local cell_endpoint="$2"

애플리케이션 헬스 체크

local http_code

http_code=$(curl -s -o /dev/null -w "%{http_code}" \

--connect-timeout 5 --max-time 10 \

"${cell_endpoint}/health/ready" 2>/dev/null || echo "000")

if [[ "$http_code" != "200" ]]; then

FAILURE_COUNT[$cell_id]=$(( ${FAILURE_COUNT[$cell_id]:-0} + 1 ))

echo "[WARN] ${cell_id}: health check failed (HTTP ${http_code}, consecutive: ${FAILURE_COUNT[$cell_id]})"

if [[ ${FAILURE_COUNT[$cell_id]} -ge $HEALTH_THRESHOLD ]]; then

echo "[CRITICAL] ${cell_id}: ${HEALTH_THRESHOLD} consecutive failures - triggering alert"

send_alert "$cell_id" "$http_code"

fi

return 1

fi

성공 시 카운터 리셋

FAILURE_COUNT[$cell_id]=0

셀 메트릭 수집

local metrics

metrics=$(curl -s --max-time 5 "${cell_endpoint}/metrics/cell" 2>/dev/null || echo "{}")

local active_connections error_rate p99_latency

active_connections=$(echo "$metrics" | jq -r '.active_connections // "N/A"')

error_rate=$(echo "$metrics" | jq -r '.error_rate_percent // "N/A"')

p99_latency=$(echo "$metrics" | jq -r '.p99_latency_ms // "N/A"')

echo "[OK] ${cell_id}: connections=${active_connections}, errors=${error_rate}%, p99=${p99_latency}ms"

return 0

}

send_alert() {

local cell_id="$1"

local http_code="$2"

local timestamp

timestamp=$(date -u +"%Y-%m-%dT%H:%M:%SZ")

if [[ -n "$ALERT_WEBHOOK" ]]; then

curl -s -X POST "$ALERT_WEBHOOK" \

-H "Content-Type: application/json" \

-d "{

\"severity\": \"critical\",

\"cell_id\": \"${cell_id}\",

\"message\": \"Cell ${cell_id} health check failed (HTTP ${http_code})\",

\"timestamp\": \"${timestamp}\",

\"action\": \"investigate_and_consider_drain\"

}"

fi

}

셀 목록 조회 및 점검

echo "=== Cell Health Check: $(date -u +"%Y-%m-%d %H:%M:%S UTC") ==="

CELLS=$(curl -s "${CELL_ROUTER_URL}/cells" | jq -r '.[] | "\(.cell_id)|\(.endpoint)"')

TOTAL=0

HEALTHY=0

UNHEALTHY=0

while IFS='|' read -r cell_id endpoint; do

if check_cell_health "$cell_id" "$endpoint"; then

HEALTHY=$((HEALTHY + 1))

else

UNHEALTHY=$((UNHEALTHY + 1))

fi

TOTAL=$((TOTAL + 1))

done <<< "$CELLS"

echo ""

echo "=== Summary: Total=${TOTAL}, Healthy=${HEALTHY}, Unhealthy=${UNHEALTHY} ==="

전체 중 50% 이상 비정상이면 글로벌 알림

if [[ $TOTAL -gt 0 ]] && [[ $((UNHEALTHY * 100 / TOTAL)) -ge 50 ]]; then

echo "[GLOBAL CRITICAL] More than 50% cells unhealthy - possible global issue"

fi

Key monitoring metrics

The metrics that must be monitored in a cell-based architecture are:

**Cell level metrics:**

- Request throughput (RPS) and error rate per cell

- P50/P95/P99 delay time per cell

- DB connection pool usage rate by cell

- CPU/memory utilization by cell

- Queue Depth per cell

**Global level metrics:**

- Request throughput and latency of cell routers

- Traffic imbalance ratio between cells

- Number of abnormal cells

- Global error rate (sum of all cells)

Routing rule settings

cell-routing-rules.yaml

셀 라우팅 규칙 정의

apiVersion: v1

kind: ConfigMap

metadata:

name: cell-routing-config

namespace: cell-router

data:

routing-rules.yaml: |

version: "2.0"

default_strategy: "consistent-hashing"

partition_key: "X-Tenant-Id"

cells:

- id: "cell-001"

endpoint: "https://cell-001.internal.example.com"

region: "us-east-1"

status: "active"

capacity: 10000

weight: 100

- id: "cell-002"

endpoint: "https://cell-002.internal.example.com"

region: "us-east-1"

status: "active"

capacity: 10000

weight: 100

- id: "cell-003"

endpoint: "https://cell-003.internal.example.com"

region: "us-west-2"

status: "active"

capacity: 10000

weight: 100

- id: "cell-004"

endpoint: "https://cell-004.internal.example.com"

region: "eu-west-1"

status: "draining"

capacity: 10000

weight: 0

drain_target: "cell-003"

전용 셀 (대형 테넌트)

dedicated_cells:

- tenant_id: "tenant-enterprise-001"

cell_id: "cell-dedicated-001"

reason: "SLA 요구사항 - 99.99% 가용성"

- tenant_id: "tenant-enterprise-002"

cell_id: "cell-dedicated-002"

reason: "데이터 레지던시 - EU GDPR 준수"

장애 시 트래픽 전환 규칙

failover_rules:

health_check_interval_seconds: 10

consecutive_failures_threshold: 3

failover_strategy: "nearest-healthy-cell"

auto_failback: true

failback_delay_minutes: 15

셀 용량 제한

capacity_management:

overflow_strategy: "reject-with-retry-after"

overflow_http_status: 503

retry_after_seconds: 30

capacity_warning_threshold_percent: 80

capacity_critical_threshold_percent: 95

Troubleshooting

Problem 1: Cell-to-cell traffic imbalance

**Symptom**: CPU utilization of a specific cell exceeds 90%, but other cells remain at 30%.

**Cause**: A large tenant is placed in a regular cell, causing excessive load on that cell.

**Solution**: Identify large tenants and migrate them to dedicated cells. In the routing table, the corresponding tenant is mapped to a dedicated cell, and traffic is switched after data migration.

Problem 2: Failure of the cell router itself

**Symptom**: The cell router becomes unresponsive and the entire system is paralyzed.

**Cause**: The cell router becomes a SPOF (Single Point of Failure).

**Solution**: Multiplex cell routers and ensure availability of the cell router itself through DNS-based load balancing. Cell router logic is kept to a minimum to reduce the likelihood of failure. By caching cell allocation information in the client, it allows direct access to existing cells even in the event of a router failure.

Problem 3: Cross-cell data query

**Symptom**: When viewing overall user statistics on the administrator dashboard, response time slows down to tens of seconds.

**Cause**: A method of sequentially sending queries to all cells and counting is used.

**Solution**: Periodically export aggregated data from each cell to a central analytical database (OLAP). Administrator dashboards are retrieved from this central repository. If real-time is required, configure streaming aggregation with Change Data Capture (CDC).

Issue 4: Data inconsistency during data migration between cells

**Symptom**: Some users' data is missing or duplicated after cell relocation.

**Cause**: A synchronization mismatch occurs between the source and destination cells during the double write period.

**Workaround**: Perform data checksum verification before and after migration. If it is based on event sourcing, the state of the target cell can be accurately restored through event replay. Original cell data is preserved for at least 48 hours for migration rollback.

Precautions during operation

Cell routers must be thin layer

The cell router is the only shared component in the system. Adding complex business logic here nullifies the benefits of cell-based isolation. A cell router must perform only three functions: extracting partition keys, calculating hashes, and mapping cells.

Golden ratio of cell sizes

If a cell is too large, the scope of impact in the event of a failure expands, while if the cell is too small, operational complexity and costs increase. According to AWS's recommendations, a **size that accommodates 5-10% of total users** is appropriate. This means 10-20 cells.

Data residency in global deployment

To comply with data regulations such as GDPR and the Personal Data Protection Act, user data in a specific region must be processed only in cells in that region. When routing cells, regional constraints must be considered along with the partition key.

Testing Strategy

**Cell-level chaos engineering** is essential in cell-based architecture. You should use AWS Fault Injection Service (FIS) or Chaos Mesh to simulate failures in individual cells and regularly verify that other cells are not affected. Cell-level fault injection training is recommended at least once a quarter.

Failure cases and recovery

Case 1: Total failure due to cell router configuration error

In one company, incorrect settings were distributed when updating the cell routing table, resulting in an incident where all users were routed to one cell. The cell exceeded its capacity and started returning a 503 error, leaving other cells idle.

**Lesson**: Canary deployment must be applied even when changing routing settings. Changes to routing tables must go through an automated test pipeline that includes verification of cell traffic distribution.

Case 2: Hidden dependencies between cells

We thought we had achieved complete isolation between cells, but the external API (payment gateway) shared by all cells failed, so all cells were affected at the same time.

**Lesson**: Identify shared dependencies outside of the cell, isolate them by cell if possible, or at least apply Circuit Breaker and Fallback. When reviewing architecture, you should create a **Dependency Map** to find hidden shared points.

Case 3: Data migration failure

During cell relocation, double writing was interrupted due to a network failure, resulting in a situation where the data in the source cell and target cell did not match. I attempted a rollback, but some data had already been cleaned up from the original cells.

**Lesson**: Migration must be designed to ensure **idempotent**, and original data deletion must be performed after a sufficient stabilization period (at least 48 hours) after migration completion.

Checklist

Summarizes items to be reviewed before introducing cell-based architecture.

**Design Stage:**

- Partition key selection completed (tenant, user, region, etc.)

- Determine cell size (recommended for 5-10% of users)

- Cell router design (Thin Layer principle)

- Definition of cross-cell data access pattern

- Establishment of global reference data replication strategy

- Check data residency requirements

**Implementation Phase:**

- Independent infrastructure provisioning for each cell (VPC, DB, Cache, Queue)

- Implement cell-to-cell isolation with NetworkPolicy or security groups

- Consistent Hashing based router implementation

- Building a cell-level canary deployment pipeline

- Implementation of health check endpoint for each cell

- Cell router multiplexing

**Operation Phase:**

- Building a cell-level monitoring dashboard

- Setting cell-level notification rules

- Automated traffic drain in case of cell failure

- Preparation of cell-to-cell data migration tools

- Quarterly cell failure injection training plan

- Cell capacity monitoring and auto-expansion settings

- Establishment of global metric aggregation pipeline

- Automate verification of routing settings changes

References

- [AWS Well-Architected - Reducing Scope of Impact with Cell-Based Architecture](https://docs.aws.amazon.com/wellarchitected/latest/reducing-scope-of-impact-with-cell-based-architecture/)

- [InfoQ - Cell-Based Architecture for Distributed Systems](https://www.infoq.com/articles/cell-based-architecture-distributed-systems/)

- [System Design Newsletter - Cell-Based Architecture](https://newsletter.systemdesign.one/p/cell-based-architecture)

- [AWS Solutions Library - Guidance for Cell-Based Architecture on AWS](https://github.com/aws-solutions-library-samples/guidance-for-cell-based-architecture-on-aws)

- [InfoQ Minibook - Cell-Based Architecture 2024](https://www.infoq.com/minibooks/cell-based-architecture-2024/)

- [Slack Engineering - Building Reliable Distributed Systems](https://slack.engineering/)

- [DoorDash Engineering Blog](https://doordash.engineering/)

Quiz

Q1: What is the main topic covered in "Cell-Based Architecture Design and Operation: Strategy to

Minimize Failure Explosion Radius"?

A comprehensive guide covering the core principles of cell-based architecture, Bulkhead pattern,

fault blast radius minimization strategy, cell routing design, AWS/Kubernetes-based

implementation, Slack·DoorDash actual application cases, and operational troubleshooting.

When operating a large-scale distributed system, you face an uncomfortable truth. No matter how

sophisticated a failure response system is, if a failure occurs in a single component shared by

the entire system, the entire service is interrupted.

What is a Cell? A cell is a self-contained deployment unit that can independently perform the

entire function of a service. Each cell has its own computing resources, data storage, message

queue, and cache, and does not share state with other cells.

Cell-based architecture is essentially an infrastructure-level application of the Bulkhead

pattern. Bulkhead is a concept derived from a ship's bulkhead.

The most important design decision in a cell-based architecture is which users to route to which

cells. Routing must be deterministic, and the same user must always be directed to the same cell.

현재 단락 (1/782)

When operating a large-scale distributed system, you face an uncomfortable truth. No matter how soph...

작성 글자: 0원문 글자: 30,997작성 단락: 0/782