Skip to content
Published on

Cell-Based Architecture Design and Operation: Strategy to Minimize Failure Explosion Radius

Authors
  • Name
    Twitter

Cell-Based Architecture

Entering

When operating a large-scale distributed system, you face an uncomfortable truth. No matter how sophisticated a failure response system is, if a failure occurs in a single component shared by the entire system, the entire service is interrupted. Representative examples include a 6-hour total failure of Facebook (now Meta) in 2021, a widespread failure due to a WAN configuration error in Microsoft Azure in 2023, and Cloudflare's API Gateway failure in 2024. What these cases have in common is that the Blast Radius affected the entire system.

Cell-Based Architecture is a structural solution to this problem. This is a pattern that divides the system into independent cells and isolates them so that even if a failure occurs in one cell, it does not affect other cells. AWS has applied it to its own infrastructure and registered it as an official document in the Well-Architected Framework, and it is an architecture that has been verified in production by companies such as Slack, DoorDash, and Salesforce.

In this article, it is comprehensively covered at the operational level, from the core principles of cell-based architecture to the relationship with the Bulkhead pattern, cell routing strategy (Consistent Hashing, Partition Key), Kubernetes and AWS-based implementation, cell-level canary deployment, data partitioning, monitoring and observability, and actual operational examples and troubleshooting.

Cell-based architecture core concepts

What is a Cell?

A cell is a self-contained deployment unit that can independently perform the entire function of a service. Each cell has its own computing resources, data storage, message queue, and cache, and does not share state with other cells. Even if one cell completely fails, the remaining cells operate normally.

The key properties of cell-based architecture are:

  • Isolation: Computing, storage, and network resources are not shared between cells.
  • Independent Deployment: Each cell can be deployed, upgraded, and rolled back independently.
  • Horizontal Scaling: Expand overall system capacity by adding cells.
  • Fault Isolation: Failure of one cell does not propagate to other cells.
  • Capacity Capping: Each cell has a fixed maximum capacity.

Differences from traditional architecture

전통적 아키텍처 (공유 인프라):
┌─────────────────────────────────────────────┐
Load Balancer├─────────────────────────────────────────────┤
App Server 1 | App Server 2 | App Server 3<-- 공유 컴퓨팅
├─────────────────────────────────────────────┤
Shared Database<-- 단일 장애점
├─────────────────────────────────────────────┤
Shared Cache<-- 단일 장애점
└─────────────────────────────────────────────┘
  장애 시: 전체 사용자 영향 (Blast Radius = 100%)

셀 기반 아키텍처 (격리된 인프라):
┌──────────────┐
Cell Router<-- 유일한 공유 컴포넌트 (최소화)
└──────┬───────┘
  ┌────┴────┬────────┬────────┐
  ▼         ▼        ▼        ▼
┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
│Cell 1│ │Cell 2│ │Cell 3│ │Cell 4App  │ │ App  │ │ App  │ │ AppDB   │ │ DB   │ │ DB   │ │ DB│Cache │ │Cache │ │Cache │ │Cache │
└──────┘ └──────┘ └──────┘ └──────┘
  장애 시: 해당 셀 사용자만 영향 (Blast Radius = 25%)

Comparison of architectural patterns

Comparison ItemsCell-BasedMulti-RegionActive-ActiveTraditional Monolith
Fault Isolation LevelBy cell (5-10% users)Regional level (30-50% users)Regional levelNone (100% users)
Implementation ComplexityHighMedium-HighHighlow
Infrastructure CostsMedium-High (Per Cell Overhead)High (regional replication)very highlow
latencyLow (intracell communication)Variable (inter-region delay)variablelow
Deployment FlexibilityCanary per cell possibleRegional levelRegional levelFull distribution
data consistencyStrong coherence within the celleventual consistencyConflict resolution requiredstrong consistency
Extension methodAdd CellAdd regionAdd regionVertical expansion
Operational ComplexityHigh (N cells managed)middleHighlow

Bulkhead pattern and fault isolation

Cell-based architecture is essentially an infrastructure-level application of the Bulkhead pattern. Bulkhead is a concept derived from a ship's bulkhead. Even if flooding occurs in one compartment, the bulkhead blocks the inflow of water into other compartments, preventing the entire ship from sinking.

Bulkhead application level

The Bulkhead pattern can be applied at several levels, with cell-based architecture being the highest level application.

  1. Thread Pool Level: Hystrix/Resilience4j’s Thread Pool Bulkhead (smallest unit)
  2. Process level: Separation of processes by function
  3. Service Level: Isolation between microservices
  4. Infrastructure level: Isolation into separate VMs, clusters, VPCs
  5. Cell level: Bundles the entire stack (compute+DB+cache+queue) into one isolation unit (largest unit)

Blast Radius Calculation

The failure explosion radius can be calculated by the following formula:

단일 셀 장애 시 영향받는 사용자 비율 = 1 / N (N = 셀 수)

예시:
-4: 장애 시 최대 25% 사용자 영향
-10: 장애 시 최대 10% 사용자 영향
-20: 장애 시 최대 5% 사용자 영향

However, if the number of cells is increased indefinitely, operational complexity and cost increase exponentially. Typically, 5-10% users per cell is a good balance between cost and isolation level.

Cell routing strategy

The most important design decision in a cell-based architecture is which users to route to which cells. Routing must be deterministic, and the same user must always be directed to the same cell.

Consistent Hashing based routing

Consistent Hashing ensures that only the minimum number of keys are relocated when cells are added or removed.

# cell_router.py - Consistent Hashing 기반 셀 라우터

import hashlib
from bisect import bisect_right
from typing import Optional


class CellRouter:
    """Consistent Hashing 기반 셀 라우터.

    가상 노드(virtual node)를 사용하여 해시 링 위에
    셀을 균등하게 분산 배치한다.
    """

    def __init__(self, virtual_nodes: int = 150):
        self.virtual_nodes = virtual_nodes
        self.ring: list[int] = []
        self.ring_to_cell: dict[int, str] = {}
        self.cells: dict[str, dict] = {}

    def _hash(self, key: str) -> int:
        """SHA-256 해시로 키를 정수 값으로 변환한다."""
        digest = hashlib.sha256(key.encode()).hexdigest()
        return int(digest[:16], 16)

    def add_cell(self, cell_id: str, metadata: Optional[dict] = None) -> None:
        """셀을 해시 링에 추가한다.

        Args:
            cell_id: 셀 식별자 (예: "cell-us-east-001")
            metadata: 셀 메타데이터 (endpoint, capacity 등)
        """
        self.cells[cell_id] = metadata or {}
        for i in range(self.virtual_nodes):
            virtual_key = f"{cell_id}:vn{i}"
            hash_value = self._hash(virtual_key)
            self.ring.append(hash_value)
            self.ring_to_cell[hash_value] = cell_id
        self.ring.sort()

    def remove_cell(self, cell_id: str) -> None:
        """셀을 해시 링에서 제거한다.

        제거된 셀의 트래픽은 인접 셀로 자동 재배치된다.
        """
        self.ring = [
            h for h in self.ring
            if self.ring_to_cell.get(h) != cell_id
        ]
        self.ring_to_cell = {
            h: c for h, c in self.ring_to_cell.items()
            if c != cell_id
        }
        del self.cells[cell_id]

    def route(self, partition_key: str) -> str:
        """파티션 키로 대상 셀을 결정한다.

        Args:
            partition_key: 라우팅 키 (예: tenant_id, user_id, org_id)

        Returns:
            대상 셀 ID
        """
        if not self.ring:
            raise ValueError("해시 링에 셀이 없습니다")

        hash_value = self._hash(partition_key)
        idx = bisect_right(self.ring, hash_value)

        if idx == len(self.ring):
            idx = 0

        return self.ring_to_cell[self.ring[idx]]

    def get_cell_distribution(self) -> dict[str, int]:
        """각 셀이 해시 링에서 차지하는 비율을 반환한다."""
        distribution: dict[str, int] = {}
        for cell_id in self.ring_to_cell.values():
            distribution[cell_id] = distribution.get(cell_id, 0) + 1
        return distribution


# 사용 예시
if __name__ == "__main__":
    router = CellRouter(virtual_nodes=150)

    # 셀 등록
    router.add_cell("cell-001", {"region": "us-east-1", "capacity": 10000})
    router.add_cell("cell-002", {"region": "us-east-1", "capacity": 10000})
    router.add_cell("cell-003", {"region": "us-west-2", "capacity": 10000})
    router.add_cell("cell-004", {"region": "eu-west-1", "capacity": 10000})

    # 사용자 라우팅
    user_ids = ["user-12345", "user-67890", "org-acme", "org-globex"]
    for uid in user_ids:
        target_cell = router.route(uid)
        print(f"{uid} -> {target_cell}")

    # 셀 분포 확인
    dist = router.get_cell_distribution()
    total = sum(dist.values())
    for cell_id, count in sorted(dist.items()):
        pct = (count / total) * 100
        print(f"{cell_id}: {pct:.1f}%")

Partition Key selection criteria

The partition key to be used for routing is determined according to the business domain.

domainpartition keyAdvantagesPrecautions
SaaS multi-tenanttenant_id / org_idComplete isolation between tenantsHotspots available for large tenants
social mediauser_idSecuring data locality for each userInfluencer account imbalance
e-commerceregion + user_idGeographic Delay OptimizationComplicated cross-region order processing
Messagingworkspace_idCommunication locality within workspaceWorkspace size deviation
Paymentmerchant_idEasy compliance with regulations at the affiliated store levelLarge affiliates require separate cell

In the case of large tenants (noisy neighbors), the Dedicated Cell strategy, which allocates dedicated cells, must also be applied.

Kubernetes-based cell implementation

The most common ways to implement cells in a Kubernetes environment are namespace-based isolation or cluster-based isolation. If you need a high level of fault isolation, choose separate clusters per cell, and if cost efficiency is important, choose namespace isolation.

Cell Deployment Manifest

# cell-deployment.yaml
# 각 셀은 독립적인 네임스페이스에 배포된다

apiVersion: v1
kind: Namespace
metadata:
  name: cell-001
  labels:
    cell-id: 'cell-001'
    region: 'us-east-1'
    tier: 'standard'
---
# 셀 전용 리소스 제한 (noisy neighbor 방지)
apiVersion: v1
kind: ResourceQuota
metadata:
  name: cell-001-quota
  namespace: cell-001
spec:
  hard:
    requests.cpu: '16'
    requests.memory: '32Gi'
    limits.cpu: '32'
    limits.memory: '64Gi'
    pods: '100'
    services: '20'
    persistentvolumeclaims: '10'
---
# 셀 애플리케이션 배포
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cell-app
  namespace: cell-001
  labels:
    app: cell-app
    cell-id: 'cell-001'
spec:
  replicas: 3
  selector:
    matchLabels:
      app: cell-app
      cell-id: 'cell-001'
  template:
    metadata:
      labels:
        app: cell-app
        cell-id: 'cell-001'
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: cell-app
              cell-id: 'cell-001'
      containers:
        - name: app
          image: myregistry/cell-app:v2.4.1
          ports:
            - containerPort: 8080
          env:
            - name: CELL_ID
              value: 'cell-001'
            - name: DB_HOST
              value: 'cell-001-db.cell-001.svc.cluster.local'
            - name: CACHE_HOST
              value: 'cell-001-redis.cell-001.svc.cluster.local'
            - name: QUEUE_URL
              value: 'https://sqs.us-east-1.amazonaws.com/123456789/cell-001-queue'
          resources:
            requests:
              cpu: '500m'
              memory: '1Gi'
            limits:
              cpu: '1000m'
              memory: '2Gi'
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
---
# 셀 전용 데이터베이스 (StatefulSet)
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: cell-001-db
  namespace: cell-001
spec:
  serviceName: cell-001-db
  replicas: 3
  selector:
    matchLabels:
      app: cell-db
      cell-id: 'cell-001'
  template:
    metadata:
      labels:
        app: cell-db
        cell-id: 'cell-001'
    spec:
      containers:
        - name: postgres
          image: postgres:16
          ports:
            - containerPort: 5432
          env:
            - name: POSTGRES_DB
              value: 'cell_001'
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ['ReadWriteOnce']
        storageClassName: gp3-encrypted
        resources:
          requests:
            storage: 100Gi
---
# 셀 전용 Redis 캐시
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cell-001-redis
  namespace: cell-001
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cell-redis
      cell-id: 'cell-001'
  template:
    metadata:
      labels:
        app: cell-redis
        cell-id: 'cell-001'
    spec:
      containers:
        - name: redis
          image: redis:7-alpine
          ports:
            - containerPort: 6379
          args: ['--maxmemory', '2gb', '--maxmemory-policy', 'allkeys-lru']
---
# 셀 간 트래픽 차단을 위한 NetworkPolicy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: cell-isolation
  namespace: cell-001
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  ingress:
    # 같은 셀 내부 트래픽만 허용
    - from:
        - namespaceSelector:
            matchLabels:
              cell-id: 'cell-001'
    # Cell Router에서 오는 트래픽 허용
    - from:
        - namespaceSelector:
            matchLabels:
              role: 'cell-router'
  egress:
    # 같은 셀 내부로의 트래픽
    - to:
        - namespaceSelector:
            matchLabels:
              cell-id: 'cell-001'
    # DNS 허용
    - to:
        - namespaceSelector: {}
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - port: 53
          protocol: UDP
    # 외부 AWS 서비스 접근 허용 (SQS, S3 등)
    - to:
        - ipBlock:
            cidr: 0.0.0.0/0
            except:
              - 10.0.0.0/8
      ports:
        - port: 443
          protocol: TCP

The key in this manifest is NetworkPolicy. By explicitly blocking network traffic between cells, fault propagation is isolated at the network level.

AWS-based cell implementation

In an AWS environment, stronger isolation can be implemented by utilizing VPCs, subnets, and security groups. By separating each cell into a separate VPC or AWS account, complete isolation, including IAM boundaries, is possible.

Terraform-based cell infrastructure

# modules/cell/main.tf
# 재사용 가능한 셀 인프라 모듈

variable "cell_id" {
  description = "셀 고유 식별자"
  type        = string
}

variable "cell_cidr" {
  description = "셀 VPC CIDR 블록"
  type        = string
}

variable "environment" {
  description = "배포 환경"
  type        = string
  default     = "production"
}

variable "max_capacity" {
  description = "셀 최대 사용자 수"
  type        = number
  default     = 10000
}

# 셀 전용 VPC
resource "aws_vpc" "cell" {
  cidr_block           = var.cell_cidr
  enable_dns_support   = true
  enable_dns_hostnames = true

  tags = {
    Name        = "cell-${var.cell_id}-vpc"
    CellId      = var.cell_id
    Environment = var.environment
  }
}

# 가용 영역별 프라이빗 서브넷
resource "aws_subnet" "private" {
  count             = 3
  vpc_id            = aws_vpc.cell.id
  cidr_block        = cidrsubnet(var.cell_cidr, 4, count.index)
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = {
    Name   = "cell-${var.cell_id}-private-${count.index}"
    CellId = var.cell_id
  }
}

# 셀 전용 ECS 클러스터
resource "aws_ecs_cluster" "cell" {
  name = "cell-${var.cell_id}"

  setting {
    name  = "containerInsights"
    value = "enabled"
  }

  tags = {
    CellId = var.cell_id
  }
}

# 셀 전용 RDS (Multi-AZ)
resource "aws_db_instance" "cell" {
  identifier     = "cell-${var.cell_id}-db"
  engine         = "postgres"
  engine_version = "16.4"
  instance_class = "db.r6g.xlarge"

  allocated_storage     = 100
  max_allocated_storage = 500
  storage_encrypted     = true
  storage_type          = "gp3"

  multi_az            = true
  db_subnet_group_name = aws_db_subnet_group.cell.name
  vpc_security_group_ids = [aws_security_group.cell_db.id]

  db_name  = "cell_${replace(var.cell_id, "-", "_")}"
  username = "cell_admin"
  password = data.aws_secretsmanager_secret_version.db_password.secret_string

  backup_retention_period = 14
  deletion_protection     = true

  tags = {
    CellId = var.cell_id
  }
}

# 셀 전용 ElastiCache Redis
resource "aws_elasticache_replication_group" "cell" {
  replication_group_id = "cell-${var.cell_id}-redis"
  description          = "Redis cluster for cell ${var.cell_id}"

  node_type            = "cache.r6g.large"
  num_cache_clusters   = 2
  engine_version       = "7.1"
  port                 = 6379
  subnet_group_name    = aws_elasticache_subnet_group.cell.name
  security_group_ids   = [aws_security_group.cell_cache.id]

  at_rest_encryption_enabled = true
  transit_encryption_enabled = true
  automatic_failover_enabled = true

  tags = {
    CellId = var.cell_id
  }
}

# 셀 전용 SQS 큐
resource "aws_sqs_queue" "cell" {
  name = "cell-${var.cell_id}-events"

  visibility_timeout_seconds = 60
  message_retention_seconds  = 1209600  # 14일
  receive_wait_time_seconds  = 20

  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.cell_dlq.arn
    maxReceiveCount     = 3
  })

  tags = {
    CellId = var.cell_id
  }
}

# 셀 Auto Scaling
resource "aws_appautoscaling_target" "cell_ecs" {
  max_capacity       = var.max_capacity / 100  # 인스턴스당 100명 기준
  min_capacity       = 3
  resource_id        = "service/${aws_ecs_cluster.cell.name}/${aws_ecs_service.cell.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

resource "aws_appautoscaling_policy" "cell_cpu" {
  name               = "cell-${var.cell_id}-cpu-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.cell_ecs.resource_id
  scalable_dimension = aws_appautoscaling_target.cell_ecs.scalable_dimension
  service_namespace  = aws_appautoscaling_target.cell_ecs.service_namespace

  target_tracking_scaling_policy_configuration {
    target_value       = 60.0
    scale_in_cooldown  = 300
    scale_out_cooldown = 60

    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
  }
}

# 출력
output "cell_vpc_id" {
  value = aws_vpc.cell.id
}

output "cell_db_endpoint" {
  value = aws_db_instance.cell.endpoint
}

output "cell_redis_endpoint" {
  value = aws_elasticache_replication_group.cell.primary_endpoint_address
}

output "cell_sqs_url" {
  value = aws_sqs_queue.cell.url
}

Cell provisioning

# environments/production/main.tf
# 실제 셀 프로비저닝

module "cell_001" {
  source       = "../../modules/cell"
  cell_id      = "001"
  cell_cidr    = "10.1.0.0/16"
  environment  = "production"
  max_capacity = 10000
}

module "cell_002" {
  source       = "../../modules/cell"
  cell_id      = "002"
  cell_cidr    = "10.2.0.0/16"
  environment  = "production"
  max_capacity = 10000
}

module "cell_003" {
  source       = "../../modules/cell"
  cell_id      = "003"
  cell_cidr    = "10.3.0.0/16"
  environment  = "production"
  max_capacity = 10000
}

Cell Distribution Strategy (Canary per Cell)

One of the most powerful benefits of cell-based architecture is cell-level canary deployment. Deploy the new version to one cell first, and gradually expand to other cells if there are no problems. Even if a failure occurs, only the users in that cell are affected.

Deployment Pipeline

# .github/workflows/cell-canary-deploy.yaml
# 셀 단위 카나리 배포 파이프라인

name: Cell Canary Deployment

on:
  push:
    branches: [main]

env:
  IMAGE_TAG: ${{ github.sha }}

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Build and push container image
        run: |
          docker build -t myregistry/cell-app:${{ env.IMAGE_TAG }} .
          docker push myregistry/cell-app:${{ env.IMAGE_TAG }}

  # Phase 1: 카나리 셀에 배포
  deploy-canary:
    needs: build
    runs-on: ubuntu-latest
    environment: canary
    steps:
      - name: Deploy to canary cell (cell-001)
        run: |
          kubectl set image deployment/cell-app \
            app=myregistry/cell-app:${{ env.IMAGE_TAG }} \
            -n cell-001
          kubectl rollout status deployment/cell-app \
            -n cell-001 --timeout=300s

      - name: Run smoke tests against canary cell
        run: |
          ./scripts/smoke-test.sh cell-001

      - name: Monitor canary metrics (10 minutes)
        run: |
          ./scripts/canary-monitor.sh cell-001 600

  # Phase 2: 첫 번째 배치 (30% 셀)
  deploy-batch-1:
    needs: deploy-canary
    runs-on: ubuntu-latest
    environment: production-batch-1
    strategy:
      max-parallel: 2
    steps:
      - name: Deploy to batch 1 cells
        run: |
          for CELL in cell-002 cell-003 cell-004; do
            kubectl set image deployment/cell-app \
              app=myregistry/cell-app:${{ env.IMAGE_TAG }} \
              -n $CELL
            kubectl rollout status deployment/cell-app \
              -n $CELL --timeout=300s
          done

  # Phase 3: 나머지 전체 셀
  deploy-remaining:
    needs: deploy-batch-1
    runs-on: ubuntu-latest
    environment: production-all
    steps:
      - name: Deploy to all remaining cells
        run: |
          REMAINING_CELLS=$(kubectl get ns -l cell-id \
            --no-headers -o custom-columns=":metadata.name" | \
            grep -v -E "cell-001|cell-002|cell-003|cell-004")
          for CELL in $REMAINING_CELLS; do
            kubectl set image deployment/cell-app \
              app=myregistry/cell-app:${{ env.IMAGE_TAG }} \
              -n $CELL
            kubectl rollout status deployment/cell-app \
              -n $CELL --timeout=300s
          done

Real examples

Slack’s cell-based architecture

Slack has transitioned to a large-scale workspace-based cell architecture starting in 2022. The partition key is workspace_id, and all channels, messages, and files within one workspace are stored in the same cell. Before Slack's cell transition, an overall service outage in February 2022 affected all users, but after the cell transition, failures in individual cells only affected users in that cell (approximately 5-8% of the total).

Slack's key design decisions include:

  • Vitess-based MySQL sharding configured on a cell basis
  • Cell router is designed as a Thin Layer containing only minimal logic
  • Dedicated Cell assigned to large enterprise customers (e.g. IBM, Amazon)
  • If inter-cell communication is required (cross-workspace search, etc.), use Asynchronous Message Bus

DoorDash’s Cell Architecture

DoorDash introduced a region-based cell architecture in 2023. The partition key is geographic_region, which partitions the United States into several geographic cells. Through this, traffic surges in certain areas (such as specific cities during Super Bowl season) are isolated so that they do not affect other areas.

  • Use DynamoDB as cell-level data storage
  • Apache Kafka clusters are also separated by cell
  • Control feature rollout via cell-level Feature Flag
  • Establish automation to drain traffic from that cell to an adjacent cell in the event of a failure

Salesforce’s Pod Architecture

Salesforce is one of the pioneers of cell-based architecture. In Salesforce, cells are called Pods, and each Pod contains a complete Salesforce instance. Dozens of Pods are distributed around the world, and customers (tenants) are pinned to specific Pods.

  • Routing cells to per-tenant instance URL (e.g. na1.salesforce.com, eu5.salesforce.com)
  • Self-developed Org Migration tool for data migration between pods
  • Operate release cycle independent of Pod-level maintenance window

Data partitioning

The most challenging challenge in cell-based architecture is data partitioning. Cross-cell query requirements must be met while maintaining cell independence.

Partitioning strategy

1. Cell-Local Data

This is data that is accessed only inside the cell, and exists only in the database of that cell. For example, the user's messages, order history, and session information.

2. Global Reference Data

This is read-only data that is equally required in all cells. This includes exchange rate information, product catalogs, country codes, etc. This data is asynchronously replicated from the global data store to each cell.

3. Cross-Cell Aggregate Data

This is a case where data from all cells must be aggregated, such as overall system statistics and global dashboards. Metrics aggregated from each cell are exported to a central analytics platform (e.g. Snowflake, BigQuery) for processing.

Data Migration

Data migration is required when rebalancing cells. The core principles are:

  • Dual Write: Write to both source and target cells during the migration period.
  • Gradual transition: Switch read to target cell first, then switch write.
  • Rollback possibility: Keep original cell data for at least 48 hours

Monitoring and Observability

In a cell-based architecture, both cell-level metrics and global aggregate metrics are required.

Cell health check script

#!/bin/bash
# cell-health-check.sh
# 전체 셀의 상태를 점검하는 헬스체크 스크립트

set -euo pipefail

CELL_ROUTER_URL="${CELL_ROUTER_URL:-http://cell-router.internal:8080}"
ALERT_WEBHOOK="${ALERT_WEBHOOK:-}"
HEALTH_THRESHOLD=3  # 연속 실패 횟수 임계값

declare -A FAILURE_COUNT

check_cell_health() {
    local cell_id="$1"
    local cell_endpoint="$2"

    # 애플리케이션 헬스 체크
    local http_code
    http_code=$(curl -s -o /dev/null -w "%{http_code}" \
        --connect-timeout 5 --max-time 10 \
        "${cell_endpoint}/health/ready" 2>/dev/null || echo "000")

    if [[ "$http_code" != "200" ]]; then
        FAILURE_COUNT[$cell_id]=$(( ${FAILURE_COUNT[$cell_id]:-0} + 1 ))
        echo "[WARN] ${cell_id}: health check failed (HTTP ${http_code}, consecutive: ${FAILURE_COUNT[$cell_id]})"

        if [[ ${FAILURE_COUNT[$cell_id]} -ge $HEALTH_THRESHOLD ]]; then
            echo "[CRITICAL] ${cell_id}: ${HEALTH_THRESHOLD} consecutive failures - triggering alert"
            send_alert "$cell_id" "$http_code"
        fi
        return 1
    fi

    # 성공 시 카운터 리셋
    FAILURE_COUNT[$cell_id]=0

    # 셀 메트릭 수집
    local metrics
    metrics=$(curl -s --max-time 5 "${cell_endpoint}/metrics/cell" 2>/dev/null || echo "{}")
    local active_connections error_rate p99_latency
    active_connections=$(echo "$metrics" | jq -r '.active_connections // "N/A"')
    error_rate=$(echo "$metrics" | jq -r '.error_rate_percent // "N/A"')
    p99_latency=$(echo "$metrics" | jq -r '.p99_latency_ms // "N/A"')

    echo "[OK] ${cell_id}: connections=${active_connections}, errors=${error_rate}%, p99=${p99_latency}ms"
    return 0
}

send_alert() {
    local cell_id="$1"
    local http_code="$2"
    local timestamp
    timestamp=$(date -u +"%Y-%m-%dT%H:%M:%SZ")

    if [[ -n "$ALERT_WEBHOOK" ]]; then
        curl -s -X POST "$ALERT_WEBHOOK" \
            -H "Content-Type: application/json" \
            -d "{
                \"severity\": \"critical\",
                \"cell_id\": \"${cell_id}\",
                \"message\": \"Cell ${cell_id} health check failed (HTTP ${http_code})\",
                \"timestamp\": \"${timestamp}\",
                \"action\": \"investigate_and_consider_drain\"
            }"
    fi
}

# 셀 목록 조회 및 점검
echo "=== Cell Health Check: $(date -u +"%Y-%m-%d %H:%M:%S UTC") ==="

CELLS=$(curl -s "${CELL_ROUTER_URL}/cells" | jq -r '.[] | "\(.cell_id)|\(.endpoint)"')
TOTAL=0
HEALTHY=0
UNHEALTHY=0

while IFS='|' read -r cell_id endpoint; do
    if check_cell_health "$cell_id" "$endpoint"; then
        HEALTHY=$((HEALTHY + 1))
    else
        UNHEALTHY=$((UNHEALTHY + 1))
    fi
    TOTAL=$((TOTAL + 1))
done <<< "$CELLS"

echo ""
echo "=== Summary: Total=${TOTAL}, Healthy=${HEALTHY}, Unhealthy=${UNHEALTHY} ==="

# 전체 중 50% 이상 비정상이면 글로벌 알림
if [[ $TOTAL -gt 0 ]] && [[ $((UNHEALTHY * 100 / TOTAL)) -ge 50 ]]; then
    echo "[GLOBAL CRITICAL] More than 50% cells unhealthy - possible global issue"
fi

Key monitoring metrics

The metrics that must be monitored in a cell-based architecture are:

Cell level metrics:

  • Request throughput (RPS) and error rate per cell
  • P50/P95/P99 delay time per cell
  • DB connection pool usage rate by cell
  • CPU/memory utilization by cell
  • Queue Depth per cell

Global level metrics:

  • Request throughput and latency of cell routers
  • Traffic imbalance ratio between cells
  • Number of abnormal cells
  • Global error rate (sum of all cells)

Routing rule settings

# cell-routing-rules.yaml
# 셀 라우팅 규칙 정의

apiVersion: v1
kind: ConfigMap
metadata:
  name: cell-routing-config
  namespace: cell-router
data:
  routing-rules.yaml: |
    version: "2.0"
    default_strategy: "consistent-hashing"
    partition_key: "X-Tenant-Id"

    cells:
      - id: "cell-001"
        endpoint: "https://cell-001.internal.example.com"
        region: "us-east-1"
        status: "active"
        capacity: 10000
        weight: 100

      - id: "cell-002"
        endpoint: "https://cell-002.internal.example.com"
        region: "us-east-1"
        status: "active"
        capacity: 10000
        weight: 100

      - id: "cell-003"
        endpoint: "https://cell-003.internal.example.com"
        region: "us-west-2"
        status: "active"
        capacity: 10000
        weight: 100

      - id: "cell-004"
        endpoint: "https://cell-004.internal.example.com"
        region: "eu-west-1"
        status: "draining"
        capacity: 10000
        weight: 0
        drain_target: "cell-003"

    # 전용 셀 (대형 테넌트)
    dedicated_cells:
      - tenant_id: "tenant-enterprise-001"
        cell_id: "cell-dedicated-001"
        reason: "SLA 요구사항 - 99.99% 가용성"

      - tenant_id: "tenant-enterprise-002"
        cell_id: "cell-dedicated-002"
        reason: "데이터 레지던시 - EU GDPR 준수"

    # 장애 시 트래픽 전환 규칙
    failover_rules:
      health_check_interval_seconds: 10
      consecutive_failures_threshold: 3
      failover_strategy: "nearest-healthy-cell"
      auto_failback: true
      failback_delay_minutes: 15

    # 셀 용량 제한
    capacity_management:
      overflow_strategy: "reject-with-retry-after"
      overflow_http_status: 503
      retry_after_seconds: 30
      capacity_warning_threshold_percent: 80
      capacity_critical_threshold_percent: 95

Troubleshooting

Problem 1: Cell-to-cell traffic imbalance

Symptom: CPU utilization of a specific cell exceeds 90%, but other cells remain at 30%.

Cause: A large tenant is placed in a regular cell, causing excessive load on that cell.

Solution: Identify large tenants and migrate them to dedicated cells. In the routing table, the corresponding tenant is mapped to a dedicated cell, and traffic is switched after data migration.

Problem 2: Failure of the cell router itself

Symptom: The cell router becomes unresponsive and the entire system is paralyzed.

Cause: The cell router becomes a SPOF (Single Point of Failure).

Solution: Multiplex cell routers and ensure availability of the cell router itself through DNS-based load balancing. Cell router logic is kept to a minimum to reduce the likelihood of failure. By caching cell allocation information in the client, it allows direct access to existing cells even in the event of a router failure.

Problem 3: Cross-cell data query

Symptom: When viewing overall user statistics on the administrator dashboard, response time slows down to tens of seconds.

Cause: A method of sequentially sending queries to all cells and counting is used.

Solution: Periodically export aggregated data from each cell to a central analytical database (OLAP). Administrator dashboards are retrieved from this central repository. If real-time is required, configure streaming aggregation with Change Data Capture (CDC).

Issue 4: Data inconsistency during data migration between cells

Symptom: Some users' data is missing or duplicated after cell relocation.

Cause: A synchronization mismatch occurs between the source and destination cells during the double write period.

Workaround: Perform data checksum verification before and after migration. If it is based on event sourcing, the state of the target cell can be accurately restored through event replay. Original cell data is preserved for at least 48 hours for migration rollback.

Precautions during operation

Cell routers must be thin layer

The cell router is the only shared component in the system. Adding complex business logic here nullifies the benefits of cell-based isolation. A cell router must perform only three functions: extracting partition keys, calculating hashes, and mapping cells.

Golden ratio of cell sizes

If a cell is too large, the scope of impact in the event of a failure expands, while if the cell is too small, operational complexity and costs increase. According to AWS's recommendations, a size that accommodates 5-10% of total users is appropriate. This means 10-20 cells.

Data residency in global deployment

To comply with data regulations such as GDPR and the Personal Data Protection Act, user data in a specific region must be processed only in cells in that region. When routing cells, regional constraints must be considered along with the partition key.

Testing Strategy

Cell-level chaos engineering is essential in cell-based architecture. You should use AWS Fault Injection Service (FIS) or Chaos Mesh to simulate failures in individual cells and regularly verify that other cells are not affected. Cell-level fault injection training is recommended at least once a quarter.

Failure cases and recovery

Case 1: Total failure due to cell router configuration error

In one company, incorrect settings were distributed when updating the cell routing table, resulting in an incident where all users were routed to one cell. The cell exceeded its capacity and started returning a 503 error, leaving other cells idle.

Lesson: Canary deployment must be applied even when changing routing settings. Changes to routing tables must go through an automated test pipeline that includes verification of cell traffic distribution.

Case 2: Hidden dependencies between cells

We thought we had achieved complete isolation between cells, but the external API (payment gateway) shared by all cells failed, so all cells were affected at the same time.

Lesson: Identify shared dependencies outside of the cell, isolate them by cell if possible, or at least apply Circuit Breaker and Fallback. When reviewing architecture, you should create a Dependency Map to find hidden shared points.

Case 3: Data migration failure

During cell relocation, double writing was interrupted due to a network failure, resulting in a situation where the data in the source cell and target cell did not match. I attempted a rollback, but some data had already been cleaned up from the original cells.

Lesson: Migration must be designed to ensure idempotent, and original data deletion must be performed after a sufficient stabilization period (at least 48 hours) after migration completion.

Checklist

Summarizes items to be reviewed before introducing cell-based architecture.

Design Stage:

  • Partition key selection completed (tenant, user, region, etc.)
  • Determine cell size (recommended for 5-10% of users)
  • Cell router design (Thin Layer principle)
  • Definition of cross-cell data access pattern
  • Establishment of global reference data replication strategy
  • Check data residency requirements

Implementation Phase:

  • Independent infrastructure provisioning for each cell (VPC, DB, Cache, Queue)
  • Implement cell-to-cell isolation with NetworkPolicy or security groups
  • Consistent Hashing based router implementation
  • Building a cell-level canary deployment pipeline
  • Implementation of health check endpoint for each cell
  • Cell router multiplexing

Operation Phase:

  • Building a cell-level monitoring dashboard
  • Setting cell-level notification rules
  • Automated traffic drain in case of cell failure
  • Preparation of cell-to-cell data migration tools
  • Quarterly cell failure injection training plan
  • Cell capacity monitoring and auto-expansion settings
  • Establishment of global metric aggregation pipeline
  • Automate verification of routing settings changes

References