Cilium ClusterMesh: Multi-Cluster Networking Internal Implementation

Overview

Cilium ClusterMesh is a multi-cluster solution that connects multiple Kubernetes clusters into a unified network. It provides cross-cluster service discovery, load balancing, and network policy enforcement while maintaining the independence of each cluster.

1. ClusterMesh Architecture

1.1 Core Components

Cluster A                              Cluster B
+---------------------------+          +---------------------------+
| Cilium Agent (per node)   |          | Cilium Agent (per node)   |
|   - Local endpoint mgmt   |          |   - Local endpoint mgmt   |
|   - Remote cluster watch  |          |   - Remote cluster watch  |
+---------------------------+          +---------------------------+
         |                                       |
         v                                       v
+---------------------------+          +---------------------------+
| clustermesh-apiserver     |          | clustermesh-apiserver     |
|   - Embedded etcd         |   <--->  |   - Embedded etcd         |
|   - Externally accessible |          |   - Externally accessible |
+---------------------------+          +---------------------------+
         |                                       |
         v                                       v
+---------------------------+          +---------------------------+
| Internal etcd (k8s state) |          | Internal etcd (k8s state) |
+---------------------------+          +---------------------------+

1.2 clustermesh-apiserver

The clustermesh-apiserver is a component that runs in each cluster, exposing the cluster's state to other clusters:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: clustermesh-apiserver
  namespace: kube-system
spec:
  replicas: 2
  template:
    spec:
      containers:
        - name: apiserver
          image: quay.io/cilium/clustermesh-apiserver:v1.16.0
          ports:
            - containerPort: 2379
              name: etcd
        - name: etcd
          image: quay.io/coreos/etcd:v3.5.11
          args:
            - --data-dir=/var/run/etcd
            - --listen-client-urls=https://0.0.0.0:2379

1.3 Data Synchronization Flow

State change in Cluster A (e.g., new Pod created)
    |
    v
Cilium Agent (Cluster A) -> Update CiliumEndpoint CRD
    |
    v
clustermesh-apiserver (Cluster A) -> Store state in embedded etcd
    |
    v
Cilium Agent (Cluster B) -> Watch Cluster A's etcd
    |
    v
Update Cluster B's ipcache and service maps

2. Cross-Cluster Service Discovery

2.1 Global Services

A global service is a service with the same name and namespace across multiple clusters:

apiVersion: v1
kind: Service
metadata:
  name: api-service
  namespace: production
  annotations:
    io.cilium/global-service: 'true'
spec:
  selector:
    app: api
  ports:
    - port: 80
      targetPort: 8080

2.2 Global Service Operation

Global service "api-service" (production namespace)

Cluster A backends: Pod-A1 (10.244.1.5), Pod-A2 (10.244.1.6)
Cluster B backends: Pod-B1 (10.245.1.5), Pod-B2 (10.245.1.6)

Cluster A BPF service map:
  api-service:80 -> [Pod-A1, Pod-A2, Pod-B1, Pod-B2]

Cluster B BPF service map:
  api-service:80 -> [Pod-A1, Pod-A2, Pod-B1, Pod-B2]

Load balancing across the same backend pool from all clusters

2.3 Service Affinity

apiVersion: v1
kind: Service
metadata:
  name: api-service
  namespace: production
  annotations:
    io.cilium/global-service: 'true'
    io.cilium/service-affinity: 'local'
spec:
  selector:
    app: api
  ports:
    - port: 80

Service affinity options:

Value	Behavior
default	Evenly distribute across all cluster backends
local	Prefer local cluster, fall back to remote if none
remote	Prefer remote cluster, fall back to local if none
none	No affinity (same as default)

3. Cross-Cluster Network Policies

3.1 Identity Synchronization

In ClusterMesh, Identities from each cluster are synchronized:

Identity synchronization flow:

Cluster A: Pod created (app=frontend)
  -> Identity assigned: 48291 (Cluster A scope)
  -> CiliumIdentity CRD created
  -> Shared via clustermesh-apiserver

Cluster B: Remote Identity received
  -> Add remote Identity to local ipcache
  -> Remote Identity can be referenced in policy maps

3.2 Cross-Cluster Policy Example

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: allow-cross-cluster
  namespace: production
spec:
  endpointSelector:
    matchLabels:
      app: backend
  ingress:
    - fromEndpoints:
        - matchLabels:
            app: frontend
            io.cilium.k8s.policy.cluster: cluster-a

3.3 Cluster Identification

Cluster identification label:
  io.cilium.k8s.policy.cluster: <cluster-name>

Each cluster's Identity includes the cluster name:
  k8s:app=frontend
  k8s:io.kubernetes.pod.namespace=production
  k8s:io.cilium.k8s.policy.cluster=cluster-a

This allows policies to select workloads from specific clusters

4. ClusterMesh Connection Setup

4.1 Prerequisites

1. Unique cluster ID (1-255)
   - Set different cluster-id for each cluster

2. Unique cluster name

3. Non-overlapping Pod CIDRs
   - Cluster A: 10.244.0.0/16
   - Cluster B: 10.245.0.0/16

4. Network connectivity
   - Pod networks must be mutually reachable
   - Tunnel or direct routing required

4.2 Setup Steps

# Step 1: Enable ClusterMesh on each cluster
cilium clustermesh enable --service-type LoadBalancer

# Step 2: Connect clusters
cilium clustermesh connect --destination-context ctx-cluster-b

# Step 3: Verify status
cilium clustermesh status

# Example output:
# Cluster Connections:
#   cluster-b:
#     connected: true
#     endpoints: 24
#     identities: 42
#     services: 8

5. KVStoreMesh: Scalability Enhancement

5.1 KVStoreMesh Architecture

In large ClusterMesh environments, each Agent connecting directly to remote cluster etcd increases load. KVStoreMesh solves this:

Without KVStoreMesh:
  All Agents in Cluster A -> Cluster B etcd (direct)
  N nodes x M clusters = N*M connections

With KVStoreMesh:
  Cluster A KVStoreMesh -> Cluster B etcd (single connection)
  All Agents in Cluster A -> Local KVStoreMesh cache
  Dramatically fewer connections

5.2 KVStoreMesh Operation

Data flow from remote Cluster B:

Cluster B: clustermesh-apiserver (etcd)
    |
    v (single connection)
Cluster A: KVStoreMesh
    |
    v (replicate data to local cache)
Cluster A: Local etcd or CRD
    |
    v
Cluster A: Cilium Agent (each node)
    - Read from local data source
    - No direct remote etcd connection needed

6. External Workloads

6.1 Overview

The external workloads feature allows installing Cilium Agent on VMs or bare-metal servers and joining them to the Kubernetes cluster:

Kubernetes Cluster
+---------------------------+
| Pod A (10.244.1.5)        |
| Pod B (10.244.1.6)        |
| Cilium Agent (per node)   |
+---------------------------+
         |
         v (ClusterMesh connection)
+---------------------------+
| External VM (192.168.1.100)|
| Cilium Agent installed     |
| - Same policies applied    |
| - Same Identity assigned   |
| - Service access available |
+---------------------------+

6.2 External Workload Setup

# Step 1: Enable external workload support on cluster
cilium clustermesh vm create my-vm --ipv4-alloc-cidr 10.192.1.0/24

# Step 2: Generate install script for VM
cilium clustermesh vm install install-external-workload.sh

# Step 3: Execute script on VM
# The script:
# - Installs Cilium Agent
# - Configures cluster connection
# - Sets up certificates
# - Starts Agent

6.3 Identity for External Workloads

Same Identity mechanism as Kubernetes Pods applied to external VMs:

VM labels:
  app: legacy-app
  env: production

Identity assigned: 59102

Policy enforcement:
  - Control communication from cluster Pods to VM
  - Control communication from VM to cluster Pods
  - Same policy model based on Identity

7. Failure Handling and High Availability

7.1 Behavior During Cluster Failure

Scenario: Cluster B goes completely down

Cluster A behavior:
1. Detect clustermesh-apiserver connection loss
2. Remove Cluster B backends from service maps
3. Route new connections to Cluster A backends only
4. Existing connections cleaned up after timeout

With service affinity "local":
  - Cluster B failure has no impact on Cluster A
  - Only Cluster A backends were being used

7.2 Network Partition Response

During network partition:
1. Remote cluster connection timeout
2. Mark remote backends as "unreachable"
3. Prefer local backends
4. Automatic reconnection and state sync on network recovery

8. Monitoring and Troubleshooting

8.1 Status Commands

# ClusterMesh connection status
cilium clustermesh status

# Endpoints synced from remote clusters
cilium endpoint list --selector "reserved:remote-node"

# Global services
cilium service list

# Remote Identities
cilium identity list | grep "cluster-b"

8.2 Debugging Commands

# ClusterMesh related logs
cilium-agent --debug

# Remote cluster connection state
cilium status --verbose | grep -A 10 "ClusterMesh"

# Cross-cluster service backends
cilium bpf lb list | grep "global"

# Remote ipcache entries
cilium bpf ipcache list | grep "cluster-b"

8.3 Common Issues

Issue: Clusters cannot connect
Checks:
1. Verify cluster-id is unique
2. Verify Pod CIDRs do not overlap
3. Verify clustermesh-apiserver is externally accessible
4. Verify TLS certificates are correct

Issue: Global service missing remote backends
Checks:
1. Same service name/namespace on both clusters
2. io.cilium/global-service annotation present
3. Check backend list with cilium service list

Issue: Cross-cluster policy not applied
Checks:
1. Verify Identities are syncing correctly
2. Verify cluster name label used correctly in policy
3. Check remote Identities with cilium identity list

Summary

Cilium ClusterMesh provides multi-cluster networking through these core principles:

Distributed Architecture: Each cluster operates independently with no central control plane
Identity Synchronization: Shared security Identities across clusters for consistent policy enforcement
Global Services: Service discovery and load balancing across multiple clusters
KVStoreMesh: Scalability optimization for large environments
External Workloads: Integration of VMs/bare-metal servers into the Kubernetes network
Failure Isolation: Failures are isolated between clusters, maintaining overall system stability