Split View: etcd 아키텍처 내부 분석: Raft 합의 알고리즘

etcd 아키텍처 내부 분석: Raft 합의 알고리즘

etcd는 분산 시스템에서 핵심적인 역할을 하는 강력한 일관성(strongly consistent)을 보장하는 분산 키-값 저장소입니다. Kubernetes의 모든 클러스터 상태가 etcd에 저장되며, 그 신뢰성은 Raft 합의 알고리즘에 기반합니다.

1. Raft 합의 알고리즘 개요

Raft는 이해하기 쉽도록 설계된 합의 알고리즘으로, Paxos와 동등한 안전성을 제공하면서도 구현이 훨씬 단순합니다.

1.1 핵심 개념

Raft의 핵심은 세 가지 하위 문제로 분해됩니다:

리더 선출(Leader Election): 클러스터에서 하나의 리더를 선출
로그 복제(Log Replication): 리더가 로그 엔트리를 팔로워에게 복제
안전성(Safety): 커밋된 엔트리가 손실되지 않음을 보장

1.2 노드 상태

Raft 클러스터의 각 노드는 세 가지 상태 중 하나입니다:

Leader: 클라이언트 요청을 처리하고 로그를 복제
Follower: 리더의 요청에 응답하는 수동적 역할
Candidate: 리더 선출에 참여하는 중간 상태

1.3 Term(임기)

Raft는 시간을 Term이라는 논리적 단위로 나눕니다. 각 Term은 선출로 시작되며, 선출에 성공하면 해당 Term 동안 하나의 리더가 존재합니다. Term은 단조 증가하며, 오래된 Term의 메시지는 무시됩니다.

2. 리더 선출(Leader Election) 상세

2.1 선출 과정

Follower가 일정 시간(election timeout) 동안 리더로부터 heartbeat를 받지 못하면 Candidate로 전환
Candidate는 현재 Term을 증가시키고 자신에게 투표한 뒤 다른 노드에 RequestVote RPC를 전송
과반수(majority) 투표를 받으면 리더로 선출
리더가 되면 즉시 빈 heartbeat를 전송하여 권한을 확립

2.2 선출 타임아웃

선출 타임아웃은 랜덤화되어 있어 동시에 여러 노드가 Candidate가 되는 것을 방지합니다. etcd에서는 기본적으로 1000~1500ms 범위입니다.

2.3 Pre-Vote 메커니즘

etcd는 네트워크 파티션 복구 후 불필요한 리더 교체를 방지하기 위해 Pre-Vote를 구현합니다. Candidate가 되기 전에 먼저 Pre-Vote를 요청하여 과반수가 동의해야만 실제 선출을 시작합니다.

// etcd의 raft 상태 전환 (간략화)
// Follower -> PreCandidate -> Candidate -> Leader

func (r *raft) tickElection() {
    r.electionElapsed++
    if r.electionElapsed >= r.randomizedElectionTimeout {
        r.electionElapsed = 0
        r.Step(pb.Message{Type: pb.MsgHup})
    }
}

2.4 Learner 노드

etcd 3.4부터 Learner 노드가 도입되었습니다. Learner는 투표권이 없는 비투표 멤버로, 로그 복제만 받습니다. 새 멤버가 기존 로그를 따라잡을 때까지 Learner로 유지하면 클러스터 가용성에 영향을 주지 않습니다.

3. 로그 복제(Log Replication)

3.1 로그 엔트리 구조

각 로그 엔트리는 다음을 포함합니다:

Index: 로그 내 위치(단조 증가)
Term: 해당 엔트리가 생성된 Term
Data: 상태 머신에 적용할 명령(예: Put key=value)

3.2 복제 과정

클라이언트가 리더에 요청 전송
리더가 로그 엔트리를 자신의 로그에 추가
리더가 AppendEntries RPC로 팔로워에게 복제
과반수 팔로워가 확인(acknowledge)하면 커밋
커밋된 엔트리를 상태 머신에 적용(apply)
클라이언트에 응답 반환

3.3 로그 일관성 보장

Raft는 두 가지 속성으로 로그 일관성을 보장합니다:

Log Matching: 같은 index와 term을 가진 두 엔트리는 동일한 명령을 저장
Leader Completeness: 커밋된 엔트리는 이후 모든 리더의 로그에 존재

Leader Log:  [1:1][2:1][3:2][4:2][5:3]
                                  ^-- committed up to here

Follower A:  [1:1][2:1][3:2][4:2][5:3]  (up to date)
Follower B:  [1:1][2:1][3:2]            (lagging, will catch up)

4. etcd 서버 아키텍처

4.1 EtcdServer 구조

EtcdServer는 etcd의 핵심 구조체로, 다음 컴포넌트를 조율합니다:

RaftNode: Raft 합의 로직 처리
Backend: BoltDB 기반 저장소
KV Store: MVCC 키-값 저장소
Lessor: Lease 관리
Watchable Store: Watch 메커니즘

4.2 요청 처리 흐름

클라이언트의 Put 요청이 처리되는 과정:

gRPC 서버가 요청 수신
인증 및 권한 확인
요청을 Raft에 제안(propose)
Raft 합의 후 커밋
Apply 루프에서 커밋된 엔트리를 상태 머신에 적용
백엔드(BoltDB)에 영속화
클라이언트에 응답 반환

4.3 Apply Loop

Apply 루프는 etcd의 핵심 루프로, Raft에서 커밋된 엔트리를 순서대로 상태 머신에 적용합니다:

// Apply 루프의 간략화된 구조
func (s *EtcdServer) applyAll(ep *etcdProgress, apply *apply) {
    s.applySnapshot(ep, apply)
    s.applyEntries(ep, apply)
    // 적용이 완료된 후 인덱스 업데이트
    s.setAppliedIndex(apply.snapshot.Metadata.Index)
}

4.4 Linearizable Read

etcd는 기본적으로 Linearizable Read를 지원합니다. 읽기 요청 시 리더가 여전히 현재 리더인지 확인하기 위해 과반수에게 heartbeat를 전송합니다. 이로써 stale read를 방지합니다.

Serializable Read 옵션을 사용하면 리더 확인 없이 로컬에서 읽어 성능은 좋지만 최신 데이터가 아닐 수 있습니다.

5. 키-값 저장소: BoltDB와 MVCC

5.1 BoltDB(bbolt) 백엔드

etcd는 BoltDB(bbolt 포크)를 백엔드 저장소로 사용합니다. BoltDB는 B+ 트리 기반의 임베디드 키-값 데이터베이스로:

읽기 최적화된 B+ 트리 구조
페이지 단위 저장(4KB 기본)
ACID 트랜잭션 지원
단일 writer, 다중 reader 모델

5.2 MVCC(Multi-Version Concurrency Control)

etcd는 MVCC를 사용하여 키의 모든 버전을 유지합니다:

Revision: 전역 단조 증가 카운터. 모든 트랜잭션마다 증가
Key Index: 키 이름에서 revision 목록으로의 매핑
Generation: 키의 생성부터 삭제까지의 생명주기

Key: "/app/config"
  Generation 1: [create:rev5] [modify:rev8] [modify:rev12] [tombstone:rev15]
  Generation 2: [create:rev20] [modify:rev25]
                                              ^-- current

5.3 BoltDB 내부 버킷 구조

etcd는 BoltDB에 두 개의 주요 버킷을 유지합니다:

key 버킷: revision을 키로, 실제 키-값 데이터를 값으로 저장
meta 버킷: 메타데이터(consistent index, scheduled compact revision 등)

6. Watch 메커니즘

6.1 Watchable Store

etcd의 Watch는 키 또는 키 범위의 변경 사항을 실시간으로 스트리밍합니다:

synced watchers: 현재 revision까지 따라잡은 watcher. 새 이벤트를 즉시 수신
unsynced watchers: 아직 따라잡지 못한 watcher. 히스토리에서 이벤트를 재생

6.2 이벤트 생성

MVCC 저장소에 쓰기 작업이 발생하면:

트랜잭션이 커밋될 때 이벤트 생성
watchable store가 이벤트를 해당 watcher들에게 전달
gRPC 스트림을 통해 클라이언트에 전송

6.3 Watch 복구

클라이언트가 연결이 끊긴 후 재연결할 때, 마지막으로 수신한 revision부터 Watch를 재개할 수 있습니다. 만약 해당 revision이 이미 compaction되었다면 에러가 반환되며, 클라이언트는 전체 데이터를 다시 로드해야 합니다.

7. 네트워크 및 gRPC 통신

7.1 Peer 간 통신

etcd 클러스터 멤버 간 통신은 두 가지 채널을 사용합니다:

Stream: 장기 연결로 Raft 메시지(heartbeat, append entries 등)를 전송
Pipeline: 대용량 데이터(스냅샷 등) 전송

7.2 클라이언트 통신

클라이언트는 gRPC를 통해 etcd와 통신합니다. 주요 서비스:

KV: Put, Range, DeleteRange, Txn
Watch: Watch
Lease: LeaseGrant, LeaseRevoke, LeaseKeepAlive
Cluster: MemberAdd, MemberRemove, MemberList
Maintenance: Alarm, Status, Defragment, Snapshot

8. 정리

etcd의 아키텍처는 Raft 합의 알고리즘을 기반으로 강력한 일관성을 보장하면서, BoltDB의 효율적인 저장과 MVCC의 다중 버전 관리를 통해 높은 성능을 달성합니다. 다음 글에서는 BoltDB와 MVCC 스토리지 엔진의 내부 구조를 더 깊이 살펴보겠습니다.

etcd Architecture Internals: Raft Consensus Algorithm

etcd is a strongly consistent distributed key-value store that plays a critical role in distributed systems. All Kubernetes cluster state is stored in etcd, and its reliability is built on the Raft consensus algorithm.

1. Raft Consensus Algorithm Overview

Raft is a consensus algorithm designed for understandability, providing safety equivalent to Paxos while being significantly simpler to implement.

1.1 Core Concepts

Raft decomposes consensus into three sub-problems:

Leader Election: Electing a single leader in the cluster
Log Replication: Leader replicates log entries to followers
Safety: Ensures committed entries are never lost

1.2 Node States

Each node in a Raft cluster is in one of three states:

Leader: Handles client requests and replicates logs
Follower: Passive role, responds to leader requests
Candidate: Intermediate state participating in leader election

1.3 Terms

Raft divides time into logical units called Terms. Each Term begins with an election, and if successful, one leader exists for that Term. Terms are monotonically increasing, and messages from older Terms are ignored.

2. Leader Election Details

2.1 Election Process

A Follower transitions to Candidate after not receiving heartbeats for an election timeout period
The Candidate increments the current Term, votes for itself, and sends RequestVote RPCs to other nodes
If it receives majority votes, it becomes leader
Upon becoming leader, it immediately sends empty heartbeats to establish authority

2.2 Election Timeout

Election timeouts are randomized to prevent multiple nodes from becoming Candidates simultaneously. In etcd, the default range is 1000-1500ms.

2.3 Pre-Vote Mechanism

etcd implements Pre-Vote to prevent unnecessary leader changes after network partition recovery. Before becoming a Candidate, a node first requests Pre-Votes, and only begins the actual election if a majority agrees.

// Simplified raft state transition in etcd
// Follower -> PreCandidate -> Candidate -> Leader

func (r *raft) tickElection() {
    r.electionElapsed++
    if r.electionElapsed >= r.randomizedElectionTimeout {
        r.electionElapsed = 0
        r.Step(pb.Message{Type: pb.MsgHup})
    }
}

2.4 Learner Nodes

Since etcd 3.4, Learner nodes are supported. Learners are non-voting members that only receive log replication. Keeping new members as Learners until they catch up with existing logs avoids impacting cluster availability.

3. Log Replication

3.1 Log Entry Structure

Each log entry contains:

Index: Position in the log (monotonically increasing)
Term: The Term when the entry was created
Data: Command to apply to the state machine (e.g., Put key=value)

3.2 Replication Process

Client sends request to leader
Leader appends log entry to its log
Leader replicates to followers via AppendEntries RPC
Committed when majority of followers acknowledge
Committed entry is applied to the state machine
Response returned to client

3.3 Log Consistency Guarantees

Raft ensures log consistency through two properties:

Log Matching: Two entries with the same index and term store the same command
Leader Completeness: Committed entries exist in all subsequent leaders' logs

Leader Log:  [1:1][2:1][3:2][4:2][5:3]
                                  ^-- committed up to here

Follower A:  [1:1][2:1][3:2][4:2][5:3]  (up to date)
Follower B:  [1:1][2:1][3:2]            (lagging, will catch up)

4. etcd Server Architecture

4.1 EtcdServer Structure

EtcdServer is the core struct of etcd, coordinating the following components:

RaftNode: Handles Raft consensus logic
Backend: BoltDB-based storage
KV Store: MVCC key-value store
Lessor: Lease management
Watchable Store: Watch mechanism

4.2 Request Processing Flow

How a client Put request is processed:

gRPC server receives the request
Authentication and authorization check
Request proposed to Raft
Committed after Raft consensus
Apply loop applies committed entry to state machine
Persisted to backend (BoltDB)
Response returned to client

4.3 Apply Loop

The apply loop is etcd's core loop, applying Raft-committed entries to the state machine in order:

// Simplified apply loop structure
func (s *EtcdServer) applyAll(ep *etcdProgress, apply *apply) {
    s.applySnapshot(ep, apply)
    s.applyEntries(ep, apply)
    // Update index after application completes
    s.setAppliedIndex(apply.snapshot.Metadata.Index)
}

4.4 Linearizable Read

etcd supports Linearizable Read by default. For read requests, the leader sends heartbeats to a majority to confirm it is still the current leader, preventing stale reads.

Using the Serializable Read option reads locally without leader confirmation, offering better performance but potentially returning non-current data.

5. Key-Value Store: BoltDB and MVCC

5.1 BoltDB (bbolt) Backend

etcd uses BoltDB (bbolt fork) as its backend store. BoltDB is a B+ tree-based embedded key-value database featuring:

Read-optimized B+ tree structure
Page-based storage (4KB default)
ACID transaction support
Single writer, multiple reader model

5.2 MVCC (Multi-Version Concurrency Control)

etcd uses MVCC to maintain all versions of keys:

Revision: Global monotonically increasing counter, incremented per transaction
Key Index: Mapping from key name to revision list
Generation: Lifecycle of a key from creation to deletion

Key: "/app/config"
  Generation 1: [create:rev5] [modify:rev8] [modify:rev12] [tombstone:rev15]
  Generation 2: [create:rev20] [modify:rev25]
                                              ^-- current

5.3 BoltDB Internal Bucket Structure

etcd maintains two main buckets in BoltDB:

key bucket: Stores actual key-value data with revision as key
meta bucket: Stores metadata (consistent index, scheduled compact revision, etc.)

6. Watch Mechanism

6.1 Watchable Store

etcd's Watch streams changes to keys or key ranges in real-time:

synced watchers: Caught up to current revision, receiving new events immediately
unsynced watchers: Not yet caught up, replaying events from history

6.2 Event Generation

When a write operation occurs in the MVCC store:

Events are generated when the transaction commits
The watchable store delivers events to relevant watchers
Events are sent to clients via gRPC streams

6.3 Watch Recovery

When a client reconnects after disconnection, it can resume the Watch from the last received revision. If that revision has already been compacted, an error is returned and the client must reload all data.

7. Network and gRPC Communication

7.1 Peer Communication

Communication between etcd cluster members uses two channels:

Stream: Long-lived connections for Raft messages (heartbeats, append entries, etc.)
Pipeline: Large data transfers (snapshots, etc.)

7.2 Client Communication

Clients communicate with etcd via gRPC. Key services:

KV: Put, Range, DeleteRange, Txn
Watch: Watch
Lease: LeaseGrant, LeaseRevoke, LeaseKeepAlive
Cluster: MemberAdd, MemberRemove, MemberList
Maintenance: Alarm, Status, Defragment, Snapshot

8. Summary

etcd's architecture ensures strong consistency through the Raft consensus algorithm while achieving high performance through BoltDB's efficient storage and MVCC's multi-version management. In the next post, we will dive deeper into the BoltDB and MVCC storage engine internals.