Split View: Kafka와 Event-Driven Architecture 완전 해부 — Partition/ISR, Exactly-Once, Outbox, Schema Registry, Flink까지

Kafka와 Event-Driven Architecture 완전 해부 — Partition/ISR, Exactly-Once, Outbox, Schema Registry, Flink까지

들어가며 — "우리는 Kafka를 쓴다"의 공허함

2015년 이후 모든 컨퍼런스에서 들리는 문장:

"저희는 이벤트 드리븐으로 Kafka를 씁니다."

그런데 파고 물어보면:

"파티션 수는 왜 그렇게?" → "기본값으로..."
"Exactly-Once 보장되나요?" → "아마도..."
"Consumer가 메시지 잃으면?" → "...?"

Kafka는 쓰기 쉽고 운영하기 어려운 대표적 시스템이다. 이 글에서는:

Kafka 내부 구조 — 파티션, Leader, ISR, Log Segment의 물리적 진실
Producer/Consumer의 내부 동작
Exactly-Once Semantics (EOS) — 전설의 진실
Transactional Outbox — DB와 이벤트의 일관성
Event Sourcing vs CDC vs Event Notification 구분
Schema Registry와 Avro/Protobuf/JSON Schema
Kafka Streams vs Flink vs ksqlDB 비교
Dead Letter Queue와 재처리
Pulsar, Redpanda 같은 현대 대안
100ms SLA 달성 실전 기법

이전 글 Zero-Downtime DB Migration에서 CDC를 언급했다. Kafka는 그 CDC의 버스이자 이벤트 아키텍처의 척추다.

1. Kafka의 탄생 — 2010년 LinkedIn

왜 만들었나

2010년 LinkedIn에서 500+개의 ETL 파이프라인이 점대점으로 얽혀 있었다.

프로필 서비스 → 검색 인덱서
프로필 서비스 → 데이터 웨어하우스
활동 로그 → 추천 시스템
활동 로그 → Hadoop

엔지니어 수만큼 파이프라인. 복잡도 폭발.

Jay Kreps의 제안

"모든 이벤트를 한 개의 통합 로그에 넣자. 소비자는 각자 읽자." 이 단순한 아이디어가 Kafka다.

왜 "Kafka"인가

Jay Kreps가 Franz Kafka(작가) 팬. "쓰기에 특화된 시스템" → "Kafka는 글을 많이 썼으니". 의외로 진지한 기술 제품의 이름 유래가 허무하다.

2011년 오픈소스, 2014년 Confluent 설립

Jay Kreps, Neha Narkhede, Jun Rao가 Confluent 창업. Kafka는 Apache로 기증. 지금은 Fortune 100의 80% 사용.

2. Kafka의 구성 요소 — 3분 요약

Topic

"이벤트의 종류". 예: user-clicks, order-created.

Partition

Topic을 여러 조각으로 나눈 것. 각 파티션은 순서 있는 로그. 파티션이 Kafka의 병렬성과 순서 보장의 단위.

Topic: orders
├── Partition 0: [evt1] [evt2] [evt3] ...
├── Partition 1: [evt4] [evt5] ...
├── Partition 2: [evt6] [evt7] [evt8] ...

Broker

Kafka 서버. 여러 개가 모여 Cluster.

Producer / Consumer

이벤트를 쓰고 읽는 클라이언트.

Consumer Group

여러 Consumer가 같은 Group ID로 묶이면, 각 파티션이 Group 내 한 Consumer에게만 할당. 병렬 처리.

Topic: orders (3 partitions)
Group: order-processor
├── Consumer A → Partition 0
├── Consumer B → Partition 1
└── Consumer C → Partition 2

Consumer가 Partition 수보다 많으면 나머지는 놀고, 적으면 한 Consumer가 여러 Partition 담당.

3. Partition 내부 — Log Segment의 물리적 진실

Append-only Log

각 Partition은 디스크의 파일. 오로지 끝에만 추가.

/var/lib/kafka/orders-0/
├── 00000000000000000000.log
├── 00000000000000000000.index
├── 00000000000000000000.timeindex
├── 00000000000054321099.log
├── 00000000000054321099.index
├── 00000000000054321099.timeindex
└── leader-epoch-checkpoint

Segment 분리

1GB(기본) 또는 1주(기본)마다 새 Segment. 이유:

삭제/압축이 파일 단위로 용이
특정 시점의 데이터만 로드 가능

Index 파일

.index — offset → 파일 위치 매핑 (sparse)
.timeindex — timestamp → offset 매핑

Consumer가 "offset 1,234,567부터"를 요청하면 index로 O(log N) 탐색.

Zero-Copy로 빠른 이유

Kafka의 속도 비결: Linux sendfile() 시스템 콜. 디스크 → NIC로 커널 공간에서 직접. 유저스페이스 복사 없음.

전통 방식:
  디스크 → 커널 페이지 캐시 → 유저 버퍼 → 커널 소켓 버퍼 → NIC
  (4번 복사)

Zero-copy:
  디스크 → 커널 페이지 캐시 → NIC
  (2번 복사, CPU 개입 최소)

덕분에 단일 브로커가 수 GB/s 처리 가능.

Page Cache — Kafka의 "메모리"

Kafka는 자체 캐시를 최소로. 리눅스 page cache에 의존. 이론상:

Producer가 쓴 데이터는 page cache에 남음
Consumer가 바로 읽으면 디스크 I/O 없이 메모리에서

RAM이 많은 서버일수록 유리. "Kafka는 디스크 기반이지만 실제로는 메모리에서 읽는다."

4. Replication — Leader와 ISR의 정치학

Replication 기본

각 Partition은 N개 복제본. 하나는 Leader, 나머지는 Follower.

Producer/Consumer는 Leader에만 접근
Follower는 Leader에서 데이터를 pull

ISR (In-Sync Replicas)

Leader와 "거의 동기화된" Follower 목록. 정의:

최근 replica.lag.time.max.ms (기본 30초) 이내에 Leader를 따라잡음

Producer의 acks 설정이 ISR과 연관:

acks=0 — Producer는 응답 기다리지 않음. 최고 성능, 유실 가능
acks=1 — Leader만 쓰면 OK. Leader 죽으면 유실 가능
acks=all (=-1) — 모든 ISR이 쓸 때까지 대기. 최고 안전성

acks=all + min.insync.replicas=2 (기본 1) → 최소 2개 ISR 쓸 때까지 대기. Leader 죽어도 유실 없음.

Leader Election

Leader가 죽으면 Controller Broker가 ISR 중 하나를 새 Leader로 승격.

Unclean Leader Election — ISR이 비어있다면?

unclean.leader.election.enable=false (기본) → 가용성 손실 대신 데이터 보호
true → 비ISR에서 Leader 선출, 데이터 유실 가능성

금융권은 반드시 false.

KRaft — ZooKeeper 제거

2022년부터 Kafka는 KRaft 모드에서 ZooKeeper 불필요. Raft 기반 자체 메타데이터 관리. 2025년 4.0부터 ZooKeeper 완전 제거.

5. Producer 내부 — 배칭과 압축의 예술

Producer 아키텍처

[앱 코드]
    │ send(record)
    ▼
[Serializer] → [Partitioner] → [Accumulator (배칭)]
                                    │ batch.size 도달 or linger.ms 만료
                                    ▼
                             [Sender Thread]
                                    │ 각 Broker에 batch 전송
                                    ▼
                             [Broker Leader]

핵심 튜닝 포인트

batch.size — 배치 최대 크기 (기본 16KB). 크면 throughput 증가, 지연 증가
linger.ms — 배치를 채울 때까지 대기 시간 (기본 0). 5~20ms 추천
compression.type — none/gzip/snappy/lz4/zstd. lz4 또는 zstd 추천
acks — 앞서 설명
max.in.flight.requests.per.connection — 동시 요청 수. Idempotence 쓸 땐 5 이하

Partitioner 로직

Producer가 메시지를 어느 파티션에 보내나?

메시지에 partition 명시 → 그걸로
key 있음 → hash(key) % num_partitions (같은 키는 같은 파티션)
둘 다 없음 → 스티키 파티셔너 (배치 효율)

Idempotent Producer

enable.idempotence=true — Producer 재시도로 같은 메시지가 두 번 쓰이는 것을 방지. Producer 쪽 PID + sequence number로 중복 검출.

6. Consumer 내부 — Offset의 정치학

Offset이란

각 파티션 내에서 메시지의 순번. Consumer가 "어디까지 읽었는지" 기록.

Offset 저장 위치 변천사

2010~2014: ZooKeeper에 저장 (부하 폭증)
2014~: __consumer_offsets 내부 토픽에 저장 (Kafka 자체 관리)

Commit 전략

Auto-commit (기본)

enable.auto.commit=true, auto.commit.interval.ms=5000
자동으로 offset 저장
위험: 메시지 처리 중 Consumer 죽으면 재처리 (at-least-once)

Manual commit

for message in consumer:
    process(message)
    consumer.commit()  # 명시적으로

처리 후 커밋 → 더 안전
커밋 전 죽으면 다음 Consumer가 재처리

At-least-once vs At-most-once vs Exactly-once

At-least-once — 메시지가 최소 1번 전달. 중복 가능. (기본)
At-most-once — 최대 1번. 유실 가능.
Exactly-once — 정확히 1번. 복잡.

대부분 시스템은 At-least-once + 소비자 멱등성으로 사실상 exactly-once.

Rebalance — Consumer Group의 춤

Consumer가 추가/제거되면 파티션 재할당. 모든 Consumer가 멈춤 (stop-the-world).

Rebalance 중 수 초간 처리 중단
Static Membership (2.3+)로 완화
Cooperative Rebalancing (2.4+)로 점진적 재할당

7. Exactly-Once Semantics — 전설의 진실

2017년 발표의 충격

Confluent가 Kafka 0.11에서 EOS 발표. 커뮤니티 반응: "진짜?" 수십 년간 분산 시스템이 풀지 못한 난제.

두 가지 요소

1. Idempotent Producer (앞서 설명)

한 Producer가 retry로 중복 쓰지 않음
PID + sequence로 브로커가 중복 검출

2. Transactions

여러 파티션에 걸친 원자적 쓰기
Consumer의 offset commit도 트랜잭션에 포함

Producer 트랜잭션 예

producer.initTransactions();
try {
    producer.beginTransaction();
    producer.send(new ProducerRecord<>("topic-a", key, value));
    producer.send(new ProducerRecord<>("topic-b", key, value));
    producer.sendOffsetsToTransaction(offsets, groupId);
    producer.commitTransaction();
} catch (Exception e) {
    producer.abortTransaction();
}

"두 토픽에 쓰기 + consumer offset 커밋"이 원자적. 하나라도 실패하면 전부 롤백.

Read Committed

Consumer가 isolation.level=read_committed면 커밋된 트랜잭션만 읽음. Abort된 메시지는 스킵.

한계

EOS는 Kafka 내부에서만 보장. DB까지 연결되면?

Kafka → DB: Kafka는 커밋했는데 DB는 실패 → 불일치
해법: Transactional Outbox 패턴 (다음 섹션)

성능 비용

EOS는 10~30% throughput 감소. 모든 워크로드에 적용은 낭비. 금융, 중복 치명적 사례에만.

8. Transactional Outbox — DB와 이벤트의 일관성

문제

def create_order(order):
    db.insert(order)              # 1
    kafka.send("orders", order)   # 2

1 성공, 2 실패 → DB엔 주문, Kafka엔 이벤트 없음.

순진한 해법: 순서 바꾸기

kafka.send("orders", order)   # 1
db.insert(order)              # 2

1 성공, 2 실패 → Kafka 이벤트 있는데 DB엔 없음. 더 나쁨.

Outbox 패턴

같은 DB 트랜잭션에 메인 write + 이벤트 write를 같이.

BEGIN;
INSERT INTO orders (...) VALUES (...);
INSERT INTO outbox (event_type, payload, created_at)
  VALUES ('order.created', ..., NOW());
COMMIT;

그 후 별도 Relay 프로세스가 outbox 테이블을 읽어 Kafka에 발행:

[App] ──► [DB: orders + outbox] ──► [Relay/Debezium] ──► [Kafka]

Debezium — 자동화된 Outbox

Debezium이 outbox 테이블의 binlog/WAL을 읽어 Kafka에 publish. 설치 후 테이블만 정의하면 끝.

장점

DB 트랜잭션으로 일관성
Kafka 장애 시에도 데이터는 DB에 남음
재처리 가능

함정

Outbox 테이블이 커짐 → 정기 cleanup 필요
Relay의 exactly-once 도 고려 (Debezium은 at-least-once)
소비자는 항상 멱등하게

9. Event Sourcing vs CDC vs Event Notification

셋의 차이 구분

혼용하면 설계가 엉망.

Event Notification

"무슨 일이 일어났다"는 단순 통지. 페이로드 최소.

{ "type": "order.created", "id": 42 }

소비자는 상세가 필요하면 API로 질의.

장점: 커플링 낮음, 페이로드 작음 단점: 소비자가 follow-up 호출 필요

Event-Carried State Transfer

통지에 전체 상태를 실음.

{
  "type": "order.created",
  "id": 42,
  "items": [...],
  "total": 99.99,
  "customer": {...}
}

장점: 소비자가 추가 호출 불필요 단점: 페이로드 큼, 스키마 변경 파급

Event Sourcing

"상태"가 아닌 **"상태를 만든 모든 이벤트"**를 저장. 현재 상태는 이벤트 replay로 재구성.

OrderCreated → ItemAdded → ItemAdded → DiscountApplied → OrderPaid

장점: 완전한 감사 로그, 과거 어느 시점도 재구성 단점: 복잡, 쿼리 어려움 (CQRS 필수), 스키마 진화 어려움

CDC (Change Data Capture)

DB 변경 자체를 이벤트화. 소스는 DB.

{
  "op": "u",
  "before": { "status": "pending" },
  "after": { "status": "paid" }
}

장점: 앱 코드 수정 불필요, Single Source of Truth 단점: DB 스키마가 이벤트 계약됨

현실 — 혼합

대규모 시스템은 셋을 목적별로 섞어 쓴다.

핵심 비즈니스: Outbox로 Event-Carried State
검색 인덱싱/분석: CDC
단순 알림: Event Notification

10. Schema Registry — 계약의 수호자

문제

Producer가 스키마 바꿨는데 Consumer가 따라오지 못함 → 파싱 실패. 수 주 뒤 발견.

Schema Registry

스키마의 중앙 저장소 + 호환성 검증. Confluent가 처음 제시.

흐름

Producer가 메시지 쓰기 전, 스키마 등록
Registry가 기존 스키마와 호환성 체크
Breaking change면 거부
메시지에는 스키마 ID만 포함, Consumer가 ID로 스키마 조회

호환성 모드

BACKWARD — 새 스키마로 옛 데이터 읽기 가능 (컬럼 추가 OK, 삭제 X)
FORWARD — 옛 스키마로 새 데이터 읽기 가능
FULL — 양방향
NONE — 검증 없음 (위험)

Producer 먼저 배포, Consumer는 천천히 업그레이드할 거면 BACKWARD.

Avro vs Protobuf vs JSON Schema

항목	Avro	Protobuf	JSON Schema
스키마 진화	뛰어남	좋음	제한적
페이로드 크기	작음	가장 작음	큼
가독성	낮음	낮음	높음
생태계	Hadoop/Kafka	gRPC	웹
기본 Kafka	Avro 선호	인기 급상승	레거시

2025년 추세: Protobuf. gRPC 생태계와 공유, 크기/속도.

11. Stream Processing — Kafka Streams vs Flink vs ksqlDB

Kafka Streams (라이브러리)

Java/Scala 앱에 embedded. 별도 클러스터 불필요.

stream.filter(...)
      .map(...)
      .groupByKey()
      .count()
      .to("output-topic");

장점: 간단, 인프라 없음, Kafka에 강한 결합 단점: JVM only, 대규모 상태 처리 약함

Apache Flink (별도 클러스터)

Stream 처리의 패왕. SQL, Java, Python, Scala.

장점:

True streaming (batch도 지원)
Exactly-once 체크포인팅
대규모 상태 처리 (수 TB)
Event time 처리 뛰어남

단점:

별도 클러스터 관리
학습 곡선 가파름

ksqlDB

Kafka 위의 SQL 엔진. Stream/Table을 SQL로 조작.

CREATE STREAM orders_large AS
  SELECT * FROM orders WHERE amount > 1000;

장점: SQL 친숙, 빠른 프로토타이핑 단점: 복잡한 로직 한계, 성능이 Flink만 못함

선택 가이드

JVM + 간단 로직 + Kafka 중심 → Kafka Streams
복잡한 상태, 대규모, 다언어 → Flink
SQL로 빠르게 → ksqlDB

Flink의 채택이 최근 2~3년 급증. AWS는 Managed Flink, Confluent는 Flink 통합 제공.

12. Dead Letter Queue와 재처리

실패하는 메시지의 운명

처리 중 예외 발생:

def process(message):
    order = json.loads(message)  # 여기서 파싱 실패!
    db.save(order)

선택:

Retry 영원히 — 독이 든 메시지(poison pill) 하나가 전체 중단
Skip하고 로그 — 데이터 유실
DLQ로 보내기 — 별도 토픽에 저장, 별도 처리

DLQ 패턴

[topic: orders] ──► [Consumer] ──► 실패 시 ──► [topic: orders-dlq]
                        │
                        ▼
                    [DLQ Processor] — 수동/자동 재처리

재처리 전략

Alert만 — 엔지니어가 수동 확인
자동 재처리 — TTL 붙여서 재시도 (n번까지)
수정 후 replay — 소비자 코드 수정 후 DLQ 전체 재처리
Kill Switch — 특정 조건 메시지만 skip

Spring Cloud Stream / Kafka Streams DSL

표준 라이브러리들이 DLQ 내장. DeadLetterPublishingRecoverer 등.

13. Pulsar, Redpanda — Kafka의 대안들

Apache Pulsar (Yahoo, 2016)

특징:

분리 아키텍처 — 브로커(무상태)와 스토리지(Apache BookKeeper) 분리
Multi-tenancy 네이티브
Geo-replication 기본
Kafka 프로토콜 호환

장점: 확장 독립적, 컴퓨트/스토리지 분리 단점: 운영 복잡도 (BookKeeper + ZooKeeper + Broker 3계층)

Redpanda (2020)

특징:

C++ 재구현, JVM 없음
Kafka 프로토콜 호환 (드롭인 교체)
ZooKeeper 없이 Raft
단일 바이너리

장점:

같은 하드웨어에서 10배 성능
낮은 지연 (p99 밀리초)
운영 단순

단점: 생태계는 여전히 Kafka가 큼, 모든 툴이 호환되진 않음

WarpStream (2023)

S3 기반. 로컬 디스크 없음. 모든 데이터 S3에.

저장 비용 극적 절감
지연은 200ms 정도 (S3 특성)
스트리밍 분석 워크로드에 적합

선택

표준/생태계 → Kafka
단순 운영 + 고성능 → Redpanda
Multi-tenant 대규모 → Pulsar
저장 비용 최소 → WarpStream

14. 100ms SLA 달성 실전 기법

초저지연 목표:

Producer to Broker: 5ms 미만
Broker: 디스크 반영 10ms 미만
Broker to Consumer: 5ms 미만
처리: 50ms 미만
총: 100ms 미만

달성 기법

Producer linger.ms=0 — 배칭 대신 즉시 전송 (throughput 감소 트레이드오프)
SSD + 풍부한 page cache — 디스크 I/O 최소
min.insync.replicas=2 — 과도한 내구성 X (Leader + 1 follower)
압축 = lz4 — 가장 빠른 코덱
Consumer fetch.min.bytes=1 — 즉시 가져옴
핫 파티션 모니터링 — 한 파티션 포화 방지
Direct Memory I/O — JVM 튜닝 (Kafka Streams 앱)
Consumer Poll loop 최적화 — 처리 중 poll 차단 방지

관측성

필수 메트릭:

Producer record-send-rate, batch-size-avg
Broker RequestHandlerAvgIdlePercent (낮으면 포화)
Consumer records-lag-max (가장 중요)
네트워크 I/O

15. 실전 함정과 안티패턴

함정 1: 파티션 수를 나중에 늘림

Partition 수는 한번 정하면 잘 못 바꿈. 늘리면 키의 해시 매핑이 바뀌어 순서 보장 깨짐. 처음부터 넉넉히 (예상치 3배).

함정 2: Key 없이 순서 기대

Key 없으면 라운드 로빈 → 순서 보장 없음. 순서가 중요하면 반드시 key.

함정 3: Large Messages (>1MB)

Kafka는 기본 1MB 메시지 제한. 대용량은 S3에 올리고 참조만 Kafka로.

함정 4: Consumer가 느림 → Rebalance 폭증

max.poll.interval.ms(기본 5분) 초과 시 Consumer kick. 처리 배치가 길면 늘리거나 프로세스를 잘게.

함정 5: Transactional이 모든 걸 해결

트랜잭션은 10~30% 성능 대가. Event Notification 정도는 필요 없음.

함정 6: 토픽 수 폭증

1만 토픽 이상은 Kafka 메타데이터 압박. 적정 수로 통합 + 헤더/타입 필드로 구분.

함정 7: retention 미설정

기본 7일. 분석 목적이면 짧고, 이벤트 소스면 길게. compacted topic으로 최신만 유지 가능.

함정 8: Schema Registry 없이 운영

Breaking change 탐지 못함. 운영 6개월만 해도 후회.

16. 실전 체크리스트 12가지

Partition 수는 넉넉히 — 후회 없이
Key 전략을 명확히 — 순서/분산 트레이드오프
acks=all + min.insync.replicas=2 — 내구성 표준
Idempotent Producer ON
Schema Registry 도입 — 계약 수호
Transactional Outbox — DB 일관성
DLQ 토픽 준비 — poison pill 대비
Consumer lag 모니터링 — 지연의 핵심 지표
압축 lz4 또는 zstd
KRaft 모드로 — ZooKeeper 제거
Rebalance 최소화 — Static Membership, Cooperative
Stream 처리 도구 선택 — 요구사항에 맞게

다음 글 예고 — Redis 내부와 분산 캐시 전략

Kafka가 이벤트의 버스라면, Redis는 실시간 데이터의 임시 저장소이자 연산 엔진이다. 다음 글에서는:

Redis의 싱글 스레드 아이디어 — 왜 빠른가
다양한 자료구조 — String, List, Set, Hash, Sorted Set, Stream, HyperLogLog, Bitmap
Redis Cluster — Hash Slot과 Redirection
Redis Sentinel — 고가용성
Persistence — RDB vs AOF
Cache 패턴 — Cache-Aside, Write-Through, Write-Behind
Thundering Herd 문제와 해법
분산 락 — Redlock 논쟁
Valkey — Redis 포크 탄생 배경 (2024)
Dragonfly / KeyDB — 멀티스레드 대안
2024년 Redis 라이선스 변경의 충격

Redis를 캐시로만 쓰는 팀이 많은데, 실제로는 훨씬 풍부한 도구다. 다음 글에서 그 전모를 보자.

"Redis는 자료구조 서버(Data Structure Server)다. 캐시는 사용처의 하나일 뿐이다." — Salvatore Sanfilippo (antirez)

Kafka and Event-Driven Architecture Deep Dive — Partition/ISR, Exactly-Once, Outbox, Schema Registry, Flink

Introduction — The Emptiness of "We Use Kafka"

Common conference refrain: "We use Kafka for event-driven architecture." Drill deeper:

"Why that partition count?" → "Defaults..."
"Exactly-Once guaranteed?" → "Maybe..."
"Consumer drops a message?" → "...?"

Kafka is easy to write, hard to operate. This post covers internal structure (partitions, Leader, ISR, Log Segment), Producer/Consumer internals, EOS truth, Transactional Outbox, Event Sourcing vs CDC vs Event Notification, Schema Registry, Kafka Streams vs Flink vs ksqlDB, DLQ, modern alternatives (Pulsar, Redpanda), and 100ms SLA techniques.

1. Kafka's Birth — LinkedIn 2010

In 2010 LinkedIn had 500+ ETL pipelines point-to-point entangled. Profile service to search indexer, profile to warehouse, activity log to recommendations, activity log to Hadoop. Complexity exploded.

Jay Kreps' proposal: put every event into one unified log; consumers read at their own pace. That simple idea is Kafka. Named after Franz Kafka because "Kafka wrote a lot" — a write-optimized system. Open-sourced 2011, Confluent founded 2014 (Kreps, Narkhede, Rao). Now used by 80% of Fortune 100.

2. Kafka Components — 3-Minute Summary

Topic: category of events (e.g. user-clicks, order-created).

Partition: topic sliced into ordered logs. The unit of parallelism and ordering.

Topic: orders
├── Partition 0: [evt1] [evt2] [evt3] ...
├── Partition 1: [evt4] [evt5] ...
├── Partition 2: [evt6] [evt7] [evt8] ...

Broker: Kafka server; multiple form a Cluster. Producer/Consumer: write/read clients. Consumer Group: consumers sharing a Group ID; each partition assigned to exactly one consumer in the group.

Topic: orders (3 partitions)
Group: order-processor
├── Consumer A → Partition 0
├── Consumer B → Partition 1
└── Consumer C → Partition 2

More consumers than partitions: idle. Fewer: one consumer handles several partitions.

3. Partition Internals — Log Segment Physical Truth

Append-only log: each partition is a disk file, append only at the end.

/var/lib/kafka/orders-0/
├── 00000000000000000000.log
├── 00000000000000000000.index
├── 00000000000000000000.timeindex
├── 00000000000054321099.log
├── 00000000000054321099.index
├── 00000000000054321099.timeindex
└── leader-epoch-checkpoint

Segment rolling: new segment every 1GB or 1 week (default). Enables file-level delete/compaction and loading only relevant time ranges.

Index files: .index maps offset to file position (sparse); .timeindex maps timestamp to offset. Offset lookup is O(log N).

Zero-copy speed: Linux sendfile() sends disk to NIC inside kernel. No userspace copy.

Traditional:
  disk → kernel page cache → user buffer → kernel socket buffer → NIC
  (4 copies)

Zero-copy:
  disk → kernel page cache → NIC
  (2 copies, minimal CPU)

A single broker handles multi-GB/s. Page cache: Kafka uses minimal internal cache, relies on Linux page cache. Producer writes sit in page cache; consumers read from memory. More RAM equals more benefit.

4. Replication — Leader and ISR Politics

Each partition has N replicas; one Leader, rest Followers. Producer/Consumer talk only to Leader; Followers pull.

ISR (In-Sync Replicas): followers within replica.lag.time.max.ms (default 30s). Producer acks:

acks=0 — no wait, max throughput, loss possible
acks=1 — wait for Leader only
acks=all — wait for all ISR

acks=all + min.insync.replicas=2 gives durability even if Leader dies.

Leader Election: Controller Broker promotes an ISR member. Unclean Leader Election: if ISR empty, unclean.leader.election.enable=false (default) protects data over availability. Finance: always false.

KRaft: since 2022 Kafka uses Raft metadata, no ZooKeeper. Kafka 4.0 (2025) fully removes ZooKeeper.

5. Producer Internals — Batching and Compression

[App]
   │ send(record)
   ▼
[Serializer] → [Partitioner] → [Accumulator (batching)]
                                   │ batch.size reached or linger.ms expired
                                   ▼
                            [Sender Thread]
                                   │ batch per broker
                                   ▼
                            [Broker Leader]

Key tuning:

batch.size — max batch (default 16KB); larger = more throughput, more latency
linger.ms — wait to fill batch (default 0); 5-20ms recommended
compression.type — lz4 or zstd recommended
max.in.flight.requests.per.connection — must be <=5 with idempotence

Partitioner: explicit partition, else hash(key) % num_partitions (same key same partition), else sticky partitioner.

Idempotent Producer: enable.idempotence=true prevents duplicate writes on retry via PID + sequence number.

6. Consumer Internals — Offset Politics

Offset: sequence number within a partition. Consumer records "how far it has read".

Offset storage history: 2010-2014 ZooKeeper (overloaded); 2014+ __consumer_offsets internal topic.

Auto-commit (default): enable.auto.commit=true, auto.commit.interval.ms=5000. Risk: consumer dies mid-processing, reprocesses.

Manual commit:

for message in consumer:
    process(message)
    consumer.commit()

Delivery semantics: At-least-once (default, duplicates possible), At-most-once (loss possible), Exactly-once (complex). Most systems use at-least-once plus consumer idempotence.

Rebalance: when consumers join/leave, partitions reassigned; stop-the-world for seconds. Mitigations: Static Membership (2.3+), Cooperative Rebalancing (2.4+).

7. Exactly-Once Semantics — Truth of the Legend

2017 Confluent announced EOS in Kafka 0.11. Two pillars:

1. Idempotent Producer — no duplicate writes from retries (PID + sequence).

2. Transactions — atomic writes across partitions + consumer offset commit in one transaction.

producer.initTransactions();
try {
    producer.beginTransaction();
    producer.send(new ProducerRecord<>("topic-a", key, value));
    producer.send(new ProducerRecord<>("topic-b", key, value));
    producer.sendOffsetsToTransaction(offsets, groupId);
    producer.commitTransaction();
} catch (Exception e) {
    producer.abortTransaction();
}

Read Committed: isolation.level=read_committed skips aborted messages.

Limit: EOS only inside Kafka. Kafka to DB sync needs Transactional Outbox. Cost: 10-30% throughput drop; reserve for critical paths only.

8. Transactional Outbox — DB and Event Consistency

Problem:

def create_order(order):
    db.insert(order)              # 1
    kafka.send("orders", order)   # 2

If 1 succeeds and 2 fails, DB has order but no event. Swapping order is worse.

Outbox pattern: write main record and event in same DB transaction.

BEGIN;
INSERT INTO orders (...) VALUES (...);
INSERT INTO outbox (event_type, payload, created_at)
  VALUES ('order.created', ..., NOW());
COMMIT;

Separate relay process publishes outbox to Kafka.

[App] ──► [DB: orders + outbox] ──► [Relay/Debezium] ──► [Kafka]

Debezium: reads DB binlog/WAL and publishes to Kafka automatically. Consistent via DB transaction; data survives Kafka outage; replay possible. Pitfalls: outbox grows (needs cleanup), relay is at-least-once (consumers must be idempotent).

9. Event Sourcing vs CDC vs Event Notification

Event Notification: minimal payload, "something happened". Consumer queries API for detail.

{ "type": "order.created", "id": 42 }

Low coupling but requires follow-up calls.

Event-Carried State Transfer: full state in event. No extra calls but larger payload, schema changes propagate.

{
  "type": "order.created",
  "id": 42,
  "items": [],
  "total": 99.99,
  "customer": {}
}

Event Sourcing: store events not state; current state reconstructed by replay.

OrderCreated → ItemAdded → ItemAdded → DiscountApplied → OrderPaid

Full audit, time travel; requires CQRS, hard queries, schema evolution pain.

CDC: DB changes as events.

{
  "op": "u",
  "before": { "status": "pending" },
  "after": { "status": "paid" }
}

No app code change but DB schema becomes the contract. Real systems mix all three: Outbox for core business, CDC for indexing/analytics, Event Notification for alerts.

10. Schema Registry — Contract Guardian

Problem: producer changes schema, consumer breaks weeks later. Schema Registry is central store + compatibility check (Confluent).

Flow: producer registers schema, registry checks compatibility with existing versions, rejects breaking changes; message carries schema ID only.

Compatibility modes:

BACKWARD — new schema reads old data (add ok, remove no)
FORWARD — old schema reads new data
FULL — bidirectional
NONE — no checks

Producer-first deployment: BACKWARD.

Item	Avro	Protobuf	JSON Schema
Evolution	Excellent	Good	Limited
Payload size	Small	Smallest	Large
Readability	Low	Low	High
Ecosystem	Hadoop/Kafka	gRPC	Web
Kafka default	Preferred	Rising	Legacy

2025 trend: Protobuf (gRPC ecosystem, size/speed).

11. Stream Processing — Kafka Streams vs Flink vs ksqlDB

Kafka Streams (embedded library): no separate cluster.

stream.filter(...)
      .map(...)
      .groupByKey()
      .count()
      .to("output-topic");

Simple, JVM-only, weak at large state.

Apache Flink (separate cluster): stream-processing king. True streaming, exactly-once checkpointing, large state (TB scale), excellent event-time handling. Steeper learning curve.

ksqlDB: SQL over Kafka.

CREATE STREAM orders_large AS
  SELECT * FROM orders WHERE amount > 1000;

Fast prototyping; weaker than Flink for complex logic.

Guide: JVM + simple + Kafka-centric = Streams; complex/large/polyglot = Flink; quick SQL = ksqlDB. Flink adoption surged the last 2-3 years (AWS Managed Flink, Confluent Flink).

12. Dead Letter Queue and Reprocessing

Processing exception options:

Retry forever — poison pill halts everything
Skip and log — data loss
DLQ — park and handle separately

[topic: orders] ──► [Consumer] ──► on failure ──► [topic: orders-dlq]
                        │
                        ▼
                    [DLQ Processor] — manual/auto replay

Strategies: alert only, auto retry with TTL, fix-then-replay, kill switch by condition. Spring Cloud Stream and Kafka Streams DSL have built-in DLQ (DeadLetterPublishingRecoverer).

13. Pulsar, Redpanda — Kafka Alternatives

Apache Pulsar (Yahoo 2016): separated broker (stateless) and storage (BookKeeper). Multi-tenant native, geo-replication default, Kafka protocol compatible. Independent scaling; operational complexity (3 layers).

Redpanda (2020): C++ rewrite, no JVM. Kafka protocol compatible drop-in. Raft without ZooKeeper, single binary. 10x performance on same hardware, p99 milliseconds, simple ops. Ecosystem smaller than Kafka.

WarpStream (2023): S3-based, no local disk. Huge storage cost savings; latency around 200ms (S3). Streaming analytics fit.

Selection: standard/ecosystem = Kafka; simple ops + performance = Redpanda; multi-tenant large = Pulsar; lowest storage cost = WarpStream.

14. 100ms SLA Techniques

Targets: Producer to Broker under 5ms; Broker disk flush under 10ms; Broker to Consumer under 5ms; processing under 50ms; total under 100ms.

Techniques:

linger.ms=0 — send immediately (tradeoff throughput)
SSD + large page cache
min.insync.replicas=2 — not overly durable
compression.type=lz4 — fastest codec
fetch.min.bytes=1 — immediate fetch
Hot partition monitoring
Direct memory I/O JVM tuning
Poll loop optimization (no blocking during processing)

Key metrics: Producer record-send-rate, batch-size-avg; Broker RequestHandlerAvgIdlePercent; Consumer records-lag-max (most important).

15. Pitfalls and Anti-Patterns

Growing partitions later — breaks key hash mapping and order. Overprovision 3x up front.
Ordering without key — round-robin loses order.
Large messages (>1MB) — default limit; store in S3, reference in Kafka.
Slow consumer causes rebalance storm — tune max.poll.interval.ms (default 5 min).
Transactions as silver bullet — 10-30% overhead; skip for notifications.
Topic explosion — 10k+ topics strain metadata; consolidate with header/type fields.
No retention config — default 7 days; use compacted topics for latest-only.
No Schema Registry — breaking changes slip through; regret within 6 months.

16. Checklist of 12

Over-provision partitions
Clear key strategy
acks=all + min.insync.replicas=2
Idempotent Producer on
Schema Registry in place
Transactional Outbox
DLQ topic ready
Consumer lag monitoring
lz4 or zstd compression
KRaft mode
Minimize rebalance (Static, Cooperative)
Right stream processor per need

Next Post Preview — Redis Internals and Distributed Cache Strategy

If Kafka is the event bus, Redis is the realtime data store and compute engine. Next post covers single-thread idea, data structures (String, List, Set, Hash, Sorted Set, Stream, HyperLogLog, Bitmap), Redis Cluster with Hash Slot and Redirection, Sentinel HA, RDB vs AOF, Cache-Aside/Write-Through/Write-Behind, Thundering Herd mitigations, Redlock debate, Valkey fork (2024), Dragonfly/KeyDB multithread alternatives, and the 2024 license upheaval.

"Redis is a data structure server. Cache is just one use case." — Salvatore Sanfilippo (antirez)