Kafka and Event-Driven Architecture Deep Dive — Partition/ISR, Exactly-Once, Outbox, Schema Registry, Flink

Introduction — The Emptiness of "We Use Kafka"

Common conference refrain: "We use Kafka for event-driven architecture." Drill deeper:

"Why that partition count?" → "Defaults..."
"Exactly-Once guaranteed?" → "Maybe..."
"Consumer drops a message?" → "...?"

Kafka is easy to write, hard to operate. This post covers internal structure (partitions, Leader, ISR, Log Segment), Producer/Consumer internals, EOS truth, Transactional Outbox, Event Sourcing vs CDC vs Event Notification, Schema Registry, Kafka Streams vs Flink vs ksqlDB, DLQ, modern alternatives (Pulsar, Redpanda), and 100ms SLA techniques.

1. Kafka's Birth — LinkedIn 2010

In 2010 LinkedIn had 500+ ETL pipelines point-to-point entangled. Profile service to search indexer, profile to warehouse, activity log to recommendations, activity log to Hadoop. Complexity exploded.

Jay Kreps' proposal: put every event into one unified log; consumers read at their own pace. That simple idea is Kafka. Named after Franz Kafka because "Kafka wrote a lot" — a write-optimized system. Open-sourced 2011, Confluent founded 2014 (Kreps, Narkhede, Rao). Now used by 80% of Fortune 100.

2. Kafka Components — 3-Minute Summary

Topic: category of events (e.g. user-clicks, order-created).

Partition: topic sliced into ordered logs. The unit of parallelism and ordering.

Topic: orders
├── Partition 0: [evt1] [evt2] [evt3] ...
├── Partition 1: [evt4] [evt5] ...
├── Partition 2: [evt6] [evt7] [evt8] ...

Broker: Kafka server; multiple form a Cluster. Producer/Consumer: write/read clients. Consumer Group: consumers sharing a Group ID; each partition assigned to exactly one consumer in the group.

Topic: orders (3 partitions)
Group: order-processor
├── Consumer A → Partition 0
├── Consumer B → Partition 1
└── Consumer C → Partition 2

More consumers than partitions: idle. Fewer: one consumer handles several partitions.

3. Partition Internals — Log Segment Physical Truth

Append-only log: each partition is a disk file, append only at the end.

/var/lib/kafka/orders-0/
├── 00000000000000000000.log
├── 00000000000000000000.index
├── 00000000000000000000.timeindex
├── 00000000000054321099.log
├── 00000000000054321099.index
├── 00000000000054321099.timeindex
└── leader-epoch-checkpoint

Segment rolling: new segment every 1GB or 1 week (default). Enables file-level delete/compaction and loading only relevant time ranges.

Index files: .index maps offset to file position (sparse); .timeindex maps timestamp to offset. Offset lookup is O(log N).

Zero-copy speed: Linux sendfile() sends disk to NIC inside kernel. No userspace copy.

Traditional:
  disk → kernel page cache → user buffer → kernel socket buffer → NIC
  (4 copies)

Zero-copy:
  disk → kernel page cache → NIC
  (2 copies, minimal CPU)

A single broker handles multi-GB/s. Page cache: Kafka uses minimal internal cache, relies on Linux page cache. Producer writes sit in page cache; consumers read from memory. More RAM equals more benefit.

4. Replication — Leader and ISR Politics

Each partition has N replicas; one Leader, rest Followers. Producer/Consumer talk only to Leader; Followers pull.

ISR (In-Sync Replicas): followers within replica.lag.time.max.ms (default 30s). Producer acks:

acks=0 — no wait, max throughput, loss possible
acks=1 — wait for Leader only
acks=all — wait for all ISR

acks=all + min.insync.replicas=2 gives durability even if Leader dies.

Leader Election: Controller Broker promotes an ISR member. Unclean Leader Election: if ISR empty, unclean.leader.election.enable=false (default) protects data over availability. Finance: always false.

KRaft: since 2022 Kafka uses Raft metadata, no ZooKeeper. Kafka 4.0 (2025) fully removes ZooKeeper.

5. Producer Internals — Batching and Compression

[App]
   │ send(record)
   ▼
[Serializer] → [Partitioner] → [Accumulator (batching)]
                                   │ batch.size reached or linger.ms expired
                                   ▼
                            [Sender Thread]
                                   │ batch per broker
                                   ▼
                            [Broker Leader]

Key tuning:

batch.size — max batch (default 16KB); larger = more throughput, more latency
linger.ms — wait to fill batch (default 0); 5-20ms recommended
compression.type — lz4 or zstd recommended
max.in.flight.requests.per.connection — must be <=5 with idempotence

Partitioner: explicit partition, else hash(key) % num_partitions (same key same partition), else sticky partitioner.

Idempotent Producer: enable.idempotence=true prevents duplicate writes on retry via PID + sequence number.

6. Consumer Internals — Offset Politics

Offset: sequence number within a partition. Consumer records "how far it has read".

Offset storage history: 2010-2014 ZooKeeper (overloaded); 2014+ __consumer_offsets internal topic.

Auto-commit (default): enable.auto.commit=true, auto.commit.interval.ms=5000. Risk: consumer dies mid-processing, reprocesses.

Manual commit:

for message in consumer:
    process(message)
    consumer.commit()

Delivery semantics: At-least-once (default, duplicates possible), At-most-once (loss possible), Exactly-once (complex). Most systems use at-least-once plus consumer idempotence.

Rebalance: when consumers join/leave, partitions reassigned; stop-the-world for seconds. Mitigations: Static Membership (2.3+), Cooperative Rebalancing (2.4+).

7. Exactly-Once Semantics — Truth of the Legend

2017 Confluent announced EOS in Kafka 0.11. Two pillars:

1. Idempotent Producer — no duplicate writes from retries (PID + sequence).

2. Transactions — atomic writes across partitions + consumer offset commit in one transaction.

producer.initTransactions();
try {
    producer.beginTransaction();
    producer.send(new ProducerRecord<>("topic-a", key, value));
    producer.send(new ProducerRecord<>("topic-b", key, value));
    producer.sendOffsetsToTransaction(offsets, groupId);
    producer.commitTransaction();
} catch (Exception e) {
    producer.abortTransaction();
}

Read Committed: isolation.level=read_committed skips aborted messages.

Limit: EOS only inside Kafka. Kafka to DB sync needs Transactional Outbox. Cost: 10-30% throughput drop; reserve for critical paths only.

8. Transactional Outbox — DB and Event Consistency

Problem:

def create_order(order):
    db.insert(order)              # 1
    kafka.send("orders", order)   # 2

If 1 succeeds and 2 fails, DB has order but no event. Swapping order is worse.

Outbox pattern: write main record and event in same DB transaction.

BEGIN;
INSERT INTO orders (...) VALUES (...);
INSERT INTO outbox (event_type, payload, created_at)
  VALUES ('order.created', ..., NOW());
COMMIT;

Separate relay process publishes outbox to Kafka.

[App] ──► [DB: orders + outbox] ──► [Relay/Debezium] ──► [Kafka]

Debezium: reads DB binlog/WAL and publishes to Kafka automatically. Consistent via DB transaction; data survives Kafka outage; replay possible. Pitfalls: outbox grows (needs cleanup), relay is at-least-once (consumers must be idempotent).

9. Event Sourcing vs CDC vs Event Notification

Event Notification: minimal payload, "something happened". Consumer queries API for detail.

{ "type": "order.created", "id": 42 }

Low coupling but requires follow-up calls.

Event-Carried State Transfer: full state in event. No extra calls but larger payload, schema changes propagate.

{
  "type": "order.created",
  "id": 42,
  "items": [],
  "total": 99.99,
  "customer": {}
}

Event Sourcing: store events not state; current state reconstructed by replay.

OrderCreated → ItemAdded → ItemAdded → DiscountApplied → OrderPaid

Full audit, time travel; requires CQRS, hard queries, schema evolution pain.

CDC: DB changes as events.

{
  "op": "u",
  "before": { "status": "pending" },
  "after": { "status": "paid" }
}

No app code change but DB schema becomes the contract. Real systems mix all three: Outbox for core business, CDC for indexing/analytics, Event Notification for alerts.

10. Schema Registry — Contract Guardian

Problem: producer changes schema, consumer breaks weeks later. Schema Registry is central store + compatibility check (Confluent).

Flow: producer registers schema, registry checks compatibility with existing versions, rejects breaking changes; message carries schema ID only.

Compatibility modes:

BACKWARD — new schema reads old data (add ok, remove no)
FORWARD — old schema reads new data
FULL — bidirectional
NONE — no checks

Producer-first deployment: BACKWARD.

Item	Avro	Protobuf	JSON Schema
Evolution	Excellent	Good	Limited
Payload size	Small	Smallest	Large
Readability	Low	Low	High
Ecosystem	Hadoop/Kafka	gRPC	Web
Kafka default	Preferred	Rising	Legacy

2025 trend: Protobuf (gRPC ecosystem, size/speed).

11. Stream Processing — Kafka Streams vs Flink vs ksqlDB

Kafka Streams (embedded library): no separate cluster.

stream.filter(...)
      .map(...)
      .groupByKey()
      .count()
      .to("output-topic");

Simple, JVM-only, weak at large state.

Apache Flink (separate cluster): stream-processing king. True streaming, exactly-once checkpointing, large state (TB scale), excellent event-time handling. Steeper learning curve.

ksqlDB: SQL over Kafka.

CREATE STREAM orders_large AS
  SELECT * FROM orders WHERE amount > 1000;

Fast prototyping; weaker than Flink for complex logic.

Guide: JVM + simple + Kafka-centric = Streams; complex/large/polyglot = Flink; quick SQL = ksqlDB. Flink adoption surged the last 2-3 years (AWS Managed Flink, Confluent Flink).

12. Dead Letter Queue and Reprocessing

Processing exception options:

Retry forever — poison pill halts everything
Skip and log — data loss
DLQ — park and handle separately

[topic: orders] ──► [Consumer] ──► on failure ──► [topic: orders-dlq]
                        │
                        ▼
                    [DLQ Processor] — manual/auto replay

Strategies: alert only, auto retry with TTL, fix-then-replay, kill switch by condition. Spring Cloud Stream and Kafka Streams DSL have built-in DLQ (DeadLetterPublishingRecoverer).

13. Pulsar, Redpanda — Kafka Alternatives

Apache Pulsar (Yahoo 2016): separated broker (stateless) and storage (BookKeeper). Multi-tenant native, geo-replication default, Kafka protocol compatible. Independent scaling; operational complexity (3 layers).

Redpanda (2020): C++ rewrite, no JVM. Kafka protocol compatible drop-in. Raft without ZooKeeper, single binary. 10x performance on same hardware, p99 milliseconds, simple ops. Ecosystem smaller than Kafka.

WarpStream (2023): S3-based, no local disk. Huge storage cost savings; latency around 200ms (S3). Streaming analytics fit.

Selection: standard/ecosystem = Kafka; simple ops + performance = Redpanda; multi-tenant large = Pulsar; lowest storage cost = WarpStream.

14. 100ms SLA Techniques

Targets: Producer to Broker under 5ms; Broker disk flush under 10ms; Broker to Consumer under 5ms; processing under 50ms; total under 100ms.

Techniques:

linger.ms=0 — send immediately (tradeoff throughput)
SSD + large page cache
min.insync.replicas=2 — not overly durable
compression.type=lz4 — fastest codec
fetch.min.bytes=1 — immediate fetch
Hot partition monitoring
Direct memory I/O JVM tuning
Poll loop optimization (no blocking during processing)

Key metrics: Producer record-send-rate, batch-size-avg; Broker RequestHandlerAvgIdlePercent; Consumer records-lag-max (most important).

15. Pitfalls and Anti-Patterns

Growing partitions later — breaks key hash mapping and order. Overprovision 3x up front.
Ordering without key — round-robin loses order.
Large messages (>1MB) — default limit; store in S3, reference in Kafka.
Slow consumer causes rebalance storm — tune max.poll.interval.ms (default 5 min).
Transactions as silver bullet — 10-30% overhead; skip for notifications.
Topic explosion — 10k+ topics strain metadata; consolidate with header/type fields.
No retention config — default 7 days; use compacted topics for latest-only.
No Schema Registry — breaking changes slip through; regret within 6 months.

16. Checklist of 12

Over-provision partitions
Clear key strategy
acks=all + min.insync.replicas=2
Idempotent Producer on
Schema Registry in place
Transactional Outbox
DLQ topic ready
Consumer lag monitoring
lz4 or zstd compression
KRaft mode
Minimize rebalance (Static, Cooperative)
Right stream processor per need

Next Post Preview — Redis Internals and Distributed Cache Strategy

If Kafka is the event bus, Redis is the realtime data store and compute engine. Next post covers single-thread idea, data structures (String, List, Set, Hash, Sorted Set, Stream, HyperLogLog, Bitmap), Redis Cluster with Hash Slot and Redirection, Sentinel HA, RDB vs AOF, Cache-Aside/Write-Through/Write-Behind, Thundering Herd mitigations, Redlock debate, Valkey fork (2024), Dragonfly/KeyDB multithread alternatives, and the 2024 license upheaval.

"Redis is a data structure server. Cache is just one use case." — Salvatore Sanfilippo (antirez)