- Published on
Kafka and Event-Driven Architecture Deep Dive — Partition/ISR, Exactly-Once, Outbox, Schema Registry, Flink
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Introduction — The Emptiness of "We Use Kafka"
Common conference refrain: "We use Kafka for event-driven architecture." Drill deeper:
- "Why that partition count?" → "Defaults..."
- "Exactly-Once guaranteed?" → "Maybe..."
- "Consumer drops a message?" → "...?"
Kafka is easy to write, hard to operate. This post covers internal structure (partitions, Leader, ISR, Log Segment), Producer/Consumer internals, EOS truth, Transactional Outbox, Event Sourcing vs CDC vs Event Notification, Schema Registry, Kafka Streams vs Flink vs ksqlDB, DLQ, modern alternatives (Pulsar, Redpanda), and 100ms SLA techniques.
1. Kafka's Birth — LinkedIn 2010
In 2010 LinkedIn had 500+ ETL pipelines point-to-point entangled. Profile service to search indexer, profile to warehouse, activity log to recommendations, activity log to Hadoop. Complexity exploded.
Jay Kreps' proposal: put every event into one unified log; consumers read at their own pace. That simple idea is Kafka. Named after Franz Kafka because "Kafka wrote a lot" — a write-optimized system. Open-sourced 2011, Confluent founded 2014 (Kreps, Narkhede, Rao). Now used by 80% of Fortune 100.
2. Kafka Components — 3-Minute Summary
Topic: category of events (e.g. user-clicks, order-created).
Partition: topic sliced into ordered logs. The unit of parallelism and ordering.
Topic: orders
├── Partition 0: [evt1] [evt2] [evt3] ...
├── Partition 1: [evt4] [evt5] ...
├── Partition 2: [evt6] [evt7] [evt8] ...
Broker: Kafka server; multiple form a Cluster. Producer/Consumer: write/read clients. Consumer Group: consumers sharing a Group ID; each partition assigned to exactly one consumer in the group.
Topic: orders (3 partitions)
Group: order-processor
├── Consumer A → Partition 0
├── Consumer B → Partition 1
└── Consumer C → Partition 2
More consumers than partitions: idle. Fewer: one consumer handles several partitions.
3. Partition Internals — Log Segment Physical Truth
Append-only log: each partition is a disk file, append only at the end.
/var/lib/kafka/orders-0/
├── 00000000000000000000.log
├── 00000000000000000000.index
├── 00000000000000000000.timeindex
├── 00000000000054321099.log
├── 00000000000054321099.index
├── 00000000000054321099.timeindex
└── leader-epoch-checkpoint
Segment rolling: new segment every 1GB or 1 week (default). Enables file-level delete/compaction and loading only relevant time ranges.
Index files: .index maps offset to file position (sparse); .timeindex maps timestamp to offset. Offset lookup is O(log N).
Zero-copy speed: Linux sendfile() sends disk to NIC inside kernel. No userspace copy.
Traditional:
disk → kernel page cache → user buffer → kernel socket buffer → NIC
(4 copies)
Zero-copy:
disk → kernel page cache → NIC
(2 copies, minimal CPU)
A single broker handles multi-GB/s. Page cache: Kafka uses minimal internal cache, relies on Linux page cache. Producer writes sit in page cache; consumers read from memory. More RAM equals more benefit.
4. Replication — Leader and ISR Politics
Each partition has N replicas; one Leader, rest Followers. Producer/Consumer talk only to Leader; Followers pull.
ISR (In-Sync Replicas): followers within replica.lag.time.max.ms (default 30s). Producer acks:
acks=0— no wait, max throughput, loss possibleacks=1— wait for Leader onlyacks=all— wait for all ISR
acks=all + min.insync.replicas=2 gives durability even if Leader dies.
Leader Election: Controller Broker promotes an ISR member. Unclean Leader Election: if ISR empty, unclean.leader.election.enable=false (default) protects data over availability. Finance: always false.
KRaft: since 2022 Kafka uses Raft metadata, no ZooKeeper. Kafka 4.0 (2025) fully removes ZooKeeper.
5. Producer Internals — Batching and Compression
[App]
│ send(record)
▼
[Serializer] → [Partitioner] → [Accumulator (batching)]
│ batch.size reached or linger.ms expired
▼
[Sender Thread]
│ batch per broker
▼
[Broker Leader]
Key tuning:
- batch.size — max batch (default 16KB); larger = more throughput, more latency
- linger.ms — wait to fill batch (default 0); 5-20ms recommended
- compression.type —
lz4orzstdrecommended - max.in.flight.requests.per.connection — must be
<=5with idempotence
Partitioner: explicit partition, else hash(key) % num_partitions (same key same partition), else sticky partitioner.
Idempotent Producer: enable.idempotence=true prevents duplicate writes on retry via PID + sequence number.
6. Consumer Internals — Offset Politics
Offset: sequence number within a partition. Consumer records "how far it has read".
Offset storage history: 2010-2014 ZooKeeper (overloaded); 2014+ __consumer_offsets internal topic.
Auto-commit (default): enable.auto.commit=true, auto.commit.interval.ms=5000. Risk: consumer dies mid-processing, reprocesses.
Manual commit:
for message in consumer:
process(message)
consumer.commit()
Delivery semantics: At-least-once (default, duplicates possible), At-most-once (loss possible), Exactly-once (complex). Most systems use at-least-once plus consumer idempotence.
Rebalance: when consumers join/leave, partitions reassigned; stop-the-world for seconds. Mitigations: Static Membership (2.3+), Cooperative Rebalancing (2.4+).
7. Exactly-Once Semantics — Truth of the Legend
2017 Confluent announced EOS in Kafka 0.11. Two pillars:
1. Idempotent Producer — no duplicate writes from retries (PID + sequence).
2. Transactions — atomic writes across partitions + consumer offset commit in one transaction.
producer.initTransactions();
try {
producer.beginTransaction();
producer.send(new ProducerRecord<>("topic-a", key, value));
producer.send(new ProducerRecord<>("topic-b", key, value));
producer.sendOffsetsToTransaction(offsets, groupId);
producer.commitTransaction();
} catch (Exception e) {
producer.abortTransaction();
}
Read Committed: isolation.level=read_committed skips aborted messages.
Limit: EOS only inside Kafka. Kafka to DB sync needs Transactional Outbox. Cost: 10-30% throughput drop; reserve for critical paths only.
8. Transactional Outbox — DB and Event Consistency
Problem:
def create_order(order):
db.insert(order) # 1
kafka.send("orders", order) # 2
If 1 succeeds and 2 fails, DB has order but no event. Swapping order is worse.
Outbox pattern: write main record and event in same DB transaction.
BEGIN;
INSERT INTO orders (...) VALUES (...);
INSERT INTO outbox (event_type, payload, created_at)
VALUES ('order.created', ..., NOW());
COMMIT;
Separate relay process publishes outbox to Kafka.
[App] ──► [DB: orders + outbox] ──► [Relay/Debezium] ──► [Kafka]
Debezium: reads DB binlog/WAL and publishes to Kafka automatically. Consistent via DB transaction; data survives Kafka outage; replay possible. Pitfalls: outbox grows (needs cleanup), relay is at-least-once (consumers must be idempotent).
9. Event Sourcing vs CDC vs Event Notification
Event Notification: minimal payload, "something happened". Consumer queries API for detail.
{ "type": "order.created", "id": 42 }
Low coupling but requires follow-up calls.
Event-Carried State Transfer: full state in event. No extra calls but larger payload, schema changes propagate.
{
"type": "order.created",
"id": 42,
"items": [],
"total": 99.99,
"customer": {}
}
Event Sourcing: store events not state; current state reconstructed by replay.
OrderCreated → ItemAdded → ItemAdded → DiscountApplied → OrderPaid
Full audit, time travel; requires CQRS, hard queries, schema evolution pain.
CDC: DB changes as events.
{
"op": "u",
"before": { "status": "pending" },
"after": { "status": "paid" }
}
No app code change but DB schema becomes the contract. Real systems mix all three: Outbox for core business, CDC for indexing/analytics, Event Notification for alerts.
10. Schema Registry — Contract Guardian
Problem: producer changes schema, consumer breaks weeks later. Schema Registry is central store + compatibility check (Confluent).
Flow: producer registers schema, registry checks compatibility with existing versions, rejects breaking changes; message carries schema ID only.
Compatibility modes:
- BACKWARD — new schema reads old data (add ok, remove no)
- FORWARD — old schema reads new data
- FULL — bidirectional
- NONE — no checks
Producer-first deployment: BACKWARD.
| Item | Avro | Protobuf | JSON Schema |
|---|---|---|---|
| Evolution | Excellent | Good | Limited |
| Payload size | Small | Smallest | Large |
| Readability | Low | Low | High |
| Ecosystem | Hadoop/Kafka | gRPC | Web |
| Kafka default | Preferred | Rising | Legacy |
2025 trend: Protobuf (gRPC ecosystem, size/speed).
11. Stream Processing — Kafka Streams vs Flink vs ksqlDB
Kafka Streams (embedded library): no separate cluster.
stream.filter(...)
.map(...)
.groupByKey()
.count()
.to("output-topic");
Simple, JVM-only, weak at large state.
Apache Flink (separate cluster): stream-processing king. True streaming, exactly-once checkpointing, large state (TB scale), excellent event-time handling. Steeper learning curve.
ksqlDB: SQL over Kafka.
CREATE STREAM orders_large AS
SELECT * FROM orders WHERE amount > 1000;
Fast prototyping; weaker than Flink for complex logic.
Guide: JVM + simple + Kafka-centric = Streams; complex/large/polyglot = Flink; quick SQL = ksqlDB. Flink adoption surged the last 2-3 years (AWS Managed Flink, Confluent Flink).
12. Dead Letter Queue and Reprocessing
Processing exception options:
- Retry forever — poison pill halts everything
- Skip and log — data loss
- DLQ — park and handle separately
[topic: orders] ──► [Consumer] ──► on failure ──► [topic: orders-dlq]
│
▼
[DLQ Processor] — manual/auto replay
Strategies: alert only, auto retry with TTL, fix-then-replay, kill switch by condition. Spring Cloud Stream and Kafka Streams DSL have built-in DLQ (DeadLetterPublishingRecoverer).
13. Pulsar, Redpanda — Kafka Alternatives
Apache Pulsar (Yahoo 2016): separated broker (stateless) and storage (BookKeeper). Multi-tenant native, geo-replication default, Kafka protocol compatible. Independent scaling; operational complexity (3 layers).
Redpanda (2020): C++ rewrite, no JVM. Kafka protocol compatible drop-in. Raft without ZooKeeper, single binary. 10x performance on same hardware, p99 milliseconds, simple ops. Ecosystem smaller than Kafka.
WarpStream (2023): S3-based, no local disk. Huge storage cost savings; latency around 200ms (S3). Streaming analytics fit.
Selection: standard/ecosystem = Kafka; simple ops + performance = Redpanda; multi-tenant large = Pulsar; lowest storage cost = WarpStream.
14. 100ms SLA Techniques
Targets: Producer to Broker under 5ms; Broker disk flush under 10ms; Broker to Consumer under 5ms; processing under 50ms; total under 100ms.
Techniques:
linger.ms=0— send immediately (tradeoff throughput)- SSD + large page cache
min.insync.replicas=2— not overly durablecompression.type=lz4— fastest codecfetch.min.bytes=1— immediate fetch- Hot partition monitoring
- Direct memory I/O JVM tuning
- Poll loop optimization (no blocking during processing)
Key metrics: Producer record-send-rate, batch-size-avg; Broker RequestHandlerAvgIdlePercent; Consumer records-lag-max (most important).
15. Pitfalls and Anti-Patterns
- Growing partitions later — breaks key hash mapping and order. Overprovision 3x up front.
- Ordering without key — round-robin loses order.
- Large messages (
>1MB) — default limit; store in S3, reference in Kafka. - Slow consumer causes rebalance storm — tune
max.poll.interval.ms(default 5 min). - Transactions as silver bullet — 10-30% overhead; skip for notifications.
- Topic explosion — 10k+ topics strain metadata; consolidate with header/type fields.
- No retention config — default 7 days; use compacted topics for latest-only.
- No Schema Registry — breaking changes slip through; regret within 6 months.
16. Checklist of 12
- Over-provision partitions
- Clear key strategy
acks=all+min.insync.replicas=2- Idempotent Producer on
- Schema Registry in place
- Transactional Outbox
- DLQ topic ready
- Consumer lag monitoring
lz4orzstdcompression- KRaft mode
- Minimize rebalance (Static, Cooperative)
- Right stream processor per need
Next Post Preview — Redis Internals and Distributed Cache Strategy
If Kafka is the event bus, Redis is the realtime data store and compute engine. Next post covers single-thread idea, data structures (String, List, Set, Hash, Sorted Set, Stream, HyperLogLog, Bitmap), Redis Cluster with Hash Slot and Redirection, Sentinel HA, RDB vs AOF, Cache-Aside/Write-Through/Write-Behind, Thundering Herd mitigations, Redlock debate, Valkey fork (2024), Dragonfly/KeyDB multithread alternatives, and the 2024 license upheaval.
"Redis is a data structure server. Cache is just one use case." — Salvatore Sanfilippo (antirez)