Skip to content
Published on

Toss Bank Data Engineer (Kafka & Streaming) Study Guide: Tech Stack, Interview Prep, and 6-Month Roadmap

Authors

Introduction: Why Toss Bank's Real-Time Data Team Matters

Toss Bank is not just another Korean fintech company. It is a mobile-first bank that processes millions of transactions daily, each of which needs to be captured, transformed, and delivered in near real-time. The Real-Time Data team sits at the center of this architecture. They operate the Kafka infrastructure that every other team depends on, build the streaming pipelines that feed fraud detection models, and maintain the CDC (Change Data Capture) systems that keep dozens of microservices in sync.

When Toss Bank posts a Data Engineer (Kafka and Streaming) position, they are looking for someone who can operate Kafka at a scale that most companies never reach. This is not a "set up a three-node cluster and call it a day" role. This is a role where you will manage hundreds of brokers, handle cross-datacenter replication for financial data that cannot be lost, and build stream processing pipelines that must produce exactly-once results in a regulated banking environment.

This guide breaks down the job description line by line, maps each requirement to specific technologies and study resources, and gives you a realistic 6-month plan to prepare. Whether you are a backend engineer looking to specialize in data infrastructure, or an existing data engineer who wants to level up your Kafka expertise, this guide will show you exactly what to study and in what order.


1. JD Deep Analysis: What Toss Bank Actually Wants

Let us dissect the job description section by section. Understanding what each bullet point really means will help you prioritize your study time.

1.1 Core Responsibilities

"Operate and optimize the Kafka Broker cluster that handles the entire company's event streaming"

This is the headline responsibility. You are not a consumer of Kafka — you are the person who keeps it running. That means:

  • Capacity planning for brokers, partitions, and topics
  • Performance tuning (OS-level, JVM, and Kafka configuration)
  • Monitoring with tools like Kafka Manager, Burrow, or custom Prometheus exporters
  • Rolling upgrades with zero downtime
  • Incident response when a broker goes down at 3 AM

"Develop and maintain the Spring Boot-based Kafka Client SDK used company-wide"

Toss Bank builds internal libraries that standardize how every team produces and consumes Kafka messages. This means you need to be comfortable with:

  • Spring Boot auto-configuration and starter modules
  • Kafka Producer and Consumer APIs at a low level
  • Schema management with Avro or Protobuf and a Schema Registry
  • Error handling patterns: dead letter queues, retry topics, circuit breakers
  • Backward compatibility when evolving the SDK

"Design and implement Active-Active replication across data centers"

Financial regulations require that Toss Bank can survive an entire data center going offline. Active-Active Kafka replication means:

  • MirrorMaker 2 or Confluent Replicator for cross-cluster replication
  • Offset translation between clusters
  • Conflict resolution strategies for dual-write scenarios
  • Failover and failback procedures that meet RTO/RPO requirements

"Build CDC pipelines using Debezium to capture database changes"

CDC is how microservices stay in sync without tight coupling. Debezium captures row-level changes from databases (MySQL, PostgreSQL) and publishes them to Kafka topics. You need to understand:

  • Debezium connectors for different databases
  • Kafka Connect architecture and its distributed mode
  • Schema evolution when database schemas change
  • Handling initial snapshots for large tables
  • Exactly-once delivery semantics end to end

"Develop stream processing applications using Flink"

Flink is the team's choice for stateful stream processing. This goes beyond simple transformations:

  • Windowing (tumbling, sliding, session windows)
  • State management with RocksDB backends
  • Checkpointing and savepoints for fault tolerance
  • Event time vs processing time semantics
  • Complex event processing (CEP) for fraud detection patterns

"Manage ClickHouse for real-time analytics dashboards"

ClickHouse is an OLAP database optimized for fast analytical queries on large datasets. The team uses it for:

  • Real-time dashboards showing transaction volumes, latency percentiles, and error rates
  • Ad-hoc queries on streaming data that has been materialized
  • Data retention and tiered storage strategies

1.2 Required Qualifications

"3+ years of experience operating Kafka in production"

This is not negotiable. They want someone who has debugged ISR (In-Sync Replica) shrink issues, dealt with unbalanced partition leaders, and understands why unclean.leader.election.enable defaults to false. Book knowledge alone will not cut it.

"Proficiency in Java or Kotlin with Spring Boot"

The internal SDK is built on Spring Boot. You need to write production-quality Java or Kotlin code, not just scripts. Understanding Spring's dependency injection, AOP for logging and metrics, and testing with MockBean are all expected.

"Understanding of distributed systems fundamentals"

This means you can discuss CAP theorem trade-offs, leader election algorithms, consensus protocols, and why exactly-once semantics are hard in distributed systems. Expect whiteboard questions on these topics.

"Experience with Linux system administration"

Kafka runs on Linux. You need to be comfortable with:

  • File system tuning (XFS, page cache, disk I/O scheduling)
  • Network tuning (TCP buffer sizes, socket settings)
  • JVM garbage collection tuning (G1GC, ZGC)
  • Monitoring with tools like top, iostat, sar, and perf

1.3 Preferred Qualifications

"Experience with Flink or Spark Structured Streaming"

Having hands-on experience with at least one stream processing framework puts you ahead. Flink is preferred, but Spark Structured Streaming knowledge transfers well.

"Familiarity with Kubernetes-based deployments"

Toss runs much of its infrastructure on Kubernetes. Understanding StatefulSets, persistent volumes, and the challenges of running stateful systems like Kafka on K8s is valuable.

"Contributions to open-source projects"

Toss Bank actively contributes to open source. Having a track record of contributions — even small ones — demonstrates that you understand collaborative software development at scale.


2. Tech Stack Deep Dive

2.1 Apache Kafka: The Foundation

Kafka is the central nervous system of Toss Bank's data infrastructure. Here is what you need to know beyond the basics.

Broker Internals

A Kafka broker stores messages in segments on disk. Understanding the log structure is fundamental:

topic-partition/
  00000000000000000000.log    # First segment
  00000000000000000000.index  # Offset index
  00000000000000000000.timeindex  # Time-based index
  00000000000000065536.log    # Second segment (after roll)

Key configuration parameters you must understand:

# Replication
default.replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false

# Performance
num.io.threads=8
num.network.threads=3
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400

# Log management
log.segment.bytes=1073741824
log.retention.hours=168
log.cleanup.policy=delete

Producer Tuning

The producer configuration determines throughput, latency, and durability guarantees:

Properties props = new Properties();
props.put("acks", "all");                    // Wait for all ISR to acknowledge
props.put("retries", Integer.MAX_VALUE);     // Retry indefinitely
props.put("max.in.flight.requests.per.connection", 5);
props.put("enable.idempotence", true);       // Exactly-once per partition
props.put("batch.size", 16384);              // Batch size in bytes
props.put("linger.ms", 5);                   // Wait up to 5ms to fill batch
props.put("compression.type", "lz4");        // Compress batches

Consumer Group Rebalancing

Consumer group rebalancing is one of the most common operational headaches. Understanding the protocols is critical:

  • Eager rebalancing: All consumers stop, reassign, resume. Simple but causes stop-the-world pauses.
  • Cooperative rebalancing (Incremental): Only affected partitions are revoked and reassigned. Introduced in Kafka 2.4+.
  • Static group membership: Consumers with group.instance.id avoid rebalancing on transient failures.
// Cooperative rebalancing configuration
props.put("partition.assignment.strategy",
    "org.apache.kafka.clients.consumer.CooperativeStickyAssignor");
props.put("group.instance.id", "consumer-host-1");
props.put("session.timeout.ms", 30000);
props.put("heartbeat.interval.ms", 10000);

2.2 KRaft Mode: Kafka Without ZooKeeper

Kafka has been migrating away from ZooKeeper dependency since KIP-500. KRaft (Kafka Raft) mode replaces ZooKeeper with an internal Raft-based consensus protocol.

Why this matters for Toss Bank:

  • Reduced operational complexity (no separate ZooKeeper ensemble to manage)
  • Faster controller failover (seconds instead of tens of seconds)
  • Better scalability for metadata (millions of partitions)
  • Simplified deployment on Kubernetes

Key KRaft concepts:

# KRaft controller configuration
process.roles=controller
node.id=1
controller.quorum.voters=1@controller1:9093,2@controller2:9093,3@controller3:9093
controller.listener.names=CONTROLLER

You should be able to discuss the migration path from ZooKeeper to KRaft mode, including the dual-write phase and the cutover process.

2.3 Spring Boot Kafka Client SDK

Building an internal SDK is a software engineering challenge as much as a Kafka challenge. Here is what a production-quality SDK looks like:

@Configuration
@EnableKafka
public class KafkaProducerConfig {

    @Bean
    public ProducerFactory<String, GenericRecord> producerFactory(
            KafkaProperties properties) {
        Map<String, Object> config = properties.buildProducerProperties();
        config.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
            StringSerializer.class);
        config.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
            KafkaAvroSerializer.class);
        config.put("schema.registry.url", schemaRegistryUrl);
        return new DefaultKafkaProducerFactory<>(config);
    }

    @Bean
    public KafkaTemplate<String, GenericRecord> kafkaTemplate(
            ProducerFactory<String, GenericRecord> factory) {
        KafkaTemplate<String, GenericRecord> template =
            new KafkaTemplate<>(factory);
        template.setObservationEnabled(true); // Micrometer tracing
        return template;
    }
}

Dead Letter Queue Pattern

When a consumer cannot process a message after retries, it should be routed to a dead letter topic:

@Bean
public ConcurrentKafkaListenerContainerFactory<String, String>
        kafkaListenerContainerFactory() {
    ConcurrentKafkaListenerContainerFactory<String, String> factory =
        new ConcurrentKafkaListenerContainerFactory<>();
    factory.setConsumerFactory(consumerFactory());
    factory.setCommonErrorHandler(new DefaultErrorHandler(
        new DeadLetterPublishingRecoverer(kafkaTemplate),
        new FixedBackOff(1000L, 3)  // 3 retries, 1 second apart
    ));
    return factory;
}

2.4 Active-Active Kafka Replication

Active-Active replication is one of the hardest problems in distributed Kafka deployments. Here are the key patterns:

MirrorMaker 2 Architecture

MirrorMaker 2 (MM2) is built on Kafka Connect and provides:

  • Topic replication with configurable topic name remapping
  • Consumer group offset synchronization
  • Automatic topic configuration sync (partitions, configs)
  • Heartbeat topics for monitoring replication lag
# MirrorMaker 2 configuration
clusters = dc1, dc2
dc1.bootstrap.servers = dc1-kafka1:9092,dc1-kafka2:9092
dc2.bootstrap.servers = dc2-kafka1:9092,dc2-kafka2:9092

# Bidirectional replication
dc1->dc2.enabled = true
dc2->dc1.enabled = true

# Topic filtering
dc1->dc2.topics = transactions.*, user-events.*
dc2->dc1.topics = transactions.*, user-events.*

# Prevent replication loops
replication.policy.class = org.apache.kafka.connect.mirror.IdentityReplicationPolicy

Conflict Resolution

In Active-Active setups, the same topic can receive writes in both data centers. Strategies include:

  • Last-writer-wins with timestamps: Simple but can lose data
  • Application-level conflict resolution: The consumer merges conflicting records
  • Region-based partitioning: Each DC writes to different partitions, avoiding conflicts entirely

2.5 CDC with Debezium

Debezium captures database changes by reading the database's transaction log (binlog for MySQL, WAL for PostgreSQL).

{
  "name": "postgres-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres-primary",
    "database.port": "5432",
    "database.user": "debezium",
    "database.dbname": "tossbank",
    "topic.prefix": "cdc",
    "table.include.list": "public.accounts,public.transactions",
    "plugin.name": "pgoutput",
    "slot.name": "debezium_slot",
    "publication.name": "debezium_pub",
    "snapshot.mode": "initial",
    "transforms": "route",
    "transforms.route.type": "io.debezium.transforms.ByLogicalTableRouter",
    "transforms.route.topic.regex": "(.*)\\.(.*)",
    "transforms.route.topic.replacement": "cdc.$2"
  }
}

Critical Debezium Concepts

  • Snapshot modes: initial (full snapshot then stream), schema_only (structure only, stream from current position), never (stream only)
  • Outbox pattern: Instead of CDC on business tables, applications write events to an outbox table, and Debezium captures those events. This decouples the event schema from the database schema.
  • Schema evolution: When a database column is added or modified, Debezium reflects the change in the Kafka message schema. Schema Registry compatibility modes (BACKWARD, FORWARD, FULL) determine whether consumers can handle the change.

Flink is the stream processing engine of choice for stateful computations on Kafka streams.

Flink Application Structure

StreamExecutionEnvironment env =
    StreamExecutionEnvironment.getExecutionEnvironment();

// Enable exactly-once checkpointing
env.enableCheckpointing(60000, CheckpointingMode.EXACTLY_ONCE);
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(30000);
env.getCheckpointConfig().setCheckpointTimeout(120000);

// Kafka source
KafkaSource<Transaction> source = KafkaSource.<Transaction>builder()
    .setBootstrapServers("kafka:9092")
    .setTopics("transactions")
    .setGroupId("flink-processor")
    .setStartingOffsets(OffsetsInitializer.committedOffsets())
    .setDeserializer(new TransactionDeserializer())
    .build();

DataStream<Transaction> transactions = env.fromSource(
    source,
    WatermarkStrategy.<Transaction>forBoundedOutOfOrderness(
        Duration.ofSeconds(5))
        .withTimestampAssigner((tx, ts) -> tx.getTimestamp()),
    "Kafka Source"
);

// Windowed aggregation: transaction count per account per minute
transactions
    .keyBy(Transaction::getAccountId)
    .window(TumblingEventTimeWindows.of(Time.minutes(1)))
    .aggregate(new TransactionCountAggregate())
    .addSink(new ClickHouseSink());

env.execute("Transaction Aggregation");

State Management

Flink maintains state in state backends. For production workloads:

  • HashMapStateBackend: Fast, stores state in JVM heap. Good for small state.
  • EmbeddedRocksDBStateBackend: Stores state on local disk via RocksDB. Required for large state that exceeds memory.
env.setStateBackend(new EmbeddedRocksDBStateBackend(true));
env.getCheckpointConfig().setCheckpointStorage(
    "s3://flink-checkpoints/job-1");

2.7 ClickHouse: Real-Time Analytics

ClickHouse is a columnar OLAP database that excels at aggregation queries on large datasets.

Table Design for Streaming Data

CREATE TABLE transactions_realtime (
    transaction_id UUID,
    account_id UInt64,
    amount Decimal64(2),
    currency LowCardinality(String),
    transaction_type LowCardinality(String),
    status LowCardinality(String),
    created_at DateTime64(3, 'Asia/Seoul'),
    region LowCardinality(String)
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(created_at)
ORDER BY (account_id, created_at)
TTL created_at + INTERVAL 90 DAY;

Kafka Integration

ClickHouse can consume directly from Kafka using the Kafka engine:

CREATE TABLE transactions_kafka (
    transaction_id UUID,
    account_id UInt64,
    amount Decimal64(2),
    created_at DateTime64(3)
) ENGINE = Kafka()
SETTINGS
    kafka_broker_list = 'kafka1:9092,kafka2:9092',
    kafka_topic_list = 'transactions',
    kafka_group_name = 'clickhouse-consumer',
    kafka_format = 'JSONEachRow';

CREATE MATERIALIZED VIEW transactions_mv TO transactions_realtime AS
SELECT * FROM transactions_kafka;

3. Interview Preparation: 30 Questions

3.1 Kafka Fundamentals (Questions 1-10)

Q1. Explain how Kafka achieves high throughput despite writing to disk.

Kafka uses sequential I/O, the OS page cache, and zero-copy transfers (sendfile system call). Sequential writes to disk can achieve throughputs exceeding 600 MB/s on modern SSDs, rivaling network throughput. The OS page cache means frequently accessed data is served from memory without Kafka needing to manage its own cache.

Q2. What happens when a Kafka broker in the ISR set falls behind?

When a replica falls behind the leader by more than replica.lag.time.max.ms (default 30 seconds), it is removed from the ISR set. If min.insync.replicas is set to 2 and only 1 replica remains in sync, producers with acks=all will receive NotEnoughReplicasException. The lagging replica will rejoin the ISR once it catches up.

Q3. Compare eager and cooperative consumer rebalancing protocols.

Eager rebalancing revokes all partitions from all consumers, then reassigns everything. This causes a stop-the-world pause where no consumer processes messages. Cooperative rebalancing only revokes the partitions that need to move, allowing other consumers to continue processing. Cooperative rebalancing is always preferred in production.

Q4. How does Kafka's idempotent producer work?

When enable.idempotence=true, each producer instance gets a unique Producer ID (PID). Every message batch includes a sequence number. The broker deduplicates messages by tracking the last sequence number per PID and partition. If a retry sends the same batch, the broker recognizes the duplicate and acknowledges without writing again.

Q5. What is the difference between log.retention.hours and log.retention.bytes?

log.retention.hours deletes segments older than the specified time. log.retention.bytes deletes the oldest segments when the total partition size exceeds the limit. When both are set, whichever threshold is reached first triggers deletion. For compacted topics, log.cleanup.policy=compact replaces deletion with key-based compaction.

Q6. Explain Kafka transactions and exactly-once semantics.

Kafka transactions allow a producer to atomically write to multiple partitions. The producer calls beginTransaction(), sends messages, and calls commitTransaction(). A transaction coordinator on the broker tracks transaction state. Consumers with isolation.level=read_committed only see committed messages. Combined with idempotent producers, this achieves exactly-once semantics within Kafka.

Q7. How would you monitor Kafka cluster health?

Key metrics to track: UnderReplicatedPartitions (should be 0), ActiveControllerCount (exactly 1), OfflinePartitionsCount (should be 0), RequestHandlerAvgIdlePercent (above 0.3), NetworkProcessorAvgIdlePercent (above 0.3), BytesInPerSec/BytesOutPerSec per broker, consumer group lag per partition. Tools: JMX exporters to Prometheus, Grafana dashboards, Burrow for consumer lag, custom alerting on critical metrics.

Q8. What is the purpose of unclean.leader.election.enable and why should it be false for financial data?

When set to true, a replica that is not in the ISR can become the leader if all ISR replicas are down. This prevents unavailability but risks data loss because the new leader may be missing recently committed messages. For financial data where data loss is unacceptable, this must be false, accepting temporary unavailability over data inconsistency.

Q9. How does log compaction work and when would you use it?

Log compaction retains only the latest value for each key. A background thread (the log cleaner) scans segments and removes older records with the same key. Use compaction for topics that represent the current state of an entity (user profiles, account balances) rather than an event stream. Configuration: log.cleanup.policy=compact, min.cleanable.dirty.ratio, log.cleaner.min.compaction.lag.ms.

Q10. Explain the impact of partition count on performance and ordering.

More partitions increase parallelism (more consumers can read concurrently) but also increase: broker memory usage (each partition has buffers), end-to-end latency (more replication work), leader election time, and the number of file handles. Ordering is guaranteed only within a single partition. For ordered event processing, use a consistent partition key (e.g., account ID) and configure enough partitions for throughput without over-partitioning.

3.2 Spring Boot and SDK Design (Questions 11-15)

Q11. How would you design a Kafka Client SDK that handles schema evolution?

Use Apache Avro with Confluent Schema Registry. The SDK registers schemas on first use and caches them. Configure BACKWARD compatibility so new consumers can read old data. The SDK wraps KafkaAvroSerializer and KafkaAvroDeserializer, handling serialization transparently. When a schema evolves, the registry validates compatibility, and consumers use schema resolution to read old formats.

Q12. What retry patterns would you build into the SDK?

Implement exponential backoff with jitter for transient failures. Use Spring Retry's RetryTemplate or the built-in Kafka consumer retry mechanism. For messages that fail after maximum retries, route to a dead letter topic with the original message, error details, and retry count as headers. Provide a dead letter processor that allows teams to inspect and replay failed messages.

Q13. How do you handle graceful shutdown in a Spring Boot Kafka consumer?

Register a shutdown hook that calls consumer.wakeup() to interrupt the poll loop. The WakeupException breaks the consumer out of poll(), and the finally block calls consumer.close() which commits offsets and leaves the consumer group cleanly. In Spring Boot, the ConcurrentKafkaListenerContainerFactory handles this when you set containerProperties.setShutdownTimeout().

Q14. How would you add distributed tracing to Kafka messages?

Inject trace context (trace ID, span ID) into Kafka message headers on the producer side. On the consumer side, extract the context from headers and create a new span linked to the producer span. With Spring Cloud Sleuth or Micrometer Tracing, this is mostly automatic when using KafkaTemplate and @KafkaListener with observation enabled. For Flink consumers, you need custom header extraction.

Q15. Explain how you would version the SDK without breaking existing consumers.

Follow semantic versioning. Use Spring Boot's auto-configuration to provide sensible defaults that existing consumers inherit. New features use opt-in configuration properties. When breaking changes are unavoidable, provide a migration guide and a compatibility module that maps old configuration keys to new ones. Run integration tests against both the current and previous SDK versions.

3.3 Distributed Systems (Questions 16-20)

Q16. Explain the CAP theorem and where Kafka fits.

CAP states that a distributed system can provide at most two of three guarantees: Consistency, Availability, Partition tolerance. Kafka prioritizes consistency and partition tolerance (CP) when configured with acks=all, min.insync.replicas=2, and unclean.leader.election.enable=false. It sacrifices availability — if not enough replicas are in sync, writes are rejected. With relaxed settings, Kafka can lean toward AP.

Q17. How does Raft consensus work in KRaft mode?

In KRaft mode, controller nodes elect a leader using the Raft protocol. The leader appends metadata changes to a replicated log. Followers replicate the log and apply changes. A leader is elected when it receives votes from a majority of the controller quorum. If the leader fails, a follower with the most up-to-date log initiates a new election. This is faster than the ZooKeeper-based controller because there is no external dependency.

Q18. What is the split-brain problem and how does Kafka prevent it?

Split-brain occurs when two nodes both believe they are the leader. In ZooKeeper mode, Kafka uses ephemeral nodes — when a broker loses its ZooKeeper session, its leadership is revoked. In KRaft mode, the Raft protocol's term-based voting prevents two leaders from existing simultaneously. The controller.quorum.voters configuration ensures a majority quorum is always required.

Q19. Explain backpressure in stream processing and how Flink handles it.

Backpressure occurs when a downstream operator cannot keep up with the upstream rate. Flink uses a credit-based flow control mechanism: each downstream task communicates how many network buffers it can accept (credits). When credits run out, the upstream task stops sending, and this pressure propagates back to the source. Monitoring backpressure in the Flink UI is critical for identifying bottleneck operators.

Q20. How would you design a system that guarantees exactly-once message processing across Kafka and a database?

Use the transactional outbox pattern. Instead of writing to both Kafka and the database, write only to the database (business data plus an outbox table) in a single transaction. Debezium captures the outbox table changes and publishes them to Kafka. This ensures that the message is published if and only if the transaction commits. On the consumer side, use idempotent writes (upserts with unique keys) to handle potential duplicates.

3.4 CDC and Data Pipeline (Questions 21-25)

Q21. Compare CDC approaches: log-based vs query-based vs trigger-based.

Log-based CDC (Debezium) reads the database transaction log directly. It captures all changes with low overhead and minimal impact on the source database. Query-based CDC polls the database for changes using timestamps or version columns. It misses deletes and has higher latency. Trigger-based CDC uses database triggers to capture changes. It is flexible but adds write overhead and complexity. Log-based is preferred for production systems.

Q22. How do you handle schema changes in a CDC pipeline?

When a column is added, Debezium captures the new schema and forwards it to Schema Registry. With BACKWARD compatibility, existing consumers can still read old messages (the new field has a default value). When a column is dropped, use FORWARD compatibility. The key is to configure Schema Registry compatibility mode correctly and test schema changes in a staging environment before production.

Q23. What happens when a Debezium connector falls behind the database WAL?

If the connector falls too far behind, the database may have already recycled the WAL segments. The connector will fail with a "WAL segment not found" error. Recovery options: re-snapshot the affected tables (set snapshot.mode to always temporarily), or use a point-in-time recovery to restore the WAL position. Prevention: monitor connector lag, ensure WAL retention is sufficient (wal_keep_segments in PostgreSQL).

Q24. Explain the outbox pattern and its advantages over dual writes.

In a dual write, the application writes to both the database and Kafka. If one write fails, the system is inconsistent. The outbox pattern avoids this: the application writes business data and an outbox event in a single database transaction. Debezium captures the outbox event and publishes it to Kafka. Since both writes are in the same transaction, they are atomic. The outbox event schema can be different from the database schema, providing a clean API boundary.

Q25. How would you ensure CDC data quality in a financial system?

Implement a reconciliation pipeline that periodically compares source database state with the state derived from CDC events. Use checksums or row counts per table per time window. Alert on discrepancies. Additionally, add end-to-end watermarking: inject synthetic events at the source and verify they arrive at the sink within SLA. Track CDC lag as a key operational metric.

3.5 Operations and Production (Questions 26-30)

Q26. How would you perform a rolling upgrade of a Kafka cluster?

Upgrade one broker at a time. Before stopping a broker, ensure all partitions it leads have up-to-date replicas. Set controlled.shutdown.enable=true so the broker transfers leadership before shutting down. After the upgrade, verify the broker rejoins the cluster and catches up. Monitor UnderReplicatedPartitions throughout the process. Upgrade controllers last in KRaft mode.

Q27. A consumer group has high lag on specific partitions. How do you diagnose?

Check if the affected partitions are on the same broker (possible broker issue). Check consumer processing time per record (slow processing). Check for data skew in the partition key (hot partitions). Check GC logs for long pauses. Check if the consumer is hitting rate limits on downstream systems. Use kafka-consumer-groups.sh --describe to see per-partition lag and consumer assignments.

Q28. How would you handle a datacenter failover for Kafka?

Ensure MirrorMaker 2 is replicating all critical topics with minimal lag. During failover: stop producers in the failed DC, verify replication is caught up in the surviving DC, redirect DNS or load balancers, start consumers in the surviving DC from translated offsets. After failover: monitor for data gaps, run reconciliation, and plan the failback when the original DC recovers.

Q29. What is your approach to Kafka topic naming conventions?

Use a hierarchical naming scheme that encodes the domain, event type, and version:

domain.subdomain.event-type.version

For example: payments.transactions.created.v1, cdc.accounts.changes. Avoid special characters. Use a topic governance tool or naming validation in the SDK to enforce conventions. Document the naming scheme and make it part of the onboarding process for new teams.

Q30. How do you capacity plan for a Kafka cluster?

Start with the required throughput (MB/s) and retention period. Calculate storage: throughput multiplied by retention multiplied by replication factor. Add overhead for compaction and segment rolls. For brokers: each broker can handle roughly 100-200 MB/s depending on hardware. Plan for N+1 (one broker can fail without overloading others). For partitions: target 10-20 MB/s per partition. Monitor actual usage and adjust quarterly. Factor in growth projections from business stakeholders.


4. Six-Month Study Roadmap

Month 1: Kafka Fundamentals

Week 1-2: Core Concepts

  • Read "Kafka: The Definitive Guide" (2nd Edition) chapters 1-6
  • Set up a 3-broker Kafka cluster locally using Docker Compose
  • Practice producing and consuming messages with the CLI tools
  • Study the log structure by examining segment files directly

Week 3-4: Producer and Consumer Deep Dive

  • Implement a Java producer with different acks settings and measure throughput
  • Implement a consumer group with manual offset commits
  • Experiment with rebalancing: add and remove consumers, observe partition reassignment
  • Read Kafka source code for KafkaProducer.send() and KafkaConsumer.poll()

Month 2: Spring Boot and SDK Development

Week 1-2: Spring Kafka

  • Build a Spring Boot application with spring-kafka
  • Implement producer interceptors for logging and metrics
  • Build a consumer with error handling and dead letter queue
  • Write integration tests using EmbeddedKafka

Week 3-4: SDK Design

  • Design and build a reusable Spring Boot Kafka starter module
  • Add auto-configuration for common serialization formats (JSON, Avro)
  • Implement schema registry integration
  • Package as a Maven/Gradle library with documentation

Month 3: Distributed Systems and Replication

Week 1-2: Theory

  • Read "Designing Data-Intensive Applications" chapters 5, 8, 9
  • Study the Raft consensus paper
  • Understand KRaft mode architecture
  • Practice distributed systems design questions

Week 3-4: Active-Active Replication

  • Set up MirrorMaker 2 between two Kafka clusters locally
  • Test topic replication, offset synchronization, and failover
  • Implement a consumer that handles cluster switching
  • Document the failover procedure step by step

Month 4: CDC with Debezium

Week 1-2: Debezium Fundamentals

  • Set up Debezium with PostgreSQL and Kafka Connect
  • Capture inserts, updates, and deletes on sample tables
  • Study the Debezium event format and schema
  • Implement the outbox pattern with a sample application

Week 3-4: Production CDC

  • Test schema evolution scenarios (add column, rename column, drop column)
  • Implement a reconciliation check between source and sink
  • Study Debezium's exactly-once delivery configuration
  • Handle initial snapshots for large tables

Week 1-2: Flink Basics

  • Complete the official Flink training exercises
  • Build a streaming application that reads from Kafka and writes aggregations
  • Study windowing: tumbling, sliding, session windows
  • Implement event time processing with watermarks

Week 3-4: Advanced Flink

  • Implement stateful processing with managed keyed state
  • Configure checkpointing with RocksDB state backend
  • Build a complex event processing (CEP) pattern for fraud detection
  • Deploy a Flink job on a local YARN or Kubernetes cluster

Month 6: ClickHouse and Integration

Week 1-2: ClickHouse

  • Set up ClickHouse and load sample data
  • Design table schemas with appropriate engines (MergeTree, ReplacingMergeTree)
  • Build a Kafka-to-ClickHouse pipeline using the Kafka engine and materialized views
  • Write analytical queries and optimize with EXPLAIN

Week 3-4: End-to-End Project and Interview Prep

  • Build a complete pipeline: PostgreSQL to Debezium to Kafka to Flink to ClickHouse
  • Create a monitoring dashboard with Grafana
  • Prepare a 5-minute architecture presentation of your project
  • Practice all 30 interview questions with a partner
  • Review and refine your resume

5. Resume Strategy

What Toss Bank Wants to See

Your resume should answer one question: can this person operate and evolve our Kafka infrastructure? Here is how to frame your experience:

Lead with Impact Metrics

  • "Operated a 50-broker Kafka cluster processing 2M messages per second with 99.99% uptime"
  • "Reduced consumer lag from 10 minutes to under 30 seconds by migrating to cooperative rebalancing"
  • "Built a CDC pipeline with Debezium that replaced batch ETL, reducing data freshness from 6 hours to under 5 seconds"

Demonstrate Operational Maturity

  • Mention on-call experience, incident response, and post-mortem culture
  • Highlight zero-downtime upgrades and migration projects
  • Show that you understand the boring but critical parts: monitoring, alerting, capacity planning

Show Breadth Beyond Kafka

  • Spring Boot SDK development shows software engineering skills
  • Flink or Spark experience shows data processing capabilities
  • ClickHouse or similar OLAP experience shows analytics awareness
  • Kubernetes deployment experience shows operational versatility

Resume Format Tips

  • Keep it to 2 pages maximum
  • Use the XYZ format: "Accomplished X by implementing Y, resulting in Z"
  • List technologies in order of relevance to the JD
  • Include a "Technical Skills" section that mirrors the JD's tech stack
  • Link to a GitHub profile with relevant projects or contributions

6. Portfolio Project Ideas

Project 1: Real-Time Financial Transaction Pipeline

Build an end-to-end streaming pipeline that simulates financial transaction processing:

  • Data Source: A Spring Boot application that generates synthetic transaction events
  • Kafka: Multi-topic architecture with transactions, alerts, and audit topics
  • Flink: Stream processing for real-time aggregation and anomaly detection
  • ClickHouse: Real-time analytics dashboard
  • Monitoring: Prometheus + Grafana for Kafka and Flink metrics

This project directly maps to Toss Bank's core architecture.

Project 2: Multi-Datacenter Kafka Replication Lab

Demonstrate your understanding of Active-Active replication:

  • Set up two Kafka clusters in Docker (simulating two data centers)
  • Configure MirrorMaker 2 for bidirectional replication
  • Implement a producer that writes to both clusters
  • Build a consumer that fails over between clusters
  • Document the failover and failback procedures
  • Measure and report replication lag under various load conditions

Project 3: CDC-Powered Event Sourcing System

Build a microservice system that uses CDC for event-driven communication:

  • A Spring Boot service with a PostgreSQL database
  • Debezium capturing changes from the outbox table
  • Kafka as the event bus
  • A downstream service that builds a materialized view from events
  • A reconciliation tool that verifies consistency between source and materialized view

Project 4: Kafka Operations Toolkit

Build tools that a Kafka operator would actually use:

  • A partition rebalancing analyzer that suggests optimal partition assignments
  • A consumer lag monitoring dashboard with alerting
  • A topic configuration auditor that checks for deviations from standards
  • A schema compatibility checker that validates Avro schemas before registration

7. Quiz: Test Your Knowledge

Q1: You set acks=all and min.insync.replicas=2 on a 3-broker cluster. One broker goes down. What happens to produce requests?

Produce requests continue to succeed. With 3 replicas and 1 broker down, 2 replicas remain in the ISR. Since min.insync.replicas=2 is satisfied, the producer receives acknowledgment. If a second broker goes down, the ISR would have only 1 replica, which is less than min.insync.replicas=2, and produce requests would fail with NotEnoughReplicasException.

Q2: A Debezium connector for PostgreSQL stops receiving changes. The connector status shows "RUNNING" but no new messages appear in Kafka. What are the most likely causes?

Possible causes: (1) The replication slot has been dropped or is inactive — check pg_replication_slots. (2) WAL level is not set to logical — verify wal_level configuration. (3) The connector is stuck on a snapshot — check connector task status. (4) Network partition between Debezium and the database. (5) The table was excluded by a filter change. (6) PostgreSQL publication does not include the target tables.

Q3: Your Flink job's checkpoint duration increases from 5 seconds to 5 minutes over a week. What is happening and how do you fix it?

Checkpoint duration increases when state size grows. Possible causes: (1) A keyed state that accumulates without TTL — add state TTL configuration. (2) State backend running out of local disk — increase disk or switch to incremental checkpoints. (3) RocksDB compaction falling behind — tune RocksDB settings. (4) Checkpoint storage (S3/HDFS) throughput bottleneck. Fix: enable incremental checkpointing, configure state TTL, monitor state size per operator in the Flink UI.

Q4: In an Active-Active Kafka setup with MirrorMaker 2, how do you prevent infinite replication loops?

MirrorMaker 2 prevents replication loops using source cluster metadata in record headers. When MM2 replicates a record, it adds a header indicating the source cluster. When the record arrives at the other cluster and is picked up for replication back, MM2 checks the header and skips records that originated from the target cluster. The IdentityReplicationPolicy or default DefaultReplicationPolicy (which prefixes topic names with the source cluster alias) also helps prevent loops.

Q5: You need to migrate a topic from delete-based retention to compaction without downtime. Describe the steps.

Steps: (1) Create a new topic with cleanup.policy=compact and the same partition count. (2) Set up a Kafka Streams or consumer-producer bridge that reads from the old topic and writes to the new topic with the same keys. (3) Wait until the bridge has caught up to the end of the old topic. (4) Switch producers to write to the new topic. (5) Switch consumers to read from the new topic at their last committed offset in the bridge's consumer group. (6) Verify data integrity. (7) Delete the old topic after a grace period.


8. References and Resources

Books

  1. Neha Narkhede, Gwen Shapira, Todd Palino — Kafka: The Definitive Guide, 2nd Edition, O'Reilly, 2021
  2. Martin Kleppmann — Designing Data-Intensive Applications, O'Reilly, 2017
  3. Fabian Hueske, Vasiliki Kalavri — Stream Processing with Apache Flink, O'Reilly, 2019
  4. Robert Yokota — Event Streams in Action, Manning, 2019

Official Documentation

  1. Apache Kafka Documentation — https://kafka.apache.org/documentation/
  2. Confluent Platform Documentation — https://docs.confluent.io/
  3. Debezium Documentation — https://debezium.io/documentation/
  4. Apache Flink Documentation — https://flink.apache.org/docs/
  5. ClickHouse Documentation — https://clickhouse.com/docs/

Courses and Tutorials

  1. Confluent Developer — Free Kafka courses — https://developer.confluent.io/
  2. Stephane Maarek — Kafka courses on Udemy
  3. Apache Flink Training — https://nightlies.apache.org/flink/flink-docs-stable/docs/learn-flink/overview/
  4. ClickHouse Academy — https://clickhouse.com/learn

Community and Talks

  1. Kafka Summit recordings — https://www.kafka-summit.org/
  2. Flink Forward conference recordings — https://flink-forward.org/
  3. Martin Kleppmann's blog — https://martin.kleppmann.com/
  4. Toss Tech Blog — https://toss.tech/
  5. Confluent Blog — https://www.confluent.io/blog/

Conclusion

The Toss Bank Data Engineer (Kafka and Streaming) role is one of the most technically demanding data engineering positions in the Korean fintech ecosystem. It requires a rare combination of deep Kafka operational expertise, software engineering skill for SDK development, distributed systems knowledge for Active-Active replication, and data pipeline experience with CDC and stream processing.

The six-month roadmap in this guide is aggressive but realistic. Start with Kafka fundamentals, build the SDK skills early, then layer on the advanced topics. The portfolio projects will give you concrete artifacts to discuss in interviews, and the 30 practice questions cover the most likely topics you will face.

Remember: Toss Bank values operational maturity as much as technical skill. Being able to discuss how you would handle a broker failure at 3 AM, how you would plan capacity for next quarter, or how you would safely migrate a critical topic — these conversations matter as much as your ability to write Flink code.

Start studying today. The pipeline waits for no one.