Skip to content

필사 모드: Stream Processing 2026 Deep Dive - Kafka Streams, Flink SQL, RisingWave, Materialize, Arroyo, ksqlDB, Bytewax, Decodable

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Prologue — The Weight of "We Need This in Real Time"

A fintech meeting room in 2026.

PM: "Can we build a real-time fraud detection dashboard?"

Engineer: "How real-time is real-time?"

PM: "Um... immediate?"

Engineer: "One second? 100 milliseconds? Or five minutes?"

Everything about stream processing in 2026 is in that short exchange. On one side there's the vague expectation that "if data comes out of Kafka, that's real time." On the other side there's the sigh of engineers who deal with **windows, watermarks, exactly-once, backpressure, and state backups** every day. Between them sit Flink, Kafka Streams, RisingWave, Materialize, Arroyo, ksqlDB, and Bytewax.

In 2026, stream processing splits into **three paradigms**: programming frameworks (Flink, Kafka Streams, Bytewax), SQL streaming databases (RisingWave, Materialize, ksqlDB), and managed SaaS (Decodable, Confluent Cloud Flink, Aiven Flink, Quix). Overlaid on top is a second axis — **state management**: embedded RocksDB versus disaggregated state on S3.

This post is a single map of that whole terrain — from Flink 2.0's Disaggregated State Backend, RisingWave 2.x's Postgres wire, Materialize's Differential Dataflow, Arroyo's Rust SQL engine, ksqlDB's Confluent winding-down, Bytewax's Python leap, all the way to what Korean and Japanese big tech actually run.

1. The 2026 Map — Three Paradigms: Framework, Stream DB, Managed

Terminology first. The phrase "stream processing" hides three different things.

| Paradigm | Meaning | Examples | User |

| --- | --- | --- | --- |

| Stream Processing Framework | Code-defined topology (DataStream, KStream) | Flink, Kafka Streams, Bytewax, Beam | Data engineer |

| Streaming Database | SQL view definitions computed incrementally | RisingWave, Materialize, ksqlDB | Analyst, backend |

| Managed Streaming SaaS | Hosted versions of the above | Decodable, Confluent Cloud Flink, Quix | Small teams |

By 2026 the lines have blurred. Flink added Materialized Tables, RisingWave accepts Rust UDFs, and Kafka Streams is being absorbed as the de facto successor to ksqlDB.

Still, **paradigm is the starting point**. A decision tree:

1. **"Just transform and aggregate between Kafka topics"** -> Kafka Streams (Java/Scala) or ksqlDB (SQL).

2. **"Complex windows/joins/CEP, non-Kafka sources, petabytes"** -> Flink (SQL or DataStream API).

3. **"I want the app to query Postgres, and a Materialized View that updates itself"** -> RisingWave or Materialize.

4. **"Python data scientists need to use it directly"** -> Bytewax or Quix.

5. **"I don't want to operate anything, give me a SaaS"** -> Decodable, Confluent Cloud Flink, Aiven Flink.

6. **"Rust, monotonic workload, single binary"** -> Arroyo.

If Flink is a hammer, every problem becomes a nail. The veteran's line in 2026: "If you do 10K events/sec and only need simple aggregation, ksqlDB or Materialize is enough. You will regret operating a Flink cluster."

2. Streaming vs Batch vs Micro-Batch — The Latency Spectrum

Start with the paradigm itself.

| Model | Processing unit | Latency | Examples |

| --- | --- | --- | --- |

| Batch | Full dataset at once | minutes-hours | Spark batch, Hive, dbt |

| Micro-batch | Small batches on a schedule | seconds-minutes | Spark Structured Streaming, Trino on Iceberg |

| True streaming | One event at a time | ms-seconds | Flink, Kafka Streams, RisingWave |

| Continuous query | Result view kept incrementally | seconds-minutes | Materialize, RisingWave, Snowflake Dynamic Tables |

Spark Structured Streaming's micro-batch fires a small batch job each tick. P99 is typically 1-10 seconds. Flink processes record-at-a-time (or a mini-batch mode) and can hit P99 under 100 ms. So HFT, game matchmaking, and fraud detection lean on Flink/Kafka Streams; analytical dashboards lean on Materialize/RisingWave; daily reports stay with Spark/Trino.

The real question is: **does sub-second latency move the business**? Most BI dashboards survive a 5-minute lag, where micro-batch is enough. True streaming pays off only where one second of delay actually costs revenue.

3. Apache Flink 2.0 — Disaggregated State and Materialized Tables

Apache Flink started from the Stratosphere project in 2014, driven by data Artisans (now Ververica, acquired by Alibaba). The release of **Flink 2.0** in March 2025 was the turning point. This section covers Flink's SQL, Table API, DataStream API, state, watermarks, and exactly-once together.

Key changes in 2.0:

- **Disaggregated State Backend (ForSt)**: state lives on S3/GCS, with a local cache for hot data. Tiered storage came to state.

- **Async State API**: non-blocking state access cuts P99.

- **Materialized Tables (FLIP-435)**: define a view in SQL, background incremental refresh. Officially absorbs RisingWave/Materialize territory.

- **Adaptive Batch Scheduler default**: stream and batch on the same engine.

- **Java 17 baseline**, Scala 2.13 migration done, Python 3.12 support.

Flink's architecture is four layers.

| Layer | Responsibility | Example |

| --- | --- | --- |

| Flink SQL | ANSI SQL, catalog, Hive compat | CREATE TABLE orders ... |

| Table API | Relational API (Java/Scala/Python) | table.groupBy("user").select(...) |

| DataStream API | Low-level transforms | dataStream.map().keyBy().window() |

| Stateful Functions / ProcessFunction | Lowest level, direct state | RichFlatMapFunction |

Reading Kafka and joining with Flink SQL:

CREATE TABLE orders (

order_id BIGINT,

user_id BIGINT,

amount DECIMAL(10,2),

event_time TIMESTAMP_LTZ(3),

WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND

) WITH (

'connector' = 'kafka',

'topic' = 'orders',

'properties.bootstrap.servers' = 'kafka:9092',

'scan.startup.mode' = 'earliest-offset',

'format' = 'avro-confluent',

'avro-confluent.schema-registry.url' = 'http://schema-registry:8081'

);

CREATE TABLE users (

user_id BIGINT,

segment STRING,

PRIMARY KEY (user_id) NOT ENFORCED

) WITH (

'connector' = 'jdbc',

'url' = 'jdbc:postgresql://pg:5432/app',

'table-name' = 'users'

);

SELECT

segment,

TUMBLE_START(event_time, INTERVAL '5' MINUTE) AS window_start,

SUM(amount) AS revenue

FROM orders o

JOIN users u ON o.user_id = u.user_id

GROUP BY segment, TUMBLE(event_time, INTERVAL '5' MINUTE);

This single query covers the core: Kafka source, JDBC lookup, watermark, tumbling window, group-by aggregate.

4. Flink DataStream API and Stateful Functions

Some workloads SQL cannot express: complex state machines, custom partitioning, dynamic routing, CEP. The DataStream API steps in.

Java example — per-user session window with alerts:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

env.enableCheckpointing(30_000); // 30s

DataStream<Event> events = env.fromSource(

KafkaSource.<Event>builder()

.setBootstrapServers("kafka:9092")

.setTopics("clicks")

.setStartingOffsets(OffsetsInitializer.earliest())

.setValueOnlyDeserializer(new EventDeserializer())

.build(),

WatermarkStrategy.<Event>forBoundedOutOfOrderness(Duration.ofSeconds(5))

.withTimestampAssigner((e, ts) -> e.eventTimeMs()),

"kafka-source"

);

events

.keyBy(Event::userId)

.window(EventTimeSessionWindows.withGap(Time.minutes(10)))

.process(new ProcessWindowFunction<Event, Alert, Long, TimeWindow>() {

@Override

public void process(Long userId, Context ctx, Iterable<Event> elems, Collector<Alert> out) {

int count = 0;

for (Event e : elems) count++;

if (count > 100) out.collect(new Alert(userId, count, "abusive"));

}

})

.sinkTo(KafkaSink.<Alert>builder()

.setBootstrapServers("kafka:9092")

.setRecordSerializer(...)

.setDeliveryGuarantee(DeliveryGuarantee.EXACTLY_ONCE)

.build());

env.execute("session-alert");

The shape is `keyBy -> window -> process`. keyBy routes same-key events to the same task; window groups by time/session; process emits per window.

State is accessed via RichFunction:

public class FraudDetector extends KeyedProcessFunction<Long, Tx, Alert> {

private transient ValueState<Double> lastAmount;

private transient ListState<Tx> recentTxs;

@Override

public void open(Configuration cfg) {

lastAmount = getRuntimeContext().getState(

new ValueStateDescriptor<>("last", Double.class));

recentTxs = getRuntimeContext().getListState(

new ListStateDescriptor<>("recent", Tx.class));

}

@Override

public void processElement(Tx tx, Context ctx, Collector<Alert> out) throws Exception {

Double prev = lastAmount.value();

if (prev != null && tx.amount() > prev * 10) {

out.collect(new Alert(tx.userId(), "amount-spike"));

}

lastAmount.update(tx.amount());

recentTxs.add(tx);

}

}

Flink offers ValueState, ListState, MapState, ReducingState, AggregatingState — all keyed scope, shared only among events with the same key.

5. Flink State, Checkpoints, and Exactly-Once

Flink's superpower is **exactly-once semantics**. With Kafka source plus Kafka sink, failures do not duplicate or drop results. How?

The core is a **Chandy-Lamport** distributed snapshot.

1. JobManager periodically injects checkpoint barriers into sources.

2. Barriers flow downstream with the data.

3. Each operator snapshots state when it receives a barrier.

4. Sinks commit per-sink semantics (Kafka transactional producer commit).

5. After all operators ack, the checkpoint completes.

On failure:

1. Restore all operator state from the last successful checkpoint.

2. Sources rewind to the offsets at that checkpoint.

3. Sinks are transactional, so only committed transactions become visible externally.

Configuration example:

env.enableCheckpointing(60_000);

env.getCheckpointConfig().setCheckpointStorage("s3://flink-state/checkpoints");

env.getCheckpointConfig().setMinPauseBetweenCheckpoints(30_000);

env.getCheckpointConfig().setCheckpointTimeout(300_000);

env.getCheckpointConfig().enableUnalignedCheckpoints();

env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);

env.setStateBackend(new EmbeddedRocksDBStateBackend());

Flink 2.0's ForSt backs RocksDB with S3 — petabyte-scale state stops being tied to broker disks.

6. Flink Windowing — Tumbling, Sliding, Session, Custom

Windows are the central abstraction in stream processing: cutting an unbounded stream into finite chunks.

| Window | Meaning | Example |

| --- | --- | --- |

| Tumbling | Non-overlapping fixed | 5-minute revenue buckets |

| Sliding | Overlapping fixed | 5-minute moving avg every 30s |

| Session | Inactivity gap | 30-minute idle ends session |

| Global | Whole stream | Total count (risk: unbounded state) |

Tumbling in Flink SQL:

SELECT

TUMBLE_START(event_time, INTERVAL '5' MINUTE) AS w_start,

COUNT(*) AS cnt

FROM clicks

GROUP BY TUMBLE(event_time, INTERVAL '5' MINUTE);

Sliding (HOP):

SELECT

HOP_START(event_time, INTERVAL '30' SECOND, INTERVAL '5' MINUTE) AS w_start,

AVG(latency) AS avg_latency

FROM requests

GROUP BY HOP(event_time, INTERVAL '30' SECOND, INTERVAL '5' MINUTE);

Session:

SELECT

user_id,

SESSION_START(event_time, INTERVAL '30' MINUTE) AS s_start,

COUNT(*) AS events_in_session

FROM user_events

GROUP BY user_id, SESSION(event_time, INTERVAL '30' MINUTE);

Choosing a window:

- **Fixed periodic reports**: Tumbling.

- **Moving averages, hot metrics**: Sliding (small slide).

- **User behavior analysis**: Session.

- **State machines spanning irregular intervals**: ProcessFunction + timers.

7. Event Time vs Processing Time — The Heart of Watermarks

The hardest concept in streaming. **Event time** is when the event happened; **processing time** is when the system saw it. They differ.

- A mobile app went offline; events arrive 10 minutes late.

- Network delay drops part of a batch behind.

- The system paused, then resumed.

A **watermark** is a signal: "no more events earlier than this timestamp will arrive." Watermarks trigger window close, result emission, and late-data separation.

Flink watermark strategy:

WatermarkStrategy.<Event>forBoundedOutOfOrderness(Duration.ofSeconds(10))

.withTimestampAssigner((event, ts) -> event.eventTimeMs())

.withIdleness(Duration.ofMinutes(1));

`forBoundedOutOfOrderness(10s)` means: "If the largest event timestamp I've seen is T, I will process events with timestamp before T-10s. Anything later is late data."

Handling late data:

SingleOutputStreamOperator<Result> main = stream

.keyBy(...)

.window(TumblingEventTimeWindows.of(Time.minutes(5)))

.allowedLateness(Time.minutes(2))

.sideOutputLateData(lateTag)

.process(...);

DataStream<Event> late = main.getSideOutput(lateTag);

Watermarks trade **correctness for latency**. Short watermark = fast results but more late data. Long watermark = correct but slow.

8. CEP — Complex Event Processing

CEP is **event pattern matching**: "5 failed logins in a row -> lock account", "payment -> refund -> payment in 5 minutes -> suspect fraud."

Flink CEP example:

Pattern<LoginEvent, ?> pattern = Pattern.<LoginEvent>begin("first")

.where(SimpleCondition.of(e -> !e.success()))

.next("repeat")

.where(SimpleCondition.of(e -> !e.success()))

.times(4)

.within(Time.minutes(5));

PatternStream<LoginEvent> ps = CEP.pattern(events.keyBy(LoginEvent::userId), pattern);

DataStream<Alert> alerts = ps.select(match -> {

LoginEvent last = match.get("repeat").get(3);

return new Alert(last.userId(), "5-fail-in-5min");

});

The major CEP options are Flink CEP, Esper (long history), and Drools Fusion (rule-engine flavor). They power fraud detection, IDS, trade rules, IoT alerts.

CEP's hard problem is **state**. A 5-minute pattern means 5 minutes of partial matches in memory. State explodes under load. The fixes: partition by key, state TTL, simpler patterns.

9. Kafka Streams 4.0 — Stream Processing as a Library

Kafka Streams arrived in Kafka 0.10 (2016) as a Confluent-led **library**. If Flink is "submit a job to a distributed cluster," Kafka Streams is "import a library into your Spring Boot app."

Kafka Streams 4.0 (2025) highlights:

- Integration with **Share Groups (KIP-932)** — consume the same topic like a queue.

- **Async Processing API** — non-blocking I/O friendly.

- **State Store on RocksDB 7.x** — better compression, faster recovery.

- **Java 17 baseline**, experimental GraalVM Native Image.

- **Kafka Streams DSL stabilized** — richer KTable.join, KStream-KTable join variants.

Two abstractions:

- **KStream**: an unbounded record stream (insert-only).

- **KTable**: changelog-backed table holding latest value per key.

Word count (Java):

Properties props = new Properties();

props.put(StreamsConfig.APPLICATION_ID_CONFIG, "wc-app");

props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka:9092");

props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());

props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());

StreamsBuilder builder = new StreamsBuilder();

KStream<String, String> source = builder.stream("text-input");

KTable<String, Long> counts = source

.flatMapValues(v -> Arrays.asList(v.toLowerCase().split("\\W+")))

.groupBy((k, v) -> v)

.count();

counts.toStream().to("word-count-output", Produced.with(Serdes.String(), Serdes.Long()));

KafkaStreams streams = new KafkaStreams(builder.build(), props);

streams.start();

The beauty: **it's just a library**. Spin up N copies of the same JAR and they split partitions among themselves (via consumer groups and internal rebalance). No cluster manager. Embeds in Spring Boot, Quarkus, Micronaut.

10. Kafka Streams State and Interactive Queries

KTables and aggregations are stored in **state stores** (RocksDB by default, in-memory optional). The state is **queryable externally** through Interactive Queries.

Example:

ReadOnlyKeyValueStore<String, Long> store = streams.store(

StoreQueryParameters.fromNameAndType("word-counts", QueryableStoreTypes.keyValueStore())

);

Long count = store.get("kafka");

The pattern: **Streams app = microservice**. Read Kafka, aggregate into local RocksDB, expose results via REST. No separate OLAP database for real-time views.

The cost is distributed-state complexity. You need to know which instance holds which key. Kafka Streams exposes `metadataForKey()` and you write gRPC/REST routing yourself.

GlobalKTable replicates small lookup tables to every instance (every partition consumed by every instance).

Kafka Streams **only processes data already in Kafka**. Joining external JDBC/REST is limited (you must source through Kafka Connect first). This is the biggest gap versus Flink.

11. ksqlDB 0.30 — The Rise and Wind-Down of SQL-on-Kafka

ksqlDB was Confluent's 2017 "SQL database on top of Kafka," using Kafka Streams internally. Submit SQL and a background streaming job runs.

CREATE STREAM orders (

order_id BIGINT KEY,

user_id BIGINT,

amount DECIMAL(10,2)

) WITH (KAFKA_TOPIC='orders', VALUE_FORMAT='AVRO');

CREATE TABLE user_revenue AS

SELECT user_id, SUM(amount) AS total

FROM orders

GROUP BY user_id

EMIT CHANGES;

SELECT * FROM user_revenue WHERE user_id = 42 EMIT CHANGES;

`EMIT CHANGES` means "continuously push result deltas" — the essence of streaming SQL.

ksqlDB 0.30 (2025) is effectively the last major release. In early 2025 Confluent declared the direction: **"ksqlDB -> Flink SQL"**. SQL streaming in Confluent Cloud unifies on Flink; ksqlDB enters maintenance mode. New projects should pick Flink SQL.

ksqlDB is still in production at many places, so the migration path matters.

12. RisingWave 2.x — A Streaming DB on the Postgres Wire

RisingWave was started in 2021 by Singularity Data (now RisingWave Labs) as a **Rust-based streaming database**. One-line summary of the 2.0 release in 2025: "**Looks like Postgres on the outside, runs streaming SQL inside.**"

Key features:

- **100% Postgres wire protocol**: psql, JDBC, ORMs all work.

- **Materialized views as first-class citizens**: `CREATE MATERIALIZED VIEW` is an incremental streaming job.

- **Rust engine, thread-per-core**: low overhead, stable P99.

- **State offload to S3**: disaggregated from day one.

- **Cluster, Cloud, and standalone deployment modes**.

- **Built-in CDC sources**: Postgres binlog, MySQL.

Basic usage:

CREATE SOURCE orders_source (

order_id BIGINT,

user_id BIGINT,

amount NUMERIC,

event_time TIMESTAMPTZ

) WITH (

connector = 'kafka',

topic = 'orders',

properties.bootstrap.server = 'kafka:9092',

scan.startup.mode = 'earliest'

) FORMAT PLAIN ENCODE JSON;

CREATE MATERIALIZED VIEW hourly_revenue AS

SELECT

date_trunc('hour', event_time) AS hour,

SUM(amount) AS revenue,

COUNT(*) AS orders

FROM orders_source

GROUP BY 1;

SELECT * FROM hourly_revenue ORDER BY hour DESC LIMIT 24;

That last SELECT is the magic. RisingWave maintains hourly_revenue **incrementally** in the background, and the query returns the latest snapshot instantly. Dashboards can query it directly like an OLAP DB.

RisingWave 2.x new features:

- **Iceberg Sink and Source**: direct lakehouse integration.

- **Python / JavaScript / Rust UDFs**.

- **Subscription**: Postgres LISTEN/NOTIFY-style change subscriptions.

- **Snowflake / BigQuery / Redshift sinks**.

13. Materialize — Differential Dataflow, Productized

Materialize was founded in 2019 by Frank McSherry (Microsoft Research, author of Naiad/Differential Dataflow) and Arjun Narayan. One-line summary: "**True incremental view maintenance.**"

Key differences:

- **Differential Dataflow engine**: real incremental computation researched since the 1990s. Only deltas propagate.

- **Postgres wire**: psql/JDBC like RisingWave.

- **99% SQL coverage**: Postgres-compatible dialect including recursive CTEs.

- **Strict serializability**: every query sees a consistent snapshot.

- **Cloud launched 2024**: self-hosting still possible, but SaaS first.

Basic example:

CREATE SOURCE orders FROM KAFKA BROKER 'kafka:9092' TOPIC 'orders'

FORMAT AVRO USING CONFLUENT SCHEMA REGISTRY 'http://schema-registry:8081'

ENVELOPE NONE;

CREATE MATERIALIZED VIEW revenue_by_segment AS

SELECT

u.segment,

SUM(o.amount) AS revenue

FROM orders o

JOIN users u ON o.user_id = u.user_id

GROUP BY u.segment;

SELECT * FROM revenue_by_segment;

Materialize's edge is **arbitrary SQL kept incremental**. Most streaming SQL only supports group-by aggregates and a few windowed patterns incrementally. Materialize handles joins, subqueries, and recursive CTEs incrementally.

The cost is **memory**. Differential Dataflow is mostly in-memory (disk spill being added). For very large state, RisingWave is stronger; for complex SQL, Materialize wins.

14. Arroyo — Flink SQL Reimplemented in Rust

Arroyo started in 2023 by Micah Wylde (ex-Lyft Streaming) as a **Rust-based streaming engine**. One-line summary: "**Flink SQL compatible, single binary, fast.**"

Highlights:

- **Rust engine**: no JVM, single binary.

- **Arroyo SQL = a Flink SQL-compatible subset**.

- **Web UI first**: job creation and monitoring feel like SaaS.

- **S3 state**: tiered, no RocksDB.

- **Kubernetes operator** available.

Single-node CLI:

docker run -p 8000:8000 ghcr.io/arroyosystems/arroyo:latest

SQL in the Web UI:

CREATE TABLE events (

user_id BIGINT,

event_type TEXT,

amount FLOAT,

ts TIMESTAMP

) WITH (

connector = 'kafka',

bootstrap_servers = 'kafka:9092',

topic = 'events',

type = 'source',

format = 'json',

'event_time_field' = 'ts',

'watermark_field' = 'ts'

);

CREATE TABLE sinks WITH (connector = 'kafka', topic = 'alerts', type = 'sink', format = 'json');

INSERT INTO sinks

SELECT user_id, SUM(amount) AS total

FROM events

WHERE event_type = 'purchase'

GROUP BY user_id, TUMBLE(INTERVAL '5' MINUTE);

By 2026, Arroyo is gaining production adoption among teams that **dislike Flink operations but don't want Confluent lock-in**. Some teams at Lyft, Reddit, and Stripe run it.

Its weakness is SQL coverage. Arroyo SQL is a subset of Flink SQL, missing some functions. It's catching up fast but not parity yet.

15. Bytewax — Stream Processing in Python

Bytewax was founded in 2022 by Bytewax (the company) as a **Python streaming framework**. One-line summary: "**Looks like Flink, written in Python.**"

Highlights:

- **Pure Python API**: data scientists can write it themselves.

- **Rust core + PyO3**: Rust performance, Python user code.

- **PyData integration**: NumPy, Pandas, scikit-learn directly.

- **Connectors**: Kafka, Redpanda, Kinesis, S3, Iceberg, ClickHouse.

Word count:

from bytewax.dataflow import Dataflow

from bytewax.connectors.kafka import KafkaSource, KafkaSink

from bytewax import operators as op

from bytewax.connectors.stdio import StdOutSink

flow = Dataflow("word-count")

stream = op.input("kafka-in", flow, KafkaSource(

brokers=["kafka:9092"], topics=["text"]

))

words = op.flat_map("split", stream, lambda msg: msg.value.decode().split())

keyed = op.key_on("key", words, lambda w: w)

counts = op.stateful_map("count",

keyed,

lambda: 0,

lambda count, word: (count + 1, (word, count + 1))

)

op.output("stdout", counts, StdOutSink())

Bytewax shines for ML inference. PyTorch, scikit-learn, and Hugging Face models drop straight into the stream:

from transformers import pipeline

clf = pipeline("text-classification", model="distilbert-base-uncased")

def classify(text):

return clf(text)[0]

stream = op.input(...)

labeled = op.map("classify", stream, classify)

op.output("sink", labeled, ...)

Doing this in Flink means Python UDFs and inefficient model loading. Bytewax stays inside the same Python process — fast and natural.

The weakness is **operational maturity**. Don't expect Flink-scale clusters or decade-proven exactly-once. Bytewax fits the 100-1000 events/sec range and data-science pipelines.

16. Apache Beam — The Promise and Limits of Unified APIs

Apache Beam came out in 2016 when Google open-sourced the Cloud Dataflow model. One line: "**Stream and batch under one API.**"

The core ideas:

- **PTransform / PCollection / Pipeline** abstractions.

- **Pluggable runners**: Flink Runner, Spark Runner, Dataflow Runner, Samza Runner.

- **Java / Python / Go SDKs**.

from apache_beam.options.pipeline_options import PipelineOptions

with beam.Pipeline(options=PipelineOptions()) as p:

(p

| 'Read' >> beam.io.ReadFromKafka(...)

| 'Window' >> beam.WindowInto(beam.window.FixedWindows(60))

| 'Count' >> beam.combiners.Count.PerKey()

| 'Write' >> beam.io.WriteToBigQuery(...))

The same code can in theory run on Flink Runner or Dataflow Runner.

In practice, runners support different features, debugging is harder, and there's an extra layer of abstraction over native APIs. By 2026 Beam is mostly used by **teams on Google Cloud Dataflow**. Few seriously chase the multi-cloud promise anymore.

17. Decodable, Confluent Cloud Flink, Aiven Flink — Managed Flink Compared

Operating Flink yourself is hard. Managed offerings have matured.

| Provider | Engine | Pricing | Notes |

| --- | --- | --- | --- |

| Confluent Cloud Flink | Flink 1.20+ | Compute Unit + data | Best Schema Registry / Kafka integration |

| Aiven for Flink | Apache Flink | Instance hours + storage | Multi-cloud, Karapace |

| Decodable | Flink + house pieces | Throughput-based | "Pipeline in 5 minutes", rich connectors |

| Ververica Platform Cloud | Flink + Ververica Enterprise | Instance | Original Flink company, strong on-prem |

| Striim | House engine (Flink-derived) | License | CDC + streaming bundle |

| Amazon MSF | Flink | Time + throughput | AWS Managed Service for Apache Flink |

2026 recommendations:

- **Full Kafka stack**: Confluent Cloud Flink.

- **Cheap, multi-cloud OSS**: Aiven Flink.

- **Quick start, GUI first**: Decodable.

- **AWS lock-in OK**: MSF.

- **On-prem / enterprise support**: Ververica Platform.

Decodable's pitch is YAML-defined pipelines that run immediately. Small teams can ship a CDC pipeline in a day:

sources:

- name: orders-kafka

type: kafka

bootstrap_servers: kafka:9092

topic: orders

pipelines:

- name: enrich-orders

sql: |

SELECT o.*, u.segment

FROM orders o

JOIN users u ON o.user_id = u.user_id

sinks:

- name: enriched-snowflake

type: snowflake

table: enriched_orders

18. Quix, Pulsar Functions, Memphis, Redpanda Connect — Lighter Options

When Flink is overkill:

**Quix Streams** — Python library, lighter than Bytewax, Kafka-centric.

from quixstreams import Application

app = Application(broker_address="kafka:9092")

events = app.dataframe(topic=app.topic("events"))

events["amount_x2"] = events["amount"] * 2

events.filter(lambda row: row["amount"] > 100).to_topic(app.topic("alerts"))

app.run()

**Apache Pulsar Functions** — built into Pulsar, light per-message transforms.

def process(input, context):

return input.upper()

Deploy with one `pulsar-admin functions create` call.

**Memphis.dev / Superstream** — Kafka-compatible OSS queue with some streaming features.

**Redpanda Connect** (formerly Benthos) — powerful YAML ETL, 200+ connectors.

input:

kafka:

addresses: [kafka:9092]

topics: [events]

pipeline:

processors:

- mapping: |

root = this

root.processed_at = now()

root.amount_usd = this.amount * 0.93

output:

kafka:

addresses: [kafka:9092]

topic: events-processed

Redpanda Connect is closer to "stream ETL" than "stream processor". Windowing and joins are weak, but transform, routing, and enrichment shine.

19. Estuary Flow — CDC Plus Streaming, Fused

Estuary Flow is a SaaS since 2023 handling CDC streaming. No Kafka required — direct source-to-sink streaming.

Features:

- **Gazette** (OSS log broker, not Kafka compatible) as the backbone.

- **Captures**: Postgres binlog, MongoDB oplog, MySQL, MSSQL CDC.

- **Materializations**: BigQuery, Snowflake, ClickHouse, Postgres sinks.

- **Derivations**: SQL or TypeScript transformations.

captures:

acme/source-postgres:

endpoint:

connector:

image: ghcr.io/estuary/source-postgres:dev

config:

address: pg:5432

database: app

user: estuary

password: secret

bindings:

- resource: { table: orders, namespace: public }

target: acme/orders

Estuary's identity: "**streaming without Kafka**." CDC -> transform -> warehouse in a single SaaS. Popular with teams building ClickHouse + ELT backends quickly.

20. Backpressure — Why Streaming Is Hard, Part 1

**Backpressure** is "what happens when consumers fall behind." Eighty percent of stream processing is getting this right.

Scenario: Kafka receives 100K events/sec. Flink transforms. Elasticsearch sinks. ES gets slow and only takes 50K/sec. Now what?

- The whole Flink operator chain stalls.

- The Kafka source eventually stops fetching -> consumer lag grows.

- Memory and disk fill up -> OOM risk.

Coping patterns:

1. **Natural backpressure (pull-based)**: Flink and Kafka Streams are pull-based and propagate backpressure automatically.

2. **Buffering**: Kafka itself is a huge buffer. Consumers can catch up without data loss.

3. **Async I/O**: Flink AsyncFunction parallelizes external calls.

4. **Drop / sample**: if you truly cannot keep up, drop events (fine for metrics, not for payments).

5. **Scale out**: more partitions, more consumers.

Flink's Web UI has a **backpressure tab** showing LOW/HIGH per operator. Persistent HIGH means that operator is the bottleneck.

21. Schema Evolution and the Schema Registry

In streaming, **schema compatibility** is operational core. Avro / Protobuf with a Schema Registry is the standard pattern.

curl -X POST http://schema-registry:8081/subjects/orders-value/versions \

-H "Content-Type: application/vnd.schemaregistry.v1+json" \

-d '{

"schema": "{ \"type\": \"record\", \"name\": \"Order\", \"fields\": [{\"name\":\"order_id\",\"type\":\"long\"}] }"

}'

curl -X PUT http://schema-registry:8081/config/orders-value \

-H "Content-Type: application/vnd.schemaregistry.v1+json" \

-d '{"compatibility": "BACKWARD"}'

Compatibility modes:

- **BACKWARD**: new schema can read old data (fields can be added, only removed if default exists).

- **FORWARD**: old schema can read new data (fields can be removed, additions ignored).

- **FULL**: both directions.

- **NONE**: no compatibility check (dangerous).

Recommendation: **BACKWARD** by default. Producers write the new schema, consumers upgrade gradually.

22. Korean Big Tech Stream Processing Adoption

From public engineering blogs and conference talks:

**Coupang** — Kafka, Flink, ksqlDB. Real-time pricing, anomaly detection, search-click tracking. Petabyte state via disaggregated backends. Flink-on-Kubernetes shared at Coupang Engineering Blog in 2024.

**NAVER Search** — Custom stream processing on Kafka Streams plus Flink SQL. Real-time search traffic stats, A/B-test metrics. In-house Schema Registry.

**Kakao Pay / Kakao Bank** — Flink with Kafka. Real-time transaction analytics, fraud detection, settlement streams. Flink CEP fraud patterns shared at if(kakao) 2025.

**NCsoft** — Kafka Streams (Java) plus Flink. Real-time matchmaking, log analysis, abuse detection. Game traffic demands sub-50ms P99.

**Baemin (Woowa Brothers)** — Kafka, Flink, Elasticsearch. Real-time delivery location, order state machines, anomaly patterns. Flink-on-K8s operations shared at WoowaCon 2024.

**Kakao Mobility** — Kafka Streams plus custom processing. Real-time matching, location tracking, dynamic pricing.

**Toss** — Kafka, Flink, Druid. Real-time payment monitoring, BI dashboards. A Materialize evaluation case shared at SLASH 2025 (no migration yet).

**NAVER Z (Zepeto)** — Kafka and Flink. Virtual space event processing, real-time behavior analysis.

**SK Telecom** — Kafka with Spark Structured Streaming, gradually migrating to Flink. Call quality and base station metrics.

23. Japanese Big Tech Stream Processing Adoption

**Mercari** — Kafka and Flink (some GCP Dataflow). Real-time fraud detection (heavy CEP), recommendations, search indexing. Flink CEP fraud patterns shared in 2025 on Mercari Engineering Blog.

**LINE Yahoo** — In-house Kafka, Flink, ksqlDB. Real-time push notification routing, trend detection, ad bidding. Flink-on-K8s case study at LY Tech Conf 2024.

**Rakuten** — Kafka, Spark Streaming moving to Flink, RisingWave under evaluation. Real-time pricing, recommendations. Global multi-region.

**SmartNews** — Kafka and Flink. Real-time news trending, user-behavior tracking. CEP for ad bidding.

**CyberAgent** — Kafka and Flink. Ad bidding, video watch analytics, real-time AbemaTV viewer metrics.

**ZOZO** — Kafka and Flink. Product search indexing, recommendations, behavior.

**DeNA** — Kafka, Spark Streaming, Flink. Game abuse detection, payment monitoring, marketing analytics.

**LINE Pay** — Flink with Kafka. Transaction monitoring, fraud detection.

**Mixi** — Kafka and Flink, especially in mobile games. Real-time event counters, matchmaking.

**NTT Communications** — Apache Beam on Dataflow. IoT telemetry.

24. Operations — Clusters, Monitoring, Debugging

Streaming job operations differ from OLTP/OLAP. They run 24/7, state accumulates, and a single mistake forces dreadful backfills.

Basic guidance:

1. **Separate clusters and separate jobs**: one OOM should not topple others.

2. **Checkpoint frequency**: 30s-1m is typical. Shorter = load, longer = slow recovery.

3. **State TTL is mandatory**: otherwise keyed state grows forever.

4. **Metrics**: Prometheus + Grafana for Flink JobManager/TaskManager metrics. Lag, checkpoint duration, backpressure, state size.

5. **Alerts**: consumer lag, checkpoint failures, restart counts, late events.

6. **Backfill strategy**: stateless jobs can replay from earliest Kafka offset; stateful uses savepoints.

7. **Blue/green deploys**: launch a new version under a different consumer group, compare results, then cut over.

Flink-on-Kubernetes is now the standard. The Apache Flink Kubernetes Operator is stable, Helm-installable.

helm repo add flink-operator-repo https://downloads.apache.org/flink/flink-kubernetes-operator-1.10.0/

helm install flink-kubernetes-operator flink-operator-repo/flink-kubernetes-operator

Define jobs via FlinkDeployment CRD:

apiVersion: flink.apache.org/v1beta1

kind: FlinkDeployment

metadata:

name: my-streaming-job

spec:

image: flink:2.0

flinkVersion: v2_0

taskManager:

replicas: 4

resource: { memory: "4096m", cpu: 2 }

job:

jarURI: s3://my-jobs/my-job.jar

parallelism: 8

upgradeMode: savepoint

25. Cost Model — The Real Price of Stream Processing

Stream processing is expensive because it is **always on**. Batch finishes; streaming never does.

Cost drivers:

- **Compute**: TaskManager instances. Usually the largest line item.

- **State storage**: RocksDB on local SSD or S3 (ForSt).

- **Kafka inter-AZ transfer**: see Kafka pricing.

- **Checkpoint S3 PUT/GET**.

- **Monitoring and logs** (Datadog, NewRelic, etc.).

Reduction patterns:

1. **State TTL to cap RAM**.

2. **Disaggregated state (ForSt, RisingWave) to save SSD**.

3. **Autoscaling**: Flink Adaptive Scheduler resizes TM count with load.

4. **Spot / preemptible**: stateless operators are fine on spot.

5. **Confluent KIP-848 / Share Groups**: more efficient consumers.

6. **WarpStream + Flink**: state in ForSt, data on WarpStream -> only S3 costs.

Confluent Cloud Flink bills by Compute Pool and scales to zero when idle. Great for small teams, but hot jobs typically cost more than OSS Flink + EKS.

26. Closing — 2026 Stream Processing Recommendations

A long article. One-line takeaways:

- **Kafka-centric, Java team, just starting streaming**: **Kafka Streams + ksqlDB** (small) or **Flink SQL** (medium/large).

- **Complex CEP, large state, multi-source**: **Apache Flink 2.0** (DataStream + SQL).

- **App queries Postgres, need Materialized Views**: **RisingWave 2.x** (scale) or **Materialize** (complex SQL).

- **Python ML integration, small scale**: **Bytewax** or **Quix Streams**.

- **No-ops, GUI**: **Decodable** or **Confluent Cloud Flink**.

- **Simple ETL/transform (rare windows)**: **Redpanda Connect (Benthos)** or **Pulsar Functions**.

- **CDC plus streaming, unified**: **Estuary Flow** or **Debezium + Flink**.

- **Rust monolithic single binary**: **Arroyo**.

The biggest 2026 trend is the rise of **streaming databases**. RisingWave and Materialize challenge the assumption that "stream processing is a data-engineer thing." A backend engineer writes `CREATE MATERIALIZED VIEW` and gets a real-time view. Flink is still powerful, but the new question is: **does this even need Flink?**

The second trend is that **state is moving to cloud storage**. ForSt, RisingWave's S3-backed state, Arroyo's design — all in the same direction. Brokers become more stateless, data lives on S3. The disaggregation pattern that started in Kafka has reached the stream processor.

End of article. Stream processing in 2026 no longer has a single right answer. **More choices means more responsibility for choosing well.** Pick the tool that fits your workload.

References

- [Apache Flink 2.0 Release Notes](https://nightlies.apache.org/flink/flink-docs-release-2.0/)

- [Flink FLIP-435: Materialized Tables](https://cwiki.apache.org/confluence/display/FLINK/FLIP-435%3A+Introduce+a+New+Dynamic+Table+for+Simplifying+Data+Pipelines)

- [Flink ForSt (Disaggregated State)](https://github.com/ververica/ForSt)

- [Apache Kafka Streams Documentation](https://kafka.apache.org/40/documentation/streams/)

- [KIP-932: Queues for Kafka](https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A+Queues+for+Kafka)

- [ksqlDB Documentation](https://docs.ksqldb.io/)

- [RisingWave Documentation](https://docs.risingwave.com/)

- [Materialize Documentation](https://materialize.com/docs/)

- [Differential Dataflow](https://github.com/TimelyDataflow/differential-dataflow)

- [Arroyo Streams](https://www.arroyo.dev/)

- [Bytewax Documentation](https://bytewax.io/docs)

- [Apache Beam Documentation](https://beam.apache.org/documentation/)

- [Apache Pulsar Functions](https://pulsar.apache.org/docs/functions-overview/)

- [Quix Streams](https://quix.io/docs/quix-streams/)

- [Decodable Documentation](https://docs.decodable.co/)

- [Confluent Cloud Flink](https://docs.confluent.io/cloud/current/flink/overview.html)

- [Aiven for Apache Flink](https://aiven.io/flink)

- [Amazon MSF (Managed Service for Apache Flink)](https://docs.aws.amazon.com/managed-flink/)

- [Estuary Flow](https://docs.estuary.dev/)

- [Redpanda Connect (Benthos)](https://docs.redpanda.com/redpanda-connect/)

- [Memphis.dev / Superstream](https://memphis.dev/)

- [Confluent Schema Registry](https://docs.confluent.io/platform/current/schema-registry/index.html)

- [Streaming Systems book by Tyler Akidau](https://www.oreilly.com/library/view/streaming-systems/9781491983867/)

- [Mercari Engineering Blog](https://engineering.mercari.com/en/blog/)

- [Coupang Engineering Blog](https://medium.com/coupang-engineering)

- [LINE Yahoo Engineering](https://techblog.lycorp.co.jp/en/)

현재 단락 (1/610)

A fintech meeting room in 2026.

작성 글자: 0원문 글자: 33,618작성 단락: 0/610