Skip to content
Published on

Modern Hadoop & Big Data Ecosystem 2026 Deep Dive - Hadoop 3.4 · Spark 4 · Hive 4 · Kafka 4 · Flink 2 · Iceberg · Trino

Authors

Prologue — In 2026, Hadoop did not die. It was relocated

"Hadoop is dead" has been an annual headline since the early 2020s. The actual landscape in 2026 is much more nuanced than the slogan.

  • Almost nobody is building a new HDFS·YARN·MapReduce cluster from scratch. That part is true.
  • Yet Spark, Hive, Iceberg, Kafka, Flink — every tool Hadoop spawned is stronger than ever. The stage they run on simply moved from HDFS to S3/GCS/ADLS and from YARN to Kubernetes.
  • Cloudera Data Platform 7 is alive and well, still huge in on-premises finance, public sector and telecom. MapR was absorbed into HPE and reborn as HPE Ezmeral.
  • Lakehouse architecture is the real heir to Hadoop's "one place for all your data" promise.

This post unfolds the entire landscape in one piece. Hadoop 3.4 and its successor components, the real 2026 specs of Spark 4 · Hive 4 · Kafka 4 · Flink 2, the table format war between Iceberg · Delta · Hudi, the OLAP scene of Trino · Presto C++ · DuckDB · ClickHouse · StarRocks · Druid · Pinot, and the actual adoption patterns at Korean and Japanese companies.

One-line takeaway: "Hadoop is no longer infrastructure — it is vocabulary." Concrete components like HDFS/YARN/MR are fading, but the concepts of distributed processing, table formats, streaming and catalogs survive and are being reassembled on top of the cloud.


1. The Big Picture — 25 years of the Hadoop ecosystem

Doug Cutting started Hadoop in 2006. As of 2026, the tools that grew on top of it compress into this table.

EraKeywordRepresentative tools
2006~2012HDFS + MapReduceHadoop 1.x, Hive 0.x, Pig
2013~2016YARN + in-memoryHadoop 2, Spark, Tez, Impala
2017~2020Cloud migrationEMR, Dataproc, HDInsight
2021~2024LakehouseIceberg, Delta, Hudi, Trino, dbt
2025~2026Modular · stream-firstKafka 4, Flink 2, Polaris, Lakekeeper

People often define "Hadoop" narrowly as HDFS + YARN + MapReduce, but in working conversations "the Hadoop ecosystem" has become vocabulary for every distributed data tool that grew up next to it. This article follows that broader sense.

Three shifts dominate every 2026 design decision:

  1. Storage split from HDFS into object storage (S3/GCS/ADLS) plus Ozone.
  2. Resource manager moved fast from YARN to Kubernetes.
  3. Table abstraction evolved from the Hive Metastore into open table formats like Iceberg, Delta and Hudi.

These three axes drive nearly every architectural decision in 2026.


2. Hadoop 3.4 and 3.5 — The story is Erasure Coding and Ozone

Apache Hadoop's current stable line is 3.4.x, with 3.5 running as beta in early 2026. Even if you are not building new clusters, this line still matters for any team running existing on-prem environments.

Two changes define the 3.x line.

First, Erasure Coding (EC). HDFS traditionally relied on 3x replication for durability. Storing 1 PB of data needed 3 PB of disk. EC splits data into k blocks plus m parity blocks. With RS-10-4 (10 data + 4 parity) you get the same durability with about 1.4 PB — roughly 40% storage savings. The trade-off is more CPU for encode/decode, and EC is unsuitable for small files. So most clusters apply EC to cold tiers first.

Second, Apache Ozone. Originally a Hadoop subproject, Ozone is object storage that exposes an S3 API while preserving HDFS-class throughput and strong consistency. The classic NameNode single-point bottleneck was solved with a distributed Ozone Manager plus Storage Container Manager design. It is becoming the practical HDFS successor in on-prem environments that need to handle billions of files.

Other notable 3.4 changes:

  • YARN GPU/FPGA resource model stabilized. ML workloads now sit naturally on YARN.
  • HDFS Router-based Federation matured. Multiple NameNodes hide behind a router for horizontal scale.
  • Java 17/21 runtime officially supported.

Brand-new Hadoop installs are rare, but the flow of existing telco · finance · public · research clusters migrating to 3.4 is very real.


3. "Is Hadoop dead?" — Sorting fact from slogan

The answer depends on what we call Hadoop.

Narrow Hadoop (HDFS + YARN + MapReduce core):

  • New adoption: effectively flat. New SaaS and startups start on cloud object storage from day one.
  • Existing operations: still enormous. Thousands of CDP 7 clusters keep running worldwide, and in Korea telecoms, banks, public agencies and game companies still rely on them.
  • New feature development: slowing. The energy went into companion components like Ozone and Erasure Coding rather than the core.

Wide Hadoop ecosystem (Spark/Hive/Kafka/Flink/Iceberg/...):

  • Exploding. Spark downloads keep setting records. Iceberg is converging on standardization. Kafka has finished the KRaft transition.
  • And these components no longer need to run on HDFS. They run on S3/GCS/ADLS, on Kubernetes, on EMR/Dataproc/HDInsight/Databricks/Snowflake.

Corporate situation:

  • Cloudera + Hortonworks merged into CDP 7. KKR and CD&R took it private in 2021 for around 5.3 billion dollars. Revenue is still large.
  • MapR was acquired by HPE in 2019 and rebranded into HPE Ezmeral Data Fabric by 2024.
  • Pivotal's Hadoop line is gone. Only the separated Greenplum line remains.
  • DataStax · Confluent · Databricks — the "post-Hadoop" companies — now carry the larger market influence.

Conclusion: new HDFS/YARN/MR adoption has slowed, but distributed data tools speaking Hadoop's vocabulary are more active than ever. The gap between slogan and reality is exactly where this article starts.


4. HDFS + Ozone — The bridge to object storage

HDFS is block-based, strongly consistent and POSIX-like. S3 is object-based, originally eventually consistent (made strongly consistent in 2020) and HTTP-based. The two models look distant, but a key mid-2020s development was that HDFS users now have a smooth path to S3-compatible APIs.

Three representative tools live in this space:

  • Apache Ozone — Hadoop subproject. S3-compatible object storage plus HDFS-like throughput.
  • MinIO — Self-hosted S3-compatible object storage. Kubernetes-native.
  • JuiceFS — A metadata layer that puts a POSIX-compatible filesystem on top of object storage.

All three sit on the same trend: looks like a filesystem on the outside, object storage on the inside. Compute engines like Spark, Hive and Trino abstract over URI schemes such as hdfs://, s3a://, ofs:// and jfs://, so the same code keeps running.

In Korea this movement is represented by Kakao i Cloud Object Storage, Naver Cloud Object Storage, NHN NCP Object Storage. In Japan the big telcos — NTT Docomo, KDDI, SoftBank — all run their own object stores.


5. Apache Spark 3.5 and 4.0 — ANSI mode, Spark Connect, English SDK

Spark is still the de facto standard for distributed data processing in 2026. The stable line is 3.5.x, with 4.0 released in June 2024 and gaining real adoption through 2026.

Five core changes in 4.0.

1. ANSI mode on by default. Spark SQL was historically forgiving compared to PostgreSQL or Snowflake — overflows and bad casts silently returned NULL. From 4.0 ANSI mode is the default, so unless you explicitly turn it off, strict SQL standard checks apply. This is the most common migration breakage.

2. Spark Connect stabilized. A gRPC-based architecture that splits the client and server. Notebooks no longer need a giant Spark driver JVM; a thin client connects to a remote cluster. Databricks Connect and JetBrains Big Data Tools both ride on this.

3. PySpark Python UDFs stabilized and Pandas API on Spark. The work that began as Koalas merged into Pandas API on Spark and is compatible with Pandas 2.x in 4.0.

4. Streaming improvements. Structured Streaming's stateful processing API was cleaned up. A unified transformWithState API treats RocksDB, in-memory and other state backends with one abstraction.

5. English SDK preview. "Write PySpark in natural language." It is experimental, but LLM integration is clearly a 4.x direction.

Databricks' Photon engine is a separate story. It is a Spark API-compatible vectorized executor written in C++, available only in Databricks Runtime — a different track from open-source Spark.


6. Apache Hive 4.0 — The SQL standard that never died, now embracing Iceberg

Hive was once "SQL for Hadoop." It looked like it would fade behind Spark SQL and Presto in the late 2010s, but Hive 4.0 (GA in April 2023) sent a clear revival signal.

Key changes in Hive 4.0:

  • Native Iceberg integration. A single CREATE TABLE ... STORED BY ICEBERG produces an Iceberg table, and SQL-level migration between Hive and Iceberg tables is supported.
  • LLAP (Live Long And Process) stabilized. Long-running daemons cut query startup latency dramatically, making short interactive queries practical.
  • Tez as the default execution engine. The old MapReduce engine is deprecated; Tez DAG is the standard.
  • ACID transactions v2. Merge, Update and Delete are stable and compaction is automatic.

Hive is still alive where it makes sense: massive batch jobs, very-large-metadata warehouses, organizations with large existing Hive UDF investments. In 2026 many telcos, banks and game companies in Korea, Japan and China still run Hive as a core SQL engine.


7. Apache Kafka 4.0 — The end of the Zookeeper era

Kafka is one of the strongest single components in the 2026 big-data ecosystem. Kafka 4.0 (GA in March 2025) cemented that position.

Key changes:

  • KRaft mode GA, Zookeeper removed. Kafka long required Zookeeper as a dependency. KRaft (Kafka Raft Metadata mode) was rolled out from 2023 and Zookeeper mode is fully deprecated in 4.0. Operations are dramatically simpler.
  • Tiered storage stabilized. Hot data sits on local disk, cold data automatically tiers to S3-style object storage. Storage cost for long-retention topics drops sharply.
  • Queue semantics introduced. Topics were traditionally publish-subscribe. KIP-932 introduced the "Share Group" concept and brought real queue patterns to Kafka. Some RabbitMQ-style workloads may migrate.
  • Kafka Streams and ksqlDB moved with it. Confluent has been consolidating ksqlDB's open-source line toward Confluent Cloud.

Kafka also has stronger rivals.

  • Apache Pulsar 4.0 — Separation of compute and storage via BookKeeper. Strong on multi-tenancy and geo-replication.
  • Redpanda 24.x — A C++ rewrite of a Kafka-compatible broker. No Zookeeper, no JVM — a single binary with 100% Kafka API compatibility.
  • WarpStream — A "Kafka on S3" variant, acquired by Confluent.

If you are picking a 2026 queue/stream backbone, "Kafka + KRaft + tiered storage" or "Redpanda single binary" are the two most common choices.


Flink looked like it was losing ground to Spark Streaming at one point, but its "stream-first" philosophy won the spaces that need real low-latency, exactly-once processing. Flink 2.0 (GA in March 2025) pushed that lead further.

Highlights of Flink 2.0:

  • Disaggregated state storage. Traditionally Flink kept RocksDB state on local disk attached to task managers. In Kubernetes and cloud environments, restoring state on node replacement is expensive. 2.0 officially supports placing state on object storage (such as S3) with cache and local SSDs in front.
  • ForSt — A RocksDB-based state engine built specifically for Flink. The LSM tree is tuned for object storage friendliness.
  • Materialized Tables. Declare in SQL that "this result should be kept fresh in real time" and Flink maintains it as a background streaming job. dbt-style declarative modeling on top of Flink streaming.
  • Flink CDC matured into a separate subproject — an alternative to Debezium.

The streaming SQL landscape in 2026 organizes as follows.

ToolModelStrength
Flink SQLSQL over DataStreamExactly-once, rich state
ksqlDBSQL over Kafka StreamsTight Kafka integration
MaterializeSQL views over external systemsOLTP-friendly, Differential Dataflow
RisingWaveNew streaming DBPostgreSQL-compatible
ArroyoNew Rust streaming SQLCloud-native

9. YARN vs Kubernetes — A generational shift in resource managers

After 2013 YARN became the de facto resource manager for distributed data workloads. Across the 2020s Kubernetes quickly took that throne.

AspectYARNKubernetes
Target workloadsMostly big dataGeneral (web/ML/data)
ContainersCustom (LXC/cgroups)OCI
Resource modelCPU/memory/GPURich and extensible
Multi-tenancyQueue-basedNamespace-based
EcosystemHadoop-centricAll of CNCF
Learning curveFamiliar to big-data teamsFamiliar to platform teams

Standard patterns for big data on K8s:

  • Spark on Kubernetes — Officially supported. Spark Operator (Google) and the native scheduler are the two paths.
  • Flink Kubernetes Operator — Official since Flink 1.15+.
  • Kafka via Strimzi Operator.
  • YuniKorn — Apache YuniKorn brings YARN-like queue, share and policy scheduling to Kubernetes.

YARN still has a place — at sites running CDP, at organizations with one massive cluster dedicated to big-data, in domains with very strong security and isolation needs. Yet the first choice for any new platform is almost always Kubernetes.


10. The table format war — Iceberg, Delta, Hudi

Adding ACID transactions, schema evolution and time travel to a data lake is the promise of an open table format. In 2026 the landscape compresses to three candidates.

Apache Iceberg 1.7 and 1.8

  • Standardizing fastest. AWS, Snowflake, Databricks, Google and Cloudera all support it.
  • The REST catalog specification is now standardized, making engine independence real.
  • Iceberg-Rust — A Rust implementation has matured. Light engines like DuckDB, Polars and Trino C++ read Iceberg directly.
  • Catalog options abound — Apache Polaris (donated by Snowflake), Lakekeeper (Rust-based), Apache Gravitino (data catalog).

Delta Lake 4.0

  • Format from Databricks, transferred to the Linux Foundation in 2022.
  • Delta Universal Format (UniForm) lets a Delta write be readable as Iceberg and Hudi.
  • Delta Sharing for safely sharing tables.
  • Strengths: Databricks integration and the most mature ACID implementation.

Apache Hudi 1.0

  • Started at Uber. Specialized in upserts and CDC.
  • Two modes: Merge-on-Read and Copy-on-Write.
  • Strong catalog integration and indexing.

Selection guide:

  • Engine independence matters most → Iceberg.
  • Databricks-centric → Delta.
  • Frequent upserts and real-time CDC are core → Hudi.

The three formats are converging. As of 2026 it is common to use mirror/conversion tools so the same data can be read as both Iceberg and Delta.


11. The catalog era — Polaris, Lakekeeper, Gravitino, Unity

Once table formats became standard, the next battleground is catalogs. A catalog manages "which tables exist where, what schema do they have and who can access them."

  • Apache Polaris — An Iceberg REST catalog implementation donated by Snowflake in 2024. Cloud-neutral by design.
  • Lakekeeper — An Iceberg REST catalog written in Rust. Lightweight and Kubernetes-friendly.
  • Apache Gravitino — A meta-catalog from Datastrato that fronts Iceberg, Delta, Hudi and Hive with one interface.
  • Unity Catalog — Databricks' catalog. Partially open-sourced in 2024 to manage data and AI assets together.
  • Nessie — A Git-like data catalog from Dremio, with branches, tags and merges for data versioning.
  • AWS Glue Data Catalog, Google Dataplex — Catalogs from the cloud providers.

Three core questions when picking one:

  1. Does it follow the Iceberg REST spec?
  2. Are RBAC/ABAC permissions rich enough?
  3. Does it span data assets and AI assets (models, features, notebooks) as well?

In 2026 most teams pick between Polaris, Lakekeeper and Unity based on infrastructure fit.


12. Trino, Presto, DuckDB — Three classes of query engines

SQL engines that run on the lake split into three classes in 2026.

Trino 460+ — The former PrestoSQL fork, commercialized by Starburst. Petabyte-scale federated SQL.

  • Fault-tolerant execution (FTE) — partial restart for large queries when nodes fail, with exchange data on external storage.
  • Joins Pinot, Druid, Iceberg, Delta and Hudi in one SQL.
  • Velocity is very high; passed version 460 in 2025.

Presto / PrestoDB

  • The original Presto, transferred to the Linux Foundation in 2019. Meta is the main contributor.
  • Since 2024 Presto C++ (on Velox) is the main line. Meta itself runs data on Presto C++.

DuckDB 1.x

  • An embedded analytic DB. A single binary. Reads Pandas, Polars, R, CSV, Parquet and Iceberg through SQL.
  • A new category — runs inside notebooks, inside Lambdas, at the edge.
  • The flagship of the paradigm shift: "a query engine no longer needs to be a huge cluster."

MotherDuck — A company wrapping DuckDB with cloud hosting. Local plus cloud hybrid.

Velox — A C++ vectorized execution engine from Meta. It aims to be the common backend across Presto C++, Spark Gluten and Verdict.

Picking depends on data scale and execution model.

  • Huge federated SQL → Trino.
  • Big-tech internal standard → Presto C++.
  • Single node · embedded · notebook → DuckDB.

13. The OLAP landscape — Druid, Pinot, ClickHouse, StarRocks

The 2026 landscape for real-time analytical (OLAP) databases.

Apache Druid — Time-series and event analytics. Started at Metamarkets in 2011. Real-time ingest plus pre-aggregation.

Apache Pinot — Started at LinkedIn. Real-time ingest plus user-facing analytics. Battle-tested at Uber, LinkedIn and Stripe scales.

ClickHouse 24.x — Columnar OLAP DB from Yandex (now ClickHouse Inc.). Single-node performance is dominant. ClickHouse Cloud launched in 2022 and grew fast.

StarRocks — A commercial fork of Apache Doris. MPP OLAP that queries Iceberg, Hudi and Delta directly.

Apache Doris — MPP OLAP that started at Baidu. Strong in China and East Asia.

What unites them is the simultaneous pursuit of three properties:

  1. Sub-second response.
  2. Scans of billions of rows.
  3. Direct user-facing dashboards (not just BI — the app backend itself).

Selection guide:

  • Single cluster · complex queries → ClickHouse.
  • Real-time ingest · user-facing → Pinot, Druid.
  • Querying directly on Iceberg/lake → StarRocks, Trino.
  • One notebook · embedded → DuckDB.

14. dbt 1.9 and dbt Mesh — The transformation-layer standard

dbt (data build tool) turned the simple idea "model data with SQL" into the standard transformation layer of the modern data stack.

Key items in dbt 1.9:

  • dbt Mesh — In large organizations, split dbt projects by domain and expose each other's models through contracts.
  • Group/Access — Group models and explicitly declare which are externally visible.
  • Semantic Layer became a flagship dbt Cloud product. Metric definitions live in one place.
  • dbt Fusion (announced 2025) — A Rust rewrite of the Python-based compiler. Execution speed improved significantly.

Rivals of dbt:

  • SQLMesh — From Tobiko Data. Redesigns around dbt's limits (Jinja, execution model).
  • Coalesce — A GUI-based transformation tool.
  • The 2024 licence split between dbt Cloud and dbt Core made waves, but the core remains open.

15. Orchestration — Airflow 3.0, Dagster, Prefect, Argo

The landscape of orchestrators that schedule and manage data pipelines.

Apache Airflow 3.0 (GA in April 2025)

  • Biggest change: task isolation stabilized. Tasks running in separate containers is now the standard.
  • The Task SDK is separated, making worker dependencies much lighter.
  • Key operations features such as multiple DAG versions are cleaned up.
  • Still the largest community.

Dagster 1.9

  • Software-Defined Assets — DAG nodes are defined by "which data they are responsible for producing." A different mental model from Airflow.
  • Asset catalog, data quality and backfills are first-class.
  • Increasingly viewed as a first-class data-platform tool.

Prefect 3.x

  • Very "Pythonic" ergonomics. A single decorator turns a function into a task.
  • Strong on dynamic workflows.

Argo Workflows

  • Kubernetes-native. Excellent at container workflows such as CI/CD.
  • Also widely used for data pipelines.

Mage — Notebook plus orchestration. No-code / low-code.

Kestra — YAML/JS-based orchestrator with a Java backend.

Selection guide:

  • Large organization · existing assets → Airflow 3.
  • Asset-centric · data platform → Dagster.
  • Lightweight dynamic workflows → Prefect.
  • Container · K8s focus → Argo.

The 2026 view of Change Data Capture (CDC) — syncing operational databases into the lake or warehouse in near-real time.

Debezium 3

  • Led by Red Hat. Reads logs from PostgreSQL, MySQL, MongoDB, Oracle, SQL Server, Db2 and emits them to Kafka.
  • 3.x improved operations, memory usage and schema evolution significantly.

Flink CDC

  • Brings Debezium's core into Flink itself. DB → Flink → Iceberg without a separate Kafka.

Estuary Flow — A managed CDC SaaS.

Airbyte — Open-source ELT. Supports CDC but is stronger on batch connectors.

Fivetran — The most mature commercial CDC/ELT.

Sequin — PostgreSQL-specialist CDC. Outputs to webhooks or Kafka.

Typical architecture:

[PostgreSQL][Debezium][Kafka][Flink/Spark][Iceberg][Trino/Spark SQL]
                    or
[PostgreSQL][Flink CDC][Iceberg]

Changes in the operational DB landing in the lake within 1~30 seconds is the 2026 default.


17. Lakehouse architecture — The return of the single source of truth

Lakehouse means combining the cheap raw storage of a data lake with the ACID, schema and performance of a warehouse. The standard 2026 architecture compresses as follows.

Bronze · Silver · Gold medallion architecture:

LayerShapeTools
Bronze (raw)CDC/events as-isKafka, Iceberg, S3
Silver (cleansed)Standardized and dedupedSpark, Flink, dbt
Gold (serving)Aggregated, domain modelsdbt, Spark, Trino

Storage is object storage plus Iceberg/Delta/Hudi tables. Compute is Spark, Flink, Trino, DuckDB and similar — they share the same tables. A catalog (Polaris/Unity/Gravitino) manages where tables live and who may see them.

The promise of the architecture is simple. One data, many engines, unified permissions. The vision Hadoop promised but never fully delivered actually works in 2026 on top of cloud object storage plus open table formats.


18. Evolution of the modern data stack — 2020 → 2026

In the early 2020s "Modern Data Stack (MDS)" became a buzzword. Its core was the combo Fivetran (ingest) + Snowflake (warehouse) + dbt (transform) + Looker (BI).

In 2026 that definition has loosened.

  • Ingest → Stream CDC (Flink CDC, Debezium) joined Fivetran/Airbyte.
  • Storage → Not Snowflake alone, but Snowflake + Databricks + self-hosted lakehouse coexist.
  • Transform → SQLMesh, dbt Fusion and Coalesce stand alongside dbt.
  • BI → Metabase, Superset, Hex and Mode exploded next to Looker.
  • AI/ML integration → ML that used to live on a separate track now sits on the same data. Feature stores, vector DBs and LLMs join as additional layers.

In short, the vocabulary moved from "modern data stack" to "modern data and AI stack." Data engineers and ML engineers sharing the same catalog and same tables is the new standard.


19. Big-data adoption in Korean companies

Three patterns are clear in Korea as of 2026.

Telecom · finance · public sector: CDP/Hortonworks lingers

  • KT, SK Telecom, LGU+, KB, Shinhan, Hana and NH still run huge CDP/HDP clusters. Tens of thousands of Spark and Hive jobs run on them daily in 2026.
  • At the same time they are building new Iceberg + Trino lakehouses on their own clouds — KT Cloud, NCloud, NHN Cloud.

Games and internet: Modern data stack

  • Naver — Internal big-data platform centered on Spark. Aggressively adopting Iceberg and Trino since 2024.
  • Kakao — Its own ML platform and data platform. Spark, Flink and Druid in heavy use.
  • Coupang — Used Hadoop broadly in the past, but from the mid-2020s shifted to AWS-based Iceberg/Spark/Trino, deprecating Hadoop dependence.
  • LINE+ — Runs Snowflake and an internal lakehouse in parallel. Some domains migrated to Snowflake in 2024~2025.
  • NCsoft · Nexon · Netmarble — Game logs, payments and in-game events are huge, so Kafka + Flink + Iceberg/Druid is core.

Startups: Managed cloud first

  • BigQuery, Snowflake and Databricks are first choices. dbt is essentially standard.
  • A few self-host ClickHouse or Trino for cost reasons.

Korea-specific traits:

  • Government and public data regulations make on-prem and sovereign-cloud share higher than the global average.
  • That extends the lifespan of on-prem big-data platforms like CDP.
  • Games, e-commerce and fintech move quickly to cloud lakehouses.

20. Big-data adoption in Japanese companies

Japan looks similar to Korea but with a slightly different flavor.

Yahoo! Japan / LINE merger into LY

  • Yahoo Japan and LINE merged into LY in 2023. Data platform integration is ongoing. Both sides carry massive Hadoop/Spark assets.
  • The huge event streams of search, advertising and messaging have run on Kafka + Spark Streaming + Hadoop for a long time.

Cookpad — Recipe platform. A relatively modern stack centered on Redshift and BigQuery. An early dbt adopter.

Mercari — A large ML platform built on cloud-native pieces such as BigQuery, Vertex AI and the Feast feature store.

Rakuten — Has huge Hadoop-based infrastructure but is migrating to the cloud since the mid-2020s. Experimenting with its own data mesh model.

DeNA · CyberAgent · GREE — Mobile-game and ad data, mostly on BigQuery and Snowflake.

Japan-specific traits:

  • Telecom (NTT Docomo, KDDI, SoftBank) own clouds carry a large share.
  • Finance still leans heavily on on-prem + Cloudera.
  • Advertising and games moved fastest to cloud and the modern data stack.

Both Korea and Japan share the same reality: very large on-prem big-data assets coexist with new lakehouses layered on top — a dual-track world.


21. Five real-world architecture patterns

Five patterns that show up most often in 2026.

Pattern 1 · Classic batch lakehouse (mid scale)

[Operational DBs][Airbyte/Fivetran][Iceberg on S3][dbt + Spark][Trino][Metabase]

Pattern 2 · CDC streaming lakehouse (event-centric)

[PostgreSQL/MySQL][DebeziumKafka][FlinkIceberg][Trino/Spark]
[ClickHouse/Pinot] (real-time dashboard)

Pattern 3 · Game / ad event analytics

[Client][Kafka][Flink][Druid/Pinot][User-facing dashboards]
                     [Iceberg][Spark batch analytics]

Pattern 4 · ML platform

[Data lake (Iceberg)][Spark/Polars feature build][Feast feature store]
            ↓                       ↓
[Training (Spark MLlib, Ray)]   [Online serving]
[Model registry (MLflow)]

Pattern 5 · Embedded analytics (small team · notebook)

[Parquet on S3][DuckDB/Polars][Streamlit/Notebook][Sharing]

In 2026 it is common for five different architectures to coexist inside one company on the same underlying data.


22. Operational traps — Things you will regret skipping

1. Small file problem — Hundreds of millions of small objects in object storage cause runaway metadata and listing costs. Run compaction and OPTIMIZE jobs in Iceberg on a regular schedule.

2. Agreeing on schema evolution — Iceberg and Delta support schema evolution, but "which changes are allowed" is a policy question. You need contracts that block breaking changes such as column renames.

3. Cost visibility — S3 request charges, Snowflake credits and BigQuery slots — the real cost of cloud big data is compute and data movement. Enforce cost labels from day one.

4. CDC backfill — The hardest part of turning on a new CDC pipeline is backfilling historical data. Validate Debezium snapshot and incremental-snapshot modes ahead of time.

5. Unified catalog and permissions — When multiple engines see the same tables, permissions must live in a single catalog. Once engine-specific permissions drift apart, governance collapses.

6. Stream vs batch semantics — "Exactly-once" means different things in different systems. Understand Flink, Kafka and Spark Streaming semantics precisely.

7. Data in non-production environments — Production data straight in dev violates GDPR/PIPA. Plan masking and synthetic data tools from the start.


23. Learning roadmap — Where to start

For someone becoming a data engineer in 2026, this order is recommended.

Step 0 · SQL and Python deeply. The common language of every tool.

Step 1 · Spark and dbt. Learn how distributed processing actually runs with Spark, and learn the vocabulary of "model with SQL" with dbt.

Step 2 · Kafka and Flink. Learn streaming semantics — exactly-once, watermarks, checkpoints.

Step 3 · Table formats. Touch Iceberg directly, then compare with Delta and Hudi.

Step 4 · Orchestration. Go deep on either Airflow or Dagster.

Step 5 · Operations. Cost, observability, data quality and governance. This stage really separates seniors from mids.

Step 6 · Pair with ML. Feature stores, vector DBs and LLM pipelines.

Following this order you can cover the entire surface of modern data platforms within 2~3 years.


24. Five years out — Where big data is heading

Five compact lines for the next five years.

  1. Table format unification. Iceberg solidifies as the de facto standard, and compatibility with Delta/Hudi becomes natural.
  2. Catalog standardization. The Iceberg REST catalog spec freezes, and the same interface works regardless of who hosts it.
  3. Compute disaggregation. From single huge clusters to "spin up only what you need" serverless compute. DuckDB, MotherDuck, Snowflake, Databricks SQL and Athena are all walking the same path.
  4. AI integration. Data catalogs and AI asset catalogs merge. Models, features and notebooks all live under one governance umbrella.
  5. Natural-language interfaces. Asking an LLM and having a data engineer verify the result will be the standard workflow instead of writing SQL by hand. dbt, Looker and Snowflake are all moving in that direction.

Hadoop's vocabulary survives 25 years on. The tools above it keep evolving. Sitting at the heart of that flow is still a fascinating place to be in 2026.


Epilogue — Death slogans vs the actual landscape

"X is dead" slogans are almost always too strong. In 2020 it was "Hadoop is dead," in 2024 it was "Spark is dead, DuckDB does everything," and in 2026 it is "data engineers will be replaced by LLMs." The reality is more subtle and more interesting.

Hadoop is not dead. The vocabulary it built remains the skeleton of every data system in 2026. The stage on which those words run simply changed. If you are standing in the middle of that change, this is one of the most exciting moments to be a big-data engineer.


References