Skip to content
Published on

Hadoop Ecosystem & Data Engineering 2026 Deep Dive - Hadoop, Spark, Flink, Trino, Iceberg, Delta Lake, Hudi, Airflow, dbt

Authors

Intro — Hadoop is dead, and Hadoop is everywhere

In May 2026, "Is Hadoop dead?" is a catchy headline but the wrong question on the ground. The honest answer is "HDFS has shrunk, MapReduce is effectively gone, the Cloudera/Hortonworks-era full-stack distribution is over. But YARN is alive, the Hive Metastore has changed clothes into Iceberg/Delta, and the lakehouse pattern of 'cheap object storage + metadata catalog + distributed processing engine' is the inheritance Hadoop left behind."

This piece is not a single-tool brag. It walks through every layer a data engineer actually touches in May 2026 — storage, table formats, processing engines, query engines, orchestration, transformation, warehouses, analytics — in one pass. Cloud managed services (EMR, Dataproc, HDInsight, Synapse), fully managed warehouses (Snowflake, Databricks, BigQuery, Redshift), and the on-prem realities of Korea and Japan are all covered honestly.

The 2026 data engineering stack — broken into 8 layers

The big picture first. The standard 2026 data engineering stack splits into these 8 layers.

  1. Storage: HDFS, S3, GCS, ABFS, MinIO
  2. Table format: Iceberg, Delta Lake, Hudi
  3. Catalog: Hive Metastore, Unity Catalog, Polaris, Nessie, AWS Glue
  4. Batch processing: Spark, Trino-on-Spark, Dremio
  5. Streaming: Flink, Spark Structured Streaming, Kafka Streams, Materialize
  6. Query/Federation: Trino, Presto, Dremio, StarRocks
  7. Transformation: dbt, SQLMesh, Coalesce
  8. Orchestration: Airflow 3.0, Dagster, Prefect, Mage, Kestra

On top of this, the analytics/OLAP layer adds ClickHouse, DuckDB, Pinot, Druid, and the fully managed warehouse layer adds Snowflake, Databricks SQL, BigQuery, Redshift. The Cloudera era — when every layer lived inside one vendor distribution — is over. Today you assemble OSS components yourself.

Why Hadoop isn't dead — YARN and Hive Metastore are stubborn survivors

The "Hadoop is dead" claim usually comes from two sources. First, the post-2018 Cloudera + Hortonworks merger and the slide in new-license revenue. Second, the way object storage (S3/GCS) has crushed HDFS, breaking the "HDFS = Hadoop" equation. Both are true, but the "Hadoop ecosystem" is much broader than HDFS.

The Hadoop components that are alive in 2026 include:

  • YARN: Still the default resource manager for on-prem Spark/Flink clusters. Gradually losing ground to Kubernetes, but banks and telecoms run lots of YARN.
  • Hive Metastore (HMS): Iceberg, Delta, Trino, Spark, and Flink all use HMS as a catalog. "Hive is dead, but HMS is alive" is the accurate phrasing.
  • Hive LLAP: Repositioned with Hive 4.0 as ACID tables with an Iceberg backend.
  • Ozone: The HDFS successor object store. Aiming at on-prem S3 compatibility.
  • Tez: Hive's execution engine. MapReduce is effectively retired but Tez survives.

MapReduce, in contrast, has essentially vanished from new workloads. Everyone moved to Spark. The MapReduce remnants are legacy Sqoop, some Oozie batch jobs, and code waiting to be migrated.

HDFS vs object storage — who won

As of May 2026, over 90% of new big-data workloads land on object storage: AWS S3, Azure ADLS Gen2, GCP GCS, or on-prem MinIO/Cloudian. HDFS survives in only three cases.

  1. Financial data sovereignty: Environments where data absolutely cannot leave the building. On-prem HDFS clusters paired with Spark/Trino.
  2. Legacy ETL assets: Thousands of Hive tables and Oozie jobs hard-coded to HDFS paths, where migration is a multi-year project.
  3. Ultra-low-latency shuffle: Some graph/join workloads where object-storage PUT/GET latency is a bottleneck. HDFS or an Alluxio-style cache layer wins here.

The reason S3 won is simple. Storage cost is roughly a third, operational overhead is near zero, and compute and storage are decoupled — so the same data is read concurrently by Spark, Trino, Snowflake, and Athena. That is the core premise of the lakehouse.

The lakehouse era — warehouses and lakes converge

The "data lake vs data warehouse" dichotomy is mostly meaningless in 2026. Everyone converged on the lakehouse pattern. The core idea:

  • Storage: object storage (S3/GCS/ABFS) — cheap, infinite, compute-decoupled
  • File format: Parquet/ORC — columnar, compressed, prunable
  • Table format: Iceberg/Delta/Hudi — ACID, schema evolution, time travel, partitioning
  • Catalog: Hive Metastore / Glue / Unity Catalog / Polaris / Nessie
  • Compute engines: Spark, Trino, Flink, Dremio, StarRocks — many engines on the same data

Warehouses (Snowflake, Redshift, BigQuery) used to lock data into proprietary formats, but they have started opening up to Iceberg. As of May 2026, both Snowflake and Databricks treat Iceberg tables as first-class citizens. What this signals is the end of the vendor lock-in era.

The table format big three — Iceberg vs Delta Lake vs Hudi

The heart of the lakehouse is the table format. As of May 2026, three formats dominate.

ItemApache IcebergDelta LakeApache Hudi
OriginNetflix → ASFDatabricks → LFUber → ASF
ACIDYes (Serializable)Yes (Serializable)Yes
Schema evolutionStrong (column-ID based)YesYes
Partition evolutionYesLimitedLimited
MERGE/UPDATEYesYesYes (CoW/MoR)
Time travelYesYesYes
CatalogsREST, Hive, Glue, Polaris, NessieUnity Catalog, HMSHive, Glue
Notable adoptersSnowflake, Confluent, Apple, NetflixDatabricks ecosystemUber, Robinhood

The 2026 trend is the gradual standardization of Iceberg. Snowflake is pushing Iceberg external tables plus Polaris Catalog hard, Databricks made Unity Catalog Iceberg-compatible, and Confluent's Tableflow materializes Kafka topics straight into Iceberg tables. Delta remains the strongest format inside the Databricks ecosystem, but the position of "the general-purpose OSS standard" went to Iceberg.

Hudi is still strong for CDC (Change Data Capture) and incremental write patterns, but new adoption is declining.

Iceberg DDL in practice — REST catalog as the standard

Real-world Iceberg usage produces DDL like the following (Spark SQL).

CREATE TABLE iceberg_catalog.analytics.events (
  event_id BIGINT,
  user_id BIGINT,
  event_type STRING,
  occurred_at TIMESTAMP,
  payload MAP<STRING, STRING>
)
USING iceberg
PARTITIONED BY (days(occurred_at), bucket(16, user_id))
TBLPROPERTIES (
  'format-version'='2',
  'write.parquet.compression-codec'='zstd',
  'write.delete.mode'='merge-on-read',
  'write.update.mode'='merge-on-read'
);

ALTER TABLE iceberg_catalog.analytics.events
ADD COLUMN device_type STRING AFTER event_type;

MERGE INTO iceberg_catalog.analytics.events t
USING staging.events_delta s
ON t.event_id = s.event_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

SELECT count(*) FROM iceberg_catalog.analytics.events
  FOR SYSTEM_TIME AS OF '2026-05-01 00:00:00';

Iceberg's strength is the decoupled catalog. The same table is written from Spark, read from Trino, streamed into from Flink, and queried from Snowflake as an external table. As the REST catalog spec solidifies, vendor lock weakens.

Delta Lake and the Databricks ecosystem

Delta Lake is OSS, but in practice Databricks holds tight control over the format. As of May 2026, the Delta Lake 4.x line headlines:

  • Liquid Clustering: Multi-dimensional clustering instead of partitions. The successor to ZORDER.
  • Predictive Optimization: Statistics-driven automatic OPTIMIZE/VACUUM.
  • Delta Sharing: A data sharing protocol that lets a different organization read the same data.
  • UniForm: A compatibility layer so Iceberg clients can read Delta tables as-is.

Typical Delta merges look like this.

MERGE INTO delta.`s3://datalake/customers` AS t
USING (SELECT * FROM staging_customers) AS s
ON t.customer_id = s.customer_id
WHEN MATCHED AND s.updated_at > t.updated_at THEN
  UPDATE SET *
WHEN NOT MATCHED THEN
  INSERT *
WHEN NOT MATCHED BY SOURCE AND t.is_active = true THEN
  UPDATE SET is_active = false;

OPTIMIZE delta.`s3://datalake/customers`
  WHERE updated_at >= current_date() - INTERVAL 7 DAYS;

VACUUM delta.`s3://datalake/customers` RETAIN 168 HOURS;

Delta is weaker outside Databricks because the operational tooling (OPTIMIZE, Z-Order, Predictive Optimization) is automated only inside Databricks. You can run OSS Delta, but the operational gap with the Databricks-managed experience is wide.

Apache Hudi — the CDC and incremental champion

Hudi was built at Uber, so it is optimized for stream-to-batch CDC. Hudi 1.0 went GA, and the two table types remain key.

  • Copy-on-Write (CoW): Files rewritten on every update. Fast reads, expensive writes.
  • Merge-on-Read (MoR): Changes accumulate in delta logs and are merged at compaction time. Fast writes, expensive reads.

Hudi also exposes a TIMELINE metadata structure that arranges every commit chronologically and supports incremental queries that read "only the last hour of changes." That makes it powerful inside CDC pipelines.

Adoption is in decline in 2026, but it remains active inside AWS EMR, OneHouse / Onehouse Open Engines, and similar offerings.

Spark 4.0 — still the center of data engineering

"Is Spark dead?" comes up often, and the 2026 answer is clearly no. Spark 4.0 went GA in late 2025, holding its position as the central processing engine for data engineering. Highlights of 4.0:

  • Spark Connect: Client-server separation. A thin client drives a remote Spark cluster. Easier IDE / notebook / CI integration.
  • ANSI mode by default: Stronger SQL standards compliance.
  • Variant type: First-class semi-structured (JSON) data handling.
  • String Collation: ICU-based multilingual ordering.
  • Python UDF performance: Arrow-based serialization plus aggressive use of Python 3.13.

Photon is Databricks's proprietary vectorized engine, and the OSS equivalent — Apache Gluten plus Velox/ClickHouse backends — is catching up fast. EMR Serverless, Dataproc Serverless, and Synapse Spark all add their own acceleration engines.

Spark Structured Streaming is micro-batch compared to Flink, but as of 2026 Continuous Processing has stabilized enough to reclaim some of Flink's territory.

Spark Structured Streaming example — Kafka to Iceberg

The Spark Structured Streaming plus Iceberg combo is the most common real-time ingestion pattern in May 2026.

from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, current_timestamp
from pyspark.sql.types import StructType, StringType, LongType

spark = (SparkSession.builder
    .appName("kafka-to-iceberg")
    .config("spark.sql.catalog.lakehouse", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.lakehouse.type", "rest")
    .config("spark.sql.catalog.lakehouse.uri", "https://catalog.internal:8181")
    .getOrCreate())

schema = StructType().add("event_id", LongType()).add("user_id", LongType()).add("event_type", StringType())

stream = (spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "kafka:9092")
    .option("subscribe", "user_events")
    .option("startingOffsets", "latest")
    .load()
    .select(from_json(col("value").cast("string"), schema).alias("d"))
    .select("d.*", current_timestamp().alias("ingested_at")))

(stream.writeStream
    .format("iceberg")
    .option("checkpointLocation", "s3://lakehouse/_checkpoints/user_events")
    .outputMode("append")
    .toTable("lakehouse.analytics.user_events"))

The Iceberg catalog is REST, so the separate Hive Metastore dependency goes away, and checkpoints live in S3. In 2024 this pattern was unusual; in 2026 it is nearly the standard.

Flink remains the first choice when you need true event-time streaming, exactly-once semantics, and complex windowing. As of May 2026, Flink 1.20 / 2.0 leads the trend.

  • Flink SQL: Writing streaming jobs in SQL is now the standard. Catalog integration (Hive, Iceberg, Paimon) is mature.
  • Apache Paimon: A lakehouse table format from the Flink camp. Competing with and complementing Iceberg.
  • Flink CDC: A Debezium alternative. Flink handles PostgreSQL/MySQL CDC directly.
  • Flink on Kubernetes: The K8s Operator is the standard deployment.
  • Confluent's Flink SaaS: Flink is integrated into Confluent Cloud, providing a fully managed Kafka + Flink combination.

Flink beats Spark Streaming in low latency plus accurate event-time windowing plus backpressure handling. Spark Streaming is friendlier to ETL and easier to reuse batch code, but sub-100ms latency is hard.

Flink SQL expressed for the same Kafka → Iceberg pattern looks like this.

CREATE TABLE kafka_events (
  event_id BIGINT,
  user_id BIGINT,
  event_type STRING,
  occurred_at TIMESTAMP_LTZ(3),
  WATERMARK FOR occurred_at AS occurred_at - INTERVAL '5' SECOND
) WITH (
  'connector' = 'kafka',
  'topic' = 'user_events',
  'properties.bootstrap.servers' = 'kafka:9092',
  'format' = 'json',
  'scan.startup.mode' = 'latest-offset'
);

CREATE TABLE iceberg_events (
  event_id BIGINT,
  user_id BIGINT,
  event_type STRING,
  occurred_at TIMESTAMP_LTZ(3)
) PARTITIONED BY (`occurred_at`) WITH (
  'connector' = 'iceberg',
  'catalog-type' = 'rest',
  'uri' = 'https://catalog.internal:8181',
  'warehouse' = 's3://lakehouse/'
);

INSERT INTO iceberg_events
SELECT event_id, user_id, event_type, occurred_at
FROM kafka_events;

Trino and Presto — the federation SQL standard

Trino (formerly PrestoSQL) became the standard by focusing on one thing: "SQL federation across many sources." As of May 2026 Trino has 450+ connectors, and these patterns are common.

  • Data lake SQL: ad-hoc queries on Iceberg/Delta/Hudi
  • Federation: joins spanning Postgres, MySQL, Iceberg, Kafka
  • BI backend: Tableau/Superset attaches to the lakehouse via Trino

Presto is the original from Meta; Trino is the fork. As of 2026 OSS activity overwhelmingly favors Trino, while Meta continues evolving Presto for its own infrastructure. Starburst provides the commercial Trino distribution.

A typical federated query looks like the following.

WITH iceberg_orders AS (
  SELECT order_id, user_id, total, ordered_at
  FROM iceberg.sales.orders
  WHERE ordered_at >= date '2026-05-01'
),
pg_users AS (
  SELECT id AS user_id, email, country
  FROM postgres.public.users
),
kafka_clicks AS (
  SELECT user_id, count(*) AS clicks
  FROM kafka.events.clickstream
  WHERE timestamp_kafka >= timestamp '2026-05-01 00:00:00'
  GROUP BY user_id
)
SELECT u.country, sum(o.total) AS revenue, sum(c.clicks) AS clicks
FROM iceberg_orders o
JOIN pg_users u USING (user_id)
LEFT JOIN kafka_clicks c USING (user_id)
GROUP BY u.country
ORDER BY revenue DESC;

Trino joins Iceberg, PostgreSQL, and Kafka inside a single query. Ad-hoc analysis without a separate ETL step is a huge value to BI and analyst teams.

dbt 1.9 + Iceberg — the modern ELT stack joins forces

dbt has become the standard for SQL-based transformation. As of May 2026 dbt Core 1.9 / dbt Cloud bring these changes.

  • Microbatch incremental strategy: Time-based incremental builds.
  • dbt Mesh: Cross-project dependency management.
  • dbt Semantic Layer + MetricFlow: Centralized metric definitions.
  • First-class Iceberg support: Snowflake/Databricks/Spark/Trino adapters all recognize Iceberg tables as materialization targets.

A typical dbt model looks like this.

{{ config(
    materialized='incremental',
    incremental_strategy='microbatch',
    event_time='occurred_at',
    batch_size='day',
    file_format='iceberg',
    partition_by=['days(occurred_at)']
) }}

SELECT
  event_id,
  user_id,
  event_type,
  occurred_at,
  date_trunc('day', occurred_at) AS event_date
FROM {{ source('raw', 'events') }}

{% if is_incremental() %}
WHERE occurred_at >= (SELECT max(occurred_at) FROM {{ this }})
{% endif %}

SQLMesh is rising as a dbt alternative with virtual environments, stronger dimensional modeling, and semantic versions. Coalesce is a UI-based SQL transformation tool friendly to non-technical users.

Orchestration — Airflow 3.0 stood back up

Airflow long suffered the criticism of being heavy and outdated, but Airflow 3.0, GA in late 2025, shifted the field.

  • Task SDK: Job definitions are completely separated from Airflow infrastructure. Workers run remotely.
  • Multi-version DAG: Multiple versions of the same DAG coexist.
  • DAG Versioning: Airflow itself manages code change history.
  • Asset-centric scheduling: Dataset (asset) based triggers — clearly in response to Dagster.

Where the alternative orchestrators sit:

ToolStrengthAdoption pattern
Airflow 3.0Generality, ecosystem, connectorsEverywhere
DagsterAsset-centric, type-safe, data visibilityNew data platforms
Prefect 3Dynamic DAGs, Python-friendlyData/ML workflows
Mage AINotebook plus pipelineAnalysts, startups
KestraYAML-based, multi-languageEnterprise
Argo WorkflowsK8s-native, container per stepInfra/platform teams

A typical Airflow DAG looks like this.

from airflow.sdk import dag, task
from datetime import datetime, timedelta

@dag(
    schedule="@daily",
    start_date=datetime(2026, 5, 1),
    catchup=False,
    tags=["lakehouse", "iceberg"],
)
def daily_iceberg_compaction():

    @task
    def list_partitions(target_date: str) -> list[str]:
        return [f"event_date={target_date}"]

    @task
    def compact(partition: str) -> str:
        import subprocess
        subprocess.run([
            "spark-submit", "/opt/jobs/compact.py",
            "--table", "iceberg.analytics.events",
            "--partition", partition,
        ], check=True)
        return partition

    partitions = list_partitions("{{ ds }}")
    compact.expand(partition=partitions)

daily_iceberg_compaction()

Thanks to Airflow 3.0's Task SDK, workers run without Airflow dependencies, and dynamic task mapping (expand) parallelizes per-partition compactions.

Dagster — the asset-centric orchestrator

Dagster's philosophy is "DAG of assets," not "DAG of tasks." Datasets are first-class citizens, and tasks are functions that produce assets.

Where Dagster stands in 2026:

  • Strong in new data platforms. Data catalog, lineage, and dbt integration feel natural from day one.
  • dbt integration is the smoothest. A single load_assets_from_dbt_project call turns dbt models into assets.
  • The software-defined assets pattern aligns well with the data mesh movement.

The Dagster Cloud (enterprise SaaS) and OSS Core are clearly separated, which makes the business model stable.

Prefect is strong on dynamic workflows, and after Prefect 3.0 in 2024 the perf/UX improvements have been large. Mage is notebook- and data-native, friendly to analysts. Kestra uses a YAML DSL on the JVM, fitting enterprise environments.

ClickHouse, DuckDB, Pinot, Druid — analytics convergence

In the OLAP layer, the 2026 trend is clear. The "one DB does all analytics" era is over, and each tool has settled into what it does best.

  • ClickHouse: The canonical large-scale columnar OLAP. Log/event analytics, observability. ClickHouse 24.x added Vector Search plus Iceberg/Delta external tables.
  • DuckDB: Local/embedded OLAP. "Snowflake in your notebook." Explosive growth since 1.0 GA. MotherDuck is the cloud version.
  • Apache Pinot: Real-time user-facing analytics. LinkedIn Feed, Uber pricing, Superset interactive dashboards.
  • Apache Druid: Time-series plus real-time OLAP. Similar territory to Pinot with more mature ops tooling.
  • StarRocks/Doris: MPP OLAP. Strong in China; growing Iceberg compatibility.

The reason DuckDB grew fastest in 2026 is local-side S3 Parquet/Iceberg queries, support for Python/CLI/SQL, and shipping as a single binary. It is increasingly replacing Snowflake on the analyst's laptop.

Snowflake, Databricks, BigQuery, Redshift — the managed big four

The fully managed warehouse/lakehouse landscape is dominated by four:

  • Snowflake: SQL-first, ease of use, simple pricing. In 2026 it leans further into lakehouse territory with Iceberg external tables, Polaris Catalog, and Snowpark Container Services. AI/LLM compute (Cortex) is also strengthened.
  • Databricks: The parent of Spark/Delta. Unity Catalog plus Photon plus Mosaic AI. Strongest ML/LLM/engineering integration.
  • Google BigQuery: Serverless, petabyte scale. BigQuery Studio adds notebook integration. Iceberg external tables also supported.
  • AWS Redshift: Spectrum gained stronger Iceberg support. Redshift Serverless is becoming the default.

The traditional "Snowflake vs Databricks" duopoly blurred in 2026 because both companies are converging on the same direction. Snowflake absorbed lakehouse capabilities through Iceberg plus containers, and Databricks entered the warehouse market with Databricks SQL.

BigQuery is the default on Google Cloud, and Redshift is the choice on AWS where cost optimization matters most.

EMR, Dataproc, HDInsight — managed Hadoop today

All three major clouds run managed Hadoop services.

  • AWS EMR: The most mature. Three form factors — EMR on EC2, EMR Serverless, and EMR on EKS. Covers almost every OSS component: Spark, Hive, Presto, Trino, Flink, HBase.
  • GCP Dataproc: Spark and Hadoop-centric. Dataproc Serverless grows fast.
  • Azure HDInsight: Very few new workloads in 2026. The flow is absorption into Synapse and Fabric.
  • Azure Synapse / Microsoft Fabric: Now the Azure data-platform mainstream. OneLake is effectively unified storage on top of Iceberg/Delta.

EMR Serverless is widely adopted in 2026 for the "run a Spark job briefly and disappear" pattern. Running batch ETL on EMR Serverless instead of Lambda is now common.

The legacy of Cloudera + Hortonworks — CDP today

Cloudera and Hortonworks, the commercial center of Hadoop, merged in 2018 and unified into a single product, CDP (Cloudera Data Platform). Where CDP stands in 2026:

  • Almost no new adoption. Cloud managed (EMR/Dataproc) or fully managed (Databricks/Snowflake) took the new workloads.
  • Revenue centers on on-prem renewals in finance and telecom, large environments driven by data sovereignty and compliance.
  • CDP Public Cloud: Attempted cloud compatibility, but differentiation against EMR / Dataproc / Databricks remains weak.

OSS operational tools like Apache Bigtop, Apache Ambari, and Apache Ranger survived. Ranger in particular is used as the standard for data governance and access control even in the lakehouse era.

Korean big data operations — Naver, Kakao, KT, Coupang, Woowa Bros

Korean big tech runs OSS Hadoop ecosystems deeply.

  • Naver: Runs petabyte-scale Hadoop/Spark on its own clusters. Search logs, ads, even CLOVA LLM training data. Gradually moving to in-house K8s plus Spark platform in 2026. Naver Search Tech blog has shared the Iceberg adoption story.
  • Kakao: KEMI (Kakao Enterprise Machine Intelligence) data platform is standard. Moved from Hadoop + Spark + Hive to Iceberg + Trino. Multi-year history shared on the Kakao Tech blog.
  • KT: On-prem Hadoop clusters for telecom BSS/OSS data. Also used in KT GiGA Genie voice analytics. Uses CDP licenses.
  • Coupang: AWS EMR + Iceberg + Trino + Airflow. A Korean flagship of the cloud-native data stack.
  • Woowa Brothers (Baemin): Spark + Airflow + Snowflake. Mixes Kafka + Flink + Spark Streaming for real-time and batch analysis of delivery data.
  • NHN, LG CNS, Samsung SDS: Large SIs and platform vendors also run their own data platforms. Migration from Hadoop bases into lakehouse is in progress.

In the hiring market, positions listing "Iceberg, dbt, Airflow, Snowflake/Databricks" have grown more visible than "Hadoop, Spark, Hive experience." In 2024 Iceberg was rare in Korean data-engineer postings; in May 2026 it is common across large IT and finance firms.

Japanese big data operations — LINE, ZOZO, NTT, Rakuten, Mercari

Japan is historically more conservative and on-prem-heavy than Korea, but the pace of change in 2026 has accelerated.

  • LINE Yahoo: Has operated Hadoop clusters for years. Search, ads, message analysis. LINE Engineering blog has many Spark/Hive operational write-ups. Iceberg + Trino adoption started in 2025.
  • ZOZO: A Japanese flagship for Spark on Kubernetes. The ZOZO TECH BLOG describes EKS + Spark + Iceberg operations, including CDC and multi-tenant patterns.
  • NTT DATA: Enterprise/finance/public data-platform SI. CDP license customer. Publishes many Japanese big-data benchmarks and white papers.
  • Rakuten: Runs its in-house data platform RIDE on Hadoop + Spark + Hive. AI integration is in progress.
  • Mercari: GCP BigQuery + dbt + Looker is the spine. A Japanese flagship of the cloud-native pattern.
  • CyberAgent: An ad data platform running Spark/Flink/Kafka. Data pipeline cases are shared from AI Lab.
  • DeNA, GREE, GREE: Game and entertainment data platforms. Athena, Glue, EMR are used.

A distinguishing feature of Japan is the higher on-prem share than Korea. Finance, telecom, and manufacturing approach cloud migration cautiously due to data-sovereignty concerns. The lakehouse pattern itself, however, is being adopted rapidly.

Kafka and the streaming backbone — the artery of data

At the very center of the data engineering stack sits Kafka. As of May 2026:

  • Apache Kafka 4.0: KRaft mode is standard. ZooKeeper dependency fully removed.
  • Confluent Cloud: Fully managed SaaS. Tableflow materializes Kafka topics into Iceberg tables.
  • AWS MSK: AWS-native. The Serverless option is popular.
  • Redpanda: A Kafka-compatible broker rewritten in C++. Low latency plus simple ops.
  • WarpStream: Kafka-compatible with an S3 backend. Cost-optimized.

The streaming SQL layer splits between Flink SQL and ksqlDB, and the 2026 trend is that Flink is the standard. ksqlDB is strong inside Confluent, but OSS activity strongly favors Flink.

CDC (Change Data Capture) is Debezium's domain, with Flink CDC catching up quickly. Airbyte and Fivetran-style ELT SaaS also support CDC, but at high traffic Debezium + Kafka is the common path.

Airbyte, Fivetran, Stitch — ELT automation

The ingestion side of ELT has standardized on SaaS tools.

  • Fivetran: The most connectors (500+). Pricey but stable, with low management overhead.
  • Airbyte: OSS and cloud both. About 350 connectors. Self-hostable.
  • Stitch: Under Talend. Cost-effective for simple SQL DB ingestion.
  • Meltano: Singer-based OSS. Loved by code-first users.
  • Hevo Data: No-code plus integrated transformation. Popular in APAC.
  • Estuary Flow: Real-time CDC plus SQL. A new challenger.

Airbyte's 2024 Connector Builder + Low-Code CDK lowered the connector-authoring barrier, and 2026 brings experiments in LLM-driven automatic connector generation.

The Modern Data Stack combination of dbt + Airbyte/Fivetran + Snowflake/BigQuery + Looker/Mode/Hex peaked in 2020-2022; in 2026 it is more common to see dbt + Iceberg + Trino/Databricks blended with the lakehouse trend.

Data catalogs and governance — Unity, Polaris, Nessie, OpenMetadata

As data assets explode, catalogs have become first-class infrastructure.

  • Unity Catalog: Originally Databricks; converted to OSS in 2024. Expanded into a multi-format catalog supporting Iceberg and Delta.
  • Apache Polaris: An Iceberg REST catalog created by Snowflake. Donated to the ASF.
  • Project Nessie: Sponsored by Dremio. Git-like branching and merging in a catalog.
  • OpenMetadata: An OSS combining metadata, lineage, and governance.
  • DataHub: OSS from LinkedIn. Asset search, lineage, ownership.
  • Atlan, Alation, Collibra: Enterprise catalog SaaS.

The 2026 headline is the "Iceberg REST catalog standard". With Snowflake Polaris, Databricks Unity Catalog, Nessie, and AWS Glue all conforming, multiple engines hitting the same data through one catalog finally comes true. The multi-engine promise of the lakehouse era is realized.

Data quality and observability — Monte Carlo, Great Expectations, Soda

Tools that guarantee pipeline quality have become their own category.

  • Monte Carlo: First-generation data observability SaaS. Introduced the "data downtime" and SLA concepts.
  • Great Expectations: OSS data validation. Python-friendly.
  • Soda: YAML-based data quality rules. OSS plus Cloud.
  • Bigeye: ML-based anomaly detection.
  • Elementary: dbt-native observability OSS.
  • Datafold: Data diff tool. Automated dbt PR review.

Inside dbt you run dbt test for NULL/uniqueness checks, layer Great Expectations / Soda on top for business rules, and use Monte Carlo / Elementary for operational SLAs — that combination is the standard pattern in 2026.

Data mesh and data contracts — where org meets infrastructure

The data mesh concept that emerged between 2020 and 2022 became more practical in 2026. The simple definition:

  • Treat data as a product — domain teams own data products.
  • Data products run on self-service infrastructure — a data platform team provides it.
  • Federated governance — central standards with distributed operations.

To realize this, the data contract pattern emerged. A contract that specifies schema, SLA, and semantics. Tools include dbt contracts, Open Data Product Standard, Data Contract CLI, Soda Contracts.

Organizations that fit data mesh are (1) ones with independent domain teams, (2) high data-product demand, and (3) large orgs where the central data team is a bottleneck. In small orgs, a single centralized data team is more efficient.

On-prem big data reality — banks, telecom, government, manufacturing

Despite the cloud era, on-prem big data remains a large market in 2026. Key environments:

  1. Banking and securities: Data sovereignty, compliance (GDPR/PIPA), audit trails. Hadoop + Hive + Spark + Ranger.
  2. Telecom: Petabyte-scale BSS/OSS data, in-network analysis. CDP/HDInsight plus their own OpenStack/K8s.
  3. Government and public: Isolated data centers with air gaps. In-house lakehouses. Iceberg + Spark + Trino on K8s.
  4. Manufacturing: OT (operational technology) plus IT integration. Time-series databases (InfluxDB, TimescaleDB) plus Spark.
  5. Healthcare and insurance: Patient data protection. In-house HDFS or Ceph plus Spark.

The recent shift in on-prem big data is gradual migration to Kubernetes. K8s instead of YARN, MinIO/Ceph instead of HDFS, ArgoCD plus Spark Operator plus Flink Operator instead of Cloudera Manager. In 2026 "you can run a lakehouse architecture on-prem" feels normal.

A Spark Operator manifest looks like this.

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: nightly-iceberg-compaction
  namespace: data
spec:
  type: Scala
  mode: cluster
  image: registry.internal/spark:4.0.0
  mainApplicationFile: s3a://jobs/compaction.jar
  sparkVersion: "4.0.0"
  driver:
    cores: 2
    memory: 4g
    serviceAccount: spark
  executor:
    cores: 4
    instances: 8
    memory: 8g
  deps:
    jars:
      - "https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.13/1.6.0/iceberg-spark-runtime-3.5_2.13-1.6.0.jar"
  hadoopConf:
    "fs.s3a.endpoint": "https://minio.internal:9000"
    "fs.s3a.path.style.access": "true"

Building and submitting:

# Install the Spark Operator
helm repo add spark-operator https://kubeflow.github.io/spark-operator
helm install spark-operator spark-operator/spark-operator \
  --namespace spark-operator --create-namespace \
  --set sparkJobNamespaces='{data}'

# Quick-start the Iceberg REST catalog
docker run -d --name iceberg-rest \
  -p 8181:8181 \
  -e CATALOG_WAREHOUSE=s3://lakehouse/ \
  -e CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO \
  apache/iceberg-rest-fixture:latest

# Submit the job
kubectl apply -f nightly-iceberg-compaction.yaml
kubectl get sparkapplications -n data

Data engineering hiring market — May 2026 snapshot

Keywords in the data engineering hiring market shifted fast.

  • 2020: Hadoop, Hive, MapReduce, Sqoop, Oozie
  • 2023: Spark, Airflow, Snowflake/BigQuery, dbt, Kafka
  • 2026: Iceberg/Delta, Trino, Flink, dbt + SQLMesh, Airflow 3.0/Dagster, Spark 4.0, lakehouse architecture

The tool frequency in LinkedIn searches for "data engineer" (rough observation, May 2026):

ToolFrequencyNote
SparkVery highEffectively required
AirflowVery highEffectively required
dbtVery highListed in most new postings
Snowflake/BigQuery/DatabricksVery highOne or two required
Iceberg/DeltaHighExploded since 2024
Trino/PrestoMediumRequired in specific domains
FlinkMediumStreaming roles
KafkaVery highEffectively required
KubernetesMedium-highPlatform roles
Hadoop/HiveLow (legacy)Maintenance roles only

The "Hadoop operations" requirement on new postings has fallen sharply. Instead, "experience operating Spark/Trino on top of S3/Iceberg" has become the standard.

Learning path — recommended entry route for 2026 data engineers

If you are starting in data engineering today, this is the recommended May 2026 order.

  1. Master SQL deeply. PostgreSQL or MySQL deeply.
  2. Python + pandas/Polars/DuckDB for local data processing. DuckDB also doubles as lakehouse learning material.
  3. Apache Spark: Start with PySpark. Not via RDDs but via DataFrame/SQL.
  4. Apache Airflow + dbt: Orchestration plus transformation. The core of the modern data stack.
  5. Apache Kafka plus Flink or Spark Streaming for streaming basics.
  6. One cloud: AWS (S3 + EMR + Athena + Glue) or GCP (GCS + Dataproc + BigQuery).
  7. Table format: Iceberg deeply. Delta you absorb naturally inside Databricks environments.
  8. Trino: Federated query plus ad-hoc analysis.
  9. Kubernetes basics: Up to the Spark/Flink Operator level.
  10. Data modeling/Kimball: Dimensional modeling remains essential.

Books: Designing Data-Intensive Applications (Martin Kleppmann), Fundamentals of Data Engineering (Joe Reis & Matt Housley), The Data Warehouse Toolkit (Kimball).

Videos and blogs: Databricks Data + AI Summit, Snowflake Summit, Airflow Summit, Iceberg / Trino Summit keynotes, plus Korean data-engineering conferences (DEvFest, Pycon Korea Data Track).

References