Skip to content

✍️ 필사 모드: Is Hadoop Dead? — The Evolution of the Big Data Stack, From Hadoop to Lakehouse (Spark, Iceberg, Delta, and Where Things Actually Stand in 2026)

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Prologue — Why "Is Hadoop Dead?" Is the Wrong Question

A junior data engineer often asks: "Is Hadoop dead?"

Senior engineers tend to give one of two answers. The first is the consultant's answer — "No, plenty of clusters are still running." The second is the startup data lead's answer — "Yes, dead. If you're starting fresh, do not start with Hadoop."

Both are right. And if you can't hold both answers in your head at the same time, you will design a 2026 data platform badly. This post traces the entire evolution — how Hadoop got demoted from the default, what took its place, and where Hadoop still lives.

A better way to phrase the question. "If I'm building a new analytics stack in 2026, is there a reason to choose Hadoop?" The answer is almost always no. But "should I shut down a Hadoop cluster that is currently doing real work?" The answer is almost always no. This post is about every detail that sits between those two answers.


1. The Timeline — Twenty Years of Big Data Evolution on One Page

The big picture first. The big data stack went through four major generational shifts over twenty years.

   2006              2012-14           2017-20          2022-26
   ─────             ───────           ───────          ───────

   HDFS              HDFS              S3 / GCS         S3 / GCS / ADLS
    +                 +                  +                +
   MapReduce  ──▶   Spark      ──▶   Spark/Trino  ──▶  Spark/Trino/DuckDB
    +                 +                  +                +
   Hive          Hive metastore     Hive metastore   REST Catalog
   (file = table)  (file = table)   (file = table)   (Iceberg/Delta/Hudi)
    +                 +                  +                +
   YARN              YARN              K8s / EMR        K8s / Serverless

   "Hadoop"      "Hadoop + Spark"   "Spark on object   "Lakehouse"
                                     storage"

What divides the generations is clear.

  1. 2006 to 2012 — Classic Hadoop: HDFS plus MapReduce plus YARN 2.x plus Hive. One cluster, with storage, compute, and metastore all bundled together.
  2. 2012 to 2017 — Spark replaces MapReduce: HDFS unchanged, but MR is gone. Hive on Tez or Spark, Impala and Presto bring interactive SQL.
  3. 2017 to 2022 — Storage and compute separate: S3, GCS, and ADLS take HDFS's place. Spark and Trino run on object storage. EMR, Dataproc, and Databricks emerge.
  4. 2022 to 2026 — Lakehouse: The Hive metastore era ends. Apache Iceberg, Delta Lake, and Apache Hudi shift the paradigm from "the file is the table" to "the metadata is the table." Snowflake natively supports Iceberg, Databricks acquires Tabular, and the REST Catalog becomes the standard.

The rest of this post unpacks what replaced what, and why at each transition.


2. Classic Hadoop — What Was the Core, and Where Did It Break

First, pin down what "Hadoop" actually was. The early 2010s "Hadoop stack" was essentially three components.

  • HDFS — A distributed file system. It split data into 64 MB or 128 MB blocks and stored them across multiple nodes' disks, typically with 3x replication.
  • MapReduce — A distributed compute framework. Map phase shuffles data by key, reduce phase aggregates. Disk-based.
  • YARN — A resource manager. Tracks each node's CPU and memory, schedules job containers.

On top of these sat Hive (SQL interface), HBase (OLTP-ish KV store), and Sqoop plus Flume (ingestion). That bundle was "the Hadoop ecosystem."

The clever insight of this model was data locality — move the compute to the data. In an era of slow networks, "run the task on the node that holds the disk block" was genuinely revolutionary.

Over time, the cracks accumulated.

  1. MapReduce is slow. Disk-based shuffle writes every intermediate result to disk. Iterative workloads (ML, multi-stage ETL) became hideously inefficient.
  2. HDFS is operationally heavy. The NameNode is a single point of failure, it holds metadata in memory so there's a hard limit on file count, the small file problem is chronic, and scaling disk capacity requires adding entire nodes.
  3. Compute and storage are coupled. If you need more disk you also pay for CPU; if you need more CPU you also pay for disk. The least cloud-friendly model imaginable.
  4. The metadata layer (Hive metastore) is weak. Partition-level only — no schema evolution, no transactions, no time travel.

In short, Hadoop was optimized for "rent a data center hall, run batch jobs all day" workloads. As the cloud, interactive queries, ML, and streaming arrived in succession, those assumptions broke.


3. First Transition — Spark Replaces MapReduce

Spark started at UC Berkeley AMPLab in 2010 and graduated to Apache TLP in 2014. Spark's core pitch was simple — "works like MapReduce, ten to a hundred times faster."

Why Spark Was Fast — RDDs and In-Memory Shuffle

Spark models data as an RDD (Resilient Distributed Dataset) — a collection of distributed partitions. A chain of transformations (map, filter, join) becomes a DAG. The key points:

  • No intermediate results to disk — cache in memory when possible.
  • Optimize over the operation graph — within a stage, pipeline without a shuffle.
  • Recover with lineage — even if in memory, RDD lineage allows partial recomputation on failure.

Same word count, MapReduce versus Spark.

MapReduce (Java) — long code, writes to disk, slow.

public class WordCount {
  public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }
  public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();
    public void reduce(Text key, Iterable<IntWritable> values, Context context)
        throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) sum += val.get();
      result.set(sum);
      context.write(key, result);
    }
  }
  // main(): create Job, set in/out paths, waitForCompletion …
}

Spark (Scala or Python) — same thing in four lines.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("wordcount").getOrCreate()
df = spark.read.text("s3://logs/2026/05/14/")
counts = df.selectExpr("explode(split(value, ' ')) as word").groupBy("word").count()
counts.write.mode("overwrite").parquet("s3://output/wc/")

Code length isn't everything. Operationally Spark won too.

  • SQL as a first-class citizen — Spark SQL absorbed almost all workloads.
  • MLlib, Streaming, GraphX integrated — one engine for batch, ML, streaming, and graph.
  • Good APIs — Scala, Python, R, Java, SQL all supported.

By roughly 2018, writing a new MapReduce job had essentially vanished. Hadoop clusters were still around, but almost everything running inside them was Spark.

The crucial observation at this stage: HDFS was unchanged. Spark on YARN, Spark on HDFS — Spark took MR's seat, but the Hadoop infrastructure underneath was the same.


4. Second Transition — Object Storage Replaces HDFS

The next domino was storage. From the mid-2010s, AWS S3, Google Cloud Storage, and Azure Data Lake Storage (collectively "object storage") became the default for data lakes.

Why S3 Beat HDFS

Object storage offered five advantages.

  1. Compute and storage separation — Scale CPU and disk independently. Fits the cloud pricing model.
  2. Effectively cheap — S3 Standard runs around 0.023 USD per GB per month, Glacier even less. Versus HDFS 3x replication, the price gap is an order of magnitude.
  3. No operations — NameNode, DataNode, disk replacement, rebalance — all handled by AWS.
  4. Eleven nines of durability — S3's 99.999999999% promise beats HDFS 3x replication.
  5. Practically unlimited scale — No real cap on file count. No NameNode memory issues.

The downsides existed too.

  • Higher latency — Each object GET takes tens of milliseconds. HDFS is sub-millisecond.
  • List operations are expensive — S3 LIST is paginated, 1000 keys per page.
  • Eventual consistency (S3 switched to strong read-after-write consistency in December 2020).
  • No rename — S3 "rename" is copy plus delete. Directory-level rename effectively does not exist.

The "no rename" problem broke the standard Hive and Spark output pattern (write to _temporary, rename to final). Writing safely to S3 required a committer (EMRFS S3-Optimized Committer, Magic Committer, etc.). That friction became one of the motivations for the next domino — open table formats.

HDFS's New Role — "Barely Used"

By the mid-2020s, a new cloud-native data platform almost never starts with HDFS. EMR, Databricks, Snowflake, BigQuery — all default to object storage. HDFS survives in two cases.

  1. On-premise clusters — Conservative finance, telecom, and government that cannot or will not move to the cloud.
  2. HDFS-compatible distributed storage — Ozone, MinIO, JuiceFS, and similar next-gen distributed storage that expose S3-compatible interfaces.

The second category is less "HDFS's successor" and more "an on-prem clone of S3." In other words the object storage model won, and HDFS got pushed into the seat of imitating it.


5. Third Transition — Open Table Formats Replace the Hive Metastore

This is the most recent, and most important, transition.

The Hive Limit — "The File Is the Table"

The Hive metastore era assumed something simple — "a directory on S3 or HDFS is a table." Parquet files inside a partition directory like s3://warehouse/orders/dt=2026-05-14/ collectively constitute the table. The metastore records "this table's partition columns, where the directories live, what the schema is."

The problems with this model piled up.

  • No transactions — A failed INSERT INTO ... PARTITION leaves half-written files behind. Idempotency is hard.
  • Weak schema evolution — Adding a column works. Renaming, changing types, evolving nested structures — basically impossible.
  • No time travel — Cannot serve "show me the state as of one hour ago."
  • Small file problem — Streaming ingest produces thousands of tiny files per partition; query performance dies.
  • The metastore itself is a bottleneck — A SHOW PARTITIONS on a giant table can take minutes.

The Three Open Table Formats — Iceberg, Delta, Hudi

To address these limits, three projects emerged within a few years of each other.

  • Apache Iceberg (2018, Netflix) — Metadata files explicitly list all data files. Snapshot-level ACID. Catalog-neutral.
  • Delta Lake (2019, Databricks) — A _delta_log/ directory with JSON/Parquet transaction logs does the same job. Deeply integrated with Databricks.
  • Apache Hudi (2017, Uber) — Strong on streaming and CDC. Copy-on-Write and Merge-on-Read modes.

The shared insight is unmistakable. "Keep metadata alongside the files, and let that metadata implement ACID, time travel, and schema evolution."

The same table, expressed as Iceberg DDL.

CREATE TABLE catalog.db.orders (
  order_id     BIGINT,
  user_id      BIGINT,
  amount_cents BIGINT,
  status       STRING,
  created_at   TIMESTAMP
)
USING iceberg
PARTITIONED BY (days(created_at), bucket(16, user_id))
TBLPROPERTIES (
  'format-version'             = '2',
  'write.target-file-size-bytes' = '536870912',
  'write.parquet.compression-codec' = 'zstd'
);

-- Time travel
SELECT count(*) FROM catalog.db.orders FOR TIMESTAMP AS OF '2026-05-13 00:00:00';

-- Schema evolution — add, rename, promote types
ALTER TABLE catalog.db.orders ADD COLUMN refund_amount_cents BIGINT;
ALTER TABLE catalog.db.orders RENAME COLUMN status TO order_status;

Two things to call out.

  1. PARTITIONED BY (days(created_at), bucket(16, user_id)) — In the Hive era you had to add a fake dt column yourself. Iceberg's hidden partitioning lets the metadata handle the time/hash transform. The user only writes WHERE created_at >= '2026-05-01' and partition pruning still kicks in.
  2. format-version 2 — Iceberg v2 supports row-level deletes. v3 (2025 to 2026) adds deletion vectors and the variant type. Databricks promoted v3 to Public Preview in 2026.

Two Events in 2024 — The End of the War

The "Iceberg vs Delta vs Hudi" debate of 2020 to 2023 was effectively settled by two events in 2024.

  1. Snowflake natively supported Iceberg and open-sourced the Polaris Catalog (2024 Summit, graduated to Apache TLP in 2025).
  2. Databricks acquired Tabular for over USD 1 billion — Tabular is the company founded by Iceberg's creators (Ryan Blue, Dan Weeks). In other words, Delta's home base bought Iceberg's home base.

After that, Databricks shipped Delta UniForm so that Delta tables can be read as Iceberg, and in April 2026 Snowflake GA'd write support for Iceberg tables managed by an external Unity Catalog (Azure first). The REST Catalog became the lingua franca.

To summarize: Iceberg has become the de facto standard, and Delta and Hudi pursue interoperability with it.


6. Fourth Transition — The Age of Query Engines (Trino, DuckDB, ClickHouse)

In the Hive on Tez era, interactive SQL hit clear limits. Presto and Trino took that seat.

  • Presto — Started at Facebook in 2012. In 2018 the Facebook side and the Starburst plus Netflix side split, and the latter relaunched as Trino. PrestoSQL rebranded to Trino in 2020.
  • Trino — A massively parallel processing (MPP) distributed SQL engine. Native connectors for Iceberg, Delta, Hudi, and Hive.
  • Starburst — A commercial distribution of Trino. Adds an acceleration layer like Warp Speed.

As of 2026, Trino is in production at Comcast, Goldman Sachs, LinkedIn, Lyft, Netflix, Pinterest, and Salesforce, and is effectively the standard engine for SQL over open data lakes.

A new wave runs alongside.

  • DuckDB — Embedded OLAP. Dominant for single-node Parquet or Iceberg analytics. "DuckDB on a laptop handles up to 1 TB" is said seriously now.
  • ClickHouse — Columnar OLAP database. Strong for real-time analytics. Now ties into the lakehouse via external Iceberg tables.
  • StarRocks and Doris — New-generation MPP analytics databases. Iceberg-native.

The point: query engines are now multi-polar. Batch goes to Spark, interactive analytics to Trino, single-node to DuckDB, real-time to ClickHouse — all sitting on top of the same Iceberg tables, each handling its niche workload.


7. "What Replaced What" — One Matrix

The whole evolution at a glance.

LayerClassic Hadoop (2010)Hadoop+Spark (2015)Object Storage (2020)Lakehouse (2026)
Distributed storageHDFSHDFSS3 / GCS / ADLSS3 / GCS / ADLS
Batch computeMapReduceSparkSpark on EMR / DatabricksSpark / Trino / Flink
Interactive SQLHiveHive on Tez, ImpalaPresto / TrinoTrino / DuckDB / ClickHouse
Table formatText, Sequence, ORCParquet on HiveParquet on HiveIceberg / Delta / Hudi
MetadataHive MetastoreHive MetastoreHive Metastore / GlueREST Catalog (Polaris, Unity, Nessie)
Resource managerYARNYARNYARN / K8s / EMRK8s / serverless
StreamingStormSpark StreamingSpark Structured / FlinkFlink / Kafka / Iceberg streaming
Operating modelOn-prem clusterOn-prem plus cloudManaged cloudMulti-engine, multi-catalog

What this matrix says is unambiguous. Nearly every layer that used to be Hadoop has been swapped out. HDFS to object storage, MR to Spark, Hive metastore to REST Catalog and Iceberg, YARN to Kubernetes.

What remains is the conceptual influence. The core Hadoop idea — "park data on disks, distribute compute, abstract via metadata" — is still the foundation of every distributed data system. But the concrete components sitting on that foundation have been almost entirely replaced.


8. Where Hadoop Still Lives

"OK, so where does Hadoop actually live?" In 2026, in these four areas it is still active.

8.1 Huge Legacy Clusters

Big enterprises, telecom carriers, and banks with petabyte-scale HDFS clusters that have run for over a decade cannot turn them off overnight. Cloudera and Hortonworks (now merged) licenses, operations team know-how, migration cost and risk — all push toward "keep it running." These clusters are usually under incremental migration — new data lands in S3 plus Iceberg, old data stays on HDFS and moves over time.

8.2 Hive Metastore — Surviving as a Compatibility Layer

An interesting fact: the Hive Metastore itself has not died. As Iceberg abstracted the catalog, the Hive Metastore became one implementation of an Iceberg catalog. That is:

   Then: Hive table → Hive Metastore → directory (Parquet)
   Now:  Iceberg table → Hive Metastore catalog → Iceberg metadata → Parquet

It is common for organizations to keep the Hive Metastore in place and register and manage Iceberg tables through it. By becoming an "Iceberg catalog compatibility layer," the Hive Metastore enabled incremental migration. Expose the same metastore via a REST Catalog and Trino, Spark, Snowflake, and Flink all see the same tables.

8.3 On-Premise Conservatism — Finance, Telecom, Government

Some organizations cannot move to public object storage due to data sovereignty, regulation, or operations policy. Parts of Korean and Japanese finance, the three major telecom carriers, and government systems are typical. In these places, Apache Ozone (the HDFS successor project) or S3-compatible on-prem storage like MinIO and JuiceFS take root. YARN is slowly migrating toward Kubernetes.

8.4 YARN — Slowly Moving to Kubernetes

YARN was once Hadoop's pride, but in 2026 almost all new deployments use Kubernetes. EMR on EKS, Databricks (which always had its own resource manager), Spark on K8s — all bypass YARN. Existing YARN clusters whose stability is good enough that there is little motivation to migrate keep running as is. Hadoop 3.5 (2026) actively maintains YARN.


9. Why Open Table Formats Won — In One Paragraph

The most important insight in this post. The reason open table formats (especially Iceberg) became the standard is simple.

The "truth of the table" moved from the directory to the metadata, which let any engine reading the table see the same result.

In the Hive era, the same table read from Spark, Presto, or Hive could yield subtly different results. Nothing prevented dropping a file directly into a partition directory, and there were no transactions. Iceberg cut that with the model "snapshot equals metadata file equals the truth at that point." One table, many engines, same result — that is the core promise of the lakehouse era.

The reason Snowflake and Databricks each surrendered part of their proprietary format and joined Iceberg is the same. Customers stopped negotiating away "I don't want to be locked into one vendor." In 2026, an analytics stack defaults to format-neutral, catalog-neutral.


10. Designing a New Analytics Stack in 2026 — The Default Stack

If you're starting today, the following is a safe default.

   ┌──────────────────────────────────────────────────────────┐
   │  Storage:    S3 / GCS / ADLS (or S3-compatible on-prem)  │
   │  Format:     Apache Iceberg (or Delta + UniForm)         │
   │  Catalog:    Polaris / Unity / Glue / Nessie             │
   │  Batch:      Spark (Databricks, EMR, Glue) or Flink      │
   │  Interactive:Trino (or Starburst Galaxy)                 │
   │  Single-node:DuckDB (developer / BI workbench)           │
   │  Streaming:  Kafka + Flink + Iceberg streaming           │
   │  Orchestrator: Airflow / Dagster / Prefect               │
   │  Transform:  dbt (or SQLMesh)                            │
   │  Observability: OpenLineage + Marquez / DataHub          │
   └──────────────────────────────────────────────────────────┘

The virtue of this stack is low lock-in. No component is tied to a single vendor. Iceberg is a standard. Trino, Spark, and Flink are OSS. Polaris, Nessie, and Unity Catalog all follow the REST Catalog spec.

What to Avoid

If starting fresh, avoid the following.

  1. A new HDFS deployment — Cloud or on-prem, it doesn't matter. Go S3-compatible (Ozone, MinIO).
  2. A new MapReduce job — Spark or Trino can already do it.
  3. Exposing the Hive Metastore as your catalog directly — Use it only behind Iceberg. Abstract it with a REST Catalog if you can.
  4. Creating new Hive tables — Use Iceberg. Conversion costs only grow over time.
  5. Vendor-proprietary table formats — Don't permanently store data in Snowflake's internal format or BigQuery native format. Keep it readable by external engines.
  6. Single-engine ETL — If you write all ETL in Databricks notebooks, Snowpark, or BigQuery SQL only, migration is hell. Put a portable layer like Spark SQL or dbt in between.

What You Can Keep As-Is

Conversely, you do not need to rip out the following.

  1. A well-running YARN cluster — Keep it if it's stable. Migrate to Kubernetes per workload.
  2. Hive Metastore — Repurpose as an Iceberg catalog and you're done.
  3. Spark jobs — If they already write to Iceberg or Delta, almost nothing to do.
  4. Old data on HDFS — Leave cold data alone; route new data to S3.

11. Migration Patterns — Classic Hadoop to Lakehouse

There are three common patterns for moving from legacy to a modern stack.

11.1 Dual Write

New ingest jobs write to both HDFS and S3 plus Iceberg. After a comparison period to validate, the old path is shut off. Safe but doubles resources.

11.2 In-Place Migration

Iceberg's add_files procedure registers existing HDFS Parquet directories as an Iceberg table without copying. Files stay put, only metadata is added. Fast, but the HDFS coupling is unchanged.

-- Register an existing Hive external table as an Iceberg table (no file copy)
CALL system.add_files(
  table => 'iceberg_catalog.db.orders',
  source_table => 'hive.db.orders'
);

11.3 Catalog Unification, Then Incremental Migration

Layer an Iceberg catalog adapter on top of the Hive Metastore. Create new tables only as Iceberg. Leave old Hive tables alone, but expose both through the same catalog. Over time the old tables retire naturally. The most recommended approach.


Epilogue — Why "Is It Dead" Is the Wrong Question

Hadoop is not dead. But it has been demoted from being the default.

When Linux first arrived, people asked "is Unix dead?" The accurate answer was "System V Unix is barely used anywhere, the BSD family and Linux took that territory, but the ideas of Unix live everywhere." Hadoop is the same. HDFS and MapReduce are virtually invisible in new workloads, but Hadoop's core ideas — "distribute data on disks, move compute to the data, abstract via metadata" — survive in a better implementation: object storage plus Iceberg plus Trino.

A 2026 data engineer's job is not to ask "is Hadoop dead." It is to ask three questions.

  1. Where should the truth of our data live? — On object storage, in an open table format, behind a standard catalog.
  2. Which engine will own which workload? — Spark for batch, Trino for interactive, DuckDB for single-node, Flink for streaming, ClickHouse for real-time.
  3. How will we avoid lock-in to any one vendor? — A three-layer abstraction of Iceberg plus REST Catalog plus a portable SQL layer (dbt).

Checklist

When designing a new analytics platform, verify the following.

  • Is storage object storage? (One of S3, GCS, ADLS, Ozone, MinIO)
  • Is the table format Iceberg (or Delta UniForm)? Not just Hive text or ORC?
  • Does the catalog follow the REST spec? (Polaris, Unity, Nessie, Glue, etc.)
  • Is compute on Kubernetes or a managed service? Not strictly bound to YARN?
  • Is interactive query handled by Trino (or Starburst)?
  • Does streaming ingest write directly to Iceberg or Delta? (Kafka Connect, Flink Iceberg sink)
  • Is there a dbt or equivalent SQL transformation layer? Portable SQL rather than engine-specific ETL?
  • Is the catalog permission model unified under Lake Formation, Unity, or Polaris RBAC?
  • Is lineage captured via OpenLineage? Visible from one place?
  • Is there a migration path? Is there a documented procedure to move legacy Hive and HDFS assets gradually?

Common Anti-Patterns

Avoid the following in any new 2026 project.

  1. "Start with HDFS now, move later" — Later never comes. Start with S3.
  2. "We have a Hadoop operations team, so we'll go Hadoop" — Ops team familiarity should not be the top criterion for new architecture.
  3. "Hive Metastore is familiar, so we'll keep using it" — Keep it, but expose through an Iceberg catalog adapter.
  4. "Put all data inside Snowflake or Databricks" — That's the start of vendor lock-in. Keep an externally readable format (Iceberg).
  5. "One engine for all workloads" — Batch, interactive, streaming, and single-node are each owned by different engines.
  6. "File equals table" — Do not write files directly into a directory. Always go through the table interface.
  7. "One catalog is enough" — Plan for multi-catalog and federation from the start.
  8. "Pick the format later" — Format is decided at ingest. Migration is expensive.

Next Post

The companion piece — "The Iceberg Catalog Wars — Polaris vs Unity vs Nessie vs Glue, and How to Federate Multi-Catalog Environments" — covers the REST Catalog spec in detail, the differences between implementations, and how to design permissions, lineage, and failover across a multi-catalog environment.


References

현재 단락 (1/246)

A junior data engineer often asks: "Is Hadoop dead?"

작성 글자: 0원문 글자: 23,069작성 단락: 0/246