Split View: 하둡은 죽었는가 — 빅데이터 스택의 진화, Hadoop에서 Lakehouse까지 (Spark·Iceberg·Delta, 그리고 2026년의 현실)

하둡은 죽었는가 — 빅데이터 스택의 진화, Hadoop에서 Lakehouse까지 (Spark·Iceberg·Delta, 그리고 2026년의 현실)

프롤로그 — "하둡은 죽었어요?"라는 질문이 잘못 던져진 이유

신입 데이터 엔지니어가 종종 묻는다. "하둡은 죽었나요?"

선임의 정답은 둘 중 하나다. 하나는 시니어 컨설턴트의 답 — "아니요, 아직 운영되는 클러스터가 많습니다." 다른 하나는 스타트업 데이터 리드의 답 — "네, 죽었어요. 새로 시작한다면 절대 하둡으로 시작하지 마세요."

둘 다 맞다. 그리고 이 두 답을 동시에 이해하지 못하면 2026년에 데이터 플랫폼을 잘못 설계한다. 이 글은 그 진화의 전 과정 — Hadoop이 어떻게 기본값에서 내려왔고, 그 자리를 무엇이 대체했고, 어디에 Hadoop이 아직 살아있는지 — 를 정리한다.

질문을 더 정확히 다시 쓰자. "2026년에 새 분석 스택을 만든다면 Hadoop을 선택할 이유가 있는가?" 답은 거의 항상 아니오다. 그러나 "이미 돌고 있는 Hadoop 클러스터를 당장 끄는 게 맞는가?" 답은 거의 항상 아니오다. 이 글은 그 사이에 있는 모든 디테일에 관한 것이다.

프롤로그 — "하둡은 죽었어요?"라는 질문이 잘못 던져진 이유
1. 타임라인 — 한 장에서 보는 빅데이터 진화
2. 클래식 Hadoop — 무엇이 핵심이었고, 무엇이 한계였는가
3. 첫 번째 전환 — Spark가 MapReduce를 갈아치우다
- Spark가 빨랐던 이유 — RDD와 인메모리 셔플
4. 두 번째 전환 — 객체 스토리지가 HDFS를 갈아치우다
- 왜 S3가 HDFS를 이겼는가
- HDFS의 새 역할 — "거의 안 쓴다"
5. 세 번째 전환 — Hive 메타스토어를 열린 테이블 포맷이 갈아치우다
6. 네 번째 전환 — 쿼리 엔진의 시대(Trino·DuckDB·ClickHouse)
7. "무엇이 무엇을 대체했는가" — 한눈 매트릭스
8. 그래도 Hadoop이 살아있는 곳
9. 왜 열린 테이블 포맷이 이겼는가 — 한 단락으로
10. 2026년에 새 분석 스택을 짠다면 — 권장 조합
- 무엇을 피해야 하는가
- 무엇을 그대로 유지해도 되는가
11. 마이그레이션 패턴 — 클래식 Hadoop에서 Lakehouse로
에필로그 — "죽었는가"가 잘못된 질문인 이유
참고 / References

1. 타임라인 — 한 장에서 보는 빅데이터 진화

먼저 큰 그림이다. 빅데이터 스택은 2006년부터 약 20년에 걸쳐 4번의 큰 세대교체를 겪었다.

   2006              2012-14           2017-20          2022-26
   ─────             ───────           ───────          ───────

   HDFS              HDFS              S3 / GCS         S3 / GCS / ADLS
    +                 +                  +                +
   MapReduce  ──▶   Spark      ──▶   Spark/Trino  ──▶  Spark/Trino/DuckDB
    +                 +                  +                +
   Hive          Hive metastore     Hive metastore   REST Catalog
   (file = table)  (file = table)   (file = table)   (Iceberg/Delta/Hudi)
    +                 +                  +                +
   YARN              YARN              K8s / EMR        K8s / Serverless

   "Hadoop"      "Hadoop + Spark"   "Spark on object   "Lakehouse"
                                     storage"

세대마다 갈리는 것은 분명하다.

2006–2012, 클래식 Hadoop 시대: HDFS + MapReduce + YARN(2.x) + Hive. 한 클러스터에 데이터·계산·메타스토어가 모두 묶여 있었다.
2012–2017, Spark가 MR을 갈아치우는 시기: HDFS는 그대로, MapReduce 자리에 Spark가 들어왔다. Hive on Tez/Spark, Impala·Presto가 인터랙티브 쿼리를 가져왔다.
2017–2022, 스토리지·컴퓨트 분리 시기: HDFS의 자리에 S3·GCS·ADLS가 들어왔다. Spark·Trino는 객체 스토리지 위에서 돌기 시작했다. EMR·Dataproc·Databricks가 등장했다.
2022–2026, Lakehouse 시기: Hive 메타스토어 시대가 끝나고 Apache Iceberg·Delta Lake·Apache Hudi — 즉 "파일이 곧 테이블"이 아니라 "메타데이터가 곧 테이블"이라는 패러다임으로 넘어갔다. Snowflake가 Iceberg를 네이티브 지원하고, Databricks가 Tabular를 인수하고, REST Catalog가 표준이 됐다.

이 글의 나머지는 각 전환의 무엇이 무엇을 대체했고, 왜 그랬는지를 풀어쓴다.

2. 클래식 Hadoop — 무엇이 핵심이었고, 무엇이 한계였는가

먼저 "Hadoop"이 정확히 무엇이었는지 짚자. 2010년대 초의 "Hadoop 스택"은 사실상 세 개의 컴포넌트로 이뤄져 있었다.

HDFS — 분산 파일 시스템. 64MB·128MB 블록 단위로 데이터를 쪼개 여러 노드의 디스크에 복제(보통 3복제)해서 저장.
MapReduce — 분산 계산 프레임워크. Map 단계에서 데이터를 키별로 흩고, Reduce 단계에서 모은다. 디스크 기반.
YARN — 자원 관리자. 노드들의 CPU·메모리를 관리하고 잡 컨테이너를 배정.

이 위에 Hive가 SQL 인터페이스를, HBase가 OLTP-ish KV 스토어를, Sqoop·Flume이 적재 도구를 얹은 형태가 "Hadoop 생태계"였다.

이 모델의 천재성은 데이터 로컬리티(data locality) 였다. 계산을 데이터 쪽으로 옮긴다 — 네트워크가 느리던 시절, "디스크에 붙어 있는 노드에서 그 디스크의 블록을 처리하라"는 발상은 충분히 혁신적이었다.

그러나 시간이 지나며 한계가 누적됐다.

MapReduce가 느리다. 디스크 기반 셔플은 모든 중간 결과를 디스크에 쓴다. 반복 작업(머신러닝, 다단계 ETL)에서 끔찍하게 비효율적이었다.
HDFS는 운영이 무겁다. NameNode가 단일 장애점이고, 메타데이터를 메모리에 들고 있어 파일 수에 한계가 있고, 작은 파일 문제(small file problem)가 만성적이고, 디스크 용량을 늘리려면 노드를 통째로 추가해야 한다.
컴퓨트와 스토리지가 묶여 있다. 디스크가 부족하면 CPU가 남아도 노드를 늘려야 하고, CPU가 부족하면 디스크가 남아도 노드를 늘려야 한다. 클라우드 시대에 가장 어울리지 않는 모델이다.
메타데이터(Hive metastore)가 약하다. 파티션 단위로만 관리되고, 스키마 진화·트랜잭션·시간 여행 같은 모던 분석 요구를 충족하지 못했다.

요컨대 Hadoop은 "데이터센터 한 동을 빌려 하루 종일 배치 잡을 굴리는" 형태의 워크로드에 최적화돼 있었다. 클라우드, 인터랙티브 쿼리, 머신러닝, 스트리밍이 차례로 들어오면서 그 가정이 무너졌다.

3. 첫 번째 전환 — Spark가 MapReduce를 갈아치우다

Spark는 2010년 UC Berkeley AMPLab에서 시작됐고, 2014년 Apache TLP가 되며 본격 확산됐다. Spark의 핵심 가치 제안은 단순했다 — "MapReduce처럼 굴러가는데 10–100배 빠르다."

Spark가 빨랐던 이유 — RDD와 인메모리 셔플

Spark는 데이터를 RDD(Resilient Distributed Dataset) 라는 추상으로 모델링한다. RDD는 분산 파티션의 컬렉션이고, 변환(map/filter/join) 체인은 DAG로 표현된다. 핵심은:

중간 결과를 디스크에 쓰지 않는다 — 가능한 한 메모리에 캐시한다.
연산 그래프를 보고 최적화한다 — 같은 stage 안에서는 셔플 없이 파이프라이닝.
장애 시 lineage로 재계산 — 메모리에 있어도 RDD의 계보를 알면 부분 재계산이 가능.

같은 단어 카운트를 Hadoop MR과 Spark로 비교해보자.

MapReduce (Java) — 코드가 길다, 디스크에 쓴다, 느리다.

public class WordCount {
  public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }
  public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();
    public void reduce(Text key, Iterable<IntWritable> values, Context context)
        throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) sum += val.get();
      result.set(sum);
      context.write(key, result);
    }
  }
  // main(): Job 생성, 입출력 경로 지정, waitForCompletion …
}

Spark (Scala/Python) — 같은 일을 4줄.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("wordcount").getOrCreate()
df = spark.read.text("s3://logs/2026/05/14/")
counts = df.selectExpr("explode(split(value, ' ')) as word").groupBy("word").count()
counts.write.mode("overwrite").parquet("s3://output/wc/")

코드 길이 차이가 모든 것을 말하지는 않는다. 그러나 운영 측면에서도 Spark가 이겼다.

SQL이 일급 시민 — Spark SQL이 거의 모든 워크로드를 흡수.
MLlib·Streaming·GraphX 통합 — 같은 엔진이 배치·ML·스트리밍·그래프를 처리.
API가 좋다 — Scala·Python·R·Java·SQL 모두.

2018년쯤이 되면 신규 프로젝트에서 MapReduce를 새로 쓰는 일은 거의 사라졌다. Hadoop 클러스터는 남았지만 그 안에서 도는 잡은 거의 다 Spark였다.

이 단계에서 중요한 것은 HDFS는 그대로였다는 점이다. Spark on YARN, Spark on HDFS — 즉 Spark는 MapReduce 자리를 차지했지만 Hadoop 인프라(HDFS·YARN)는 그대로 썼다.

4. 두 번째 전환 — 객체 스토리지가 HDFS를 갈아치우다

다음 도미노는 스토리지였다. 2010년대 중반부터 AWS S3·Google Cloud Storage·Azure Data Lake Storage(이하 객체 스토리지 또는 ADLS)가 데이터 레이크의 기본값이 되기 시작했다.

왜 S3가 HDFS를 이겼는가

객체 스토리지의 장점은 다섯 가지다.

컴퓨트·스토리지 분리 — CPU와 디스크를 독립적으로 늘릴 수 있다. 클라우드 가격 모델에 맞다.
무한히 싸다(상대적으로) — S3 Standard는 GB당 월 0.023 USD대, S3 Glacier는 더 싸다. HDFS 3복제와 비교하면 가격이 한 자릿수 차이.
운영 불필요 — NameNode·DataNode·디스크 교체·리밸런스 — 모두 AWS가 한다.
내구성 11 nine — S3가 약속하는 99.999999999% 내구성은 HDFS 3복제보다 안전하다.
거의 무한 확장 — 파일 수 한계가 사실상 없다. NameNode 메모리 문제 같은 게 없다.

대신 단점도 있었다.

레이턴시가 높다 — 객체별 GET이 수십 ms. HDFS는 ms 미만.
list 연산이 비싸다 — S3 LIST는 1000개 단위 페이지네이션.
eventual consistency (S3는 2020년 12월부터 strong read-after-write consistency로 전환).
rename이 없다 — S3의 "rename"은 copy + delete. 디렉토리 단위 rename이 사실상 불가.

이 단점들 중 rename이 없다는 점이 Hive·Spark의 기존 출력 방식(write to _temporary, rename to final)을 박살냈다. 그래서 S3 위에서 안전하게 쓰려면 committer(EMRFS S3-Optimized Committer, Magic Committer 등)가 필요했다. 이 불편함이 다음 도미노 — 열린 테이블 포맷 — 의 동기 중 하나가 됐다.

HDFS의 새 역할 — "거의 안 쓴다"

2020년대 중반이 되면 클라우드에서 새로 시작하는 데이터 플랫폼이 HDFS를 쓰는 일은 거의 없다. EMR·Databricks·Snowflake·BigQuery — 모두 객체 스토리지를 기본으로 한다. HDFS는 두 가지 경우에만 살아남는다.

온프레미스(on-prem) 클러스터 — 클라우드로 못/안 가는 보수적인 금융·통신·정부.
HDFS 호환 분산 스토리지 — Ozone·MinIO·JuiceFS 같이 S3 호환 인터페이스를 제공하는 신세대 분산 스토리지.

후자는 사실상 "HDFS의 명목상 후계자"라기보다 "S3의 온프레미스 클론"에 가깝다. 즉 객체 스토리지 모델이 승리했고, HDFS는 그 객체 스토리지를 흉내내는 자리로 밀려났다.

5. 세 번째 전환 — Hive 메타스토어를 열린 테이블 포맷이 갈아치우다

이게 가장 최근의, 가장 중요한 전환이다.

Hive의 한계 — "파일이 곧 테이블"

Hive 메타스토어 시대의 핵심 가정은 단순했다 — "S3·HDFS 위의 디렉토리가 곧 테이블이다." s3://warehouse/orders/dt=2026-05-14/ 같은 파티션 디렉토리 안의 Parquet 파일들이 하나의 테이블을 이룬다. 메타스토어에는 "이 테이블의 파티션 컬럼은 무엇이고, 어디에 디렉토리가 있고, 스키마는 무엇인지"가 적혀 있다.

이 모델의 문제는 누적됐다.

트랜잭션 부재 — INSERT INTO ... PARTITION 중간에 실패하면 절반만 쓰인 파일이 남는다. 멱등성을 보장하기 어렵다.
스키마 진화 약함 — 컬럼 추가는 되지만, 컬럼 이름 변경·타입 변경·중첩 구조 변경은 사실상 안 된다.
시간 여행 없음 — "1시간 전 상태로 보고 싶어요" 같은 요구를 충족 못 함.
작은 파일 문제 — 스트리밍 적재로 파티션마다 수천 개의 작은 파일이 쌓이면 쿼리가 끔찍해진다.
메타스토어 자체가 병목 — 거대한 파티션을 가진 테이블에서 SHOW PARTITIONS가 분 단위로 걸린다.

열린 테이블 포맷 3대장 — Iceberg, Delta, Hudi

이 한계를 풀기 위해 세 개의 프로젝트가 거의 동시에 등장했다.

Apache Iceberg (2018, Netflix) — 메타데이터 파일에 모든 데이터 파일 목록을 명시. 스냅샷 단위 ACID. 카탈로그 중립.
Delta Lake (2019, Databricks) — _delta_log/ 디렉토리의 트랜잭션 로그(JSON·Parquet)로 같은 일을 함. Databricks와 깊게 통합.
Apache Hudi (2017, Uber) — 스트리밍·CDC에 강점. Copy-on-Write·Merge-on-Read 두 모드.

공통점은 분명하다. "메타데이터를 파일과 함께 두고, 그 메타데이터로 ACID·시간여행·스키마 진화를 구현한다."

같은 테이블을 Iceberg DDL로 만들어보자.

CREATE TABLE catalog.db.orders (
  order_id     BIGINT,
  user_id      BIGINT,
  amount_cents BIGINT,
  status       STRING,
  created_at   TIMESTAMP
)
USING iceberg
PARTITIONED BY (days(created_at), bucket(16, user_id))
TBLPROPERTIES (
  'format-version'             = '2',
  'write.target-file-size-bytes' = '536870912',
  'write.parquet.compression-codec' = 'zstd'
);

-- 시간 여행
SELECT count(*) FROM catalog.db.orders FOR TIMESTAMP AS OF '2026-05-13 00:00:00';

-- 스키마 진화 — 컬럼 추가, 이름 변경, 타입 승격 모두 가능
ALTER TABLE catalog.db.orders ADD COLUMN refund_amount_cents BIGINT;
ALTER TABLE catalog.db.orders RENAME COLUMN status TO order_status;

여기서 의미있는 두 가지를 짚자.

PARTITIONED BY (days(created_at), bucket(16, user_id)) — Hive 시절에는 dt 같은 가짜 컬럼을 직접 추가해야 했다. Iceberg는 숨겨진 파티셔닝(hidden partitioning) 으로 시간/해시 변환을 메타데이터가 알아서 처리한다. 사용자는 WHERE created_at >= '2026-05-01' 만 써도 파티션 프루닝이 작동한다.
format-version 2 — Iceberg v2는 row-level delete를 지원한다. v3(2025–2026)는 deletion vector·variant 타입을 추가했고, Databricks가 2026년 v3를 Public Preview로 올렸다.

2024년의 두 사건 — 전쟁의 끝

2020–2023년까지의 "Iceberg vs Delta vs Hudi" 논쟁은 2024년 두 사건으로 사실상 결판났다.

Snowflake가 Iceberg 네이티브 지원 + Polaris Catalog 오픈소스화 (2024 Summit, 그리고 2025년 Apache TLP로 졸업).
Databricks가 Tabular를 약 10억 USD 이상에 인수 — Tabular는 Iceberg의 창립자(Ryan Blue, Dan Weeks)의 회사. 즉 Delta의 본진이 Iceberg 본진을 인수한 셈.

그 후 Databricks는 Delta UniForm으로 Delta를 Iceberg로 읽을 수 있게 만들었고, 2026년 4월에는 외부 Unity Catalog가 관리하는 Iceberg 테이블에 Snowflake가 write할 수 있는 상호운용성을 GA로 풀었다(Azure 기준). REST Catalog가 사실상 lingua franca가 됐다.

요약: Iceberg가 사실상의 표준이 되었고, Delta·Hudi는 Iceberg와의 상호운용성을 추구하는 그림.

6. 네 번째 전환 — 쿼리 엔진의 시대(Trino·DuckDB·ClickHouse)

Hive on Tez 시대에 인터랙티브 SQL은 한계가 분명했다. 그 자리를 가져간 것이 Presto/Trino다.

Presto — 2012년 Facebook 시작. 2018년 Facebook 진영과 Starburst·Netflix 진영이 갈라지면서 후자가 Trino로 재출발. 2020년 PrestoSQL이 Trino로 리브랜드.
Trino — MPP(Massively Parallel Processing) 분산 SQL 엔진. Iceberg·Delta·Hudi·Hive 모두에 native 커넥터.
Starburst — Trino의 상용 배포판. Warp Speed 같은 가속 레이어 추가.

2026년 현재 Trino는 Comcast·Goldman Sachs·LinkedIn·Lyft·Netflix·Pinterest·Salesforce 등에서 운영 중이고, 사실상 오픈 데이터 레이크 위에서 SQL을 돌리는 표준 엔진이다.

그 옆에 새 흐름이 있다.

DuckDB — 임베디드 OLAP. 한 노드짜리 Parquet·Iceberg 분석에는 압도적. "노트북에서 1TB까지는 DuckDB로 충분"이 진지하게 말해지는 시점.
ClickHouse — 컬럼나(columnar) OLAP DB. 실시간 분석에 강점. Iceberg 외부 테이블 지원으로 lakehouse와도 결합.
StarRocks·Doris — MPP 분석 DB의 새 세대. Iceberg native.

핵심: 쿼리 엔진은 이제 다극화됐다. 배치는 Spark, 인터랙티브 분석은 Trino, 단일 노드는 DuckDB, 실시간은 ClickHouse — 같은 Iceberg 테이블 위에서 서로 다른 엔진이 자기 영역의 워크로드를 처리한다.

7. "무엇이 무엇을 대체했는가" — 한눈 매트릭스

진화의 전체 그림을 표 하나로 정리하자.

계층	클래식 Hadoop (2010)	Hadoop+Spark (2015)	객체 스토리지 (2020)	Lakehouse (2026)
분산 스토리지	HDFS	HDFS	S3 / GCS / ADLS	S3 / GCS / ADLS
배치 컴퓨트	MapReduce	Spark	Spark on EMR/Databricks	Spark / Trino / Flink
인터랙티브 SQL	Hive	Hive on Tez, Impala	Presto / Trino	Trino / DuckDB / ClickHouse
테이블 포맷	텍스트·Sequence·ORC	Parquet on Hive	Parquet on Hive	Iceberg / Delta / Hudi
메타데이터	Hive Metastore	Hive Metastore	Hive Metastore / Glue	REST Catalog (Polaris·Unity·Nessie)
자원 관리	YARN	YARN	YARN / K8s / EMR	K8s / 서버리스
스트리밍	Storm	Spark Streaming	Spark Structured / Flink	Flink / Kafka / Iceberg streaming
운영 모델	온프레미스 클러스터	온프레미스 + 클라우드	클라우드 매니지드	멀티 엔진 / 멀티 카탈로그

이 표가 말하는 바는 명료하다. Hadoop의 자리에 있던 거의 모든 계층이 다른 것으로 교체됐다. HDFS가 객체 스토리지로, MR이 Spark로, Hive 메타스토어가 REST Catalog와 Iceberg로, YARN이 Kubernetes로.

남은 것은 개념적 영향력 뿐이다. "데이터를 디스크에 두고 계산을 분산한다"는 Hadoop의 기본 발상은 여전히 모든 분산 데이터 시스템의 토대다. 그러나 그 토대 위에 올라간 구체적 컴포넌트는 거의 다 갈렸다.

8. 그래도 Hadoop이 살아있는 곳

"그럼 진짜로 Hadoop은 어디에 살아있는가?" 2026년 현재 다음 네 가지 영역에서는 여전히 현역이다.

8.1 거대한 레거시 클러스터

10년 넘게 운영된 페타바이트급 HDFS 클러스터를 보유한 대기업·통신사·금융은 그 클러스터를 단번에 끄지 못한다. Cloudera·Hortonworks(현재는 합병)의 라이선스, 운영팀의 노하우, 옮길 비용·리스크 — 모두가 "그냥 굴리는" 쪽으로 가중치를 준다. 이 클러스터들은 보통 점진적 마이그레이션 중이다. 새 데이터는 S3·Iceberg로, 옛 데이터는 HDFS에 둔 채 점차 옮기는 식.

8.2 Hive Metastore — 호환 레이어로 살아남기

흥미로운 사실: Hive Metastore 자체는 죽지 않았다. Iceberg가 카탈로그를 추상화하면서 Hive Metastore는 Iceberg 카탈로그의 한 구현체가 됐다. 즉:

   기존: Hive table → Hive Metastore → 디렉토리(Parquet)
   현재: Iceberg table → Hive Metastore catalog → Iceberg metadata → Parquet

기업이 Hive Metastore를 그대로 두고 그 안에서 Iceberg 테이블을 등록·관리하는 방식이 흔하다. Hive Metastore가 "Iceberg 카탈로그 호환 레이어"가 되면서 마이그레이션이 점진적으로 가능해진 것이다. 같은 메타스토어를 REST Catalog로 노출하면 Trino·Spark·Snowflake·Flink가 모두 같은 테이블을 본다.

8.3 온프레미스 보수 영역 — 금융·통신·정부

데이터 주권·규제·운영 정책상 클라우드 객체 스토리지로 가지 못하는 조직이 여전히 있다. 한국·일본의 일부 금융권, 통신 3사, 정부 시스템이 대표적. 이런 곳에서는 Apache Ozone(HDFS 후속 프로젝트)이나 MinIO·JuiceFS 같은 S3 호환 온프레미스 스토리지가 자리를 잡고 있다. YARN은 점차 Kubernetes로 옮겨가는 중.

8.4 YARN — 점진적으로 K8s로

YARN은 한때 Hadoop의 자랑이었지만 2026년 현재 신규 배포는 거의 다 Kubernetes다. EMR on EKS, Databricks(원래부터 자체 자원 관리), Spark on K8s — 모두 YARN을 우회한다. 다만 기존 YARN 클러스터의 안정성이 좋아 굳이 옮길 동기가 약한 곳에서는 그대로 운영된다. Hadoop 3.5(2026년)도 YARN을 활발히 유지보수 중이다.

9. 왜 열린 테이블 포맷이 이겼는가 — 한 단락으로

이 글의 가장 중요한 통찰이다. 열린 테이블 포맷(특히 Iceberg)이 결국 표준이 된 이유는 단순하다.

"테이블의 진실"이 디렉토리에서 메타데이터로 옮겨가면서, 누가 그 테이블을 어떻게 읽든 같은 결과를 보장할 수 있게 됐기 때문이다.

Hive 시대에는 같은 테이블을 Spark·Presto·Hive로 읽으면 미묘하게 결과가 다를 수 있었다. 파티션 디렉토리에 직접 파일을 떨어뜨리는 행위가 막혀 있지 않았고, 트랜잭션이 없었기 때문이다. Iceberg는 "스냅샷 = 메타데이터 파일 = 그 시점의 진실"이라는 모델로 그 문제를 끊었다. 하나의 테이블, 여러 엔진, 같은 결과 — 이게 Lakehouse 시대의 핵심 약속이다.

Snowflake와 Databricks가 자기 폐쇄형 포맷을 일부 양보하고 Iceberg에 합류한 이유도 같다. 고객이 "어느 한 벤더에 락인되고 싶지 않다"고 더 이상 협상에서 양보하지 않게 된 것. 2026년의 분석 스택은 포맷 중립·카탈로그 중립이 기본 전제가 됐다.

10. 2026년에 새 분석 스택을 짠다면 — 권장 조합

지금 새로 시작한다면 다음 조합이 무난한 기본값이다.

   ┌──────────────────────────────────────────────────────────┐
   │  Storage:    S3 / GCS / ADLS (또는 온프레미스 S3 호환)   │
   │  Format:     Apache Iceberg (또는 Delta + UniForm)       │
   │  Catalog:    Polaris / Unity / Glue / Nessie             │
   │  Batch:      Spark (Databricks·EMR·Glue) 또는 Flink      │
   │  Interactive:Trino (or Starburst Galaxy)                 │
   │  Single-node:DuckDB (개발자·BI 워크벤치)                 │
   │  Streaming:  Kafka + Flink + Iceberg streaming           │
   │  Orchestrator: Airflow / Dagster / Prefect               │
   │  Transform:  dbt (또는 SQLMesh)                          │
   │  Observability: OpenLineage + Marquez/DataHub            │
   └──────────────────────────────────────────────────────────┘

이 조합의 미덕은 잠금이 적다는 것이다. 어느 컴포넌트도 단일 벤더에 묶이지 않는다. Iceberg는 표준이고, Trino·Spark·Flink는 OSS고, Polaris·Nessie·Unity Catalog는 모두 REST Catalog 스펙을 따른다.

무엇을 피해야 하는가

새로 시작한다면 다음은 피하라.

HDFS 신규 구축 — 클라우드든 온프레미스든. S3 호환(Ozone·MinIO)으로 가라.
MapReduce 신규 잡 — 이미 Spark·Trino로 가능하다.
Hive 메타스토어를 카탈로그로 그대로 노출 — Iceberg와 결합한 형태로만 쓰라. 가능하면 REST Catalog로 추상화.
Hive 테이블 신규 생성 — Iceberg로 만들어라. 변환 비용은 갈수록 커진다.
벤더 고유 테이블 포맷 — Snowflake 내부 포맷·BigQuery native 포맷에 데이터를 영구히 두지 말 것. 외부에서 읽을 수 있는 형태를 유지하라.
한 엔진 종속 ETL — Databricks 노트북·Snowpark·BigQuery SQL에만 ETL을 짜놓으면 마이그레이션이 지옥. Spark SQL이나 dbt 같은 이식 가능한 레이어를 두라.

무엇을 그대로 유지해도 되는가

반대로, 이미 있는 다음은 굳이 갈아치울 필요가 없다.

잘 도는 YARN 클러스터 — 안정적이면 두라. Kubernetes 마이그레이션은 워크로드별로.
Hive Metastore — Iceberg 카탈로그로 재활용하면 그만이다.
Spark 잡 — 이미 Iceberg/Delta로 쓰고 있다면 추가로 할 일 거의 없음.
HDFS 위의 옛 데이터 — 콜드 데이터는 둔 채로 새 데이터부터 S3로 적재.

11. 마이그레이션 패턴 — 클래식 Hadoop에서 Lakehouse로

레거시에서 모던 스택으로 가는 흔한 마이그레이션 패턴은 세 가지다.

11.1 듀얼 라이트(dual write)

새 적재 잡이 HDFS와 S3·Iceberg에 동시에 쓴다. 일정 기간 같은 데이터를 양쪽에서 비교 검증 후 옛 경로를 끈다. 안전하지만 자원이 두 배.

11.2 인플레이스 마이그레이션(in-place)

Iceberg의 add_files 프로시저로 기존 HDFS Parquet 디렉토리를 "복사 없이" Iceberg 테이블로 등록한다. 파일을 그대로 두고 메타데이터만 추가. 빠르지만 HDFS와의 결합은 그대로다.

-- 기존 Hive 외부 테이블을 Iceberg 테이블로 등록(파일 복사 없음)
CALL system.add_files(
  table => 'iceberg_catalog.db.orders',
  source_table => 'hive.db.orders'
);

11.3 카탈로그 통합 → 점진적 이전

Hive Metastore 위에 Iceberg 카탈로그 어댑터를 얹는다. 새 테이블만 Iceberg로 만들고, 옛 Hive 테이블은 그대로 두되, 둘 다 같은 카탈로그를 통해 노출한다. 시간이 지나며 옛 테이블이 자연히 retire 된다. 가장 권장되는 접근.

에필로그 — "죽었는가"가 잘못된 질문인 이유

Hadoop은 죽지 않았다. 그러나 "기본값"의 자리에서는 내려왔다.

Linux가 처음 등장했을 때 사람들은 "유닉스는 죽었나?" 물었다. 정확한 답은 "유닉스 시스템 V는 거의 안 쓰고, BSD 가족과 Linux가 그 영토를 차지했지만, 유닉스의 아이디어는 어디에나 살아있다"였다. Hadoop도 같다. HDFS·MapReduce는 신규 워크로드에서 거의 보이지 않지만, "데이터를 분산 저장하고, 계산을 데이터 쪽으로 옮기고, 메타데이터로 추상화한다" 는 Hadoop의 핵심 아이디어는 객체 스토리지 + Iceberg + Trino라는 더 좋은 구현으로 살아남았다.

2026년의 데이터 엔지니어가 해야 할 일은 "Hadoop이 죽었는지" 묻는 게 아니다. 다음 세 가지를 묻는 일이다.

우리 데이터의 진실은 어디에 있어야 하는가 — 객체 스토리지에, 열린 테이블 포맷으로, 표준 카탈로그에.
어떤 엔진이 어떤 워크로드를 가져갈 것인가 — Spark는 배치, Trino는 인터랙티브, DuckDB는 단일 노드, Flink는 스트리밍, ClickHouse는 실시간.
어떻게 한 벤더에 묶이지 않을 것인가 — Iceberg + REST Catalog + 이식 가능한 SQL 레이어(dbt) 라는 3중 추상화.

체크리스트

신규 분석 플랫폼 설계 시 다음을 확인하라.

흔한 안티패턴

다음은 2026년 새 프로젝트에서 피해야 할 패턴이다.

"일단 HDFS로 시작하고 나중에 옮긴다" — 나중은 오지 않는다. 처음부터 S3.
"우리는 Hadoop 운영팀이 있으니 Hadoop으로 갑니다" — 운영팀의 노하우가 신규 시스템 아키텍처 결정의 1순위가 되면 안 된다.
"Hive Metastore는 익숙하니 그대로 씁니다" — 그대로 쓰되 Iceberg 카탈로그 어댑터로 노출하라.
"Snowflake/Databricks 안에 모든 데이터를 둔다" — 벤더 종속의 시작. 외부에서 읽을 수 있는 포맷(Iceberg) 유지.
"한 엔진으로 모든 워크로드" — 배치·인터랙티브·스트리밍·단일 노드는 서로 다른 엔진이 잘한다.
"파일 = 테이블" — 디렉토리에 직접 파일을 쓰지 마라. 항상 테이블 인터페이스로.
"카탈로그는 하나면 충분" — 멀티 카탈로그·페더레이션 가능성을 처음부터 고려.
"포맷은 나중에 정한다" — 처음 적재할 때 포맷이 결정된다. 마이그레이션은 비싸다.

다음 글 예고

다음 글은 이 시리즈의 자매편 — "Iceberg 카탈로그 전쟁 — Polaris vs Unity vs Nessie vs Glue, 멀티 카탈로그 페더레이션을 어떻게 짤 것인가" — 다. REST Catalog 스펙의 디테일, 각 구현체의 차이, 멀티 카탈로그 환경에서 권한·리니지·페일오버를 어떻게 설계할지를 다룬다.

참고 / References

Is Hadoop Dead? — The Evolution of the Big Data Stack, From Hadoop to Lakehouse (Spark, Iceberg, Delta, and Where Things Actually Stand in 2026)

Prologue — Why "Is Hadoop Dead?" Is the Wrong Question

A junior data engineer often asks: "Is Hadoop dead?"

Senior engineers tend to give one of two answers. The first is the consultant's answer — "No, plenty of clusters are still running." The second is the startup data lead's answer — "Yes, dead. If you're starting fresh, do not start with Hadoop."

Both are right. And if you can't hold both answers in your head at the same time, you will design a 2026 data platform badly. This post traces the entire evolution — how Hadoop got demoted from the default, what took its place, and where Hadoop still lives.

A better way to phrase the question. "If I'm building a new analytics stack in 2026, is there a reason to choose Hadoop?" The answer is almost always no. But "should I shut down a Hadoop cluster that is currently doing real work?" The answer is almost always no. This post is about every detail that sits between those two answers.

Prologue — Why "Is Hadoop Dead?" Is the Wrong Question
1. The Timeline — Twenty Years of Big Data Evolution on One Page
2. Classic Hadoop — What Was the Core, and Where Did It Break
3. First Transition — Spark Replaces MapReduce
- Why Spark Was Fast — RDDs and In-Memory Shuffle
4. Second Transition — Object Storage Replaces HDFS
- Why S3 Beat HDFS
- HDFS's New Role — "Barely Used"
5. Third Transition — Open Table Formats Replace the Hive Metastore
6. Fourth Transition — The Age of Query Engines (Trino, DuckDB, ClickHouse)
7. "What Replaced What" — One Matrix
8. Where Hadoop Still Lives
9. Why Open Table Formats Won — In One Paragraph
10. Designing a New Analytics Stack in 2026 — The Default Stack
- What to Avoid
- What You Can Keep As-Is
11. Migration Patterns — Classic Hadoop to Lakehouse
Epilogue — Why "Is It Dead" Is the Wrong Question
References

1. The Timeline — Twenty Years of Big Data Evolution on One Page

The big picture first. The big data stack went through four major generational shifts over twenty years.

   2006              2012-14           2017-20          2022-26
   ─────             ───────           ───────          ───────

   HDFS              HDFS              S3 / GCS         S3 / GCS / ADLS
    +                 +                  +                +
   MapReduce  ──▶   Spark      ──▶   Spark/Trino  ──▶  Spark/Trino/DuckDB
    +                 +                  +                +
   Hive          Hive metastore     Hive metastore   REST Catalog
   (file = table)  (file = table)   (file = table)   (Iceberg/Delta/Hudi)
    +                 +                  +                +
   YARN              YARN              K8s / EMR        K8s / Serverless

   "Hadoop"      "Hadoop + Spark"   "Spark on object   "Lakehouse"
                                     storage"

What divides the generations is clear.

2006 to 2012 — Classic Hadoop: HDFS plus MapReduce plus YARN 2.x plus Hive. One cluster, with storage, compute, and metastore all bundled together.
2012 to 2017 — Spark replaces MapReduce: HDFS unchanged, but MR is gone. Hive on Tez or Spark, Impala and Presto bring interactive SQL.
2017 to 2022 — Storage and compute separate: S3, GCS, and ADLS take HDFS's place. Spark and Trino run on object storage. EMR, Dataproc, and Databricks emerge.
2022 to 2026 — Lakehouse: The Hive metastore era ends. Apache Iceberg, Delta Lake, and Apache Hudi shift the paradigm from "the file is the table" to "the metadata is the table." Snowflake natively supports Iceberg, Databricks acquires Tabular, and the REST Catalog becomes the standard.

The rest of this post unpacks what replaced what, and why at each transition.

2. Classic Hadoop — What Was the Core, and Where Did It Break

First, pin down what "Hadoop" actually was. The early 2010s "Hadoop stack" was essentially three components.

HDFS — A distributed file system. It split data into 64 MB or 128 MB blocks and stored them across multiple nodes' disks, typically with 3x replication.
MapReduce — A distributed compute framework. Map phase shuffles data by key, reduce phase aggregates. Disk-based.
YARN — A resource manager. Tracks each node's CPU and memory, schedules job containers.

On top of these sat Hive (SQL interface), HBase (OLTP-ish KV store), and Sqoop plus Flume (ingestion). That bundle was "the Hadoop ecosystem."

The clever insight of this model was data locality — move the compute to the data. In an era of slow networks, "run the task on the node that holds the disk block" was genuinely revolutionary.

Over time, the cracks accumulated.

MapReduce is slow. Disk-based shuffle writes every intermediate result to disk. Iterative workloads (ML, multi-stage ETL) became hideously inefficient.
HDFS is operationally heavy. The NameNode is a single point of failure, it holds metadata in memory so there's a hard limit on file count, the small file problem is chronic, and scaling disk capacity requires adding entire nodes.
Compute and storage are coupled. If you need more disk you also pay for CPU; if you need more CPU you also pay for disk. The least cloud-friendly model imaginable.
The metadata layer (Hive metastore) is weak. Partition-level only — no schema evolution, no transactions, no time travel.

In short, Hadoop was optimized for "rent a data center hall, run batch jobs all day" workloads. As the cloud, interactive queries, ML, and streaming arrived in succession, those assumptions broke.

3. First Transition — Spark Replaces MapReduce

Spark started at UC Berkeley AMPLab in 2010 and graduated to Apache TLP in 2014. Spark's core pitch was simple — "works like MapReduce, ten to a hundred times faster."

Why Spark Was Fast — RDDs and In-Memory Shuffle

Spark models data as an RDD (Resilient Distributed Dataset) — a collection of distributed partitions. A chain of transformations (map, filter, join) becomes a DAG. The key points:

No intermediate results to disk — cache in memory when possible.
Optimize over the operation graph — within a stage, pipeline without a shuffle.
Recover with lineage — even if in memory, RDD lineage allows partial recomputation on failure.

Same word count, MapReduce versus Spark.

MapReduce (Java) — long code, writes to disk, slow.

public class WordCount {
  public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }
  public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();
    public void reduce(Text key, Iterable<IntWritable> values, Context context)
        throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) sum += val.get();
      result.set(sum);
      context.write(key, result);
    }
  }
  // main(): create Job, set in/out paths, waitForCompletion …
}

Spark (Scala or Python) — same thing in four lines.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("wordcount").getOrCreate()
df = spark.read.text("s3://logs/2026/05/14/")
counts = df.selectExpr("explode(split(value, ' ')) as word").groupBy("word").count()
counts.write.mode("overwrite").parquet("s3://output/wc/")

Code length isn't everything. Operationally Spark won too.

SQL as a first-class citizen — Spark SQL absorbed almost all workloads.
MLlib, Streaming, GraphX integrated — one engine for batch, ML, streaming, and graph.
Good APIs — Scala, Python, R, Java, SQL all supported.

By roughly 2018, writing a new MapReduce job had essentially vanished. Hadoop clusters were still around, but almost everything running inside them was Spark.

The crucial observation at this stage: HDFS was unchanged. Spark on YARN, Spark on HDFS — Spark took MR's seat, but the Hadoop infrastructure underneath was the same.

4. Second Transition — Object Storage Replaces HDFS

The next domino was storage. From the mid-2010s, AWS S3, Google Cloud Storage, and Azure Data Lake Storage (collectively "object storage") became the default for data lakes.

Why S3 Beat HDFS

Object storage offered five advantages.

Compute and storage separation — Scale CPU and disk independently. Fits the cloud pricing model.
Effectively cheap — S3 Standard runs around 0.023 USD per GB per month, Glacier even less. Versus HDFS 3x replication, the price gap is an order of magnitude.
No operations — NameNode, DataNode, disk replacement, rebalance — all handled by AWS.
Eleven nines of durability — S3's 99.999999999% promise beats HDFS 3x replication.
Practically unlimited scale — No real cap on file count. No NameNode memory issues.

The downsides existed too.

Higher latency — Each object GET takes tens of milliseconds. HDFS is sub-millisecond.
List operations are expensive — S3 LIST is paginated, 1000 keys per page.
Eventual consistency (S3 switched to strong read-after-write consistency in December 2020).
No rename — S3 "rename" is copy plus delete. Directory-level rename effectively does not exist.

The "no rename" problem broke the standard Hive and Spark output pattern (write to _temporary, rename to final). Writing safely to S3 required a committer (EMRFS S3-Optimized Committer, Magic Committer, etc.). That friction became one of the motivations for the next domino — open table formats.

HDFS's New Role — "Barely Used"

By the mid-2020s, a new cloud-native data platform almost never starts with HDFS. EMR, Databricks, Snowflake, BigQuery — all default to object storage. HDFS survives in two cases.

On-premise clusters — Conservative finance, telecom, and government that cannot or will not move to the cloud.
HDFS-compatible distributed storage — Ozone, MinIO, JuiceFS, and similar next-gen distributed storage that expose S3-compatible interfaces.

The second category is less "HDFS's successor" and more "an on-prem clone of S3." In other words the object storage model won, and HDFS got pushed into the seat of imitating it.

5. Third Transition — Open Table Formats Replace the Hive Metastore

This is the most recent, and most important, transition.

The Hive Limit — "The File Is the Table"

The Hive metastore era assumed something simple — "a directory on S3 or HDFS is a table." Parquet files inside a partition directory like s3://warehouse/orders/dt=2026-05-14/ collectively constitute the table. The metastore records "this table's partition columns, where the directories live, what the schema is."

The problems with this model piled up.

No transactions — A failed INSERT INTO ... PARTITION leaves half-written files behind. Idempotency is hard.
Weak schema evolution — Adding a column works. Renaming, changing types, evolving nested structures — basically impossible.
No time travel — Cannot serve "show me the state as of one hour ago."
Small file problem — Streaming ingest produces thousands of tiny files per partition; query performance dies.
The metastore itself is a bottleneck — A SHOW PARTITIONS on a giant table can take minutes.

The Three Open Table Formats — Iceberg, Delta, Hudi

To address these limits, three projects emerged within a few years of each other.

Apache Iceberg (2018, Netflix) — Metadata files explicitly list all data files. Snapshot-level ACID. Catalog-neutral.
Delta Lake (2019, Databricks) — A _delta_log/ directory with JSON/Parquet transaction logs does the same job. Deeply integrated with Databricks.
Apache Hudi (2017, Uber) — Strong on streaming and CDC. Copy-on-Write and Merge-on-Read modes.

The shared insight is unmistakable. "Keep metadata alongside the files, and let that metadata implement ACID, time travel, and schema evolution."

The same table, expressed as Iceberg DDL.

CREATE TABLE catalog.db.orders (
  order_id     BIGINT,
  user_id      BIGINT,
  amount_cents BIGINT,
  status       STRING,
  created_at   TIMESTAMP
)
USING iceberg
PARTITIONED BY (days(created_at), bucket(16, user_id))
TBLPROPERTIES (
  'format-version'             = '2',
  'write.target-file-size-bytes' = '536870912',
  'write.parquet.compression-codec' = 'zstd'
);

-- Time travel
SELECT count(*) FROM catalog.db.orders FOR TIMESTAMP AS OF '2026-05-13 00:00:00';

-- Schema evolution — add, rename, promote types
ALTER TABLE catalog.db.orders ADD COLUMN refund_amount_cents BIGINT;
ALTER TABLE catalog.db.orders RENAME COLUMN status TO order_status;

Two things to call out.

PARTITIONED BY (days(created_at), bucket(16, user_id)) — In the Hive era you had to add a fake dt column yourself. Iceberg's hidden partitioning lets the metadata handle the time/hash transform. The user only writes WHERE created_at >= '2026-05-01' and partition pruning still kicks in.
format-version 2 — Iceberg v2 supports row-level deletes. v3 (2025 to 2026) adds deletion vectors and the variant type. Databricks promoted v3 to Public Preview in 2026.

Two Events in 2024 — The End of the War

The "Iceberg vs Delta vs Hudi" debate of 2020 to 2023 was effectively settled by two events in 2024.

Snowflake natively supported Iceberg and open-sourced the Polaris Catalog (2024 Summit, graduated to Apache TLP in 2025).
Databricks acquired Tabular for over USD 1 billion — Tabular is the company founded by Iceberg's creators (Ryan Blue, Dan Weeks). In other words, Delta's home base bought Iceberg's home base.

After that, Databricks shipped Delta UniForm so that Delta tables can be read as Iceberg, and in April 2026 Snowflake GA'd write support for Iceberg tables managed by an external Unity Catalog (Azure first). The REST Catalog became the lingua franca.

To summarize: Iceberg has become the de facto standard, and Delta and Hudi pursue interoperability with it.

6. Fourth Transition — The Age of Query Engines (Trino, DuckDB, ClickHouse)

In the Hive on Tez era, interactive SQL hit clear limits. Presto and Trino took that seat.

Presto — Started at Facebook in 2012. In 2018 the Facebook side and the Starburst plus Netflix side split, and the latter relaunched as Trino. PrestoSQL rebranded to Trino in 2020.
Trino — A massively parallel processing (MPP) distributed SQL engine. Native connectors for Iceberg, Delta, Hudi, and Hive.
Starburst — A commercial distribution of Trino. Adds an acceleration layer like Warp Speed.

As of 2026, Trino is in production at Comcast, Goldman Sachs, LinkedIn, Lyft, Netflix, Pinterest, and Salesforce, and is effectively the standard engine for SQL over open data lakes.

A new wave runs alongside.

DuckDB — Embedded OLAP. Dominant for single-node Parquet or Iceberg analytics. "DuckDB on a laptop handles up to 1 TB" is said seriously now.
ClickHouse — Columnar OLAP database. Strong for real-time analytics. Now ties into the lakehouse via external Iceberg tables.
StarRocks and Doris — New-generation MPP analytics databases. Iceberg-native.

The point: query engines are now multi-polar. Batch goes to Spark, interactive analytics to Trino, single-node to DuckDB, real-time to ClickHouse — all sitting on top of the same Iceberg tables, each handling its niche workload.

7. "What Replaced What" — One Matrix

The whole evolution at a glance.

Layer	Classic Hadoop (2010)	Hadoop+Spark (2015)	Object Storage (2020)	Lakehouse (2026)
Distributed storage	HDFS	HDFS	S3 / GCS / ADLS	S3 / GCS / ADLS
Batch compute	MapReduce	Spark	Spark on EMR / Databricks	Spark / Trino / Flink
Interactive SQL	Hive	Hive on Tez, Impala	Presto / Trino	Trino / DuckDB / ClickHouse
Table format	Text, Sequence, ORC	Parquet on Hive	Parquet on Hive	Iceberg / Delta / Hudi
Metadata	Hive Metastore	Hive Metastore	Hive Metastore / Glue	REST Catalog (Polaris, Unity, Nessie)
Resource manager	YARN	YARN	YARN / K8s / EMR	K8s / serverless
Streaming	Storm	Spark Streaming	Spark Structured / Flink	Flink / Kafka / Iceberg streaming
Operating model	On-prem cluster	On-prem plus cloud	Managed cloud	Multi-engine, multi-catalog

What this matrix says is unambiguous. Nearly every layer that used to be Hadoop has been swapped out. HDFS to object storage, MR to Spark, Hive metastore to REST Catalog and Iceberg, YARN to Kubernetes.

What remains is the conceptual influence. The core Hadoop idea — "park data on disks, distribute compute, abstract via metadata" — is still the foundation of every distributed data system. But the concrete components sitting on that foundation have been almost entirely replaced.

8. Where Hadoop Still Lives

"OK, so where does Hadoop actually live?" In 2026, in these four areas it is still active.

8.1 Huge Legacy Clusters

Big enterprises, telecom carriers, and banks with petabyte-scale HDFS clusters that have run for over a decade cannot turn them off overnight. Cloudera and Hortonworks (now merged) licenses, operations team know-how, migration cost and risk — all push toward "keep it running." These clusters are usually under incremental migration — new data lands in S3 plus Iceberg, old data stays on HDFS and moves over time.

8.2 Hive Metastore — Surviving as a Compatibility Layer

An interesting fact: the Hive Metastore itself has not died. As Iceberg abstracted the catalog, the Hive Metastore became one implementation of an Iceberg catalog. That is:

   Then: Hive table → Hive Metastore → directory (Parquet)
   Now:  Iceberg table → Hive Metastore catalog → Iceberg metadata → Parquet

It is common for organizations to keep the Hive Metastore in place and register and manage Iceberg tables through it. By becoming an "Iceberg catalog compatibility layer," the Hive Metastore enabled incremental migration. Expose the same metastore via a REST Catalog and Trino, Spark, Snowflake, and Flink all see the same tables.

8.3 On-Premise Conservatism — Finance, Telecom, Government

Some organizations cannot move to public object storage due to data sovereignty, regulation, or operations policy. Parts of Korean and Japanese finance, the three major telecom carriers, and government systems are typical. In these places, Apache Ozone (the HDFS successor project) or S3-compatible on-prem storage like MinIO and JuiceFS take root. YARN is slowly migrating toward Kubernetes.

8.4 YARN — Slowly Moving to Kubernetes

YARN was once Hadoop's pride, but in 2026 almost all new deployments use Kubernetes. EMR on EKS, Databricks (which always had its own resource manager), Spark on K8s — all bypass YARN. Existing YARN clusters whose stability is good enough that there is little motivation to migrate keep running as is. Hadoop 3.5 (2026) actively maintains YARN.

9. Why Open Table Formats Won — In One Paragraph

The most important insight in this post. The reason open table formats (especially Iceberg) became the standard is simple.

The "truth of the table" moved from the directory to the metadata, which let any engine reading the table see the same result.

In the Hive era, the same table read from Spark, Presto, or Hive could yield subtly different results. Nothing prevented dropping a file directly into a partition directory, and there were no transactions. Iceberg cut that with the model "snapshot equals metadata file equals the truth at that point." One table, many engines, same result — that is the core promise of the lakehouse era.

The reason Snowflake and Databricks each surrendered part of their proprietary format and joined Iceberg is the same. Customers stopped negotiating away "I don't want to be locked into one vendor." In 2026, an analytics stack defaults to format-neutral, catalog-neutral.

10. Designing a New Analytics Stack in 2026 — The Default Stack

If you're starting today, the following is a safe default.

   ┌──────────────────────────────────────────────────────────┐
   │  Storage:    S3 / GCS / ADLS (or S3-compatible on-prem)  │
   │  Format:     Apache Iceberg (or Delta + UniForm)         │
   │  Catalog:    Polaris / Unity / Glue / Nessie             │
   │  Batch:      Spark (Databricks, EMR, Glue) or Flink      │
   │  Interactive:Trino (or Starburst Galaxy)                 │
   │  Single-node:DuckDB (developer / BI workbench)           │
   │  Streaming:  Kafka + Flink + Iceberg streaming           │
   │  Orchestrator: Airflow / Dagster / Prefect               │
   │  Transform:  dbt (or SQLMesh)                            │
   │  Observability: OpenLineage + Marquez / DataHub          │
   └──────────────────────────────────────────────────────────┘

The virtue of this stack is low lock-in. No component is tied to a single vendor. Iceberg is a standard. Trino, Spark, and Flink are OSS. Polaris, Nessie, and Unity Catalog all follow the REST Catalog spec.

What to Avoid

If starting fresh, avoid the following.

A new HDFS deployment — Cloud or on-prem, it doesn't matter. Go S3-compatible (Ozone, MinIO).
A new MapReduce job — Spark or Trino can already do it.
Exposing the Hive Metastore as your catalog directly — Use it only behind Iceberg. Abstract it with a REST Catalog if you can.
Creating new Hive tables — Use Iceberg. Conversion costs only grow over time.
Vendor-proprietary table formats — Don't permanently store data in Snowflake's internal format or BigQuery native format. Keep it readable by external engines.
Single-engine ETL — If you write all ETL in Databricks notebooks, Snowpark, or BigQuery SQL only, migration is hell. Put a portable layer like Spark SQL or dbt in between.

What You Can Keep As-Is

Conversely, you do not need to rip out the following.

A well-running YARN cluster — Keep it if it's stable. Migrate to Kubernetes per workload.
Hive Metastore — Repurpose as an Iceberg catalog and you're done.
Spark jobs — If they already write to Iceberg or Delta, almost nothing to do.
Old data on HDFS — Leave cold data alone; route new data to S3.

11. Migration Patterns — Classic Hadoop to Lakehouse

There are three common patterns for moving from legacy to a modern stack.

11.1 Dual Write

New ingest jobs write to both HDFS and S3 plus Iceberg. After a comparison period to validate, the old path is shut off. Safe but doubles resources.

11.2 In-Place Migration

Iceberg's add_files procedure registers existing HDFS Parquet directories as an Iceberg table without copying. Files stay put, only metadata is added. Fast, but the HDFS coupling is unchanged.

-- Register an existing Hive external table as an Iceberg table (no file copy)
CALL system.add_files(
  table => 'iceberg_catalog.db.orders',
  source_table => 'hive.db.orders'
);

11.3 Catalog Unification, Then Incremental Migration

Layer an Iceberg catalog adapter on top of the Hive Metastore. Create new tables only as Iceberg. Leave old Hive tables alone, but expose both through the same catalog. Over time the old tables retire naturally. The most recommended approach.

Epilogue — Why "Is It Dead" Is the Wrong Question

Hadoop is not dead. But it has been demoted from being the default.

When Linux first arrived, people asked "is Unix dead?" The accurate answer was "System V Unix is barely used anywhere, the BSD family and Linux took that territory, but the ideas of Unix live everywhere." Hadoop is the same. HDFS and MapReduce are virtually invisible in new workloads, but Hadoop's core ideas — "distribute data on disks, move compute to the data, abstract via metadata" — survive in a better implementation: object storage plus Iceberg plus Trino.

A 2026 data engineer's job is not to ask "is Hadoop dead." It is to ask three questions.

Where should the truth of our data live? — On object storage, in an open table format, behind a standard catalog.
Which engine will own which workload? — Spark for batch, Trino for interactive, DuckDB for single-node, Flink for streaming, ClickHouse for real-time.
How will we avoid lock-in to any one vendor? — A three-layer abstraction of Iceberg plus REST Catalog plus a portable SQL layer (dbt).

Checklist

When designing a new analytics platform, verify the following.

Common Anti-Patterns

Avoid the following in any new 2026 project.

"Start with HDFS now, move later" — Later never comes. Start with S3.
"We have a Hadoop operations team, so we'll go Hadoop" — Ops team familiarity should not be the top criterion for new architecture.
"Hive Metastore is familiar, so we'll keep using it" — Keep it, but expose through an Iceberg catalog adapter.
"Put all data inside Snowflake or Databricks" — That's the start of vendor lock-in. Keep an externally readable format (Iceberg).
"One engine for all workloads" — Batch, interactive, streaming, and single-node are each owned by different engines.
"File equals table" — Do not write files directly into a directory. Always go through the table interface.
"One catalog is enough" — Plan for multi-catalog and federation from the start.
"Pick the format later" — Format is decided at ingest. Migration is expensive.

The companion piece — "The Iceberg Catalog Wars — Polaris vs Unity vs Nessie vs Glue, and How to Federate Multi-Catalog Environments" — covers the REST Catalog spec in detail, the differences between implementations, and how to design permissions, lineage, and failover across a multi-catalog environment.