- Published on
Data Lakehouse & Modern Data Engineering 2026 — Iceberg / Delta / Hudi / Paimon / Tabular (Databricks acquisition) / Trino / Spark 4 / Flink 2 / DataFusion Deep Dive
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Prologue — The word "data warehouse" has aged in place
In the early 2010s, "where do you put your data" was simple. Structured data went into a data warehouse (Teradata, Oracle Exadata, Vertica). Logs and semi-structured data went into Hadoop HDFS. ETL tools (Informatica, Talend) bridged them. Analysts used SQL; data engineers used MapReduce and Spark.
In 2026 the picture is completely different.
- The lakehouse has become the standard. Parquet files sit on S3, GCS, or ADLS, and table formats like Apache Iceberg layer transactions, schemas, and time travel on top. The flexibility of the "lake" and the reliability of the "warehouse" under one roof.
- Apache Iceberg has emerged as the de facto winner of the 2024-25 table format war. Netflix built it; Apple, LinkedIn, Stripe, and Airbnb joined; AWS, Snowflake, Cloudera, and Dremio support it as a first-class citizen.
- Databricks acquired Tabular in June 2024 for over $1B. They brought Iceberg co-creator Ryan Blue inside and accelerated their UniForm strategy of treating Delta Lake and Iceberg as the same metadata.
- Apache Hudi (from Uber) survives through Onehouse, its commercial arm. Niche but still the strongest at CDC, indexing, and incremental processing.
- Apache Paimon (from the Flink team) arrived with a new category called "streaming lakehouse." Built on LSM trees, it handles real-time updates and analytics in the same table.
- Query engines have diversified. Trino (formerly PrestoSQL) is the standard for distributed OLAP; DuckDB is "SQLite for analytics" and handles almost any single-node analytics problem; ClickHouse Cloud owns real-time OLAP; Apache DataFusion (Rust) aims to be the next-generation embedded SQL engine.
- dbt has become the standard for SQL transformation, and "analytics engineer" emerged as a real job title.
This essay maps that landscape — the three table format heavyweights (Iceberg, Delta, Hudi) and the challenger Paimon, the engines (Spark 4, Flink 2, Trino, DuckDB), the cloud data platforms (Databricks, Snowflake, BigQuery, ClickHouse), and the actual stacks at Korean and Japanese companies — all in one go.
1 · The 2026 Data Lakehouse Map — Three Axes
Start with a picture. The 2026 data lakehouse can be understood as three orthogonal axes.
[ Catalog / Governance ]
Unity Catalog · Polaris · BigLake
Glue · Nessie · Snowflake Horizon
|
|
[ Compute / Query Engines ] -+- [ Table Formats / Storage ]
Spark 4 · Flink 2 · Trino | Iceberg · Delta · Hudi · Paimon
Presto · DuckDB | Parquet · ORC · Avro
ClickHouse · DataFusion | S3 · GCS · ADLS
|
|
[ Transform / Orchestration ]
dbt · SQLMesh · Coalesce
Airflow · Dagster · Prefect
- Table format — layers transactions, schemas, and metadata on top of Parquet files. Iceberg is the standard; Delta is the Databricks camp; Hudi is niche; Paimon is streaming.
- Engine — reads, writes, and transforms that data. Spark for ETL and batch; Flink for streaming; Trino for distributed OLAP; DuckDB for single-node OLAP.
- Catalog — manages tables, schemas, permissions. The hottest area in 2025. Unity Catalog (Databricks open-sourced), Polaris (Snowflake's reference Iceberg REST implementation), BigLake (GCP), Nessie (Project Nessie).
That these three axes can be decoupled is the heart of the lakehouse. Iceberg for the table format, Spark/Trino/DuckDB picked per workload for the engine, Unity for the catalog — that kind of mix-and-match is now natural.
Traditional warehouses (Snowflake, BigQuery) had all three axes locked inside one company. The lakehouse unlocks them. That's why Snowflake also built Polaris, and BigQuery extended BigLake to read external Iceberg tables. Unlocking itself has become the 2026 default.
2 · Apache Iceberg — Winner of the Table Format War
2.1 Why a "table format" is needed at all
Parquet files are great. Columnar, compressed, with statistics pushdown. But that alone is not enough.
- A "table" with 10,000 Parquet files — how do you add a column? How is the schema change recorded?
- Two jobs write to the same table at once — what about ACID?
- Want to see yesterday's state — time travel?
- Need to update 1M out of 10B rows — is there a way other than overwriting whole files?
A table format is the metadata layer that solves this. JSON and Avro files on top of Parquet hold information like "the current snapshot of this table is X, the schema is Y, the next transaction is Z."
Iceberg, Delta, and Hudi are different ways of solving the same problem.
2.2 Iceberg's data model
Iceberg has a three-tier metadata hierarchy.
[ Catalog ] ─ catalog (Glue, Nessie, Polaris, REST)
|
v
[ Metadata file v0.json, v1.json ... ]
|
v
[ Snapshot ] ─── state of the table at a point in time
|
v
[ Manifest list ]
|
v
[ Manifest ] ─ which data files exist where, with statistics
|
v
[ Data files (Parquet/ORC/Avro) ]
Core ideas:
- Snapshot-based. Every write creates a new snapshot. Old snapshots live until deleted — time travel and rollback come for free.
- Metadata points at metadata. The catalog only knows where the current metadata file is. Inside that is the snapshot list, inside that the manifest list, inside that the manifests, inside that the data files.
- Partitions are hidden. The user only writes the
event_timecolumn, and Iceberg manages partitions likeday(event_time)internally. Partition evolution is possible.
2.3 Why Iceberg won
As recently as 2023, "Iceberg vs Delta vs Hudi" was genuinely a contest. By late 2025, the industry center of gravity had clearly shifted to Iceberg. The reason can be summarized in one sentence.
Iceberg is a standard, Delta is a product.
- Vendor-neutral. Netflix built it and donated it to Apache. Databricks, Snowflake, AWS, GCP, Cloudera, Dremio, and Tabular all support it as first-class.
- REST Catalog standard. The Iceberg REST Catalog spec stabilized in 2024, making "any vendor's catalog can speak the same API" real.
- Cloud vendor lock-in dissolved. Tables stored in Snowflake can be read directly from Databricks, Trino, or Spark. "Change the engine without moving the data" became real.
- The Tabular acquisition. The fact that Databricks bought it for over $1B was the deciding blow. "The Delta company buys the Iceberg founder" was a huge signal.
Snowflake building Polaris Catalog in June 2024 and donating it to Apache in 2025, and AWS S3 Tables (December 2024) supporting Iceberg as first-class — all part of the same current.
2.4 Try Iceberg yourself — PyIceberg
The smallest possible example with PyIceberg (the Python-native Iceberg client).
from pyiceberg.catalog import load_catalog
from pyiceberg.schema import Schema
from pyiceberg.types import LongType, StringType, TimestampType, NestedField
# 1. Load a catalog (Glue, REST, SQL, Hive all supported)
catalog = load_catalog(
"my_catalog",
**{
"type": "rest",
"uri": "http://localhost:8181",
"warehouse": "s3://my-bucket/warehouse",
}
)
# 2. Define the schema
schema = Schema(
NestedField(1, "event_id", LongType(), required=True),
NestedField(2, "user_id", StringType()),
NestedField(3, "event_time", TimestampType()),
NestedField(4, "event_type", StringType()),
)
# 3. Create the table
table = catalog.create_table(
identifier="analytics.events",
schema=schema,
partition_spec=... # day(event_time)
)
# 4. Append data (as an Arrow Table)
import pyarrow as pa
data = pa.table({
"event_id": [1, 2, 3],
"user_id": ["u1", "u2", "u3"],
"event_time": [...],
"event_type": ["click", "view", "purchase"],
})
table.append(data)
# 5. Read (scan)
result = table.scan(
row_filter="event_type == 'purchase'",
selected_fields=("event_id", "user_id"),
).to_pandas()
# 6. Time travel
old_snapshot = table.history()[-2].snapshot_id
old_data = table.scan(snapshot_id=old_snapshot).to_pandas()
That's it. Without any distributed cluster, in pure Python, you create an Iceberg table, read it, and query historical snapshots. This makes it concrete that Iceberg is a "spec," not an "engine."
3 · Delta Lake (Databricks) — Still a Strong Contender
3.1 What makes Delta different
Delta Lake was open-sourced by Databricks in 2019. The essence is the same as Iceberg: a transaction log layered over Parquet files to provide ACID, time travel, and schema management. The difference is the metadata structure and the ecosystem.
[ Delta Table ]
|
v
_delta_log/
00000000000000000000.json ← transaction log (JSON per line)
00000000000000000001.json
...
00000000000000000010.checkpoint.parquet
_last_checkpoint
|
v
data files (Parquet)
- The transaction log is JSON. Compacted to Parquet checkpoints every 10 logs. Whereas Iceberg is a tree of metadata→snapshot→manifest, Delta is a flat log sequence.
- Deeply tied to the Databricks ecosystem. Liquid Clustering, Predictive I/O, Photon — the most mature optimizations live inside Databricks.
- Delta Sharing. An open protocol to share data with other organizations without copying. Iceberg does not yet have a peer standard.
3.2 UniForm — Delta imitates Iceberg
Databricks's Delta UniForm announcement in 2023 was the moment the market direction became clear.
Take the same Parquet files and write Delta logs and Iceberg metadata at the same time. From the outside, "the table is also an Iceberg table."
This makes it possible to read with Iceberg from Snowflake, Trino, and BigQuery, while writing with Delta inside Databricks. Databricks itself opened the path to making "the Delta camp" meaningless.
After the Tabular acquisition in June 2024, that strategy deepened. By 2025 the direction "do not treat Delta and Iceberg as separate formats — treat them as the same metadata" was set in stone.
3.3 Why Delta still survives
- Delta is faster inside Databricks. Many optimizations like Liquid Clustering, Predictive I/O, and Photon are Delta-only.
- Tightest integration with Spark Structured Streaming. Most natural with Delta.
- Delta Sharing has settled as a data-sharing standard.
In short: "In a world where Iceberg is the standard, Delta has settled into being Databricks's internal optimized format." That's the 2026 picture.
4 · Apache Hudi (Uber) — The Niche Charm
4.1 Hudi's identity
Hudi was built by Uber in 2016 and donated to Apache in 2019. Its identity is "upsert-first."
From day one it solved "how do we apply frequent updates to a data lake?" Uber's need to reflect CDC (Change Data Capture) events at minute-level latency was the starting point.
Hudi's two storage types
[ Copy-on-Write (CoW) ]
Write: rewrite the entire affected Parquet file
Read: fast (just Parquet)
Best for: read-heavy, infrequent writes
[ Merge-on-Read (MoR) ]
Write: stack deltas into separate log files (Avro)
Read: merge Parquet plus logs (real-time view)
Best for: frequent updates and deletes (CDC)
4.2 Hudi's strengths — indexing and incremental queries
There are two things Iceberg and Delta still cannot match.
- Record-level indexes. Bloom filter, HBase, bucket index — given a user_id, you find which file it lives in instantly. The fundamental reason upserts are fast.
- Incremental queries. "Give me only the rows that changed after this commit" is a first-class operation. Slot it directly downstream of any CDC pipeline.
These are exactly the areas Iceberg v3 is trying to catch up on. Iceberg v3 is strengthening row-level deletes, equality deletes, and merge-on-read, and is likely to close the gap in 2026. So Hudi holds the position of "the niche heavyweight."
4.3 Onehouse — Hudi's commercial path
Founded by Hudi co-creator Vinoth Chandar in 2021. Sells a managed service under the "Universal Data Lakehouse" banner, covering Hudi, Iceberg, and Delta. Interestingly, the Hudi company does not sell only Hudi.
Two tools Onehouse released in 2024-25 were significant.
- Apache XTable (formerly OneTable). Converts metadata between Hudi, Iceberg, and Delta. Exposes the same Parquet as three formats' metadata at once. The open version of UniForm.
- Open Engines. A layer that applies Hudi's indexing and incremental processing to other formats.
5 · Apache Paimon (Flink Team) — The New Challenger
5.1 Why yet another table format
Iceberg, Delta, and Hudi all started from "batch analytics." Streaming was layered on top.
Paimon starts from "streaming." It began inside the Apache Flink team in 2022 and graduated to a Top-Level Project in 2024. The core difference is the LSM tree (Log-Structured Merge Tree).
Paimon = "a table format built on an LSM tree"
- The structure LevelDB and RocksDB use
- Memory → L0 → L1 → L2 ... incremental merge
- Fast writes + efficient reads + natural compaction
- Couples naturally with Flink's checkpointing
5.2 What Paimon solves
- Real-time materialized views. Put Flink-computed aggregates into a Paimon table; read them from Trino or Spark immediately.
- Streaming CDC. Apply CDC events from Debezium to a Paimon table at minute granularity. More natural than Iceberg.
- Lookup join. Use a Paimon table as a lookup table inside a Flink job. Like Kafka Streams's GlobalKTable.
As of 2025, Paimon is most heavily used inside Chinese giants like Alibaba and ByteDance, and is spreading to the West. If Iceberg is the standard for batch, Paimon aims to be the standard for "streaming lakehouse."
5.3 Iceberg vs Paimon — competition or complement?
Looking at the trend, it's closer to a complement.
- Iceberg: huge analytic tables, snapshot-based, time travel.
- Paimon: streaming ingestion, frequent updates, materialized views.
In 2026 it'll be normal to keep both formats inside the same data platform. Flink writes to Paimon; once cubes and aggregates form, they harden into Iceberg.
6 · The Tabular Acquisition (Databricks, June 2024, $1B+) — Meaning
6.1 What happened
In June 2024, Databricks acquired Tabular for over $1B. Founded by Iceberg co-creators Ryan Blue, Daniel Weeks, and Jason Reid, Tabular sold a managed Iceberg-based data platform.
Rumors that Snowflake also tried to buy Tabular were strong at the time. Databricks ultimately won, and the price was extremely high relative to Tabular's ARR. What Databricks paid for was not revenue but people and the standard.
6.2 What changed
- Databricks now has direct influence over the Iceberg standard. Many Iceberg PMC members and committers are now Databricks employees.
- Delta UniForm accelerated. Delta is no longer "imitating" Iceberg — the same company evolves both formats together.
- Polaris (Snowflake) vs Unity Catalog (Databricks) — the catalog war began in earnest. Snowflake announcing Polaris in June 2024 was the same wave. Databricks open-sourced Unity Catalog in June 2024.
The event was the signal that the table format war was over. Iceberg becomes the standard; the real competition starts at the catalog, engine, and service layers.
6.3 Ryan Blue's message
At the first conference after the acquisition (Iceberg Summit 2024), Ryan Blue said something that stuck.
"Table formats need to be standards. Once the standard is set, the real competition starts above it."
That's the whole thing. Data engineering in 2026 is no longer "which format do you choose?" but "which engines, catalogs, and UX do you choose on top of Iceberg?"
7 · Onehouse — The Commercialization of the Hudi Camp
Onehouse is the David side of David and Goliath. Where Databricks and Snowflake move tens of billions, Onehouse is a small company just past Series B. But it has a clear position.
- Apache XTable — Hudi/Iceberg/Delta metadata compatibility layer. Open source.
- Onehouse Cloud — managed lakehouse. Hudi at the core but exposed simultaneously as Iceberg and Delta.
- Open Engines — apply indexing, materialized views, and CDC optimization across formats.
The Onehouse hypothesis is simple.
"The era of one format per company is over. Inside one company, Iceberg, Delta, and Hudi will all coexist. Whoever stitches them seamlessly wins."
As of 2026 that hypothesis keeps coming true. In real enterprises it's now common to see "Databricks uses Delta, Snowflake uses Iceberg, in-house analytics uses Hudi" all inside one company.
8 · dbt — The Standard for Transformation
8.1 What dbt changed
dbt (data build tool) was started by Fishtown Analytics (now dbt Labs) in 2016. The core idea is simple.
"Write your models as SQL. We'll solve the dependencies. Tests are SQL too. Docs come out of SQL."
Previously, analysts hand-wrote SQL and embedded it in BI tools. Transformation logic was scattered — where it lived, what depended on what, how it was tested — all in different places.
What dbt organized:
models/
staging/
stg_orders.sql ← raw → cleaned
stg_customers.sql
marts/
core/
dim_customers.sql
fct_orders.sql ← references stg_orders, dim_customers
tests/
not_null_dim_customers_id.sql
dbt_project.yml ← project config
- SQL is the model. Each file is one SELECT, and dbt turns it into CREATE TABLE AS or CREATE VIEW.
- References via jinja macros. Things like
ref('stg_orders'). dbt builds the dependency graph automatically. - Tests are SQL. Assertions like "this column must be not null" or "must be unique" expressed as SQL queries.
- Docs are YAML. Add descriptions per column or model; dbt docs visualize them.
8.2 The Analytics Engineer
The biggest change dbt brought is the rise of the "Analytics Engineer" role. Someone who owns transformation, modeling, testing, and documentation using SQL as their tool. In between data engineer and analyst.
By 2026 this role is the majority of many data teams. Data engineers (with Python and Spark) own ingestion and infrastructure; analytics engineers model with dbt; data analysts build BI on top.
8.3 dbt's competitors
After dbt became the standard, challengers arrived.
- SQLMesh — dbt alternative from Tobiko Data. Differentiates with virtual data environments (test changes without copying schemas) and column-level lineage (column-level dependencies). Takes CI/CD more seriously.
- Coalesce — UI-based dbt alternative. Compose transformations through a GUI rather than writing code directly.
- dbt Mesh / dbt Cloud — dbt Labs's enterprise answer. Cross-project dependencies, model governance.
As of 2026, dbt is dominant, but SQLMesh is catching up fast. Both share the core value: SQL is code, so it needs versioning, testing, and CI/CD.
9 · Trino (formerly PrestoSQL) / Presto — Distributed OLAP Engine
9.1 The Presto fork
Presto was built by Facebook in 2012 as a distributed SQL engine. In 2019 the core developers left and forked into Trino (originally PrestoSQL). The side that stayed with Presto Foundation (Linux Foundation) became PrestoDB, essentially Meta-internal.
As of 2026:
- Trino — de facto industry standard. Commercialized by Starburst. Over 50 connectors including Iceberg, Delta, Hive, MySQL, PostgreSQL, MongoDB, Kafka.
- Presto (PrestoDB) — alive only inside Meta. External adoption has nearly stopped.
Six years on from the fork, Trino has become Presto's true successor.
9.2 Trino's position
Trino is a "federated query engine." It does not store data itself. It JOINs an Iceberg table on S3, an operational DB in MySQL, and a Kafka topic with a single SQL statement.
-- Iceberg table ⋈ MySQL table ⋈ PostgreSQL table
SELECT
i.user_id,
m.user_name,
p.last_login,
SUM(i.amount) AS total_spent
FROM iceberg.analytics.purchases i
JOIN mysql.app.users m ON i.user_id = m.id
JOIN postgres.crm.profiles p ON i.user_id = p.user_id
WHERE i.event_date >= DATE '2026-05-01'
GROUP BY 1, 2, 3
That this is one query is Trino's identity. Trino has settled as the OLAP standard for the data lakehouse.
9.3 Who uses Trino
- Netflix — the maker. Iceberg plus Trino is their data platform.
- LinkedIn, Shopify, Lyft — top-tier adopters.
- Starburst — the commercial company founded by Trino creators. Sells Trino Galaxy (managed) and Trino Enterprise.
- AWS Athena is based on Presto but increasingly closer to Trino.
9.4 Trino vs DuckDB — single-node challenge
Interesting nuance: on a single node DuckDB often beats Trino. When data is small or medium (tens to hundreds of GB), DuckDB is enough and often faster. Trino is narrowing into "queries that genuinely need distribution."
10 · Spark 4 (Aug 2024) + Apache Flink 2 — Processing Engines
10.1 What changed in Spark 4
Apache Spark 4.0 GA'd in August 2024. Three key changes.
- Spark Connect stabilized — client and server separated, talking over gRPC. Python, Scala, Go, and Rust clients all become possible. Drive heavy clusters from light clients in notebooks or CI.
- Pandas API on Spark matured.
import pyspark.pandas as psruns Pandas code almost unchanged in distributed form. - ANSI SQL is the default. From 4.0, ANSI mode is the default. Implicit casts and NULL handling align with the standard.
Spark is no longer "after Hadoop." It's "the standard ETL and batch engine, writing to and reading from any of Iceberg, Delta, Hudi."
10.2 What changed in Flink 2
Apache Flink 2.0 released in early 2025. The core is cloud-native state management.
- Disaggregated state — a mode that separates state from local RocksDB and stores it in S3 or GCS. Jobs with large state (joins, session windows) are no longer trapped inside a single node's memory.
- Hybrid Source — bridges batch (past) and streaming (current) from the same source without seams.
- PyFlink matured. Python users are now first-class.
Flink 2 plus Paimon is becoming the 2026 standard for "streaming lakehouse."
10.3 Spark vs Flink — what to use when
| Axis | Spark | Flink |
|---|---|---|
| Batch | standard | possible |
| Streaming | Structured Streaming (micro-batch) | true streaming (event-by-event) |
| Latency | seconds | milliseconds |
| State management | weak | first-class |
| ML / DataFrame | very strong | weak |
| Talent pool | very large | medium |
ETL, batch, and ML go to Spark; true streaming goes to Flink. That divide holds in 2026. The nuance is that Spark's Structured Streaming has gotten good enough that "seconds of latency is fine, Spark is enough" applies more often.
11 · Databricks / Snowflake / BigQuery / ClickHouse — Cloud Data Platforms
11.1 Databricks — the lakehouse pioneer
- Unity Catalog — open-sourced June 2024. Data, AI, notebooks, features, and models in one catalog.
- Delta + Iceberg ambidextrous. UniForm exposes the same data as both formats.
- Mosaic AI — after the MosaicML acquisition, model training and serving on the same platform.
- Photon — vectorized native query engine. Spark SQL on top of ANSI, with Photon running fast underneath.
The Databricks message in 2026: "Data plus AI equals one platform." Not just ETL and BI but model training in one place.
11.2 Snowflake — from warehouse to lakehouse
Started as a traditional warehouse, but in 2024-25 moved fast toward the lakehouse.
- Polaris Catalog — announced June 2024, donated to Apache in 2025. A reference implementation of the Iceberg REST standard.
- Iceberg Tables — Iceberg tables as first-class inside Snowflake. External engines (Spark, Trino) can read the same data.
- Snowpark — run Python, Java, or Scala inside Snowflake. UDFs, UDTFs, stored procedures.
- Cortex — LLM and ML features callable as SQL.
The Snowflake message: "Even when Iceberg becomes the standard, we are a first-class citizen of that standard."
11.3 BigQuery — Google's answer
- BigLake — read and write external Iceberg, Delta, and Hudi tables from BigQuery.
- BigQuery Studio — notebooks, SQL, and Python in one workbench.
- Gemini in BigQuery — natural language to SQL and code completion.
Inside GCP, BigQuery remains the smoothest choice. Accepting Iceberg through BigLake was the big 2025 shift.
11.4 ClickHouse Cloud — the real-time OLAP heavyweight
ClickHouse is not an orthodox OLTP/HTAP system but a columnar DB specialized for real-time analytics. Spun out of Yandex in 2022 as ClickHouse Inc.; Cloud grew significantly in 2024-25.
- Millisecond response. Aggregations across billions of rows within a second.
- Materialized views as first-class. Aggregates built at ingestion.
- Used for: product analytics (Plausible, PostHog), observability (logs, metrics), ad analytics.
ClickHouse is less "part of the data lakehouse" and more "a separate engine for real-time OLAP." It connects to Iceberg through external tables.
12 · AWS Athena + Glue — Cloud Managed
12.1 Athena — serverless SQL over S3
AWS Athena is a serverless query engine based on Presto/Trino. Query Parquet, ORC, Iceberg, and Delta files directly from S3 with SQL. No clusters to spin up or manage.
-- Iceberg table query (Athena v3 engine)
SELECT
event_date,
COUNT(*) AS events,
COUNT(DISTINCT user_id) AS dau
FROM iceberg_catalog.analytics.events
WHERE event_date BETWEEN DATE '2026-05-01' AND DATE '2026-05-15'
GROUP BY 1
ORDER BY 1
- Billing is per data scanned ($5/TB).
- First-class Iceberg support since 2022.
- Athena for Spark (2022) lets you run Spark jobs serverless too.
12.2 Glue — catalog plus ETL
- Glue Data Catalog — AWS's standard metastore. Hive Metastore compatible. Athena, Redshift Spectrum, EMR, and SageMaker all read from it.
- Glue ETL — Spark-based serverless ETL.
- Glue Studio — visual ETL builder.
The Glue Data Catalog launched an adapter that mimics Iceberg REST in 2024. Inside AWS, "Iceberg plus Athena plus Glue Catalog plus S3" became the most natural combination.
12.3 S3 Tables (Dec 2024)
AWS's S3 Tables, announced December 2024, was a big shift. S3 itself manages Iceberg tables as first-class. Auto-compaction, snapshot expiration, and metadata management — all handled by S3 directly.
S3 (object storage)
|
v
S3 Tables (first-class Iceberg) ← auto compaction, expiration, stats
|
v
Athena · EMR · Glue · Redshift · Trino · Spark
The implication is large. A cloud provider started handling Iceberg directly. The final nail in "Iceberg is the standard."
13 · DuckDB — "SQLite for Analytics"
13.1 DuckDB's identity
DuckDB started in 2019 at CWI in the Netherlands as an in-process OLAP database. The slogan is "SQLite for analytics." No server to run — import as a library and run SQL inside your own process.
import duckdb
# Query Parquet directly. No cluster, no server.
result = duckdb.sql("""
SELECT
user_id,
COUNT(*) AS events,
SUM(amount) AS total
FROM 's3://my-bucket/events/*.parquet'
WHERE event_date >= '2026-05-01'
GROUP BY user_id
ORDER BY total DESC
LIMIT 100
""").df() # → Pandas DataFrame
That's the whole thing. No PostgreSQL client, no Spark cluster, no Snowflake account.
13.2 Why DuckDB exploded
DuckDB's GitHub stars roughly 4xed in 2024-25. The reasons are simple.
- Cloud memory and disk got too big. 128GB or 256GB RAM instances cost a few dollars per hour. Tens of GB of analytics fit on a single node and are fast.
- First-class integration with Apache Arrow and Parquet. Zero-copy with Pandas and Polars.
- In-process. Runs in Lambdas, notebooks, and CI alike.
- MotherDuck — commercial company founded 2022. Bridges local DuckDB and the cloud with "hybrid execution."
13.3 The slot DuckDB occupies
- Notebook analytics. Often faster than Pandas and more natural in SQL.
- CI and tests. Verifies faster than Spark or BigQuery.
- Embedded analytics. Embedded inside desktop apps, CLIs, and BI tools. Tableau and Hex increasingly use DuckDB internally.
- Edge analytics. A WASM build runs in the browser.
Spark and Trino started getting asked "do you really need distribution?" When data is under 1TB, DuckDB is often enough.
14 · Apache Arrow / Parquet / ORC / DataFusion — Low-Level Standards
14.1 Apache Arrow — in-memory columnar standard
If Parquet is the standard for storing data on disk, Apache Arrow is the standard for handling it in memory.
- Columnar in-memory format. The same data shared zero-copy across Python, Java, C++, Rust, and Go.
- Arrow Flight — gRPC-based data transport standard. Receiving Snowflake or BigQuery results as Arrow has become normal.
- Arrow DataFusion — covered below.
Since Pandas 2.0 (2023) Arrow became a backend; Polars was Arrow-native from day one. "DataFrames live on Arrow" has become the standard.
14.2 Parquet vs ORC — the disk formats
- Apache Parquet — from Twitter and Cloudera. De facto standard. Iceberg, Delta, Hudi, and Paimon all use Parquet by default.
- Apache ORC — from Hortonworks and the Hive era. Was strong inside Hive and Presto and still alive, but new projects almost all use Parquet.
The difference is small. Compression ratio and scan performance trade places by scenario. But the ecosystem momentum is overwhelmingly on Parquet. In 2026 there is essentially no reason to start a new project with anything other than Parquet.
14.3 Apache DataFusion — Rust query engine
DataFusion is a SQL query engine written in Rust, started as part of the Apache Arrow project. Andy Grove built it, and it joined Apache in 2021.
- Arrow-native. In-memory representation is Arrow.
- Embedded. Import as a library and run SQL inside your own process.
- Fast. Vectorized execution. SIMD friendly.
DataFusion's position is interesting. It doesn't compete directly with DuckDB — instead, it is adopted as the internal engine of other systems.
- InfluxDB 3.0 — time-series DB. Internal SQL engine replaced with DataFusion.
- Comet — accelerates parts of Spark using DataFusion.
- LanceDB, GreptimeDB, Sail — new databases built on DataFusion at the core.
- Ballista — distributed execution on top of DataFusion.
"When building a Rust DBMS, start your SQL engine from DataFusion" has become the 2026 default.
15 · Korea / Japan — Toss, Kakao KaaP, Mercari, ZOZO
15.1 Toss data — a single data platform
Toss reorganized its data platform in 2025 around Apache Iceberg, dbt, Trino, and Airflow. From its SLASH 25 conference talks:
- Iceberg holds essentially all analytic data. S3, Glue Catalog, Iceberg.
- dbt for transformations. An analytics-engineer role is in production.
- Trino and Athena for queries. Users only see SQL.
- Airflow for orchestration. Manages both dbt jobs and Spark jobs.
What Toss emphasized in particular is "data self-service." Data engineers own only ingestion and infrastructure; analysis and modeling are owned by analytics engineers and analysts themselves.
15.2 Kakao KaaP (Kakao as a Platform)
Kakao runs an internal data platform called KaaP. The core is multi-tenant analytics infrastructure.
- S3 plus HDFS hybrid. Datacenter HDFS plus cloud S3.
- Shared Spark and Trino clusters. Per-team workspaces.
- Airflow-based workflow standardization.
- In-house metastore — integrated internal governance and permissions.
In 2025-26 KaaP is rapidly increasing Iceberg adoption. Migrating from Hive tables to Iceberg is a major effort.
15.3 Mercari — Japan's data platform
Mercari is Japan's leading C2C marketplace. The data platform centers on BigQuery, dbt, and Looker.
- BigQuery-centric. Runs on GCP.
- dbt for transformations. Introduced the analytics-engineer role early.
- Looker for BI.
- Dataform and BigLake — increasingly taking in Iceberg and external tables.
A 2024 post on the Mercari engineering blog about "taking in external Iceberg through BigLake" was striking. Among Japanese companies, Mercari moved fastest toward the lakehouse direction.
15.4 ZOZO — Japanese fashion commerce
ZOZO operates ZOZOTOWN. The data platform centers on BigQuery and Dataform.
- BigQuery-centric.
- Dataform — Google-acquired dbt alternative. Internal standard.
- Data Mesh — distributed domain operations. Catalog and governance are core concerns.
ZOZO is among the Japanese companies that tried data mesh early. A 2024 ZOZO Tech Blog post emphasizing "distributed domain responsibility" is often cited.
16 · Who Should Pick What — SMB / Enterprise / Streaming / Analytics
By scale and need.
16.1 SMB and startups (under $5,000 per month)
| Area | Recommendation |
|---|---|
| Storage | S3 plus Parquet |
| Table format | Start with plain Parquet; move to Iceberg when it grows |
| Engine | DuckDB plus (optional) Athena |
| Transformation | dbt-core |
| Orchestration | dbt schedules plus GitHub Actions |
| BI | Metabase, Lightdash |
The core: don't spin up a cluster. DuckDB is enough up to 100GB. Athena just takes SQL. Snowflake and Databricks are expensive until the data is genuinely large and the analyst headcount passes 5.
16.2 Mid-size (50,000 per month)
| Area | Recommendation |
|---|---|
| Storage | S3 / GCS / ADLS |
| Table format | Iceberg |
| Engine | Trino (Starburst Galaxy), managed Spark |
| Transformation | dbt Cloud or SQLMesh |
| Catalog | Glue, Polaris, Unity Catalog |
| BI | Looker, Hex, Mode |
Iceberg starts to matter here. The zone where you have 2-3 data engineers, 10-plus analysts, and data infrastructure is a meaningful cost line.
16.3 Enterprise
| Area | Recommendation |
|---|---|
| Storage | Multi-cloud (S3 plus GCS) |
| Table format | Iceberg (Delta UniForm acceptable in parallel) |
| Engine | Databricks (company-wide) plus Snowflake (specific orgs) plus Trino (self-service) |
| Transformation | dbt plus an analytics-engineer org |
| Catalog | Unity Catalog or Polaris (set the company standard) |
| Governance | Atlan, Collibra, OpenMetadata |
The hardest thing at enterprise scale is governance. The catalog has to become a standard before 100-plus teams can see the same data.
16.4 Streaming-focused
| Area | Recommendation |
|---|---|
| Ingestion | Kafka |
| Processing | Flink 2 |
| Table format | Paimon (or Hudi) |
| Downstream analytics | Iceberg plus Trino |
Streaming materialized views go to Paimon; huge analytic tables go to Iceberg. Run both formats inside the same platform.
16.5 Real-time OLAP (product analytics, observability)
| Area | Recommendation |
|---|---|
| Engine | ClickHouse Cloud |
| Ingestion | Kafka into ClickHouse directly |
| Data | Real-time plus daily aggregates |
Less "part of the data lakehouse," more "separate engine." Where millisecond response is required.
17 · Closing — What Begins After the Standard
In the late 2010s there was a "data lake vs data warehouse" debate. In the early 2020s came the "Iceberg vs Delta vs Hudi" war. As of 2026, both are over.
- The lakehouse is the standard. Object storage plus Parquet plus a table format.
- Iceberg is the de facto table format standard. Delta is Databricks's internal optimized format; Hudi is niche; Paimon is the streaming complement.
- Engines are decoupled. Spark, Flink, Trino, DuckDB, and ClickHouse — picked per workload.
- The catalog is the next battlefield. Unity Catalog, Polaris, and BigLake compete.
- dbt is the standard for transformation. SQL is code; the analytics-engineer role has settled in.
Quoting Ryan Blue again after the Tabular acquisition:
"Once the standard is set, the real competition starts above it."
That's the one-line summary of data engineering in 2026. That Iceberg is the standard is no longer up for debate. The real work begins above it — catalogs, governance, UX, AI integration, real-time sync, multi-cloud.
The data platform you build today will still hold the same Iceberg metadata five years from now. That's what "standard" means.
References
- Apache Iceberg — https://iceberg.apache.org
- Apache Iceberg REST Catalog Spec — https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml
- Delta Lake — https://delta.io
- Delta UniForm — https://docs.databricks.com/aws/en/delta/uniform
- Apache Hudi — https://hudi.apache.org
- Apache Paimon — https://paimon.apache.org
- Tabular acquisition (Databricks blog, Jun 2024) — https://www.databricks.com/blog/databricks-tabular
- Unity Catalog OSS — https://www.unitycatalog.io
- Apache Polaris — https://polaris.apache.org
- AWS S3 Tables (re:Invent 2024) — https://aws.amazon.com/s3/features/tables/
- dbt Labs — https://www.getdbt.com
- SQLMesh — https://sqlmesh.com
- Trino — https://trino.io
- Starburst — https://www.starburst.io
- Presto — https://prestodb.io
- Apache Spark — https://spark.apache.org
- Spark Connect — https://spark.apache.org/docs/latest/spark-connect-overview.html
- Apache Flink — https://flink.apache.org
- Databricks — https://www.databricks.com
- Snowflake — https://www.snowflake.com
- BigQuery / BigLake — https://cloud.google.com/biglake
- ClickHouse Cloud — https://clickhouse.com/cloud
- AWS Athena — https://aws.amazon.com/athena/
- AWS Glue — https://aws.amazon.com/glue/
- DuckDB — https://duckdb.org
- MotherDuck — https://motherduck.com
- Apache Arrow — https://arrow.apache.org
- Apache Parquet — https://parquet.apache.org
- Apache ORC — https://orc.apache.org
- Apache DataFusion — https://datafusion.apache.org
- Onehouse — https://www.onehouse.ai
- Apache XTable — https://xtable.apache.org
- Project Nessie — https://projectnessie.org
- Toss SLASH 25 — https://toss.tech/slash-25
- Mercari Engineering — https://engineering.mercari.com
- ZOZO TECH BLOG — https://techblog.zozo.com