Skip to content
Published on

Modern Cloud Data Warehouses 2026 Deep Dive - Snowflake, Databricks SQL, BigQuery, Redshift, Firebolt, Azure Synapse, MotherDuck

Authors

"Lakehouse is not a product, it is an architecture. By 2026, every serious data platform either is one or sells against one." — Matei Zaharia, Databricks CTO, Data + AI Summit 2025 keynote

The data warehouse traces back to Teradata and IBM DB2 Parallel Edition in the 1980s, evolved through 2000s MPP appliances like Vertica, Greenplum, and Netezza, then jumped into the cloud era in 2014 when Snowflake introduced "compute fully separated from storage as SaaS." Databricks defined the term "Lakehouse" in 2020 and blurred the line between data lakes and warehouses, and in 2024 Apache Iceberg became the de facto open table format — opening a new era where "any engine can read the same data."

As of May 2026, the modern DW market is a multipolar landscape with 20+ engines coexisting — Snowflake, Databricks SQL, Google BigQuery, AWS Redshift, Azure Synapse, Firebolt, MotherDuck, ClickHouse Cloud, StarRocks Cloud, Apache Druid/Pinot, Doris, DuckDB, CrateDB. This article surveys each engine's architecture, pricing, AI integrations, lock-in posture, and real adoption from Coupang, Naver, Mercari, LINE Yahoo, CyberAgent, and Rakuten.

1. MPP DW Evolution — From Teradata to Lakehouse

The data warehouse has gone through four major paradigm shifts across roughly 50 years.

EraRepresentative ProductsCharacteristics
1st gen (1980-2000)Teradata, IBM DB2 PE, Oracle ExadataShared-nothing MPP appliances, custom hardware
2nd gen (2005-2015)Vertica, Greenplum, Netezza, ParAccelColumnar MPP, software-centric
3rd gen (2014-2020)Snowflake, BigQuery, RedshiftCloud-native, storage/compute decoupling
4th gen (2020-)Databricks Lakehouse, Iceberg + Trino/Spark, Apache PolarisOpen table formats + free choice of engine

First-gen systems ran on mainframe-class hardware from DEC and Tandem and cost tens of millions of dollars to handle 100TB. Walmart's 24TB Teradata DW in 1992 was the largest in the world at the time.

Second-gen brought columnar storage and compression to x86 commodity hardware, dropping prices by an order of magnitude. But it remained on-premises by default, with storage and compute coupled in the same box and scaling difficult.

The third-gen revolution was decoupling. Snowflake stored data on S3 and let users spin up compute (virtual warehouses) by the minute. BigQuery went further with a true serverless "pay per query" model. The first/second-gen pattern of "one cluster per query" disappeared entirely.

The fourth gen is openness. With Apache Iceberg, Delta Lake, and Apache Hudi becoming standards, the same data can be accessed by Snowflake, Trino, Spark, or DuckDB. The "single source of truth" is no longer locked into a specific vendor's proprietary format.

2. Snowflake — Market Leader, Synonymous with Storage/Compute Decoupling

Snowflake (snowflake.com) was founded in 2012 by ex-Oracle engineers Benoit Dageville and Thierry Cruanes, and listed on NYSE in 2020 in the largest SaaS IPO at the time. As of May 2026 its market cap is around $60B, with quarterly revenue near $1B, making it the entrenched leader in the cloud DW market.

Snowflake's core architecture is three layers:

  • Storage: Columnar data stored as micro-partitions (50-500MB average) on S3/Azure Blob/GCS. Proprietary format (FDN, columnar), but since 2025 Iceberg external tables are also a first-class citizen
  • Compute (Virtual Warehouse): Virtual compute clusters. Sizes XS/S/M/L/XL/2XL/3XL/4XL/5XL/6XL, billed per second
  • Cloud Services: Metadata, query optimizer, transactions, security — the invisible control plane
-- Snowflake virtual warehouse creation and usage
CREATE WAREHOUSE etl_wh
  WAREHOUSE_SIZE = 'LARGE'
  AUTO_SUSPEND = 60         -- auto-suspend after 60s idle
  AUTO_RESUME = TRUE
  MIN_CLUSTER_COUNT = 1
  MAX_CLUSTER_COUNT = 4     -- multi-cluster concurrency scaling
  SCALING_POLICY = 'STANDARD';

USE WAREHOUSE etl_wh;

-- Query that leverages micro-partition pruning
SELECT customer_id, SUM(amount) AS total
FROM orders
WHERE order_date BETWEEN '2026-04-01' AND '2026-04-30'
  AND region = 'APAC'
GROUP BY customer_id;

Key announcements from Snowflake Summit June 2025:

  • Apache Iceberg native tables: Snowflake writes Iceberg metadata directly so external engines (Trino, Spark) can read the same data. The decisive move to address lock-in concerns
  • Dynamic Tables: Declarative ETL — define with CREATE DYNAMIC TABLE and Snowflake auto-refreshes incrementally
  • Snowpark Container Services: Run containers inside Snowflake so Python/R/Java workloads sit next to the data
  • Snowflake Cortex: Built-in LLM API (SNOWFLAKE.CORTEX.COMPLETE, SUMMARIZE, TRANSLATE) — direct access to Mistral, Llama, Reka, Anthropic Claude
  • Snowflake Polaris (open-source catalog, donated to Apache in 2024) — REST API standard Iceberg catalog

Pricing is credit-based, with a credit typically costing $2-$4 (Standard to Business Critical). An XS warehouse is 1 credit/hr, and each size step doubles the rate. A 4XL is 128 credits/hr, which at Business Critical is around $512/hr. Rule #1 is to minimize idle cost via AUTO_SUSPEND.

3. Databricks SQL + Lakehouse — Delta Lake, Photon, Unity Catalog

Databricks (databricks.com) was founded in 2013 by Apache Spark creators Matei Zaharia and Ion Stoica out of UC Berkeley AMPLab. As of May 2026, its valuation is around $62B, putting it shoulder to shoulder with Snowflake.

Databricks started as a Spark + notebooks company but entered the SQL workload market directly against Snowflake when it announced the Lakehouse vision in 2020. Core components:

  • Delta Lake: Table format on the data lake with ACID transactions, schema evolution, and time travel. Open-sourced in 2019, under the Linux Foundation
  • Photon Engine: Vectorized query engine rewritten in C++. Spark API compatible, 3-12x faster on average
  • Databricks SQL (DB SQL): SQL warehouse for BI workloads. Photon + auto-scaling
  • Unity Catalog: Unified governance catalog. Tables, views, notebooks, ML models, and LLMs are all permissioned in one place. Open-sourced in 2024
  • Mosaic AI: ML/AI workload platform. Vector Search, Model Serving, AI Functions
  • DBRX: 132B MoE LLM released by Databricks in March 2024. Apache 2.0 license
-- Databricks SQL example — Delta Lake table + AI Function
SELECT
  c.customer_id,
  c.name,
  AI_SUMMARIZE(
    ARRAY_AGG(r.review_text)
  ) AS review_summary,
  AVG(r.rating) AS avg_rating
FROM customers c
JOIN reviews r ON r.customer_id = c.customer_id
WHERE r.created_at > current_date() - INTERVAL 90 DAY
GROUP BY c.customer_id, c.name;

The Databricks differentiator is one platform that handles SQL, Spark, ML, and LLM. While Snowflake is SQL-centric, Databricks lets data scientists and ML engineers train and serve models directly on the same data.

Pricing is DBU-based, with DBU cost varying by workload type and compute class. SQL workloads usually run $0.22-$0.55/DBU, on top of which you pay separately for cloud infrastructure (EC2/GCE/Azure VM). Compared to Snowflake's all-in pricing, the split billing looks cheaper at first but often lands at similar real-world totals.

4. Google BigQuery — The Original Serverless DW

BigQuery (cloud.google.com/bigquery) launched in 2010 based on Google's internal Dremel paper. The biggest differentiator is true serverless — users do not configure clusters, nodes, or storage sizes; they throw SQL at it and Google auto-provisions thousands of nodes.

BigQuery core components:

  • Dremel: Distributed query engine over columnar storage. Tree-based execution model
  • Colossus: Successor to GFS — BigQuery's backend distributed file system
  • Jupiter: Petabit datacenter network — the key to bandwidth between compute and storage
  • BigQuery ML (BQML): Train and predict with ML models via SQL — CREATE MODEL ... OPTIONS(model_type='linear_reg')
  • BigQuery Studio: GA in 2024. Integrated IDE for notebooks, dbt, Looker Studio
  • BigQuery Omni: Query AWS/Azure data with the BigQuery engine (multi-cloud)
  • BigQuery GIS: Geographic data types and functions (ST_GEOGFROMTEXT etc.)
-- BigQuery example — ML + GIS combined
CREATE OR REPLACE MODEL mydataset.churn_model
OPTIONS(
  model_type = 'logistic_reg',
  input_label_cols = ['churned']
) AS
SELECT
  age,
  tenure_months,
  monthly_spend,
  ST_DISTANCE(home_location, store_location) AS distance_m,
  churned
FROM `mydataset.customer_features`
WHERE _PARTITIONTIME BETWEEN '2026-01-01' AND '2026-04-30';

-- Prediction + Gemini text generation combined
SELECT
  customer_id,
  predicted_churned_probs[OFFSET(1)].prob AS churn_prob,
  ML.GENERATE_TEXT(
    MODEL `mydataset.gemini_model`,
    'Suggest 2 retention offers for this customer'
  ) AS retention_suggestion
FROM ML.PREDICT(MODEL mydataset.churn_model, TABLE customer_segment_apac);

Pricing comes in two flavors:

  • On-demand: $5 per TB scanned (us region). First 1TB free per month. Less predictable but favorable for small workloads
  • Slot-based (Editions): Standard/Enterprise/Enterprise Plus editions at $0.04-$0.10 per slot-hour. Favorable for large, stable workloads

BigQuery is classified as the most locked-in DW. Data sits in BigQuery's proprietary capacitor format and is hard for other engines to read directly. Since 2024 BigLake has supported Iceberg external tables to ease lock-in concerns, but true multi-engine compatibility still lags Snowflake.

5. AWS Redshift Serverless — RA3, Aqua, and the New Serverless Model

AWS Redshift launched in 2012 based on ParAccel as AWS's first DW service. Once the cloud DW market share leader, it was overtaken by Snowflake and stagnated, then regained momentum with the 2022 launch of Redshift Serverless.

Core components:

  • RA3 nodes: New-generation node type with separated compute and storage. Managed storage (S3-based), per-node SSD cache
  • Aqua (Advanced Query Accelerator): FPGA-based hardware acceleration. Performs decompression and filtering off-node
  • Concurrency Scaling: Temporary compute when concurrent queries spike (with free hours included)
  • Redshift Spectrum: Query S3 external tables (Parquet, ORC, Avro)
  • Redshift Serverless: GA in 2022. Billed per RPU (Redshift Processing Unit), $0.36 per RPU-hour
  • Zero-ETL: Since 2023, auto-replication from Aurora/RDS/DynamoDB to Redshift (change data capture)
  • Redshift + Bedrock: GA in 2024. Call Bedrock LLMs directly from SQL
-- Redshift Serverless workgroup + external table + Bedrock
CREATE EXTERNAL TABLE spectrum.orders_raw (
  order_id BIGINT,
  customer_id BIGINT,
  amount DECIMAL(18, 2),
  order_date DATE
)
STORED AS PARQUET
LOCATION 's3://my-data-lake/orders/'
TABLE PROPERTIES ('skip.header.line.count'='1');

-- Call a Bedrock LLM from SQL
SELECT
  o.order_id,
  o.customer_id,
  AWS.BEDROCK.INVOKE_MODEL(
    'anthropic.claude-sonnet-20240229-v1:0',
    'Summarize this order: ' || o.notes
  ) AS summary
FROM spectrum.orders_raw o
WHERE o.order_date >= current_date - 7;

Redshift Serverless lowered the entry barrier with BigQuery-style usage-based pricing, but it has a "minimum base RPU" that doesn't go all the way to zero. For very light workloads, BigQuery still has the edge.

6. Azure Synapse Analytics — Dedicated/Serverless SQL + Spark + Kusto

Microsoft Azure Synapse Analytics (azure.microsoft.com/services/synapse-analytics) launched in 2019 as a rebrand and expansion of Azure SQL Data Warehouse (the former SQL DW) into a unified analytics platform.

It bundles three engine types in one workspace:

  • Dedicated SQL Pool (formerly Azure SQL DW): MPP SQL, provisioned per DWU (Data Warehouse Unit)
  • Serverless SQL Pool: Query Parquet/CSV/JSON in ADLS Gen2 via T-SQL. $5 per TB scanned
  • Spark Pool: Synapse Spark with .NET for Spark support
  • Data Explorer Pool (Kusto): KQL (Kusto Query Language) for log and time-series analysis
  • Synapse Link: Analyze Cosmos DB, Dataverse, SQL DB data without ETL (HTAP pattern)

Announced at Microsoft Ignite in November 2024, Microsoft Fabric is the next-generation SaaS analytics platform that absorbs and extends Synapse, bundling OneLake (Iceberg/Delta-based unified storage), Power BI, Data Factory, and Synapse Real-Time Intelligence into a single SKU. As of 2026 Synapse is still usable, but new adoption clearly flows to Fabric.

Synapse's strength is Microsoft ecosystem integration — Active Directory, Azure DevOps, Power BI, and Office 365 all integrate seamlessly. Its weakness is the complex UX of running multiple engines in a single workspace and unpredictable costs.

7. Firebolt — The Self-Proclaimed "Snowflake-killer"

Firebolt (firebolt.io) was founded in 2019 in Israel by the prior founder of Sisense, taking another swing at next-gen OLAP. It pitches "10x faster and 10x cheaper than Snowflake," and raised a $127M Series C in 2021 and a $100M Series D in 2023.

The key differentiator is aggressive use of indexes, compute, and cache. While Snowflake relies on micro-partition pruning, Firebolt adds:

  • Sparse Index: Sparse per-column indexes for precise row location
  • Aggregate Indexes: Pre-aggregated indexes, lighter than materialized views
  • Result + Sub-result Cache: Cache not just query results but intermediate results too
  • F3 Engine: Proprietary columnar format (F3), smaller than Parquet
  • SSD Direct Cache: Cache F3 data on the compute node's NVMe SSD
-- Firebolt example — Aggregate Index + dimension table
CREATE AGGREGATING INDEX orders_daily_agg ON orders (
  order_date,
  region,
  SUM(amount),
  COUNT(*)
);

-- Query that auto-leverages the index
SELECT order_date, region, SUM(amount), COUNT(*)
FROM orders
WHERE order_date BETWEEN '2026-01-01' AND '2026-04-30'
  AND region IN ('APAC', 'EMEA')
GROUP BY order_date, region;

Firebolt's weakness is a narrower ecosystem. While Snowflake is a first-class citizen in hundreds of tools (Tableau, Looker, dbt, Fivetran, etc.), Firebolt's integration coverage is still limited. That said, certain dashboarding-heavy workloads (thousands of users running queries via Looker/Tableau daily) report a strong price-performance story.

8. MotherDuck — DuckDB-as-a-Service

MotherDuck (motherduck.com) launched in 2023 as a cloud DW built on DuckDB, co-founded by DuckDB's founder Hannes Mühleisen. Its vision is "make the single-node DW attractive again."

The key idea is hybrid execution.

  • Small data and interactive queries run on DuckDB on the user's laptop
  • Large data and complex queries run in the cloud (MotherDuck servers)
  • The same SQL and SDK automatically route between modes
# MotherDuck Python example — local + cloud joined
import duckdb

# md: prefix connects to MotherDuck cloud
conn = duckdb.connect('md:my_db?motherduck_token=...')

# Join local Parquet against cloud tables
df = conn.execute("""
    SELECT
        l.customer_id,
        l.local_event_count,
        c.lifetime_value
    FROM read_parquet('local_events.parquet') l
    JOIN my_db.customers c ON c.customer_id = l.customer_id
    WHERE c.region = 'JP'
""").df()

Since 2025 MotherDuck supports Iceberg external tables, direct Snowflake connections, and its own notebook UI. Pricing is usage-based and very inexpensive (serious use starts at $15-$25/month), which is attractive for small teams. It is not designed for petabyte-scale workloads — the sweet spot is "up to 100TB."

9. DuckDB 1.x — The Standard for Embedded Analytics

DuckDB (duckdb.org) was built in 2018 at the Dutch CWI research institute by Hannes Mühleisen and Mark Raasveldt as an embedded OLAP database. Pitched as "SQLite for analytics," 1.0 went GA in June 2024 and 1.3 stable shipped in November 2025.

DuckDB is overwhelming in these scenarios:

  • Analyzing 10GB Parquet from a notebook: 10-100x faster than pandas
  • Data validation in CI/CD: PR-level data tests with dbt + DuckDB
  • ML preprocessing: Windowed aggregation in front of a PyTorch DataLoader
  • Backend for MotherDuck / Definite: The local engine for cloud DWs
# DuckDB joining S3 Parquet + Postgres + local CSV in one query
import duckdb

con = duckdb.connect()

con.execute("INSTALL httpfs; LOAD httpfs;")
con.execute("INSTALL postgres; LOAD postgres;")
con.execute("""
    CREATE SECRET s3 (
        TYPE S3,
        KEY_ID '...',
        SECRET '...',
        REGION 'ap-northeast-1'
    );
""")

result = con.execute("""
    SELECT
        s.customer_id,
        s.order_count,
        p.tier,
        l.city
    FROM read_parquet('s3://my-bucket/orders/*.parquet') s
    JOIN postgres_scan(
        'host=db.example.com user=admin password=...',
        'public', 'customers'
    ) p ON p.id = s.customer_id
    JOIN read_csv('regions.csv') l ON l.code = p.region_code
""").df()

DuckDB's popularity has exploded — over 25,000 GitHub stars by 2026, with active adapter ecosystems including dbt-duckdb, ibis-duckdb, and datafusion-duckdb. Many small orgs no longer buy a separate DW and run on S3 + Parquet + DuckDB + dbt-duckdb.

10. ClickHouse Cloud — Columnar Store SaaS

ClickHouse (clickhouse.com) started in 2009 inside Russia's Yandex as an open-source columnar DBMS, was open-sourced in 2016, and grew explosively after that. It relocated to the US and incorporated ClickHouse Inc in 2021, ClickHouse Cloud went GA in 2022, and it raised a $350M Series C in late 2024.

ClickHouse's specialty is real-time analytics over billions of rows. The MergeTree engine combines per-column compression, partitioning, and a primary key (sparse index) to deliver 1/10 the cost and 1/10 the response time of a typical DW. Cloudflare Analytics, Uber, Shopify, and eBay run real-time analytics on ClickHouse.

-- ClickHouse table definition — time-series event analytics
CREATE TABLE events (
    event_time DateTime,
    user_id UInt64,
    event_type LowCardinality(String),
    properties JSON
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_type, event_time, user_id)
TTL event_time + INTERVAL 90 DAY;

-- Aggregate billions of rows in under a second
SELECT
    event_type,
    toStartOfHour(event_time) AS hour,
    count() AS events,
    uniq(user_id) AS uniq_users
FROM events
WHERE event_time >= now() - INTERVAL 24 HOUR
GROUP BY event_type, hour
ORDER BY hour DESC;

ClickHouse Cloud offers stateless compute + S3 storage separation, auto-scaling, and managed replicated MergeTree. Pricing is per-minute compute time + storage GB, with the dev tier starting from $0.31/hr.

ClickHouse's weakness is difficulty with transactions and UPDATEs. UPDATE/DELETE only happen as async mutations and don't have the same semantics as row-level transactions in a typical DW. CDC-based DW ingestion requires extra design.

11. StarRocks / Apache Doris — Next-Generation MPP Open Source

StarRocks (starrocks.io) is a next-gen MPP OLAP engine forked from Apache Doris (formerly Palo) at China's Baidu, commercialized in 2021 by US-based CelerData. Apache Doris (doris.apache.org) continues independently inside the Apache Foundation.

Both engines combine MPP + vectorization + materialized views + partial lakehouse capabilities, with strengths different from ClickHouse:

  • StarRocks: Shared-data architecture (compute/storage separation), first-class Iceberg/Hudi/Delta external catalogs
  • Doris: MySQL-compatible protocol + ANSI SQL, materialized view + rollup automation

The StarRocks advantage over ClickHouse is multi-table joins. ClickHouse is optimized for huge single fact-table aggregations, while StarRocks shows better performance on star-schema fact-dimension joins.

CelerData Cloud is the managed SaaS version of StarRocks, running on AWS/GCP. In Korea, parts of NHN and Kakao reportedly self-host StarRocks.

12. Apache Druid + Imply — Real-Time OLAP

Apache Druid (druid.apache.org) was created in 2011 at Metamarkets (now Imply) as a real-time OLAP datastore. Its key trait is petabyte-scale aggregation at millisecond-to-second response times.

Druid architecture:

  • Real-time Indexer: Streaming ingest from Kafka/Kinesis, queryable immediately
  • Historical Node: Stores older data on S3/HDFS
  • Broker Node: Query routing and result merging
  • Coordinator: Manages segment distribution
{
  "type": "kafka",
  "ioConfig": {
    "topic": "user-events",
    "consumerProperties": {
      "bootstrap.servers": "kafka:9092"
    },
    "taskCount": 4,
    "replicas": 1,
    "useEarliestOffset": false
  },
  "dataSchema": {
    "dataSource": "events",
    "timestampSpec": { "column": "event_time", "format": "iso" },
    "dimensionsSpec": {
      "dimensions": ["user_id", "event_type", "country"]
    },
    "metricsSpec": [
      { "type": "count", "name": "events" },
      { "type": "longSum", "name": "amount_sum", "fieldName": "amount" }
    ],
    "granularitySpec": {
      "segmentGranularity": "HOUR",
      "queryGranularity": "MINUTE"
    }
  }
}

Imply (imply.io) is the commercial managed service for Druid, offered as Polaris (SaaS), Hybrid (own cloud), or Enterprise (self-hosted). Twitch, Lyft, Wikimedia, and Confluent run real-time dashboards on Druid.

13. Apache Pinot + StarTree — User-Facing Analytics

Apache Pinot (pinot.apache.org) was created in 2014 at LinkedIn, graduated Apache incubation in 2018, and reached TLP (Top-Level Project) status in 2021. Core developers from Apache Druid joined to found StarTree in 2019.

Pinot specializes in user-facing analytics — guaranteeing sub-100ms responses for dashboards, game leaderboards, recommendation feeds, etc. that end users see directly. LinkedIn Feed's "Who viewed your profile" and Uber Eats' real-time order board are flagship cases.

Core tech:

  • Star-tree Index: Pre-aggregation index. Pre-computes SUM, COUNT to shorten query time
  • Real-time + Offline Hybrid: Kafka ingestion + S3 batch in one table
  • Multi-stage Query Engine: Complex JOIN support since 2023

StarTree Cloud is the managed SaaS for Pinot, handling multi-cluster management, auto-scaling, and segment lifecycle.

14. Vertica, Greenplum, Yellowbrick — Legacy and Hybrid DWs

  • Vertica (vertica.com): Founded by Michael Stonebraker in 2005, acquired by HP in 2011, Micro Focus in 2017, then OpenText in 2023. Still used in some finance and telco verticals. One of the archetypes of columnar MPP
  • Greenplum (greenplum.org): Postgres-based MPP, acquired by EMC in 2010, sold to Pivotal in 2020 (then VMware Tanzu Data, then Broadcom in 2023). Still maintained open source in 2026, but new adoption is rare
  • Yellowbrick (yellowbrick.com): Founded in 2014, appliance + cloud hybrid DW. Notable for Kubernetes-based container deployment
  • Netezza: Acquired by IBM, survives as Cloud Pak for Data Netezza Performance Server

None of these are recommended for greenfield adoption today, but the migration consulting market for them is active. Both Snowflake and Databricks run "migrate from Vertica/Teradata to us" incentive programs.

15. Modern Open Table Formats — Iceberg vs Delta vs Hudi

The biggest 2026 shift is standardization on open table formats. Three contenders:

FormatOriginStrengthsWeaknesses
Apache IcebergNetflix, donated to Apache in 2018Broad engine support, schema evolution, time travel, hidden partitioningUPDATE/MERGE performance weaker than Delta (improving)
Delta LakeDatabricks, open-sourced 2019Deep integration with Databricks engines (Photon), ACID, excellent MERGELess external-engine support than Iceberg (delta-rs improving this)
Apache HudiUber, open-sourced 2016Strongest upsert/CDC, diverse indexesConceptually complex (Copy-on-Write vs Merge-on-Read), steep learning curve

Since 2024 the market has standardized rapidly on Iceberg. AWS, Google, Snowflake, Salesforce, Tabular, and Confluent all support Iceberg first-class. Databricks pivoted in 2024 by acquiring Tabular (the company founded by Iceberg's creators) for $2B, supporting both Delta and Iceberg.

-- Same Iceberg table across multiple engines (same data)
-- 1) Snowflake
CREATE OR REPLACE EXTERNAL VOLUME my_vol ...;
CREATE ICEBERG TABLE orders
  EXTERNAL_VOLUME = my_vol
  CATALOG = 'snowflake_polaris'
  BASE_LOCATION = 'orders/';

-- 2) Spark
spark.sql("""
  CREATE TABLE polaris.production.orders (
    order_id BIGINT, amount DECIMAL(18,2), order_date DATE
  )
  USING iceberg
  PARTITIONED BY (days(order_date))
""")

-- 3) Trino
CREATE TABLE polaris.production.orders (
  order_id BIGINT, amount DECIMAL(18,2), order_date DATE
)
WITH (
  partitioning = ARRAY['day(order_date)'],
  format = 'PARQUET'
);

-- 4) DuckDB
INSTALL iceberg; LOAD iceberg;
SELECT * FROM iceberg_scan('s3://my-bucket/orders/');

16. Open Catalogs — Polaris, Lakekeeper, Unity, Gravitino

With table formats standardized, the next battlefront is metadata catalogs. A catalog manages "which table is where and who has access."

  • Apache Polaris (polaris.apache.org): Open-sourced and donated to Apache by Snowflake in June 2024. REST API standard Iceberg catalog. Supported by Snowflake, Trino, Spark, Flink, Dremio
  • Lakekeeper (lakekeeper.io): A German-origin Rust-based Iceberg REST catalog. Lightweight and fast, friendly to self-hosting
  • Apache Gravitino (gravitino.apache.org): Multi-metadata catalog from China's Datastrato. Unifies Iceberg + Hive + relational DBs
  • Unity Catalog (unitycatalog.io): Open-sourced by Databricks in June 2024. Unified management of tables, views, notebooks, ML models, even LLM tokens
  • AWS Glue Data Catalog: AWS's managed catalog. Default backend for Athena, EMR, Redshift Spectrum
  • Nessie (projectnessie.org): Git-style catalog led by Dremio. Versioned data via branch/merge/tag

The 2026 market is narrowing to a two-camp race between Polaris and Unity Catalog. Polaris stays faithful to the Iceberg standard, while Unity emphasizes Iceberg + Delta both, plus broader asset management (notebooks, models).

17. Lock-in Matrix — Who Locks You In Most

Lock-in posture per engine:

EngineData Lock-inCompute Lock-inMigration DifficultyPrimary Reason
BigQueryVery HighVery HighVery HardCapacitor format, GCP-only
Redshift (Dense)HighHighHardProprietary columnar, AWS-only
Redshift Spectrum + RA3MediumMediumModerateS3 external tables available
SnowflakeMediumMediumModerateFDN format but Iceberg external supported
Snowflake + IcebergLowLowEasyOfficial external Iceberg support
Databricks + DeltaLowLowEasyDelta is open, Unity Catalog is open
ClickHouse CloudLowMediumModerateSelf-host OSS available
StarRocks / DorisVery LowLowEasyOSS, external catalogs
Trino + IcebergVery LowVery LowVery EasyBoth OSS, data stays put

Key insight: If data lives in an open format like Iceberg/Delta, compute lock-in is near zero. To truly avoid lock-in, regardless of DW choice, follow the principle of "data in external open format."

18. Pricing Models Compared — Credit, Slot, DBU, RPU

Each engine uses different pricing units.

EngineUnitPrice (May 2026)Notes
SnowflakeCredit$2-$4 per creditXS = 1 credit/hr, doubles per size
BigQuery on-demandTB scanned$5 per TBus region, first 1TB free/month
BigQuery EditionsSlot-hour$0.04-$0.10 per slot-hourStandard/Enterprise/Enterprise Plus
Databricks DB SQLDBU$0.22-$0.55 per DBU+ cloud compute extra
Redshift ServerlessRPU-hour$0.36 per RPU-hourMinimum base RPU
Redshift RA3Per node-hour$3.26-$13.04 per node-hourra3.xlplus to ra3.16xlarge
Azure Synapse DedicatedDWU$1.20 per 100 DWU/hrGen2
Synapse ServerlessTB scanned$5 per TBSame as BigQuery
FireboltF-credit$1-$3 per F-creditVaries by engine size
ClickHouse CloudCompute-min + storageFrom $0.31/hrdev tier
MotherDuckUsage-basedFrom $15-$25/monthpersonal/team/scale

The pricing trap: Surface prices like "TB at $5" cannot tell you the real cost. The same workload can be billed 10x differently depending on clustering/partitioning/cache-hit rate/idle time. The only honest answer is running a PoC with the same data and same queries.

19. Cost Optimization — Materialized Views, Clustering, Auto-Suspend

Seven patterns that cut DW cost 30-70%.

1) Auto-suspend / Auto-resume: Snowflake standard is to auto-suspend after 1 minute idle. Cutting off the overnight idle alone often reduces cost by 60%.

2) Materialized Views: Pre-compute and store common aggregates. Snowflake auto-refreshes, BigQuery requires explicit refresh policies.

3) Clustering Keys / Partitioning: Snowflake's cluster key, BigQuery's partition + cluster column. When the query can skip unnecessary partitions via stats, scan cost drops to 1/10.

-- Snowflake clustering key
ALTER TABLE orders CLUSTER BY (order_date, region);

-- BigQuery partitioning + clustering
CREATE TABLE orders_p (
  order_id INT64, customer_id INT64, region STRING, amount NUMERIC, order_date DATE
)
PARTITION BY order_date
CLUSTER BY region, customer_id;

4) Result Caching: Snowflake caches the same query result for 24 hours (if the table is unchanged). BigQuery also caches for 24 hours. Cache hits cost zero.

5) Column Pruning + Predicate Pushdown: Never SELECT *. Specify only the columns you need.

6) Right-Sized Warehouses: In Snowflake, running an XL for 24 hours vs running a 4XL for 1 hour both cost 64 credits, but the 4XL is 16x faster. Fast and short almost always wins.

7) Reserved Capacity / Slot Commitment: BigQuery slot commitments and Redshift Reserved Instances offer 40-65% discounts with 1-3 year commits. Favorable for stable workloads.

20. AI Integration — Cortex, AI Functions, Gemini, Bedrock

The 2025-2026 DW trend is calling LLMs directly from SQL.

Snowflake Cortex (SNOWFLAKE.CORTEX.*):

SELECT
  customer_id,
  review_text,
  SNOWFLAKE.CORTEX.SENTIMENT(review_text) AS sentiment_score,
  SNOWFLAKE.CORTEX.SUMMARIZE(review_text) AS summary,
  SNOWFLAKE.CORTEX.COMPLETE(
    'claude-3-5-sonnet',
    'Suggest a follow-up email: ' || review_text
  ) AS followup
FROM product_reviews
WHERE review_date > current_date - 30;

Cortex exposes Mistral, Llama, Reka, and Claude as built-in functions. It is billed in credits, and data never leaves Snowflake.

Databricks AI Functions:

SELECT
  customer_id,
  AI_SUMMARIZE(notes) AS summary,
  AI_CLASSIFY(notes, ARRAY('billing', 'support', 'sales')) AS category,
  AI_EXTRACT(notes, ARRAY('date', 'amount', 'product')) AS extracted
FROM customer_notes;

BigQuery + Gemini (ML.GENERATE_TEXT):

SELECT
  product_id,
  description,
  ml_generate_text_result['predictions'][0]['content'] AS marketing_copy
FROM ML.GENERATE_TEXT(
  MODEL `mydataset.gemini_pro_model`,
  TABLE `mydataset.products`,
  STRUCT(
    0.4 AS temperature,
    200 AS max_output_tokens,
    'Write a marketing tagline for this product: ' || description AS prompt
  )
);

Redshift + Bedrock:

SELECT
  feedback_id,
  feedback_text,
  AWS.BEDROCK.INVOKE_MODEL(
    'anthropic.claude-sonnet-20240229-v1:0',
    JSON_OBJECT('prompt' VALUE 'Analyze: ' || feedback_text)
  ) AS analysis
FROM customer_feedback
WHERE submitted_at >= current_date - 7;

Calling LLMs directly inside the DW removes data motion and simplifies security/compliance, but token costs accumulate fast and must be monitored. Calling Claude Sonnet once over 1M rows at an average 1,000 tokens per row would cost 1M rows × $0.003/1K input + output, racking up thousands of dollars in a single query.

21. Reverse ETL, Data Observability, Catalog Governance

Adopting a DW pulls in a whole ecosystem around it.

  • Reverse ETL: Push data back from the DW into SaaS like Salesforce, HubSpot, Iterable. The two leaders are Hightouch (hightouch.com) and Census (getcensus.com), both with first-class support for Snowflake/BigQuery/Databricks
  • Data observability: Monitor data quality, freshness, and lineage. Leading players are Monte Carlo (montecarlodata.com), Anomalo (anomalo.com), Bigeye (bigeye.com), Datafold (datafold.com), Metaplane (metaplane.com)
  • Catalog and governance: Search, document, and permission data assets. Atlan (atlan.com), Castor (castordoc.com), Data.world (data.world), Alation (alation.com), Collibra (collibra.com), Apache Atlas (atlas.apache.org)

Data observability tools consume DW metadata (query log, table stats) and auto-detect anomalies. Example: "A table that ingested 100M rows a day now only ingested 5M today — alert." Monte Carlo defined this category and was valued at $2.4B in 2024.

22. Korean Adoption — Coupang, Naver, NCsoft, Kakao

Coupang: According to the 2025 Coupang Engineering Blog, Coupang runs its data platform on a Spark + Snowflake + in-house data catalog combo. Core domains like product, order, and payment are loaded into Snowflake, while log and event analytics use a self-hosted ClickHouse cluster. Marketing ML training happens on Databricks Lakehouse, with Reverse ETL via Hightouch back to ad platforms. Daily ingest is petabyte-scale; the BigQuery to Snowflake migration completed in 2023.

Naver Cloud: Naver runs its own DW and Trino-based query engine on its own cloud (NCP). Search log and ad click analytics use a proprietary columnar store, while business analytics for shopping and pay uses Snowflake on AWS Tokyo, as shared at DEVIEW 2024. From 2025 Naver has been building an internal BI tool that fuses HyperCLOVA X with the DW.

NCsoft: Gaming data is extreme on both volume and time-sensitivity. NCsoft processes real-time logs and events from Lineage W and Throne and Liberty on Apache Druid + Imply, while long-term storage and analytics go into Snowflake. As shared at NDC 2024, a single title can emit tens of billions of events daily, and Druid's sub-second responses are core to live operations (anti-abuse, revenue monitoring).

Kakao: Kakao has historically handled analytics for KakaoTalk, Daum, Pay, and others on a self-hosted Hadoop + Hive + Trino + in-house catalog. Since 2024 it has been migrating to Apache Iceberg + Polaris, with some workloads moved to Databricks on AWS Seoul, per the if(kakao) 2024 talk.

Woowa Brothers (Baemin): Orders and rider data are loaded into BigQuery, with ad ML and recommendations on Vertex AI + BigQuery ML. From 2025, some real-time analytics moved to ClickHouse Cloud, per Woowacon 2024.

23. Japanese Adoption — Mercari, LINE Yahoo, CyberAgent, Rakuten

Mercari: Mercari is a deep GCP shop with BigQuery as the core DW. Per the 2024 Mercari Engineering Blog, BigQuery scans petabytes daily, and the analytics workflow is standardized on dbt + BigQuery Studio + Looker Studio. Since 2025 they have built internal natural-language query (NLQ) tools using Gemini + BigQuery ML integration, while customers receive more sophisticated personalized recommendations.

LINE Yahoo Japan: Before the merger, LINE ran a Hadoop-based proprietary DW while Yahoo Japan used Teradata + an in-house analytics platform. Since 2024, the merged entity adopted BigQuery + Iceberg + Trino as its unified data platform, per LINE DEVELOPER DAY 2024. Petabyte-scale Hive assets were migrated to Iceberg, and BigLake external tables make them queryable from BigQuery too.

CyberAgent: AbemaTV (now ABEMA) and the ad business consolidate on Snowflake on AWS Tokyo. Per CyberAgent Developers Conference 2025, ad bidding logs (tens of billions per day) are in BigQuery, business analytics in Snowflake, and real-time attribution in ClickHouse — a multi-engine strategy. dbt + Airflow + Datadog observe every pipeline.

Rakuten: Rakuten was an early Treasure Data (treasuredata.com) shop and is gradually migrating workloads to Snowflake. Treasure Data is a Hive-based managed CDP, and the move to Snowflake was driven by ML workload integration and predictable cost. Per Rakuten Tech Conference 2025, data from Rakuten Shopping, Securities, and Mobile is all governed under a single Snowflake account.

DeNA: Uses BigQuery for gaming and healthcare data, Vertex AI for ML workloads, and integrates partner data (including Pokémon GO operator Niantic) into BigQuery. From 2024 DeNA started adopting Apache Iceberg as a lock-in dispersion strategy, per DeNA TechCon 2024.

24. Decision Checklist — How to Choose

DW selection is almost always a function of "team + data shape + budget + lock-in tolerance."

SQL-centric team, analytics-driven to Snowflake. Smoothest SQL UX, largest ecosystem.

Team with ML engineers and data scientists, large Spark/Python workloads to Databricks Lakehouse. SQL + Spark + ML on one platform.

GCP-centric infra, pre-PMF startup running frequent small workloads to BigQuery on-demand. Starts at zero, free BQML for ML.

AWS-centric infra, existing Redshift operations to Redshift Serverless. Zero-ETL and Bedrock integration are strong.

Microsoft 365 / Power BI-centric enterprise to Microsoft Fabric (or Synapse). Strong identity and access integration.

Real-time, user-facing analytics is core to Apache Pinot (StarTree) or Apache Druid (Imply).

Billions of log rows, fast response is core to ClickHouse Cloud.

Under 100TB, 1-10 person team, cost minimization is rule 1 to MotherDuck (or DuckDB + S3 + dbt).

Lock-in avoidance is rule 1 to Trino + Apache Iceberg + Polaris (self-host or Starburst Galaxy).

Most companies end up on a multi-engine strategy — Snowflake for general analytics, ClickHouse for real-time, BigQuery for ads, with the same data in Iceberg. The 2026 best-practice pattern is "data in Iceberg, engine per workload."

25. References

  • Snowflake documentation — https://docs.snowflake.com/
  • Snowflake Cortex AI — https://docs.snowflake.com/en/guides-overview-ai-features
  • Apache Polaris — https://polaris.apache.org/
  • Databricks documentation — https://docs.databricks.com/
  • Databricks Lakehouse Architecture — https://www.databricks.com/glossary/data-lakehouse
  • Delta Lake — https://delta.io/
  • Unity Catalog (OSS) — https://www.unitycatalog.io/
  • Apache Spark — https://spark.apache.org/
  • DBRX model — https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
  • Google BigQuery documentation — https://cloud.google.com/bigquery/docs
  • BigQuery ML — https://cloud.google.com/bigquery/docs/bqml-introduction
  • BigQuery Omni — https://cloud.google.com/bigquery/docs/omni-introduction
  • AWS Redshift documentation — https://docs.aws.amazon.com/redshift/
  • Redshift Serverless — https://docs.aws.amazon.com/redshift/latest/mgmt/serverless-whatis.html
  • Azure Synapse Analytics — https://learn.microsoft.com/en-us/azure/synapse-analytics/
  • Microsoft Fabric — https://learn.microsoft.com/en-us/fabric/
  • Firebolt documentation — https://docs.firebolt.io/
  • MotherDuck — https://motherduck.com/docs/
  • DuckDB — https://duckdb.org/docs/
  • ClickHouse — https://clickhouse.com/docs
  • StarRocks — https://docs.starrocks.io/
  • Apache Doris — https://doris.apache.org/docs/
  • Apache Druid — https://druid.apache.org/docs/latest/
  • Apache Pinot — https://docs.pinot.apache.org/
  • StarTree — https://startree.ai/
  • Apache Iceberg — https://iceberg.apache.org/
  • Apache Hudi — https://hudi.apache.org/docs/overview
  • Lakekeeper — https://lakekeeper.io/
  • Apache Gravitino — https://gravitino.apache.org/
  • Project Nessie — https://projectnessie.org/
  • Trino — https://trino.io/docs/current/
  • Starburst Galaxy — https://www.starburst.io/platform/starburst-galaxy/
  • dbt documentation — https://docs.getdbt.com/
  • Hightouch — https://hightouch.com/docs
  • Census — https://docs.getcensus.com/
  • Monte Carlo Data — https://www.montecarlodata.com/
  • Atlan — https://atlan.com/
  • Coupang Engineering Blog — https://medium.com/coupang-engineering
  • Naver DEVIEW archive — https://deview.kr/
  • LINE DEVELOPER DAY — https://linedevday.linecorp.com/
  • Mercari Engineering Blog — https://engineering.mercari.com/en/
  • CyberAgent Developers Blog — https://developers.cyberagent.co.jp/blog/
  • Rakuten Tech Conference — https://tech.rakuten.com/