Data Engineering Complete Guide 2025: Flink vs Spark, dbt, Iceberg, Airflow — Modern Data Stack Deep Dive

Introduction
1. The 2025 Data Engineering Landscape
2. Stream Processing: Flink vs Spark vs Beam
3. Orchestration: Airflow 3.0 vs Dagster vs Prefect vs Mage
4. dbt: The SQL Renaissance
5. Lakehouse: Iceberg vs Delta Lake vs Hudi
6. Real-Time Analytics: ClickHouse vs Druid vs Pinot vs StarRocks
7. Data Mesh and Data Contracts
8. Production Architecture: Real-Time Recommendation System
9. Certifications and Learning Roadmap
10. Three Portfolio Projects
Quizzes
References

Introduction

In 2025, data engineering has become one of the most sought-after roles in the software industry. With the explosion of AI/ML pipeline demand, the role has expanded from simple ETL development to data platform architecture. According to LinkedIn's 2025 Jobs Report, data engineer demand grew 35% year-over-year, with a notable salary premium for engineers experienced in streaming processing and lakehouse technologies.

This article comprehensively covers every core component of the Modern Data Stack. From the Flink vs Spark stream processing comparison, to dbt's SQL revolution, the Iceberg lakehouse wars, ClickHouse real-time analytics, and Airflow 3.0 orchestration — we cover everything about data engineering in 2025, complete with production architectures.

1. The 2025 Data Engineering Landscape

1.1 Why Data Engineers Are the Hottest Role

Three reasons why data engineers are in such high demand in 2025:

First, AI/ML pipeline demand is exploding. Every company is trying to adopt AI post-ChatGPT, but there's an absolute shortage of talent who can build high-quality data pipelines for model training. According to Gartner estimates, 85% of ML projects fail due to data quality issues.

Second, real-time processing requirements are surging. Batch processing alone is no longer sufficient. Use cases requiring millisecond-level decisions — fraud detection, real-time recommendations, dynamic pricing — are growing explosively.

Third, regulatory and data governance needs. With GDPR, the AI Act, and other data regulations tightening, the value of engineers who can professionally handle data lineage and quality management has increased.

1.2 Evolution of the Modern Data Stack

The modern data stack has evolved into the following layered structure:

[Data Sources] → [Ingestion/CDC] → [Stream Processing] → [Storage/Lakehouse] → [Transform] → [Analytics/Serving]
       |               |                  |                     |                    |              |
    DB, API,       Airbyte,           Flink,             Iceberg/Delta,           dbt,        ClickHouse,
    IoT, SaaS     Debezium,           Spark              S3/GCS/ADLS          Spark SQL     Pinot, Druid
                   Fivetran          Streaming                                                StarRocks

2020 vs 2025 key changes:

Area	2020	2025
Storage	Data Lake (raw)	Lakehouse (ACID)
Processing	Batch-first	Stream-first
Transformation	Stored Procedures	dbt + SQL
Orchestration	Airflow 1.x	Airflow 3.0 / Dagster
Format	Parquet/ORC	Iceberg/Delta/Hudi
Catalog	Hive Metastore	Unity Catalog / Polaris
Quality	Manual validation	Great Expectations / Soda

1.3 Technology Stack Selection Framework

Key criteria to consider when choosing a data stack:

Evaluation Criteria Checklist:
1. Latency requirements (batch vs near-real-time vs real-time)
2. Data volume (GB/day vs TB/day vs PB/day)
3. Team size and capabilities (SQL-centric vs code-centric)
4. Vendor lock-in tolerance
5. Budget (open source vs managed services)
6. Regulatory requirements (data governance, lineage)

2. Stream Processing: Flink vs Spark vs Beam

2.1 Why Stream Processing Matters

Traditional batch processing collects data and processes it all at once. But modern businesses demand instant insights. Fraud detection must be instantaneous, and recommendation systems need to respond to user behavior in real-time.

Core concepts of stream processing:

[Event Stream] ──→ [Window] ──→ [Aggregate/Transform] ──→ [Sink]
                      |
              ┌───────┼───────┐
              │       │       │
          Tumbling  Sliding  Session
           Window   Window   Window
           (fixed)  (sliding) (session)

2.2 Apache Flink: True Event-by-Event Processing

Flink is a stream processing engine with event-by-event processing as its default. Batch is treated as a special case of streaming (bounded stream).

Key strengths:

Sub-second latency: Processing within milliseconds of event occurrence
Checkpoint-based state management: Exactly-once guarantees
Event-time processing: Accurate time-based processing via watermarks
Savepoints: State preservation during application upgrades

// Flink - Real-time fraud detection example
DataStream<Transaction> transactions = env
    .addSource(new FlinkKafkaConsumer<>("transactions", schema, props));

DataStream<Alert> alerts = transactions
    .keyBy(Transaction::getAccountId)
    .window(SlidingEventTimeWindows.of(Time.minutes(10), Time.minutes(1)))
    .aggregate(new FraudScoreAggregator())
    .filter(score -> score.getValue() > THRESHOLD);

alerts.addSink(new AlertSink());

Flink SQL — Streams as tables:

-- Flink SQL: Real-time revenue aggregation
CREATE TABLE orders (
    order_id STRING,
    amount DECIMAL(10,2),
    order_time TIMESTAMP(3),
    WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND
) WITH (
    'connector' = 'kafka',
    'topic' = 'orders',
    'format' = 'json'
);

SELECT
    TUMBLE_START(order_time, INTERVAL '1' MINUTE) AS window_start,
    COUNT(*) AS order_count,
    SUM(amount) AS total_revenue
FROM orders
GROUP BY TUMBLE(order_time, INTERVAL '1' MINUTE);

2.3 Apache Spark Structured Streaming: Batch+Stream Hybrid

Spark processes streams using the micro-batching approach. It supports streaming while maintaining the strengths of batch processing.

Key strengths:

Unified API: Same DataFrame API for both batch and streaming
Vast ecosystem: Seamless integration with MLlib, GraphX, SparkSQL
Mature community: Largest user base and documentation
Photon engine: Databricks' C++ native engine with up to 12x performance boost

# Spark Structured Streaming - Real-time order analytics
from pyspark.sql import SparkSession
from pyspark.sql.functions import window, sum, count

spark = SparkSession.builder.appName("OrderAnalytics").getOrCreate()

orders = spark.readStream \
    .format("kafka") \
    .option("subscribe", "orders") \
    .load()

order_stats = orders \
    .withWatermark("order_time", "5 minutes") \
    .groupBy(window("order_time", "1 minute")) \
    .agg(
        count("*").alias("order_count"),
        sum("amount").alias("total_revenue")
    )

order_stats.writeStream \
    .format("iceberg") \
    .outputMode("append") \
    .option("checkpointLocation", "/checkpoint/orders") \
    .toTable("catalog.db.order_stats")

2.4 Apache Beam: Unified Abstraction Layer

Beam is an abstraction layer that separates pipeline definition from execution engine. Write once and run on various runners like Flink, Spark, or Dataflow.

# Apache Beam - Multi-runner pipeline
import apache_beam as beam

with beam.Pipeline(options=pipeline_options) as p:
    (p
     | 'ReadKafka' >> beam.io.ReadFromKafka(
         consumer_config=kafka_config,
         topics=['orders'])
     | 'ParseJSON' >> beam.Map(parse_order)
     | 'WindowInto' >> beam.WindowInto(
         beam.window.FixedWindows(60))  # 1-minute window
     | 'CountPerWindow' >> beam.combiners.Count.Globally()
     | 'WriteToGCS' >> beam.io.WriteToText('gs://bucket/output'))

2.5 Flink vs Spark vs Beam Comprehensive Comparison

Feature	Apache Flink	Spark Structured Streaming	Apache Beam
Processing Model	True event-by-event	Micro-batching	Unified API (runner-dependent)
Latency	Milliseconds to sub-second	Seconds (100ms+)	Runner-dependent
State Management	Built-in checkpoints, RocksDB	Limited (watermark)	Runner-dependent
Accuracy	Exactly-once by default	Exactly-once capable	Runner-dependent
SQL Support	Flink SQL (powerful)	Spark SQL (best)	Beam SQL (limited)
Batch Processing	Good (bounded stream)	Best (native)	Good
ML Integration	Limited	MLlib / Spark ML	TFX integration
Learning Curve	Steep	Moderate	Steep
Best Use Cases	Real-time CEP, fraud detection	Batch+stream hybrid	Multi-engine portability
Major Users	Alibaba, Uber, Netflix	All Databricks customers	Google Cloud customers

Selection guide:

Millisecond latency is mandatory → Flink
Both batch and streaming are important → Spark
Multi-cloud portability → Beam
GCP-centric → Beam + Dataflow
Already using Databricks → Spark

2.6 Benchmarks: Real-World Performance Comparison

Nexocode 2024 Benchmark results (1 million events/second processing):

Processing Latency (p99):
┌────────────────────────────────────────────┐
│ Flink     ████  23ms                       │
│ Spark     ████████████████  450ms           │
│ Beam/Flink████  25ms                       │
│ Beam/Spark████████████████  460ms           │
└────────────────────────────────────────────┘

Throughput (events/second):
┌────────────────────────────────────────────┐
│ Flink     ████████████████████  5.2M       │
│ Spark     ████████████████  3.8M           │
│ Beam/Flink██████████████████  4.8M         │
└────────────────────────────────────────────┘

3. Orchestration: Airflow 3.0 vs Dagster vs Prefect vs Mage

3.1 What Is Orchestration?

Data pipeline orchestration manages the order, dependencies, scheduling, retries, and monitoring of tasks. In the modern data stack, the orchestrator serves as the "glue" connecting all components.

3.2 Apache Airflow 3.0: Evolution of the Throne

Airflow has been the de facto standard for data orchestration for over a decade. Airflow 3.0 in 2025 brought major innovations.

Airflow 3.0 Key Changes:

Airflow 3.0 Core Upgrades:

1. React-based Modern UI
   - Real-time log streaming
   - Improved DAG visualization
   - Dark mode support

2. Event-Driven Scheduling
   - Dataset-aware scheduling (auto-trigger on data arrival)
   - Enhanced External Triggers

3. TaskFlow API 2.0
   - Decorator-based Python-native DAG authoring
   - Improved dynamic task mapping

4. Edge Labels & DAG Versioning
   - DAG change history tracking
   - A/B testing DAG support

5. Enhanced Security
   - Improved RBAC
   - Expanded Secret Backend integrations

# Airflow 3.0 - TaskFlow API Example
from airflow.decorators import dag, task
from datetime import datetime

@dag(schedule="@daily", start_date=datetime(2025, 1, 1), catchup=False)
def data_pipeline():

    @task()
    def extract():
        """Extract raw data from S3"""
        import boto3
        # Data extraction logic
        return {"records": 15000, "source": "s3://datalake/raw/"}

    @task()
    def transform(data: dict):
        """Transform data via dbt"""
        import subprocess
        subprocess.run(["dbt", "run", "--select", "staging+"])
        return {"transformed": data["records"], "models": 12}

    @task()
    def load(data: dict):
        """Load results into ClickHouse"""
        # ClickHouse loading logic
        return {"loaded": data["transformed"]}

    @task()
    def notify(result: dict):
        """Send Slack notification"""
        # Slack notification logic
        pass

    raw = extract()
    transformed = transform(raw)
    loaded = load(transformed)
    notify(loaded)

data_pipeline()

3.3 Dagster: The Asset-Centric Paradigm

Dagster is designed around the concept of "Software-Defined Assets." It defines pipelines centered on assets rather than tasks.

Key differentiators:

Asset Graph: Explicitly manages dependencies between data assets
Native dbt integration: Automatically converts dbt projects into Dagster assets
Type system: Strong type validation via I/O managers
Observability: Automatic tracking of asset materialization history

# Dagster - Software-Defined Assets Example
from dagster import asset, AssetExecutionContext
from dagster_dbt import DbtCliResource, dbt_assets

@asset(
    description="Extract raw order data from S3",
    group_name="raw",
    compute_kind="python"
)
def raw_orders(context: AssetExecutionContext):
    """Extract order data from S3"""
    import pandas as pd
    df = pd.read_parquet("s3://datalake/raw/orders/")
    context.log.info(f"Extracted {len(df)} orders")
    return df

@asset(
    description="Validate order data quality",
    group_name="validated",
    compute_kind="great_expectations"
)
def validated_orders(raw_orders):
    """Validate data quality with Great Expectations"""
    # Quality validation logic
    assert raw_orders["amount"].min() >= 0, "Negative amounts found"
    return raw_orders

# Automatic dbt asset integration
@dbt_assets(manifest=dbt_manifest_path)
def dbt_models(context: AssetExecutionContext, dbt: DbtCliResource):
    yield from dbt.cli(["build"], context=context).stream()

3.4 Prefect: The Simplest Orchestrator

Prefect aims for minimal boilerplate and best-in-class failure handling.

# Prefect - Concise Pipeline
from prefect import flow, task
from prefect.tasks import task_input_hash
from datetime import timedelta

@task(retries=3, retry_delay_seconds=60,
      cache_key_fn=task_input_hash,
      cache_expiration=timedelta(hours=1))
def extract_data(source: str) -> dict:
    """Extraction with automatic retries and caching"""
    # Data extraction
    return {"data": [...], "count": 15000}

@task(log_prints=True)
def transform_data(raw: dict) -> dict:
    """Transformation logic"""
    print(f"Processing {raw['count']} records")
    return {"transformed": raw["count"]}

@flow(name="daily-etl", log_prints=True)
def daily_pipeline():
    raw = extract_data("s3://datalake/raw/")
    result = transform_data(raw)
    print(f"Pipeline complete: {result}")

# Execute
daily_pipeline()

3.5 Mage AI: All-in-One Platform

Mage aims to be an integrated development environment (IDE) for data engineering. You can visually build pipelines in a notebook-style UI.

3.6 Orchestrator Comprehensive Comparison

Feature	Airflow 3.0	Dagster	Prefect	Mage
Paradigm	DAG/task-centric	Asset-centric	Flow/task	Notebook+pipeline
Learning Curve	Medium-high	Medium	Low	Low
dbt Integration	Via Provider	Native (best)	Basic	Basic
UI	React (major improvement)	Dagit (excellent)	Cloud UI	Notebook style
Scaling	KubernetesExecutor	K8s / ECS	Kubernetes	Kubernetes
Community	Largest (10+ years)	Growing fast	Medium	Growing
Provider Ecosystem	1000+	100+	200+	50+
Failure Handling	Medium	Good	Best	Good
Pricing (Cloud)	MWAA cost	Dagster+	Prefect Cloud	Mage Pro
Recommended For	Large/complex pipelines	dbt-centric projects	Quick start, simplicity	Exploratory work

Final recommendations:

Enterprise, complex pipelines → Airflow 3.0
dbt-centric, analytics engineering → Dagster
Quick start, simplicity first → Prefect
Data science + engineering → Mage

4. dbt: The SQL Renaissance

4.1 How dbt Changed the Data Transformation Paradigm

dbt (data build tool) handles the "T (Transform)" in ELT. It enables building data transformation pipelines using only SQL while applying software engineering best practices (version control, testing, documentation) to data transformations.

dbt's growth:

dbt Milestones:
- 2016: Fishtown Analytics founded
- 2020: dbt Cloud launched
- 2022: Rebranded to dbt Labs, Series D $222M
- 2023: MetricFlow open-sourced
- 2024: Semantic Layer GA in dbt Cloud
- 2025: Crossed $100M+ ARR, established de facto standard status

4.2 dbt Core vs dbt Cloud

Feature	dbt Core (OSS)	dbt Cloud
Price	Free	Team from $100/mo, Enterprise custom
Execution	CLI (local/CI)	Managed execution environment
Scheduling	External (Airflow, etc.)	Built-in scheduler
IDE	VS Code + extensions	Cloud IDE (browser)
Semantic Layer	MetricFlow CLI	Managed API
Docs site	Self-hosted	Auto-generated/hosted
CI/CD	Self-configured	Built-in Slim CI

4.3 Production dbt Project Structure

dbt_project/
├── dbt_project.yml
├── packages.yml
├── profiles.yml
├── models/
│   ├── staging/           # Source data cleansing
│   │   ├── stg_orders.sql
│   │   ├── stg_customers.sql
│   │   └── _staging.yml   # Tests & docs
│   ├── intermediate/      # Business logic composition
│   │   ├── int_order_items.sql
│   │   └── _intermediate.yml
│   ├── marts/             # Final analytics tables
│   │   ├── finance/
│   │   │   ├── fct_revenue.sql
│   │   │   └── dim_customers.sql
│   │   └── marketing/
│   │       └── fct_campaigns.sql
│   └── metrics/           # MetricFlow metrics
│       └── revenue.yml
├── tests/                 # Custom tests
│   └── assert_positive_revenue.sql
├── macros/                # Reusable SQL macros
│   └── cents_to_dollars.sql
├── seeds/                 # Static data (CSV)
│   └── country_codes.csv
└── snapshots/             # SCD Type 2 snapshots
    └── snap_customers.sql

Core model examples:

-- models/staging/stg_orders.sql
WITH source AS (
    SELECT * FROM {{ source('raw', 'orders') }}
),

renamed AS (
    SELECT
        id AS order_id,
        user_id AS customer_id,
        status AS order_status,
        amount_cents / 100.0 AS amount_dollars,
        created_at AS ordered_at
    FROM source
    WHERE status != 'cancelled'
)

SELECT * FROM renamed

-- models/marts/finance/fct_revenue.sql
WITH orders AS (
    SELECT * FROM {{ ref('stg_orders') }}
),

customers AS (
    SELECT * FROM {{ ref('dim_customers') }}
),

final AS (
    SELECT
        o.order_id,
        o.customer_id,
        c.customer_segment,
        o.amount_dollars,
        o.ordered_at,
        DATE_TRUNC('month', o.ordered_at) AS order_month
    FROM orders o
    JOIN customers c ON o.customer_id = c.customer_id
    WHERE o.order_status = 'completed'
)

SELECT * FROM final

# models/staging/_staging.yml
version: 2

models:
  - name: stg_orders
    description: 'Cleansed order data'
    columns:
      - name: order_id
        description: 'Unique order identifier'
        tests:
          - unique
          - not_null
      - name: amount_dollars
        description: 'Order amount in dollars'
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 100000

4.4 MetricFlow and the Semantic Layer

MetricFlow is a metric definition framework open-sourced by dbt Labs. Define business metrics in one place, and get consistent results across various BI tools.

# models/metrics/revenue.yml
semantic_models:
  - name: orders
    defaults:
      agg_time_dimension: ordered_at
    model: ref('fct_revenue')
    entities:
      - name: order_id
        type: primary
      - name: customer_id
        type: foreign
    measures:
      - name: revenue
        agg: sum
        expr: amount_dollars
      - name: order_count
        agg: count
        expr: order_id
    dimensions:
      - name: ordered_at
        type: time
      - name: customer_segment
        type: categorical

metrics:
  - name: monthly_revenue
    type: simple
    type_params:
      measure: revenue
  - name: revenue_per_customer
    type: derived
    type_params:
      expr: monthly_revenue / active_customers
      metrics:
        - name: monthly_revenue
        - name: active_customers

4.5 dbt Best Practices

Top 10 dbt Project Best Practices:

1. Enforce layer structure: staging -> intermediate -> marts
2. Complete all renaming and type casting in staging
3. Apply unique + not_null tests on all models
4. Monitor data freshness with source freshness checks
5. Optimize large tables with incremental models
6. Leverage dbt-utils, dbt-expectations via packages.yml
7. SQL linting with pre-commit hooks (sqlfluff)
8. Use slim CI (state:modified+) in CI pipelines
9. Separate profiles by environment (dev / staging / prod)
10. Centralize metric management with Semantic Layer

5. Lakehouse: Iceberg vs Delta Lake vs Hudi

5.1 Why Lakehouse Is Needed

Traditionally, data lakes and data warehouses were separate systems. Data lakes stored all formats cheaply but lacked ACID transactions, while warehouses provided fast analytical queries but were expensive.

The lakehouse combines both:

Lakehouse = Low-cost storage of data lakes + ACID/performance of warehouses

Traditional Architecture:
[Sources] -> [Data Lake (S3)] -> [ETL] -> [Data Warehouse (Redshift)]
                                           High cost, duplicate storage

Lakehouse Architecture:
[Sources] -> [Object Storage (S3)] + [Table Format (Iceberg)]
              Low cost                 ACID, schema evolution, time travel
              └──> [Analytics Engine (Spark/Trino/Flink)] Direct query

5.2 Apache Iceberg: The Engine-Independent King

Iceberg is an open table format developed at Netflix. It's not tied to any specific engine, allowing Spark, Flink, Trino, Presto, Hive, and more to access the same tables.

Key features:

Schema Evolution: Add/drop/rename columns without data rewriting
Partition Evolution: Change partition schema while preserving existing data
Time Travel: Query snapshots at specific points in time
Hidden Partitioning: Automatic optimization without users worrying about partitions
Multi-engine: Same table format regardless of engine

-- Iceberg table creation and usage
CREATE TABLE catalog.db.orders (
    order_id    BIGINT,
    customer_id BIGINT,
    amount      DECIMAL(10,2),
    status      STRING,
    ordered_at  TIMESTAMP
)
USING iceberg
PARTITIONED BY (days(ordered_at));

-- Time travel: query data as of yesterday
SELECT * FROM catalog.db.orders
VERSION AS OF '2025-03-21T00:00:00';

-- Snapshot-based incremental reads
SELECT * FROM catalog.db.orders
WHERE ordered_at > (
    SELECT max(ordered_at)
    FROM catalog.db.orders
    VERSION AS OF 'snapshot_id_123'
);

Tabular Acquisition (2024): Databricks acquired Tabular, the commercial company behind Iceberg. This caused major shifts in Iceberg's commercial ecosystem, with Snowflake and other vendors strengthening their independent Iceberg support.

5.3 Delta Lake: The Heart of the Databricks Ecosystem

Delta Lake is an open-source table format developed by Databricks. Its deep integration with Spark is its key strength.

Key features:

UniForm: Automatic Iceberg and Hudi-compatible metadata generation
Liquid Clustering: Automatic data layout optimization replacing Z-Order
Change Data Feed: Capture CDC events directly from Delta tables
Deletion Vectors: Efficient deletes/updates without file rewriting

# Delta Lake - UniForm for Iceberg compatibility
from delta.tables import DeltaTable

# Create Delta table (UniForm enabled)
spark.sql("""
    CREATE TABLE catalog.db.events (
        event_id BIGINT,
        event_type STRING,
        payload STRING,
        event_time TIMESTAMP
    )
    USING DELTA
    TBLPROPERTIES (
        'delta.universalFormat.enabledFormats' = 'iceberg'
    )
""")

# Delta table MERGE (Upsert)
delta_table = DeltaTable.forName(spark, "catalog.db.events")

delta_table.alias("target").merge(
    new_events.alias("source"),
    "target.event_id = source.event_id"
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()

5.4 Apache Hudi: The Real-Time CDC Champion

Hudi (Hadoop Upserts Deletes and Incrementals) was developed at Uber, optimized for record-level indexing and real-time CDC processing.

Key features:

Record-level index: Extremely fast individual record upserts
Copy-on-Write vs Merge-on-Read: Optimal strategy selection per workload
Incremental Query: Efficiently read only changed records
Concurrency Control: Optimistic concurrency control

5.5 Iceberg vs Delta Lake vs Hudi Comprehensive Comparison

Feature	Apache Iceberg	Delta Lake	Apache Hudi
Developer	Netflix (now Apache)	Databricks	Uber (now Apache)
Best For	Multi-engine, petabyte	Databricks ecosystem	Real-time CDC/Upsert
Engine Support	Spark, Flink, Trino, Presto	Spark (optimal), others limited	Spark, Flink
Schema Evolution	Best (full evolution)	Good	Good
Partition Evolution	Best (hidden partitioning)	Liquid Clustering	Limited
CDC Support	Incremental reads only	Change Data Feed	Native (best)
Time Travel	Snapshot-based	Version-based	Commit timeline
Concurrency	Optimistic (branch support)	Optimistic	Optimistic
Compatibility	Standard (open)	UniForm (Iceberg compat)	Limited
Catalog	Polaris, REST, Hive	Unity Catalog	Internal timeline
Community Size	Growing fast (largest)	Databricks-led	Medium

Selection guide:

Multi-engine, vendor-independent → Iceberg
Already using Databricks → Delta Lake (with UniForm for Iceberg compat)
Real-time CDC, frequent upserts → Hudi
New project, maximum flexibility → Iceberg

6. Real-Time Analytics: ClickHouse vs Druid vs Pinot vs StarRocks

6.1 Why You Need an OLAP Engine

Once you've stored data in your lakehouse, you need an OLAP engine that can execute analytical queries in milliseconds. PostgreSQL or MySQL simply cannot handle real-time aggregation queries across billions of rows.

6.2 ClickHouse: The Fastest Open-Source OLAP

ClickHouse is a columnar OLAP database developed by Yandex. It can scan billions of rows per second even on a single node.

-- ClickHouse - Real-time dashboard queries
CREATE TABLE events (
    event_id UInt64,
    user_id UInt32,
    event_type LowCardinality(String),
    properties Map(String, String),
    event_time DateTime64(3)
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_type, user_id, event_time);

-- Materialized View for real-time aggregation
CREATE MATERIALIZED VIEW hourly_events_mv
ENGINE = SummingMergeTree()
ORDER BY (event_type, hour)
AS SELECT
    event_type,
    toStartOfHour(event_time) AS hour,
    count() AS event_count,
    uniqHLL12(user_id) AS unique_users
FROM events
GROUP BY event_type, hour;

-- Millisecond responses across billions of rows
SELECT
    event_type,
    sum(event_count) AS total_events,
    sum(unique_users) AS total_users
FROM hourly_events_mv
WHERE hour >= now() - INTERVAL 24 HOUR
GROUP BY event_type
ORDER BY total_events DESC
LIMIT 20;

6.3 Apache Druid: Real-Time Ingestion + Analytics

Druid is an OLAP database designed to handle real-time data ingestion and analytics simultaneously. It can ingest data directly from Kafka while concurrently processing queries.

6.4 Apache Pinot: LinkedIn's Real-Time Analytics

Pinot is an OLAP database developed at LinkedIn, optimized for user-facing analytics. It powers LinkedIn's "Who viewed your profile," Uber's UberEats restaurant dashboards, and more.

6.5 StarRocks: The Next-Generation OLAP Challenger

StarRocks is a next-generation OLAP engine supporting the MySQL-compatible protocol. Its unique lakehouse analytics capability allows direct querying of Iceberg, Hudi, and Delta Lake tables without separate data loading.

6.6 OLAP Engine Comprehensive Comparison

Feature	ClickHouse	Apache Druid	Apache Pinot	StarRocks
Developer	Yandex → ClickHouse Inc	Imply	LinkedIn → StarTree	CelerData
Architecture	Single binary	MSQ (complex)	Helix-based	MPP (simple)
Real-time Ingestion	Kafka engine	Native (best)	Native (good)	Routine Load
Query Performance	Best (single node)	Good	Good	Best (MPP)
SQL Compatibility	ANSI SQL	Druid SQL	PQL + SQL	MySQL compatible
Join Performance	Limited	Limited	Limited	Good (MPP)
Lakehouse Queries	Limited	Limited	Limited	Native (best)
Operational Complexity	Low	High	Medium	Low
Learning Curve	Low	High	Medium	Low (MySQL)
Scaling	Horizontal	Horizontal	Horizontal	Horizontal

Benchmarks (ClickBench 2024):

TPC-H 10GB Query Performance (seconds, lower is better):
┌──────────────────────────────────────────────────┐
│ StarRocks 3.x  ████  2.1s (2.2x faster than CH) │
│ ClickHouse     █████████  4.6s                   │
│ Druid          ████████████████████  18.7s       │
│ Pinot          ██████████████████  16.3s         │
└──────────────────────────────────────────────────┘

Selection guide:

Best single-node performance, simple operations → ClickHouse
Kafka-native real-time ingestion → Druid
User-facing analytics at scale → Pinot
Direct lakehouse queries, MySQL compatible → StarRocks

7. Data Mesh and Data Contracts

7.1 Data Mesh: Decentralized Data Architecture

Data Mesh is a decentralized data architecture paradigm proposed by Zhamak Dehghani. Instead of a central data team managing all data, domain teams own and manage their own Data Products.

The 4 Principles of Data Mesh:

1. Domain Ownership
   -> Each domain team owns their data

2. Data as a Product
   -> Apply SLA, documentation, quality guarantees to data

3. Self-Serve Platform
   -> Infrastructure team provides data platform tools

4. Federated Governance
   -> Balance between enterprise standards + domain autonomy

Gartner 2025 Forecast: 70% of large enterprises are piloting or planning Data Mesh. However, fully successful Data Mesh transformations remain few.

7.2 Data Contracts: The Promise of Data APIs

A Data Contract is a formal agreement between data producers and consumers. Like API specs, it specifies data schema, quality, and SLA.

# data-contract.yml example
dataContractSpecification: 0.9.3
id: orders-contract
info:
  title: 'Orders Data Contract'
  version: 2.1.0
  owner: order-domain-team
  contact:
    name: 'Orders Team'
    email: orders@company.com

servers:
  production:
    type: iceberg
    catalog: polaris
    database: orders_db
    table: orders

models:
  orders:
    description: 'Completed order data'
    fields:
      order_id:
        type: bigint
        required: true
        unique: true
        description: 'Unique order identifier'
      customer_id:
        type: bigint
        required: true
      amount:
        type: decimal
        required: true
        minimum: 0
        description: 'Order amount (USD)'
      ordered_at:
        type: timestamp
        required: true

quality:
  type: SodaCL
  specification:
    checks for orders:
      - row_count > 0
      - missing_count(order_id) = 0
      - duplicate_count(order_id) = 0
      - avg(amount) between 10 and 500

sla:
  freshness: 1 hour
  availability: 99.9%
  latency: p99 < 500ms

7.3 Data Mesh + Data Fabric Hybrid

The realistic trend in 2025 is a hybrid approach combining Data Mesh and Data Fabric.

Data Mesh + Fabric Hybrid:

                    ┌──────────────────────────┐
                    │    Data Fabric Layer      │
                    │ (Unified Governance/Catalog)│
                    └──────────┬───────────────┘
            ┌──────────────────┼──────────────────┐
            │                  │                  │
     ┌──────┴──────┐   ┌──────┴──────┐   ┌──────┴──────┐
     │ Order Domain│   │Customer Domn│   │Product Domn │
     │ Data Product│   │ Data Product│   │ Data Product│
     │  (Iceberg)  │   │  (Iceberg)  │   │  (Iceberg)  │
     └─────────────┘   └─────────────┘   └─────────────┘

Tooling ecosystem:

Catalog: OpenMetadata, DataHub, Amundsen
Quality: Great Expectations, Soda, Monte Carlo
Contracts: Datakin, Schemata, Bitol
Lineage: Marquez, OpenLineage

8. Production Architecture: Real-Time Recommendation System

8.1 Requirements

Assume we're building a real-time personalized recommendation system for a large e-commerce platform.

Requirements:
- Daily active users: 5 million
- Events per second: 50,000
- Recommendation latency: p99 < 100ms
- Data sources: Click streams, purchases, searches, inventory
- Feature store: Combined real-time + batch features
- A/B testing: Performance comparison across model versions

8.2 Architecture Design

Real-Time Recommendation System Architecture:

[Click/Purchase/Search Events]
        |
        v
   ┌─────────┐     ┌──────────────┐
   │  Kafka  │────>│  Flink (CEP) │──> Real-time feature computation
   │ Cluster │     │ Window Agg.  │    (recent 5-min click categories)
   └────┬────┘     └──────┬───────┘
        │                 │
        v                 v
   ┌─────────┐     ┌──────────────┐
   │ Iceberg │     │    Redis     │──> Real-time feature serving
   │  (raw)  │     │ Feature Store│    (p99 < 5ms)
   └────┬────┘     └──────────────┘
        │                 ^
        v                 │
   ┌─────────┐     ┌──────────────┐
   │   dbt   │────>│  ClickHouse  │──> Batch feature computation
   │(transform│    │  (agg/query) │    (30-day purchase patterns)
   └────┬────┘     └──────────────┘
        │
        v
   ┌─────────────────┐     ┌──────────────┐
   │ ML Model Serving │<───│ Feature Store │
   │ (TorchServe/     │    │ (Redis+DuckDB)│
   │  Triton)         │    └──────────────┘
   └────────┬────────┘
            │
            v
   ┌─────────────────┐
   │   API Gateway   │──> Serve recommendations to users
   │  (p99 < 100ms)  │
   └─────────────────┘

8.3 Core Implementation Code

Flink real-time feature computation:

// Real-time user behavior feature computation
DataStream<UserFeature> realtimeFeatures = clickEvents
    .keyBy(ClickEvent::getUserId)
    .window(SlidingEventTimeWindows.of(
        Time.minutes(5), Time.minutes(1)))
    .aggregate(new UserBehaviorAggregator());

// Store to Redis feature store
realtimeFeatures.addSink(
    new RedisSink<>(redisConfig, new FeatureRedisMapper()));

dbt batch feature model:

-- models/features/user_purchase_patterns.sql
-- 30-day purchase pattern features (batch)
WITH purchase_stats AS (
    SELECT
        customer_id,
        COUNT(*) AS purchase_count_30d,
        SUM(amount_dollars) AS total_spend_30d,
        AVG(amount_dollars) AS avg_order_value_30d,
        COUNT(DISTINCT category) AS unique_categories_30d,
        MAX(ordered_at) AS last_purchase_at
    FROM {{ ref('fct_revenue') }}
    WHERE ordered_at >= CURRENT_DATE - INTERVAL '30' DAY
    GROUP BY customer_id
)

SELECT
    customer_id,
    purchase_count_30d,
    total_spend_30d,
    avg_order_value_30d,
    unique_categories_30d,
    DATEDIFF('day', last_purchase_at, CURRENT_DATE) AS days_since_last_purchase
FROM purchase_stats

Airflow DAG:

# Recommendation system daily pipeline
@dag(schedule="0 2 * * *", catchup=False)
def recommendation_pipeline():

    @task()
    def run_dbt_features():
        """Generate batch features with dbt"""
        import subprocess
        result = subprocess.run(
            ["dbt", "run", "--select", "features+"],
            capture_output=True, text=True
        )
        return result.returncode == 0

    @task()
    def export_to_feature_store(dbt_success: bool):
        """Export batch features to Feature Store"""
        if not dbt_success:
            raise Exception("dbt run failed")
        # ClickHouse -> Redis feature store sync
        pass

    @task()
    def retrain_model():
        """Retrain ML model"""
        # Daily model retraining logic
        pass

    @task()
    def validate_model(retrain_result):
        """Validate model performance"""
        # A/B test metric comparison
        pass

    dbt_result = run_dbt_features()
    export = export_to_feature_store(dbt_result)
    retrain = retrain_model()
    validate_model(retrain)

recommendation_pipeline()

8.4 Cost Optimization

Monthly Estimated Costs (AWS, 5M daily users):

| Component       | Instance Type          | Monthly Cost  |
|-----------------|------------------------|---------------|
| Kafka (MSK)     | kafka.m5.2xlarge x3    | ~$2,100       |
| Flink (EMR)     | m5.2xlarge x4          | ~$2,800       |
| Iceberg (S3)    | 50TB storage           | ~$1,150       |
| ClickHouse      | m5.4xlarge x3          | ~$4,200       |
| Redis           | r6g.2xlarge x2         | ~$1,800       |
| Airflow (MWAA)  | mw1.medium             | ~$350         |

Total estimated cost: ~$12,400/month

Cost optimization strategies:
1. Spot instances for Flink workers -> 60% savings
2. Iceberg compaction optimization -> 30% storage savings
3. ClickHouse TTL policies -> Auto-delete old data
4. Reserved Instances -> 40% savings for long-term workloads

9. Certifications and Learning Roadmap

9.1 Recommended 2025 Certifications

Certification	Provider	Difficulty	Core Content	Recommended For
DP-700	Microsoft	Medium	Fabric Analytics Engineer	Azure ecosystem
Data Engineer Associate	Databricks	Medium	Spark, Delta Lake, Unity	Databricks users
Professional DE	Google Cloud	High	BigQuery, Dataflow, GCS	GCP ecosystem
Data Engineer	AWS (DEA-C01)	Medium	Glue, Redshift, EMR	AWS ecosystem
dbt Analytics Engineering	dbt Labs	Low-Med	dbt Core, modeling	Analytics engineers
CCA Spark and Hadoop	Cloudera	Medium	Spark, HDFS, Hive	On-premises

9.2 12-Month Learning Roadmap

12-Month Data Engineer Learning Path:

[Months 1-3: Foundations]
├── Advanced SQL (window functions, CTEs, optimization)
├── Python data processing (pandas, polars)
├── Linux / Docker basics
└── Git / CI/CD basics

[Months 4-6: Core Tools]
├── Apache Spark (DataFrame API, SparkSQL)
├── dbt (project structure, tests, documentation)
├── Airflow (DAG authoring, operators)
└── Kafka basics (producers, consumers, topics)

[Months 7-9: Advanced]
├── Apache Flink (stream processing, state management)
├── Iceberg / Delta Lake (lakehouse)
├── ClickHouse (OLAP analytics)
└── Great Expectations (data quality)

[Months 10-12: Production & Certification]
├── Complete 3 portfolio projects
├── Earn 1-2 certifications
├── Learn Data Mesh / governance
└── Start a technical blog

9.3 Recommended Learning Resources

Free resources:

DataTalksClub DE Zoomcamp: Free online bootcamp (16 weeks)
Databricks Academy: Free Spark, Delta Lake courses
dbt Learn: Official dbt tutorials
ClickHouse University: Official free courses

Paid resources:

Zach Wilson's Data Engineering Bootcamp: Project-focused
O'Reilly "Fundamentals of Data Engineering": The bible
Udemy "Apache Flink Master Class": Advanced Flink

10. Three Portfolio Projects

Project 1: Real-Time Taxi Fare Prediction Pipeline

Tech Stack: Kafka + Flink + Iceberg + dbt + ClickHouse

Architecture:
NYC Taxi Data (API) -> Kafka -> Flink (real-time features) -> Iceberg
                                                                 |
                                                      dbt (batch transform)
                                                                 |
                                                      ClickHouse (dashboard)

Implementation highlights:
1. Real-time regional demand forecasting features with Flink
2. Iceberg partition evolution (daily -> hourly)
3. dbt incremental models for daily aggregation
4. ClickHouse Materialized Views for real-time dashboards
5. Airflow DAG orchestrating the entire pipeline

Project 2: E-Commerce CDC Pipeline

Tech Stack: Debezium + Kafka + Spark + Delta Lake + dbt

Architecture:
PostgreSQL (OLTP) -> Debezium (CDC) -> Kafka -> Spark Streaming -> Delta Lake
                                                                       |
                                                          dbt (layered transform)
                                                                       |
                                                        Superset (BI dashboard)

Implementation highlights:
1. Real-time PostgreSQL change capture with Debezium
2. CDC event processing with Spark Structured Streaming
3. SCD Type 2 implementation with Delta Lake MERGE
4. dbt staging/intermediate/marts structure
5. Superset real-time revenue/inventory dashboard

Project 3: Data Contract-Based Data Mesh

Tech Stack: OpenMetadata + Soda + dbt + Iceberg + Dagster

Architecture:
Domain A (Orders) ──> Data Contract (YAML) ──> Iceberg
Domain B (Customers) -> Data Contract (YAML) -> Iceberg
Domain C (Products) -> Data Contract (YAML) -> Iceberg
                                                   |
                              OpenMetadata (catalog/lineage)
                              Soda (quality validation)
                              Dagster (orchestration)

Implementation highlights:
1. Define Data Contract YAMLs for 3 domains
2. Automated data quality validation with Soda
3. Lineage and catalog management with OpenMetadata
4. Pipeline building with Dagster Software-Defined Assets
5. Enterprise metric layer (MetricFlow)

Quizzes

Quiz 1: Stream Processing Engine Selection

Q: You need to build a credit card fraud detection system. Decisions must be made within 50ms of a transaction, and you need to analyze trading patterns from the last 10 minutes. Which stream processing engine is most suitable?

A: Apache Flink

Why Flink is optimal:

Event-by-event processing: Flink performs true per-event processing, so each transaction can be handled immediately.
Sub-millisecond latency: Unlike Spark's micro-batching (minimum 100ms+), Flink guarantees millisecond-level latency.
Checkpoint-based state management: Reliably maintains the 10-minute sliding window state in RocksDB while guaranteeing exactly-once semantics.
CEP (Complex Event Processing): The Flink CEP library allows declarative definition of complex fraud patterns.

Spark's micro-batching nature makes it difficult to meet the 50ms constraint, and Beam's performance varies by runner, making Flink the best choice for this scenario.

Quiz 2: Lakehouse Format Selection

Q: Your company uses Spark, Trino, and Flink, processing hundreds of GBs of CDC data daily. Schemas change frequently. Which table format should you choose?

A: Apache Iceberg

Why Iceberg is optimal:

Multi-engine support: Spark, Trino, and Flink all support Iceberg as first-class citizens. Delta Lake has limited non-Spark engine support, and Hudi has weak Trino support.
Schema evolution: Iceberg supports full schema evolution. Adding, dropping, renaming, and type-changing columns requires no data rewriting.
Partition evolution: Hidden partitioning ensures partition schema changes have no impact on existing data or queries.

If extremely frequent record-level upserts are the core requirement, Hudi could be considered, but if multi-engine support and schema evolution are priorities, Iceberg is the best choice.

Quiz 3: dbt Project Structure

Q: What is the difference between staging and marts models in a dbt project, and why is this layered structure necessary?

Staging models:

Cleanse source data in a 1:1 mapping
Only basic transformations: column renaming, type casting, filtering
Contain no business logic
One staging model per source

Marts models:

Final analytics tables that answer business questions
Join multiple staging/intermediate models and apply business logic
Divided into fact and dimension tables
Grouped by business domain (finance, marketing, etc.)

Why the layered structure is necessary:

DRY principle: When sources change, only staging needs modification, propagating to all downstream models
Testability: Independent testing at each layer
Debugging: Quickly identify which layer a problem originated from
Team collaboration: Analysts manage marts only; engineers manage staging

Quiz 4: Orchestrator Selection

Q: A 5-person analytics engineering team wants to run dbt-centric data pipelines. They have no prior Airflow experience and need to set up a production environment as quickly as possible. Which orchestrator would you recommend?

A: Dagster

Why Dagster is optimal:

Native dbt integration: Dagster automatically converts dbt projects into Software-Defined Assets. Dependencies between dbt models are automatically reflected in the Dagster asset graph.
Low learning curve (for dbt users): Dagster's asset-centric paradigm feels natural to dbt users. It's more intuitive than Airflow's DAG/Operator concepts.
Quick production setup: Dagster Cloud lets you start immediately without infrastructure management.
Observability: Automatically tracks asset materialization history, data lineage, and metadata.

Airflow's initial setup and operations are complex for inexperienced teams, and Prefect's dbt integration isn't as deep as Dagster's.

Quiz 5: OLAP Engine Selection

Q: You've already built an Iceberg lakehouse and want to run analytical queries directly on Iceberg tables without separate data loading. Your team is familiar with MySQL. Which OLAP engine is most suitable?

A: StarRocks

Why StarRocks is optimal:

Native lakehouse queries: StarRocks can query Iceberg, Delta Lake, and Hudi tables directly. No separate data ingestion needed, reducing data duplication and sync costs.
MySQL-compatible protocol: MySQL clients and drivers work as-is, so MySQL-familiar teams can adapt immediately.
MPP architecture: Superior performance for large join queries compared to ClickHouse.
Simple operations: Simpler architecture than Druid, much easier to operate.

ClickHouse has limited direct Iceberg query support, and Druid/Pinot have high operational complexity and are not MySQL-compatible.

References

Official Documentation

Apache Flink Documentation — Official Flink docs
Apache Spark Documentation — Official Spark docs
dbt Documentation — Official dbt docs
Apache Iceberg Documentation — Official Iceberg docs
ClickHouse Documentation — Official ClickHouse docs
Apache Airflow Documentation — Official Airflow docs
Dagster Documentation — Official Dagster docs

Introduction

1. The 2025 Data Engineering Landscape

1.1 Why Data Engineers Are the Hottest Role

1.2 Evolution of the Modern Data Stack

1.3 Technology Stack Selection Framework

2. Stream Processing: Flink vs Spark vs Beam

2.1 Why Stream Processing Matters

2.2 Apache Flink: True Event-by-Event Processing

2.3 Apache Spark Structured Streaming: Batch+Stream Hybrid

2.4 Apache Beam: Unified Abstraction Layer

2.5 Flink vs Spark vs Beam Comprehensive Comparison

2.6 Benchmarks: Real-World Performance Comparison

3. Orchestration: Airflow 3.0 vs Dagster vs Prefect vs Mage

3.1 What Is Orchestration?

3.2 Apache Airflow 3.0: Evolution of the Throne

3.3 Dagster: The Asset-Centric Paradigm

3.4 Prefect: The Simplest Orchestrator

3.5 Mage AI: All-in-One Platform

3.6 Orchestrator Comprehensive Comparison

4. dbt: The SQL Renaissance

4.1 How dbt Changed the Data Transformation Paradigm

4.2 dbt Core vs dbt Cloud

4.3 Production dbt Project Structure

4.4 MetricFlow and the Semantic Layer

4.5 dbt Best Practices

5. Lakehouse: Iceberg vs Delta Lake vs Hudi

5.1 Why Lakehouse Is Needed

5.2 Apache Iceberg: The Engine-Independent King

5.3 Delta Lake: The Heart of the Databricks Ecosystem

5.4 Apache Hudi: The Real-Time CDC Champion

5.5 Iceberg vs Delta Lake vs Hudi Comprehensive Comparison

6. Real-Time Analytics: ClickHouse vs Druid vs Pinot vs StarRocks

6.1 Why You Need an OLAP Engine

6.2 ClickHouse: The Fastest Open-Source OLAP

6.3 Apache Druid: Real-Time Ingestion + Analytics

6.4 Apache Pinot: LinkedIn's Real-Time Analytics

6.5 StarRocks: The Next-Generation OLAP Challenger

6.6 OLAP Engine Comprehensive Comparison

7. Data Mesh and Data Contracts

7.1 Data Mesh: Decentralized Data Architecture

7.2 Data Contracts: The Promise of Data APIs

7.3 Data Mesh + Data Fabric Hybrid

8. Production Architecture: Real-Time Recommendation System

8.1 Requirements

8.2 Architecture Design

8.3 Core Implementation Code

8.4 Cost Optimization

9. Certifications and Learning Roadmap

9.1 Recommended 2025 Certifications

9.2 12-Month Learning Roadmap

9.3 Recommended Learning Resources

10. Three Portfolio Projects

Project 1: Real-Time Taxi Fare Prediction Pipeline

Project 2: E-Commerce CDC Pipeline

Project 3: Data Contract-Based Data Mesh

Quizzes

Quiz 1: Stream Processing Engine Selection

Quiz 2: Lakehouse Format Selection

Quiz 3: dbt Project Structure

Quiz 4: Orchestrator Selection

Quiz 5: OLAP Engine Selection

References

Official Documentation

Architecture and Comparisons

Learning Resources

Trends and Analysis