Data Engineering Complete Guide — Lakehouse, Streaming, dbt, Orchestration, Data Mesh (Season 2 Ep 8, 2025)

Intro — Expectations for Data Engineers in 2025

Ten years ago a data engineer wrote ETL scripts and operated Hadoop. 2025 expectations:

Lakehouse design: pick and operate Iceberg, Delta, or Hudi
Streaming + Batch unification: Lambda is out, Kappa is in
Modern Data Stack: dbt + Airflow/Dagster + Fivetran/Airbyte + BigQuery/Snowflake
Data Mesh: centralized vs distributed org model
Data Contract: schema-as-contract (Protobuf-like)
AI/ML integration: Feature Store, Vector DB, MLOps touchpoints
Cost management: cloud data cost control (FinOps)

This post captures the mental frame of a 2025 data engineer.

Part 1 — Lakehouse: Data Lake + Warehouse Unified

1.1 Historical Context

Era	Architecture	Limitation
2000s	Data Warehouse (Teradata, Oracle)	Structured only, expensive
2010s	Data Lake (Hadoop, S3)	Any shape, but no ACID / poor query perf
2020s	Lakehouse (Iceberg, Delta, Hudi)	Best of both

Definition: object storage (S3/GCS/ADLS) + table format (Iceberg etc.) + query engine (Spark, Trino, DuckDB) delivering warehouse-grade ACID and performance.

1.2 Three Table Formats (2024-2025)

Item	Iceberg	Delta Lake	Hudi
Origin	Netflix	Databricks	Uber
Governance	Apache	Linux Foundation	Apache
Engine neutrality	Best	Databricks-leaning	Neutral
Real-time upsert	Good	Good	Best
Time travel	Yes	Yes	Yes
Schema evolution	Strong	Strong	Medium
2025 share	Rapidly growing	Leader (slipping)	Niche

Key 2024 event: Databricks acquired Tabular (Iceberg founders), pointing toward Iceberg/Delta convergence.

2025 pick: Iceberg for new projects. Delta is fine on Databricks.

1.3 Iceberg Structure

Metadata (JSON):
  -> Manifest List (Avro):
     |- Manifest 1 (Avro)
     |  \- Data Files (Parquet)
     |- Manifest 2
     |  \- Data Files
     \- ...

Snapshot: point-in-time state (time travel)
Partition Evolution: partition schema can change
Hidden Partitioning: users write WHERE year=2025, engine uses internal partition

1.4 Lakehouse Query Engines 2025

Engine	Strength
Spark	General-purpose king, Delta-native
Trino (Presto)	Interactive SQL, multi-source
DuckDB	Local/embedded, blazing fast
Snowflake	Best operational UX
BigQuery	Serverless, GCP-integrated
ClickHouse	Real-time analytics
Databricks SQL	Delta-optimized

Part 2 — Streaming: The Kappa Victory

2.1 Lambda vs Kappa

Lambda (2010s): Batch Layer (accurate) + Speed Layer (fast) + Serving. Problem: same logic implemented twice — maintenance hell.

Kappa (2014, Jay Kreps): Streaming only. Re-process by restarting the stream. Standard in 2025.

2.2 Streaming Engines 2025

Engine	Trait
Apache Flink	Most mature & feature-rich, stateful
Kafka Streams	Kafka-native Java library
Spark Structured Streaming	Unified with batch API
Materialize	PostgreSQL-compatible SQL
RisingWave	Materialize alternative, Rust
Arroyo	New, Rust-based

2.3 Ten Streaming Concepts

Event Time vs Processing Time
Watermark: "all events up to time T have arrived" signal
Windowing: Tumbling, Sliding, Session
Stateful Processing
Exactly-Once
Backpressure
Checkpointing: recovery points
Join: Stream-Stream, Stream-Table
CDC (Change Data Capture)
Deduplication

2.4 CDC — The Modern Integration Primitive

PostgreSQL/MySQL -> Debezium -> Kafka -> Flink/Spark -> Iceberg

Eliminates batch ETL, near-real-time sync. Tools: Debezium (OSS), Fivetran, Airbyte.

2.5 Example Real-time Platform 2025

[Kafka] -> [Flink SQL] -> [Iceberg (bronze/silver/gold)] -> [Trino/DuckDB]
                   |
                   +-> [Materialize] (live dashboards)
                   |
                   +-> [Redis] (low-latency serving)

Part 3 — Medallion Architecture: Bronze/Silver/Gold

3.1 Three Tiers

Bronze (Raw): original data, absorbs schema drift
Silver (Cleansed): validated, deduped, unified schema
Gold (Business): domain aggregates, BI/ML-ready

3.2 Responsibilities

Tier	Owner	Frequency
Bronze	Data engineer	Real-time to hourly
Silver	Data engineer	Hourly to daily
Gold	Analytics engineer + domain team	Daily to weekly

3.3 Benefits

Reprocessable (Bronze preserved)
Clear contract (Silver = canonical schema)
Business logic isolated (Gold)

Part 4 — dbt: The Analytics Engineering Standard

4.1 What dbt Is

"SQL modeling as software engineering" — version control, tests, docs, dependency graph.

4.2 Core Layout

models/
  staging/
    stg_orders.sql
    stg_customers.sql
  marts/
    orders.sql
    revenue_daily.sql
tests/
seeds/
macros/

4.3 Model Example

-- models/marts/orders.sql
{{ config(materialized='table', partition_by={'field': 'order_date', 'data_type': 'date'}) }}

with orders as (
    select * from {{ ref('stg_orders') }}
),
customers as (
    select * from {{ ref('stg_customers') }}
)
select
    o.order_id,
    o.order_date,
    c.country,
    o.amount
from orders o
join customers c using (customer_id)
where o.status = 'completed'

{{ ref('stg_orders') }} — dbt auto-builds the dependency graph.

4.4 Tests and Documentation

# schema.yml
models:
  - name: orders
    description: "Completed orders (Gold layer)"
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: amount
        tests:
          - dbt_utils.expression_is_true:
              expression: ">= 0"

dbt test
dbt docs generate && dbt docs serve

4.5 2025 dbt Ecosystem

dbt Core: OSS CLI
dbt Cloud: SaaS (IDE + scheduler)
dbt-osmosis: metadata propagation
Elementary: data quality + observability
SQLMesh: dbt alternative, stronger dependency handling
Dagster + dbt: tighter orchestration

Part 5 — Orchestration: Airflow vs Dagster vs Prefect vs Temporal

5.1 2025 Comparison

Tool	Strength	Weakness	Fits
Airflow	Mature, huge ecosystem	Dated UX	Large, mature teams
Dagster	Data-aware, typed	Learning curve	Data-centric teams
Prefect	Pythonic, much improved in 2.x	Smaller community	Mid-size teams
Temporal	Workflow reliability	Not data-specific	Long-running workflows
Kestra	Declarative YAML	New	Simple pipelines

5.2 Dagster: "Data-aware Orchestration"

Airflow manages tasks. Dagster manages Assets (data outputs).

from dagster import asset

@asset
def raw_orders():
    return fetch_from_postgres("orders")

@asset
def cleaned_orders(raw_orders):
    return raw_orders.dropna()

@asset
def revenue_by_day(cleaned_orders):
    return cleaned_orders.groupby("date").sum()

Dependencies, schemas, types all explicit. Execution graph = Asset graph.

5.3 Airflow 2.x Improvements

TaskFlow API (Pythonic)
Dynamic Task Mapping
Dataset Triggering (inspired by Dagster)
Airflow 3.0 (2025): data-centric overhaul

Part 6 — Data Mesh: Organizational Data Architecture

6.1 Background

Limits of centralized data teams: weak domain knowledge, bottleneck, priority conflicts.

6.2 Four Principles (Zhamak Dehghani, 2019)

Domain Ownership: domains own their data
Data as a Product: dataset = product (quality, docs, SLA)
Self-serve Platform: central team provides platform
Federated Governance: central rules + domain execution

6.3 Field Reality

Works when: 500+ engineers, clear domain boundaries, mature platform-engineering org.

Fails when: under 100 engineers (over-engineering), weak domain teams, no central platform.

2025 reality: from ideology to pragmatic adjustment — Hub-and-Spoke or partial Mesh is common.

Part 7 — Data Contract: The Team Interface

7.1 Why

Problem: frontend renames user.full_name to user.name, 100 pipelines break, nobody traced.

7.2 Definition

Schema + semantics + SLA as a formal promise. Often Protobuf or JSON Schema.

# user_contract.yml
name: user_events
version: 1.2.0
owner: user-platform-team
schema:
  event_id: string
  user_id: string
  event_type: enum [signup, login, logout, purchase]
  timestamp: timestamp
  properties: map
sla:
  freshness: under 5 minutes
  availability: 99.9%
  breaking_change_notice: 30 days
consumers:
  - analytics-team
  - ml-team
  - finance-team

7.3 Tools (2024-2025)

Protobuf + Buf: engineering-team centric
Great Expectations: data quality validation
dbt Contracts (1.5+): model-level contracts
DataHub: metadata catalog
Apache Atlas: Hadoop ecosystem

7.4 Contract-first Pipeline

Producer Team                         Consumer Team
  |                                        |
  |- Commits schema change to Buf -> Review
  |                                        |
  |- CI: backward compat check             |
  |                                        |
  |- Deploy producer                       |
  |                                        |
  \- Emit events ----- Kafka --------> Update consumer

Part 8 — Data Quality: The Trust Foundation

8.1 Six Dimensions

Accuracy
Completeness
Consistency (across systems)
Timeliness
Uniqueness
Validity (format and range)

8.2 Five Pillars of Data Observability (Monte Carlo)

Freshness: last update time
Volume: expected row count range
Schema: column/type changes
Distribution: value distribution shifts
Lineage: where from, where to

8.3 2025 Tools

Great Expectations: OSS, powerful
Soda: SQL-based, simple
Monte Carlo: commercial, ML anomaly detection
Bigeye: auto-SLA learning
Elementary: dbt-integrated

Part 9 — Modern Data Stack 2025

9.1 Canonical Setup

[Sources]
  |- OLTP DBs (Postgres/MySQL)
  |- SaaS APIs (Salesforce, Stripe)
  \- Event Streams (Kafka, Kinesis)
[Ingestion]
  |- Fivetran / Airbyte
  \- Debezium (CDC)
[Warehouse/Lakehouse]
  |- Snowflake / BigQuery / Databricks
  \- Iceberg on S3 + Trino
[Transformation]
  \- dbt (or SQLMesh)
[Orchestration]
  \- Dagster / Airflow
[Serving]
  |- BI: Looker, Metabase, Hex, Preset
  |- ML: Feature Store (Feast)
  \- Reverse ETL: Census, Hightouch
[Observability]
  |- Monte Carlo / Elementary
  \- DataHub

9.2 New Trends

Semantic Layer: dbt Semantic Layer, Cube, MetricFlow
Reverse ETL: Warehouse to CRM/ad platforms
Zero-copy Cloning: Snowflake/Databricks
Open Table Format convergence: around Iceberg
AI-assisted data engineering: copilots drafting dbt models

Part 10 — Six-Month Data Engineer Roadmap

Month 1: SQL + warehouse (advanced PostgreSQL, dbt basics)
Month 2: Orchestration + pipelines (Airflow or Dagster)
Month 3: Lakehouse (Iceberg or Delta, Trino/DuckDB, Parquet tuning)
Month 4: Streaming (Kafka basics, Flink or Spark Streaming, CDC)
Month 5: Quality + observability (Great Expectations/Soda, DataHub, Data Contract)
Month 6: Operations + ML (FinOps, Feature Store, Semantic Layer)

Part 11 — 12-Item Checklist

Differences between the three Lakehouse table formats
Lambda vs Kappa
Event Time vs Processing Time + Watermark
Medallion tier responsibilities
How ref() in dbt builds the graph
CDC pipeline topology
Dagster Asset vs Airflow Task
Data Mesh 4 principles
Data Contract definition and need
Six quality dimensions
Five observability pillars
Semantic Layer purpose

Part 12 — Ten Anti-patterns

Skipping Bronze: reprocessing impossible
Clinging to Lambda: duplicate logic
Raw SQL without dbt: no deps/tests/docs
Sharing data without a Contract: untraceable breakage
Parquet without partition design: slow queries
Hourly UPSERT without MoR/compaction
Business logic inside Airflow DAGs: push to dbt/Flink
Observability later: track quality from day one
No idempotency: retries corrupt data
No cost monitoring: warehouse bills explode

Closing — Data Engineering Is "Contract Engineering"

If software engineering is the engineering of functions and interfaces, data engineering is the engineering of schemas and contracts.

The essence in 2025:

deliver trustworthy data in the right shape at the right time
be the foundation that decisions, ML, and products can rely on

Tools churn yearly. Lakehouse, Streaming, dbt, Dagster, Data Contract did not exist in 2020; 2030 will bring more. What endures:

data historicity (preserve Bronze)
schema evolution (Contract)
idempotent processing (safe retries)
observability (detect problems)

Keep these and the tools swap in and out as needed.

Next — "Observability Complete Guide: Metric, Log, Trace, OpenTelemetry, eBPF, SLO"

Season 2 Ep 9 covers the nervous system of modern systems, Observability: three axes plus Profile (Pyroscope), the real value of OpenTelemetry, eBPF at the kernel level, SLO/SLI/Error Budget design, Grafana vs Elastic vs Datadog, and cost control (stopping log explosions).

"If you cannot observe it, you cannot operate it." Continued next time.