Skip to content
Published on

Data Engineering Complete Guide — Lakehouse, Streaming, dbt, Orchestration, Data Mesh (Season 2 Ep 8, 2025)

Authors

Intro — Expectations for Data Engineers in 2025

Ten years ago a data engineer wrote ETL scripts and operated Hadoop. 2025 expectations:

  • Lakehouse design: pick and operate Iceberg, Delta, or Hudi
  • Streaming + Batch unification: Lambda is out, Kappa is in
  • Modern Data Stack: dbt + Airflow/Dagster + Fivetran/Airbyte + BigQuery/Snowflake
  • Data Mesh: centralized vs distributed org model
  • Data Contract: schema-as-contract (Protobuf-like)
  • AI/ML integration: Feature Store, Vector DB, MLOps touchpoints
  • Cost management: cloud data cost control (FinOps)

This post captures the mental frame of a 2025 data engineer.


Part 1 — Lakehouse: Data Lake + Warehouse Unified

1.1 Historical Context

EraArchitectureLimitation
2000sData Warehouse (Teradata, Oracle)Structured only, expensive
2010sData Lake (Hadoop, S3)Any shape, but no ACID / poor query perf
2020sLakehouse (Iceberg, Delta, Hudi)Best of both

Definition: object storage (S3/GCS/ADLS) + table format (Iceberg etc.) + query engine (Spark, Trino, DuckDB) delivering warehouse-grade ACID and performance.

1.2 Three Table Formats (2024-2025)

ItemIcebergDelta LakeHudi
OriginNetflixDatabricksUber
GovernanceApacheLinux FoundationApache
Engine neutralityBestDatabricks-leaningNeutral
Real-time upsertGoodGoodBest
Time travelYesYesYes
Schema evolutionStrongStrongMedium
2025 shareRapidly growingLeader (slipping)Niche

Key 2024 event: Databricks acquired Tabular (Iceberg founders), pointing toward Iceberg/Delta convergence.

2025 pick: Iceberg for new projects. Delta is fine on Databricks.

1.3 Iceberg Structure

Metadata (JSON):
  -> Manifest List (Avro):
     |- Manifest 1 (Avro)
     |  \- Data Files (Parquet)
     |- Manifest 2
     |  \- Data Files
     \- ...
  • Snapshot: point-in-time state (time travel)
  • Partition Evolution: partition schema can change
  • Hidden Partitioning: users write WHERE year=2025, engine uses internal partition

1.4 Lakehouse Query Engines 2025

EngineStrength
SparkGeneral-purpose king, Delta-native
Trino (Presto)Interactive SQL, multi-source
DuckDBLocal/embedded, blazing fast
SnowflakeBest operational UX
BigQueryServerless, GCP-integrated
ClickHouseReal-time analytics
Databricks SQLDelta-optimized

Part 2 — Streaming: The Kappa Victory

2.1 Lambda vs Kappa

Lambda (2010s): Batch Layer (accurate) + Speed Layer (fast) + Serving. Problem: same logic implemented twice — maintenance hell.

Kappa (2014, Jay Kreps): Streaming only. Re-process by restarting the stream. Standard in 2025.

2.2 Streaming Engines 2025

EngineTrait
Apache FlinkMost mature & feature-rich, stateful
Kafka StreamsKafka-native Java library
Spark Structured StreamingUnified with batch API
MaterializePostgreSQL-compatible SQL
RisingWaveMaterialize alternative, Rust
ArroyoNew, Rust-based

2.3 Ten Streaming Concepts

  1. Event Time vs Processing Time
  2. Watermark: "all events up to time T have arrived" signal
  3. Windowing: Tumbling, Sliding, Session
  4. Stateful Processing
  5. Exactly-Once
  6. Backpressure
  7. Checkpointing: recovery points
  8. Join: Stream-Stream, Stream-Table
  9. CDC (Change Data Capture)
  10. Deduplication

2.4 CDC — The Modern Integration Primitive

PostgreSQL/MySQL -> Debezium -> Kafka -> Flink/Spark -> Iceberg

Eliminates batch ETL, near-real-time sync. Tools: Debezium (OSS), Fivetran, Airbyte.

2.5 Example Real-time Platform 2025

[Kafka] -> [Flink SQL] -> [Iceberg (bronze/silver/gold)] -> [Trino/DuckDB]
                   |
                   +-> [Materialize] (live dashboards)
                   |
                   +-> [Redis] (low-latency serving)

Part 3 — Medallion Architecture: Bronze/Silver/Gold

3.1 Three Tiers

  1. Bronze (Raw): original data, absorbs schema drift
  2. Silver (Cleansed): validated, deduped, unified schema
  3. Gold (Business): domain aggregates, BI/ML-ready

3.2 Responsibilities

TierOwnerFrequency
BronzeData engineerReal-time to hourly
SilverData engineerHourly to daily
GoldAnalytics engineer + domain teamDaily to weekly

3.3 Benefits

  • Reprocessable (Bronze preserved)
  • Clear contract (Silver = canonical schema)
  • Business logic isolated (Gold)

Part 4 — dbt: The Analytics Engineering Standard

4.1 What dbt Is

"SQL modeling as software engineering" — version control, tests, docs, dependency graph.

4.2 Core Layout

models/
  staging/
    stg_orders.sql
    stg_customers.sql
  marts/
    orders.sql
    revenue_daily.sql
tests/
seeds/
macros/

4.3 Model Example

-- models/marts/orders.sql
{{ config(materialized='table', partition_by={'field': 'order_date', 'data_type': 'date'}) }}

with orders as (
    select * from {{ ref('stg_orders') }}
),
customers as (
    select * from {{ ref('stg_customers') }}
)
select
    o.order_id,
    o.order_date,
    c.country,
    o.amount
from orders o
join customers c using (customer_id)
where o.status = 'completed'

{{ ref('stg_orders') }} — dbt auto-builds the dependency graph.

4.4 Tests and Documentation

# schema.yml
models:
  - name: orders
    description: "Completed orders (Gold layer)"
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: amount
        tests:
          - dbt_utils.expression_is_true:
              expression: ">= 0"
dbt test
dbt docs generate && dbt docs serve

4.5 2025 dbt Ecosystem

  • dbt Core: OSS CLI
  • dbt Cloud: SaaS (IDE + scheduler)
  • dbt-osmosis: metadata propagation
  • Elementary: data quality + observability
  • SQLMesh: dbt alternative, stronger dependency handling
  • Dagster + dbt: tighter orchestration

Part 5 — Orchestration: Airflow vs Dagster vs Prefect vs Temporal

5.1 2025 Comparison

ToolStrengthWeaknessFits
AirflowMature, huge ecosystemDated UXLarge, mature teams
DagsterData-aware, typedLearning curveData-centric teams
PrefectPythonic, much improved in 2.xSmaller communityMid-size teams
TemporalWorkflow reliabilityNot data-specificLong-running workflows
KestraDeclarative YAMLNewSimple pipelines

5.2 Dagster: "Data-aware Orchestration"

Airflow manages tasks. Dagster manages Assets (data outputs).

from dagster import asset

@asset
def raw_orders():
    return fetch_from_postgres("orders")

@asset
def cleaned_orders(raw_orders):
    return raw_orders.dropna()

@asset
def revenue_by_day(cleaned_orders):
    return cleaned_orders.groupby("date").sum()

Dependencies, schemas, types all explicit. Execution graph = Asset graph.

5.3 Airflow 2.x Improvements

  • TaskFlow API (Pythonic)
  • Dynamic Task Mapping
  • Dataset Triggering (inspired by Dagster)
  • Airflow 3.0 (2025): data-centric overhaul

Part 6 — Data Mesh: Organizational Data Architecture

6.1 Background

Limits of centralized data teams: weak domain knowledge, bottleneck, priority conflicts.

6.2 Four Principles (Zhamak Dehghani, 2019)

  1. Domain Ownership: domains own their data
  2. Data as a Product: dataset = product (quality, docs, SLA)
  3. Self-serve Platform: central team provides platform
  4. Federated Governance: central rules + domain execution

6.3 Field Reality

Works when: 500+ engineers, clear domain boundaries, mature platform-engineering org.

Fails when: under 100 engineers (over-engineering), weak domain teams, no central platform.

2025 reality: from ideology to pragmatic adjustment — Hub-and-Spoke or partial Mesh is common.


Part 7 — Data Contract: The Team Interface

7.1 Why

Problem: frontend renames user.full_name to user.name, 100 pipelines break, nobody traced.

7.2 Definition

Schema + semantics + SLA as a formal promise. Often Protobuf or JSON Schema.

# user_contract.yml
name: user_events
version: 1.2.0
owner: user-platform-team
schema:
  event_id: string
  user_id: string
  event_type: enum [signup, login, logout, purchase]
  timestamp: timestamp
  properties: map
sla:
  freshness: under 5 minutes
  availability: 99.9%
  breaking_change_notice: 30 days
consumers:
  - analytics-team
  - ml-team
  - finance-team

7.3 Tools (2024-2025)

  • Protobuf + Buf: engineering-team centric
  • Great Expectations: data quality validation
  • dbt Contracts (1.5+): model-level contracts
  • DataHub: metadata catalog
  • Apache Atlas: Hadoop ecosystem

7.4 Contract-first Pipeline

Producer Team                         Consumer Team
  |                                        |
  |- Commits schema change to Buf -> Review
  |                                        |
  |- CI: backward compat check             |
  |                                        |
  |- Deploy producer                       |
  |                                        |
  \- Emit events ----- Kafka --------> Update consumer

Part 8 — Data Quality: The Trust Foundation

8.1 Six Dimensions

  1. Accuracy
  2. Completeness
  3. Consistency (across systems)
  4. Timeliness
  5. Uniqueness
  6. Validity (format and range)

8.2 Five Pillars of Data Observability (Monte Carlo)

  1. Freshness: last update time
  2. Volume: expected row count range
  3. Schema: column/type changes
  4. Distribution: value distribution shifts
  5. Lineage: where from, where to

8.3 2025 Tools

  • Great Expectations: OSS, powerful
  • Soda: SQL-based, simple
  • Monte Carlo: commercial, ML anomaly detection
  • Bigeye: auto-SLA learning
  • Elementary: dbt-integrated

Part 9 — Modern Data Stack 2025

9.1 Canonical Setup

[Sources]
  |- OLTP DBs (Postgres/MySQL)
  |- SaaS APIs (Salesforce, Stripe)
  \- Event Streams (Kafka, Kinesis)
[Ingestion]
  |- Fivetran / Airbyte
  \- Debezium (CDC)
[Warehouse/Lakehouse]
  |- Snowflake / BigQuery / Databricks
  \- Iceberg on S3 + Trino
[Transformation]
  \- dbt (or SQLMesh)
[Orchestration]
  \- Dagster / Airflow
[Serving]
  |- BI: Looker, Metabase, Hex, Preset
  |- ML: Feature Store (Feast)
  \- Reverse ETL: Census, Hightouch
[Observability]
  |- Monte Carlo / Elementary
  \- DataHub
  • Semantic Layer: dbt Semantic Layer, Cube, MetricFlow
  • Reverse ETL: Warehouse to CRM/ad platforms
  • Zero-copy Cloning: Snowflake/Databricks
  • Open Table Format convergence: around Iceberg
  • AI-assisted data engineering: copilots drafting dbt models

Part 10 — Six-Month Data Engineer Roadmap

  • Month 1: SQL + warehouse (advanced PostgreSQL, dbt basics)
  • Month 2: Orchestration + pipelines (Airflow or Dagster)
  • Month 3: Lakehouse (Iceberg or Delta, Trino/DuckDB, Parquet tuning)
  • Month 4: Streaming (Kafka basics, Flink or Spark Streaming, CDC)
  • Month 5: Quality + observability (Great Expectations/Soda, DataHub, Data Contract)
  • Month 6: Operations + ML (FinOps, Feature Store, Semantic Layer)

Part 11 — 12-Item Checklist

  1. Differences between the three Lakehouse table formats
  2. Lambda vs Kappa
  3. Event Time vs Processing Time + Watermark
  4. Medallion tier responsibilities
  5. How ref() in dbt builds the graph
  6. CDC pipeline topology
  7. Dagster Asset vs Airflow Task
  8. Data Mesh 4 principles
  9. Data Contract definition and need
  10. Six quality dimensions
  11. Five observability pillars
  12. Semantic Layer purpose

Part 12 — Ten Anti-patterns

  1. Skipping Bronze: reprocessing impossible
  2. Clinging to Lambda: duplicate logic
  3. Raw SQL without dbt: no deps/tests/docs
  4. Sharing data without a Contract: untraceable breakage
  5. Parquet without partition design: slow queries
  6. Hourly UPSERT without MoR/compaction
  7. Business logic inside Airflow DAGs: push to dbt/Flink
  8. Observability later: track quality from day one
  9. No idempotency: retries corrupt data
  10. No cost monitoring: warehouse bills explode

Closing — Data Engineering Is "Contract Engineering"

If software engineering is the engineering of functions and interfaces, data engineering is the engineering of schemas and contracts.

The essence in 2025:

  • deliver trustworthy data in the right shape at the right time
  • be the foundation that decisions, ML, and products can rely on

Tools churn yearly. Lakehouse, Streaming, dbt, Dagster, Data Contract did not exist in 2020; 2030 will bring more. What endures:

  • data historicity (preserve Bronze)
  • schema evolution (Contract)
  • idempotent processing (safe retries)
  • observability (detect problems)

Keep these and the tools swap in and out as needed.


Next — "Observability Complete Guide: Metric, Log, Trace, OpenTelemetry, eBPF, SLO"

Season 2 Ep 9 covers the nervous system of modern systems, Observability: three axes plus Profile (Pyroscope), the real value of OpenTelemetry, eBPF at the kernel level, SLO/SLI/Error Budget design, Grafana vs Elastic vs Datadog, and cost control (stopping log explosions).

"If you cannot observe it, you cannot operate it." Continued next time.