- Published on
Data Engineering Complete Guide — Lakehouse, Streaming, dbt, Orchestration, Data Mesh (Season 2 Ep 8, 2025)
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Intro — Expectations for Data Engineers in 2025
Ten years ago a data engineer wrote ETL scripts and operated Hadoop. 2025 expectations:
- Lakehouse design: pick and operate Iceberg, Delta, or Hudi
- Streaming + Batch unification: Lambda is out, Kappa is in
- Modern Data Stack: dbt + Airflow/Dagster + Fivetran/Airbyte + BigQuery/Snowflake
- Data Mesh: centralized vs distributed org model
- Data Contract: schema-as-contract (Protobuf-like)
- AI/ML integration: Feature Store, Vector DB, MLOps touchpoints
- Cost management: cloud data cost control (FinOps)
This post captures the mental frame of a 2025 data engineer.
Part 1 — Lakehouse: Data Lake + Warehouse Unified
1.1 Historical Context
| Era | Architecture | Limitation |
|---|---|---|
| 2000s | Data Warehouse (Teradata, Oracle) | Structured only, expensive |
| 2010s | Data Lake (Hadoop, S3) | Any shape, but no ACID / poor query perf |
| 2020s | Lakehouse (Iceberg, Delta, Hudi) | Best of both |
Definition: object storage (S3/GCS/ADLS) + table format (Iceberg etc.) + query engine (Spark, Trino, DuckDB) delivering warehouse-grade ACID and performance.
1.2 Three Table Formats (2024-2025)
| Item | Iceberg | Delta Lake | Hudi |
|---|---|---|---|
| Origin | Netflix | Databricks | Uber |
| Governance | Apache | Linux Foundation | Apache |
| Engine neutrality | Best | Databricks-leaning | Neutral |
| Real-time upsert | Good | Good | Best |
| Time travel | Yes | Yes | Yes |
| Schema evolution | Strong | Strong | Medium |
| 2025 share | Rapidly growing | Leader (slipping) | Niche |
Key 2024 event: Databricks acquired Tabular (Iceberg founders), pointing toward Iceberg/Delta convergence.
2025 pick: Iceberg for new projects. Delta is fine on Databricks.
1.3 Iceberg Structure
Metadata (JSON):
-> Manifest List (Avro):
|- Manifest 1 (Avro)
| \- Data Files (Parquet)
|- Manifest 2
| \- Data Files
\- ...
- Snapshot: point-in-time state (time travel)
- Partition Evolution: partition schema can change
- Hidden Partitioning: users write
WHERE year=2025, engine uses internal partition
1.4 Lakehouse Query Engines 2025
| Engine | Strength |
|---|---|
| Spark | General-purpose king, Delta-native |
| Trino (Presto) | Interactive SQL, multi-source |
| DuckDB | Local/embedded, blazing fast |
| Snowflake | Best operational UX |
| BigQuery | Serverless, GCP-integrated |
| ClickHouse | Real-time analytics |
| Databricks SQL | Delta-optimized |
Part 2 — Streaming: The Kappa Victory
2.1 Lambda vs Kappa
Lambda (2010s): Batch Layer (accurate) + Speed Layer (fast) + Serving. Problem: same logic implemented twice — maintenance hell.
Kappa (2014, Jay Kreps): Streaming only. Re-process by restarting the stream. Standard in 2025.
2.2 Streaming Engines 2025
| Engine | Trait |
|---|---|
| Apache Flink | Most mature & feature-rich, stateful |
| Kafka Streams | Kafka-native Java library |
| Spark Structured Streaming | Unified with batch API |
| Materialize | PostgreSQL-compatible SQL |
| RisingWave | Materialize alternative, Rust |
| Arroyo | New, Rust-based |
2.3 Ten Streaming Concepts
- Event Time vs Processing Time
- Watermark: "all events up to time T have arrived" signal
- Windowing: Tumbling, Sliding, Session
- Stateful Processing
- Exactly-Once
- Backpressure
- Checkpointing: recovery points
- Join: Stream-Stream, Stream-Table
- CDC (Change Data Capture)
- Deduplication
2.4 CDC — The Modern Integration Primitive
PostgreSQL/MySQL -> Debezium -> Kafka -> Flink/Spark -> Iceberg
Eliminates batch ETL, near-real-time sync. Tools: Debezium (OSS), Fivetran, Airbyte.
2.5 Example Real-time Platform 2025
[Kafka] -> [Flink SQL] -> [Iceberg (bronze/silver/gold)] -> [Trino/DuckDB]
|
+-> [Materialize] (live dashboards)
|
+-> [Redis] (low-latency serving)
Part 3 — Medallion Architecture: Bronze/Silver/Gold
3.1 Three Tiers
- Bronze (Raw): original data, absorbs schema drift
- Silver (Cleansed): validated, deduped, unified schema
- Gold (Business): domain aggregates, BI/ML-ready
3.2 Responsibilities
| Tier | Owner | Frequency |
|---|---|---|
| Bronze | Data engineer | Real-time to hourly |
| Silver | Data engineer | Hourly to daily |
| Gold | Analytics engineer + domain team | Daily to weekly |
3.3 Benefits
- Reprocessable (Bronze preserved)
- Clear contract (Silver = canonical schema)
- Business logic isolated (Gold)
Part 4 — dbt: The Analytics Engineering Standard
4.1 What dbt Is
"SQL modeling as software engineering" — version control, tests, docs, dependency graph.
4.2 Core Layout
models/
staging/
stg_orders.sql
stg_customers.sql
marts/
orders.sql
revenue_daily.sql
tests/
seeds/
macros/
4.3 Model Example
-- models/marts/orders.sql
{{ config(materialized='table', partition_by={'field': 'order_date', 'data_type': 'date'}) }}
with orders as (
select * from {{ ref('stg_orders') }}
),
customers as (
select * from {{ ref('stg_customers') }}
)
select
o.order_id,
o.order_date,
c.country,
o.amount
from orders o
join customers c using (customer_id)
where o.status = 'completed'
{{ ref('stg_orders') }} — dbt auto-builds the dependency graph.
4.4 Tests and Documentation
# schema.yml
models:
- name: orders
description: "Completed orders (Gold layer)"
columns:
- name: order_id
tests:
- unique
- not_null
- name: amount
tests:
- dbt_utils.expression_is_true:
expression: ">= 0"
dbt test
dbt docs generate && dbt docs serve
4.5 2025 dbt Ecosystem
- dbt Core: OSS CLI
- dbt Cloud: SaaS (IDE + scheduler)
- dbt-osmosis: metadata propagation
- Elementary: data quality + observability
- SQLMesh: dbt alternative, stronger dependency handling
- Dagster + dbt: tighter orchestration
Part 5 — Orchestration: Airflow vs Dagster vs Prefect vs Temporal
5.1 2025 Comparison
| Tool | Strength | Weakness | Fits |
|---|---|---|---|
| Airflow | Mature, huge ecosystem | Dated UX | Large, mature teams |
| Dagster | Data-aware, typed | Learning curve | Data-centric teams |
| Prefect | Pythonic, much improved in 2.x | Smaller community | Mid-size teams |
| Temporal | Workflow reliability | Not data-specific | Long-running workflows |
| Kestra | Declarative YAML | New | Simple pipelines |
5.2 Dagster: "Data-aware Orchestration"
Airflow manages tasks. Dagster manages Assets (data outputs).
from dagster import asset
@asset
def raw_orders():
return fetch_from_postgres("orders")
@asset
def cleaned_orders(raw_orders):
return raw_orders.dropna()
@asset
def revenue_by_day(cleaned_orders):
return cleaned_orders.groupby("date").sum()
Dependencies, schemas, types all explicit. Execution graph = Asset graph.
5.3 Airflow 2.x Improvements
- TaskFlow API (Pythonic)
- Dynamic Task Mapping
- Dataset Triggering (inspired by Dagster)
- Airflow 3.0 (2025): data-centric overhaul
Part 6 — Data Mesh: Organizational Data Architecture
6.1 Background
Limits of centralized data teams: weak domain knowledge, bottleneck, priority conflicts.
6.2 Four Principles (Zhamak Dehghani, 2019)
- Domain Ownership: domains own their data
- Data as a Product: dataset = product (quality, docs, SLA)
- Self-serve Platform: central team provides platform
- Federated Governance: central rules + domain execution
6.3 Field Reality
Works when: 500+ engineers, clear domain boundaries, mature platform-engineering org.
Fails when: under 100 engineers (over-engineering), weak domain teams, no central platform.
2025 reality: from ideology to pragmatic adjustment — Hub-and-Spoke or partial Mesh is common.
Part 7 — Data Contract: The Team Interface
7.1 Why
Problem: frontend renames user.full_name to user.name, 100 pipelines break, nobody traced.
7.2 Definition
Schema + semantics + SLA as a formal promise. Often Protobuf or JSON Schema.
# user_contract.yml
name: user_events
version: 1.2.0
owner: user-platform-team
schema:
event_id: string
user_id: string
event_type: enum [signup, login, logout, purchase]
timestamp: timestamp
properties: map
sla:
freshness: under 5 minutes
availability: 99.9%
breaking_change_notice: 30 days
consumers:
- analytics-team
- ml-team
- finance-team
7.3 Tools (2024-2025)
- Protobuf + Buf: engineering-team centric
- Great Expectations: data quality validation
- dbt Contracts (1.5+): model-level contracts
- DataHub: metadata catalog
- Apache Atlas: Hadoop ecosystem
7.4 Contract-first Pipeline
Producer Team Consumer Team
| |
|- Commits schema change to Buf -> Review
| |
|- CI: backward compat check |
| |
|- Deploy producer |
| |
\- Emit events ----- Kafka --------> Update consumer
Part 8 — Data Quality: The Trust Foundation
8.1 Six Dimensions
- Accuracy
- Completeness
- Consistency (across systems)
- Timeliness
- Uniqueness
- Validity (format and range)
8.2 Five Pillars of Data Observability (Monte Carlo)
- Freshness: last update time
- Volume: expected row count range
- Schema: column/type changes
- Distribution: value distribution shifts
- Lineage: where from, where to
8.3 2025 Tools
- Great Expectations: OSS, powerful
- Soda: SQL-based, simple
- Monte Carlo: commercial, ML anomaly detection
- Bigeye: auto-SLA learning
- Elementary: dbt-integrated
Part 9 — Modern Data Stack 2025
9.1 Canonical Setup
[Sources]
|- OLTP DBs (Postgres/MySQL)
|- SaaS APIs (Salesforce, Stripe)
\- Event Streams (Kafka, Kinesis)
[Ingestion]
|- Fivetran / Airbyte
\- Debezium (CDC)
[Warehouse/Lakehouse]
|- Snowflake / BigQuery / Databricks
\- Iceberg on S3 + Trino
[Transformation]
\- dbt (or SQLMesh)
[Orchestration]
\- Dagster / Airflow
[Serving]
|- BI: Looker, Metabase, Hex, Preset
|- ML: Feature Store (Feast)
\- Reverse ETL: Census, Hightouch
[Observability]
|- Monte Carlo / Elementary
\- DataHub
9.2 New Trends
- Semantic Layer: dbt Semantic Layer, Cube, MetricFlow
- Reverse ETL: Warehouse to CRM/ad platforms
- Zero-copy Cloning: Snowflake/Databricks
- Open Table Format convergence: around Iceberg
- AI-assisted data engineering: copilots drafting dbt models
Part 10 — Six-Month Data Engineer Roadmap
- Month 1: SQL + warehouse (advanced PostgreSQL, dbt basics)
- Month 2: Orchestration + pipelines (Airflow or Dagster)
- Month 3: Lakehouse (Iceberg or Delta, Trino/DuckDB, Parquet tuning)
- Month 4: Streaming (Kafka basics, Flink or Spark Streaming, CDC)
- Month 5: Quality + observability (Great Expectations/Soda, DataHub, Data Contract)
- Month 6: Operations + ML (FinOps, Feature Store, Semantic Layer)
Part 11 — 12-Item Checklist
- Differences between the three Lakehouse table formats
- Lambda vs Kappa
- Event Time vs Processing Time + Watermark
- Medallion tier responsibilities
- How
ref()in dbt builds the graph - CDC pipeline topology
- Dagster Asset vs Airflow Task
- Data Mesh 4 principles
- Data Contract definition and need
- Six quality dimensions
- Five observability pillars
- Semantic Layer purpose
Part 12 — Ten Anti-patterns
- Skipping Bronze: reprocessing impossible
- Clinging to Lambda: duplicate logic
- Raw SQL without dbt: no deps/tests/docs
- Sharing data without a Contract: untraceable breakage
- Parquet without partition design: slow queries
- Hourly UPSERT without MoR/compaction
- Business logic inside Airflow DAGs: push to dbt/Flink
- Observability later: track quality from day one
- No idempotency: retries corrupt data
- No cost monitoring: warehouse bills explode
Closing — Data Engineering Is "Contract Engineering"
If software engineering is the engineering of functions and interfaces, data engineering is the engineering of schemas and contracts.
The essence in 2025:
- deliver trustworthy data in the right shape at the right time
- be the foundation that decisions, ML, and products can rely on
Tools churn yearly. Lakehouse, Streaming, dbt, Dagster, Data Contract did not exist in 2020; 2030 will bring more. What endures:
- data historicity (preserve Bronze)
- schema evolution (Contract)
- idempotent processing (safe retries)
- observability (detect problems)
Keep these and the tools swap in and out as needed.
Next — "Observability Complete Guide: Metric, Log, Trace, OpenTelemetry, eBPF, SLO"
Season 2 Ep 9 covers the nervous system of modern systems, Observability: three axes plus Profile (Pyroscope), the real value of OpenTelemetry, eBPF at the kernel level, SLO/SLI/Error Budget design, Grafana vs Elastic vs Datadog, and cost control (stopping log explosions).
"If you cannot observe it, you cannot operate it." Continued next time.