Split View: 데이터 엔지니어링 완전 가이드 — Lakehouse·Streaming·dbt·Orchestration·Data Mesh (Season 2 Ep 8, 2025)

데이터 엔지니어링 완전 가이드 — Lakehouse·Streaming·dbt·Orchestration·Data Mesh (Season 2 Ep 8, 2025)

들어가며 — 데이터 엔지니어의 2025년 기대치

10년 전 데이터 엔지니어는 "ETL 스크립트 작성 + Hadoop 운영"이었다. 2025년의 기대치:

Lakehouse 설계: Iceberg/Delta/Hudi 중 선택·운영
Streaming + Batch 통합: Lambda는 가고 Kappa, 이제 진짜 통합
Modern Data Stack: dbt + Airflow/Dagster + Fivetran/Airbyte + BigQuery/Snowflake
Data Mesh: 중앙 집중 vs 분산 조직 모델
Data Contract: Protobuf 같은 스키마 계약
AI/ML 연계: MLOps와의 접점, Feature Store, Vector DB
비용 관리: 클라우드 데이터 비용 통제 (FinOps)

이 글은 2025년 데이터 엔지니어의 사고 프레임을 한 편에 담는다.

1부 — Lakehouse: Data Lake + Data Warehouse의 통합

1.1 역사적 맥락

시대	아키텍처	한계
2000s	Data Warehouse (Teradata, Oracle)	정형 데이터만, 비쌈
2010s	Data Lake (Hadoop, S3)	정형·비정형 OK, 쿼리 성능·트랜잭션 부재
2020s	Lakehouse (Iceberg, Delta, Hudi)	둘의 장점 결합

Lakehouse 정의: 오브젝트 스토리지(S3, GCS, ADLS) + 테이블 포맷(Iceberg 등) + 쿼리 엔진(Spark, Trino, DuckDB) 조합으로 Warehouse급 ACID와 성능 제공.

1.2 세 가지 테이블 포맷 비교 (2024~2025)

항목	Iceberg	Delta Lake	Hudi
시작	Netflix	Databricks	Uber
거버넌스	Apache 재단	Linux Foundation	Apache 재단
엔진 중립성	최고	Databricks 강함	중립
실시간 upsert	좋음	좋음	최상
시간 여행	✅	✅	✅
Schema Evolution	강력	강력	중
2025 점유율	급성장	선두 (하락세)	니치

2024년 결정적 사건: Databricks가 Tabular(Iceberg 창업사)를 인수 → Iceberg와 Delta 통합 방향.

2025 추천: 새 프로젝트는 Iceberg. Databricks 환경이면 Delta도 OK.

1.3 Iceberg 기본 구조

Metadata (JSON):
  ↓
Manifest List (Avro):
  ├─ Manifest 1 (Avro)
  │   └─ Data Files (Parquet)
  ├─ Manifest 2
  │   └─ Data Files
  └─ ...

Snapshot: 특정 시점 상태 (시간 여행 가능)
Partition Evolution: 파티션 스키마 변경 가능
Hidden Partitioning: 사용자가 WHERE year=2025 써도 내부 파티션 자동 사용

1.4 Lakehouse 쿼리 엔진 2025

엔진	강점
Spark	범용 최강, Delta 네이티브
Trino (Presto)	인터랙티브 SQL, 멀티 소스
DuckDB	로컬·임베디드, 초고속
Snowflake	운영 관리 최고
BigQuery	Serverless, GCP 통합
ClickHouse	실시간 분석
Databricks SQL	Delta 최적화

2부 — Streaming: Kappa 아키텍처의 승리

2.1 Lambda vs Kappa

Lambda (2010s):

Batch Layer (정확) + Speed Layer (빠름) + Serving Layer
문제: 같은 로직을 두 번 구현 → 유지보수 지옥

Kappa (2014, Jay Kreps):

Streaming만. 재처리 필요하면 스트림 재시작
2025년 표준

2.2 2025 Streaming 엔진

엔진	특징
Apache Flink	최고 기능·성숙도, Stateful
Kafka Streams	Kafka 네이티브, Java 라이브러리
Spark Structured Streaming	Batch API와 통합
Materialize	PostgreSQL-compatible SQL
RisingWave	Materialize 대안, Rust
Arroyo	새로운 Rust 기반

2.3 Streaming 핵심 개념 10

Event Time vs Processing Time: 이벤트 발생 시각 vs 시스템 처리 시각
Watermark: "이 시간까지의 이벤트는 다 왔다" 신호
Windowing: Tumbling, Sliding, Session
Stateful Processing: 상태를 유지하며 처리
Exactly-Once: 정확히 한 번 처리 보장
Backpressure: 하류가 느릴 때 상류 속도 조절
Checkpointing: 장애 복구 지점
Join: Stream-Stream, Stream-Table
CDC (Change Data Capture): DB 변경 → 스트림
Deduplication: 중복 제거

2.4 CDC — 현대 데이터 통합의 핵심

PostgreSQL/MySQL → Debezium → Kafka → Flink/Spark → Iceberg

장점: 배치 ETL 제거, 실시간에 가까운 동기화. 도구: Debezium (오픈소스), Fivetran, Airbyte.

2.5 2025 실시간 데이터 플랫폼 예시

[Kafka] → [Flink SQL] → [Iceberg (bronze/silver/gold)] → [Trino/DuckDB]
                   ↓
              [Materialize] (실시간 대시보드)
                   ↓
              [Redis] (저지연 서빙)

3부 — Medallion Architecture: Bronze/Silver/Gold

3.1 3계층 구조

Bronze (Raw): 원본 그대로. 스키마 변화 흡수.
Silver (Cleansed): 정제·검증·중복 제거·스키마 통일.
Gold (Business): 비즈니스 도메인별 집계, BI/ML 바로 쓸 수 있는 형태.

3.2 각 계층의 책임

계층	담당	빈도
Bronze	데이터 엔지니어	실시간~시간당
Silver	데이터 엔지니어	시간~일
Gold	애널리틱스 엔지니어 + 도메인 팀	일~주

3.3 장점

재처리 가능 (Bronze 보존)
계약 명확 (Silver = 표준 스키마)
비즈니스 로직 분리 (Gold)

4부 — dbt: Analytics Engineering의 표준

4.1 dbt란

"SQL로 데이터 모델링을 소프트웨어 엔지니어링처럼" — 버전 관리, 테스트, 문서화, 의존성 그래프.

4.2 핵심 구성 요소

models/
  staging/
    stg_orders.sql        -- Bronze -> Silver 수준
    stg_customers.sql
  marts/
    orders.sql            -- Gold, 비즈니스 로직
    revenue_daily.sql
tests/
  assert_total_revenue.sql
seeds/
  country_codes.csv
macros/
  generate_schema_name.sql

4.3 모델 예제

-- models/marts/orders.sql
{{ config(materialized='table', partition_by={'field': 'order_date', 'data_type': 'date'}) }}

with orders as (
    select * from {{ ref('stg_orders') }}
),
customers as (
    select * from {{ ref('stg_customers') }}
)
select
    o.order_id,
    o.order_date,
    c.country,
    o.amount
from orders o
join customers c using (customer_id)
where o.status = 'completed'

{{ ref('stg_orders') }} — dbt가 의존성 그래프 자동 구축.

4.4 Test와 Documentation

# schema.yml
models:
  - name: orders
    description: "완료된 주문 (Gold 레이어)"
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: amount
        tests:
          - dbt_utils.expression_is_true:
              expression: ">= 0"

dbt test  # 모든 테스트 실행
dbt docs generate && dbt docs serve  # 문서 사이트

4.5 2025 dbt 생태계

dbt Core: 오픈소스 CLI
dbt Cloud: 상용 SaaS (IDE·스케줄링)
dbt-osmosis: 메타데이터 자동 전파
Elementary: 데이터 품질·옵저버빌리티
SQLMesh: dbt 대안, 더 강력한 의존성 관리
Dagster + dbt 통합: 오케스트레이션 강화

5부 — Orchestration: Airflow vs Dagster vs Prefect vs Temporal

5.1 2025 비교

도구	강점	약점	적합
Airflow	성숙, 생태계 최대	UX 구식	대규모·성숙한 팀
Dagster	데이터 인식, 타입 안전	학습 곡선	데이터 중심 팀
Prefect	Pythonic, 2.x에서 대폭 개선	커뮤니티 작음	중소 팀
Temporal	워크플로우 신뢰성 최상	데이터 특화 아님	장기 워크플로우
Kestra	선언적, YAML	신생	간단 파이프라인

5.2 Dagster의 철학: "Data-aware Orchestration"

Airflow는 "태스크"를 관리. Dagster는 "Asset"(데이터 결과물)을 관리.

from dagster import asset

@asset
def raw_orders():
    return fetch_from_postgres("orders")

@asset
def cleaned_orders(raw_orders):
    return raw_orders.dropna()

@asset
def revenue_by_day(cleaned_orders):
    return cleaned_orders.groupby("date").sum()

의존성·스키마·타입이 명시적. 실행 그래프 = Asset 그래프.

5.3 Airflow 2.x의 개선

TaskFlow API (Pythonic)
Dynamic Task Mapping
Dataset Triggering (Dagster에 영감)
Airflow 3.0 (2025 예정): 데이터 중심 개편

6부 — Data Mesh: 조직의 데이터 아키텍처

6.1 배경

중앙집중 데이터 팀의 한계:

도메인 지식 부족
병목 (한 팀이 전체 요청 처리)
우선순위 충돌

6.2 Data Mesh 4원칙 (Zhamak Dehghani, 2019)

Domain Ownership: 데이터는 도메인 팀이 소유 (마케팅 팀이 마케팅 데이터)
Data as a Product: 데이터셋 = 제품 (품질·문서·SLA)
Self-serve Platform: 중앙 팀은 "플랫폼" 제공 (도메인 팀이 자율 사용)
Federated Governance: 중앙 규칙 + 도메인 실행

6.3 Data Mesh 실전 고려사항

잘 될 때:

조직 규모 500+ 엔지니어
도메인 간 명확한 경계
플랫폼 엔지니어링 조직 성숙

실패할 때:

100명 미만 스타트업 (과공학)
도메인 팀 역량 부족
중앙 플랫폼 없이 "자유방임"

2025 현실: 이상론에서 실용 조정으로. "Hub-and-Spoke" 혹은 부분 Mesh가 흔함.

7부 — Data Contract: 팀 간 인터페이스

7.1 왜 필요한가

문제: 프론트엔드 팀이 user.full_name을 user.name으로 리네임 → 데이터 파이프라인 100개 깨짐 → 추적 불가.

7.2 Data Contract 정의

스키마 + 의미론 + SLA의 공식 약속. Protobuf나 JSON Schema 같은 형태.

# user_contract.yml
name: user_events
version: 1.2.0
owner: user-platform-team
schema:
  event_id: string (uuid)
  user_id: string
  event_type: enum [signup, login, logout, purchase]
  timestamp: timestamp (UTC)
  properties: map<string, any>
sla:
  freshness: <5 minutes
  availability: 99.9%
  breaking_change_notice: 30 days
consumers:
  - analytics-team
  - ml-team
  - finance-team

7.3 도구 (2024~2025)

Protobuf + Buf: 엔지니어링 팀 중심
Great Expectations: 데이터 품질 검증
dbt Contracts (1.5+): 모델 간 계약
DataHub: 메타데이터 카탈로그
Apache Atlas: Hadoop 생태계

7.4 Contract-first 파이프라인

Producer Team                         Consumer Team
  │                                        │
  ├─ Commits schema change to Buf ──→ Review
  │                                        │
  ├─ CI: backward compat check             │
  │                                        │
  ├─ Deploy producer                       │
  │                                        │
  └─ Update event emission ─── Kafka ──→ Update consumer

8부 — Data Quality: 신뢰의 기반

8.1 데이터 품질 6차원

정확성 (Accuracy): 현실 반영
완전성 (Completeness): 결측 없음
일관성 (Consistency): 시스템 간 동일
적시성 (Timeliness): 최신성
유일성 (Uniqueness): 중복 없음
유효성 (Validity): 형식·범위

8.2 Data Observability 5 Pillars (Monte Carlo)

Freshness: 언제 마지막 업데이트?
Volume: 예상 행 수 범위?
Schema: 컬럼·타입 변화?
Distribution: 값 분포 변화?
Lineage: 어디서 왔고 어디로 가는가?

8.3 2025 도구

Great Expectations: 오픈소스, 강력
Soda: SQL 기반, 단순
Monte Carlo: 상용, ML 기반 이상 감지
Bigeye: 자동 SLA 학습
Elementary: dbt 통합

9부 — 모던 데이터 스택 2025

9.1 표준 구성

[Sources]
  ├─ OLTP DBs (Postgres/MySQL)
  ├─ SaaS APIs (Salesforce, Stripe)
  └─ Event Streams (Kafka, Kinesis)
       ↓
[Ingestion]
  ├─ Fivetran / Airbyte (SaaS → Warehouse)
  └─ Debezium (CDC)
       ↓
[Warehouse/Lakehouse]
  ├─ Snowflake / BigQuery / Databricks
  └─ Iceberg on S3 + Trino
       ↓
[Transformation]
  └─ dbt (또는 SQLMesh)
       ↓
[Orchestration]
  └─ Dagster / Airflow
       ↓
[Serving]
  ├─ BI: Looker, Metabase, Hex, Preset
  ├─ ML: Feature Store (Feast)
  └─ Reverse ETL: Census, Hightouch (to SaaS)
       ↓
[Observability]
  ├─ Monte Carlo / Elementary (데이터)
  └─ DataHub (카탈로그)

9.2 2025 새 트렌드

Semantic Layer: dbt Semantic Layer, Cube, MetricFlow (지표 통일)
Reverse ETL: Warehouse → CRM·광고 플랫폼
Zero-copy Cloning: Snowflake·Databricks 데이터 복사 없이 복제
Open Table Format 수렴: Iceberg 중심 재편
AI-assisted Data Engineering: 코파일럿이 dbt 모델 초안 작성

10부 — 데이터 엔지니어 로드맵 6개월

Month 1: SQL + 웨어하우스

PostgreSQL 고급 (window 함수, CTE, recursive)
dbt 기본 프로젝트

Month 2: 오케스트레이션 + 파이프라인

Airflow 또는 Dagster
스케줄링·의존성·재시도 전략

Month 3: Lakehouse

Iceberg 또는 Delta 실전
Trino 또는 DuckDB로 쿼리
Parquet 튜닝

Month 4: Streaming

Kafka 기본
Flink 또는 Spark Streaming
CDC (Debezium)

Month 5: 품질·관측성

Great Expectations 또는 Soda
DataHub 카탈로그
Data Contract 시도

Month 6: 운영 + ML 연결

비용 모니터링 (FinOps)
Feature Store (Feast)
Semantic Layer

11부 — 데이터 엔지니어 체크리스트 12

Lakehouse 3가지 테이블 포맷 차이를 안다
Lambda vs Kappa 아키텍처 차이를 안다
Event Time vs Processing Time + Watermark를 설명할 수 있다
Medallion (Bronze/Silver/Gold) 각 계층 책임을 안다
**dbt의 ref()**로 의존성 그래프가 어떻게 생기는지 안다
CDC 파이프라인 구성을 안다
Dagster의 Asset vs Airflow의 Task 차이를 안다
Data Mesh 4원칙을 말할 수 있다
Data Contract가 무엇이고 왜 필요한지 안다
데이터 품질 6차원을 안다
Data Observability 5 Pillars를 안다
Semantic Layer의 목적을 안다

12부 — 데이터 엔지니어링 안티패턴 10

Bronze 없이 Silver로 직접: 재처리 불가. 원본 보존 필수
Lambda 아키텍처 고집: 로직 중복. Kappa 전환 고려
dbt 없이 SQL 난립: 의존성·테스트·문서화 부재
Data Contract 없이 팀 간 데이터 공유: 깨짐·추적 불가
파티션 설계 없이 Parquet: 쿼리 느림. Hidden Partitioning 활용
UPSERT를 매 시간 배치: Merge on Read가 안 되면 compaction 필요
Airflow DAG에 비즈니스 로직: 오케스트레이션에 상태 저장 ❌. dbt·Flink로 분리
관측성 나중에: 품질 지표 처음부터
파이프라인 실패 시 재시도 전략 없음: 멱등성 설계 필수
비용 모니터링 없음: 클라우드 warehouse 비용 폭주 흔함

마치며 — 데이터 엔지니어링은 "계약의 공학"

소프트웨어 엔지니어링이 "함수와 인터페이스의 공학"이라면, 데이터 엔지니어링은 **"스키마와 계약의 공학"**이다.

2025년의 본질은 그대로다:

신뢰할 수 있는 데이터를 필요한 시점에 필요한 형태로 제공
조직의 의사결정·ML·제품이 의존할 수 있는 기반

그러나 도구는 매년 바뀐다. Lakehouse, Streaming, dbt, Dagster, Data Contract — 2020년엔 없던 것들. 2030년엔 또 다른 것들.

핵심은 원리다:

데이터의 역사성 (Bronze 보존)
스키마 진화 (Contract)
처리의 멱등성 (안전한 재시도)
관측성 (문제 감지)

이것만 유지되면 도구는 필요에 따라 교체된다.

다음 글 예고 — "Observability 완전 가이드: Metric·Log·Trace·OpenTelemetry·eBPF·SLO"

Season 2 Ep 9는 현대 시스템의 신경계, Observability. 다음 글은:

Metric·Log·Trace 3축 + Profile 4축 (Pyroscope)
OpenTelemetry의 진짜 가치
eBPF와 커널 수준 관측
SLO·SLI·Error Budget 실전 설계
Grafana Stack vs Elastic Stack vs Datadog
비용 통제 (로그 폭주 막기)

"관측할 수 없으면 운영할 수 없다", 다음 글에서 이어진다.

Data Engineering Complete Guide — Lakehouse, Streaming, dbt, Orchestration, Data Mesh (Season 2 Ep 8, 2025)

Intro — Expectations for Data Engineers in 2025

Ten years ago a data engineer wrote ETL scripts and operated Hadoop. 2025 expectations:

Lakehouse design: pick and operate Iceberg, Delta, or Hudi
Streaming + Batch unification: Lambda is out, Kappa is in
Modern Data Stack: dbt + Airflow/Dagster + Fivetran/Airbyte + BigQuery/Snowflake
Data Mesh: centralized vs distributed org model
Data Contract: schema-as-contract (Protobuf-like)
AI/ML integration: Feature Store, Vector DB, MLOps touchpoints
Cost management: cloud data cost control (FinOps)

This post captures the mental frame of a 2025 data engineer.

Part 1 — Lakehouse: Data Lake + Warehouse Unified

1.1 Historical Context

Era	Architecture	Limitation
2000s	Data Warehouse (Teradata, Oracle)	Structured only, expensive
2010s	Data Lake (Hadoop, S3)	Any shape, but no ACID / poor query perf
2020s	Lakehouse (Iceberg, Delta, Hudi)	Best of both

Definition: object storage (S3/GCS/ADLS) + table format (Iceberg etc.) + query engine (Spark, Trino, DuckDB) delivering warehouse-grade ACID and performance.

1.2 Three Table Formats (2024-2025)

Item	Iceberg	Delta Lake	Hudi
Origin	Netflix	Databricks	Uber
Governance	Apache	Linux Foundation	Apache
Engine neutrality	Best	Databricks-leaning	Neutral
Real-time upsert	Good	Good	Best
Time travel	Yes	Yes	Yes
Schema evolution	Strong	Strong	Medium
2025 share	Rapidly growing	Leader (slipping)	Niche

Key 2024 event: Databricks acquired Tabular (Iceberg founders), pointing toward Iceberg/Delta convergence.

2025 pick: Iceberg for new projects. Delta is fine on Databricks.

1.3 Iceberg Structure

Metadata (JSON):
  -> Manifest List (Avro):
     |- Manifest 1 (Avro)
     |  \- Data Files (Parquet)
     |- Manifest 2
     |  \- Data Files
     \- ...

Snapshot: point-in-time state (time travel)
Partition Evolution: partition schema can change
Hidden Partitioning: users write WHERE year=2025, engine uses internal partition

1.4 Lakehouse Query Engines 2025

Engine	Strength
Spark	General-purpose king, Delta-native
Trino (Presto)	Interactive SQL, multi-source
DuckDB	Local/embedded, blazing fast
Snowflake	Best operational UX
BigQuery	Serverless, GCP-integrated
ClickHouse	Real-time analytics
Databricks SQL	Delta-optimized

Part 2 — Streaming: The Kappa Victory

2.1 Lambda vs Kappa

Lambda (2010s): Batch Layer (accurate) + Speed Layer (fast) + Serving. Problem: same logic implemented twice — maintenance hell.

Kappa (2014, Jay Kreps): Streaming only. Re-process by restarting the stream. Standard in 2025.

2.2 Streaming Engines 2025

Engine	Trait
Apache Flink	Most mature & feature-rich, stateful
Kafka Streams	Kafka-native Java library
Spark Structured Streaming	Unified with batch API
Materialize	PostgreSQL-compatible SQL
RisingWave	Materialize alternative, Rust
Arroyo	New, Rust-based

2.3 Ten Streaming Concepts

Event Time vs Processing Time
Watermark: "all events up to time T have arrived" signal
Windowing: Tumbling, Sliding, Session
Stateful Processing
Exactly-Once
Backpressure
Checkpointing: recovery points
Join: Stream-Stream, Stream-Table
CDC (Change Data Capture)
Deduplication

2.4 CDC — The Modern Integration Primitive

PostgreSQL/MySQL -> Debezium -> Kafka -> Flink/Spark -> Iceberg

Eliminates batch ETL, near-real-time sync. Tools: Debezium (OSS), Fivetran, Airbyte.

2.5 Example Real-time Platform 2025

[Kafka] -> [Flink SQL] -> [Iceberg (bronze/silver/gold)] -> [Trino/DuckDB]
                   |
                   +-> [Materialize] (live dashboards)
                   |
                   +-> [Redis] (low-latency serving)

Part 3 — Medallion Architecture: Bronze/Silver/Gold

3.1 Three Tiers

Bronze (Raw): original data, absorbs schema drift
Silver (Cleansed): validated, deduped, unified schema
Gold (Business): domain aggregates, BI/ML-ready

3.2 Responsibilities

Tier	Owner	Frequency
Bronze	Data engineer	Real-time to hourly
Silver	Data engineer	Hourly to daily
Gold	Analytics engineer + domain team	Daily to weekly

3.3 Benefits

Reprocessable (Bronze preserved)
Clear contract (Silver = canonical schema)
Business logic isolated (Gold)

Part 4 — dbt: The Analytics Engineering Standard

4.1 What dbt Is

"SQL modeling as software engineering" — version control, tests, docs, dependency graph.

4.2 Core Layout

models/
  staging/
    stg_orders.sql
    stg_customers.sql
  marts/
    orders.sql
    revenue_daily.sql
tests/
seeds/
macros/

4.3 Model Example

-- models/marts/orders.sql
{{ config(materialized='table', partition_by={'field': 'order_date', 'data_type': 'date'}) }}

with orders as (
    select * from {{ ref('stg_orders') }}
),
customers as (
    select * from {{ ref('stg_customers') }}
)
select
    o.order_id,
    o.order_date,
    c.country,
    o.amount
from orders o
join customers c using (customer_id)
where o.status = 'completed'

{{ ref('stg_orders') }} — dbt auto-builds the dependency graph.

4.4 Tests and Documentation

# schema.yml
models:
  - name: orders
    description: "Completed orders (Gold layer)"
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: amount
        tests:
          - dbt_utils.expression_is_true:
              expression: ">= 0"

dbt test
dbt docs generate && dbt docs serve

4.5 2025 dbt Ecosystem

dbt Core: OSS CLI
dbt Cloud: SaaS (IDE + scheduler)
dbt-osmosis: metadata propagation
Elementary: data quality + observability
SQLMesh: dbt alternative, stronger dependency handling
Dagster + dbt: tighter orchestration

Part 5 — Orchestration: Airflow vs Dagster vs Prefect vs Temporal

5.1 2025 Comparison

Tool	Strength	Weakness	Fits
Airflow	Mature, huge ecosystem	Dated UX	Large, mature teams
Dagster	Data-aware, typed	Learning curve	Data-centric teams
Prefect	Pythonic, much improved in 2.x	Smaller community	Mid-size teams
Temporal	Workflow reliability	Not data-specific	Long-running workflows
Kestra	Declarative YAML	New	Simple pipelines

5.2 Dagster: "Data-aware Orchestration"

Airflow manages tasks. Dagster manages Assets (data outputs).

from dagster import asset

@asset
def raw_orders():
    return fetch_from_postgres("orders")

@asset
def cleaned_orders(raw_orders):
    return raw_orders.dropna()

@asset
def revenue_by_day(cleaned_orders):
    return cleaned_orders.groupby("date").sum()

Dependencies, schemas, types all explicit. Execution graph = Asset graph.

5.3 Airflow 2.x Improvements

TaskFlow API (Pythonic)
Dynamic Task Mapping
Dataset Triggering (inspired by Dagster)
Airflow 3.0 (2025): data-centric overhaul

Part 6 — Data Mesh: Organizational Data Architecture

6.1 Background

Limits of centralized data teams: weak domain knowledge, bottleneck, priority conflicts.

6.2 Four Principles (Zhamak Dehghani, 2019)

Domain Ownership: domains own their data
Data as a Product: dataset = product (quality, docs, SLA)
Self-serve Platform: central team provides platform
Federated Governance: central rules + domain execution

6.3 Field Reality

Works when: 500+ engineers, clear domain boundaries, mature platform-engineering org.

Fails when: under 100 engineers (over-engineering), weak domain teams, no central platform.

2025 reality: from ideology to pragmatic adjustment — Hub-and-Spoke or partial Mesh is common.

Part 7 — Data Contract: The Team Interface

7.1 Why

Problem: frontend renames user.full_name to user.name, 100 pipelines break, nobody traced.

7.2 Definition

Schema + semantics + SLA as a formal promise. Often Protobuf or JSON Schema.

# user_contract.yml
name: user_events
version: 1.2.0
owner: user-platform-team
schema:
  event_id: string
  user_id: string
  event_type: enum [signup, login, logout, purchase]
  timestamp: timestamp
  properties: map
sla:
  freshness: under 5 minutes
  availability: 99.9%
  breaking_change_notice: 30 days
consumers:
  - analytics-team
  - ml-team
  - finance-team

7.3 Tools (2024-2025)

Protobuf + Buf: engineering-team centric
Great Expectations: data quality validation
dbt Contracts (1.5+): model-level contracts
DataHub: metadata catalog
Apache Atlas: Hadoop ecosystem

7.4 Contract-first Pipeline

Producer Team                         Consumer Team
  |                                        |
  |- Commits schema change to Buf -> Review
  |                                        |
  |- CI: backward compat check             |
  |                                        |
  |- Deploy producer                       |
  |                                        |
  \- Emit events ----- Kafka --------> Update consumer

Part 8 — Data Quality: The Trust Foundation

8.1 Six Dimensions

Accuracy
Completeness
Consistency (across systems)
Timeliness
Uniqueness
Validity (format and range)

8.2 Five Pillars of Data Observability (Monte Carlo)

Freshness: last update time
Volume: expected row count range
Schema: column/type changes
Distribution: value distribution shifts
Lineage: where from, where to

8.3 2025 Tools

Great Expectations: OSS, powerful
Soda: SQL-based, simple
Monte Carlo: commercial, ML anomaly detection
Bigeye: auto-SLA learning
Elementary: dbt-integrated

Part 9 — Modern Data Stack 2025

9.1 Canonical Setup

[Sources]
  |- OLTP DBs (Postgres/MySQL)
  |- SaaS APIs (Salesforce, Stripe)
  \- Event Streams (Kafka, Kinesis)
[Ingestion]
  |- Fivetran / Airbyte
  \- Debezium (CDC)
[Warehouse/Lakehouse]
  |- Snowflake / BigQuery / Databricks
  \- Iceberg on S3 + Trino
[Transformation]
  \- dbt (or SQLMesh)
[Orchestration]
  \- Dagster / Airflow
[Serving]
  |- BI: Looker, Metabase, Hex, Preset
  |- ML: Feature Store (Feast)
  \- Reverse ETL: Census, Hightouch
[Observability]
  |- Monte Carlo / Elementary
  \- DataHub

9.2 New Trends

Semantic Layer: dbt Semantic Layer, Cube, MetricFlow
Reverse ETL: Warehouse to CRM/ad platforms
Zero-copy Cloning: Snowflake/Databricks
Open Table Format convergence: around Iceberg
AI-assisted data engineering: copilots drafting dbt models

Part 10 — Six-Month Data Engineer Roadmap

Month 1: SQL + warehouse (advanced PostgreSQL, dbt basics)
Month 2: Orchestration + pipelines (Airflow or Dagster)
Month 3: Lakehouse (Iceberg or Delta, Trino/DuckDB, Parquet tuning)
Month 4: Streaming (Kafka basics, Flink or Spark Streaming, CDC)
Month 5: Quality + observability (Great Expectations/Soda, DataHub, Data Contract)
Month 6: Operations + ML (FinOps, Feature Store, Semantic Layer)

Part 11 — 12-Item Checklist

Differences between the three Lakehouse table formats
Lambda vs Kappa
Event Time vs Processing Time + Watermark
Medallion tier responsibilities
How ref() in dbt builds the graph
CDC pipeline topology
Dagster Asset vs Airflow Task
Data Mesh 4 principles
Data Contract definition and need
Six quality dimensions
Five observability pillars
Semantic Layer purpose

Part 12 — Ten Anti-patterns

Skipping Bronze: reprocessing impossible
Clinging to Lambda: duplicate logic
Raw SQL without dbt: no deps/tests/docs
Sharing data without a Contract: untraceable breakage
Parquet without partition design: slow queries
Hourly UPSERT without MoR/compaction
Business logic inside Airflow DAGs: push to dbt/Flink
Observability later: track quality from day one
No idempotency: retries corrupt data
No cost monitoring: warehouse bills explode

Closing — Data Engineering Is "Contract Engineering"

If software engineering is the engineering of functions and interfaces, data engineering is the engineering of schemas and contracts.

The essence in 2025:

deliver trustworthy data in the right shape at the right time
be the foundation that decisions, ML, and products can rely on

Tools churn yearly. Lakehouse, Streaming, dbt, Dagster, Data Contract did not exist in 2020; 2030 will bring more. What endures:

data historicity (preserve Bronze)
schema evolution (Contract)
idempotent processing (safe retries)
observability (detect problems)

Keep these and the tools swap in and out as needed.

Next — "Observability Complete Guide: Metric, Log, Trace, OpenTelemetry, eBPF, SLO"

Season 2 Ep 9 covers the nervous system of modern systems, Observability: three axes plus Profile (Pyroscope), the real value of OpenTelemetry, eBPF at the kernel level, SLO/SLI/Error Budget design, Grafana vs Elastic vs Datadog, and cost control (stopping log explosions).

"If you cannot observe it, you cannot operate it." Continued next time.