Skip to content

✍️ 필사 모드: Data Engineering Fundamentals — ETL, Data Warehouses, Streaming, and Data Lakes

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Table of Contents

  1. What Is Data Engineering
  2. Data Architecture
  3. ETL vs ELT
  4. Batch Processing
  5. Stream Processing
  6. Workflow Orchestration
  7. Data Warehouses
  8. Data Quality
  9. Data Governance
  10. End-to-End Pipeline Example

1. What Is Data Engineering

Data engineering is the discipline of collecting, transforming, and storing raw data so that it becomes accessible and usable for analysis. Before data scientists can build models or analysts can derive insights, the data must first be clean, reliable, and available. The data engineer is responsible for making that happen.

Role Comparison

RolePrimary ResponsibilitiesCore Skills
Data EngineerPipeline construction, infrastructure management, data transformationPython, SQL, Spark, Kafka, Airflow
Data ScientistModeling, prediction, experiment designPython, R, TensorFlow, Statistics
Data AnalystReporting, dashboards, business insightsSQL, Tableau, Excel, BI tools

A data engineer is the person who builds trustworthy data infrastructure. Even the most sophisticated model is worthless if the data pipeline feeding it is unreliable.

Core Competencies of a Data Engineer

  • SQL proficiency: Complex joins, window functions, CTEs, and optimization
  • Programming: Python is the de facto standard; Scala and Java are also used in the Spark ecosystem
  • Distributed systems: Partitioning, replication, the CAP theorem, and related concepts
  • Cloud services: Fluency with data-related services on AWS, GCP, or Azure
  • Data modeling: Normalization, denormalization, and dimensional modeling

2. Data Architecture

Modern data architectures fall into four main patterns.

2.1 Data Lake

A data lake is a large-scale repository that stores structured, semi-structured, and unstructured data in its raw form. Schema is applied at read time (Schema-on-Read).

Raw Data Store (Data Lake)
├── Structured data (CSV, Parquet, ORC)
├── Semi-structured data (JSON, XML, Avro)
└── Unstructured data (images, logs, text)

Pros: Cheap storage, flexible schema, accepts all data formats Cons: Hard to manage; can devolve into a "data swamp" without governance

Key technologies: AWS S3, Azure Data Lake Storage, Google Cloud Storage

2.2 Data Warehouse

A data warehouse is a structured data store optimized for analytics. Schema is applied at write time (Schema-on-Write).

Analytics Store (Data Warehouse)
├── Fact tables (sales, orders, clicks)
├── Dimension tables (users, products, dates)
└── Aggregate tables (daily/monthly summaries)

Pros: Fast query performance, consistent schema, ACID transactions Cons: Limited handling of unstructured data, schema changes are costly

Key technologies: Snowflake, BigQuery, Amazon Redshift

2.3 Lakehouse

A lakehouse combines the flexibility of a data lake with the management features of a data warehouse.

Key capabilities:

  • ACID transaction support
  • Schema enforcement and evolution
  • SQL analytics on top of a data lake
  • Unified streaming and batch processing

Key technologies: Delta Lake, Apache Iceberg, Apache Hudi

2.4 Medallion Architecture

The medallion architecture organizes data into three progressive layers.

Bronze (Raw Data)
Ingested as-is from source systems
Minimal transformation applied

Silver (Cleansed Data)
Deduplication, type casting, validation complete
Ready for business logic but not yet aggregated

Gold (Business Data)
Aggregation, joining, business rules applied
Consumed directly by dashboards and ML models

This pattern was popularized by Databricks and offers the advantage of guaranteeing data quality at each stage.


3. ETL vs ELT

3.1 ETL (Extract, Transform, Load)

ETL is the traditional approach to data integration.

Source[Extract][Transform][Load]Warehouse
  1. Extract: Pull data from source systems
  2. Transform: Cleanse, transform, and aggregate in a staging area
  3. Load: Write the transformed data to the target system

3.2 ELT (Extract, Load, Transform)

ELT is the modern approach that leverages the compute power of cloud warehouses.

Source[Extract][Load]Warehouse[Transform]
  1. Extract: Pull data from sources
  2. Load: Write raw data directly into the warehouse
  3. Transform: Use SQL inside the warehouse to transform

3.3 When to Use Which

CriterionETLELT
Data volumeSmall to mediumLarge
Transformation complexityComplex business logicExpressible in SQL
InfrastructureOn-premisesCloud
Data securitySensitive data needs pre-maskingWarehouse-level access control suffices
Key toolsInformatica, Talenddbt, Fivetran, Airbyte

3.4 Key Tools

dbt (data build tool): A SQL-based transformation tool that handles the T in ELT.

-- dbt model example: daily revenue aggregation
-- models/marts/daily_revenue.sql

WITH orders AS (
    SELECT * FROM {{ ref('stg_orders') }}
),

payments AS (
    SELECT * FROM {{ ref('stg_payments') }}
)

SELECT
    o.order_date,
    COUNT(DISTINCT o.order_id) AS total_orders,
    SUM(p.amount) AS total_revenue,
    AVG(p.amount) AS avg_order_value
FROM orders o
JOIN payments p ON o.order_id = p.order_id
WHERE p.status = 'completed'
GROUP BY o.order_date

Airbyte: An open-source data integration platform with over 300 connectors.

Fivetran: A managed data integration service known for easy setup and reliability.


4. Batch Processing

Batch processing is the approach of processing accumulated large volumes of data all at once. It is suitable for the majority of analytics workloads where real-time latency is not required.

4.1 Apache Spark

Apache Spark is a unified analytics engine for large-scale data processing.

Key features:

  • In-memory processing: Up to 100x faster than MapReduce
  • Unified API: Batch, streaming, ML, and graph processing in one framework
  • Multi-language support: Python (PySpark), Scala, Java, R, SQL
# PySpark example: daily revenue aggregation
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum as spark_sum, count

spark = SparkSession.builder \
    .appName("DailyRevenue") \
    .getOrCreate()

# Read data
orders = spark.read.parquet("s3://data-lake/orders/")
payments = spark.read.parquet("s3://data-lake/payments/")

# Transform
daily_revenue = (
    orders
    .join(payments, "order_id")
    .filter(col("status") == "completed")
    .groupBy("order_date")
    .agg(
        count("order_id").alias("total_orders"),
        spark_sum("amount").alias("total_revenue")
    )
    .orderBy("order_date")
)

# Save results
daily_revenue.write \
    .mode("overwrite") \
    .parquet("s3://data-warehouse/daily_revenue/")

4.2 Spark SQL and DataFrames

Spark SQL lets analysts who are comfortable with SQL process large-scale data.

# Register DataFrames and use SQL
orders.createOrReplaceTempView("orders")
payments.createOrReplaceTempView("payments")

result = spark.sql("""
    SELECT
        o.order_date,
        COUNT(DISTINCT o.order_id) AS total_orders,
        SUM(p.amount) AS total_revenue
    FROM orders o
    JOIN payments p ON o.order_id = p.order_id
    WHERE p.status = 'completed'
    GROUP BY o.order_date
    ORDER BY o.order_date
""")

4.3 MapReduce vs Spark

CriterionMapReduceSpark
SpeedDisk-based, slowIn-memory, fast
Programming modelMap and Reduce phases onlyRich set of transformations
Real-time processingNot supportedStructured Streaming
Learning curveSteepRelatively gentle
EcosystemHadoop ecosystemStandalone + Hadoop compatible

5. Stream Processing

Stream processing is the approach of processing data in real time as it is generated.

5.1 Apache Kafka

Kafka is a distributed event streaming platform. It serves as the backbone for real-time data pipelines and streaming applications.

Core concepts:

  • Topic: A category to which messages are published
  • Producer: An entity that publishes messages to a topic
  • Consumer: An entity that subscribes to messages from a topic
  • Broker: A server that stores and delivers messages
  • Partition: A subdivision of a topic for parallel processing
  • Consumer Group: Multiple consumers sharing the load of a topic
# Kafka Producer example (Python)
from kafka import KafkaProducer
import json

producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'],
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

# Publish an order event
event = {
    "order_id": "ORD-12345",
    "user_id": "USR-678",
    "amount": 45000,
    "timestamp": "2026-04-13T10:30:00Z"
}

producer.send('order-events', value=event)
producer.flush()
# Kafka Consumer example (Python)
from kafka import KafkaConsumer
import json

consumer = KafkaConsumer(
    'order-events',
    bootstrap_servers=['localhost:9092'],
    group_id='order-processing-group',
    value_deserializer=lambda m: json.loads(m.decode('utf-8')),
    auto_offset_reset='earliest'
)

for message in consumer:
    order = message.value
    print(f"Processing order: {order['order_id']}, amount: {order['amount']}")

5.2 CDC (Change Data Capture)

CDC is a technique that captures database changes in real time and propagates them to other systems.

Operational DB[CDC]Kafka[Stream Processing]Warehouse
Search engine
Cache

Key tool: Debezium. It captures change events from MySQL, PostgreSQL, MongoDB, and more, then streams them to Kafka.

{
  "before": null,
  "after": {
    "id": 1001,
    "name": "John Doe",
    "email": "john@example.com"
  },
  "source": {
    "connector": "postgresql",
    "db": "users_db",
    "table": "users"
  },
  "op": "c",
  "ts_ms": 1681364400000
}

The JSON above shows an example CDC event captured by Debezium. The "op": "c" field indicates an INSERT operation.

Flink is a stateful stream processing engine. It guarantees exactly-once processing semantics and excels at event-time-based windowed operations.

// Flink stream processing example (Java)
DataStream<OrderEvent> orders = env
    .addSource(new FlinkKafkaConsumer<>(
        "order-events",
        new OrderEventSchema(),
        kafkaProps
    ));

// 5-minute tumbling window aggregation
DataStream<WindowedRevenue> revenue = orders
    .keyBy(OrderEvent::getCategory)
    .window(TumblingEventTimeWindows.of(Time.minutes(5)))
    .aggregate(new RevenueAggregator());

revenue.addSink(new JdbcSink<>(...));

5.4 Batch vs Streaming Comparison

CriterionBatch ProcessingStream Processing
LatencyMinutes to hoursMilliseconds to seconds
Data completenessComplete datasetArrives incrementally
ComplexityRelatively simpleComplex (state management, ordering)
CostRelatively cheapAlways running, higher cost
Best forDaily reports, ML trainingReal-time dashboards, anomaly detection

6. Workflow Orchestration

6.1 Apache Airflow

Airflow is a platform for programmatically authoring, scheduling, and monitoring workflows. Complex data pipelines are defined as DAGs (Directed Acyclic Graphs).

Core concepts:

  • DAG: A directed acyclic graph defining task execution order and dependencies
  • Operator: The unit that performs actual work (BashOperator, PythonOperator, etc.)
  • Task: An individual work instance within a DAG
  • Sensor: A special Operator that waits until a condition is met
  • XCom: A mechanism for passing data between Tasks
# Airflow DAG example
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator
from airflow.sensors.s3_key_sensor import S3KeySensor
from datetime import datetime, timedelta

default_args = {
    'owner': 'data-engineering',
    'depends_on_past': False,
    'email_on_failure': True,
    'email': ['team@example.com'],
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
}

with DAG(
    dag_id='daily_revenue_pipeline',
    default_args=default_args,
    description='Daily revenue data processing pipeline',
    schedule_interval='0 2 * * *',  # Every day at 2 AM
    start_date=datetime(2026, 1, 1),
    catchup=False,
    tags=['revenue', 'daily'],
) as dag:

    # 1. Verify source data exists
    check_source = S3KeySensor(
        task_id='check_source_data',
        bucket_name='raw-data-bucket',
        bucket_key='orders/{{ ds }}/*.parquet',
        timeout=3600,
        poke_interval=300,
    )

    # 2. Extract data
    extract = PythonOperator(
        task_id='extract_orders',
        python_callable=extract_orders_from_source,
    )

    # 3. Transform data
    transform = PythonOperator(
        task_id='transform_orders',
        python_callable=transform_and_aggregate,
    )

    # 4. Load to warehouse
    load = PythonOperator(
        task_id='load_to_warehouse',
        python_callable=load_to_snowflake,
    )

    # 5. Data quality validation
    validate = PythonOperator(
        task_id='validate_data_quality',
        python_callable=run_quality_checks,
    )

    # Define dependencies
    check_source >> extract >> transform >> load >> validate

6.2 DAG Design Best Practices

  1. Idempotency: Running a task multiple times should produce the same result
  2. Atomicity: Each Task should perform a single logical unit of work
  3. Retry strategy: Configure retries to handle transient failures
  4. Monitoring: Set SLAs, failure alerts, and execution-time tracking
  5. Testing: DAG structure tests and Task-level unit tests are essential

6.3 Other Orchestration Tools

ToolCharacteristicsBest For
Apache AirflowPython-based, rich ecosystemComplex batch pipelines
PrefectModern API, dynamic workflowsFlexible workflow needs
DagsterData-asset-centric, strong typingData quality focused
MageNotebook-style interfaceRapid prototyping
AWS Step FunctionsServerless, AWS-nativeAWS-centric architectures

7. Data Warehouses

7.1 Major Cloud Warehouses

Snowflake

  • Separated compute and storage architecture
  • Multi-cloud support (AWS, Azure, GCP)
  • Unique features like Time Travel and Zero-Copy Clone
  • Strong concurrency handling

Google BigQuery

  • Serverless architecture (no infrastructure management)
  • Pay-per-query pricing model
  • Create ML models with SQL (BigQuery ML)
  • Optimized for large-scale analytics

Amazon Redshift

  • Deep integration with the AWS ecosystem
  • Serverless option via Redshift Serverless
  • Query S3 data directly with Redshift Spectrum
  • Compatible with existing PostgreSQL tools

7.2 Dimensional Modeling

The most widely used modeling techniques in data warehouses are the star schema and the snowflake schema.

Star Schema

A central fact table surrounded by dimension tables connected directly to it.

-- Fact table
CREATE TABLE fact_sales (
    sale_id         BIGINT PRIMARY KEY,
    date_key        INT REFERENCES dim_date(date_key),
    product_key     INT REFERENCES dim_product(product_key),
    customer_key    INT REFERENCES dim_customer(customer_key),
    store_key       INT REFERENCES dim_store(store_key),
    quantity        INT,
    unit_price      DECIMAL(10,2),
    total_amount    DECIMAL(12,2),
    discount_amount DECIMAL(10,2)
);

-- Dimension tables
CREATE TABLE dim_product (
    product_key     INT PRIMARY KEY,
    product_id      VARCHAR(50),
    product_name    VARCHAR(200),
    category        VARCHAR(100),
    subcategory     VARCHAR(100),
    brand           VARCHAR(100),
    unit_cost       DECIMAL(10,2)
);

CREATE TABLE dim_date (
    date_key        INT PRIMARY KEY,
    full_date       DATE,
    year            INT,
    quarter         INT,
    month           INT,
    week            INT,
    day_of_week     VARCHAR(20),
    is_holiday      BOOLEAN
);

Snowflake Schema

A variation where dimension tables are further normalized into sub-tables. It saves storage compared to a star schema but increases query complexity due to additional joins.

7.3 SCD (Slowly Changing Dimension)

Methods for tracking historical changes in dimension data.

TypeDescriptionExample
SCD Type 1Overwrite with new valuePhone number change: old number deleted
SCD Type 2Add new row preserving historyAddress change: new row with validity period
SCD Type 3Separate columns for current and previouscurrent_address, previous_address columns

8. Data Quality

Data quality is the critical factor that determines pipeline reliability. "Garbage In, Garbage Out" is a long-standing adage in the data world.

8.1 Six Dimensions of Data Quality

  1. Accuracy: Does the data correctly reflect real-world values?
  2. Completeness: Is there any missing data?
  3. Consistency: Is the data free of contradictions across systems?
  4. Timeliness: Is the data available when needed?
  5. Uniqueness: Is the data free of duplicates?
  6. Validity: Does the data conform to defined rules and formats?

8.2 Great Expectations

Great Expectations is an open-source library for data validation, documentation, and profiling.

import great_expectations as gx

# Create data context
context = gx.get_context()

# Connect data source
datasource = context.sources.add_pandas("my_datasource")
data_asset = datasource.add_dataframe_asset(name="orders")

# Define expectations
batch = data_asset.build_batch_request(dataframe=df)
validator = context.get_validator(batch_request=batch)

# Data quality rules
validator.expect_column_values_to_not_be_null("order_id")
validator.expect_column_values_to_be_unique("order_id")
validator.expect_column_values_to_be_between(
    "amount", min_value=0, max_value=1000000
)
validator.expect_column_values_to_be_in_set(
    "status", ["pending", "completed", "cancelled", "refunded"]
)
validator.expect_column_values_to_match_regex(
    "email", r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"
)

# Run validation
results = validator.validate()
print(f"Success: {results.success}")

8.3 Data Observability

Data observability is the practice of continuously monitoring data systems to detect issues early.

Key metrics:

  • Freshness: How up-to-date is the data?
  • Volume: Changes in data volume relative to expectations
  • Schema: Detecting schema changes
  • Distribution: Spotting anomalies in data distributions
  • Lineage: Tracing data origin and flow

Key tools: Monte Carlo, Atlan, Soda, Elementary


9. Data Governance

9.1 Metadata Management

Metadata is "data about data" and falls into three categories:

  • Technical metadata: Schema, data types, partition info, storage locations
  • Business metadata: Data definitions, owners, SLAs, business rules
  • Operational metadata: Processing times, row counts, error logs, access history

9.2 Data Catalog

A data catalog is a tool that enables discovery, exploration, and understanding of all data assets within an organization.

Key capabilities:

  • Automatic metadata collection and indexing
  • Data lineage visualization
  • Data dictionary management
  • Tagging and classification systems
  • Collaboration features (comments, ratings, wikis)

Key tools: Apache Atlas, DataHub, Atlan, Alation

9.3 Access Control

Core principles of data access control:

  1. Principle of Least Privilege: Grant only the minimum permissions needed
  2. Role-Based Access Control (RBAC): Manage permissions by role
-- Snowflake RBAC example
CREATE ROLE data_analyst;
CREATE ROLE data_engineer;
CREATE ROLE data_admin;

-- Analyst role: read-only access
GRANT USAGE ON DATABASE analytics_db TO ROLE data_analyst;
GRANT SELECT ON ALL TABLES IN SCHEMA analytics_db.gold TO ROLE data_analyst;

-- Engineer role: read/write access
GRANT ALL PRIVILEGES ON DATABASE analytics_db TO ROLE data_engineer;
GRANT ALL PRIVILEGES ON ALL SCHEMAS IN DATABASE analytics_db TO ROLE data_engineer;

-- Assign roles to users
GRANT ROLE data_analyst TO USER analyst_kim;
GRANT ROLE data_engineer TO USER engineer_park;
  1. Row-Level Security (RLS): Restrict which rows a user can access
  2. Column-Level Security (CLS): Restrict access to sensitive columns (including masking)
  3. Audit Logging: Record all data access history

10. End-to-End Pipeline Example

10.1 Overall Architecture

Here is an example of an end-to-end data pipeline in a production environment.

[Source Systems]
  ├── Operational DB (PostgreSQL)  ── Debezium CDC ──┐
  ├── Web Events (Clickstream) ─── SDK ──────────────┤
  ├── External APIs ─── Airbyte ─────────────────────┤
  └── Files (CSV/Excel) ─── Airflow ─────────────────┤
[Message Broker]  └── Apache Kafka <──────────────────────────────────┘
       ├── Real-time path: FlinkReal-time dashboards
       └── Batch path:     SparkS3 (Data Lake)
[Transformation Layer]  └── dbt (Silver/Gold) <───────────┘
[Warehouse]
  └── Snowflake
       ├── Gold layer → Looker/Tableau dashboards
       └── ML Feature StoreModel training

10.2 Orchestrating with Airflow

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator
from airflow.providers.dbt.cloud.operators.dbt import DbtCloudRunJobOperator
from airflow.providers.slack.operators.slack_webhook import SlackWebhookOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'data-team',
    'retries': 2,
    'retry_delay': timedelta(minutes=10),
    'execution_timeout': timedelta(hours=2),
}

with DAG(
    dag_id='e2e_data_pipeline',
    default_args=default_args,
    schedule_interval='0 3 * * *',
    start_date=datetime(2026, 1, 1),
    catchup=False,
) as dag:

    # Stage 1: Process raw data with Spark
    spark_process = SparkSubmitOperator(
        task_id='spark_raw_processing',
        application='s3://scripts/process_raw_data.py',
        conn_id='spark_default',
        conf={
            'spark.executor.memory': '4g',
            'spark.executor.cores': '2',
        },
    )

    # Stage 2: Transform with dbt
    dbt_transform = DbtCloudRunJobOperator(
        task_id='dbt_transform',
        job_id=12345,
        check_interval=30,
        timeout=3600,
    )

    # Stage 3: Data quality validation
    quality_check = PythonOperator(
        task_id='data_quality_check',
        python_callable=run_great_expectations_suite,
    )

    # Stage 4: Send completion notification
    notify = SlackWebhookOperator(
        task_id='slack_notification',
        slack_webhook_conn_id='slack_webhook',
        message='Daily pipeline completed successfully.',
    )

    spark_process >> dbt_transform >> quality_check >> notify

10.3 Kafka + Spark Real-Time Processing

# Spark Structured Streaming + Kafka
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, window, sum as spark_sum
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, TimestampType

spark = SparkSession.builder \
    .appName("RealTimeRevenue") \
    .getOrCreate()

# Define Kafka schema
order_schema = StructType([
    StructField("order_id", StringType()),
    StructField("user_id", StringType()),
    StructField("amount", DoubleType()),
    StructField("category", StringType()),
    StructField("timestamp", TimestampType()),
])

# Read streaming data from Kafka
raw_stream = (
    spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "kafka:9092")
    .option("subscribe", "order-events")
    .option("startingOffsets", "latest")
    .load()
)

# Parse JSON
orders = (
    raw_stream
    .select(from_json(
        col("value").cast("string"),
        order_schema
    ).alias("data"))
    .select("data.*")
)

# 5-minute windowed aggregation
windowed_revenue = (
    orders
    .withWatermark("timestamp", "10 minutes")
    .groupBy(
        window("timestamp", "5 minutes"),
        "category"
    )
    .agg(spark_sum("amount").alias("revenue"))
)

# Output to console (in production, write to a DB or dashboard)
query = (
    windowed_revenue.writeStream
    .outputMode("update")
    .format("console")
    .option("truncate", False)
    .trigger(processingTime="1 minute")
    .start()
)

query.awaitTermination()

Summary

Data engineering is a field that encompasses a wide range of technologies and concepts. Here is a recap of the topics covered in this post.

AreaCore ConceptsKey Tools
Data StorageLake, Warehouse, LakehouseS3, Snowflake, Delta Lake
Data IntegrationETL/ELTdbt, Airbyte, Fivetran
Batch ProcessingDistributed computingApache Spark
Stream ProcessingEvent streaming, CDCApache Kafka, Flink, Debezium
OrchestrationWorkflow managementApache Airflow, Dagster
Data QualityValidation, ObservabilityGreat Expectations, Monte Carlo
GovernanceMetadata, Access controlDataHub, Apache Atlas

You do not need to learn every technology at once. Starting with SQL and Python and building a single pipeline from scratch is the most effective way to learn. Begin with a small project and expand incrementally from there.

Quiz: Data Engineering Fundamentals

Q1. What is the biggest difference between a data lake and a data warehouse?

A: A data lake uses a Schema-on-Read approach, storing raw data and applying schema at read time. A data warehouse uses Schema-on-Write, enforcing a predefined schema when data is stored.

Q2. Where does transformation happen in ETL vs ELT?

A: In ETL, transformation occurs in a staging area (a separate server) before loading into the warehouse. In ELT, raw data is loaded into the warehouse first, then transformed inside the warehouse using its compute power.

Q3. What is the role of a Consumer Group in Kafka?

A: A Consumer Group allows multiple consumers to share the processing load of a topic. Consumers within the same group each handle different partitions, enabling parallel processing and horizontal scaling.

Q4. What are the three layers of the Medallion Architecture?

A: Bronze (raw data stored as-is), Silver (cleansed and validated data), and Gold (aggregated data with business logic applied, ready for analytics).

Q5. What is the role of a DAG in Airflow?

A: A DAG (Directed Acyclic Graph) defines the execution order and dependencies of tasks. It declaratively specifies which tasks must run after which other tasks.

현재 단락 (1/539)

1. [What Is Data Engineering](#1-what-is-data-engineering)

작성 글자: 0원문 글자: 22,387작성 단락: 0/539