✍️ 필사 모드: DuckDB Internals Deep Dive — Embedded OLAP, Vectorized Execution, Morsel-Driven Parallelism, and Storage Format (2025)
EnglishTL;DR
- DuckDB was started in 2018 at CWI (Centrum Wiskunde & Informatica, Netherlands) by Mark Raasveldt and Hannes Mühleisen. The goal: "SQLite for analytics."
- Design philosophy: single-file format, single process, no external dependencies. A full OLAP engine with just
pip install duckdb. - Storage: columnar. Each block uses compression (RLE, bitpacking, dictionary, FSST for strings).
- Execution model: Vector-based — 1024 rows processed as a chunk. SIMD-friendly.
- Parallelism: Morsel-Driven Parallelism — the idea from the HyPer paper. Each morsel (~100K rows) processed independently.
- Query optimization: cost-based join ordering, filter pushdown, projection pruning, general rule-based optimizations.
- Python integration: access Pandas DataFrames / Polars / Arrow zero-copy.
duckdb.sql("SELECT * FROM my_df")just works. - Parquet/Arrow/CSV/JSON native — scanned without copying.
- WASM: runs in the browser too. Used by Observable notebooks, Malloy, Evidence.
- MotherDuck: cloud service built on DuckDB. "Hybrid execution" mixes local and cloud.
1. The Problem DuckDB Solved
1.1 The Two Extremes of Data Science
Around 2020, the options for a Python data scientist were:
Pandas (local, single machine):
- Convenient for small data (< 10 GB).
- Out-of-memory and performance problems with large data.
- Mostly single-threaded.
Spark / Dask (distributed):
- Needed for big data (TB+).
- Excessive overhead for small data.
- Requires operating a JVM cluster.
- Seconds just to start.
The middle ground (10 GB to 1 TB) was a blind spot.
1.2 Mark Raasveldt and Hannes Mühleisen
At CWI in 2018 the two envisioned "SQLite for OLAP." They wanted:
- Embedded (inside the process).
- Fast for analytical queries.
- Single-file format.
- Support for various file formats (Parquet, CSV).
- Python as the primary consumer, with a C/C++ API as well.
First public release in 2019. Rapid growth 2020–2022. 1.0 release in 2024 — production-ready.
1.3 A Choice Vindicated by Time
With Pandas and Spark around, do we need "yet another DB"? The 2025 answer: yes.
- Pandas 2.0+: can use DuckDB or Arrow as a backend.
- Polars: similar philosophy, in Rust.
- Malloy, Evidence: BI built on DuckDB.
- Fivetran, dlt: ETL with DuckDB embedded.
- Mode, Hex: interactive analytics on DuckDB.
- MotherDuck: DuckDB as serverless cloud.
DuckDB is becoming the new standard for data science.
2. Design Philosophy
2.1 Embedded
DuckDB is a library — import duckdb in Python, a link in C++, a crate in Rust.
import duckdb
con = duckdb.connect('my.duckdb') # or ':memory:'
con.execute("CREATE TABLE users (id INT, name VARCHAR)")
con.execute("INSERT INTO users VALUES (1, 'Alice')")
result = con.execute("SELECT * FROM users").fetchall()
No server process. Same model as SQLite.
2.2 Single File
my_data.duckdb ← one file holding all tables and indexes
Simple to back up, deploy, and share. Scales from a few GB to hundreds of GB.
2.3 No External Dependencies
Builds from a single C++ amalgamation. No companion libraries. Small binary (~30 MB).
2.4 "No Locks, No Compromise"
Concurrency support, kept simple:
- Single-writer: one writer at a time.
- Multiple readers: reads run concurrently with the writer.
- MVCC-based.
No complex distributed consensus — a simple implementation that is still sufficient for analytical workloads.
2.5 Compatibility
- ANSI SQL compliant (PostgreSQL-compatible).
- Reads Parquet, Arrow, CSV, JSON directly.
- Bindings for Python, R, Java, Node.js, Go, Rust, WASM.
3. Storage — Columnar
3.1 Basic Structure
Each table is composed of multiple row groups. Each row group holds multiple column chunks.
Table "events":
Row Group 1 (122,880 rows):
Column "timestamp": [bytes...]
Column "user_id": [bytes...]
Column "event_type": [bytes...]
Column "value": [bytes...]
Row Group 2:
...
Each column chunk is compressed independently.
3.2 Compression
DuckDB's compression codecs:
Uncompressed: raw bytes.
RLE (Run-Length Encoding): repeated values. AAABBC → 3A 2B 1C.
Bit-packing: minimum number of bits for the integer range. 0–7 uses 3 bits each.
Dictionary: dictionary of unique values plus indexes. Effective for string columns.
FSST (Fast Static Symbol Table): string compression. A paper from the DuckDB team.
Chimp / Alp: float compression. For time-series data.
Each chunk automatically picks the most efficient compression.
3.3 Row Group Size
The default row group is 122,880 rows. That number is 2^17 - 2^13 — chosen to enable specific optimizations.
- Too small: metadata overhead.
- Too large: filter pushdown becomes less effective.
- Around 120K strikes a balance.
3.4 Block Size
Internally uses 256 KB blocks. The unit of disk I/O.
3.5 Checkpoint
After writes, periodic checkpoints flush the in-memory WAL to the actual file. Similar to SQLite.
3.6 Stable File Format
Starting with the 2024 1.0 release, the Storage Format is stable. Earlier versions changed the format often, but now it is forward/backward compatible.
4. Vector-Based Execution
4.1 Volcano vs Vectorized
Traditional DB (PostgreSQL):
for each row:
apply_filter(row)
transform(row)
aggregate(row)
Volcano model — one row at a time. Function call overhead accumulates. Branch mispredictions.
Columnar + Vectorized (DuckDB, ClickHouse):
for each batch of 1024 rows:
filter_batch()
transform_batch()
aggregate_batch()
Vector model — 1024 rows at once. SIMD-friendly, cache-efficient.
4.2 DuckDB's Vector Size
1024 rows per chunk is the default. Why:
- Fits in L1/L2 cache: one column of 1024 × 8 bytes = 8 KB. L1 = 32 KB, L2 = 256 KB. Multiple columns can be processed simultaneously while still sitting in cache.
- SIMD granularity: AVX-512 handles 8
i64at once. 1024 / 8 = 128 iterations. Efficient. - Branch prediction: the inner loop body is easy to predict.
4.3 Vector Structure
struct Vector {
Type type; // INT, VARCHAR, DOUBLE, etc.
VectorType vtype; // FLAT, CONSTANT, DICTIONARY, SEQUENCE
ValidityMask nulls; // per-row null flags
void* data; // actual values
idx_t count; // current row count
};
Vector types:
- FLAT: contiguous array.
- CONSTANT: all rows share one value. Stores only one value (compressed).
- DICTIONARY: unique values plus indexes.
- SEQUENCE: arithmetic progression (e.g. 0, 1, 2, ..., 1023).
When the Vector type is known at compile time, specialized paths run → higher performance.
4.4 Chunks
Several vectors together form a DataChunk:
struct DataChunk {
vector<Vector> data;
idx_t count; // current row count (up to 1024)
};
"A wide table of 1024 rows." The unit of query execution.
4.5 Operator Pipeline
SELECT user_id, SUM(value)
FROM events
WHERE event_type = 'click'
GROUP BY user_id
Operator tree:
HashAggregate (GROUP BY user_id, SUM(value))
↓
Filter (event_type = 'click')
↓
TableScan (events)
Each operator takes a DataChunk as input, returns a DataChunk as output:
class Operator {
virtual unique_ptr<DataChunk> GetChunk() = 0;
};
Pipeline-based execution. Pull model (an operator asks its upstream operator "give me the next chunk").
5. Morsel-Driven Parallelism
5.1 Problems with Traditional Parallelism
Traditional approach: push data between workers with an exchange operator.
Filter → Exchange → Aggregate
(heavy)
Problems:
- Shuffle cost (data copies).
- Load imbalance: some workers are slow.
- Complex implementation.
5.2 The HyPer Morsel Idea
Viktor Leis et al.'s 2014 "Morsel-Driven Parallelism" paper (HyPer DB).
The core:
- Split the input into morsels — about 100K rows each.
- A worker pool exists (one per thread).
- Each worker pulls a morsel from the queue and processes it.
- Work stealing gives you automatic load balance.
Operators do not know there are multiple workers. Each worker processes its own morsel and pushes an output morsel.
5.3 DuckDB's Implementation
DuckDB uses pipeline-based parallelism:
Pipeline 1: TableScan → Filter → PartialHashAggregate
Pipeline 2: FinalHashAggregate → Output
Each pipeline runs in parallel. When one pipeline finishes, the next one starts.
Inside a pipeline, parallelism is at the morsel level:
Input morsels: [M1] [M2] [M3] [M4] [M5] ...
Worker 1: pull M1 → process → output
Worker 2: pull M2 → process → output
Worker 3: pull M3 → process → output
Worker 4: pull M4 → process → output
...
A thread-safe queue distributes morsels. Almost no locks.
5.4 Parallel Hash Aggregate
Parallelizing aggregation:
Phase 1: each worker builds a partial hash table (from its local morsel)
Phase 2: merge partial hash tables
Phase 3: output from the final hash table
Phase 1 is completely independent → linear scaling. Phase 2 is a partition-based merge — each partition is assigned to its own worker.
5.5 The Result
DuckDB scales almost linearly with core count. 4 cores → 4x, 16 cores → 16x (close to this on realistic workloads).
6. Query Optimizer
6.1 Rule-Based Optimizer
Baseline optimizations:
- Filter pushdown: push filters down toward the scan so they apply early.
- Projection pruning: do not read unnecessary columns.
- Constant folding: evaluate constants like
1 + 2. - CSE (Common Subexpression Elimination).
- Predicate simplification:
x AND FALSE→FALSE. - Deliminator: correlated-subquery removal.
6.2 Cost-Based Join Ordering
Join order has a huge impact on performance:
A (1M) JOIN B (100) JOIN C (10)
- A JOIN B first: 1M × 100 = 100M intermediate.
- B JOIN C first: 100 × 10 = 1K intermediate, then 1K JOIN A.
The second is far faster.
DuckDB uses dynamic programming for join ordering. Cost estimation uses table cardinalities and predicate selectivities.
6.3 Statistics
Baseline statistics:
- Row count.
- Distinct count (approximate, via HyperLogLog).
- Min/max per column.
- Null fraction.
Collected automatically during scans — no ANALYZE required.
6.4 Zonemaps / Pruning
Each row group stores min/max. For example:
SELECT * FROM events WHERE timestamp > '2024-01-01'
Check each row group's max timestamp → if it does not match, skip. Never read the actual data.
Similar to predicate pushdown in formats like Parquet.
6.5 Late Materialization
Taking advantage of the column store:
SELECT name, email FROM users WHERE age > 30
- Early materialization: read all columns, then filter → wasteful.
- Late materialization: read only
agefirst, filter → store matching row indexes → fetchnameandemailonly for those indexes → efficient.
DuckDB uses late materialization by default.
7. Python Integration
7.1 Zero-Copy DataFrames
DuckDB's killer feature:
import duckdb
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob'],
'age': [30, 25]
})
# Query the DataFrame directly in SQL
result = duckdb.sql("SELECT * FROM df WHERE age > 28").df()
# ^^ Python variable used as a table
print(result)
How? DuckDB inspects the Python interpreter's locals to find the DataFrame and accesses it zero-copy. It reads Pandas's internal Arrow or NumPy buffer directly.
7.2 Polars and Arrow Too
import polars as pl
df = pl.DataFrame({'a': [1,2,3]})
duckdb.sql("SELECT * FROM df").pl() # Polars result
import pyarrow as pa
tbl = pa.table({'a': [1,2,3]})
duckdb.sql("SELECT * FROM tbl").arrow()
Interoperable with every frame library.
7.3 Relational API
Use the Python API instead of SQL:
rel = duckdb.sql("SELECT * FROM df")
.filter("age > 28")
.project("name, age * 2 as double_age")
.aggregate("SUM(double_age)")
Lazy evaluation — nothing runs until .df(), .pl(), or .fetchall() is called. The optimizer sees the whole plan and optimizes it.
7.4 Jupyter Notebooks
# cell 1
import duckdb
con = duckdb.connect(':memory:')
# cell 2
%load_ext sql # ipython-sql extension
%sql duckdb:///my.duckdb
# cell 3
%%sql
SELECT * FROM my_table LIMIT 10
Write SQL directly in Jupyter. The analytics workflow is seamless.
7.5 Pandas 2.0+ Integration
Pandas 2.0 supports an Arrow backend:
df = pd.read_parquet('data.parquet', dtype_backend='pyarrow')
This df is in Arrow memory layout. DuckDB accesses it zero-copy.
"Pandas → DuckDB → result → Pandas" pipelines may become the new standard.
8. Parquet / Arrow Integration
8.1 Reading Parquet Directly
SELECT * FROM 'data.parquet' WHERE year = 2024
Files act like tables. DuckDB:
- Parses the Parquet footer.
- Inspects row-group metadata.
- Predicate pushdown — reads only matching row groups.
- Projection pushdown — reads only needed columns.
- Parallel scan.
8.2 Many Files at Once
SELECT * FROM 'logs/*.parquet'
SELECT * FROM read_parquet(['f1.parquet', 'f2.parquet'])
Hive partitioning is supported:
data/year=2024/month=01/day=01/part.parquet
WHERE year = 2024 → automatically scans only the matching directory.
8.3 S3 / HTTP
SELECT * FROM 's3://bucket/data.parquet'
SELECT * FROM 'https://example.com/data.csv'
With the httpfs extension, remote files work directly. Range requests download only the bytes you need.
8.4 CSV / JSON
SELECT * FROM 'data.csv' -- automatic schema detection
SELECT * FROM read_json('data.json', auto_detect=true)
A smart CSV parser — handles Unicode BOM, various delimiters, and quoted fields.
JSON supports nesting:
SELECT item->>'name' AS name, (item->>'price')::FLOAT AS price
FROM 'items.json'
8.5 Arrow Results
result = con.execute("SELECT * FROM t").fetch_arrow_table()
Returns an Arrow table zero-copy. DuckDB's internal vectors are already Arrow-compatible.
9. WASM and the Browser
9.1 DuckDB-WASM
DuckDB compiles to WebAssembly. An analytical DB directly in the browser:
<script type="module">
import * as duckdb from '@duckdb/duckdb-wasm';
const JSDELIVR_BUNDLES = duckdb.getJsDelivrBundles();
const bundle = await duckdb.selectBundle(JSDELIVR_BUNDLES);
const worker = new Worker(bundle.mainWorker);
const logger = new duckdb.ConsoleLogger();
const db = new duckdb.AsyncDuckDB(logger, worker);
await db.instantiate(bundle.mainModule, bundle.pthreadWorker);
const conn = await db.connect();
const result = await conn.query(`
SELECT * FROM 'https://example.com/data.parquet' LIMIT 10
`);
</script>
Client-side analytics without a server. Tools like Observable, Malloy, and Evidence take advantage of this.
9.2 Use Cases
- Observable notebooks: analyze JSON/CSV/Parquet in the browser with DuckDB.
- Evidence: static dashboards. Aggregate with DuckDB at build time and ship HTML.
- Static BI: publish data as Parquet, run SQL in the browser.
"Server-less BI" is now possible.
9.3 Performance
20–50% slower than native, but still practical. Analyze a 100 MB Parquet file in seconds.
10. Python Workflow Examples
10.1 The Replacement Scenario
Before (Pandas):
import pandas as pd
df = pd.read_parquet('large.parquet')
grouped = df[df['event_type'] == 'click'].groupby('user_id')['value'].sum()
top10 = grouped.nlargest(10)
- Loads the entire file into memory.
- Single-threaded.
- Slow on large data.
After (DuckDB):
import duckdb
result = duckdb.sql("""
SELECT user_id, SUM(value) AS total
FROM 'large.parquet'
WHERE event_type = 'click'
GROUP BY user_id
ORDER BY total DESC
LIMIT 10
""").df()
- Streams the file (no full load).
- Parallel execution.
- 10x+ faster.
- Same Pandas DataFrame result.
10.2 ETL Pipelines
con = duckdb.connect('warehouse.duckdb')
# Extract: from an API/file
con.execute("""
CREATE OR REPLACE TABLE raw_events AS
SELECT * FROM read_parquet('s3://bucket/events/*.parquet')
""")
# Transform
con.execute("""
CREATE OR REPLACE TABLE clean_events AS
SELECT
user_id,
event_type,
CAST(timestamp AS TIMESTAMP) AS ts,
value
FROM raw_events
WHERE value > 0
""")
# Load
con.execute("COPY clean_events TO 'output.parquet' (FORMAT PARQUET)")
ETL of a few GB to tens of GB runs locally. No Spark required.
10.3 Jupyter Analysis
%load_ext sql
%sql duckdb:///analysis.duckdb
%%sql
SELECT
date_trunc('month', purchase_date) AS month,
category,
SUM(amount) AS revenue
FROM sales
GROUP BY 1, 2
ORDER BY 1, 2
The notebook becomes SQL-first. SQL is often more expressive than Pandas — especially for window functions, CTEs, and joins.
10.4 dbt Integration
The dbt-duckdb adapter:
# profiles.yml
my_project:
target: dev
outputs:
dev:
type: duckdb
path: /tmp/my.duckdb
All dbt features (models, tests, snapshots) on local DuckDB. Development/testing is fast. Production stays on Snowflake or wherever.
11. Relationship with Polars
11.1 The Two Projects
Polars: Ritchie Vink's Rust-based DataFrame library. DuckDB: a C++-based SQL engine.
They solve the same problem (fast local OLAP) through different interfaces.
11.2 Polars's Engine
In fact, Polars's early query engine was co-designed with the DuckDB team. Expressions, lazy evaluation, and optimization patterns look similar.
Since 2024 Polars has been transitioning to its own engine (the new streaming engine). But many of the ideas originated in DuckDB.
11.3 Comparison
| Aspect | DuckDB | Polars |
|---|---|---|
| Language | C++ | Rust |
| Interface | SQL | DataFrame API |
| File format | Native .duckdb | Parquet or native |
| Python integration | Zero-copy | Native |
| Analytical features | Rich (SQL standard) | Polars API |
| Ecosystem | dbt, BI tools | ML pipelines |
Both are excellent. Prefer DuckDB if SQL fits you; prefer Polars if a DataFrame API fits you.
11.4 Using Them Together
import polars as pl
import duckdb
# Polars DataFrame
df = pl.read_parquet('data.parquet')
# SQL query via DuckDB
result = duckdb.sql("SELECT name, SUM(value) FROM df GROUP BY name").pl()
# ^^
# returns a Polars result
Interop is excellent — they share Arrow memory.
12. Extensions
12.1 The Extension System
DuckDB has an extensible architecture:
INSTALL httpfs;
LOAD httpfs;
-- Now you can access S3
SELECT * FROM 's3://bucket/file.parquet';
Extensions are dynamic libraries, loaded at runtime.
12.2 Major Extensions
- httpfs: HTTP/S3 access.
- postgres_scanner: query Postgres tables directly.
- mysql_scanner: same for MySQL.
- sqlite_scanner: access SQLite tables.
- iceberg: Apache Iceberg tables.
- delta: Delta Lake tables.
- fts: full-text search.
- json: advanced JSON functions.
- spatial: GIS functions.
- vss (Vector Similarity Search): embedding search (HNSW).
- icu: internationalization.
Dozens more. A rich ecosystem.
12.3 Third-Party
Community extensions can also be loaded:
INSTALL 'my_extension' FROM 'https://example.com/';
LOAD 'my_extension';
Signing is supported. Safe loading.
12.4 Iceberg / Delta
Modern data-lake formats are supported:
SELECT * FROM iceberg_scan('s3://bucket/iceberg-table');
Read tables written by other engines (Spark, Trino) with DuckDB. Engine interoperability.
13. MotherDuck
13.1 The Company
In 2022 some DuckDB Labs founders started the MotherDuck company. A managed service built on DuckDB.
13.2 Hybrid Execution
The core idea: a hybrid of local and cloud.
Local (laptop):
- user interaction
- small data caches
- recent queries
Cloud (MotherDuck):
- large dataset storage
- shared data
- heavy computation
A query can join "local data + cloud data." The optimizer decides where to execute.
13.3 Use Case
Small teams:
- Put shared datasets on MotherDuck.
- Everyone analyzes on their laptop with DuckDB.
- The same query works either way.
13.4 Free Tier
MotherDuck offers a free tier. Very attractive pricing for small teams.
A low-cost alternative to Snowflake and BigQuery that is earning a place.
14. Performance Benchmarks
14.1 TPC-H
The standard OLAP benchmark. DuckDB uses TPC-H aggressively.
Single-machine performance comparison (1 TB scale, approximate):
DuckDB: 1x (baseline)
PostgreSQL: 20-50x slower
SQLite: 100x+ slower
ClickHouse: 0.5-0.8x (slightly faster)
DuckDB is optimized for single-machine performance. Distribution is handled via MotherDuck or partitioning.
14.2 Compared with Pandas
A simple GroupBy example (10 GB Parquet):
Pandas: 45s (OOM risk)
Polars: 12s
DuckDB: 8s
DuckDB is typically the fastest. For small data (under 100 MB) Pandas is competitive because its overhead is smaller.
14.3 Compared with Spark
Input: 10 GB CSV
Spark (local mode): 120s (JVM startup + plan + execute)
Spark (cluster): 40s (with cluster cost)
DuckDB: 8s
For small data DuckDB is dramatically faster. Spark's startup overhead is too much for small workloads.
At the 1 TB level a Spark cluster may win, but DuckDB remains impressive.
15. Tuning
15.1 Memory
SET memory_limit='8GB';
Default: 80% of the system. On small machines set an explicit cap.
When memory runs out on large data, DuckDB spills to disk using temp files.
15.2 Threads
SET threads=8;
Default: the number of CPU cores. Set explicitly if you need control.
15.3 Temp Directory
SET temp_directory='/fast/ssd/tmp';
For disk spills. Fast SSD improves performance.
15.4 Profiling
EXPLAIN ANALYZE SELECT * FROM t WHERE x > 10;
Shows actual execution time, row counts, and per-operator cost.
PRAGMA enable_profiling;
A detailed profile for every query.
15.5 Tuning Tips
- SELECT only the columns you need (projection pushdown).
- WHERE clauses as early as possible (predicate pushdown).
ORDER BY+LIMITwith big joins: if you only need top-k, only part of the sort runs.- Partitioning over indexes: row group skipping is usually enough.
COPYfor bulk load: 100x faster than INSERT.
16. Learning Resources
Official:
- https://duckdb.org/docs/ — official docs.
- https://duckdb.org/community/ — blog, Discord.
Talks:
- Presentations by Hannes Mühleisen and Mark Raasveldt (CIDR, SIGMOD).
- The "DuckDB under the hood" series.
Books:
- Papers by Mark Raasveldt and Hannes Mühleisen.
- Papers from CWI's Database Architecture group.
Examples:
- https://duckdb.org/docs/guides/overview.
- DuckDB Python tutorial.
- Malloy + DuckDB tutorial.
Video:
- CMU Database Group lectures on vectorized execution.
17. Summary — One Page
┌─────────────────────────────────────────────────────┐
│ DuckDB Cheat Sheet │
├─────────────────────────────────────────────────────┤
│ Philosophy: │
│ Embedded OLAP database │
│ "SQLite for analytics" │
│ single file, single process │
│ No server, no lock │
│ │
│ Storage: │
│ columnar │
│ Row groups (122,880 rows) │
│ Compression: RLE, bitpack, dict, FSST │
│ Checkpoint + WAL │
│ │
│ Execution: │
│ Vector-based (1024 rows per chunk) │
│ SIMD friendly │
│ Late materialization │
│ Pipeline operators │
│ │
│ Parallelism: │
│ Morsel-Driven Parallelism │
│ ~100K rows per morsel │
│ Work stealing │
│ near-linear scaling │
│ │
│ Optimizer: │
│ Rule-based (pushdown, pruning, CSE) │
│ Cost-based join ordering │
│ HyperLogLog distinct count │
│ Row group skipping (min/max) │
│ │
│ Python: │
│ pip install duckdb │
│ Zero-copy DataFrames (Pandas, Polars, Arrow) │
│ Relational API (lazy) │
│ Jupyter ipython-sql integration │
│ │
│ File integration: │
│ Parquet (predicate pushdown) │
│ Arrow (zero-copy) │
│ CSV (smart parser) │
│ JSON (nested support) │
│ S3 / HTTP (httpfs extension) │
│ │
│ Browser: │
│ DuckDB-WASM │
│ Observable, Malloy, Evidence │
│ │
│ Extensions: │
│ httpfs, postgres_scanner, iceberg, delta │
│ fts, spatial, vss (vector search) │
│ │
│ MotherDuck: │
│ Managed service │
│ Hybrid execution (local + cloud) │
│ Free tier │
│ │
│ vs competitors: │
│ Pandas: 10x faster, big data OK │
│ Polars: similar philosophy, DataFrame API │
│ SQLite: much faster for OLAP │
│ ClickHouse: DuckDB is embedded │
│ Spark: DuckDB wins single-machine │
│ │
│ Tuning: │
│ memory_limit, threads │
│ EXPLAIN ANALYZE │
│ Projection + predicate pushdown │
│ COPY for bulk load │
└─────────────────────────────────────────────────────┘
18. Quiz
Q1. Why is DuckDB called "SQLite for analytics"?
A. Same deployment/usage model, different workload optimization. SQLite provides an embedded DB for OLTP (transactions, small reads/writes, many concurrent users); DuckDB applies the same model to OLAP (big scans, aggregations, complex queries). Common ground: (1) library form — no server, just pip install, (2) single-file format — easy to back up and move, (3) no external dependencies — small binary, (4) ACID + MVCC. Differences: SQLite is row-based storage (B-tree); DuckDB is columnar storage plus vectorized execution. DuckDB's genius is applying the same "embedded DB" philosophy to a different workload. It is the answer the 2020s data scientist wanted to "SQL analytics without a small server."
Q2. How does Morsel-Driven Parallelism differ from traditional exchange-based parallelism?
A. Minimal data movement and work stealing. Traditional parallel DBs (Spark, PostgreSQL parallel query) shuffle data between workers via an exchange operator — high network cost, and load imbalance remains after the shuffle (some workers are slow). Morsel-Driven: split the input into small morsels (~100K rows) and let the worker pool pull from a queue. Each worker pulls at its own pace → fast workers process more morsels → automatic load balance. No shuffle, and each morsel is independent. The 2014 Leis et al. HyPer paper proposed this and DuckDB/Umbra/Vectorwise adopted it. Benefits: (1) simple implementation (lock-free queue), (2) linear scaling, (3) long-tail elimination. Optimal for exploiting dozens of cores on a single machine.
Q3. Why is DuckDB's vector 1024 rows?
A. L1/L2 cache optimization plus SIMD efficiency. Too small (e.g. 16 rows) and function-call/branch-prediction overhead dominates → you recreate the Volcano model's problems. Too large (e.g. 100K rows) and one column exceeds L1/L2 → spill. 1024 rows × 8 bytes = 8 KB per column — multiple columns fit in L1 (32 KB) simultaneously. On AVX-512 each iteration handles 8 i64, so 1024 / 8 = 128 iterations — easy to predict. The specific row-group size of 122,880 = 2^17 - 2^13 is also designed as the optimum combination of compression and vector size. Not magic numbers — grounded in hardware characteristics. ClickHouse uses 65,536; DuckDB uses 1024 — different trade-offs. The evolution "row-based Volcano → column-at-a-time → vector-at-a-time" lands on this current sweet spot.
Q4. How does DuckDB achieve zero-copy access to Python DataFrames?
A. By leveraging the Arrow memory representation. Pandas 2.0+, Polars, and PyArrow all use the Arrow columnar memory layout as an internal or external representation. DuckDB's internal vectors are Arrow-compatible too. When duckdb.sql("SELECT * FROM df") is called, DuckDB: (1) searches the Python interpreter's locals for a df variable (replacement scan), (2) detects the DataFrame type, (3) gets pointers into the internal Arrow buffers, (4) points its vectors at that same memory. No data copy → even huge DataFrames have no latency. Pandas 1.x (NumPy-based) may require some copies, but the Arrow backend in 2.0+ is truly zero-copy. The reverse direction is identical — .df(), .pl(), .arrow() results are zero-copy too. The upshot: Python and SQL integrate transparently — that is the technical basis for the acceleration that replaces Pandas.
Q5. Why is DuckDB's row-group size (122,880) special?
A. The balance point of compression, vector size, and row-group skipping. Smaller (e.g. 10K) and row-group metadata overhead grows while compression becomes less effective. Larger (e.g. 1M) and the skip unit for predicate pushdown becomes too coarse, forcing you to read more data than needed for selective queries. 122,880 = 120 × 1024 — 120 vectors per row group. At that size (1) compression dictionaries are large enough, (2) min/max statistics enable meaningful skipping, (3) metadata overhead is under 1% of total size. Parquet's default row group (1M rows) is optimal for bulk analytics; DuckDB's 122,880 also suits point-ish queries. For the same reason, DuckDB processes Parquet in 120K chunks. A fine example of how a design-detail decision affects real query performance.
Q6. What new use cases did DuckDB-WASM enable?
A. Server-less BI and static dashboards. Traditional BI requires server + DB + auth infrastructure — too much for small teams and individuals. DuckDB-WASM: (1) host Parquet/CSV as static files (GitHub Pages, S3, CloudFront), (2) the browser's DuckDB-WASM uses range requests to download only the needed bytes, (3) SQL runs in the browser, (4) visualize results with JS. Evidence.dev follows this pattern: Markdown + SQL statically generated to HTML/JSON at build time, and visitors query in the browser. Observable notebooks: interactive analysis, everything client-side. Mosaic (UW IDL): interactive visualization of millions of rows. "A small service + static files + DuckDB-WASM" is a new architecture for simple analytics apps. Near-zero infra cost, infinite scalability (CDN), excellent privacy (data need not leave the browser). The opposite pole from Snowflake.
Q7. Are DuckDB and Polars competitors or complements?
A. Both — complementary tools that solve the same problem from different angles. In common: (1) local/embedded OLAP, (2) columnar + vectorized, (3) Arrow-memory compatible, (4) the goal of replacing Pandas. Differences: DuckDB is a SQL engine (C++); Polars is a DataFrame API (Rust). Follow your preference — DuckDB if SQL fits, Polars if method chaining fits. An interesting history: Polars's early query-engine ideas were shared with the DuckDB team (expressions, lazy eval, optimization). Polars is moving to its own streaming engine in 2024, but the roots are the same. A practical pattern: use them together. Manipulate DataFrames with Polars → duckdb.sql() for SQL queries → get Polars DataFrames back. Zero-copy through the same Arrow memory. "One or the other" is the wrong framing — "both" is the answer. A joint front replacing Pandas (sequential, single-threaded).
If this post was helpful, check out these related posts:
- "ClickHouse Internals Deep Dive" — server-based columnar OLAP.
- "SQLite Internals Deep Dive" — the textbook on embedded OLTP.
- "Apache Arrow & Columnar Memory" — the shared memory format.
- "Columnar Storage (Parquet/ORC/Arrow/Dremel)" — a deep dive on file formats.
현재 단락 (1/529)
- **DuckDB** was started in 2018 at CWI (Centrum Wiskunde & Informatica, Netherlands) by Mark Raasve...