- Published on
DuckDB in Practice — The Era of Embedded Analytics That Ends on Your Laptop
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction — The Duck That Became a Hacker News Regular
- What Is DuckDB — The SQLite of OLAP
- Architecture — Why It Is Fast
- Getting Started Without Installing Anything
- Querying Files Directly — The Killer Feature
- A Collection of Hands-On Query Examples
- pandas and Polars — Symbiosis, Not Competition
- The Extension Ecosystem
- Its Place in Data Engineering — Lightweight ETL and dbt
- MotherDuck — The Hybrid Execution Compromise
- DuckLake and the Lakehouse Current
- Everyday Recipes — Worth Committing to Muscle Memory
- Limits — What Not to Use It For
- Selection Guide — Versus Postgres, SQLite, ClickHouse
- Performance Tips
- Closing
- References
Introduction — The Duck That Became a Hacker News Regular
Hacker News has a set of recurring front-page favorites: odes to SQLite, "Postgres is enough" posts, and DuckDB war stories. Every time a title like "I aggregated hundreds of millions of rows on my laptop in seconds" or "I finished the whole analytics pipeline without a data warehouse" goes up, the comment section converges on the same conclusion: our data is smaller than we think, and DuckDB is faster than we think.
In the context of 2026, another reason has been added. With AI coding agents now mainstream, workflows where "the agent explores the data directly" have become common — and DuckDB, which gives you a SQL analytics environment from a single file with no server installation and no auth setup, has become the poster child of agent-friendly tooling. The era's mood of cloud cost-cutting pressure and big tech fatigue also amplifies the appeal of "analytics that ends on my laptop."
This post covers, from a practitioner perspective, the structural reasons DuckDB is loved, hands-on query patterns, its place in data engineering, and its limits.
What Is DuckDB — The SQLite of OLAP
Defined in one sentence, DuckDB is an embedded analytical database. Just as SQLite embeds as a library inside an application to handle transactions, DuckDB embeds inside your process to handle analytical queries. Lining up the comparison axes:
- SQLite: embedded, row-oriented, optimal for OLTP (many small reads and writes)
- DuckDB: embedded, column-oriented, optimal for OLAP (few heavy aggregations)
- Postgres and MySQL: client-server, row-oriented, OLTP
- ClickHouse and BigQuery: client-server (or distributed), column-oriented, OLAP
In other words, DuckDB is the project that filled the empty quadrant precisely: a "serverless ClickHouse" and an "analytical SQLite." It originated in the database research group of the CWI institute in the Netherlands, reached version 1.0 in 2024, and its stability and ecosystem matured rapidly afterwards.
Architecture — Why It Is Fast
DuckDB derives its speed from three design decisions.
+--------------------------------------------------------------+
| user process (Python, CLI, Node, JVM, ...) |
| |
| +------------------------------------------------------+ |
| | DuckDB (embedded as a library, no server) | |
| | | |
| | SQL parser / optimizer | |
| | | | |
| | v | |
| | vectorized execution engine | |
| | - processes vectors (about 2048 values) at a time, | |
| | not one row at a time | |
| | - CPU cache friendly, uses SIMD | |
| | - automatic parallelism across all cores | |
| | | | |
| | v | |
| | column-oriented storage | |
| | - single-file DB (ACID, MVCC) | |
| | - per-column compression (dictionary, RLE, | |
| | bit packing) | |
| | - direct scans of Parquet/CSV/JSON (no import) | |
| +------------------------------------------------------+ |
| |
| data larger than memory spills to disk (out-of-core) |
+--------------------------------------------------------------+
First, columnar storage. Analytical queries usually read only two or three of dozens of columns. Storing by column means reading only the needed columns from disk, and values of the same type sitting together compress better.
Second, vectorized execution. Instead of the traditional Volcano model that processes one row at a time, operators consume vectors of roughly 2048 values. Function call overhead drops by orders of magnitude, and CPU caches and SIMD instructions are used efficiently.
Third, the single file and out-of-core processing. The whole database is one file, and data larger than memory is processed by automatically spilling intermediate results to disk. This is the decisive difference from pandas, where "does not fit in memory" means game over.
Getting Started Without Installing Anything
The barrier to entry for DuckDB is effectively zero.
# Install the CLI (macOS)
brew install duckdb
# Run immediately in in-memory mode
duckdb
# Open a file DB (created if missing)
duckdb analytics.duckdb
# Python: one line, pip install duckdb, and you are done
import duckdb
# query directly without a connection (in-memory)
duckdb.sql("SELECT 42 AS answer").show()
# use a file DB
con = duckdb.connect("analytics.duckdb")
con.sql("CREATE TABLE t AS SELECT * FROM range(1000000)")
No server daemon, no port configuration, no user accounts. This frictionlessness creates the habit of "let me just open it with DuckDB," and that habit became the foundation of its popularity.
Querying Files Directly — The Killer Feature
DuckDB queries files like tables without "importing" the data first.
-- query a CSV directly (schema inferred automatically)
SELECT count(*) FROM 'events-2026-06.csv';
-- many Parquet files at once via glob patterns
SELECT user_id, sum(amount) AS total
FROM 'data/sales/*.parquet'
GROUP BY user_id
ORDER BY total DESC
LIMIT 10;
-- JSON Lines as well
SELECT payload.event_type, count(*)
FROM read_json_auto('logs/*.jsonl')
GROUP BY 1;
-- converting between file formats is one line of SQL
COPY (SELECT * FROM 'raw.csv') TO 'clean.parquet' (FORMAT parquet);
Load the httpfs extension and remote files work with the same syntax.
INSTALL httpfs;
LOAD httpfs;
-- query an HTTP URL directly
SELECT * FROM 'https://example.com/open-data/stats.parquet' LIMIT 5;
-- aggregate Parquet in an S3 bucket right from your laptop
SET s3_region = 'ap-northeast-2';
SELECT date_trunc('day', ts) AS d, count(*)
FROM 's3://my-bucket/events/year=2026/month=06/*.parquet'
GROUP BY d ORDER BY d;
Because it exploits Parquet column statistics and row-group metadata to read only what is needed, even files of tens of gigabytes answer in seconds when filters prune well. This feature sits at the center of the trend of "query the data lake in place instead of loading it into a warehouse."
A Collection of Hands-On Query Examples
Three patterns you will use constantly in analytical SQL, in DuckDB syntax.
Window Functions — Moving Averages and Ranks
-- 7-day moving average of daily revenue and day-over-day change
WITH daily AS (
SELECT date_trunc('day', ts) AS d, sum(amount) AS revenue
FROM 'sales/*.parquet'
GROUP BY d
)
SELECT
d,
revenue,
avg(revenue) OVER (
ORDER BY d ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS ma7,
revenue - lag(revenue) OVER (ORDER BY d) AS diff_prev_day,
rank() OVER (ORDER BY revenue DESC) AS best_day_rank
FROM daily
ORDER BY d;
PIVOT — Rows into Columns
DuckDB supports PIVOT as a dedicated clause. You do not have to enumerate column values by hand.
-- monthly revenue by category in wide format
PIVOT (
SELECT category, strftime(ts, '%Y-%m') AS month, amount
FROM 'sales/*.parquet'
)
ON month
USING sum(amount)
GROUP BY category;
-- UNPIVOT works in the opposite direction
UNPIVOT monthly_wide
ON COLUMNS(* EXCLUDE (category))
INTO NAME month VALUE amount;
ASOF JOIN — Matching the Nearest Past Point in Time Series
It shines when joining time series whose timestamps do not align exactly, such as quotes and trades.
-- for each trade, match "the latest exchange rate before the trade time"
SELECT t.trade_id, t.ts, t.amount_usd, r.rate AS usdkrw_at_trade
FROM trades t
ASOF JOIN fx_rates r
ON t.ts >= r.ts;
Implementing this with regular joins requires correlated subqueries or window-function acrobatics, but ASOF JOIN expresses the intent directly in syntax while keeping the execution plan efficient.
pandas and Polars — Symbiosis, Not Competition
DuckDB complements rather than replaces dataframe libraries. The key is zero-copy interop based on Apache Arrow.
import duckdb
import pandas as pd
df = pd.read_parquet("events.parquet")
# query a pandas DataFrame with SQL, no copy (the variable name is recognized)
result = duckdb.sql("""
SELECT user_id, count(*) AS cnt
FROM df
WHERE event_type = 'purchase'
GROUP BY user_id
HAVING cnt >= 3
""").df() # back to pandas
# the same round trip works with Polars
import polars as pl
pdf = duckdb.sql("SELECT * FROM 'big.parquet'").pl() # via Arrow, minimal copying
The typical division of labor in practice: DuckDB performs the heavy scans, joins, and aggregations directly on disk, and you pull into a dataframe only for the final polishing right before visualization. Eight out of ten "pandas out of memory" problems disappear with this split.
The Extension Ecosystem
DuckDB keeps the core small and splits features into extensions. Two commands, INSTALL and LOAD, are all you need.
-- spatial analytics: GIS functions and geometry types
INSTALL spatial;
LOAD spatial;
SELECT ST_Distance(
ST_Point(126.9780, 37.5665), -- Seoul
ST_Point(139.6917, 35.6895) -- Tokyo
) AS deg_distance;
-- full-text search: BM25-based index
INSTALL fts;
LOAD fts;
PRAGMA create_fts_index('docs', 'doc_id', 'body');
SELECT doc_id, fts_main_docs.match_bm25(doc_id, 'duckdb embedded') AS score
FROM docs ORDER BY score DESC LIMIT 10;
Beyond these, json (close to built-in), httpfs, iceberg, delta, the postgres scanner, and the sqlite scanner are in frequent use. The postgres scanner lets DuckDB read your production DB tables directly, making it a convenient alternative for "build an analytics replica without loading the production DB."
Its Place in Data Engineering — Lightweight ETL and dbt
DuckDB has gone beyond "the analyst's toy" to become a legitimate pipeline component. Two representative patterns:
First, as a lightweight ETL engine. For batches under a few dozen gigabytes per day, finishing the transformation with DuckDB in a single container is overwhelmingly simpler than a Spark cluster — in both cost and operational difficulty.
# the skeleton of a daily batch ETL that fits in one container
import duckdb
con = duckdb.connect()
con.sql("INSTALL httpfs; LOAD httpfs;")
con.sql("""
COPY (
SELECT
cast(ts AS DATE) AS event_date,
user_id,
event_type,
amount
FROM 's3://raw-bucket/events/2026-06-11/*.jsonl'
WHERE event_type IN ('purchase', 'refund')
)
TO 's3://curated-bucket/events/dt=2026-06-11/data.parquet'
(FORMAT parquet, COMPRESSION zstd)
""")
Second, dbt-duckdb. You keep dbt model management, tests, and documentation while swapping the execution engine for DuckDB. You get SQL transformation pipeline best practices without warehouse costs, so adoption has grown fast among startups and side projects. It is also excellent as a test environment for running entire dbt model suites in CI.
MotherDuck — The Hybrid Execution Compromise
The commercial answer to "what if my laptop is no longer enough" is MotherDuck. A cloud service working with the team behind DuckDB, it offers hybrid execution where local DuckDB and cloud DuckDB cooperate within one query. Join a local file against a cloud table, and the optimizer splits which parts execute where.
-- joining a local CSV with a MotherDuck cloud table (conceptual example)
ATTACH 'md:my_database';
SELECT l.campaign, m.user_segment, count(*)
FROM 'local_campaign.csv' l
JOIN my_database.users m USING (user_id)
GROUP BY 1, 2;
This creates a growth path of "start embedded, move to the cloud only as much as needed." It is also the point where vendor dependence reappears, so many teams keep core pipelines on standard DuckDB and Parquet and reserve MotherDuck for sharing and scaling.
DuckLake and the Lakehouse Current
In 2025 the DuckDB team unveiled DuckLake, an open lakehouse format. Where Iceberg-style formats manage metadata across many JSON and Avro files, DuckLake chose the simplification of storing catalog metadata in a SQL database while keeping the data in Parquet — the idea that "metadata is relational data too." It is still early and ecosystem compatibility (especially with existing Iceberg tooling) is the thing to watch, but the direction of a lakehouse that starts from a single node is consistent with the DuckDB philosophy. Existing Iceberg and Delta tables are readable through their respective extensions, so for now the appropriate distance feels like "read the standard formats, experiment with DuckLake."
Everyday Recipes — Worth Committing to Muscle Memory
DuckDB has many convenience constructs that cut down on analyst busywork. Here are the ones you end up using all the time.
-- 1. SUMMARIZE: table profiling in one shot
-- automatic summary of min, max, approx unique counts, null ratio per column
SUMMARIZE SELECT * FROM 'events.parquet';
-- 2. EXCLUDE / REPLACE: lifesavers with wide tables
SELECT * EXCLUDE (raw_payload, internal_flag)
FROM wide_table;
SELECT * REPLACE (round(amount, 2) AS amount)
FROM sales;
-- 3. canonical dedup: keep only the latest row per group
SELECT * FROM events
QUALIFY row_number() OVER (
PARTITION BY user_id ORDER BY ts DESC
) = 1;
-- 4. reproducible sampling
SELECT * FROM big_table USING SAMPLE 1 PERCENT (bernoulli, 42);
-- 5. list aggregation and structs: handling denormalized data
SELECT user_id, list(distinct category) AS categories
FROM purchases GROUP BY user_id;
SUMMARIZE in particular lets you grasp the quality of freshly received data within a minute, and QUALIFY lets you filter on window function results without a subquery. Master just these two and your exploration speed changes visibly.
One more thing worth emphasizing: forgiving CSV handling. Delimiter inference, encoding issues, and options to skip broken rows absorb a great deal of real-world data messiness.
-- read while discarding broken rows, then inspect what was discarded
SELECT * FROM read_csv('messy.csv', ignore_errors = true);
SELECT * FROM read_csv('messy.csv', store_rejects = true);
SELECT * FROM reject_errors; -- table of rejected rows and reasons
Limits — What Not to Use It For
The limits of DuckDB are the flip side of its strengths.
First, it is weak at concurrent writes. Within a single process it works well multi-threaded, but multiple processes writing to the same DB file simultaneously is not supported (read-only multi-process access is possible). It is unsuitable as the backend DB of a service where many users write concurrently.
Second, there is no distributed processing. The cores and disks of a single machine are the ceiling. Regular batches beyond several terabytes and BI serving with hundreds of concurrent queries belong to ClickHouse or a cloud warehouse. The key point, however, is that the ceiling is higher than people think — a single machine today can have dozens of cores and hundreds of gigabytes of memory.
Third, it fits batch and interactive analytics better than long-running resident services. High availability, replication, and failover — features of operational databases — were never the goal.
Selection Guide — Versus Postgres, SQLite, ClickHouse
| Criterion | SQLite | DuckDB | Postgres | ClickHouse |
|---|---|---|---|---|
| Deployment | embedded | embedded | server | server or distributed |
| Storage layout | row-oriented | column-oriented | row-oriented | column-oriented |
| Best workload | in-app OLTP | single-node OLAP | general OLTP | large-scale OLAP serving |
| Concurrent writes | single writer | single-process writer | strong | strong |
| Horizontal scaling | none | none | limited | core strength |
| Direct file querying | limited | core strength | needs extensions | supported |
| Operational burden | near zero | near zero | moderate | high |
| Comfortable scale | on-device data | gigabytes to hundreds of gigabytes | whole services | terabytes and up, high concurrency |
Roughly summarized: application state goes to SQLite or Postgres, analytics for one person or one pipeline goes to DuckDB, and real-time analytics serving hammered by a whole organization goes to ClickHouse. And DuckDB doubles as the glue that can read the data of all three others.
Performance Tips
Finally, the tuning points with the biggest practical payoff.
-- 1. manage memory and threads explicitly
SET memory_limit = '8GB';
SET threads = 8;
-- 2. make checking execution plans a habit
EXPLAIN ANALYZE
SELECT category, sum(amount) FROM 'sales/*.parquet' GROUP BY 1;
-- 3. convert CSV to Parquet once and every later query gets faster
COPY (SELECT * FROM 'raw/*.csv') TO 'data.parquet'
(FORMAT parquet, COMPRESSION zstd);
-- 4. pruning via partitioned directory layouts
-- recognizes hive partitioning such as year=2026/month=06
SELECT count(*) FROM read_parquet('events/*/*/*.parquet', hive_partitioning = true)
WHERE year = 2026 AND month = 6;
Three more things worth remembering. First, avoid star selects and name only the columns you need to get the full benefit of column pruning. Second, large sorts and joins that exceed memory spill to disk automatically — pointing the temp directory at a fast SSD makes a tangible difference. Third, if you query the same data repeatedly, materializing it into the native DuckDB format with CREATE TABLE AS beats re-scanning Parquet.
Closing
The success of DuckDB is not a victory of flashy new technology but of precise positioning. Most analytical data fits on one machine, most analysts do not want to operate servers, and most pipelines deal in Parquet files. DuckDB drove orthodox database technology — columnar storage and vectorized execution — squarely into these three realities.
Adoption is painless too. Start by opening today's CSV in the duckdb shell and firing off a GROUP BY. That single experience will lead to the healthy question: "does this data really need to go into a warehouse?" The era of analytics that ends on your laptop has already arrived.
References
- DuckDB documentation: https://duckdb.org/docs/
- Why DuckDB (official): https://duckdb.org/why_duckdb
- DuckDB 1.0 release announcement: https://duckdb.org/2024/06/03/announcing-duckdb-100.html
- DuckLake announcement: https://duckdb.org/2025/05/27/ducklake.html
- DuckDB GitHub repository: https://github.com/duckdb/duckdb
- dbt-duckdb adapter: https://github.com/duckdb/dbt-duckdb
- MotherDuck official site: https://motherduck.com/
- DuckDB Python API docs: https://duckdb.org/docs/api/python/overview
- httpfs extension docs: https://duckdb.org/docs/extensions/httpfs/overview
- ASOF JOIN documentation: https://duckdb.org/docs/sql/query_syntax/from
- Hacker News DuckDB discussion: https://news.ycombinator.com/item?id=24531085
- GeekNews main page: https://news.hada.io/