DuckDB in Practice — The Era of Embedded Analytics That Ends on Your Laptop

Introduction — The Duck That Became a Hacker News Regular
What Is DuckDB — The SQLite of OLAP
Architecture — Why It Is Fast
Getting Started Without Installing Anything
Querying Files Directly — The Killer Feature
A Collection of Hands-On Query Examples
pandas and Polars — Symbiosis, Not Competition
The Extension Ecosystem
Its Place in Data Engineering — Lightweight ETL and dbt
MotherDuck — The Hybrid Execution Compromise
DuckLake and the Lakehouse Current
Everyday Recipes — Worth Committing to Muscle Memory
Limits — What Not to Use It For
Selection Guide — Versus Postgres, SQLite, ClickHouse
Performance Tips
Closing
References

Introduction — The Duck That Became a Hacker News Regular

Hacker News has a set of recurring front-page favorites: odes to SQLite, "Postgres is enough" posts, and DuckDB war stories. Every time a title like "I aggregated hundreds of millions of rows on my laptop in seconds" or "I finished the whole analytics pipeline without a data warehouse" goes up, the comment section converges on the same conclusion: our data is smaller than we think, and DuckDB is faster than we think.

In the context of 2026, another reason has been added. With AI coding agents now mainstream, workflows where "the agent explores the data directly" have become common — and DuckDB, which gives you a SQL analytics environment from a single file with no server installation and no auth setup, has become the poster child of agent-friendly tooling. The era's mood of cloud cost-cutting pressure and big tech fatigue also amplifies the appeal of "analytics that ends on my laptop."

This post covers, from a practitioner perspective, the structural reasons DuckDB is loved, hands-on query patterns, its place in data engineering, and its limits.

What Is DuckDB — The SQLite of OLAP

Defined in one sentence, DuckDB is an embedded analytical database. Just as SQLite embeds as a library inside an application to handle transactions, DuckDB embeds inside your process to handle analytical queries. Lining up the comparison axes:

SQLite: embedded, row-oriented, optimal for OLTP (many small reads and writes)
DuckDB: embedded, column-oriented, optimal for OLAP (few heavy aggregations)
Postgres and MySQL: client-server, row-oriented, OLTP
ClickHouse and BigQuery: client-server (or distributed), column-oriented, OLAP

In other words, DuckDB is the project that filled the empty quadrant precisely: a "serverless ClickHouse" and an "analytical SQLite." It originated in the database research group of the CWI institute in the Netherlands, reached version 1.0 in 2024, and its stability and ecosystem matured rapidly afterwards.

Architecture — Why It Is Fast

DuckDB derives its speed from three design decisions.

+--------------------------------------------------------------+
|  user process (Python, CLI, Node, JVM, ...)                   |
|                                                              |
|   +------------------------------------------------------+   |
|   |  DuckDB (embedded as a library, no server)           |   |
|   |                                                      |   |
|   |  SQL parser / optimizer                              |   |
|   |        |                                             |   |
|   |        v                                             |   |
|   |  vectorized execution engine                         |   |
|   |   - processes vectors (about 2048 values) at a time, |   |
|   |     not one row at a time                            |   |
|   |   - CPU cache friendly, uses SIMD                    |   |
|   |   - automatic parallelism across all cores           |   |
|   |        |                                             |   |
|   |        v                                             |   |
|   |  column-oriented storage                             |   |
|   |   - single-file DB (ACID, MVCC)                      |   |
|   |   - per-column compression (dictionary, RLE,         |   |
|   |     bit packing)                                     |   |
|   |   - direct scans of Parquet/CSV/JSON (no import)     |   |
|   +------------------------------------------------------+   |
|                                                              |
|  data larger than memory spills to disk (out-of-core)        |
+--------------------------------------------------------------+

First, columnar storage. Analytical queries usually read only two or three of dozens of columns. Storing by column means reading only the needed columns from disk, and values of the same type sitting together compress better.

Second, vectorized execution. Instead of the traditional Volcano model that processes one row at a time, operators consume vectors of roughly 2048 values. Function call overhead drops by orders of magnitude, and CPU caches and SIMD instructions are used efficiently.

Third, the single file and out-of-core processing. The whole database is one file, and data larger than memory is processed by automatically spilling intermediate results to disk. This is the decisive difference from pandas, where "does not fit in memory" means game over.

Getting Started Without Installing Anything

The barrier to entry for DuckDB is effectively zero.

# Install the CLI (macOS)
brew install duckdb

# Run immediately in in-memory mode
duckdb

# Open a file DB (created if missing)
duckdb analytics.duckdb

# Python: one line, pip install duckdb, and you are done
import duckdb

# query directly without a connection (in-memory)
duckdb.sql("SELECT 42 AS answer").show()

# use a file DB
con = duckdb.connect("analytics.duckdb")
con.sql("CREATE TABLE t AS SELECT * FROM range(1000000)")

No server daemon, no port configuration, no user accounts. This frictionlessness creates the habit of "let me just open it with DuckDB," and that habit became the foundation of its popularity.

Querying Files Directly — The Killer Feature

DuckDB queries files like tables without "importing" the data first.

-- query a CSV directly (schema inferred automatically)
SELECT count(*) FROM 'events-2026-06.csv';

-- many Parquet files at once via glob patterns
SELECT user_id, sum(amount) AS total
FROM 'data/sales/*.parquet'
GROUP BY user_id
ORDER BY total DESC
LIMIT 10;

-- JSON Lines as well
SELECT payload.event_type, count(*)
FROM read_json_auto('logs/*.jsonl')
GROUP BY 1;

-- converting between file formats is one line of SQL
COPY (SELECT * FROM 'raw.csv') TO 'clean.parquet' (FORMAT parquet);

Load the httpfs extension and remote files work with the same syntax.

INSTALL httpfs;
LOAD httpfs;

-- query an HTTP URL directly
SELECT * FROM 'https://example.com/open-data/stats.parquet' LIMIT 5;

-- aggregate Parquet in an S3 bucket right from your laptop
SET s3_region = 'ap-northeast-2';
SELECT date_trunc('day', ts) AS d, count(*)
FROM 's3://my-bucket/events/year=2026/month=06/*.parquet'
GROUP BY d ORDER BY d;

Because it exploits Parquet column statistics and row-group metadata to read only what is needed, even files of tens of gigabytes answer in seconds when filters prune well. This feature sits at the center of the trend of "query the data lake in place instead of loading it into a warehouse."

A Collection of Hands-On Query Examples

Three patterns you will use constantly in analytical SQL, in DuckDB syntax.

Window Functions — Moving Averages and Ranks

-- 7-day moving average of daily revenue and day-over-day change
WITH daily AS (
  SELECT date_trunc('day', ts) AS d, sum(amount) AS revenue
  FROM 'sales/*.parquet'
  GROUP BY d
)
SELECT
  d,
  revenue,
  avg(revenue) OVER (
    ORDER BY d ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS ma7,
  revenue - lag(revenue) OVER (ORDER BY d) AS diff_prev_day,
  rank() OVER (ORDER BY revenue DESC) AS best_day_rank
FROM daily
ORDER BY d;

PIVOT — Rows into Columns

DuckDB supports PIVOT as a dedicated clause. You do not have to enumerate column values by hand.

-- monthly revenue by category in wide format
PIVOT (
  SELECT category, strftime(ts, '%Y-%m') AS month, amount
  FROM 'sales/*.parquet'
)
ON month
USING sum(amount)
GROUP BY category;

-- UNPIVOT works in the opposite direction
UNPIVOT monthly_wide
ON COLUMNS(* EXCLUDE (category))
INTO NAME month VALUE amount;

ASOF JOIN — Matching the Nearest Past Point in Time Series

It shines when joining time series whose timestamps do not align exactly, such as quotes and trades.

-- for each trade, match "the latest exchange rate before the trade time"
SELECT t.trade_id, t.ts, t.amount_usd, r.rate AS usdkrw_at_trade
FROM trades t
ASOF JOIN fx_rates r
  ON t.ts >= r.ts;

Implementing this with regular joins requires correlated subqueries or window-function acrobatics, but ASOF JOIN expresses the intent directly in syntax while keeping the execution plan efficient.

pandas and Polars — Symbiosis, Not Competition

DuckDB complements rather than replaces dataframe libraries. The key is zero-copy interop based on Apache Arrow.

import duckdb
import pandas as pd

df = pd.read_parquet("events.parquet")

# query a pandas DataFrame with SQL, no copy (the variable name is recognized)
result = duckdb.sql("""
    SELECT user_id, count(*) AS cnt
    FROM df
    WHERE event_type = 'purchase'
    GROUP BY user_id
    HAVING cnt >= 3
""").df()  # back to pandas

# the same round trip works with Polars
import polars as pl
pdf = duckdb.sql("SELECT * FROM 'big.parquet'").pl()  # via Arrow, minimal copying

The typical division of labor in practice: DuckDB performs the heavy scans, joins, and aggregations directly on disk, and you pull into a dataframe only for the final polishing right before visualization. Eight out of ten "pandas out of memory" problems disappear with this split.

The Extension Ecosystem

DuckDB keeps the core small and splits features into extensions. Two commands, INSTALL and LOAD, are all you need.

-- spatial analytics: GIS functions and geometry types
INSTALL spatial;
LOAD spatial;
SELECT ST_Distance(
  ST_Point(126.9780, 37.5665),  -- Seoul
  ST_Point(139.6917, 35.6895)   -- Tokyo
) AS deg_distance;

-- full-text search: BM25-based index
INSTALL fts;
LOAD fts;
PRAGMA create_fts_index('docs', 'doc_id', 'body');
SELECT doc_id, fts_main_docs.match_bm25(doc_id, 'duckdb embedded') AS score
FROM docs ORDER BY score DESC LIMIT 10;

Beyond these, json (close to built-in), httpfs, iceberg, delta, the postgres scanner, and the sqlite scanner are in frequent use. The postgres scanner lets DuckDB read your production DB tables directly, making it a convenient alternative for "build an analytics replica without loading the production DB."

Its Place in Data Engineering — Lightweight ETL and dbt

DuckDB has gone beyond "the analyst's toy" to become a legitimate pipeline component. Two representative patterns:

First, as a lightweight ETL engine. For batches under a few dozen gigabytes per day, finishing the transformation with DuckDB in a single container is overwhelmingly simpler than a Spark cluster — in both cost and operational difficulty.

# the skeleton of a daily batch ETL that fits in one container
import duckdb

con = duckdb.connect()
con.sql("INSTALL httpfs; LOAD httpfs;")
con.sql("""
    COPY (
        SELECT
            cast(ts AS DATE) AS event_date,
            user_id,
            event_type,
            amount
        FROM 's3://raw-bucket/events/2026-06-11/*.jsonl'
        WHERE event_type IN ('purchase', 'refund')
    )
    TO 's3://curated-bucket/events/dt=2026-06-11/data.parquet'
    (FORMAT parquet, COMPRESSION zstd)
""")

Second, dbt-duckdb. You keep dbt model management, tests, and documentation while swapping the execution engine for DuckDB. You get SQL transformation pipeline best practices without warehouse costs, so adoption has grown fast among startups and side projects. It is also excellent as a test environment for running entire dbt model suites in CI.

MotherDuck — The Hybrid Execution Compromise

The commercial answer to "what if my laptop is no longer enough" is MotherDuck. A cloud service working with the team behind DuckDB, it offers hybrid execution where local DuckDB and cloud DuckDB cooperate within one query. Join a local file against a cloud table, and the optimizer splits which parts execute where.

-- joining a local CSV with a MotherDuck cloud table (conceptual example)
ATTACH 'md:my_database';
SELECT l.campaign, m.user_segment, count(*)
FROM 'local_campaign.csv' l
JOIN my_database.users m USING (user_id)
GROUP BY 1, 2;

This creates a growth path of "start embedded, move to the cloud only as much as needed." It is also the point where vendor dependence reappears, so many teams keep core pipelines on standard DuckDB and Parquet and reserve MotherDuck for sharing and scaling.

DuckLake and the Lakehouse Current

In 2025 the DuckDB team unveiled DuckLake, an open lakehouse format. Where Iceberg-style formats manage metadata across many JSON and Avro files, DuckLake chose the simplification of storing catalog metadata in a SQL database while keeping the data in Parquet — the idea that "metadata is relational data too." It is still early and ecosystem compatibility (especially with existing Iceberg tooling) is the thing to watch, but the direction of a lakehouse that starts from a single node is consistent with the DuckDB philosophy. Existing Iceberg and Delta tables are readable through their respective extensions, so for now the appropriate distance feels like "read the standard formats, experiment with DuckLake."

Everyday Recipes — Worth Committing to Muscle Memory

DuckDB has many convenience constructs that cut down on analyst busywork. Here are the ones you end up using all the time.

-- 1. SUMMARIZE: table profiling in one shot
-- automatic summary of min, max, approx unique counts, null ratio per column
SUMMARIZE SELECT * FROM 'events.parquet';

-- 2. EXCLUDE / REPLACE: lifesavers with wide tables
SELECT * EXCLUDE (raw_payload, internal_flag)
FROM wide_table;

SELECT * REPLACE (round(amount, 2) AS amount)
FROM sales;

-- 3. canonical dedup: keep only the latest row per group
SELECT * FROM events
QUALIFY row_number() OVER (
  PARTITION BY user_id ORDER BY ts DESC
) = 1;

-- 4. reproducible sampling
SELECT * FROM big_table USING SAMPLE 1 PERCENT (bernoulli, 42);

-- 5. list aggregation and structs: handling denormalized data
SELECT user_id, list(distinct category) AS categories
FROM purchases GROUP BY user_id;

SUMMARIZE in particular lets you grasp the quality of freshly received data within a minute, and QUALIFY lets you filter on window function results without a subquery. Master just these two and your exploration speed changes visibly.

One more thing worth emphasizing: forgiving CSV handling. Delimiter inference, encoding issues, and options to skip broken rows absorb a great deal of real-world data messiness.

-- read while discarding broken rows, then inspect what was discarded
SELECT * FROM read_csv('messy.csv', ignore_errors = true);
SELECT * FROM read_csv('messy.csv', store_rejects = true);
SELECT * FROM reject_errors;  -- table of rejected rows and reasons

Limits — What Not to Use It For

The limits of DuckDB are the flip side of its strengths.

First, it is weak at concurrent writes. Within a single process it works well multi-threaded, but multiple processes writing to the same DB file simultaneously is not supported (read-only multi-process access is possible). It is unsuitable as the backend DB of a service where many users write concurrently.

Second, there is no distributed processing. The cores and disks of a single machine are the ceiling. Regular batches beyond several terabytes and BI serving with hundreds of concurrent queries belong to ClickHouse or a cloud warehouse. The key point, however, is that the ceiling is higher than people think — a single machine today can have dozens of cores and hundreds of gigabytes of memory.

Third, it fits batch and interactive analytics better than long-running resident services. High availability, replication, and failover — features of operational databases — were never the goal.

Selection Guide — Versus Postgres, SQLite, ClickHouse

Criterion	SQLite	DuckDB	Postgres	ClickHouse
Deployment	embedded	embedded	server	server or distributed
Storage layout	row-oriented	column-oriented	row-oriented	column-oriented
Best workload	in-app OLTP	single-node OLAP	general OLTP	large-scale OLAP serving
Concurrent writes	single writer	single-process writer	strong	strong
Horizontal scaling	none	none	limited	core strength
Direct file querying	limited	core strength	needs extensions	supported
Operational burden	near zero	near zero	moderate	high
Comfortable scale	on-device data	gigabytes to hundreds of gigabytes	whole services	terabytes and up, high concurrency

Roughly summarized: application state goes to SQLite or Postgres, analytics for one person or one pipeline goes to DuckDB, and real-time analytics serving hammered by a whole organization goes to ClickHouse. And DuckDB doubles as the glue that can read the data of all three others.

Performance Tips

Finally, the tuning points with the biggest practical payoff.

-- 1. manage memory and threads explicitly
SET memory_limit = '8GB';
SET threads = 8;

-- 2. make checking execution plans a habit
EXPLAIN ANALYZE
SELECT category, sum(amount) FROM 'sales/*.parquet' GROUP BY 1;

-- 3. convert CSV to Parquet once and every later query gets faster
COPY (SELECT * FROM 'raw/*.csv') TO 'data.parquet'
  (FORMAT parquet, COMPRESSION zstd);

-- 4. pruning via partitioned directory layouts
-- recognizes hive partitioning such as year=2026/month=06
SELECT count(*) FROM read_parquet('events/*/*/*.parquet', hive_partitioning = true)
WHERE year = 2026 AND month = 6;

Three more things worth remembering. First, avoid star selects and name only the columns you need to get the full benefit of column pruning. Second, large sorts and joins that exceed memory spill to disk automatically — pointing the temp directory at a fast SSD makes a tangible difference. Third, if you query the same data repeatedly, materializing it into the native DuckDB format with CREATE TABLE AS beats re-scanning Parquet.

Closing

The success of DuckDB is not a victory of flashy new technology but of precise positioning. Most analytical data fits on one machine, most analysts do not want to operate servers, and most pipelines deal in Parquet files. DuckDB drove orthodox database technology — columnar storage and vectorized execution — squarely into these three realities.

Adoption is painless too. Start by opening today's CSV in the duckdb shell and firing off a GROUP BY. That single experience will lead to the healthy question: "does this data really need to go into a warehouse?" The era of analytics that ends on your laptop has already arrived.

References

DuckDB documentation: https://duckdb.org/docs/
Why DuckDB (official): https://duckdb.org/why_duckdb
DuckDB 1.0 release announcement: https://duckdb.org/2024/06/03/announcing-duckdb-100.html
DuckLake announcement: https://duckdb.org/2025/05/27/ducklake.html
DuckDB GitHub repository: https://github.com/duckdb/duckdb
dbt-duckdb adapter: https://github.com/duckdb/dbt-duckdb
MotherDuck official site: https://motherduck.com/
DuckDB Python API docs: https://duckdb.org/docs/api/python/overview
httpfs extension docs: https://duckdb.org/docs/extensions/httpfs/overview
ASOF JOIN documentation: https://duckdb.org/docs/sql/query_syntax/from
Hacker News DuckDB discussion: https://news.ycombinator.com/item?id=24531085
GeekNews main page: https://news.hada.io/