Prologue — What "The End of the Pandas Era" Actually Means
In 2025 and 2026, the most-quoted sentence in the dataframe community was this:
"I switched to Polars and the same code got 10x faster."
And right next to it, always:
"But to push it into sklearn I ended up calling
to_pandas()anyway."
Those two sentences summarize the 2026 dataframe world precisely. Polars is taking almost all of the new workloads. And yet Pandas is not going away. Fifteen years of ecosystem inertia — sklearn input, statsmodels, plotting libraries, hundreds of thousands of Stack Overflow answers, every classroom — make Pandas the safe default.
This post tries to locate the balance point. The Polars 1.x stability promise, the Rust plus Arrow architecture, what the query optimizer actually does, how far Pandas 2.2 plus PyArrow caught up and where it stopped, how DuckDB — the other Arrow-native engine — beats Polars on some workloads and loses on others, where Dask and Ibis sit, the real cost of migrating. And at the very end — when you should still stay on Pandas.
1. The Foundation Is Arrow — The Real Hero of the DataFrame Revolution
The most common omission in Polars vs Pandas comparisons: both libraries sit on top of Apache Arrow. The revolution is not Polars. It is Arrow.
Why Arrow Matters
Arrow is a columnar in-memory format. NumPy is columnar too, but the differences are huge:
- Language-neutral memory layout — Python, Rust, C++, Java, Go all share the same memory zero-copy.
- Column-major — values of the same column live contiguously in memory. Ideal for SIMD.
- Built-in type system — strings, lists, structs, dictionaries (categories), date, time, decimal all standardized.
- Separate null bitmap — no more abusing
NaNas null the way NumPy and Pandas 1.x had to. - Chunked arrays — split big columns into chunks, the substrate for lazy and streaming execution.
The one-line takeaway:
On top of Arrow, Polars, Pandas, DuckDB and Spark see the same memory. Conversion cost collapses toward zero.
df.to_arrow() then DuckDB reads it directly. DuckDB writes Arrow then Polars reads it directly. Zero-copy. This is why the dataframe world of the 2020s is converging.
Arrow vs NumPy Backend in Practice
| Aspect | NumPy backend (Pandas 1.x) | Arrow backend (Polars / Pandas 2.2+) |
|---|---|---|
| Strings | object dtype, list of Python objects | string dtype, contiguous memory, 5–10x faster |
| Null | forced NaN, int column silently promoted to float | proper null bitmap, dtype preserved |
| Categorical | category dtype, custom impl | standard dictionary type |
| Timestamps | locked to datetime64[ns] | rich timestamp[us, tz] and friends |
| Zero-copy sharing | hard | immediate with DuckDB, Polars, Spark |
| SIMD utilization | limited | natural fit for columnar |
That table is most of the reason Polars beats Pandas. Polars started here; Pandas is catching up via an opt-in backend in 2.x.
2. Polars 1.x — A Rust Core and a Stability Promise
Polars is a Rust dataframe library started by Ritchie Vink in 2020. 1.0 shipped in August 2024, and since then minors have been moving fast while the project promises public API stability. As of 2026 we are deep in the 1.x line (see the release notes at Polars GitHub Releases); breaking changes arrive as opt-in new APIs, and old APIs go through a deprecation cycle.
A Rust Core With Python Bindings
- Core is Rust — the
polarscrate. Memory safety, SIMD, no-GIL parallelism. - Python is a binding —
py-polars. Wraps the Rust core via PyO3. - R and NodeJS bindings sit on the same core.
What this means: Python's GIL does not slow Polars down. Group-by genuinely uses every core. NumPy parallelizes via BLAS, but the dataframe operations themselves (group-by, join) are stuck behind the GIL in Pandas.
The Polars Data Model
import polars as pl
df = pl.DataFrame({
"user_id": [1, 2, 3, 4],
"country": ["KR", "US", "KR", "JP"],
"amount": [120.0, 95.0, 200.0, 75.0],
})
print(df.schema)
# Schema({'user_id': Int64, 'country': String, 'amount': Float64})
The schema is an ordered map from column name to Arrow dtype. Every dtype is explicit; there is very little inference magic.
Eager vs Lazy
Polars has two modes:
- Eager —
pl.DataFrame. Executes immediately, just like Pandas. - Lazy —
pl.LazyFrame. Builds a query plan and only executes at.collect(), after optimization.
# eager
df = pl.read_parquet("sales.parquet")
big_kr = df.filter(pl.col("country") == "KR").select(["user_id", "amount"])
# lazy — build the plan, optimize and execute on .collect()
plan = (
pl.scan_parquet("sales.parquet") # I/O is deferred
.filter(pl.col("country") == "KR")
.select(["user_id", "amount"])
)
big_kr = plan.collect()
See the difference? scan_parquet does not read the file yet. The next chapter shows why that matters so much.
3. Lazy Evaluation and the Query Optimizer — Why Polars Feels Magical
Half of Polars' performance comes from the optimizer that rewrites the lazy plan. What does it do?
3.1 Predicate Pushdown — Push the Filter Into I/O
plan = (
pl.scan_parquet("sales.parquet") # 100 million rows
.filter(pl.col("country") == "KR") # 1 million KR rows
.select(["user_id", "amount"])
)
The naive execution: read all 100M rows, then filter in memory. A hundred times too expensive.
What the optimizer does: push the filter into the scan. It reads Parquet row-group statistics (min and max) and skips row groups where country != "KR". Disk I/O drops by 100x.
3.2 Projection Pushdown — Read Only the Columns You Need
Even though select(["user_id", "amount"]) is at the end, Parquet is columnar, so only those two columns are read off disk. Pandas reads everything at pd.read_parquet() time and selecting later is too late. (Pandas does accept a columns= parameter, but it has no plan-level inference.)
3.3 Common Subexpression Elimination
plan = (
df.lazy()
.with_columns([
(pl.col("price") * pl.col("qty")).alias("revenue"),
(pl.col("price") * pl.col("qty") * 0.1).alias("commission"),
])
)
price * qty appears twice. The optimizer computes it once and reuses it.
3.4 Slice Pushdown, Type Coercion, Dead-Column Removal
If .head(100) sits at the bottom of a plan, only enough rows are pulled up to satisfy those 100. Columns unused after a join are dropped before the join. Implicit casts get reduced.
3.5 The Streaming Engine
collect(streaming=True) — and in 2026 the new streaming engine is approaching stable status (see the Polars blog for the streaming write-ups) — processes datasets that do not fit in memory in chunks. Out-of-core inside a single library.
Visualize the Plan
plan.explain() # human-readable plan
plan.show_graph() # graphviz; compare optimized vs naive
This is Polars' brain. Pandas has nothing like it. Pandas takes orders and executes them verbatim.
4. The Polars Expression API — One Series, One Expression
The biggest mental model shift in Polars is the expression.
import polars as pl
df.select([
pl.col("amount").sum().alias("total"),
pl.col("amount").mean().over("country").alias("country_avg"),
(pl.col("amount") - pl.col("amount").mean()).alias("diff_from_mean"),
])
Each expression is a DAG that transforms columns. Where Pandas column manipulation is imperative, Polars expressions are declarative — closer to Spark DataFrame or SQL.
Mental model differences vs Pandas:
| Operation | Pandas | Polars |
|---|---|---|
| Create a new column | df["x"] = df["a"] + df["b"] (assign) | df.with_columns((pl.col("a") + pl.col("b")).alias("x")) (expression) |
| Group-by aggregation | df.groupby("k")["x"].sum() (chained) | df.group_by("k").agg(pl.col("x").sum()) (expression list) |
| Window | df.groupby("k")["x"].transform("mean") | pl.col("x").mean().over("k") |
| Multi-column transform | for loop or apply | one expression list, all at once |
It feels awkward for a few days, then it clicks. And the key point is that an expression is a node in the optimizer's plan. Because expressions are declarative, Polars can optimize them.
5. Pandas 2.2 — How Far Did It Catch Up With the PyArrow Backend?
Pandas introduced the PyArrow backend in 2.0 (2023) and stabilized it in 2.2 (2024). As of 2026, Pandas 2.2.x is the de-facto standard while 3.0 has been in a long beta.
Turning On the Arrow Backend
import pandas as pd
# Option A — explicit Arrow dtypes
df = pd.read_parquet("sales.parquet", dtype_backend="pyarrow")
# Option B — set a Series dtype to Arrow
s = pd.Series([1, 2, None], dtype="int64[pyarrow]")
What you gain:
- String columns become genuinely fast and use much less memory.
- An int column with nulls no longer silently promotes to float (
Int64[pyarrow]is nullable). - Date and time stay as Arrow timestamps.
- Zero-copy interop with DuckDB, Polars, Spark.
What Pandas 2.2 Still Cannot Do
- No lazy evaluation. Everything is eager. Predicate pushdown only via
read_parquet(filters=). - No query optimizer. Pandas runs operations in the order you wrote them.
- Single-threaded inside the GIL by default. You opt into
numba,numexpr, orpyarrow.computecase by case. - No expression API. Composing multi-column expressions inside a group-by stays awkward.
- Huge API surface. Pandas is a sprawling dictionary; a lot of implicit magic happens.
To summarize: the data structures moved to Arrow, but the execution engine is still 1990s imperative. That gap is why Polars wins even on the same Arrow foundation.
6. Polars vs Pandas — The Full Comparison Matrix
The same task, side by side in both libraries.
6.1 Read Parquet, Filter, Group-By
Pandas:
import pandas as pd
df = pd.read_parquet("sales.parquet", dtype_backend="pyarrow")
kr = df[df["country"] == "KR"]
result = (
kr.groupby("user_id", as_index=False)["amount"]
.agg(["sum", "mean", "count"])
)
Polars (lazy):
import polars as pl
result = (
pl.scan_parquet("sales.parquet")
.filter(pl.col("country") == "KR")
.group_by("user_id")
.agg([
pl.col("amount").sum().alias("total"),
pl.col("amount").mean().alias("avg"),
pl.col("amount").count().alias("n"),
])
.collect()
)
The line counts look similar, but the plan-level work is not. Polars pushes the filter into the scan and only reads three columns off disk.
6.2 Join
Pandas:
merged = pd.merge(orders, users, on="user_id", how="left")
Polars:
merged = orders.lazy().join(users.lazy(), on="user_id", how="left").collect()
A Polars join is parallel hash-join by default. The gap widens fast as the data grows.
6.3 Window Functions
Pandas:
df["avg_country"] = df.groupby("country")["amount"].transform("mean")
df["rank_in_country"] = df.groupby("country")["amount"].rank(method="dense")
Polars:
df = df.with_columns([
pl.col("amount").mean().over("country").alias("avg_country"),
pl.col("amount").rank(method="dense").over("country").alias("rank_in_country"),
])
If you group multiple windows into a single plan, Polars computes each partition only once.
6.4 The Comparison Matrix
| Aspect | Pandas 2.2 + Arrow | Polars 1.x |
|---|---|---|
| Memory backend | NumPy default, Arrow optional | Arrow only |
| String performance | good with Arrow on | fast from day one |
| Null handling | OK with Arrow dtypes | standard null bitmap |
| Multi-core | partial via numba and numexpr | parallel by default, GIL-free |
| Lazy plan | none | LazyFrame |
| Query optimizer | none | predicate, projection, CSE and more |
| Streaming | none | streaming engine (out-of-core) |
| Expression API | none (chaining only) | first-class |
| Time series | mature and powerful | improving; some gaps remain |
| Plotting integration | seaborn and matplotlib directly | often needs to_pandas() |
| sklearn input | direct | needs to_pandas() or to_numpy() |
| Learning material | overwhelming | growing but smaller |
On raw performance Polars wins decisively. Add ecosystem integration, learning material and legacy and Pandas still has the safety edge.
7. TPC-H Benchmarks — The Real Numbers
The Polars team publishes its own runs of all 22 TPC-H queries (see the Polars benchmark page; concrete numbers depend on SF, hardware and version, so this table is shape-only and meant to anchor decisions, not replace the originals).
| Query | Pandas 2.x | Polars 1.x (lazy) | DuckDB | Dask |
|---|---|---|---|---|
| Q1 (simple group-by) | baseline | roughly 1/10 the time | roughly 1/15 | roughly 1/3 |
| Q3 (join + filter) | baseline | roughly 1/8 | roughly 1/10 | roughly 1/2 |
| Q9 (multi-join, complex) | baseline | roughly 1/7 | roughly 1/8 | similar or slightly faster |
| Q21 (correlated subquery) | often OOMs | runs fine | very fast | can OOM |
| Larger-than-memory dataset | impossible | streaming engine handles it | handles it | distributed handles it |
Key observations:
- Polars and DuckDB are in the same league on a single node — query optimizer plus Arrow plus multi-core. Pandas trails far behind.
- If you need distribution, you reach for Dask, Spark or Ray. Polars optimizes for single-node (or one large cloud VM).
- For SQL-style analytics like TPC-H, DuckDB tops the chart most often. For a dataframe API, Polars wins.
Benchmarks live and die on "which workload". The table is a starting point, not the verdict.
8. Polars vs DuckDB — Two Faces of the Same Engine
This is the most interesting comparison of 2026. Both are Arrow-native, both are vectorized columnar engines, both massacre large data on a single node. The interface is what differs.
- DuckDB — an embeddable SQL engine.
duckdb.sql("SELECT ... FROM parquet_scan('sales.parquet') WHERE ..."). Heavy analytical SQL and OLAP. Windowing, complex joins and subqueries are unmatched. - Polars — a dataframe API. Declarative dataframe with expressions.
# Same task, two interfaces
# DuckDB
import duckdb
result = duckdb.sql("""
SELECT user_id, SUM(amount) AS total
FROM 'sales.parquet'
WHERE country = 'KR'
GROUP BY user_id
""").to_df()
# Polars
import polars as pl
result = (
pl.scan_parquet("sales.parquet")
.filter(pl.col("country") == "KR")
.group_by("user_id")
.agg(pl.col("amount").sum().alias("total"))
.collect()
)
Where DuckDB Wins
- Complex multi-table joins, deep CTEs, correlated subqueries — SQL reads more naturally.
- Analysts already write SQL — zero cognitive cost.
- BI tooling, Jupyter magic, dbt integration — nearly every BI tool can read DuckDB.
Where Polars Wins
- Long pipelines of column-level transforms — expressions are cleaner than nested CTEs.
- The codebase is already in Pandas — the migration surface is smaller.
- ML feature engineering, where you want to hold the dataframe imperatively in Python.
- User-defined Python functions sneak in (
map_elements,pipe).
The Pattern of Using Both
# Receive in Polars, hand heavy SQL off to DuckDB, take the result back as Polars
import duckdb, polars as pl
lf = pl.scan_parquet("sales.parquet").filter(pl.col("amount") > 100)
out = duckdb.sql("SELECT country, percentile_cont(0.95) WITHIN GROUP (ORDER BY amount) AS p95 FROM lf GROUP BY country").pl()
On Arrow, the two exchange data zero-copy. Drop the "pick exactly one" obsession.
9. Dask and Ibis — The Supporting Cast
Dask — When You Need Out-Of-Core or Distribution
Dask is a distributed dataframe with a Pandas-like API. It launches Pandas DataFrames per partition and binds them into a lazy graph, then executes across a cluster.
import dask.dataframe as dd
df = dd.read_parquet("s3://bucket/sales-*.parquet")
result = df[df.country == "KR"].groupby("user_id").amount.sum().compute()
Dask keeps its identity: "Pandas, but distributed". By 2026 the 2024.x line and beyond ships a gradually-rolled-out query optimizer (Coiled has led the work) and Pandas 2.x PyArrow dtypes interop cleanly. On a single node it loses to Polars and DuckDB — by design. Tens of terabytes scattered across S3 is the real use case.
Ibis — A Backend-Agnostic DataFrame Spec
Ibis lets you write dataframe expressions and compile them to 20-plus backends: DuckDB, BigQuery, Snowflake, Polars, Pandas, Spark and more.
import ibis
t = ibis.read_parquet("sales.parquet") # default backend is DuckDB
expr = t.filter(t.country == "KR").group_by("user_id").agg(t.amount.sum().name("total"))
expr.execute() # returns Pandas
The same query runs against DuckDB, gets pushed to BigQuery, or hits Snowflake — without code changes. Push heavy work into the warehouse, test the same code locally.
Where Ibis sits in 2026:
- A candidate for the dataframe interface standard. The PyData ecosystem is gradually coalescing here.
- Strong choice for analytics teams that live in a cloud warehouse.
- Backend quirks still leak through occasionally — the same expression does not run identically everywhere.
10. Migration Traps — The Real Cost of Pandas to Polars
On performance alone the move to Polars looks obvious. In practice, people hit a handful of traps.
10.1 No Index
Pandas centers on the index. Polars has no index. .set_index(), .reset_index(), .loc[], MultiIndex — all gone. Joins use explicit key columns and sorts are explicit. At first it feels annoying, but you end up with fewer bugs — no more silent corruption from implicit index alignment.
10.2 No inplace
# Pandas
df["x"] = df["a"] + df["b"] # works
df.rename(columns={"a": "A"}, inplace=True)
# Polars
df = df.with_columns((pl.col("a") + pl.col("b")).alias("x"))
df = df.rename({"a": "A"})
Every operation returns a new DataFrame. You are forced into an immutable mindset.
10.3 The Death of apply (Kind Of)
# Pandas — apply is so common it feels like boilerplate
df["y"] = df.apply(lambda r: complicated_func(r["a"], r["b"]), axis=1)
In Polars, map_elements is the explicitly slow path. Combine expressions when you can. If you must, prefer map_batches for a vectorized batch function.
10.4 Group-By Result Shape
Whether a Pandas group-by returns a Series or a DataFrame depends on subtleties; Polars always returns a DataFrame. But the column-naming rules differ enough to cause KeyErrors — Polars makes you call .alias() explicitly.
10.5 Datetime Is Different
Pandas pins datetime64[ns]. Polars (via Arrow) uses Datetime("us", "UTC") and friends. Timezone comparisons get strict. Mixed-timezone columns will make Polars complain. Annoying at first, a blessing later.
10.6 CSV Type Inference Diverges
read_csv infers dtypes differently. Pandas peeks at a slice and guesses; Polars is more conservative. Get into the habit of declaring dtypes explicitly.
10.7 Method Names Are Subtly Different
reset_index — does not exist. concat(axis=1) — pl.concat([..], how="horizontal"). pd.melt — pl.DataFrame.unpivot. merge — join. rename — different signature. Find-and-replace alone will not get you there; a human has to read the code.
10.8 An Incremental Migration Strategy
# Receive data in Polars from the start
df = pl.read_parquet("a.parquet")
# Heavy transforms in Polars
df = df.filter(...).group_by(...).agg(...)
# Only convert to pandas at the sklearn or seaborn boundary
pdf = df.to_pandas()
This is the realistic pattern. Do not rewrite everything in one go. New pipelines go to Polars; existing ones stay as they are.
11. When You Should Still Stay on Pandas
The most honest chapter of this post. In 2026 it is still right to stay on Pandas in several cases.
11.1 sklearn Is the Endpoint
sklearn treats pandas DataFrames as first-class citizens. ColumnTransformer, OneHotEncoder(sparse_output=False), every step of a pipeline preserves column names. If you are going to call .to_pandas() right before sklearn anyway, the case for moving thins out.
11.2 statsmodels and Causal-Inference Libraries
Most statistics libraries (statsmodels, lifelines, dowhy, econml) assume Pandas input. Formula notation (y ~ x1 + x2) is bound to column names.
11.3 Plotting Integration
seaborn, plotnine and many plotly examples want Pandas. Polars now supports the __dataframe__ protocol so more libraries can read it directly, but Pandas is still the smoothest path.
11.4 Small Data and Quick Prototyping
Below a million rows the difference is tens to hundreds of milliseconds — irrelevant. Familiarity wins.
11.5 The Team Only Knows Pandas, and Migration Cost Exceeds the Gain
If the bottleneck is somewhere else (network I/O, model inference), moving to Polars barely changes end-to-end latency. The migration is pure cost.
11.6 Education
Fifteen years of lectures, Stack Overflow and books are in Pandas. The mass of learning material drives the learning cost.
Polars is becoming the default for new pipelines. Pandas survives as a glue language. Knowing both is the rational position.
12. The 2026 DataFrame Stack — A Decision Map
The same table from a different angle — "which workload picks which tool".
| Workload | First choice | Second choice |
|---|---|---|
| In-memory analytical dataframe, from Python | Polars | DuckDB |
| Heavy SQL analytics in BI or Jupyter | DuckDB | Polars |
| TB scale, scattered across S3 | Dask or Spark | Ray Data |
| Push down to Snowflake or BigQuery | Ibis | (warehouse SDK) |
| ML feature engineering | Polars | Pandas (with Arrow) |
| sklearn model input | Pandas | Polars then to_pandas() |
| Time series (statsmodels) | Pandas | (simple cases in Polars) |
| Small data, quick prototyping | Pandas | Polars |
| Standardized backend-agnostic expressions | Ibis | (custom wrapper) |
Epilogue — The Balanced One-Liner
The one-sentence summary of this post:
Polars is winning almost all of the new workloads. Pandas survives in the swamp of sklearn, plotting and education. The 2026 answer is to know both and choose by workload.
A decision tree for the data engineer:
- New pipeline, single-node analytics → Polars by default.
- SQL is more natural and analysts share the work → DuckDB.
- The data does not fit on one node → Dask, Spark or Ray.
- The cloud warehouse is the real store → Ibis plus the warehouse.
- The ML endpoint is sklearn → transform in Polars, call
to_pandas()last. - Plotting or statistics is the final step → receive into Pandas.
12-Item Checklist
- Is the dataframe backend Arrow? (Pandas:
dtype_backend="pyarrow".) - Is your I/O Parquet? (Anchoring to CSV is a loss in any library.)
- Are you using a lazy plan? (Polars' biggest win.)
- Are filter and select near the top of the plan? (Predicate and projection pushdown.)
- Are expressions inside group-by collected into an expression list?
- Did you check whether the same subexpression appears twice? (CSE target.)
- Do the join-key dtypes match on both sides?
- Are timestamp timezones explicit?
- Are you abusing
map_elements? - Did you measure whether you need the streaming engine?
- Did you batch the final
to_pandas()once, right before plotting or ML? - Did you re-run the benchmarks on your own environment and data?
10 Anti-Patterns
- Loading into Polars and immediately calling
to_pandas()to keep writing Pandas code — no gain. - Skipping the lazy plan and
collect()ing at every step — the optimizer cannot help. - Leaving
map_elementsin place where a vectorized expression would do. - Storing everything in CSV instead of Parquet — half of the Arrow benefit gone.
- Bringing
inplace=Truemindset into Polars — it does not exist. - Comparing datetimes without timezones — Polars yells first; that is when you learn.
- Introducing Dask or Spark before you actually need distribution — complexity spikes, performance drops.
- Forcing SQL-shaped analytics into a dataframe API.
- Quoting benchmark tables without re-running on your own data — weak foundation.
- Mass-migrating off Pandas on the assumption it will "disappear soon" — pure cost.
Next Up
Candidates for the next post: DuckDB deep dive — everything about the embedded analytical engine, Apache Arrow — Flight, DataFusion, ADBC, PySpark 4.x with Polars and Ray Data — the 2026 distributed-dataframe map, A real Pandas to Polars migration case study.
"The engine is the same — it is Arrow. Only the dataframe API sitting on top differs. In 2026 the data engineer is not a believer in one library; they are someone who reads the whole stack."
— Polars 1.x vs Pandas, end.
References
- Polars official site
- Polars GitHub repository
- Polars GitHub Releases (1.x notes)
- Polars User Guide
- Polars Python API reference
- Polars benchmarks page
- Polars blog (streaming and more)
- Pandas official docs
- Pandas 2.2 What's New
- Pandas PyArrow backend guide
- Apache Arrow official
- Apache Arrow Columnar Format spec
- DuckDB official
- DuckDB Python API
- DuckDB Polars integration
- TPC-H specification
- Dask official
- Dask DataFrame Query Optimizer announcement
- Ibis official
- Ibis backend list
- PyO3 (Rust to Python)
- Apache Spark 4.x releases
- Modin official (Pandas-compatible distributed)
- Ray Data
- scikit-learn input types
- DataFrame Interchange Protocol
- Awkward Array (Arrow interop reference)
- DataFusion (Arrow plus Rust query engine)
현재 단락 (1/295)
In 2025 and 2026, the most-quoted sentence in the dataframe community was this: