Skip to content

✍️ 필사 모드: Polars 1.x vs Pandas — End of an Era? A Modern DataFrame Deep Dive for 2026

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Prologue — What "The End of the Pandas Era" Actually Means

In 2025 and 2026, the most-quoted sentence in the dataframe community was this:

"I switched to Polars and the same code got 10x faster."

And right next to it, always:

"But to push it into sklearn I ended up calling to_pandas() anyway."

Those two sentences summarize the 2026 dataframe world precisely. Polars is taking almost all of the new workloads. And yet Pandas is not going away. Fifteen years of ecosystem inertia — sklearn input, statsmodels, plotting libraries, hundreds of thousands of Stack Overflow answers, every classroom — make Pandas the safe default.

This post tries to locate the balance point. The Polars 1.x stability promise, the Rust plus Arrow architecture, what the query optimizer actually does, how far Pandas 2.2 plus PyArrow caught up and where it stopped, how DuckDB — the other Arrow-native engine — beats Polars on some workloads and loses on others, where Dask and Ibis sit, the real cost of migrating. And at the very end — when you should still stay on Pandas.


1. The Foundation Is Arrow — The Real Hero of the DataFrame Revolution

The most common omission in Polars vs Pandas comparisons: both libraries sit on top of Apache Arrow. The revolution is not Polars. It is Arrow.

Why Arrow Matters

Arrow is a columnar in-memory format. NumPy is columnar too, but the differences are huge:

  • Language-neutral memory layout — Python, Rust, C++, Java, Go all share the same memory zero-copy.
  • Column-major — values of the same column live contiguously in memory. Ideal for SIMD.
  • Built-in type system — strings, lists, structs, dictionaries (categories), date, time, decimal all standardized.
  • Separate null bitmap — no more abusing NaN as null the way NumPy and Pandas 1.x had to.
  • Chunked arrays — split big columns into chunks, the substrate for lazy and streaming execution.

The one-line takeaway:

On top of Arrow, Polars, Pandas, DuckDB and Spark see the same memory. Conversion cost collapses toward zero.

df.to_arrow() then DuckDB reads it directly. DuckDB writes Arrow then Polars reads it directly. Zero-copy. This is why the dataframe world of the 2020s is converging.

Arrow vs NumPy Backend in Practice

AspectNumPy backend (Pandas 1.x)Arrow backend (Polars / Pandas 2.2+)
Stringsobject dtype, list of Python objectsstring dtype, contiguous memory, 5–10x faster
Nullforced NaN, int column silently promoted to floatproper null bitmap, dtype preserved
Categoricalcategory dtype, custom implstandard dictionary type
Timestampslocked to datetime64[ns]rich timestamp[us, tz] and friends
Zero-copy sharinghardimmediate with DuckDB, Polars, Spark
SIMD utilizationlimitednatural fit for columnar

That table is most of the reason Polars beats Pandas. Polars started here; Pandas is catching up via an opt-in backend in 2.x.


2. Polars 1.x — A Rust Core and a Stability Promise

Polars is a Rust dataframe library started by Ritchie Vink in 2020. 1.0 shipped in August 2024, and since then minors have been moving fast while the project promises public API stability. As of 2026 we are deep in the 1.x line (see the release notes at Polars GitHub Releases); breaking changes arrive as opt-in new APIs, and old APIs go through a deprecation cycle.

A Rust Core With Python Bindings

  • Core is Rust — the polars crate. Memory safety, SIMD, no-GIL parallelism.
  • Python is a bindingpy-polars. Wraps the Rust core via PyO3.
  • R and NodeJS bindings sit on the same core.

What this means: Python's GIL does not slow Polars down. Group-by genuinely uses every core. NumPy parallelizes via BLAS, but the dataframe operations themselves (group-by, join) are stuck behind the GIL in Pandas.

The Polars Data Model

import polars as pl

df = pl.DataFrame({
    "user_id": [1, 2, 3, 4],
    "country": ["KR", "US", "KR", "JP"],
    "amount": [120.0, 95.0, 200.0, 75.0],
})
print(df.schema)
# Schema({'user_id': Int64, 'country': String, 'amount': Float64})

The schema is an ordered map from column name to Arrow dtype. Every dtype is explicit; there is very little inference magic.

Eager vs Lazy

Polars has two modes:

  • Eagerpl.DataFrame. Executes immediately, just like Pandas.
  • Lazypl.LazyFrame. Builds a query plan and only executes at .collect(), after optimization.
# eager
df = pl.read_parquet("sales.parquet")
big_kr = df.filter(pl.col("country") == "KR").select(["user_id", "amount"])

# lazy — build the plan, optimize and execute on .collect()
plan = (
    pl.scan_parquet("sales.parquet")          # I/O is deferred
      .filter(pl.col("country") == "KR")
      .select(["user_id", "amount"])
)
big_kr = plan.collect()

See the difference? scan_parquet does not read the file yet. The next chapter shows why that matters so much.


3. Lazy Evaluation and the Query Optimizer — Why Polars Feels Magical

Half of Polars' performance comes from the optimizer that rewrites the lazy plan. What does it do?

3.1 Predicate Pushdown — Push the Filter Into I/O

plan = (
    pl.scan_parquet("sales.parquet")          # 100 million rows
      .filter(pl.col("country") == "KR")       # 1 million KR rows
      .select(["user_id", "amount"])
)

The naive execution: read all 100M rows, then filter in memory. A hundred times too expensive.

What the optimizer does: push the filter into the scan. It reads Parquet row-group statistics (min and max) and skips row groups where country != "KR". Disk I/O drops by 100x.

3.2 Projection Pushdown — Read Only the Columns You Need

Even though select(["user_id", "amount"]) is at the end, Parquet is columnar, so only those two columns are read off disk. Pandas reads everything at pd.read_parquet() time and selecting later is too late. (Pandas does accept a columns= parameter, but it has no plan-level inference.)

3.3 Common Subexpression Elimination

plan = (
    df.lazy()
      .with_columns([
          (pl.col("price") * pl.col("qty")).alias("revenue"),
          (pl.col("price") * pl.col("qty") * 0.1).alias("commission"),
      ])
)

price * qty appears twice. The optimizer computes it once and reuses it.

3.4 Slice Pushdown, Type Coercion, Dead-Column Removal

If .head(100) sits at the bottom of a plan, only enough rows are pulled up to satisfy those 100. Columns unused after a join are dropped before the join. Implicit casts get reduced.

3.5 The Streaming Engine

collect(streaming=True) — and in 2026 the new streaming engine is approaching stable status (see the Polars blog for the streaming write-ups) — processes datasets that do not fit in memory in chunks. Out-of-core inside a single library.

Visualize the Plan

plan.explain()       # human-readable plan
plan.show_graph()    # graphviz; compare optimized vs naive

This is Polars' brain. Pandas has nothing like it. Pandas takes orders and executes them verbatim.


4. The Polars Expression API — One Series, One Expression

The biggest mental model shift in Polars is the expression.

import polars as pl

df.select([
    pl.col("amount").sum().alias("total"),
    pl.col("amount").mean().over("country").alias("country_avg"),
    (pl.col("amount") - pl.col("amount").mean()).alias("diff_from_mean"),
])

Each expression is a DAG that transforms columns. Where Pandas column manipulation is imperative, Polars expressions are declarative — closer to Spark DataFrame or SQL.

Mental model differences vs Pandas:

OperationPandasPolars
Create a new columndf["x"] = df["a"] + df["b"] (assign)df.with_columns((pl.col("a") + pl.col("b")).alias("x")) (expression)
Group-by aggregationdf.groupby("k")["x"].sum() (chained)df.group_by("k").agg(pl.col("x").sum()) (expression list)
Windowdf.groupby("k")["x"].transform("mean")pl.col("x").mean().over("k")
Multi-column transformfor loop or applyone expression list, all at once

It feels awkward for a few days, then it clicks. And the key point is that an expression is a node in the optimizer's plan. Because expressions are declarative, Polars can optimize them.


5. Pandas 2.2 — How Far Did It Catch Up With the PyArrow Backend?

Pandas introduced the PyArrow backend in 2.0 (2023) and stabilized it in 2.2 (2024). As of 2026, Pandas 2.2.x is the de-facto standard while 3.0 has been in a long beta.

Turning On the Arrow Backend

import pandas as pd

# Option A — explicit Arrow dtypes
df = pd.read_parquet("sales.parquet", dtype_backend="pyarrow")

# Option B — set a Series dtype to Arrow
s = pd.Series([1, 2, None], dtype="int64[pyarrow]")

What you gain:

  • String columns become genuinely fast and use much less memory.
  • An int column with nulls no longer silently promotes to float (Int64[pyarrow] is nullable).
  • Date and time stay as Arrow timestamps.
  • Zero-copy interop with DuckDB, Polars, Spark.

What Pandas 2.2 Still Cannot Do

  • No lazy evaluation. Everything is eager. Predicate pushdown only via read_parquet(filters=).
  • No query optimizer. Pandas runs operations in the order you wrote them.
  • Single-threaded inside the GIL by default. You opt into numba, numexpr, or pyarrow.compute case by case.
  • No expression API. Composing multi-column expressions inside a group-by stays awkward.
  • Huge API surface. Pandas is a sprawling dictionary; a lot of implicit magic happens.

To summarize: the data structures moved to Arrow, but the execution engine is still 1990s imperative. That gap is why Polars wins even on the same Arrow foundation.


6. Polars vs Pandas — The Full Comparison Matrix

The same task, side by side in both libraries.

6.1 Read Parquet, Filter, Group-By

Pandas:

import pandas as pd

df = pd.read_parquet("sales.parquet", dtype_backend="pyarrow")
kr = df[df["country"] == "KR"]
result = (
    kr.groupby("user_id", as_index=False)["amount"]
      .agg(["sum", "mean", "count"])
)

Polars (lazy):

import polars as pl

result = (
    pl.scan_parquet("sales.parquet")
      .filter(pl.col("country") == "KR")
      .group_by("user_id")
      .agg([
          pl.col("amount").sum().alias("total"),
          pl.col("amount").mean().alias("avg"),
          pl.col("amount").count().alias("n"),
      ])
      .collect()
)

The line counts look similar, but the plan-level work is not. Polars pushes the filter into the scan and only reads three columns off disk.

6.2 Join

Pandas:

merged = pd.merge(orders, users, on="user_id", how="left")

Polars:

merged = orders.lazy().join(users.lazy(), on="user_id", how="left").collect()

A Polars join is parallel hash-join by default. The gap widens fast as the data grows.

6.3 Window Functions

Pandas:

df["avg_country"] = df.groupby("country")["amount"].transform("mean")
df["rank_in_country"] = df.groupby("country")["amount"].rank(method="dense")

Polars:

df = df.with_columns([
    pl.col("amount").mean().over("country").alias("avg_country"),
    pl.col("amount").rank(method="dense").over("country").alias("rank_in_country"),
])

If you group multiple windows into a single plan, Polars computes each partition only once.

6.4 The Comparison Matrix

AspectPandas 2.2 + ArrowPolars 1.x
Memory backendNumPy default, Arrow optionalArrow only
String performancegood with Arrow onfast from day one
Null handlingOK with Arrow dtypesstandard null bitmap
Multi-corepartial via numba and numexprparallel by default, GIL-free
Lazy plannoneLazyFrame
Query optimizernonepredicate, projection, CSE and more
Streamingnonestreaming engine (out-of-core)
Expression APInone (chaining only)first-class
Time seriesmature and powerfulimproving; some gaps remain
Plotting integrationseaborn and matplotlib directlyoften needs to_pandas()
sklearn inputdirectneeds to_pandas() or to_numpy()
Learning materialoverwhelminggrowing but smaller

On raw performance Polars wins decisively. Add ecosystem integration, learning material and legacy and Pandas still has the safety edge.


7. TPC-H Benchmarks — The Real Numbers

The Polars team publishes its own runs of all 22 TPC-H queries (see the Polars benchmark page; concrete numbers depend on SF, hardware and version, so this table is shape-only and meant to anchor decisions, not replace the originals).

QueryPandas 2.xPolars 1.x (lazy)DuckDBDask
Q1 (simple group-by)baselineroughly 1/10 the timeroughly 1/15roughly 1/3
Q3 (join + filter)baselineroughly 1/8roughly 1/10roughly 1/2
Q9 (multi-join, complex)baselineroughly 1/7roughly 1/8similar or slightly faster
Q21 (correlated subquery)often OOMsruns finevery fastcan OOM
Larger-than-memory datasetimpossiblestreaming engine handles ithandles itdistributed handles it

Key observations:

  • Polars and DuckDB are in the same league on a single node — query optimizer plus Arrow plus multi-core. Pandas trails far behind.
  • If you need distribution, you reach for Dask, Spark or Ray. Polars optimizes for single-node (or one large cloud VM).
  • For SQL-style analytics like TPC-H, DuckDB tops the chart most often. For a dataframe API, Polars wins.

Benchmarks live and die on "which workload". The table is a starting point, not the verdict.


8. Polars vs DuckDB — Two Faces of the Same Engine

This is the most interesting comparison of 2026. Both are Arrow-native, both are vectorized columnar engines, both massacre large data on a single node. The interface is what differs.

  • DuckDB — an embeddable SQL engine. duckdb.sql("SELECT ... FROM parquet_scan('sales.parquet') WHERE ..."). Heavy analytical SQL and OLAP. Windowing, complex joins and subqueries are unmatched.
  • Polars — a dataframe API. Declarative dataframe with expressions.
# Same task, two interfaces
# DuckDB
import duckdb
result = duckdb.sql("""
    SELECT user_id, SUM(amount) AS total
    FROM 'sales.parquet'
    WHERE country = 'KR'
    GROUP BY user_id
""").to_df()

# Polars
import polars as pl
result = (
    pl.scan_parquet("sales.parquet")
      .filter(pl.col("country") == "KR")
      .group_by("user_id")
      .agg(pl.col("amount").sum().alias("total"))
      .collect()
)

Where DuckDB Wins

  • Complex multi-table joins, deep CTEs, correlated subqueries — SQL reads more naturally.
  • Analysts already write SQL — zero cognitive cost.
  • BI tooling, Jupyter magic, dbt integration — nearly every BI tool can read DuckDB.

Where Polars Wins

  • Long pipelines of column-level transforms — expressions are cleaner than nested CTEs.
  • The codebase is already in Pandas — the migration surface is smaller.
  • ML feature engineering, where you want to hold the dataframe imperatively in Python.
  • User-defined Python functions sneak in (map_elements, pipe).

The Pattern of Using Both

# Receive in Polars, hand heavy SQL off to DuckDB, take the result back as Polars
import duckdb, polars as pl

lf = pl.scan_parquet("sales.parquet").filter(pl.col("amount") > 100)
out = duckdb.sql("SELECT country, percentile_cont(0.95) WITHIN GROUP (ORDER BY amount) AS p95 FROM lf GROUP BY country").pl()

On Arrow, the two exchange data zero-copy. Drop the "pick exactly one" obsession.


9. Dask and Ibis — The Supporting Cast

Dask — When You Need Out-Of-Core or Distribution

Dask is a distributed dataframe with a Pandas-like API. It launches Pandas DataFrames per partition and binds them into a lazy graph, then executes across a cluster.

import dask.dataframe as dd
df = dd.read_parquet("s3://bucket/sales-*.parquet")
result = df[df.country == "KR"].groupby("user_id").amount.sum().compute()

Dask keeps its identity: "Pandas, but distributed". By 2026 the 2024.x line and beyond ships a gradually-rolled-out query optimizer (Coiled has led the work) and Pandas 2.x PyArrow dtypes interop cleanly. On a single node it loses to Polars and DuckDB — by design. Tens of terabytes scattered across S3 is the real use case.

Ibis — A Backend-Agnostic DataFrame Spec

Ibis lets you write dataframe expressions and compile them to 20-plus backends: DuckDB, BigQuery, Snowflake, Polars, Pandas, Spark and more.

import ibis
t = ibis.read_parquet("sales.parquet")  # default backend is DuckDB
expr = t.filter(t.country == "KR").group_by("user_id").agg(t.amount.sum().name("total"))
expr.execute()  # returns Pandas

The same query runs against DuckDB, gets pushed to BigQuery, or hits Snowflake — without code changes. Push heavy work into the warehouse, test the same code locally.

Where Ibis sits in 2026:

  • A candidate for the dataframe interface standard. The PyData ecosystem is gradually coalescing here.
  • Strong choice for analytics teams that live in a cloud warehouse.
  • Backend quirks still leak through occasionally — the same expression does not run identically everywhere.

10. Migration Traps — The Real Cost of Pandas to Polars

On performance alone the move to Polars looks obvious. In practice, people hit a handful of traps.

10.1 No Index

Pandas centers on the index. Polars has no index. .set_index(), .reset_index(), .loc[], MultiIndex — all gone. Joins use explicit key columns and sorts are explicit. At first it feels annoying, but you end up with fewer bugs — no more silent corruption from implicit index alignment.

10.2 No inplace

# Pandas
df["x"] = df["a"] + df["b"]            # works
df.rename(columns={"a": "A"}, inplace=True)

# Polars
df = df.with_columns((pl.col("a") + pl.col("b")).alias("x"))
df = df.rename({"a": "A"})

Every operation returns a new DataFrame. You are forced into an immutable mindset.

10.3 The Death of apply (Kind Of)

# Pandas — apply is so common it feels like boilerplate
df["y"] = df.apply(lambda r: complicated_func(r["a"], r["b"]), axis=1)

In Polars, map_elements is the explicitly slow path. Combine expressions when you can. If you must, prefer map_batches for a vectorized batch function.

10.4 Group-By Result Shape

Whether a Pandas group-by returns a Series or a DataFrame depends on subtleties; Polars always returns a DataFrame. But the column-naming rules differ enough to cause KeyErrors — Polars makes you call .alias() explicitly.

10.5 Datetime Is Different

Pandas pins datetime64[ns]. Polars (via Arrow) uses Datetime("us", "UTC") and friends. Timezone comparisons get strict. Mixed-timezone columns will make Polars complain. Annoying at first, a blessing later.

10.6 CSV Type Inference Diverges

read_csv infers dtypes differently. Pandas peeks at a slice and guesses; Polars is more conservative. Get into the habit of declaring dtypes explicitly.

10.7 Method Names Are Subtly Different

reset_index — does not exist. concat(axis=1)pl.concat([..], how="horizontal"). pd.meltpl.DataFrame.unpivot. mergejoin. rename — different signature. Find-and-replace alone will not get you there; a human has to read the code.

10.8 An Incremental Migration Strategy

# Receive data in Polars from the start
df = pl.read_parquet("a.parquet")
# Heavy transforms in Polars
df = df.filter(...).group_by(...).agg(...)
# Only convert to pandas at the sklearn or seaborn boundary
pdf = df.to_pandas()

This is the realistic pattern. Do not rewrite everything in one go. New pipelines go to Polars; existing ones stay as they are.


11. When You Should Still Stay on Pandas

The most honest chapter of this post. In 2026 it is still right to stay on Pandas in several cases.

11.1 sklearn Is the Endpoint

sklearn treats pandas DataFrames as first-class citizens. ColumnTransformer, OneHotEncoder(sparse_output=False), every step of a pipeline preserves column names. If you are going to call .to_pandas() right before sklearn anyway, the case for moving thins out.

11.2 statsmodels and Causal-Inference Libraries

Most statistics libraries (statsmodels, lifelines, dowhy, econml) assume Pandas input. Formula notation (y ~ x1 + x2) is bound to column names.

11.3 Plotting Integration

seaborn, plotnine and many plotly examples want Pandas. Polars now supports the __dataframe__ protocol so more libraries can read it directly, but Pandas is still the smoothest path.

11.4 Small Data and Quick Prototyping

Below a million rows the difference is tens to hundreds of milliseconds — irrelevant. Familiarity wins.

11.5 The Team Only Knows Pandas, and Migration Cost Exceeds the Gain

If the bottleneck is somewhere else (network I/O, model inference), moving to Polars barely changes end-to-end latency. The migration is pure cost.

11.6 Education

Fifteen years of lectures, Stack Overflow and books are in Pandas. The mass of learning material drives the learning cost.

Polars is becoming the default for new pipelines. Pandas survives as a glue language. Knowing both is the rational position.


12. The 2026 DataFrame Stack — A Decision Map

The same table from a different angle — "which workload picks which tool".

WorkloadFirst choiceSecond choice
In-memory analytical dataframe, from PythonPolarsDuckDB
Heavy SQL analytics in BI or JupyterDuckDBPolars
TB scale, scattered across S3Dask or SparkRay Data
Push down to Snowflake or BigQueryIbis(warehouse SDK)
ML feature engineeringPolarsPandas (with Arrow)
sklearn model inputPandasPolars then to_pandas()
Time series (statsmodels)Pandas(simple cases in Polars)
Small data, quick prototypingPandasPolars
Standardized backend-agnostic expressionsIbis(custom wrapper)

Epilogue — The Balanced One-Liner

The one-sentence summary of this post:

Polars is winning almost all of the new workloads. Pandas survives in the swamp of sklearn, plotting and education. The 2026 answer is to know both and choose by workload.

A decision tree for the data engineer:

  1. New pipeline, single-node analytics → Polars by default.
  2. SQL is more natural and analysts share the work → DuckDB.
  3. The data does not fit on one node → Dask, Spark or Ray.
  4. The cloud warehouse is the real store → Ibis plus the warehouse.
  5. The ML endpoint is sklearn → transform in Polars, call to_pandas() last.
  6. Plotting or statistics is the final step → receive into Pandas.

12-Item Checklist

  1. Is the dataframe backend Arrow? (Pandas: dtype_backend="pyarrow".)
  2. Is your I/O Parquet? (Anchoring to CSV is a loss in any library.)
  3. Are you using a lazy plan? (Polars' biggest win.)
  4. Are filter and select near the top of the plan? (Predicate and projection pushdown.)
  5. Are expressions inside group-by collected into an expression list?
  6. Did you check whether the same subexpression appears twice? (CSE target.)
  7. Do the join-key dtypes match on both sides?
  8. Are timestamp timezones explicit?
  9. Are you abusing map_elements?
  10. Did you measure whether you need the streaming engine?
  11. Did you batch the final to_pandas() once, right before plotting or ML?
  12. Did you re-run the benchmarks on your own environment and data?

10 Anti-Patterns

  1. Loading into Polars and immediately calling to_pandas() to keep writing Pandas code — no gain.
  2. Skipping the lazy plan and collect()ing at every step — the optimizer cannot help.
  3. Leaving map_elements in place where a vectorized expression would do.
  4. Storing everything in CSV instead of Parquet — half of the Arrow benefit gone.
  5. Bringing inplace=True mindset into Polars — it does not exist.
  6. Comparing datetimes without timezones — Polars yells first; that is when you learn.
  7. Introducing Dask or Spark before you actually need distribution — complexity spikes, performance drops.
  8. Forcing SQL-shaped analytics into a dataframe API.
  9. Quoting benchmark tables without re-running on your own data — weak foundation.
  10. Mass-migrating off Pandas on the assumption it will "disappear soon" — pure cost.

Next Up

Candidates for the next post: DuckDB deep dive — everything about the embedded analytical engine, Apache Arrow — Flight, DataFusion, ADBC, PySpark 4.x with Polars and Ray Data — the 2026 distributed-dataframe map, A real Pandas to Polars migration case study.

"The engine is the same — it is Arrow. Only the dataframe API sitting on top differs. In 2026 the data engineer is not a believer in one library; they are someone who reads the whole stack."

— Polars 1.x vs Pandas, end.


References

현재 단락 (1/295)

In 2025 and 2026, the most-quoted sentence in the dataframe community was this:

작성 글자: 0원문 글자: 22,774작성 단락: 0/295