Skip to content
Published on

Columnar Storage Deep Dive 2025: Parquet, ORC, Apache Arrow, Dremel — Why Analytics DBs Are 10,000x Faster

Authors

Intro: Same data, 1000x difference

An experiment

100M rows of sales data. Each row: (id, timestamp, country, product, category, price, ...) with 50 columns.

Question: "Average price of products sold in Korea?"

SELECT AVG(price) FROM sales WHERE country = 'KR';

Row-oriented storage (CSV, MySQL):

  1. Read all 100M rows (all 50 columns).
  2. Filter by country.
  3. Average price.
  4. Disk I/O: tens of GB. Minutes.

Same data in Parquet:

  1. Read only the country column (maybe only partial via stats).
  2. Read only price for average.
  3. Disk I/O: hundreds of MB. Seconds.

100-1000x differences are common. This is why Snowflake, BigQuery, and Spark query TB-scale data in seconds.

Why columnar is the standard now

Until the early 2000s almost all DBMS were row-oriented. Then Michael Stonebraker's 2005 C-Store paper proved columnar's advantage, followed by Vertica (2005), Apache Parquet (2013), Apache ORC (2013), Apache Arrow (2016), ClickHouse, and Snowflake. Columnar is now the de facto standard for OLAP workloads.


1. Row-oriented vs Column-oriented

Row storage

Traditional DBs (PostgreSQL, MySQL) store each row as contiguous bytes:

Row 1: [id=1][time=T1][country=KR][product=A][price=100]
Row 2: [id=2][time=T2][country=US][product=B][price=200]
Row 3: [id=3][time=T3][country=KR][product=C][price=150]

Pros: fast full-row reads, natural transactions, optimal for OLTP. Cons: must read entire rows, wasted I/O, poor compression (mixed types).

Column storage

Columnar stores same-column values in contiguous blocks:

id column:      [1, 2, 3, 4, 5, ...]
country column: [KR, US, KR, JP, KR, ...]
price column:   [100, 200, 150, 300, 120, ...]

Pros: read only needed columns (projection pushdown), same-type runs compress well, CPU cache friendly (vectorizable), optimal for OLAP. Cons: row reconstruction needs multiple column accesses, slow inserts/updates, complex transactions.

OLTP vs OLAP

AspectRow-oriented (OLTP)Column-oriented (OLAP)
ExampleMySQL, PostgreSQLParquet, ClickHouse
Query patternIndividual record accessAggregation, filter, scan
Write patternSmall frequent writesBulk load
Read patternSELECT * FROM users WHERE id=...SELECT AVG(price) FROM sales WHERE ...
CompressionLow (2:1)High (5:1 ~ 20:1)
Index needHighLow (replaced by column stats)

2. Parquet: Influence of Dremel

Birth

Apache Parquet was co-developed in 2013 by Twitter and Cloudera, influenced by Google's Dremel paper (2010) for storing nested data in columnar form. Nearly every analytics system supports it: Spark, Hive, Presto, Trino, Snowflake, BigQuery, Athena, Impala.

File structure

File
 ├── Magic "PAR1"
 ├── Row Group 1
 │    ├── Column Chunk 1 (Data Pages)
 │    ├── Column Chunk 2
 │    └── Column Chunk 3
 ├── Row Group 2 ...
 ├── Footer (schema, row group offsets, column stats)
 └── Magic "PAR1"

Layers:

  • Row Group: logical block of hundreds of MB to 1 GB. Parallel processing unit.
  • Column Chunk: one column within a row group.
  • Data Page: smallest unit (default 1 MB). Compression/encoding unit.

Metadata power

Parquet's footer carries schema, row group offsets, column stats (min/max/null count), optional Bloom filters and page indexes. This enables skipping row groups entirely.

Predicate Pushdown

SELECT * FROM sales WHERE price > 1000;

The engine checks each row group's price min/max. If max <= 1000, skip entirely. With well-sorted data this can eliminate 99% of I/O.

Bloom Filter

Parquet 1.11+ stores optional Bloom filters for point lookups like user_id = 12345 where min/max is useless. Especially effective for random IDs/UUIDs.

Page Index

Parquet 2.9+'s page index stores per-page min/max separately, enabling byte-range fetches in cloud storage (S3).


3. Dremel Algorithm: Nested data as columns

Challenge

Columnar works for flat data. What about nested JSON or Protocol Buffers?

{
  "id": 1,
  "name": "Alice",
  "addresses": [
    {"city": "Seoul", "zip": "12345"},
    {"city": "Busan", "zip": "67890"}
  ]
}

Repetition and Definition levels

Dremel's solution: store two integers per value.

  • Repetition level (R): at what repeated depth this value lives.
  • Definition level (D): the maximum depth to which this value is defined.

For addresses.city:

ValueRD
"Seoul"02
"Busan"12
"LA"02
NULL01

With these two numbers, the original nested structure is losslessly reconstructible — mathematically proven.

This enables Parquet to store nested data columnar, support schema evolution, handle NULLs efficiently, and read only requested sub-fields.


4. Compression and Encoding

Encoding vs compression

Parquet applies encoding first, compression second.

Plain Encoding

Values as-is. Fallback when other encodings lose.

Dictionary Encoding

The compression king. For low-cardinality columns:

Original:   [KR, US, KR, JP, KR, KR, US, JP, ...]
Dictionary: [KR=0, US=1, JP=2]
Values:     [0, 1, 0, 2, 0, 0, 1, 2, ...]

Then bit-packed. On a country column, 1000x size reduction is common. Parquet auto-falls-back to plain when the dictionary exceeds parquet.dictionary.page.size.

Run-Length Encoding (RLE)

Original: [1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3]
RLE:      [(1,4), (2,2), (3,5)]

Parquet combines RLE with dictionary encoding as "RLE/Bit-Packing Hybrid".

Bit Packing

When integers fit in small ranges, use only required bits. 32-bit ints of value 0-3 pack into 2 bits each: 16x reduction.

Delta Encoding

Original: [100, 102, 105, 108, 110, 115]
Delta:    [100, 2, 3, 3, 2, 5]

Optimal for timestamps, sequence IDs. Parquet implements this as DELTA_BINARY_PACKED.

Byte Stream Split

For IEEE 754 floats: collect same byte positions across values into separate streams, making general compression more effective. Useful for scientific data.

Compression algorithms

AlgorithmRatioSpeedUse
UNCOMPRESSED1xFastestBenchmarks
SNAPPY2-3xVery fastBalanced (default)
LZ42-3xVery fastFaster decompress
GZIP4-5xSlowHigh ratio
BROTLI4-5xMediumWeb
ZSTD4-5xFast2025 recommended
df.write.option("compression", "zstd").parquet("data.parquet")

Combined effect

100M rows of e-commerce data:

FormatSizeQuery
CSV12 GB180 s
JSON28 GB300 s
Parquet (uncompressed)6 GB15 s
Parquet (snappy)2 GB8 s
Parquet (zstd)1.2 GB7 s

5. ORC: Hive's partner

Apache ORC (Optimized Row Columnar) was developed in 2013 by Hortonworks for Hive. Similar to Parquet in concept but different terminology.

  • Stripe: Parquet's row group equivalent (default 250 MB).
  • Stream: column data within stripe.
  • Footer: file metadata.

ORC vs Parquet

AspectParquetORC
EcosystemSpark, Trino, Flink, PandasHive, Presto
Nested dataDremel (strong)Flatten-focused
StatsRow group levelStripe + stride (finer)
ACIDLimitedHive transactional tables
CompressionGoodSlightly better

Rule of thumb: Spark/Python/general → Parquet. Hive-centric warehouse → ORC.

Stride level statistics

ORC maintains stats at stride (default 10,000 rows) within a stripe, enabling finer filtering. Parquet's page index has since narrowed the gap.

Hive ACID

ORC supports ACID transactions in Hive 0.14+ via base + delta files with periodic compaction.


6. Apache Arrow: Columnar in memory

Different problem

Parquet/ORC are disk formats. Arrow is a memory format for system-to-system exchange. When Python pandas hands data to Spark JVM to another service, serialization/deserialization can dominate query cost.

Arrow's birth

Released in 2016 by Wes McKinney (pandas creator) and others with the goal: "standard cross-language, cross-system in-memory columnar format."

Characteristics

  • Zero-copy reads: no serialization.
  • Language-neutral: C++, Java, Python, R, Rust, Go, etc.
  • SIMD-friendly: vectorization directly.
  • Interoperability: Arrow arrays move across systems without conversion.

Memory layout

Int32 array: [1, 2, 3, NULL, 5]

Validity bitmap: [1, 1, 1, 0, 1]
Values buffer:   [1, 2, 3, 0, 5]

Each column has validity bitmap, values buffer, and (for variable-length) offsets buffer. Perfectly aligned for CPU cache and SIMD.

Zero-copy power

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
df.toPandas()

10x+ faster than Pickle-based serialization.

Arrow Flight

gRPC-based network protocol for Arrow. 20x+ faster than ODBC/JDBC.

DataFusion

Rust-based SQL engine in the Arrow project. Vectorized, embeddable. Polars and InfluxDB IOx use it.

DuckDB

import duckdb
import pyarrow.parquet as pq

arrow_table = pq.read_table("data.parquet")
result = duckdb.query("SELECT country, AVG(price) FROM arrow_table GROUP BY country").to_df()

In-memory SQL over Arrow without disk writes. Revolutionary for local analytics.


7. Vectorized Execution: Columnar's real power

Traditional tuple-at-a-time (Volcano)

while ((tuple = child->next()) != NULL) {
    if (predicate(tuple)) emit(tuple);
}

Function call overhead, poor CPU pipeline use, branch misses, cache misses.

Vectorized (batch-at-a-time)

Process thousands of rows at once:

int32_t prices[1024];
for (int i = 0; i < 1024; i += 8) {
    __m256i v = _mm256_loadu_si256((__m256i*)&prices[i]);
    __m256i threshold = _mm256_set1_epi32(1000);
    __m256i result = _mm256_cmpgt_epi32(v, threshold);
}

SIMD processes 8-16 values per instruction. Typically 2-10x faster.

Snowflake, ClickHouse (1024-65536 row batches), DuckDB (1024 batches with LLVM JIT), and Spark SQL (whole-stage codegen) all use this. TU Munich's HyPer/Umbra goes further by compiling queries to LLVM IR.


8. Partitioning and Z-Ordering

Partitioning

sales/
├── year=2023/
│   ├── month=01/
│   └── ...
├── year=2024/
└── year=2025/

WHERE year=2024 AND month=03 reads only that folder.

Hive-style

s3://bucket/sales/year=2025/month=04/day=15/

Directory names become partition columns — not stored in files (saves space).

Partition explosion

Too-granular partitioning (e.g., minute-level) creates millions of small files — list API latency, metadata burden. Rule: each partition should be at least hundreds of MB.

Z-Ordering

Z-order (Morton code) interleaves bits of multiple columns into a 1-D order. Min/max works for multiple columns simultaneously. Delta Lake and Iceberg implement OPTIMIZE ... ZORDER BY (country, product).


9. Delta Lake / Iceberg / Hudi

Parquet's limits

Parquet is just a file format — no ACID, manual schema evolution, no time travel, complex delete/update.

Lakehouse formats

  • Delta Lake (Databricks): Parquet + JSON transaction log. ACID, time travel, Z-ordering.
  • Apache Iceberg (Netflix/Apple): Parquet/ORC + manifest files. Hidden partitioning, snapshot isolation.
  • Apache Hudi (Uber): Parquet + timeline. Upsert/delete optimized.

After 2024 Iceberg is converging as the industry standard (AWS, Snowflake, Databricks all support it).

Hidden partitioning

Iceberg lets you declare partition transforms at table-creation time. Users query with the logical column; Iceberg computes month(ts) automatically.


10. Practical tips and pitfalls

Tip 1: Row group size

df.write.option("parquet.block.size", 134217728).parquet("out")  # 128MB

128 MB - 512 MB is the sweet spot.

Tip 2: Column order

Parquet stores in write order. Put hot columns first for I/O locality.

Tip 3: Dictionary max size

df.write.option("parquet.dictionary.page.size", 2097152).parquet("out")

Tip 4: Compression

df.write.option("compression", "zstd").parquet("out")

ZSTD is the near-universal answer in 2025.

Tip 5: File size

Sweet spot 256 MB - 1 GB. Use coalesce() or repartition() in Spark.

Pitfall 1: Small files

Disaster: list overhead, metadata load, schema mismatch risk. Fix with periodic compaction or Delta/Iceberg OPTIMIZE.

Pitfall 2: Schema evolution

Parquet schema is fixed. Use Delta/Iceberg for managed evolution, or Spark's mergeSchema.

Pitfall 3: Decimal and Timestamp

INT96 timestamps are deprecated — use INT64. Decimal encoding varies. Always check compatibility when moving files between systems.

Pitfall 4: Partition explosion

Restrict to monthly/daily; use column filters for finer time grain.

Pitfall 5: Over-nesting

3+ levels of nesting causes metadata bloat. Flatten for better query performance.


11. Real-world performance

100M rows of e-commerce

FormatSizeSELECT *SELECT price WHERE country='KR'
CSV (gzip)2.5 GB180 s180 s
JSON (gzip)4 GB300 s300 s
Avro2 GB60 s60 s
Parquet (snappy)1 GB25 s3 s
Parquet (zstd)700 MB22 s3 s
ORC (zstd)650 MB20 s2.5 s

Column selection cost

SELECT col_x FROM table_100cols:

FormatTime
Parquet1 column read, 100 MB I/O, 5 s
JSONFull parse, 10 GB I/O, 120 s
CSVFull parse, 5 GB I/O, 90 s

Parquet's projection pushdown: read exactly 1/100 of the data.


Quiz

Q1. Where does the biggest row-vs-column performance gap come from?

A. Three factors: projection (read only needed columns), compression (same-type columns compress much better via dictionary, RLE, delta), and vectorization (contiguous same-type values enable SIMD, better cache use). Combined, OLAP queries see 100-1000x differences. Row-oriented still wins for OLTP.

Q2. Why are Dremel's repetition and definition levels needed?

A. To decompose nested data into columns without loss. Repetition level tracks array-position reconstruction; definition level tracks how deep the path was actually defined (for NULLs and optional fields). The two integers allow perfect JSON reconstruction — mathematically proven lossless encoding. This is why Parquet can give columnar benefits to Protocol Buffer/JSON data.

Q3. How do predicate pushdown and Bloom filter dramatically improve queries?

A. The fastest I/O is none at all. Predicate pushdown checks row-group min/max to skip entire groups that cannot match. Bloom filters answer point-lookup questions (user_id=12345) where min/max is useless. Combined, a 10 GB file may read only 100 MB — 100x faster queries.

Q4. Why does Apache Arrow solve a fundamentally different problem from Parquet?

A. Parquet is a disk format (minimize disk reads); Arrow is a memory format (minimize cross-system serialization). Defining a standard memory layout all languages/systems share enables zero-copy exchange. PySpark toPandas() with Arrow is 10x+ faster; Arrow Flight is 20x faster than ODBC. Modern pipelines use both: Parquet on disk, Arrow in memory.

Q5. Why is dictionary encoding so effective, and when does Parquet abandon it?

A. Real columns have low cardinality — country has ~200 unique values, status ~10. Dictionary encoding replaces strings with small ints, then bit-packs — often 10x reduction. Parquet auto-falls-back to plain encoding when dictionary exceeds parquet.dictionary.page.size (default 1 MB), typically on UUIDs, full URLs, log lines. Then plain + zstd is more efficient.


Closing

Key takeaways

  1. Row vs Column: OLTP → row, OLAP → column.
  2. Parquet: disk standard with Dremel nested support.
  3. ORC: Hive partner, similar performance.
  4. Arrow: memory standard, zero-copy.
  5. Compression: Dictionary, RLE, Delta, Bit-packing, ZSTD.
  6. Vectorization: columnar's true power via SIMD.
  7. Lakehouse: Delta Lake, Iceberg, Hudi add ACID.
  8. Tuning: row group size, compression, partition design.

Checklist

  • Analytics on CSV/JSON → switch to Parquet.
  • Parquet files under 16 MB → compact.
  • Too many/few partitions → redesign.
  • Sort by frequent filter columns for maximum pushdown.
  • Default compression → consider ZSTD.
  • Slow in-memory movement → adopt Arrow.
  • Need transactions → Delta Lake or Iceberg.

Final lesson

Stonebraker's 2005 C-Store revolution now dominates the data industry. Same hardware, same data, different structure — 1000x the speed. That's the beauty of computer science. Next SQL query, ask: how is this data laid out? The answer tells you why it's fast or slow.


References