- Published on
Columnar Storage Deep Dive 2025: Parquet, ORC, Apache Arrow, Dremel — Why Analytics DBs Are 10,000x Faster
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Intro: Same data, 1000x difference
An experiment
100M rows of sales data. Each row: (id, timestamp, country, product, category, price, ...) with 50 columns.
Question: "Average price of products sold in Korea?"
SELECT AVG(price) FROM sales WHERE country = 'KR';
Row-oriented storage (CSV, MySQL):
- Read all 100M rows (all 50 columns).
- Filter by
country. - Average
price. - Disk I/O: tens of GB. Minutes.
Same data in Parquet:
- Read only the
countrycolumn (maybe only partial via stats). - Read only
pricefor average. - Disk I/O: hundreds of MB. Seconds.
100-1000x differences are common. This is why Snowflake, BigQuery, and Spark query TB-scale data in seconds.
Why columnar is the standard now
Until the early 2000s almost all DBMS were row-oriented. Then Michael Stonebraker's 2005 C-Store paper proved columnar's advantage, followed by Vertica (2005), Apache Parquet (2013), Apache ORC (2013), Apache Arrow (2016), ClickHouse, and Snowflake. Columnar is now the de facto standard for OLAP workloads.
1. Row-oriented vs Column-oriented
Row storage
Traditional DBs (PostgreSQL, MySQL) store each row as contiguous bytes:
Row 1: [id=1][time=T1][country=KR][product=A][price=100]
Row 2: [id=2][time=T2][country=US][product=B][price=200]
Row 3: [id=3][time=T3][country=KR][product=C][price=150]
Pros: fast full-row reads, natural transactions, optimal for OLTP. Cons: must read entire rows, wasted I/O, poor compression (mixed types).
Column storage
Columnar stores same-column values in contiguous blocks:
id column: [1, 2, 3, 4, 5, ...]
country column: [KR, US, KR, JP, KR, ...]
price column: [100, 200, 150, 300, 120, ...]
Pros: read only needed columns (projection pushdown), same-type runs compress well, CPU cache friendly (vectorizable), optimal for OLAP. Cons: row reconstruction needs multiple column accesses, slow inserts/updates, complex transactions.
OLTP vs OLAP
| Aspect | Row-oriented (OLTP) | Column-oriented (OLAP) |
|---|---|---|
| Example | MySQL, PostgreSQL | Parquet, ClickHouse |
| Query pattern | Individual record access | Aggregation, filter, scan |
| Write pattern | Small frequent writes | Bulk load |
| Read pattern | SELECT * FROM users WHERE id=... | SELECT AVG(price) FROM sales WHERE ... |
| Compression | Low (2:1) | High (5:1 ~ 20:1) |
| Index need | High | Low (replaced by column stats) |
2. Parquet: Influence of Dremel
Birth
Apache Parquet was co-developed in 2013 by Twitter and Cloudera, influenced by Google's Dremel paper (2010) for storing nested data in columnar form. Nearly every analytics system supports it: Spark, Hive, Presto, Trino, Snowflake, BigQuery, Athena, Impala.
File structure
File
├── Magic "PAR1"
├── Row Group 1
│ ├── Column Chunk 1 (Data Pages)
│ ├── Column Chunk 2
│ └── Column Chunk 3
├── Row Group 2 ...
├── Footer (schema, row group offsets, column stats)
└── Magic "PAR1"
Layers:
- Row Group: logical block of hundreds of MB to 1 GB. Parallel processing unit.
- Column Chunk: one column within a row group.
- Data Page: smallest unit (default 1 MB). Compression/encoding unit.
Metadata power
Parquet's footer carries schema, row group offsets, column stats (min/max/null count), optional Bloom filters and page indexes. This enables skipping row groups entirely.
Predicate Pushdown
SELECT * FROM sales WHERE price > 1000;
The engine checks each row group's price min/max. If max <= 1000, skip entirely. With well-sorted data this can eliminate 99% of I/O.
Bloom Filter
Parquet 1.11+ stores optional Bloom filters for point lookups like user_id = 12345 where min/max is useless. Especially effective for random IDs/UUIDs.
Page Index
Parquet 2.9+'s page index stores per-page min/max separately, enabling byte-range fetches in cloud storage (S3).
3. Dremel Algorithm: Nested data as columns
Challenge
Columnar works for flat data. What about nested JSON or Protocol Buffers?
{
"id": 1,
"name": "Alice",
"addresses": [
{"city": "Seoul", "zip": "12345"},
{"city": "Busan", "zip": "67890"}
]
}
Repetition and Definition levels
Dremel's solution: store two integers per value.
- Repetition level (R): at what repeated depth this value lives.
- Definition level (D): the maximum depth to which this value is defined.
For addresses.city:
| Value | R | D |
|---|---|---|
| "Seoul" | 0 | 2 |
| "Busan" | 1 | 2 |
| "LA" | 0 | 2 |
| NULL | 0 | 1 |
With these two numbers, the original nested structure is losslessly reconstructible — mathematically proven.
This enables Parquet to store nested data columnar, support schema evolution, handle NULLs efficiently, and read only requested sub-fields.
4. Compression and Encoding
Encoding vs compression
Parquet applies encoding first, compression second.
Plain Encoding
Values as-is. Fallback when other encodings lose.
Dictionary Encoding
The compression king. For low-cardinality columns:
Original: [KR, US, KR, JP, KR, KR, US, JP, ...]
Dictionary: [KR=0, US=1, JP=2]
Values: [0, 1, 0, 2, 0, 0, 1, 2, ...]
Then bit-packed. On a country column, 1000x size reduction is common. Parquet auto-falls-back to plain when the dictionary exceeds parquet.dictionary.page.size.
Run-Length Encoding (RLE)
Original: [1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3]
RLE: [(1,4), (2,2), (3,5)]
Parquet combines RLE with dictionary encoding as "RLE/Bit-Packing Hybrid".
Bit Packing
When integers fit in small ranges, use only required bits. 32-bit ints of value 0-3 pack into 2 bits each: 16x reduction.
Delta Encoding
Original: [100, 102, 105, 108, 110, 115]
Delta: [100, 2, 3, 3, 2, 5]
Optimal for timestamps, sequence IDs. Parquet implements this as DELTA_BINARY_PACKED.
Byte Stream Split
For IEEE 754 floats: collect same byte positions across values into separate streams, making general compression more effective. Useful for scientific data.
Compression algorithms
| Algorithm | Ratio | Speed | Use |
|---|---|---|---|
| UNCOMPRESSED | 1x | Fastest | Benchmarks |
| SNAPPY | 2-3x | Very fast | Balanced (default) |
| LZ4 | 2-3x | Very fast | Faster decompress |
| GZIP | 4-5x | Slow | High ratio |
| BROTLI | 4-5x | Medium | Web |
| ZSTD | 4-5x | Fast | 2025 recommended |
df.write.option("compression", "zstd").parquet("data.parquet")
Combined effect
100M rows of e-commerce data:
| Format | Size | Query |
|---|---|---|
| CSV | 12 GB | 180 s |
| JSON | 28 GB | 300 s |
| Parquet (uncompressed) | 6 GB | 15 s |
| Parquet (snappy) | 2 GB | 8 s |
| Parquet (zstd) | 1.2 GB | 7 s |
5. ORC: Hive's partner
Apache ORC (Optimized Row Columnar) was developed in 2013 by Hortonworks for Hive. Similar to Parquet in concept but different terminology.
- Stripe: Parquet's row group equivalent (default 250 MB).
- Stream: column data within stripe.
- Footer: file metadata.
ORC vs Parquet
| Aspect | Parquet | ORC |
|---|---|---|
| Ecosystem | Spark, Trino, Flink, Pandas | Hive, Presto |
| Nested data | Dremel (strong) | Flatten-focused |
| Stats | Row group level | Stripe + stride (finer) |
| ACID | Limited | Hive transactional tables |
| Compression | Good | Slightly better |
Rule of thumb: Spark/Python/general → Parquet. Hive-centric warehouse → ORC.
Stride level statistics
ORC maintains stats at stride (default 10,000 rows) within a stripe, enabling finer filtering. Parquet's page index has since narrowed the gap.
Hive ACID
ORC supports ACID transactions in Hive 0.14+ via base + delta files with periodic compaction.
6. Apache Arrow: Columnar in memory
Different problem
Parquet/ORC are disk formats. Arrow is a memory format for system-to-system exchange. When Python pandas hands data to Spark JVM to another service, serialization/deserialization can dominate query cost.
Arrow's birth
Released in 2016 by Wes McKinney (pandas creator) and others with the goal: "standard cross-language, cross-system in-memory columnar format."
Characteristics
- Zero-copy reads: no serialization.
- Language-neutral: C++, Java, Python, R, Rust, Go, etc.
- SIMD-friendly: vectorization directly.
- Interoperability: Arrow arrays move across systems without conversion.
Memory layout
Int32 array: [1, 2, 3, NULL, 5]
Validity bitmap: [1, 1, 1, 0, 1]
Values buffer: [1, 2, 3, 0, 5]
Each column has validity bitmap, values buffer, and (for variable-length) offsets buffer. Perfectly aligned for CPU cache and SIMD.
Zero-copy power
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
df.toPandas()
10x+ faster than Pickle-based serialization.
Arrow Flight
gRPC-based network protocol for Arrow. 20x+ faster than ODBC/JDBC.
DataFusion
Rust-based SQL engine in the Arrow project. Vectorized, embeddable. Polars and InfluxDB IOx use it.
DuckDB
import duckdb
import pyarrow.parquet as pq
arrow_table = pq.read_table("data.parquet")
result = duckdb.query("SELECT country, AVG(price) FROM arrow_table GROUP BY country").to_df()
In-memory SQL over Arrow without disk writes. Revolutionary for local analytics.
7. Vectorized Execution: Columnar's real power
Traditional tuple-at-a-time (Volcano)
while ((tuple = child->next()) != NULL) {
if (predicate(tuple)) emit(tuple);
}
Function call overhead, poor CPU pipeline use, branch misses, cache misses.
Vectorized (batch-at-a-time)
Process thousands of rows at once:
int32_t prices[1024];
for (int i = 0; i < 1024; i += 8) {
__m256i v = _mm256_loadu_si256((__m256i*)&prices[i]);
__m256i threshold = _mm256_set1_epi32(1000);
__m256i result = _mm256_cmpgt_epi32(v, threshold);
}
SIMD processes 8-16 values per instruction. Typically 2-10x faster.
Snowflake, ClickHouse (1024-65536 row batches), DuckDB (1024 batches with LLVM JIT), and Spark SQL (whole-stage codegen) all use this. TU Munich's HyPer/Umbra goes further by compiling queries to LLVM IR.
8. Partitioning and Z-Ordering
Partitioning
sales/
├── year=2023/
│ ├── month=01/
│ └── ...
├── year=2024/
└── year=2025/
WHERE year=2024 AND month=03 reads only that folder.
Hive-style
s3://bucket/sales/year=2025/month=04/day=15/
Directory names become partition columns — not stored in files (saves space).
Partition explosion
Too-granular partitioning (e.g., minute-level) creates millions of small files — list API latency, metadata burden. Rule: each partition should be at least hundreds of MB.
Z-Ordering
Z-order (Morton code) interleaves bits of multiple columns into a 1-D order. Min/max works for multiple columns simultaneously. Delta Lake and Iceberg implement OPTIMIZE ... ZORDER BY (country, product).
9. Delta Lake / Iceberg / Hudi
Parquet's limits
Parquet is just a file format — no ACID, manual schema evolution, no time travel, complex delete/update.
Lakehouse formats
- Delta Lake (Databricks): Parquet + JSON transaction log. ACID, time travel, Z-ordering.
- Apache Iceberg (Netflix/Apple): Parquet/ORC + manifest files. Hidden partitioning, snapshot isolation.
- Apache Hudi (Uber): Parquet + timeline. Upsert/delete optimized.
After 2024 Iceberg is converging as the industry standard (AWS, Snowflake, Databricks all support it).
Hidden partitioning
Iceberg lets you declare partition transforms at table-creation time. Users query with the logical column; Iceberg computes month(ts) automatically.
10. Practical tips and pitfalls
Tip 1: Row group size
df.write.option("parquet.block.size", 134217728).parquet("out") # 128MB
128 MB - 512 MB is the sweet spot.
Tip 2: Column order
Parquet stores in write order. Put hot columns first for I/O locality.
Tip 3: Dictionary max size
df.write.option("parquet.dictionary.page.size", 2097152).parquet("out")
Tip 4: Compression
df.write.option("compression", "zstd").parquet("out")
ZSTD is the near-universal answer in 2025.
Tip 5: File size
Sweet spot 256 MB - 1 GB. Use coalesce() or repartition() in Spark.
Pitfall 1: Small files
Disaster: list overhead, metadata load, schema mismatch risk. Fix with periodic compaction or Delta/Iceberg OPTIMIZE.
Pitfall 2: Schema evolution
Parquet schema is fixed. Use Delta/Iceberg for managed evolution, or Spark's mergeSchema.
Pitfall 3: Decimal and Timestamp
INT96 timestamps are deprecated — use INT64. Decimal encoding varies. Always check compatibility when moving files between systems.
Pitfall 4: Partition explosion
Restrict to monthly/daily; use column filters for finer time grain.
Pitfall 5: Over-nesting
3+ levels of nesting causes metadata bloat. Flatten for better query performance.
11. Real-world performance
100M rows of e-commerce
| Format | Size | SELECT * | SELECT price WHERE country='KR' |
|---|---|---|---|
| CSV (gzip) | 2.5 GB | 180 s | 180 s |
| JSON (gzip) | 4 GB | 300 s | 300 s |
| Avro | 2 GB | 60 s | 60 s |
| Parquet (snappy) | 1 GB | 25 s | 3 s |
| Parquet (zstd) | 700 MB | 22 s | 3 s |
| ORC (zstd) | 650 MB | 20 s | 2.5 s |
Column selection cost
SELECT col_x FROM table_100cols:
| Format | Time |
|---|---|
| Parquet | 1 column read, 100 MB I/O, 5 s |
| JSON | Full parse, 10 GB I/O, 120 s |
| CSV | Full parse, 5 GB I/O, 90 s |
Parquet's projection pushdown: read exactly 1/100 of the data.
Quiz
Q1. Where does the biggest row-vs-column performance gap come from?
A. Three factors: projection (read only needed columns), compression (same-type columns compress much better via dictionary, RLE, delta), and vectorization (contiguous same-type values enable SIMD, better cache use). Combined, OLAP queries see 100-1000x differences. Row-oriented still wins for OLTP.
Q2. Why are Dremel's repetition and definition levels needed?
A. To decompose nested data into columns without loss. Repetition level tracks array-position reconstruction; definition level tracks how deep the path was actually defined (for NULLs and optional fields). The two integers allow perfect JSON reconstruction — mathematically proven lossless encoding. This is why Parquet can give columnar benefits to Protocol Buffer/JSON data.
Q3. How do predicate pushdown and Bloom filter dramatically improve queries?
A. The fastest I/O is none at all. Predicate pushdown checks row-group min/max to skip entire groups that cannot match. Bloom filters answer point-lookup questions (user_id=12345) where min/max is useless. Combined, a 10 GB file may read only 100 MB — 100x faster queries.
Q4. Why does Apache Arrow solve a fundamentally different problem from Parquet?
A. Parquet is a disk format (minimize disk reads); Arrow is a memory format (minimize cross-system serialization). Defining a standard memory layout all languages/systems share enables zero-copy exchange. PySpark toPandas() with Arrow is 10x+ faster; Arrow Flight is 20x faster than ODBC. Modern pipelines use both: Parquet on disk, Arrow in memory.
Q5. Why is dictionary encoding so effective, and when does Parquet abandon it?
A. Real columns have low cardinality — country has ~200 unique values, status ~10. Dictionary encoding replaces strings with small ints, then bit-packs — often 10x reduction. Parquet auto-falls-back to plain encoding when dictionary exceeds parquet.dictionary.page.size (default 1 MB), typically on UUIDs, full URLs, log lines. Then plain + zstd is more efficient.
Closing
Key takeaways
- Row vs Column: OLTP → row, OLAP → column.
- Parquet: disk standard with Dremel nested support.
- ORC: Hive partner, similar performance.
- Arrow: memory standard, zero-copy.
- Compression: Dictionary, RLE, Delta, Bit-packing, ZSTD.
- Vectorization: columnar's true power via SIMD.
- Lakehouse: Delta Lake, Iceberg, Hudi add ACID.
- Tuning: row group size, compression, partition design.
Checklist
- Analytics on CSV/JSON → switch to Parquet.
- Parquet files under 16 MB → compact.
- Too many/few partitions → redesign.
- Sort by frequent filter columns for maximum pushdown.
- Default compression → consider ZSTD.
- Slow in-memory movement → adopt Arrow.
- Need transactions → Delta Lake or Iceberg.
Final lesson
Stonebraker's 2005 C-Store revolution now dominates the data industry. Same hardware, same data, different structure — 1000x the speed. That's the beauty of computer science. Next SQL query, ask: how is this data laid out? The answer tells you why it's fast or slow.
References
- Dremel: Interactive Analysis of Web-Scale Datasets (Melnik et al., 2010)
- C-Store: A Column-oriented DBMS (Stonebraker et al., 2005)
- Apache Parquet Documentation
- Apache ORC Specification
- Apache Arrow Format
- DuckDB Paper (SIGMOD 2019)
- Delta Lake Paper (VLDB 2020)
- Apache Iceberg
- Snowflake Paper
- ClickHouse MergeTree
- Vectorization vs Compilation (Kersten et al., 2018)