✍️ 필사 모드: Compression Algorithms Deep Dive — LZ77, Huffman, Arithmetic, ANS, Zstandard, Brotli (2025)
EnglishTL;DR
- Lossless compression is almost always a two-stage combination: (1) dictionary-based duplicate removal (LZ77 family — replace repeated bytes with back-references), and (2) entropy coding (Huffman/Arithmetic/ANS — short codes for frequent symbols).
- LZ77 (1977): find repeats in a sliding window and emit
(offset, length, next_char). The root of every modern compressor. - Huffman coding: frequency-based prefix-free codes. The entropy stage of DEFLATE/gzip.
- Arithmetic coding: better ratio than Huffman but slow. Patents plus performance kept it niche for 30 years.
- ANS (Asymmetric Numeral Systems), Jarek Duda 2009: arithmetic quality at Huffman speed. Used by Zstd, LZFSE, AV1 — the quiet revolution of the last decade.
- DEFLATE (1993, RFC 1951): LZ77 + Huffman. The basis of gzip, PNG, ZIP.
- LZ4 (2011): sacrifices ratio for extreme speed (GB/s). Competes with Snappy; embedded in modern systems (RocksDB, LZ4-HC).
- Zstandard (2016, Yann Collet): LZ77 + FSE (ANS). Roughly 3x faster than gzip with better ratios. Dictionary training makes even tiny messages compress well.
- Brotli (2015, Google): a Zstd peer tuned for web traffic with a built-in static dictionary.
- In practice: speed -> LZ4, general -> Zstd, static web assets -> Brotli, legacy -> gzip, many small messages -> Zstd with dictionary.
1. Fundamentals
1.1 Why compress
- Disk: lower storage cost.
- Network: saves bandwidth, reduces latency.
- Memory: better RAM utilization.
- Backup/archive: long-term retention.
Data growth in 2025 only increases the importance of compression. CPU is plentiful; I/O and network are the bottlenecks.
1.2 Lossless vs Lossy
Lossless (this post):
- Perfect reconstruction.
- Text, code, DB, executables.
- gzip, zstd, LZ4, Brotli.
Lossy:
- Accepts information loss (quality vs ratio trade-off).
- Images (JPEG), audio (MP3), video (H.264).
- Out of scope here.
1.3 Shannon's limit
Claude Shannon (1948) proved every data source has a theoretical floor — its entropy:
where p_i is the probability of symbol i. Units are bits per symbol.
English text has entropy around 1.5 bits/char. An 8-bit ASCII file could theoretically be compressed more than 5x. Real compressors approach but never reach that bound.
1.4 The two-stage strategy
Modern lossless compressors combine two stages:
- Dictionary / LZ: replace repeated sequences such as "ABCDEF...ABCDEF" with back-references. Removes structural redundancy.
- Entropy coding: exploit statistical bias in the remaining bytes. Frequent bytes get short codes; rare bytes get long ones.
The two stages evolve independently; different compressors pair different LZ and entropy components.
2. LZ77 — the root of it all
2.1 Idea
Lempel and Ziv's 1977 paper: replace repeated sequences with a back-reference to an earlier position.
Input: ABABABAB
A -> "A" (literal)
B -> "B" (literal)
AB -> "copy from -2, length 2"
AB -> "copy from -2, length 2"
Output:
(0, 0, A)
(0, 0, B)
(2, 2, _)
(2, 2, _)
Each token is (offset, length, next_char).
2.2 Sliding window
Real implementations only search the last N bytes.
[search buffer (N bytes)] [lookahead buffer (M bytes)] [unprocessed data]
Each step: find where the lookahead prefix last appeared in the search buffer, pick the longest match, emit (offset, length, next_char), advance by length + 1.
2.3 LZSS (1982)
Improvement over LZ77: emitting a 3-byte reference for a 3-byte match is a wash. LZSS uses a flag bit to distinguish literal from reference and only emits references of 3+ bytes. Most modern compressors are LZSS variants.
2.4 LZ78 and LZW
LZ78: builds an explicit dictionary instead of using a sliding window. LZW (Welch, 1984) is the variant used by GIF, Unix compress, and old PDF. Rarely used today — the LZ77 family dominates.
2.5 Back-reference efficiency
1 MB of repeated text:
- Raw: 1 MB.
- LZ77 (32 KB window): the first ~32 KB is literal, everything afterwards becomes "copy back 32 KB, length ...".
- Result: about 32 KB + metadata.
32x compression from a single mechanism: "point at the repetition".
2.6 Hash-based matching
Finding the longest match at each position is the bottleneck. The classic fix: hash chains. Hash every 3-byte prefix; chain the previous positions with the same hash. Candidate matches are found quickly.
Speed vs ratio trade-off:
- Longer chains: more candidates -> longer matches -> better ratio -> slower.
- Shorter chains: faster, lower ratio.
This is exactly gzip -1 vs gzip -9.
3. Huffman coding
3.1 Idea
David Huffman, 1952 (a graduate assignment at MIT): frequent symbols get short codes, rare symbols get long codes.
Example frequencies:
a: 40%
b: 20%
c: 15%
d: 15%
e: 10%
Fixed 3-bit codes: 3 bits/symbol. Huffman:
a -> 0 (1 bit)
b -> 10 (2 bits)
c -> 110 (3 bits)
d -> 1110 (4 bits)
e -> 1111 (4 bits)
Average: 0.4x1 + 0.2x2 + 0.15x3 + 0.15x4 + 0.10x4 = 2.25 bits/symbol. A 25% saving.
3.2 Prefix-free property
No code is a prefix of another, so the stream decodes unambiguously with no separators:
01001101110 -> 0 10 0 110 1110 -> a b a c d
3.3 Huffman tree
A binary tree where leaves are symbols, the path is the code (left=0, right=1), and frequent symbols sit on shorter paths.
3.4 Tree construction
- Put every symbol into a priority queue by frequency.
- Pop the two smallest nodes, merge into a parent whose frequency is their sum.
- Push the parent back.
- Repeat until one node remains.
O(n log n). Greedy but provably optimal for prefix-free codes.
3.5 Canonical Huffman
The same frequency table admits many tree shapes. Canonical form stores only the code lengths and assigns codes by rule, shrinking metadata. DEFLATE and most modern compressors use canonical Huffman.
3.6 Limits
Huffman has a floor of 1 bit/symbol. For a symbol with 90% probability, the ideal is about 0.15 bits but Huffman must emit 1 — an 85% overhead. Arithmetic coding and ANS fix this.
4. Arithmetic coding
4.1 Idea
Represent the entire message as a single real number, approaching the Shannon bound for any distribution.
4.2 Example
Alphabet a (50%), b (25%), c (25%). Encode "abc" starting from the interval [0, 1).
- 'a': range =
[0, 0.5). - 'b': range =
[0.25, 0.375). - 'c': range =
[0.34375, 0.375).
Emit any real number inside the final interval (for example 0.35, binary 01011010). The decoder recovers the sequence using the same probability model.
4.3 Efficiency
Each symbol costs exactly -log2(p) bits — essentially reaching Shannon's bound. Huffman uses integer bits; arithmetic uses fractional bits.
4.4 Why it stayed niche
- Speed: floating-point or precise integer arithmetic, 2-5x slower than Huffman.
- Patents: IBM patents restricted use for decades (expired in the 2000s).
- Complexity: hard to implement, easy to get wrong.
For 30 years it was "better in theory, impractical." Seen only in H.264 and JPEG 2000.
4.5 Range coding
An arithmetic-coding variant that uses integer arithmetic and avoided the patents. Used by some compressors (LZMA in 7zip).
5. ANS — the game changer
5.1 Background
In 2009, Polish mathematician Jarek Duda proposed ANS (Asymmetric Numeral Systems), quietly published in papers and mailing lists. Adoption accelerated once people realized ANS delivers arithmetic-quality compression at Huffman speed.
5.2 Idea
Encode the whole message into a single large integer, but using ordinary register operations. Behaves like arithmetic coding without floating point and with almost no division — reformulated as a finite state machine.
5.3 Finite State Entropy (FSE)
The practical variant, implemented by Yann Collet (author of Zstd).
- Each symbol has a next state assignment.
- Encode: symbol in -> state transition -> emit some bits.
- Decode: reverse.
The state machine is pre-computed into a table; runtime is lookup + shift — very fast.
5.4 tANS vs rANS
- tANS (table-based ANS): state-transition table; fast, memory-heavy.
- rANS (range-ANS): range-based; no table, slightly slower, more flexible.
Zstd's FSE is tANS; some Facebook internal tools use rANS.
5.5 Performance
| Property | Huffman | Arithmetic | FSE |
|---|---|---|---|
| Ratio | baseline | -0.5% to -3% | -0.5% to -3% |
| Encode speed | 1x | 0.3-0.5x | 1.2-2x (faster!) |
| Decode speed | 1x | 0.3-0.5x | 1.5-2x |
An upgrade with essentially no downside — "why didn't this exist earlier?"
5.6 Adoption
ANS-based compressors:
- Zstandard (2016): FSE. The most widespread.
- LZFSE (Apple, 2015): LZ + FSE, default on iOS/macOS.
- AV1 video codec: ANS entropy stage.
- oodle: commercial game compression.
- Brotli: partially.
Effectively the standard entropy coder for new compressors built after 2015.
6. DEFLATE — the gzip engine
6.1 Overview
Phil Katz, 1993, used in PKZIP; standardized as RFC 1951 (1996). LZ77 + Huffman. The most widely used compression format ever: gzip, ZIP, PNG, Content-Encoding: gzip. Still a baseline after 30 years.
6.2 Structure
Compressed data = a sequence of blocks. Each block has:
- Header: block type (stored, fixed Huffman, dynamic Huffman).
- Huffman tree (dynamic only).
- Compressed data: LZ77 tokens encoded with Huffman.
6.3 LZ77 stage
- Window: 32 KB.
- Match length: 3-258 bytes.
- Offset: 1-32768.
Each match is a (length, distance) pair; literals are 0-255 character codes.
6.4 Huffman stage
Two trees:
- Literal/length tree: literals (0-255) plus length codes (257-285).
- Distance tree: offset codes (0-29).
Lengths and distances map to log-scale buckets — small values exact, large values encoded with extra bits.
6.5 Levels
gzip -1 through gzip -9:
- -1: lazy match, short hash chains.
- -9: longer hash chains, best-match search.
Time/space trade-off — roughly 20% ratio difference across a 5-10x range of speed.
6.6 Performance
- Compress: ~50 MB/s (single core).
- Decompress: ~300 MB/s.
- Ratio: 3-5x on text.
Still serviceable, but modern alternatives (zstd, brotli) are better.
7. LZ4 — extreme speed
7.1 Philosophy
Yann Collet, 2011. Trade ratio for speed.
7.2 Design
- No entropy coding: LZ77 only.
- Simple format: byte-aligned, no bit tricks.
- Hash table: trivial hash function, single lookup.
- Small minimum match: 4 bytes.
7.3 Speed
- Compress: 500-800 MB/s (single core).
- Decompress: 2-5 GB/s (single core).
- Ratio: about 2-2.5x on text (half of gzip).
Perfect for environments where CPU is faster than the disk or network.
7.4 Where it is used
- RocksDB, Cassandra: on-disk storage compression.
- Zstandard: LZ4-like mode at low levels.
- MongoDB: snapshots and journals.
- Kernel: some Linux compression paths.
- Kafka: some configurations.
Anywhere you need real-time compression on a constrained CPU budget.
7.5 LZ4 HC
"High Compression" variant: slower, better ratio. Still faster and less-compressing than zstd.
7.6 Snappy
Google's similar compressor (2011): very fast, low ratio; used in LevelDB, BigTable, HBase. LZ4 is slightly faster and slightly denser — LZ4 usually wins.
8. Zstandard — the general-purpose king
8.1 Background
Yann Collet (again) at Facebook, released 2016. Goal: "3x faster than gzip with better ratios."
8.2 Components
- LZ77-style match: 2+ bytes, large window (default 8 MB).
- Entropy coding: FSE (ANS-based) for literals, with Huffman used for literal distribution in some cases.
- Frame format: chunked, streaming-friendly.
8.3 Dictionary feature
Killer feature: pre-trained dictionaries.
# Train a dictionary from 100 JSON samples
zstd --train samples/*.json -o dict.bin
# Compress using the dictionary
zstd -D dict.bin input.json -o output.zst
# Decompress using the dictionary
zstd -D dict.bin -d output.zst -o input.json
Effect: tiny messages (hundreds of bytes) see 10-100x better ratios. Regular compressors do poorly on small payloads; Zstd dictionaries fix that. Ideal for Kafka, small HTTP APIs, chat messages.
8.4 Long-range mode
zstd --long searches matches over GB-scale windows. Great for backups and disk images.
8.5 Compression levels
zstd -1 (fast) through zstd -22 (max):
- -1: ~500 MB/s, ratio 2x. Competes with LZ4.
- -3 (default): ~200 MB/s, ratio 3x.
- -9: ~50 MB/s, ratio 3.5x.
- -19: ~10 MB/s, ratio 4.2x.
- -22: ~1 MB/s, ratio 4.5x.
8.6 Decompression speed
2-3 GB/s single core, independent of compression level. The biggest win.
8.7 Adoption
- Linux kernel compression (
zstdoption). - Btrfs, ZFS filesystems.
- Arch Linux packages (entire repo switched to zstd).
- Kafka, RocksDB, and other modern systems.
- Facebook's full infrastructure.
The de facto new standard.
8.8 Code complexity
Zstd source is ~25,000 lines of C; gzip is ~5,000. Trade-off: complexity for performance. But it ships as a single binary with no external dependencies, so porting is straightforward.
9. Brotli — Google's pick
9.1 Background
Jyrki Alakuijala et al. at Google, 2015. Designed for HTTP/2 content compression.
Features:
- Built-in static dictionary: 120 KB of English/HTML/CSS/JS tokens.
- Context modeling: nearby context shapes the Huffman tree.
- Long back-references: 2 MB window.
9.2 Ratio
Similar to zstd, but slightly slower. Tuned for text — especially HTML/CSS/JS — where the static dictionary makes it a touch better than zstd.
9.3 Decompression
~400 MB/s. Slightly slower than zstd, still fast enough.
9.4 Where it is used
Mainly the web:
- HTTP
Content-Encoding: br— supported by Chrome and Firefox. - WOFF2 (Web Open Font Format 2).
- Android APK.
- Cloudflare edge compression.
General-purpose systems tend to use zstd.
9.5 Brotli vs Zstd (web)
10 MB HTML + CSS + JS
gzip (-9): 2.5 MB
brotli (11): 1.8 MB (28% better than gzip)
zstd (19): 1.9 MB
Brotli wins slightly on the web thanks to its static dictionary.
10. LZMA / 7z
10.1 Background
Igor Pavlov's LZMA (Lempel-Ziv-Markov chain-Algorithm), the default in 7-zip since 1998.
10.2 Features
- Huge window: up to 4 GB — finds very distant matches.
- Arithmetic coding (range coder).
- Markov model: conditional probabilities from literal context.
- Very complex implementation.
10.3 Performance
- Compress: very slow (a few MB/s).
- Decompress: tens of MB/s.
- Ratio: excellent — 5-15% better than Zstd.
10.4 Where it is used
- 7-zip: the GUI tool.
- xz (
.tar.xz). - Arch Linux (moving from xz to zstd).
- Some Debian packages.
"When you need extreme ratios and have time." Backups, distribution files.
10.5 Why zstd is displacing it
Zstd: slightly worse ratio, far faster. Arch reported that switching from xz to zstd dropped install time by 50% at the cost of 20% larger packages — a user-side net positive.
11. Comparison tables
11.1 Ratio and speed
Text input (10 MB Linux kernel source):
| Algorithm | Ratio | Compress | Decompress |
|---|---|---|---|
| LZ4 -1 | 2.1x | 700 MB/s | 3500 MB/s |
| LZ4 -9 (HC) | 2.5x | 30 MB/s | 3500 MB/s |
| Snappy | 2.0x | 600 MB/s | 2500 MB/s |
| gzip -1 | 2.7x | 200 MB/s | 400 MB/s |
| gzip -9 | 3.2x | 20 MB/s | 400 MB/s |
| zstd -1 | 2.8x | 500 MB/s | 2500 MB/s |
| zstd -3 | 3.4x | 250 MB/s | 2500 MB/s |
| zstd -9 | 3.6x | 80 MB/s | 2500 MB/s |
| zstd -19 | 4.0x | 10 MB/s | 2500 MB/s |
| brotli -11 | 4.1x | 8 MB/s | 400 MB/s |
| xz -9 | 4.3x | 3 MB/s | 70 MB/s |
11.2 Recommendations
- Real-time streaming / high-throughput I/O: LZ4.
- General-purpose file compression: Zstd -3.
- Backup/archive: Zstd -19 or xz.
- Web content: Brotli (fallback gzip).
- Many small messages: Zstd with dictionary.
- DB / storage engines: LZ4 for hot data, Zstd when space matters.
12. SIMD acceleration
12.1 Opportunities
Compression is sequential, but parts vectorize well:
- Hash computation across many positions.
- Match scanning over long byte comparisons.
- Huffman decoding parallel lookups.
- Entropy decode ANS state updates.
12.2 Modern implementations
- Zstd: SSE2/AVX2 support; SIMD accelerates literal copying.
- LZ4: structure is too simple for heavy SIMD, but long copies use
memcpy(SIMD internally). - libdeflate: SIMD-accelerated zlib replacement, 2-3x faster gzip.
12.3 Bandwidth ceiling
Beyond a point, memory bandwidth is the real bottleneck. At multi-GB/s compression, L3/DRAM limits dominate.
12.4 Hardware offload
- Intel QAT: on-chip DEFLATE accelerators, 100 GB/s class.
- NVIDIA GPU: parts of cuDNN, some DB engines.
- SmartNIC: compression on the NIC.
Large services (Facebook, Google) invest heavily here.
13. Dictionary training
13.1 Why
Normal compressors exploit redundancy inside the input. A 100-byte JSON has almost no internal redundancy:
{"name":"Alice","age":30}
It may even grow after compression due to metadata.
13.2 What a dictionary does
Across thousands of similar JSONs, strings like "name": and "age": repeat constantly. A dictionary stores those shared tokens so back-references can point into it. References become very short.
13.3 Training
zstd --train samples/*.json --maxdict=110K -o dict
- Read samples.
- Find the most frequent substrings statistically.
- Pack the winners into the dictionary.
- Write the dictionary file.
13.4 Usage
zstd -D dict input.json -o output.zst
Both encoder and decoder must share the dictionary.
13.5 Numbers
- 100-byte JSON with plain zstd: 90 bytes (10% saved).
- 100-byte JSON with a 10 KB dictionary trained on 1000 samples: 30 bytes (70% saved).
A 7x difference. Essential when shipping many small messages.
13.6 Use cases
- Kafka: per-message payloads; 50-80% reduction.
- HTTP APIs: responses with fixed schema.
- Protobuf messages: schema-specific dictionaries.
- NoSQL KV: similar key-value pairs.
Facebook uses this at petabyte scale to cut storage.
14. Real examples
14.1 HTTP server
Accept-Encoding: gzip, br, zstd
Response:
Content-Encoding: zstd
Nginx:
http {
zstd on;
zstd_comp_level 3;
zstd_types text/html text/css application/javascript application/json;
}
For static files, precompress at build time (.br, .gz files) to save runtime CPU.
14.2 Kafka
compression.type=zstd
Producers compress batches. Level 3 reduces text messages by 3-5x. Brokers store and forward the compressed batches without decompressing — CPU efficient.
14.3 Kubernetes container images
OCI image layers default to gzip; the ecosystem is moving to zstd:
docker build --compression=zstd .
Pull times drop 30-50%. Big wins for large images (ML models).
14.4 Backup
tar -cf - /data | zstd -19 --long > backup.tar.zst
--long finds cross-file redundancy. Rsync + zstd is 10-30% faster and smaller than gzip pipelines.
14.5 Databases
RocksDB:
options.compression = kZstdCompression;
options.bottommost_compression = kZstdCompression;
options.compression_opts.level = 3;
Tiered compression: LZ4 for L0-L4 (fast reads), zstd for L5-L6 (dense cold data).
ClickHouse:
CREATE TABLE events (
...
) ENGINE = MergeTree
SETTINGS
index_granularity = 8192;
ALTER TABLE events MODIFY COLUMN data CODEC(ZSTD(3));
Different codecs per column.
15. Modern trends
15.1 Asymmetric speed
Exploit decode-faster-than-encode asymmetry: compress once, decompress many times (web, CDN, images). Invest encoder CPU so the decoder stays fast. Brotli -11, zstd -19.
15.2 Hardware acceleration
- Intel QAT on server CPUs.
- NVIDIA decompression engines (Ampere+).
- AWS Nitro cards.
Major investment area for hyperscalers.
15.3 ML-based compression
Google DeepMind experiments with LLM-based compression: the model predicts the next token, only the delta is stored. 5-10x compression on images and text — but slow and GPU-bound. Interesting for specialty data (genome, scientific).
15.4 Lossy text?
"Some error allowed" text compression is being researched. Natural-language entropy is about 1-1.5 bits/char, and LLMs may push the floor lower. Still academic.
15.5 A successor to ANS?
Ten years in, proposals like Pelican and rANS variants exist, but ANS is good enough that mainstream replacement is unlikely soon.
16. FAQ
Q: Recompressing an already-compressed file?
A: Practically no gain. JPEG, MP3, ZIP, already-gzipped files — usually under 1% savings, sometimes larger due to metadata.
Q: Solid vs per-file compression?
A: Solid (the whole thing is one stream) exploits cross-file redundancy for a better ratio, but you must decompress the whole archive to extract one file. Per-file is independent with random access but loses shared redundancy. ZIP is per-file; tar.gz / tar.zst / 7z are solid.
Q: Best ratio combination?
A: LZMA or zstd --long with multi-GB windows for large files.
Q: Compression with encryption?
A: Compress, then encrypt. Encrypted data is random and uncompressible. Beware CRIME/BREACH-style length attacks: mixing compression with sensitive data over HTTPS is dangerous.
Q: Streaming support?
A: Zstd, Brotli, gzip all have streaming APIs — you can compress without holding the full input in memory.
17. One-page summary
+------------------------------------------------------+
| Compression Cheat Sheet |
+------------------------------------------------------+
| Basics: |
| Two stages: LZ (dedup) + Entropy (statistics) |
| |
| LZ family: |
| LZ77 (sliding window, back-reference) |
| LZSS (minimum match length) |
| LZ78/LZW (explicit dictionary, deprecated) |
| |
| Entropy coders: |
| Huffman (1952): integer bits, fast |
| Arithmetic (1970s): fractional bits, slow, patents |
| ANS/FSE (2009): arithmetic quality at Huffman speed|
| |
| Main algorithms: |
| DEFLATE (1993): LZ77 + Huffman |
| LZ4 (2011): LZ77 only, ultra-fast |
| Zstandard (2016): LZ77 + FSE |
| Brotli (2015): static dictionary + context |
| LZMA (1998): LZ77 + Markov + range coder |
| |
| Speed vs ratio: |
| LZ4/Snappy: fast, low ratio |
| gzip -1: fast, medium |
| zstd -3: balanced (default pick) |
| gzip -9: slow, good ratio |
| zstd -19: slow, better ratio |
| brotli -11: slow, web-optimal |
| xz -9: very slow, best ratio |
| |
| Use cases: |
| Real-time: LZ4 |
| General: zstd -3 |
| Backup: zstd -19 or xz |
| Web: brotli |
| Small messages: zstd + dictionary |
| |
| Dictionary training: |
| zstd --train |
| 10-100x improvement on small messages |
| Great for Kafka, HTTP APIs, NoSQL |
| |
| Acceleration: |
| SIMD (libdeflate, zstd) |
| Hardware (Intel QAT) |
| GPU |
| |
| Caveats: |
| Do not recompress compressed data |
| Compress before encryption |
| Watch for CRIME/BREACH over HTTPS |
+------------------------------------------------------+
18. Quiz
Q1. Why is most lossless compression "two stages"?
A. Because two kinds of redundancy need different tools. (1) Structural redundancy: repeated sequences like "ABCDEFABCDEF" — LZ77's back-references handle this perfectly but cannot address per-symbol statistical bias. (2) Statistical bias: imbalanced symbol distributions such as "e" (21%) vs "z" (0.1%) — entropy coding (Huffman/ANS) assigns short codes to frequent symbols. Separating the two stages means each can be optimized independently. DEFLATE, Zstd, and Brotli all follow this pattern. "Do it all at once" schemes like PPM or context mixing exist but are too complex and slow to be practical — textbook divide-and-conquer.
Q2. What is Huffman coding's fundamental limitation?
A. The 1-bit/symbol integer floor. Huffman gives each symbol a prefix-free code of integer length. Symbols with 50% probability get the ideal 1 bit. Symbols with 25% get 2 bits (ideal). But a symbol with 90% probability ideally needs ~0.15 bits while Huffman emits 1 bit — 85% overhead. High-frequency symbols like spaces or 'e' in text live in this regime, and Huffman cannot reach the Shannon bound. Arithmetic coding and ANS encode "0.15 bits per symbol" by packing multiple symbols into the same integer width, gaining 5-15% better ratios. Huffman dominated from 1952 to the 2000s because of its speed and patent freedom — then ANS (2009) solved both problems, which is why Zstd and LZFSE moved to it.
Q3. Why is ANS called a "game changer"?
A. It delivers arithmetic-coding ratios at Huffman speed. For 30 years the trade-off was "Huffman fast but suboptimal, Arithmetic optimal but slow and patent-encumbered." ANS (2009, Jarek Duda) solved both: (1) ratio near arithmetic, 5-10% better than Huffman; (2) speed 1.5-2x faster than Huffman (shift + lookup, no floats or division); (3) patent-free — Duda deliberately released it openly. Commercial adoption: LZFSE (Apple 2015), Zstd (Facebook 2016), AV1 (2018). "Huffman is replaced by FSE/ANS" is the new standard in modern compressors — a rare case where a classic CS trade-off was actually dissolved. Why it was not found earlier is a CS-history mystery.
Q4. What problem does Zstandard's dictionary training solve?
A. The lack of internal redundancy in small messages. Normal compressors (gzip, LZ4) only find repeats within the input. A 100-byte JSON like {"name":"Alice","age":30} has almost none, so the output is about 5% smaller — or sometimes larger due to metadata. But across thousands of similar JSONs, keys like "name": and "age": repeat heavily. Dictionary training (1) extracts the most common substrings from samples into a ~10 KB dictionary, (2) ships that dictionary to encoder and decoder, (3) uses it as a "virtual prefix" during compression. Result: the 100-byte JSON goes from 90 bytes (plain zstd) to 30 bytes (with dictionary) — a 7x difference. Facebook, Kafka, and HTTP APIs use this to shrink payloads 50-80%. Caveats: the dictionary must be shared; it helps only similar data. The principle — compression feeds on correlation — extended with extra context.
Q5. Why do databases prefer LZ4 even though gzip has a better ratio?
A. Decompression speed is the DB bottleneck. Read path: disk read -> decompress -> query. If gzip decode (~400 MB/s) is 10x slower than SSD reads (~5 GB/s), the CPU cannot keep up with I/O. LZ4 decompresses at 3-5 GB/s, close to SSD throughput. Yes, LZ4's 2x ratio uses more disk than gzip's 3x, but reads are an order of magnitude faster — critical for online queries. RocksDB uses tiered compression: LZ4 for hot L0-L1, zstd for cold L5-L6. Cassandra takes a similar approach. The lesson: what matters is pipeline throughput, not isolated ratio.
Q6. Why is Brotli better than Zstd for HTTP?
A. Static built-in dictionary plus HTML/CSS/JS tuning. Brotli ships a 120 KB built-in dictionary of English words, HTML tags, CSS properties, and JavaScript keywords. Web pages effectively get a huge virtual prefix, so small pages compress very well. Without a dictionary, Zstd lags slightly on web content. Benchmarks: on 10 MB of web content, brotli -11 beats zstd -19 by 5-10%. Brotli's context modeling (using nearby context to select the Huffman tree) is also more sophisticated. The trade-off is slow compression — Brotli is typically used for build-time precompression (*.html.br files), with runtime workloads going to zstd. Modern web stacks: static assets precompressed with Brotli; dynamic responses compressed with gzip or zstd; Cloudflare-style CDNs manage both at the edge.
Q7. Why does recompressing an already-compressed file (JPEG, MP3, ZIP) save almost nothing?
A. Compressed data is statistically near-random. Compression feeds on "redundancy plus bias"; well-compressed data, by definition, has almost none of either. LZ77 finds no back-references (every subsequence is unique), Huffman/ANS find no bias (every symbol is nearly equiprobable). The data already sits near Shannon's bound. Recompression yields (1) ~0-1% savings (noise), (2) small growth due to metadata, (3) wasted CPU. Exceptions: (a) redundancy between compressed files — archiving many copies of the same compressed file into a solid tar can still help, (b) lossy -> lossless re-encoding (e.g. JPEG to PNG) — different data. Operational rule: skip HTTP compression for already-compressed MIME types (image/jpeg, video/mp4, application/zip), which is why Nginx's gzip_types excludes them. Compression eats redundancy — on randomness, it starves.
19. Learning resources
Papers:
- Lempel & Ziv, "A Universal Algorithm for Sequential Data Compression" (1977).
- Huffman, "A Method for the Construction of Minimum-Redundancy Codes" (1952).
- Duda, "Asymmetric Numeral Systems" (2009).
- Yann Collet's Zstd blog posts.
Books:
- "Introduction to Data Compression" — Khalid Sayood.
- "Managing Gigabytes" — Witten, Moffat, Bell.
Blogs:
- https://fastcompression.blogspot.com — Yann Collet (LZ4, Zstd author).
- https://www.righto.com/2019/07/shanghaied-by-zsr.html — compression blog.
Tools:
zstd,gzip,xz,brotliCLIs.lzbench: benchmarks many compressors.- squash benchmark (https://quixdb.github.io/squash-benchmark).
Source:
- Facebook/zstd on GitHub.
- LZ4 on GitHub.
- Brotli on GitHub.
If this post helped, check out these related ones:
- "SIMD / AVX / NEON Deep Dive" — acceleration techniques used by compressors.
- "RocksDB and LSM-Tree" — DB engines built on LZ4/Zstd.
- "CDN and Edge Caching" — deploying web compression.
- "Binary Serialization (Protobuf/Thrift/Avro)" — pre-compacted data formats.
현재 단락 (1/419)
- **Lossless compression** is almost always a **two-stage** combination: (1) **dictionary-based** du...