- Published on
Binary Serialization Complete Guide 2025: Protobuf, Thrift, Avro, MessagePack, FlatBuffers, Cap'n Proto — Choosing Between Performance and Flexibility
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Introduction: Is JSON Really the Best Choice?
A Common Scenario
When building a REST API, most developers naturally pick this:
{
"id": 12345,
"name": "Alice",
"email": "alice@example.com",
"age": 30
}
JSON is human-readable, language-neutral, and supported everywhere. It looks perfect.
But what about a service handling 1 billion requests per day? If each request is 1 KB of JSON, that's 1 TB of data per day. Parsing cost, network bandwidth, storage — all get expensive.
The same data in Protobuf is ~100 bytes. Parsing is over 10x faster. 10 GB instead of 100 GB per day. Over a year, that's tens of TB of difference.
Why Binary Serialization?
JSON's weaknesses:
- Size: Field names repeat every time.
"email"dozens of times. - Parsing cost: Converting text to numbers/objects.
- No type information: Runtime type checking required.
- No schema: Documentation and validation separate.
Strengths of binary formats:
- Small: Numeric tags instead of field names.
- Fast: Direct memory mapping or simple decoding.
- Type-safe: Schema-based.
- Evolvable: Schema evolution support.
What This Post Compares
- Protocol Buffers (Google): The most widely used standard.
- Thrift (Facebook/Apache): RPC integration.
- Avro (Apache Hadoop): The de facto standard for big data.
- MessagePack: Binary replacement for JSON.
- FlatBuffers (Google): Zero-copy reads.
- Cap'n Proto (Sandstorm): Zero-copy plus RPC.
This post dives 720 lines into each format's internal structure, encoding scheme, and when to use it.
1. Protocol Buffers: King of the Standard
History
- 2001: Google starts internal development.
- 2008: Open-sourced (proto2).
- 2016: proto3 released. Simpler syntax.
- 2023: Edition 2023 released. Gradual feature additions.
The default format for gRPC. All of Google uses it, and externally it's the most popular binary serialization.
IDL (Interface Definition Language)
Protobuf defines schemas in .proto files:
syntax = "proto3";
message Person {
int32 id = 1;
string name = 2;
string email = 3;
optional int32 age = 4;
repeated string phone_numbers = 5;
}
- Each field has a tag number (1, 2, 3...).
- The tag number is the key in wire format. Field names are used only for code generation.
Code Generation
The protoc compiler generates code for the target language:
protoc --java_out=. --python_out=. --go_out=. person.proto
Result: Java, Python, and Go classes. Each includes serialize/deserialize methods.
Wire Format: The Core Idea
Protobuf encoding is a key-value stream. Each field is:
[tag + wire_type][value]
Wire types:
0: Varint (integers, boolean, enum).1: 64-bit (fixed64, double).2: Length-delimited (string, bytes, message, packed repeated).5: 32-bit (fixed32, float).
Varint: Variable-Length Integer
Protobuf's key optimization: varint encoding. Small numbers use fewer bytes:
150 as varint:
150 = 0b10010110
Step 1: Split into 7-bit groups
→ [0b0000001, 0b0010110]
Step 2: Use MSB of each byte as "continuation flag"
→ [0b10010110, 0b00000001]
Step 3: Little-endian order
→ [0x96, 0x01]
Result:
0~127: 1 byte.128~16383: 2 bytes.16384~2097151: 3 bytes.- And so on.
Small numbers take 1 byte, larger ones take more. Since most fields are small, this is efficient on average.
Encoding Example
message Person {
int32 id = 1;
string name = 2;
}
id = 150, name = "Bob"
Wire:
Field 1 (id):
tag = (1 << 3) | 0 = 0x08 (field 1, wire type 0 = varint)
value = 150 varint = 0x96 0x01
Field 2 (name):
tag = (2 << 3) | 2 = 0x12 (field 2, wire type 2 = length-delimited)
length = 3
value = "Bob" = 0x42 0x6F 0x62
Total: 08 96 01 12 03 42 6F 62 (8 bytes)
8 bytes! The JSON is:
{"id":150,"name":"Bob"}
23 bytes. 2.9x smaller.
ZigZag Encoding: Optimizing for Negative Numbers
Varint is optimized for positive numbers. Negative numbers become very large varints (due to two's complement):
-1 → 0xFFFFFFFF = 10 bytes!
Solution: sint32 and sint64 types use ZigZag encoding:
0 → 0
-1 → 1
1 → 2
-2 → 3
2 → 4
...
Formula: (n << 1) ^ (n >> 31)
Negative and positive numbers are mapped alternately, so smaller absolute values give shorter encodings.
Schema Evolution
Protobuf supports forward/backward compatibility:
Rules:
- Never change tag numbers: the tag is the ID.
- Renaming fields is OK: unrelated to wire format.
- Adding new fields is OK: old clients ignore them.
- Deleting fields is OK, but mark tag reserved: prevents reuse.
- Don't use required: removed in proto3.
Example:
// v1
message Person {
int32 id = 1;
string name = 2;
}
// v2 (compatible)
message Person {
int32 id = 1;
string name = 2;
string email = 3; // new field
reserved 4, 5; // reserved for future
}
A v1 server receiving a message from a v2 client will simply ignore the email field. Vice versa works too. This is critical in distributed systems during deployments.
Default Values in Proto3
In proto3, all fields are optional. Absent fields get default values:
- Numeric: 0
- String: ""
- bool: false
- message: null
Pitfall: You cannot distinguish "is the value 0?" from "is the value missing?" To solve this, use the optional keyword (proto3 2.6+):
message Person {
optional int32 age = 1; // explicit presence tracking
}
Performance
Protobuf is very fast:
- 3~10x smaller than JSON.
- Parsing speed: 5~100x (depending on language implementation).
- Low memory allocation (reusable).
2. Apache Thrift: RPC Integration
Origin
- 2007: Facebook releases it.
- 2008: Apache Incubator.
- Today: Internal RPC standard at many large companies.
Emerged around the same time as Protobuf. Battle-tested at Facebook scale.
Differences from Protobuf
Thrift provides serialization + RPC + server framework all together:
service UserService {
Person getUser(1: i32 id) throws (1: NotFoundException e),
list<Person> listUsers(1: i32 limit, 2: i32 offset)
}
struct Person {
1: i32 id,
2: string name,
3: optional string email
}
Data structures and service definitions live in one file. Protobuf was originally separate from gRPC, but Thrift was integrated.
Transport Layer
Thrift's unique aspect: separation of transport, protocol, and server.
Transport: how data moves.
- TSocket, TFramedTransport, TMemoryBuffer, ...
Protocol: how data is encoded.
- TBinaryProtocol: simple, fixed size.
- TCompactProtocol: variable length (varint-like).
- TJSONProtocol: JSON format.
Server: how requests are handled.
- TSimpleServer, TThreadPoolServer, TNonblockingServer, ...
This composability allows choosing the right combination per situation.
TCompactProtocol
Thrift's most efficient encoding. Similar ideas to Protobuf:
- Varint: integers.
- ZigZag: negatives.
- Field ID delta: stores only the difference from the previous field ID (smaller).
- Type + ID merge: wire type and field id in a single byte.
Result: Nearly identical size to Protobuf. In some cases slightly smaller.
Pros and Cons
Pros:
- Integrated RPC: self-contained without separate gRPC setup.
- Language support: 27+ languages.
- Mature: validated inside Facebook, Uber, Twitter.
- Flexible transport/protocol.
Cons:
- Fewer recent updates: Thrift 4 is in discussion but slow.
- Scattered docs: less organized than Protobuf.
- Pushed aside by gRPC: Google's marketing power.
Selection guide:
- New project: gRPC (Protobuf).
- Existing Thrift systems: keep as-is.
3. Apache Avro: Big Data's Friend
Background
A format born in the Hadoop ecosystem. Currently the de facto standard in Kafka, Spark, and Hive.
Core Feature: Schema Travels with Data
In Protobuf/Thrift, schemas are used at code generation time. Avro is different:
Avro's philosophy: Schema is delivered with the data or managed centrally.
Two modes:
- Container File Mode: schema embedded in file header.
- Schema Registry Mode: referenced by schema ID from a central registry.
Schema Example
Avro schemas are written in JSON:
{
"type": "record",
"name": "Person",
"fields": [
{ "name": "id", "type": "int" },
{ "name": "name", "type": "string" },
{ "name": "email", "type": ["null", "string"], "default": null },
{ "name": "age", "type": "int", "default": 0 }
]
}
Wire Format
Avro's encoding is extremely simple:
- Fixed-size types: stored as-is (int, long, float, double).
- ZigZag varint: for int and long.
- Variable size: length + data (string, bytes).
- Record: fields listed in schema order.
Important: No tag numbers! Field order is the structure.
Result: Very compact binary. Impossible to interpret without the schema.
Example
Same Person:
id=150, name="Bob", email=null, age=0
Avro wire:
150 (zigzag varint) = 0xAC 0x02 (2 bytes)
length 3 + "Bob" = 0x06 0x42 0x6F 0x62 (4 bytes)
null union tag = 0x00 (1 byte)
0 (zigzag varint) = 0x00 (1 byte)
Total: 8 bytes
Schema Evolution: Writer vs Reader Schema
Avro's killer feature: Writer schema is not equal to Reader schema.
- Writer schema: the schema used when writing the data.
- Reader schema: the schema expected when reading the data.
Avro automatically resolves differences between the two schemas:
Writer: {id, name, age}
Reader: {id, name, email}
ageis absent in reader → ignored.emailis absent in writer → reader's default is used.
This makes compatibility checks and migration smooth.
Schema Registry
Confluent Schema Registry centrally manages Avro schemas:
Producer:
schema_id = registry.register(schema)
message = [magic byte][schema_id (4B)][avro payload]
Consumer:
schema_id = extract_from_message
schema = registry.get(schema_id)
data = decode(payload, schema)
Each message contains only a 4-byte schema ID. Payload is pure Avro. When accumulated across tens to hundreds of GB of Kafka topics, this savings becomes enormous.
Kafka + Avro + Schema Registry
These three form the golden combination for big data streaming:
- Kafka: message broker.
- Avro: efficient encoding + schema evolution.
- Schema Registry: central schema management + compatibility validation.
The default architecture of Confluent Platform. Adopted by Netflix, LinkedIn, Uber, and more.
Protobuf vs Avro
| Item | Protobuf | Avro |
|---|---|---|
| When schema needed | At code gen | At write/read |
| Field identification | Tag number | Order |
| Wire size | Similar | Similar (Avro slightly smaller) |
| Schema evolution | Good | Best |
| Big data friendly | Medium | Best |
| RPC support | gRPC | None (separate) |
| Language support | Broad | Medium |
Selection criteria:
- RPC: Protobuf (gRPC).
- Kafka / big data streaming: Avro.
- File storage: Avro (self-describing files).
4. MessagePack: A Direct JSON Replacement
Philosophy
MessagePack's slogan: "It's like JSON. but fast and small."
- 1:1 mapping with JSON.
- No schema required.
- Binary, so smaller and faster.
Example
JSON:
{"id": 150, "name": "Bob"}
MessagePack (hex):
82 A2 69 64 CC 96 A4 6E 61 6D 65 A3 42 6F 62
82: map with 2 items.A2: fixstr of length 2 ("id").CC 96: uint8 value 150.A4: fixstr of length 4 ("name").A3 42 6F 62: fixstr of length 3 "Bob".
Total 15 bytes. Down from JSON's 25 bytes — a 40% reduction.
Characteristics
- No schema: works immediately.
- Type preservation: int, string, array, map, binary, etc.
- Extension: supports user-defined types (timestamp, UUID, etc.).
- Language support: 50+ languages.
Use Cases
- Redis: Lua script results, some internal communication.
- Fluentd: log collection protocol.
- ZeroMQ: message bindings.
- Games: real-time network protocols.
Differences from Protobuf
| Item | MessagePack | Protobuf |
|---|---|---|
| Schema | None | Required |
| Size | Medium | Small |
| Parse speed | Fast | Very fast |
| Self-describing | Yes | No |
| Schema evolution | Weak | Strong |
| IDE support | Little | Much |
MessagePack is the best fit as a JSON replacement when you want something faster and smaller than JSON without a schema system burden.
5. FlatBuffers: The Answer to Zero-Copy
Motivation
Developed by Google for games and mobile. Goal: zero deserialization time.
Problems with existing formats:
- Protobuf/Avro/MessagePack: reading requires full parsing. New object allocations.
- Small messages are fine, but large messages or frequently accessed data incurs overhead.
The Idea: Use Directly from Memory
FlatBuffers uses the memory layout itself as the serialization format:
Byte stream:
[root offset][vtable 1][table 1][vtable 2][table 2]...
When reading:
- Get the byte buffer (memcpy or mmap).
- Follow the root offset.
- Access fields directly without object allocation.
auto person = GetPerson(buffer);
int id = person->id(); // read memory directly without actual parsing
Result: Deserialization time approaches zero. No memory allocation. Cache-friendly.
IDL
A schema similar to Protobuf:
namespace Game;
table Person {
id: int;
name: string;
email: string;
}
root_type Person;
Performance
- Deserialization: 1000x faster than Protobuf (near zero).
- Memory: no allocation.
- Serialization: slightly slower than Protobuf (vtable construction).
- Size: 20~50% larger than Protobuf.
Use Cases
- Game engines: Cocos2d, various game data.
- Mobile apps: data file formats.
- Apache Arrow: some format sharing.
- Facebook: major user.
- TensorFlow Lite: model files.
When to Use
- Accessed often but rarely written.
- Large data structures.
- Memory-constrained.
- Fast loading matters.
Downsides:
- Size: larger than Protobuf.
- Lower flexibility: writing is complex.
- Hard to debug: difficult to inspect the binary directly.
The Value of Random Access
A major strength of FlatBuffers: random access. If you want to read a specific field in a 1 GB file:
auto file = mmap(filename); // actual disk loading is lazy
auto root = GetRoot(file);
auto item_100 = root->items()->Get(100);
int value = item_100->value();
// only the pages actually needed are loaded from disk
No full parsing required. Ideal for large config files, game assets, and ML models.
6. Cap'n Proto: FlatBuffers' Cousin
Origin
Kenton Varda (a main developer of Protobuf v2) left Google to build this next-generation format. Philosophy:
"Protocol Buffers, infinity times faster."
Zero-Copy Plus RPC
FlatBuffers' zero-copy advantage plus a built-in RPC system. Used in the Sandstorm project.
Schema
struct Person {
id @0 :Int32;
name @1 :Text;
email @2 :Text;
}
Similar to Protobuf but with @N syntax for ordering.
Encoding Characteristics
- Aligned memory: 8-byte boundaries.
- Zero-copy capable: like FlatBuffers.
- Packed compression: optionally applied for size reduction.
- Random access: access a portion of a large message.
Canonical Form
Cap'n Proto tries to produce the same bytes for the same data (useful for hash-based comparison).
RPC Features
Time Traveling RPC: the server can forward results to another service before responding. Promise pipelining.
Client → A: getUser(1)
Client → A: (promise of user) → B.emailSummary(user)
Data flows directly between A and B. The client is not in the middle. Reduces latency.
Use Cases
Less commonly used than FlatBuffers, but used internally by large companies like Cloudflare.
FlatBuffers vs Cap'n Proto
| Item | FlatBuffers | Cap'n Proto |
|---|---|---|
| Zero-copy | Yes | Yes |
| Size | Medium | Similar |
| RPC | None | Yes |
| Maturity | More mature | Younger |
| Language support | Many | Few |
Cap'n Proto has some technical advantages, but FlatBuffers is more widely used in practice.
7. Other Notable Formats
CBOR (Concise Binary Object Representation)
IETF standard (RFC 8949). The "real" binary version of JSON. Used in IoT and CoAP.
- No schema.
- 1:1 mapping with JSON.
- Similar to MessagePack but standardized.
BSON (Binary JSON)
MongoDB's storage format. JSON-style with additional types (ObjectId, Date, Binary).
- No schema.
- Slightly larger than JSON but faster to parse.
- Mainly internal use at MongoDB.
Apache Arrow
Columnar memory format. Described earlier. Can serialize, but its main purpose is zero-copy interchange between processes.
Parquet
Columnar disk format. Described earlier. Less a serialization format and more a storage format, but provides cross-language compatibility.
Bencode
BitTorrent's protocol. Simple but inefficient.
Smile
JSON-like binary. Part of the Jackson library.
8. Performance Comparison: By the Numbers
Benchmark Scenario
A typical message (Person object, 100 fields, with nesting):
| Format | Size | Serialize | Deserialize |
|---|---|---|---|
| JSON | 5.2 KB | 85 us | 120 us |
| BSON | 4.8 KB | 75 us | 105 us |
| MessagePack | 3.8 KB | 45 us | 60 us |
| Protobuf | 2.8 KB | 15 us | 25 us |
| Avro | 2.6 KB | 20 us | 35 us |
| Thrift (Compact) | 2.7 KB | 18 us | 30 us |
| Cap'n Proto | 3.2 KB | 10 us | 1 us |
| FlatBuffers | 3.5 KB | 12 us | 1 us |
Key observations:
- Size: Avro is greater than Thrift is greater than Protobuf is greater than FlatBuffers is greater than MessagePack is greater than JSON.
- Serialization speed: all are faster than JSON.
- Deserialization: Cap'n Proto/FlatBuffers are overwhelming (zero-copy).
Real-World Throughput
Processing 1 GB of data (100 GB system memory):
| Format | Time |
|---|---|
| JSON | 120 s |
| MessagePack | 60 s |
| Protobuf | 25 s |
| Avro (columnar compression) | 30 s |
| FlatBuffers | 2 s (mostly mmap time) |
FlatBuffers' advantage is stark for bulk data processing.
9. Schema Evolution Comparison
Schemas change over time. How does each format handle different versions?
Protobuf
- Tag-number based: never change.
- Adding fields: OK (new tag).
- Removing fields: OK (mark tag reserved).
- Type changes: only among compatible types (int32 is interchangeable with int64).
- Forward/backward compatibility: both possible.
Safe changes (v1 to v2):
- Adding new fields.
- Renaming fields.
- required to optional.
Dangerous changes:
- Changing tag numbers.
- Drastic type change (string to int).
- Reusing a tag after deleting the field.
Thrift
Similar to Protobuf. Tag-based and well-supports compatibility.
Avro
Handled via Writer plus Reader schema. Avro's schema resolution rules:
- Field matching: by name (aliases supported).
- Missing fields: default values used.
- Type promotion: int to long, float to double.
- Union evolution: can add null.
Avro has the most powerful evolution system. Complex changes are possible.
JSON
No schema, making evaluation difficult but effectively allowing anything.
- Adding new fields: easy (clients ignore).
- Removing fields: risky (clients may depend on them).
- Type changes: risky.
JSON Schema enables validation, but there's no code-level automatic compatibility.
FlatBuffers / Cap'n Proto
Tag-based. Similar rules to Protobuf. However, due to zero-copy, field rearrangement is forbidden.
Practical Rules
- Tag/field numbers are forever: never reuse them.
- Additions only: adding is safer than deleting.
- Optional by default: new fields should be optional.
- Default values: set them explicitly.
- Mark deprecated: note deprecation in code comments.
- Validate compatibility in CI: buf breaking (Protobuf), Schema Registry (Avro).
10. Selection Guide
Recommendations by Scenario
REST API (public API):
- JSON: still the default. Anyone can parse it.
- Internal optimization: MessagePack or Protobuf over HTTP.
gRPC services:
- Protobuf: the default. gRPC is designed around it.
Kafka / streaming:
- Avro plus Schema Registry: the standard combination.
- Alternative: Protobuf with Schema Registry.
Games / real-time networking:
- FlatBuffers: game data.
- Cap'n Proto: when RPC is also needed.
- MessagePack: when flexibility is needed.
Config files / CLIs:
- YAML/JSON/TOML: must be human-readable.
ML models:
- FlatBuffers: TensorFlow Lite style.
- Protobuf: ONNX style.
- SafeTensors: latest trend.
Logs / events:
- Avro: big data pipelines.
- Protobuf: gRPC logs.
- JSON: for human parsing.
IoT / embedded:
- CBOR: IoT standard.
- Protobuf: when performance matters.
- MessagePack: for flexibility.
MongoDB storage:
- BSON: fixed (MongoDB only).
Decision Tree
Do you want a schema?
|-- Yes
| |-- Need RPC integration? -> gRPC + Protobuf or Thrift
| |-- Big data evolution important? -> Avro
| |-- Need zero-copy? -> FlatBuffers or Cap'n Proto
| +-- General -> Protobuf
+-- No
|-- JSON compatibility? -> MessagePack or CBOR
|-- MongoDB? -> BSON
+-- Human-readable? -> JSON/YAML
11. Pitfalls and Real-World Lessons
Pitfall 1: "Using Protobuf When JSON Is Enough"
Problem: A complex code-generation pipeline for a small-scale API.
Lesson: The tool must justify the problem. At low traffic, JSON's developer convenience outweighs Protobuf's performance gains.
Pitfall 2: "Deleting an Old Field in Protobuf and Reusing the Tag"
Problem: v2 code misinterprets v1 data.
Lesson: Tag numbers are forever. If deletion is needed, mark as reserved.
Pitfall 3: "Losing the Writer Schema in Avro"
Problem: Old messages in a Kafka topic become unreadable.
Lesson: Use Schema Registry. Always preserve the writer schema.
Pitfall 4: "File Size Increase with FlatBuffers"
Problem: After migrating from Protobuf to FlatBuffers, file sizes increase.
Lesson: FlatBuffers is speed-first. If size matters, Protobuf is better. Or add a compression layer on top of FlatBuffers.
Pitfall 5: "Maintaining Code Generation for Dozens of Target Languages"
Problem: protoc version management, per-language builds.
Lesson: Use tools like monorepo plus buf. Automate in CI.
Pitfall 6: "Violating Schema Evolution Rules"
Problem: Adding a field as required, or changing types.
Lesson: Automated validation. buf breaking, Schema Registry compatibility check.
12. Practical Tools
buf: Modern Protobuf Tooling
Created by Uber and the community, buf dramatically improves Protobuf development:
# Lint
buf lint
# Breaking change detection
buf breaking --against .git#branch=main
# Generate code
buf generate
# Remote plugins (no need to install protoc)
buf generate --template buf.gen.yaml
Recommended for every project.
Confluent Schema Registry
Centrally manages Avro/Protobuf/JSON Schema:
- REST API for registering/retrieving schemas.
- Compatibility validation.
- Kafka integration.
// Kafka producer
properties.put("value.serializer", "io.confluent.kafka.serializers.KafkaAvroSerializer");
properties.put("schema.registry.url", "http://localhost:8081");
Essential in production Kafka plus Avro environments.
protobuf-validator
Adds validation rules to Protobuf messages:
message Person {
int32 age = 1 [(validate.rules).int32 = { gte: 0, lte: 150 }];
string email = 2 [(validate.rules).string.email = true];
}
Generated code includes validate methods.
grpcurl / grpc_cli
Test gRPC services like curl:
grpcurl -d '{"id": 1}' localhost:50051 myservice.UserService/GetUser
Automatically obtains schemas via server reflection.
Quiz Review
Q1. Why is Protobuf's varint effective for small numbers?
A. Varint encodes values 7 bits at a time and uses the most significant bit (MSB) of each byte as a "is there a next byte?" flag.
Example:
0~127: 1 byte (MSB=0).128~16383: 2 bytes.- Larger numbers: more bytes.
Regular int32 case: always 4 bytes. Varint:
0: 1 byte100: 1 byte10000: 2 bytes1000000: 3 bytes2^31-1: 5 bytes (larger!)
Real-world observation: Most IDs, counts, bitmasks, and indices are small values. Across millions of messages, if most integers encode as 1~2 bytes, the total size shrinks significantly.
Exception: Negative numbers
What if you encode -1 as an int32 varint? Due to two's complement, it becomes 0xFFFFFFFF, requiring 5 bytes. That's a disaster.
Solution: ZigZag
The sint32 type uses ZigZag encoding:
0 → 0, -1 → 1, 1 → 2, -2 → 3, 2 → 4, ...- The smaller the absolute value, the shorter the encoding.
So fields where negatives are common should use sint32/sint64. This is an important Protobuf tutorial lesson that's often overlooked.
Lesson: Knowing your data distribution and choosing the format accordingly makes a big difference. Protobuf's varint is a clever design leveraging the common distribution of small numbers.
Q2. Why does Avro "deliver schema with the data" unlike Protobuf, and what are the benefits?
A. Protobuf and Avro have fundamentally different philosophies.
Protobuf:
- Schema is only needed at code generation time.
- Wire format contains only tag numbers.
- Field names aren't needed; tags identify fields.
- At runtime, schemas are assumed to be "already known."
Avro:
- Schema is needed at write/read time.
- Wire format has no field names or tags (only order).
- Without the schema, interpretation is impossible.
- Two operational modes:
- Container file: schema embedded in file header.
- Schema Registry: only schema ID in messages; actual schema centrally stored.
Why this difference?
Avro was designed for big data and streaming. Characteristics of this environment:
- Data lives long and schemas evolve.
- Consumers may be unknown in the future.
- Writer schema is not equal to Reader schema is routine.
Avro's benefits:
-
Writer vs Reader schema resolution:
- Avro takes both schemas and converts automatically.
- Protobuf is tag-based, so only one side needs to be known.
- In complex migrations, Avro is more flexible.
-
Self-describing files:
- Avro container files contain the schema, so "any tool" can read them.
- Protobuf can't be read without the
.protofile.
-
Schema Registry pattern:
- Central schema management and compatibility validation.
- The reason Kafka plus Avro became the big-data standard.
Counterpoint for Protobuf:
- Tag-based is more efficient (no field names).
- Code generation provides type safety.
- Integrated with gRPC.
Conclusion: Avro is optimized for the assumption that "data lives long and schema changes", while Protobuf is optimized for the assumption that "schema is deployed alongside code". Which is right depends on your operational environment:
- RPC systems (Protobuf favored)
- Big data streams (Avro favored)
- Storage files (Avro favored)
Both are proven technologies and the best choice in their respective domains.
Q3. How does FlatBuffers achieve "zero deserialization time"?
A. By using the memory layout itself as the serialization format.
Problem with traditional serialization: Protobuf, Avro, MessagePack, and others sequentially parse the serialized bytes:
// Protobuf style
Person person;
person.ParseFromArray(buffer, size); // parse entire data
int id = person.id(); // access already-parsed field
- Must read the entire buffer.
- Must allocate new objects and populate fields.
- Time and memory proportional to size.
FlatBuffers' approach: The serialized bytes are already a structure usable directly from memory:
// FlatBuffers
auto person = GetPerson(buffer); // just cast a pointer, no parsing
int id = person->id(); // internally compute offset, read memory directly
How it works:
- Root offset stored at the start of the buffer.
- Before each object, a vtable offset (handles variable structure).
- Vtable: "field X is at offset Y" information.
- Field access is offset addition plus memory read. O(1).
Benefits:
- Deserialization time: 0 (only buffer-mapping time).
- No memory allocation: reuse the existing buffer.
- Random access: for a 1 GB file needing only one field, only that part is loaded.
- mmap-friendly: even large files load lazily.
- Cache efficient: once read, stays in CPU cache for reuse.
Costs:
- Size: 20~50% larger than Protobuf due to vtable overhead.
- Complex serialization: constructed stepwise with a builder.
- Low flexibility: hard to modify the buffer once built.
- Debugging: hard to inspect binary structure directly.
When to use FlatBuffers:
- Game data: character info, level data. Read often, rarely changed.
- ML model files: TensorFlow Lite is a prime example.
- Mobile app data: fast loading matters.
- Embedded: memory-constrained, allocation is expensive.
When not to use:
- RPC: every request has a different message; can't reap zero-copy benefits.
- Small messages: at under a few KB, parse time is negligible.
- Frequent modifications: writing is cumbersome.
- Network size matters: Protobuf is smaller.
FlatBuffers vs typical parser:
- Parsing a 1 MB message 1000 times:
- Protobuf: 1000 times 20 ms = 20 seconds.
- FlatBuffers: 1000 times 0.001 ms = 1 ms.
Not 10 million times, but the gap is stark in bulk processing. This is why FlatBuffers became the standard for game engines and ML runtimes.
Lesson: A "serialization format" doesn't have to be "a parseable byte stream." FlatBuffers opened new performance territory with the bold choice of "memory layout equals wire format." Engineering sometimes makes leaps by overturning default assumptions.
Q4. Why is "deleting a field" in schema evolution dangerous?
A. There are three levels of risk.
1. Tag number reuse risk
In Protobuf/Thrift, tags are eternal IDs:
// v1
message Person {
int32 id = 1;
string name = 2;
string deprecated_field = 3; // delete this and...
}
// v2 (dangerous!)
message Person {
int32 id = 1;
string name = 2;
int32 new_field = 3; // reusing tag 3
}
Problem scenario:
- v1 data stored in a Kafka topic or DB still has
deprecated_field = "hello". - v2 code reads this data; tag 3 is interpreted as
int32. - Trying to read
"hello"as int32 errors out or produces garbage data.
Solution: Use the reserved keyword to prevent reuse:
message Person {
int32 id = 1;
string name = 2;
reserved 3;
reserved "deprecated_field";
}
Now attempting to reuse tag 3 in v2 is a compile error.
2. When reader expects the field
In Avro, field deletion is OK if the field isn't in the reader schema:
- Writer v1:
{id, name, age} - Reader v2:
{id, name}(no age) - → age ignored. Safe.
Problem: If you delete the field in the reader and the writer stops writing it?
- Reader v2 can read v1 data (ignores age).
- Reader v2 can read v2 data (no age).
- But when Reader v1 reads v2 data... v1 expected age but v2 doesn't have it → default or null. Business logic may break.
Lesson: Delete only after all readers are gone.
3. Data dependency
If client code is using the field, deletion is catastrophic:
# Old client code
age = person.age # KeyError / AttributeError / 0 (default)
send_birthday_wishes(person.age) # wrong results
Field deletion is an API contract change. All usage sites must be tracked.
Safe field-removal procedure:
- Mark (deprecated): note in code comments and docs.
- Stop using: modify all writer code to not set the field.
- Soak period: confirm over weeks/months that it's really unused.
- Update readers: remove the field from reader code.
- Update writers: remove from writer schema.
- Reserved: mark the tag as reserved (prevent reuse).
- Monitor: track errors related to this tag.
Practical advice:
- Adding is easy, deleting is hard: a "we'll remove it later" field lives forever.
- Add fields rather than new versions: prefer incremental additions to breaking changes.
- CI validation:
buf breaking, Schema Registry compatibility check. - Graceful deprecation: mark old fields "DEPRECATED" and keep for several quarters.
Conclusion: The "delete" button is easy, but the consequences ripple across multiple layers of distributed systems. The best approach is designing schemas to be extensible from the start. If deletion is necessary, proceed with a careful process.
This is what "schema design is hard to change once decided" means. A sloppy schema becomes eternal technical debt.
Q5. When should you keep using JSON and when should you switch to binary?
A. There's no simple answer. Multiple factors must be considered.
Reasons to keep using JSON:
- Public API: external developers use it. Must be testable with standard tools.
- Low traffic: if throughput is low, optimization benefits are less than complexity costs.
- Fast development: schema-generation pipeline setup is overhead.
- Debugging convenience: readable directly in logs.
- Diverse clients: JavaScript browsers, shell scripts, etc.
- Small messages: parsing overhead is negligible relative to network.
Signs to switch to binary:
- Traffic growth: 1B+ requests/day, bandwidth costs unignorable.
- CPU bottleneck: JSON parsing accounts for 10%+ in profiling.
- Internal services: not a public API but microservice-to-microservice communication.
- Schema evolution needed: version compatibility becomes important.
- Bugs from lack of type safety: outages from "typo in field name."
- Mobile/IoT: bandwidth and battery constrained.
Concrete thresholds:
Keep JSON:
- Under 1M requests/day.
- Average message under 10 KB.
- API is public.
- Small dev team.
Switch to MessagePack (gentle JSON upgrade):
- Keep JSON structure with slight size/speed improvement.
- OK without a schema.
- Limited internal communication.
Switch to Protobuf (full binary):
- Microservice architecture.
- Using gRPC.
- Type safety and schema evolution needed.
- Team can operate the tooling.
Switch to Avro (big data):
- Kafka-centric.
- Long-lived storage data.
- Schema evolution is core.
FlatBuffers (special cases):
- Games, mobile, ML models.
- Read often, written rarely.
Mixed strategy (most common):
Most real-world systems use a mix:
- External APIs: JSON (REST).
- Internal services: Protobuf (gRPC).
- Message queues: Avro (Kafka).
- DB storage: text or format-specific.
- Config files: YAML/JSON.
Don't try to unify on one format. Choose the right one for each boundary.
Decision framework:
- Measure: is JSON actually the bottleneck in the current system?
- Compare: benchmark with your own workload.
- Gradual migration: start with one service.
- Consider complexity cost: tooling, debugging, hiring.
- Make it reversible: maintain JSON compatibility or a gateway layer.
Common mistakes:
- "Read that Protobuf is fast, so adopt unconditionally": adds complexity without a real problem.
- "Size matters, so FlatBuffers": FlatBuffers is actually larger. Protobuf is smaller.
- "JSON parsing is slow": often not the actual bottleneck. Network or DB may be the real cause.
- "Same format everywhere": forcing Protobuf on a public API → developer attrition.
Conclusion: Solutions must solve actual problems. Start with JSON, measure real problems, then switch to binary at the right boundary. "JSON is slow" is a myth, not a fact. Millions of successful systems run on JSON.
This follows the general engineering principle: simplicity by default, complexity only when proven necessary. Binary serialization is a powerful tool, but the judgment behind choosing it matters more than the tool itself.
Closing: The Choice of Bytes
Key Takeaways
- Protobuf: the standard. RPC plus type safety.
- Thrift: Protobuf's cousin. Integrated RPC.
- Avro: the big data standard. Strongest schema evolution.
- MessagePack: the gentle JSON replacement.
- FlatBuffers: zero-copy. Games/ML.
- Cap'n Proto: FlatBuffers plus RPC.
- JSON: still the default for public APIs.
Core of the Choice
"Which format is best?" is the wrong question. "Which format is optimal for this situation?" is the right question.
- RPC: Protobuf
- Kafka: Avro
- Games: FlatBuffers
- Public API: JSON
- File storage: Avro (self-describing)
- IoT: CBOR/MessagePack
Final Lesson
A serialization format is the clothing of data. Change it to fit the situation. Poor choice means lifelong pain; good choice invisibly prevents countless problems.
Next time you design a new system, don't just say "I'll use JSON." Ask:
- How much data will flow?
- Will the schema change?
- Who are the consumers?
- Is performance the bottleneck?
- Is type safety important?
Choose the format based on those answers. That is the art of engineering.
References
- Protocol Buffers Language Guide (proto3)
- Protobuf Encoding Reference
- Apache Thrift Documentation
- Apache Avro Specification
- FlatBuffers Documentation
- Cap'n Proto
- MessagePack Specification
- buf: Modern Protobuf Tooling
- Confluent Schema Registry
- Designing Data-Intensive Applications, Ch.4 (Encoding)
- Martin Kleppmann: Thinking in Events
- Comparing Binary Formats: MessagePack, BSON, CBOR