Skip to content

✍️ 필사 모드: Binary Serialization Complete Guide 2025: Protobuf, Thrift, Avro, MessagePack, FlatBuffers, Cap'n Proto — Choosing Between Performance and Flexibility

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Introduction: Is JSON Really the Best Choice?

A Common Scenario

When building a REST API, most developers naturally pick this:

{
  "id": 12345,
  "name": "Alice",
  "email": "alice@example.com",
  "age": 30
}

JSON is human-readable, language-neutral, and supported everywhere. It looks perfect.

But what about a service handling 1 billion requests per day? If each request is 1 KB of JSON, that's 1 TB of data per day. Parsing cost, network bandwidth, storage — all get expensive.

The same data in Protobuf is ~100 bytes. Parsing is over 10x faster. 10 GB instead of 100 GB per day. Over a year, that's tens of TB of difference.

Why Binary Serialization?

JSON's weaknesses:

  • Size: Field names repeat every time. "email" dozens of times.
  • Parsing cost: Converting text to numbers/objects.
  • No type information: Runtime type checking required.
  • No schema: Documentation and validation separate.

Strengths of binary formats:

  • Small: Numeric tags instead of field names.
  • Fast: Direct memory mapping or simple decoding.
  • Type-safe: Schema-based.
  • Evolvable: Schema evolution support.

What This Post Compares

  1. Protocol Buffers (Google): The most widely used standard.
  2. Thrift (Facebook/Apache): RPC integration.
  3. Avro (Apache Hadoop): The de facto standard for big data.
  4. MessagePack: Binary replacement for JSON.
  5. FlatBuffers (Google): Zero-copy reads.
  6. Cap'n Proto (Sandstorm): Zero-copy plus RPC.

This post dives 720 lines into each format's internal structure, encoding scheme, and when to use it.


1. Protocol Buffers: King of the Standard

History

  • 2001: Google starts internal development.
  • 2008: Open-sourced (proto2).
  • 2016: proto3 released. Simpler syntax.
  • 2023: Edition 2023 released. Gradual feature additions.

The default format for gRPC. All of Google uses it, and externally it's the most popular binary serialization.

IDL (Interface Definition Language)

Protobuf defines schemas in .proto files:

syntax = "proto3";

message Person {
  int32 id = 1;
  string name = 2;
  string email = 3;
  optional int32 age = 4;
  repeated string phone_numbers = 5;
}
  • Each field has a tag number (1, 2, 3...).
  • The tag number is the key in wire format. Field names are used only for code generation.

Code Generation

The protoc compiler generates code for the target language:

protoc --java_out=. --python_out=. --go_out=. person.proto

Result: Java, Python, and Go classes. Each includes serialize/deserialize methods.

Wire Format: The Core Idea

Protobuf encoding is a key-value stream. Each field is:

[tag + wire_type][value]

Wire types:

  • 0: Varint (integers, boolean, enum).
  • 1: 64-bit (fixed64, double).
  • 2: Length-delimited (string, bytes, message, packed repeated).
  • 5: 32-bit (fixed32, float).

Varint: Variable-Length Integer

Protobuf's key optimization: varint encoding. Small numbers use fewer bytes:

150 as varint:
150 = 0b10010110

Step 1: Split into 7-bit groups
[0b0000001, 0b0010110]
Step 2: Use MSB of each byte as "continuation flag"
[0b10010110, 0b00000001]
Step 3: Little-endian order
[0x96, 0x01]

Result:

  • 0~127: 1 byte.
  • 128~16383: 2 bytes.
  • 16384~2097151: 3 bytes.
  • And so on.

Small numbers take 1 byte, larger ones take more. Since most fields are small, this is efficient on average.

Encoding Example

message Person {
  int32 id = 1;
  string name = 2;
}
id = 150, name = "Bob"

Wire:

Field 1 (id):
  tag = (1 << 3) | 0 = 0x08  (field 1, wire type 0 = varint)
  value = 150 varint = 0x96 0x01
Field 2 (name):
  tag = (2 << 3) | 2 = 0x12  (field 2, wire type 2 = length-delimited)
  length = 3
  value = "Bob" = 0x42 0x6F 0x62

Total: 08 96 01 12 03 42 6F 62  (8 bytes)

8 bytes! The JSON is:

{"id":150,"name":"Bob"}

23 bytes. 2.9x smaller.

ZigZag Encoding: Optimizing for Negative Numbers

Varint is optimized for positive numbers. Negative numbers become very large varints (due to two's complement):

-10xFFFFFFFF = 10 bytes!

Solution: sint32 and sint64 types use ZigZag encoding:

 00
-11
 12
-23
 24
...

Formula: (n << 1) ^ (n >> 31)

Negative and positive numbers are mapped alternately, so smaller absolute values give shorter encodings.

Schema Evolution

Protobuf supports forward/backward compatibility:

Rules:

  1. Never change tag numbers: the tag is the ID.
  2. Renaming fields is OK: unrelated to wire format.
  3. Adding new fields is OK: old clients ignore them.
  4. Deleting fields is OK, but mark tag reserved: prevents reuse.
  5. Don't use required: removed in proto3.

Example:

// v1
message Person {
  int32 id = 1;
  string name = 2;
}

// v2 (compatible)
message Person {
  int32 id = 1;
  string name = 2;
  string email = 3;  // new field
  reserved 4, 5;     // reserved for future
}

A v1 server receiving a message from a v2 client will simply ignore the email field. Vice versa works too. This is critical in distributed systems during deployments.

Default Values in Proto3

In proto3, all fields are optional. Absent fields get default values:

  • Numeric: 0
  • String: ""
  • bool: false
  • message: null

Pitfall: You cannot distinguish "is the value 0?" from "is the value missing?" To solve this, use the optional keyword (proto3 2.6+):

message Person {
  optional int32 age = 1;  // explicit presence tracking
}

Performance

Protobuf is very fast:

  • 3~10x smaller than JSON.
  • Parsing speed: 5~100x (depending on language implementation).
  • Low memory allocation (reusable).

2. Apache Thrift: RPC Integration

Origin

  • 2007: Facebook releases it.
  • 2008: Apache Incubator.
  • Today: Internal RPC standard at many large companies.

Emerged around the same time as Protobuf. Battle-tested at Facebook scale.

Differences from Protobuf

Thrift provides serialization + RPC + server framework all together:

service UserService {
  Person getUser(1: i32 id) throws (1: NotFoundException e),
  list<Person> listUsers(1: i32 limit, 2: i32 offset)
}

struct Person {
  1: i32 id,
  2: string name,
  3: optional string email
}

Data structures and service definitions live in one file. Protobuf was originally separate from gRPC, but Thrift was integrated.

Transport Layer

Thrift's unique aspect: separation of transport, protocol, and server.

Transport: how data moves.

  • TSocket, TFramedTransport, TMemoryBuffer, ...

Protocol: how data is encoded.

  • TBinaryProtocol: simple, fixed size.
  • TCompactProtocol: variable length (varint-like).
  • TJSONProtocol: JSON format.

Server: how requests are handled.

  • TSimpleServer, TThreadPoolServer, TNonblockingServer, ...

This composability allows choosing the right combination per situation.

TCompactProtocol

Thrift's most efficient encoding. Similar ideas to Protobuf:

  • Varint: integers.
  • ZigZag: negatives.
  • Field ID delta: stores only the difference from the previous field ID (smaller).
  • Type + ID merge: wire type and field id in a single byte.

Result: Nearly identical size to Protobuf. In some cases slightly smaller.

Pros and Cons

Pros:

  • Integrated RPC: self-contained without separate gRPC setup.
  • Language support: 27+ languages.
  • Mature: validated inside Facebook, Uber, Twitter.
  • Flexible transport/protocol.

Cons:

  • Fewer recent updates: Thrift 4 is in discussion but slow.
  • Scattered docs: less organized than Protobuf.
  • Pushed aside by gRPC: Google's marketing power.

Selection guide:

  • New project: gRPC (Protobuf).
  • Existing Thrift systems: keep as-is.

3. Apache Avro: Big Data's Friend

Background

A format born in the Hadoop ecosystem. Currently the de facto standard in Kafka, Spark, and Hive.

Core Feature: Schema Travels with Data

In Protobuf/Thrift, schemas are used at code generation time. Avro is different:

Avro's philosophy: Schema is delivered with the data or managed centrally.

Two modes:

  1. Container File Mode: schema embedded in file header.
  2. Schema Registry Mode: referenced by schema ID from a central registry.

Schema Example

Avro schemas are written in JSON:

{
  "type": "record",
  "name": "Person",
  "fields": [
    { "name": "id", "type": "int" },
    { "name": "name", "type": "string" },
    { "name": "email", "type": ["null", "string"], "default": null },
    { "name": "age", "type": "int", "default": 0 }
  ]
}

Wire Format

Avro's encoding is extremely simple:

  • Fixed-size types: stored as-is (int, long, float, double).
  • ZigZag varint: for int and long.
  • Variable size: length + data (string, bytes).
  • Record: fields listed in schema order.

Important: No tag numbers! Field order is the structure.

Result: Very compact binary. Impossible to interpret without the schema.

Example

Same Person:

id=150, name="Bob", email=null, age=0

Avro wire:

150 (zigzag varint) = 0xAC 0x02  (2 bytes)
length 3 + "Bob" = 0x06 0x42 0x6F 0x62  (4 bytes)
null union tag = 0x00  (1 byte)
0 (zigzag varint) = 0x00  (1 byte)

Total: 8 bytes

Schema Evolution: Writer vs Reader Schema

Avro's killer feature: Writer schema is not equal to Reader schema.

  • Writer schema: the schema used when writing the data.
  • Reader schema: the schema expected when reading the data.

Avro automatically resolves differences between the two schemas:

Writer: {id, name, age}
Reader: {id, name, email}
  • age is absent in reader → ignored.
  • email is absent in writer → reader's default is used.

This makes compatibility checks and migration smooth.

Schema Registry

Confluent Schema Registry centrally manages Avro schemas:

Producer:
  schema_id = registry.register(schema)
  message = [magic byte][schema_id (4B)][avro payload]

Consumer:
  schema_id = extract_from_message
  schema = registry.get(schema_id)
  data = decode(payload, schema)

Each message contains only a 4-byte schema ID. Payload is pure Avro. When accumulated across tens to hundreds of GB of Kafka topics, this savings becomes enormous.

Kafka + Avro + Schema Registry

These three form the golden combination for big data streaming:

  1. Kafka: message broker.
  2. Avro: efficient encoding + schema evolution.
  3. Schema Registry: central schema management + compatibility validation.

The default architecture of Confluent Platform. Adopted by Netflix, LinkedIn, Uber, and more.

Protobuf vs Avro

ItemProtobufAvro
When schema neededAt code genAt write/read
Field identificationTag numberOrder
Wire sizeSimilarSimilar (Avro slightly smaller)
Schema evolutionGoodBest
Big data friendlyMediumBest
RPC supportgRPCNone (separate)
Language supportBroadMedium

Selection criteria:

  • RPC: Protobuf (gRPC).
  • Kafka / big data streaming: Avro.
  • File storage: Avro (self-describing files).

4. MessagePack: A Direct JSON Replacement

Philosophy

MessagePack's slogan: "It's like JSON. but fast and small."

  • 1:1 mapping with JSON.
  • No schema required.
  • Binary, so smaller and faster.

Example

JSON:

{"id": 150, "name": "Bob"}

MessagePack (hex):

82 A2 69 64 CC 96 A4 6E 61 6D 65 A3 42 6F 62
  • 82: map with 2 items.
  • A2: fixstr of length 2 ("id").
  • CC 96: uint8 value 150.
  • A4: fixstr of length 4 ("name").
  • A3 42 6F 62: fixstr of length 3 "Bob".

Total 15 bytes. Down from JSON's 25 bytes — a 40% reduction.

Characteristics

  • No schema: works immediately.
  • Type preservation: int, string, array, map, binary, etc.
  • Extension: supports user-defined types (timestamp, UUID, etc.).
  • Language support: 50+ languages.

Use Cases

  1. Redis: Lua script results, some internal communication.
  2. Fluentd: log collection protocol.
  3. ZeroMQ: message bindings.
  4. Games: real-time network protocols.

Differences from Protobuf

ItemMessagePackProtobuf
SchemaNoneRequired
SizeMediumSmall
Parse speedFastVery fast
Self-describingYesNo
Schema evolutionWeakStrong
IDE supportLittleMuch

MessagePack is the best fit as a JSON replacement when you want something faster and smaller than JSON without a schema system burden.


5. FlatBuffers: The Answer to Zero-Copy

Motivation

Developed by Google for games and mobile. Goal: zero deserialization time.

Problems with existing formats:

  • Protobuf/Avro/MessagePack: reading requires full parsing. New object allocations.
  • Small messages are fine, but large messages or frequently accessed data incurs overhead.

The Idea: Use Directly from Memory

FlatBuffers uses the memory layout itself as the serialization format:

Byte stream:
[root offset][vtable 1][table 1][vtable 2][table 2]...

When reading:

  1. Get the byte buffer (memcpy or mmap).
  2. Follow the root offset.
  3. Access fields directly without object allocation.
auto person = GetPerson(buffer);
int id = person->id();  // read memory directly without actual parsing

Result: Deserialization time approaches zero. No memory allocation. Cache-friendly.

IDL

A schema similar to Protobuf:

namespace Game;

table Person {
  id: int;
  name: string;
  email: string;
}

root_type Person;

Performance

  • Deserialization: 1000x faster than Protobuf (near zero).
  • Memory: no allocation.
  • Serialization: slightly slower than Protobuf (vtable construction).
  • Size: 20~50% larger than Protobuf.

Use Cases

  1. Game engines: Cocos2d, various game data.
  2. Mobile apps: data file formats.
  3. Apache Arrow: some format sharing.
  4. Facebook: major user.
  5. TensorFlow Lite: model files.

When to Use

  • Accessed often but rarely written.
  • Large data structures.
  • Memory-constrained.
  • Fast loading matters.

Downsides:

  • Size: larger than Protobuf.
  • Lower flexibility: writing is complex.
  • Hard to debug: difficult to inspect the binary directly.

The Value of Random Access

A major strength of FlatBuffers: random access. If you want to read a specific field in a 1 GB file:

auto file = mmap(filename);  // actual disk loading is lazy
auto root = GetRoot(file);
auto item_100 = root->items()->Get(100);
int value = item_100->value();
// only the pages actually needed are loaded from disk

No full parsing required. Ideal for large config files, game assets, and ML models.


6. Cap'n Proto: FlatBuffers' Cousin

Origin

Kenton Varda (a main developer of Protobuf v2) left Google to build this next-generation format. Philosophy:

"Protocol Buffers, infinity times faster."

Zero-Copy Plus RPC

FlatBuffers' zero-copy advantage plus a built-in RPC system. Used in the Sandstorm project.

Schema

struct Person {
  id @0 :Int32;
  name @1 :Text;
  email @2 :Text;
}

Similar to Protobuf but with @N syntax for ordering.

Encoding Characteristics

  • Aligned memory: 8-byte boundaries.
  • Zero-copy capable: like FlatBuffers.
  • Packed compression: optionally applied for size reduction.
  • Random access: access a portion of a large message.

Canonical Form

Cap'n Proto tries to produce the same bytes for the same data (useful for hash-based comparison).

RPC Features

Time Traveling RPC: the server can forward results to another service before responding. Promise pipelining.

ClientA: getUser(1)
ClientA: (promise of user)B.emailSummary(user)

Data flows directly between A and B. The client is not in the middle. Reduces latency.

Use Cases

Less commonly used than FlatBuffers, but used internally by large companies like Cloudflare.

FlatBuffers vs Cap'n Proto

ItemFlatBuffersCap'n Proto
Zero-copyYesYes
SizeMediumSimilar
RPCNoneYes
MaturityMore matureYounger
Language supportManyFew

Cap'n Proto has some technical advantages, but FlatBuffers is more widely used in practice.


7. Other Notable Formats

CBOR (Concise Binary Object Representation)

IETF standard (RFC 8949). The "real" binary version of JSON. Used in IoT and CoAP.

  • No schema.
  • 1:1 mapping with JSON.
  • Similar to MessagePack but standardized.

BSON (Binary JSON)

MongoDB's storage format. JSON-style with additional types (ObjectId, Date, Binary).

  • No schema.
  • Slightly larger than JSON but faster to parse.
  • Mainly internal use at MongoDB.

Apache Arrow

Columnar memory format. Described earlier. Can serialize, but its main purpose is zero-copy interchange between processes.

Parquet

Columnar disk format. Described earlier. Less a serialization format and more a storage format, but provides cross-language compatibility.

Bencode

BitTorrent's protocol. Simple but inefficient.

Smile

JSON-like binary. Part of the Jackson library.


8. Performance Comparison: By the Numbers

Benchmark Scenario

A typical message (Person object, 100 fields, with nesting):

FormatSizeSerializeDeserialize
JSON5.2 KB85 us120 us
BSON4.8 KB75 us105 us
MessagePack3.8 KB45 us60 us
Protobuf2.8 KB15 us25 us
Avro2.6 KB20 us35 us
Thrift (Compact)2.7 KB18 us30 us
Cap'n Proto3.2 KB10 us1 us
FlatBuffers3.5 KB12 us1 us

Key observations:

  1. Size: Avro is greater than Thrift is greater than Protobuf is greater than FlatBuffers is greater than MessagePack is greater than JSON.
  2. Serialization speed: all are faster than JSON.
  3. Deserialization: Cap'n Proto/FlatBuffers are overwhelming (zero-copy).

Real-World Throughput

Processing 1 GB of data (100 GB system memory):

FormatTime
JSON120 s
MessagePack60 s
Protobuf25 s
Avro (columnar compression)30 s
FlatBuffers2 s (mostly mmap time)

FlatBuffers' advantage is stark for bulk data processing.


9. Schema Evolution Comparison

Schemas change over time. How does each format handle different versions?

Protobuf

  • Tag-number based: never change.
  • Adding fields: OK (new tag).
  • Removing fields: OK (mark tag reserved).
  • Type changes: only among compatible types (int32 is interchangeable with int64).
  • Forward/backward compatibility: both possible.

Safe changes (v1 to v2):

  • Adding new fields.
  • Renaming fields.
  • required to optional.

Dangerous changes:

  • Changing tag numbers.
  • Drastic type change (string to int).
  • Reusing a tag after deleting the field.

Thrift

Similar to Protobuf. Tag-based and well-supports compatibility.

Avro

Handled via Writer plus Reader schema. Avro's schema resolution rules:

  1. Field matching: by name (aliases supported).
  2. Missing fields: default values used.
  3. Type promotion: int to long, float to double.
  4. Union evolution: can add null.

Avro has the most powerful evolution system. Complex changes are possible.

JSON

No schema, making evaluation difficult but effectively allowing anything.

  • Adding new fields: easy (clients ignore).
  • Removing fields: risky (clients may depend on them).
  • Type changes: risky.

JSON Schema enables validation, but there's no code-level automatic compatibility.

FlatBuffers / Cap'n Proto

Tag-based. Similar rules to Protobuf. However, due to zero-copy, field rearrangement is forbidden.

Practical Rules

  1. Tag/field numbers are forever: never reuse them.
  2. Additions only: adding is safer than deleting.
  3. Optional by default: new fields should be optional.
  4. Default values: set them explicitly.
  5. Mark deprecated: note deprecation in code comments.
  6. Validate compatibility in CI: buf breaking (Protobuf), Schema Registry (Avro).

10. Selection Guide

Recommendations by Scenario

REST API (public API):

  • JSON: still the default. Anyone can parse it.
  • Internal optimization: MessagePack or Protobuf over HTTP.

gRPC services:

  • Protobuf: the default. gRPC is designed around it.

Kafka / streaming:

  • Avro plus Schema Registry: the standard combination.
  • Alternative: Protobuf with Schema Registry.

Games / real-time networking:

  • FlatBuffers: game data.
  • Cap'n Proto: when RPC is also needed.
  • MessagePack: when flexibility is needed.

Config files / CLIs:

  • YAML/JSON/TOML: must be human-readable.

ML models:

  • FlatBuffers: TensorFlow Lite style.
  • Protobuf: ONNX style.
  • SafeTensors: latest trend.

Logs / events:

  • Avro: big data pipelines.
  • Protobuf: gRPC logs.
  • JSON: for human parsing.

IoT / embedded:

  • CBOR: IoT standard.
  • Protobuf: when performance matters.
  • MessagePack: for flexibility.

MongoDB storage:

  • BSON: fixed (MongoDB only).

Decision Tree

Do you want a schema?
|-- Yes
|   |-- Need RPC integration? -> gRPC + Protobuf or Thrift
|   |-- Big data evolution important? -> Avro
|   |-- Need zero-copy? -> FlatBuffers or Cap'n Proto
|   +-- General -> Protobuf
+-- No
    |-- JSON compatibility? -> MessagePack or CBOR
    |-- MongoDB? -> BSON
    +-- Human-readable? -> JSON/YAML

11. Pitfalls and Real-World Lessons

Pitfall 1: "Using Protobuf When JSON Is Enough"

Problem: A complex code-generation pipeline for a small-scale API.

Lesson: The tool must justify the problem. At low traffic, JSON's developer convenience outweighs Protobuf's performance gains.

Pitfall 2: "Deleting an Old Field in Protobuf and Reusing the Tag"

Problem: v2 code misinterprets v1 data.

Lesson: Tag numbers are forever. If deletion is needed, mark as reserved.

Pitfall 3: "Losing the Writer Schema in Avro"

Problem: Old messages in a Kafka topic become unreadable.

Lesson: Use Schema Registry. Always preserve the writer schema.

Pitfall 4: "File Size Increase with FlatBuffers"

Problem: After migrating from Protobuf to FlatBuffers, file sizes increase.

Lesson: FlatBuffers is speed-first. If size matters, Protobuf is better. Or add a compression layer on top of FlatBuffers.

Pitfall 5: "Maintaining Code Generation for Dozens of Target Languages"

Problem: protoc version management, per-language builds.

Lesson: Use tools like monorepo plus buf. Automate in CI.

Pitfall 6: "Violating Schema Evolution Rules"

Problem: Adding a field as required, or changing types.

Lesson: Automated validation. buf breaking, Schema Registry compatibility check.


12. Practical Tools

buf: Modern Protobuf Tooling

Created by Uber and the community, buf dramatically improves Protobuf development:

# Lint
buf lint

# Breaking change detection
buf breaking --against .git#branch=main

# Generate code
buf generate

# Remote plugins (no need to install protoc)
buf generate --template buf.gen.yaml

Recommended for every project.

Confluent Schema Registry

Centrally manages Avro/Protobuf/JSON Schema:

  • REST API for registering/retrieving schemas.
  • Compatibility validation.
  • Kafka integration.
// Kafka producer
properties.put("value.serializer", "io.confluent.kafka.serializers.KafkaAvroSerializer");
properties.put("schema.registry.url", "http://localhost:8081");

Essential in production Kafka plus Avro environments.

protobuf-validator

Adds validation rules to Protobuf messages:

message Person {
  int32 age = 1 [(validate.rules).int32 = { gte: 0, lte: 150 }];
  string email = 2 [(validate.rules).string.email = true];
}

Generated code includes validate methods.

grpcurl / grpc_cli

Test gRPC services like curl:

grpcurl -d '{"id": 1}' localhost:50051 myservice.UserService/GetUser

Automatically obtains schemas via server reflection.


Quiz Review

Q1. Why is Protobuf's varint effective for small numbers?

A. Varint encodes values 7 bits at a time and uses the most significant bit (MSB) of each byte as a "is there a next byte?" flag.

Example:

  • 0~127: 1 byte (MSB=0).
  • 128~16383: 2 bytes.
  • Larger numbers: more bytes.

Regular int32 case: always 4 bytes. Varint:

  • 0: 1 byte
  • 100: 1 byte
  • 10000: 2 bytes
  • 1000000: 3 bytes
  • 2^31-1: 5 bytes (larger!)

Real-world observation: Most IDs, counts, bitmasks, and indices are small values. Across millions of messages, if most integers encode as 1~2 bytes, the total size shrinks significantly.

Exception: Negative numbers What if you encode -1 as an int32 varint? Due to two's complement, it becomes 0xFFFFFFFF, requiring 5 bytes. That's a disaster.

Solution: ZigZag The sint32 type uses ZigZag encoding:

  • 0 → 0, -1 → 1, 1 → 2, -2 → 3, 2 → 4, ...
  • The smaller the absolute value, the shorter the encoding.

So fields where negatives are common should use sint32/sint64. This is an important Protobuf tutorial lesson that's often overlooked.

Lesson: Knowing your data distribution and choosing the format accordingly makes a big difference. Protobuf's varint is a clever design leveraging the common distribution of small numbers.

Q2. Why does Avro "deliver schema with the data" unlike Protobuf, and what are the benefits?

A. Protobuf and Avro have fundamentally different philosophies.

Protobuf:

  • Schema is only needed at code generation time.
  • Wire format contains only tag numbers.
  • Field names aren't needed; tags identify fields.
  • At runtime, schemas are assumed to be "already known."

Avro:

  • Schema is needed at write/read time.
  • Wire format has no field names or tags (only order).
  • Without the schema, interpretation is impossible.
  • Two operational modes:
    1. Container file: schema embedded in file header.
    2. Schema Registry: only schema ID in messages; actual schema centrally stored.

Why this difference?

Avro was designed for big data and streaming. Characteristics of this environment:

  • Data lives long and schemas evolve.
  • Consumers may be unknown in the future.
  • Writer schema is not equal to Reader schema is routine.

Avro's benefits:

  1. Writer vs Reader schema resolution:

    • Avro takes both schemas and converts automatically.
    • Protobuf is tag-based, so only one side needs to be known.
    • In complex migrations, Avro is more flexible.
  2. Self-describing files:

    • Avro container files contain the schema, so "any tool" can read them.
    • Protobuf can't be read without the .proto file.
  3. Schema Registry pattern:

    • Central schema management and compatibility validation.
    • The reason Kafka plus Avro became the big-data standard.

Counterpoint for Protobuf:

  • Tag-based is more efficient (no field names).
  • Code generation provides type safety.
  • Integrated with gRPC.

Conclusion: Avro is optimized for the assumption that "data lives long and schema changes", while Protobuf is optimized for the assumption that "schema is deployed alongside code". Which is right depends on your operational environment:

  • RPC systems (Protobuf favored)
  • Big data streams (Avro favored)
  • Storage files (Avro favored)

Both are proven technologies and the best choice in their respective domains.

Q3. How does FlatBuffers achieve "zero deserialization time"?

A. By using the memory layout itself as the serialization format.

Problem with traditional serialization: Protobuf, Avro, MessagePack, and others sequentially parse the serialized bytes:

// Protobuf style
Person person;
person.ParseFromArray(buffer, size);  // parse entire data
int id = person.id();                   // access already-parsed field
  • Must read the entire buffer.
  • Must allocate new objects and populate fields.
  • Time and memory proportional to size.

FlatBuffers' approach: The serialized bytes are already a structure usable directly from memory:

// FlatBuffers
auto person = GetPerson(buffer);  // just cast a pointer, no parsing
int id = person->id();             // internally compute offset, read memory directly

How it works:

  1. Root offset stored at the start of the buffer.
  2. Before each object, a vtable offset (handles variable structure).
  3. Vtable: "field X is at offset Y" information.
  4. Field access is offset addition plus memory read. O(1).

Benefits:

  1. Deserialization time: 0 (only buffer-mapping time).
  2. No memory allocation: reuse the existing buffer.
  3. Random access: for a 1 GB file needing only one field, only that part is loaded.
  4. mmap-friendly: even large files load lazily.
  5. Cache efficient: once read, stays in CPU cache for reuse.

Costs:

  1. Size: 20~50% larger than Protobuf due to vtable overhead.
  2. Complex serialization: constructed stepwise with a builder.
  3. Low flexibility: hard to modify the buffer once built.
  4. Debugging: hard to inspect binary structure directly.

When to use FlatBuffers:

  • Game data: character info, level data. Read often, rarely changed.
  • ML model files: TensorFlow Lite is a prime example.
  • Mobile app data: fast loading matters.
  • Embedded: memory-constrained, allocation is expensive.

When not to use:

  • RPC: every request has a different message; can't reap zero-copy benefits.
  • Small messages: at under a few KB, parse time is negligible.
  • Frequent modifications: writing is cumbersome.
  • Network size matters: Protobuf is smaller.

FlatBuffers vs typical parser:

  • Parsing a 1 MB message 1000 times:
    • Protobuf: 1000 times 20 ms = 20 seconds.
    • FlatBuffers: 1000 times 0.001 ms = 1 ms.

Not 10 million times, but the gap is stark in bulk processing. This is why FlatBuffers became the standard for game engines and ML runtimes.

Lesson: A "serialization format" doesn't have to be "a parseable byte stream." FlatBuffers opened new performance territory with the bold choice of "memory layout equals wire format." Engineering sometimes makes leaps by overturning default assumptions.

Q4. Why is "deleting a field" in schema evolution dangerous?

A. There are three levels of risk.

1. Tag number reuse risk

In Protobuf/Thrift, tags are eternal IDs:

// v1
message Person {
  int32 id = 1;
  string name = 2;
  string deprecated_field = 3;  // delete this and...
}

// v2 (dangerous!)
message Person {
  int32 id = 1;
  string name = 2;
  int32 new_field = 3;  // reusing tag 3
}

Problem scenario:

  • v1 data stored in a Kafka topic or DB still has deprecated_field = "hello".
  • v2 code reads this data; tag 3 is interpreted as int32.
  • Trying to read "hello" as int32 errors out or produces garbage data.

Solution: Use the reserved keyword to prevent reuse:

message Person {
  int32 id = 1;
  string name = 2;
  reserved 3;
  reserved "deprecated_field";
}

Now attempting to reuse tag 3 in v2 is a compile error.

2. When reader expects the field

In Avro, field deletion is OK if the field isn't in the reader schema:

  • Writer v1: {id, name, age}
  • Reader v2: {id, name} (no age)
  • → age ignored. Safe.

Problem: If you delete the field in the reader and the writer stops writing it?

  • Reader v2 can read v1 data (ignores age).
  • Reader v2 can read v2 data (no age).
  • But when Reader v1 reads v2 data... v1 expected age but v2 doesn't have it → default or null. Business logic may break.

Lesson: Delete only after all readers are gone.

3. Data dependency

If client code is using the field, deletion is catastrophic:

# Old client code
age = person.age  # KeyError / AttributeError / 0 (default)
send_birthday_wishes(person.age)  # wrong results

Field deletion is an API contract change. All usage sites must be tracked.

Safe field-removal procedure:

  1. Mark (deprecated): note in code comments and docs.
  2. Stop using: modify all writer code to not set the field.
  3. Soak period: confirm over weeks/months that it's really unused.
  4. Update readers: remove the field from reader code.
  5. Update writers: remove from writer schema.
  6. Reserved: mark the tag as reserved (prevent reuse).
  7. Monitor: track errors related to this tag.

Practical advice:

  • Adding is easy, deleting is hard: a "we'll remove it later" field lives forever.
  • Add fields rather than new versions: prefer incremental additions to breaking changes.
  • CI validation: buf breaking, Schema Registry compatibility check.
  • Graceful deprecation: mark old fields "DEPRECATED" and keep for several quarters.

Conclusion: The "delete" button is easy, but the consequences ripple across multiple layers of distributed systems. The best approach is designing schemas to be extensible from the start. If deletion is necessary, proceed with a careful process.

This is what "schema design is hard to change once decided" means. A sloppy schema becomes eternal technical debt.

Q5. When should you keep using JSON and when should you switch to binary?

A. There's no simple answer. Multiple factors must be considered.

Reasons to keep using JSON:

  1. Public API: external developers use it. Must be testable with standard tools.
  2. Low traffic: if throughput is low, optimization benefits are less than complexity costs.
  3. Fast development: schema-generation pipeline setup is overhead.
  4. Debugging convenience: readable directly in logs.
  5. Diverse clients: JavaScript browsers, shell scripts, etc.
  6. Small messages: parsing overhead is negligible relative to network.

Signs to switch to binary:

  1. Traffic growth: 1B+ requests/day, bandwidth costs unignorable.
  2. CPU bottleneck: JSON parsing accounts for 10%+ in profiling.
  3. Internal services: not a public API but microservice-to-microservice communication.
  4. Schema evolution needed: version compatibility becomes important.
  5. Bugs from lack of type safety: outages from "typo in field name."
  6. Mobile/IoT: bandwidth and battery constrained.

Concrete thresholds:

Keep JSON:

  • Under 1M requests/day.
  • Average message under 10 KB.
  • API is public.
  • Small dev team.

Switch to MessagePack (gentle JSON upgrade):

  • Keep JSON structure with slight size/speed improvement.
  • OK without a schema.
  • Limited internal communication.

Switch to Protobuf (full binary):

  • Microservice architecture.
  • Using gRPC.
  • Type safety and schema evolution needed.
  • Team can operate the tooling.

Switch to Avro (big data):

  • Kafka-centric.
  • Long-lived storage data.
  • Schema evolution is core.

FlatBuffers (special cases):

  • Games, mobile, ML models.
  • Read often, written rarely.

Mixed strategy (most common):

Most real-world systems use a mix:

  • External APIs: JSON (REST).
  • Internal services: Protobuf (gRPC).
  • Message queues: Avro (Kafka).
  • DB storage: text or format-specific.
  • Config files: YAML/JSON.

Don't try to unify on one format. Choose the right one for each boundary.

Decision framework:

  1. Measure: is JSON actually the bottleneck in the current system?
  2. Compare: benchmark with your own workload.
  3. Gradual migration: start with one service.
  4. Consider complexity cost: tooling, debugging, hiring.
  5. Make it reversible: maintain JSON compatibility or a gateway layer.

Common mistakes:

  • "Read that Protobuf is fast, so adopt unconditionally": adds complexity without a real problem.
  • "Size matters, so FlatBuffers": FlatBuffers is actually larger. Protobuf is smaller.
  • "JSON parsing is slow": often not the actual bottleneck. Network or DB may be the real cause.
  • "Same format everywhere": forcing Protobuf on a public API → developer attrition.

Conclusion: Solutions must solve actual problems. Start with JSON, measure real problems, then switch to binary at the right boundary. "JSON is slow" is a myth, not a fact. Millions of successful systems run on JSON.

This follows the general engineering principle: simplicity by default, complexity only when proven necessary. Binary serialization is a powerful tool, but the judgment behind choosing it matters more than the tool itself.


Closing: The Choice of Bytes

Key Takeaways

  1. Protobuf: the standard. RPC plus type safety.
  2. Thrift: Protobuf's cousin. Integrated RPC.
  3. Avro: the big data standard. Strongest schema evolution.
  4. MessagePack: the gentle JSON replacement.
  5. FlatBuffers: zero-copy. Games/ML.
  6. Cap'n Proto: FlatBuffers plus RPC.
  7. JSON: still the default for public APIs.

Core of the Choice

"Which format is best?" is the wrong question. "Which format is optimal for this situation?" is the right question.

  • RPC: Protobuf
  • Kafka: Avro
  • Games: FlatBuffers
  • Public API: JSON
  • File storage: Avro (self-describing)
  • IoT: CBOR/MessagePack

Final Lesson

A serialization format is the clothing of data. Change it to fit the situation. Poor choice means lifelong pain; good choice invisibly prevents countless problems.

Next time you design a new system, don't just say "I'll use JSON." Ask:

  • How much data will flow?
  • Will the schema change?
  • Who are the consumers?
  • Is performance the bottleneck?
  • Is type safety important?

Choose the format based on those answers. That is the art of engineering.


References

현재 단락 (1/756)

When building a REST API, most developers naturally pick this:

작성 글자: 0원문 글자: 33,005작성 단락: 0/756