Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Introduction — The Pattern You See Everywhere Once You Learn It
Location-Addressing vs Content-Addressing
Three Things You Get for Free — Dedup, Integrity, Caching
Immutability — Why It Naturally Follows
Merkle Trees and DAGs — Weaving Hashes into Structure
The Real-World Cases — Five Faces of the Same Idea
The Shadows — Limits of Content-Addressing
Conclusion — Learn One, See Ten
References

Introduction — The Pattern You See Everywhere Once You Learn It

Work with systems long enough and you have a strange experience: you learn a handful of completely different tools, and then one day you realize they were all variations on the same idea. Content-addressable storage (CAS) is exactly one of those ideas.

How Git stores commits, how Docker shares image layers, how IPFS distributes files, how Nix isolates packages, how BitTorrent verifies chunks. These five look unrelated, but underneath them lies one and the same notion. You address data not by where it is, but by what it is.

This post digs into that single idea. Why addressing by content instead of location hands you deduplication, integrity, and caching for free, why immutability naturally follows, and how the idea scales up into Merkle trees and DAGs to hold up much of today's infrastructure.

Location-Addressing vs Content-Addressing

The traditional way we handle data is location-addressing. A file lives at a path like /home/user/report.pdf, a web resource lives at a URL like https://example.com/logo.png. Here the address points at "where it is." If you know the address, you go there and grab whatever is present.

The fundamental property of this scheme is that address and content are decoupled. You can completely replace the contents of report.pdf and the path stays the same. The same URL can serve different content today and tomorrow. The address is just the name of a container; it makes no promise about what is inside.

Content-addressing inverts this relationship. You compute the address from the content. Concretely, you run a cryptographic hash function (like SHA-256) over the data and use the resulting hash as that data's address.

  Location address:
    "This data is at /path/to/file"
    (address points at a location — contents can change)

  Content address:
    "The address of this data is hash(data)"
    (address is a fingerprint of the content — change the content, the address changes)

From this tiny inversion, a cascade of remarkable properties follows. To recap briefly what a hash is: a hash function takes data of arbitrary length and crushes it into a short, fixed-length value. A good cryptographic hash has two decisive properties. The same input always yields the same output (deterministic), and it is effectively impossible to find two different inputs that yield the same output (a collision). Those two properties hold up all the magic of CAS. If you want to compute hashes yourself, the hash generator lets you see with your own eyes that the same input always yields the same hash, and that changing a single character completely changes the hash.

Three Things You Get for Free — Dedup, Integrity, Caching

The real appeal of content-addressing is that the moment you set the address to be the hash of the content, three useful properties tag along on their own. You do not implement them separately; they fall out of the structure itself.

Deduplication is free. If two pieces of data are exactly the same, their hashes are the same too. That is, identical content has an identical address. So if that address already exists in the store, there is no need to store it again. Even if you try to store the same file a hundred times, it is actually stored only once. You need no separate comparison logic for dedup. You just check whether the address already exists.

Integrity verification is free. When you receive data, all you do is re-hash it and check that it matches the address you requested. If it matches, that data is undeniably the original. If even a single bit was tampered with, the hash differs and no longer matches the address. With no separate checksum field or signature, the address itself is proof of integrity. This is why BitTorrent is safe even while receiving chunks from untrusted strangers.

Caching is free. A content address never points at different content. The address hash(X) forever means only X. So once you have received something, you can cache it indefinitely. You never worry about the cache going stale. The dreaded "cache invalidation" problem of location-addressing simply never arises here. Because the address is the content, an identical address guarantees identical content.

There is a reason I emphasize that these three are "free." In a location-addressed system, deduplication, integrity, and cache invalidation are each hard problems that demand complex engineering. Content-addressing does not solve these problems. It makes them not exist in the first place.

Immutability — Why It Naturally Follows

Another fundamental property that follows from content-addressing is immutability. Objects in a content-addressable store are never modified in place. They cannot be.

The reason is simple. If you change the data, the hash changes, and if the hash changes, it is already a different object with a different address. The very concept of "modifying" what is stored at hash(X) does not hold. Changing X into Y is not a modification of hash(X); it is the creation of a new object hash(Y). The old object hash(X) remains untouched.

The implications of this immutability in practice are large.

Concurrency gets easier: Since objects never change, multiple processes can read simultaneously with no locks. There is no chance the content changes mid-read.
Versioning is natural: A new version does not overwrite the old one; it is added at a new address. Every version coexists at its own address. This is exactly how Git's history accumulates.
References are robust: A reference pointing at an address is guaranteed that its target has not been tampered with, because the address is a fingerprint of the content.

So how do you express "change"? Instead of modifying an immutable object, you create a new object and move a pointer (a mutable name) that points at it. This is exactly what a branch does in Git. The commit object itself is immutable, but the branch name main is a mutable pointer that points at the latest commit and moves to a new one when it appears. This pattern of layering a mutable pointer on top of immutable data repeats everywhere in CAS systems.

Merkle Trees and DAGs — Weaving Hashes into Structure

So far we have talked about addressing a single piece of data. But real systems deal not with one file but with complex structures: directory trees, commit histories, layer stacks. Extending content-addressing to such structures is what Merkle trees and Merkle DAGs are.

The core idea is this. When an object references other objects, it expresses those references as nothing other than the hashes of its children. Then the children's hashes are embedded in the parent's content, and therefore the parent's hash depends on the children's hashes.

  Root hash  (one fingerprint representing the whole)
      |
   +--+--+
   |     |
 hashA hashB      <- these two hashes are part of the root's content
   |     |
 data  data

The property that emerges from this structure is decisive. If even one piece of the data below changes, that piece's hash changes, the parent's hash that references it changes, its parent's hash changes, and eventually the root hash itself changes. Conversely, a single matching root hash guarantees that the entire tree below is identical down to the last byte.

Let's see why this is powerful.

Summarizing a huge structure with a single hash: An entire state made of millions of files is fingerprinted by one root hash. If two systems have the same root hash, everything is the same. That single comparison stands in for a vast verification.
Partial verification and partial transfer: Without receiving the whole tree, you can prove a specific piece's authenticity with just that piece and the hash path up to it. This is the principle behind BitTorrent verifying pieces before the whole file arrives.
Efficient diffing: If two trees have different roots, you descend into the children and follow only the diverging branches. Branches with the same hash are skipped wholesale. This is Git's secret to finding changes quickly even across enormous histories.

A Merkle DAG goes one step further. In a tree each node has a single parent, but in a DAG (directed acyclic graph) multiple parents can share the same child. Combined with deduplication, this is powerful. When several commits or several images reference the same subobject, that object is stored exactly once and everyone shares it by its hash.

The Real-World Cases — Five Faces of the Same Idea

Now let's see how this idea appears in real systems. The remarkable thing is that although the surfaces are completely different, the skeleton is identical.

Git. Git is the textbook example of a content-addressable store. Everything in Git — file contents as blobs, directories as trees, snapshots as commits — is an object addressed by the hash of its content. Even if the same file appears in many commits, the blob is stored only once (dedup). A commit hash depends on the commit's entire content and its parent commit, so tampering with any point in history changes every hash after it (integrity). Git's commit graph is quite literally a Merkle DAG. If you want to practice how Git objects accumulate right in the browser, the Git playground is a good starting point.

Docker and OCI images. A container image is a stack of layers, and each layer is identified by the hash (digest) of its content. If two images use the same base layer, that layer is stored and shared only once, both on disk and in the registry (dedup). This is why docker pull skips layers you already have. After download it recomputes the digest to verify integrity, and if the digest matches it does not fetch again (caching). The manifest that represents the whole image is likewise a content-addressed structure woven from layer digests.

IPFS. IPFS is a distributed file system that pushes content-addressing to web scale. A file is addressed not by location but by a CID (Content Identifier). The CID carries the content's hash, so no matter which node you receive it from, you can verify integrity via the CID. Large files are split into chunks woven into a Merkle DAG, so partial transfer and deduplication come naturally. When you ask the network "give me this CID," any node that has it can respond, because the address is decoupled from location.

Nix. The Nix package manager addresses each build output by the hash of its entire build input and isolates it under /nix/store. The same input produces the same path, so the same package is not built twice. Different inputs produce different paths, so different versions coexist without conflict. This is the core of how Nix achieves reproducible builds and atomic rollbacks. The hash is, in effect, the isolation boundary.

BitTorrent. BitTorrent splits a file into pieces and puts each piece's hash into the torrent metadata. A downloader receives pieces from untrusted strangers, but because it verifies each piece by hash, it instantly filters out corrupted or malicious pieces. It does not matter who you received it from; if the piece's hash matches, it is genuine. This is the archetype of content-addressing enabling zero-trust distributed transfer.

Line these five up and the pattern becomes sharp. Version control, containers, distributed file systems, package management, P2P transfer — completely different problem domains, yet the skeleton of the solution is one. Address data by content, and weave those by hash into structure.

The Shadows — Limits of Content-Addressing

Powerful as this idea is, it is not a silver bullet. Content-addressing has clear costs and limits. You need to know them when designing tools.

Poor fit for mutable data: A content address is inherently for immutable data. Data that changes often produces a new address every time it changes, so to point at "the latest version of this name" you need a separate mutable-pointer layer (name → current hash). Git's branches and IPFS's IPNS are that layer. In other words, content addresses alone cannot express "change"; you have to layer a naming scheme on top.
The garbage collection problem: Immutable objects keep piling up. When to delete an object that nothing references anymore is a headache. You need a GC that finds and reaps objects unreachable from any mutable pointer, and that is non-trivial work. This is also why a Git repository bloats over time.
The cost of hashing: Running a cryptographic hash over all data is not free. On large data, the hashing itself can be a burden. That said, on modern hardware it is usually tolerable.
Collisions and aging hash functions: Security depends on the collision resistance of the hash function. There is precedent for once-widely-used hash functions growing weak over time. A content-addressed system's integrity guarantee collapses if its hash function breaks, so migrating to a stronger hash is an extraordinarily hard task in a large system — because hundreds of millions of addresses are already baked in with the old hash.
The disappearance of location: Since the address does not carry a location, "where do I get this data" has to be solved separately. This is why IPFS keeps a separate layer — a DHT (distributed hash table) — to find "who has this CID." A content address tells you what it is, but not where it is.

Taken together, these limits mean content-addressing is optimal for "immutable, integrity-critical, heavily duplicated, widely cached-and-shared data." For frequently changing state, or data where location is inherently important, a hybrid that layers content-addressing partially on top of location-addressing is realistic. In fact, every system above is a hybrid that layers a mutable naming scheme on top of content addresses (the immutable objects).

Conclusion — Learn One, See Ten

Content-addressing is, on its face, a modest idea. "Address data by its hash." But from that single sentence, deduplication, integrity, caching, and immutability tumble out one after another, and when extended into Merkle trees it gains the ability to verify a huge structure with a single hash. One shift in perspective — addressing by content rather than location — makes so many hard problems simply not exist.

Once you internalize this idea, you see it everywhere afterward. When you learn a new tool, ask "is this maybe content-addressing?" and surprisingly often it is. As Git, Docker, IPFS, Nix, and BitTorrent were, content-addressable block stores, content-addressable caches, and content-addressable artifact stores keep showing up. They wear different labels, but the skeleton is the same.

Understanding systems deeply is, in the end, the ability to recognize these recurring ideas. See the common notion flowing beneath surface differences, and learning a new tool stops being a start-from-scratch each time and becomes meeting another face of a pattern you already know. Content-addressing is one of the most elegant and widespread of those patterns.

References

Git Internals (Pro Git): https://git-scm.com/book/en/v2/Git-Internals-Git-Objects
OCI Image Specification: https://github.com/opencontainers/image-spec
IPFS docs (content addressing): https://docs.ipfs.tech/concepts/content-addressing/
Nix store: https://nixos.org/guides/nix-pills/
Merkle tree (Wikipedia): https://en.wikipedia.org/wiki/Merkle_tree
BitTorrent specification (BEP 3): https://www.bittorrent.org/beps/bep_0003.html