Skip to content
Published on

How Git Actually Stores Your Data

Authors

Introduction — We Use It Daily, Nobody Looks Inside

Git is a tool developers use every day, yet surprisingly few know what its insides actually look like. Most people memorize git add, git commit, and git push like incantations, and when something goes wrong they barely escape with a command copied from Stack Overflow.

But Git's internals are astonishingly simple and elegant. In fact, Git's core data model is small enough to understand completely in a few hours. And once you understand that model, all the behaviors that used to look like magic become logically explainable. Why creating a branch is so fast, what git checkout actually does, why commit hashes look the way they do — all of it fits inside one picture.

This post looks at Git not as a "collection of commands" but as a "data storage system." We dig from the bottom up into what structure Git actually stores your files in, and why that structure is designed the way it is. By the end, Git will feel a lot less scary.

The Core Insight — Git Is a Content-Addressable Store

If there is one key to understanding Git, it is this: Git is fundamentally a content-addressable store.

Content-addressable storage means addressing data not by a "name" or "location" but by the content itself. Concretely, when Git stores some content, it computes a SHA hash of that content and uses the hash value as that content's address (its name).

The consequences of this idea are powerful:

  • Identical content always has the same address. If two files have exactly the same content, they have the same hash, and so they are stored only once inside the Git repository.
  • Any change to the content completely changes the address. By the nature of hash functions, changing even a single byte yields a totally different hash.
  • Integrity comes for free. Since the address is the hash of the content, any corruption of the data makes the hash mismatch and is detected immediately.

On top of this content-addressing scheme, Git stacks four kinds of objects. These four objects are the entirety of Git's data model.

The Four Objects — blob, tree, commit, tag

Everything Git stores is ultimately one of four kinds of object. Let's look at each.

blob — a file's content

A blob holds the content of a single file. The important thing is that a blob holds only content. No filename, no path, no permissions live inside the blob. It is just a lump of bytes. So two files with different names but identical content point to the same single blob.

tree — directory structure

A tree corresponds to a directory. A tree holds a list of names and object references. Each entry points to "the entry with this name is that blob (file)" or "the entry with this name is that tree (subdirectory)." In other words, the tree gives names and structure to blobs. If a blob is file content, a tree organizes that content like a filesystem.

commit — snapshot and history

A commit is what we usually call "a commit." A single commit object holds:

  • A reference to one top-level tree — the full snapshot of the project at this commit.
  • References to one or more parent commits (the first commit has no parent; a merge commit has two or more).
  • Author and committer information, and timestamps.
  • The commit message.

The key here is that a commit points to a snapshot. A common misconception is that "Git stores diffs (changes)," but in fact each commit points to the entire tree at that moment, whole. A diff is computed by comparing two snapshots when needed; it is not the basic unit of storage. (The packfiles we'll see later use deltas for storage efficiency, but that is a storage optimization, not the logical model.)

tag — a name label

A tag object is used to attach a permanent name to a particular object (usually a commit). Release tags like v1.0.0 are the classic case. A tag can include the tagger, date, message, and a signature.

Drawn as a picture, the relationship between these four objects looks like this:

  commit  ──parent──▶  commit  ──parent──▶  commit
    │                    │                    │
    ▼ (tree)             ▼                    ▼
  tree ──────────────▶ blob  (README.md content)
    ├──────────────▶ blob  (main.py content)
    └──────────────▶ tree ──────▶ blob (src/util.py content)
                    (subdirectory)

A commit points to a tree, a tree points to blobs and other trees, and a commit points to parent commits. This simple reference structure is all of Git.

SHA Hashes — Why That Long String

Using Git, you keep seeing 40-character (or the abbreviated 7-character) hexadecimal strings like a1b2c3d4.... This is exactly the address of an object — the SHA hash.

When Git stores each object, it computes a SHA hash over the object's type and content. That hash becomes the object's unique name. That is why commit hashes look the way they do: it is not an arbitrarily assigned number, but a fingerprint computed from the entire content of that commit (the tree it points to, its parents, message, author, and so on).

A beautiful property falls out of this. A commit hash includes the tree the commit points to, a tree hash includes the blobs it points to, and a commit includes its parent commit's hash. In other words, each object's hash depends on everything it can reach.

As a result, if a single byte of a file changes in some old commit deep in history, that commit's hash changes, and the hashes of every descendant commit that has it as a parent change in a chain. This is how Git guarantees integrity. Tampering with history secretly is impossible, because all the hashes would break. This kind of structure is called a Merkle tree, or a hash DAG.

(Historically Git has used SHA-1, and a transition to SHA-256 has been underway due to collision concerns. But the principle of the data model is the same regardless of which hash function is used.)

The Commit DAG — History Is a Graph

Many people imagine Git history as "commits lined up in a single row." But precisely, it is a DAG (Directed Acyclic Graph).

  • Directed: each commit points to its parents. The arrows point toward the past.
  • Acyclic: a commit cannot have one of its own descendants as a parent. There are no cycles.

When branches diverge and merge, this graph becomes a real graph rather than a single line.

                  o---o---o   (feature branch)
                 /         \
  o---o---o---o-------------o---o   (main branch)
      │           │              │
    past ─────────────────────▶ present

A merge commit (where the two lines meet in the diagram above) has two parents: one is main's previous commit, the other is feature's last commit. Because a commit can have multiple parents like this, history becomes a graph.

This DAG view crisply explains many of Git's behaviors. git log traverses this graph from the current point backward toward parents; git merge finds the common ancestor of two points and combines the changes from there; rebase detaches commits and reattaches them on top of a different point. All of it is manipulation over a graph.

Branches Are Just Pointers — Git's Most Liberating Fact

The biggest aha moment in learning Git is usually this: a branch is not some heavy, complex thing — it is just a pointer to a single commit.

Precisely, a branch is a tiny text file containing one commit hash. The main branch is nothing more than a 41-byte file that says "the hash of the latest commit that main points to is this." Creating a new branch is just making one more pointer file that points to the same commit.

This fact explains a lot:

  • Why branch creation is instant. It is just writing one file, so no matter how large the repository, creating a branch happens in an instant. No heavy copy occurs.
  • What happens when you commit. When you commit on the current branch, a new commit object is created, and the branch pointer is updated to point to that new commit. The pointer moves one step forward.
  • What HEAD is. HEAD is yet another pointer, one that indicates "which branch am I on right now." It usually points to a branch, which in turn points to a commit.

Drawn as a picture, this relationship looks like this:

  HEAD ──▶ main ──▶ commit(f9a2...) ──▶ commit(3c1d...) ──▶ ...
                        (latest)            (before that)

  after git commit:
  HEAD ──▶ main ──▶ commit(new!) ──▶ commit(f9a2...) ──▶ ...
                     the pointer moved forward

Once you internalize "a branch is a pointer," Git suddenly stops being scary. Deleting a branch does not make the commit objects themselves disappear (as long as they are referenced elsewhere), and a branch you accidentally moved is fixed by simply pointing the pointer back to the original commit. A tag is a pointer similar to a branch, but with one difference: a branch moves forward with every commit, while a tag is pinned to a single commit and does not move.

Let's Look Inside the .git Directory

Every concept so far actually exists inside the .git directory. Open the .git folder at your project root and you can see the real things behind what we've only discussed abstractly. Roughly, the structure looks like this:

  .git/
  ├── HEAD              # points to the currently checked-out branch (e.g. ref: refs/heads/main)
  ├── config            # repository configuration
  ├── objects/          # all objects (blob, tree, commit, tag) are stored here
  │   ├── 3c/
  │   │   └── 1d8f...   # first 2 chars of the hash are the folder, the rest is the filename
  │   ├── f9/
  │   │   └── a2b7...
  │   └── pack/         # packfiles (explained below)
  ├── refs/
  │   ├── heads/        # local branches — each file holds one commit hash
  │   │   ├── main
  │   │   └── feature
  │   └── tags/         # tags
  └── logs/             # the reflog — records how references moved

Let's hit the key points:

  • The objects/ directory is the actual data store. Every blob, tree, commit, and tag is stored here using its own hash as the filename. Splitting off the first 2 characters of the hash as a folder name is a practical device to keep any one folder from holding too many files.
  • Each file in refs/heads/ is a branch. Open the main file and you find one line: a commit hash. The earlier "a branch is a pointer" is confirmed here, literally.
  • The HEAD file usually holds something like ref: refs/heads/main, indicating which branch you are on.
  • The reflog in logs/ records how branches and HEAD have moved over time. This is why you can recover a commit with the reflog after you accidentally lose it.

If you want to learn Git's internals by actually touching them, the Git playground lets you stack commits and create branches and watch this structure change with your own eyes. Reading concepts in prose and watching the graph actually grow are different depths of understanding.

Identical Content Is Stored Only Once — The Elegance of Deduplication

One of the most practical consequences of content addressing is automatic deduplication.

As said earlier, a blob's name is the hash of its content. So even if a repository has several files with exactly the same content, they all point to the same single blob. Physically, it is stored only once.

This principle works powerfully between commits too. Suppose you made 100 commits, and some file never changed once across them. Then that file's blob exists exactly once across all 100 commits. The trees of each commit simply all point to the same blob hash. Likewise, if a single commit changes just one file, only a new blob for the changed file and the trees containing it are created anew; the remaining unchanged blobs and trees are reused as-is from the previous commit.

This is why Git can hold a vast history without the repository ballooning as much as you'd expect. Each commit "points to" a full snapshot, but the unchanged parts are physically shared. You get both the conceptual simplicity of the snapshot model and the storage efficiency of deduplication at the same time.

Packfiles — Compressing Storage One More Time

So far I've described each object as stored as an individual file under objects/ (these are called loose objects). That alone is fairly efficient thanks to deduplication, but Git goes one step further: packfiles.

As a repository grows, individual object files can balloon into the hundreds of thousands. Too many files strains the filesystem, and compressing objects together is more efficient than compressing each one separately. So Git periodically (or when you run git gc) gathers many objects into a single packfile.

The two key optimizations of packfiles are:

  • Compress together. Gathering many objects into one file and compressing the whole thing yields a better ratio than compressing individually.
  • Delta encoding. Here comes an interesting twist. Earlier I said "Git stores snapshots," but inside a packfile, similar objects can store only the difference (a delta). For example, if there are several versions of some large file, one is stored whole and the rest are expressed only as differences from it.

The important thing here is the distinction of layers. In the logical model, Git is still snapshot-based. Each commit conceptually points to a complete tree. But in physical storage, packfiles save space with delta encoding. These two are not a contradiction. What users and commands see is the snapshot model; deltas are merely an optimization of the storage layer beneath it. When Git reads a delta-stored object, it automatically reconstructs it and hands you back a complete object.

This design echoes the lessons of SQLite and ripgrep seen earlier: show the user a clean, simple logical model, while hiding practical optimizations underneath.

The Whole Picture — Tying It Together

Combine all the pieces so far into one picture and you can see the entirety of Git's data model.

  refs/heads/main  ─┐
                    │ (pointer)
  HEAD ─────────────┘
                 commit  ──parent──▶  commit  ──▶ ...
                    ▼ (snapshot: points to a tree)
                  tree ──────▶ blob   (file content, addressed by hash)
                    └────────▶ tree ──▶ blob
                              (subdirectory)

  * every object is addressed by the SHA hash of its content (content addressing)
  * identical content = same hash = stored once (deduplication)
  * loose objects in objects/, later compressed into packfiles

This single picture contains every concept in this post. Pointers (branches, HEAD) point to commits, commits point to snapshots (trees), trees point to file content (blobs), and everything is addressed by the hash of its content and stored without duplication.

What Changes When You Know This Model

Understanding Git's data model changes several things in practice:

  • Commands look logical. checkout changes the working directory to the content of a particular tree and moves HEAD; merge finds a common ancestor and makes a merge commit; reset moves the branch pointer. All of it is explained as manipulation of objects and pointers.
  • Mistakes are less scary. Commit objects do not disappear as long as they are referenced, so even if you moved a branch wrong, you can find the original commit hash with the reflog and point the pointer back. Commits that "look lost" are usually still there.
  • Performance makes sense. Why branches are lightweight, and why certain operations are fast in a large repository, is explained by the data structure.
  • Collaboration becomes clear. push and fetch are ultimately about exchanging objects and references between repositories. Once you can picture what is being exchanged, conflicts and sync issues become less confusing.

Conclusion

Git tends to feel hard mostly because people memorize its commands without knowing its internal model. But that model itself is astonishingly small and elegant. Four kinds of objects (blob, tree, commit, tag), SHA hashes that address by content, the DAG the commits form, and branches — lightweight pointers pointing on top of it all. That is everything.

Once you get this picture into your head, all the Git behaviors that used to look like magic or feel scary become understandable on the same logic. Creating a branch, reverting a commit, rewriting history — in the end, they are all a matter of creating objects and moving pointers.

The next time Git behaves unexpectedly and throws you off, recall the data model instead of the commands. "What object was just created, and which pointer points where?" This one question dissolves most of the confusion. And if you build commits and branches yourself in the Git playground and watch the graph grow with your own eyes, this model becomes fully yours.

References