Skip to content
Published on

Git Internals Deep Dive — Object Model, Packfile, Merge Algorithms, Reflog, Protocol (2025)

Authors

TL;DR

  • Git is a content-addressable filesystem. Version control is a layer built on top. Every piece of data (files, directories, commits) is addressed by SHA-1 hash.
  • Four object types: blob (file content), tree (directory), commit (snapshot + metadata), tag (signed reference).
  • Loose object vs Packfile: each object starts as its own file (.git/objects/ab/cd...), later packed and delta-compressed.
  • Refs: branches/tags are just text files pointing at commit hashes. HEAD points to whatever is currently checked out.
  • Index: staging area. A binary file that pre-builds the next commit's tree.
  • Merge: in 2021 the default changed from recursive to ort. Up to 500x faster in some cases.
  • Rebase: cherry-picks commits one by one onto a new base. Changes commit hashes.
  • Reflog: complete history of HEAD and refs. The secret to recovering "lost commits."
  • Packfile: delta compression + index (.idx) + bitmap (.bitmap). Makes push/fetch KB-scale.
  • Protocol: Smart HTTP (most common), SSH, Git protocol. Client declares "what I have" and server sends only what's needed.

1. Git's Philosophy — "A Filesystem, Not Version Control"

In 2005, BitKeeper (then used by the Linux kernel) revoked its free license. Linus built Git in two weeks. Design goals: distributed, fast, data-integrity, first-class branching/merging.

Linus's insight: existing VCS picked the wrong abstraction. Instead of storing diffs between files, Git stores snapshots.

Content-Addressable Filesystem

At its core, Git is a key-value store:

SHA-1 hash → object content

Store: "save this content" → returns a SHA-1. Retrieve: "give me the content for this hash" → returns the object. Everything else (branches, merges, history) is built on top.

mkdir /tmp/gitrepo && cd /tmp/gitrepo
git init
echo "Hello, Git" | git hash-object -w --stdin
# 3b18e512dba79e4c8300dd08aeb37f8e728b8dad

git cat-file -p 3b18e512
# Hello, Git

hash-object -w prepends a header, computes SHA-1, zlib-compresses, and stores under .git/objects/3b/18e5.../.


2. The Object Model

2.1 Four Types

  • Blob: pure file content (no metadata). Same content = same blob.
  • Tree: directory listing. Maps name → mode → blob/tree hash.
  • Commit: snapshot (tree pointer) + metadata (parent, author, message).
  • Tag: annotated tag. Object pointing at a commit with a signature/message.

2.2 Blob

Format: blob <byte_length>\0<content>, then SHA-1, then zlib. 10,000 files with identical "Hello World" contents share one blob.

2.3 Tree

tree <length>\0
100644 blob abc123... README.md
040000 tree def456... src
100755 blob ghi789... build.sh

Each entry: mode (100644 regular, 100755 executable, 040000 tree, 120000 symlink, 160000 submodule), type, SHA-1 (20 bytes binary), null-terminated name.

git cat-file -p HEAD^{tree}

Trees compose recursively — directory of directories is tree containing trees.

2.4 Commit

commit <length>\0
tree 3c4e9cd789...
parent 5a9d8b41...
author Linus Torvalds <torvalds@linux.org> 1234567890 -0700
committer Linus Torvalds <torvalds@linux.org> 1234567890 -0700

Initial commit

Fields: tree (full snapshot), parent(s) (initial has none, merges have 2+), author (who made change), committer (who applied — differs on rebase/cherry-pick), message.

2.5 Tag

Annotated tag (git tag -a) is a real object with object, type, tag, tagger, message, and optional PGP signature. Lightweight tags are just refs.

2.6 Hash Chain

commit → tree → blob
              → blob
              → tree → blob
commit → parent commit → parent commit → ...

Every edge is a hash. Change one byte and every ancestor hash shifts — Git's integrity guarantee.

2.7 SHA-1 vs SHA-256

Git originally used SHA-1. After Google proved SHAttered (SHA-1 collision) in 2017, SHA-256 support was added (git init --object-format=sha256). The two are not interoperable. Most projects still use SHA-1.

2.8 Loose Object Storage

.git/objects/
├── 3b/
│   └── 18e512dba79e4c8300dd08aeb37f8e728b8dad
├── 5a/
│   └── 9d8b41...

First 2 hex chars form directory, remaining 38 form filename — 256 buckets to avoid filesystem slowdown with huge directories.


3. Refs — Meaningful Names

3.1 Ref = Alias for a Hash

cat .git/refs/heads/main
# 3c4e9cd789abc...

That's it. A branch is one text file containing a commit hash.

3.2 HEAD

cat .git/HEAD
# ref: refs/heads/main

Detached HEAD holds a commit hash directly.

3.3 Tag and Remote Refs

Lightweight tags point at commits; annotated tags point at tag objects. Remote refs live under .git/refs/remotes/origin/ and hold the last fetched state, fully independent from local branches.

3.4 packed-refs

With many refs, Git packs them into a single .git/packed-refs file. Git consults both unpacked files and packed-refs (unpacked wins).

3.5 Symbolic Ref

A ref may point at another ref (HEAD being the canonical example). See with git symbolic-ref HEAD.


4. Index — The Staging Area

4.1 Role

Working directory  (git add)Index  (git commit)Commit

The index pre-builds the next commit's tree.

4.2 Structure

.git/index is binary: header + entries (ctime, mtime, dev, ino, mode, uid, gid, size, SHA-1, flags, path) + checksum.

git ls-files --stage
# 100644 a906cb2a4a904a15... 0 README.md
# 100644 3b18e512dba79e4c... 0 src/main.c

4.3 Why an Index?

  1. Partial commits: stage only a subset.
  2. Performance: file change detection via inode/mtime/size is cheap.
  3. Merge state: on conflict, the index stores multiple stages (1 ancestor, 2 ours, 3 theirs) per path.

4.4 The Three-Tree Model

HEAD              Index            Working Tree
(last commit)    (staged)        (unstaged)

git status compares HEAD vs Index ("Changes to be committed") and Index vs Working Tree ("Changes not staged").


5. Packfiles — Efficient Storage

5.1 Loose Object Limits

10,000 files with 10 modifications = 100,000 loose objects (inode waste). Worse: no delta compression — a 10-byte edit of a 100MB file produces a new 100MB blob.

5.2 Packfile Layout

.git/objects/pack/
├── pack-abc123.idx     # index for offset lookup
├── pack-abc123.pack    # actual data
├── pack-abc123.bitmap  # reachability bitmap (optional)

5.3 Pack Format

Header (PACK + version + count), objects (type + compressed data; deltas carry OFS_DELTA or REF_DELTA base ref), 20-byte checksum.

5.4 Delta Compression

If object A is similar to B:

Pack's B: full content (zlib-compressed)
Pack's A: "reference B, then apply these edit instructions"

Edit instructions: COPY (offset X, N bytes from base) or INSERT (N new bytes).

Example: "Hello, World" to "Hello, Git":

A = COPY(0, 7) + INSERT("Git") + COPY(12, 1)

5.5 Base Object Selection

Heuristics: same filename previous version, similar size, prefix match. git repack -f forces rebuild.

5.6 Delta Chain

V1 (full) ← V2 (delta of V1) ← V3 (delta of V2) ← ...

Walk backwards to reconstruct. Chains are capped (default pack.depth=50).

5.7 Pack Index (.idx)

Fanout table (256 entries keyed on first byte) + sorted SHA-1 list + CRC32 + offsets. Lookup is O(log n) via binary search.

5.8 Reachability Bitmap

Git 2.0+. Each selected commit stores a bitmap of reachable objects. git fetch computes "need X, already have Y" instantly. Why GitHub git clone is fast.


6. Object Storage Operations

6.1 Create

git add file.txt: read → compute blob hash → check .git/objects/ → write if absent → update index.

6.2 Commit

git commit -m "...": build tree objects from index (recursive per dir) → create commit object with tree, parent, author, message → update HEAD's ref.

6.3 Checkout

git checkout main: read refs/heads/main → read its commit tree → materialize tree into working dir → sync index → update HEAD.

6.4 GC and Prune

git gc

Compacts loose objects into packfiles and deletes unreachable objects older than 2 weeks. Auto-triggers at ~6700 loose objects. git prune --expire=now purges everything unreachable (reflog entries keep objects alive).


7. Merge Algorithms

7.1 Three-way Merge

  • Common Ancestor: shared ancestor commit.
  • Ours: current branch tip.
  • Theirs: branch being merged.

Per file: only Ours changed → Ours; only Theirs → Theirs; both same change → that change; both diverge → conflict.

7.2 Fast-forward

When main is a strict ancestor of branch, main is simply advanced. No merge commit.

7.3 Three-way Merge Commit

When histories diverge, Git creates a merge commit M with two parents.

7.4 Recursive to Ort

Until 2021, Git's default was recursive. Criss-cross merges (multiple common ancestors) recursed into merging ancestors — sometimes pathologically slow.

Ort (Ours/Recursive/Theirs), rewritten by Elijah Newren: up to 500x faster, better rename detection, improved subtree merges, much lower memory. Default from Git 2.33 (2021). Most developers never noticed — merges just got faster.

7.5 Rename Detection

Renamed file (50%+ content similarity) is tracked as a rename, so edits on another branch apply to the new name, not the old. Ort makes this faster and more accurate.

7.6 Conflict Resolution

<<<<<<< HEAD
int x = 1;
=======
int x = 2;
>>>>>>> branch

Edit, git add, git commit. Multi-stage index entries drive this.

7.7 Strategy Options

  • -X ours / -X theirs: auto-pick on conflict.
  • -X ignore-all-space: ignore whitespace.
  • -X rename-threshold=80: tune rename detection.

Strategies (-s): ort (default), recursive (legacy), resolve, octopus (multi-branch, conflict-free only), ours (discard Theirs content).


8. Rebase Mechanics

8.1 Idea

main:    ABC
branch:  ABDE

After git rebase main on branch:

main:    ABC
branch:  ABCD' → E'

D', E' are new commits — same changes, different parent, different hashes.

8.2 How It Works

  1. Start at main's tip as temporary HEAD.
  2. List branch-only commits in order.
  3. Cherry-pick each: compute ancestor, three-way merge, create new commit with new parent.
  4. Move the branch ref to the new tip.

8.3 Interactive Rebase

git rebase -i HEAD~5

Actions: pick, reword, edit, squash, fixup, drop. History rewrite — new hashes.

8.4 Danger on Public Branches

Rebase changes hashes. Anyone working on old hashes will have divergent history after git push --force. Rebase only your own branches; merge shared ones.

8.5 Force-with-lease

git push --force-with-lease

Refuses the push if the remote advanced since you last fetched — prevents clobbering teammates' work.


9. Reflog — The Safety Net

9.1 What It Is

Local log of every HEAD/ref change. Even after git reset --hard or git rebase, you can recover.

git reflog
# abc123 (HEAD -> main) HEAD@{0}: commit: Add feature
# def456 HEAD@{1}: commit: Fix bug
# ghi789 HEAD@{2}: rebase finished

9.2 Recovery Example

Accidentally dropped a commit with reset:

git reflog
# abc123 HEAD@{1}: commit: the lost one
# def456 HEAD@{0}: reset: moving to HEAD~1

git reset --hard abc123

Objects remain in .git/objects/ as long as reflog holds the hash.

9.3 Storage and Expiry

.git/logs/HEAD and .git/logs/refs/.... Default expiry: 90 days for reachable, 30 days for unreachable. git gc cleans up.

9.4 Recovery Routine

Don't panic → git reflog → find hash → git reset --hard <hash> or git cherry-pick <hash>. Within 30 days, almost any mistake is recoverable.


10. Partial Clone and Sparse Checkout

10.1 Big Repos

Linux kernel: 3GB. Chromium: 50GB. Full clones waste time and space.

10.2 Shallow Clone

git clone --depth=1 https://github.com/...

Latest commit only. Common in CI. Limits some git pull workflows.

10.3 Partial Clone (Git 2.19+)

git clone --filter=blob:none https://github.com/...

All commits + trees; blobs fetched lazily on demand.

10.4 Sparse Checkout

git sparse-checkout init
git sparse-checkout set src/feature-x docs

Only specified directories materialize. Essential for monorepos.

10.5 Combined

git clone --filter=blob:none --sparse https://...
git sparse-checkout init
git sparse-checkout set src/mymodule

50GB repo can operate in under 1GB.

10.6 Git LFS

Stores large files on a separate server; Git tracks only a pointer (~60 bytes). Good for media/binaries, but no real diff.


11. Git Protocols

11.1 Clone/Fetch Flow

  1. Refs exchange.
  2. "have"/"want" negotiation.
  3. Server builds and sends a packfile.

11.2 Smart HTTP

Most common (GitHub, GitLab):

GET  /repo.git/info/refs?service=git-upload-pack
POST /repo.git/git-upload-pack

Firewall-friendly over port 443.

11.3 SSH

ssh git@github.com "git-upload-pack 'user/repo.git'"

Same protocol over SSH — often faster (no HTTP overhead).

11.4 Git Protocol (port 9418)

Original, anonymous-only, rarely used.

11.5 Protocol V2 (2018+)

Capability negotiation, ls-refs filtering, better compression, stateless. Default in GitHub/GitLab, Git 2.18+.

11.6 Transfer Optimization

Server uses reachability bitmaps to compute needed objects instantly, skipping what the client already has, then delta-compresses into a pack. Large hosts pre-compute packs plus per-user deltas.


12. Debugging and Inspection Tools

git cat-file -t abc123    # type
git cat-file -s abc123    # size
git cat-file -p abc123    # pretty-print
git cat-file --batch-all-objects --unordered

git rev-list HEAD
git rev-list --count HEAD
git rev-list --objects HEAD

git verify-pack -v .git/objects/pack/pack-abc123.idx
git fsck --full
git show-ref
git count-objects -v

13. Common Real-World Scenarios

13.1 Accidental Force Push

git reflog
git push --force-with-lease origin main:<old-hash>

13.2 Brick-wall Conflict

git merge --abort
git checkout feature
git rebase -i main
git checkout main
git merge feature

13.3 Split a Commit

git rebase -i HEAD~3
# mark target commit 'edit'
git reset HEAD^
git add file1 && git commit -m "first half"
git add file2 && git commit -m "second half"
git rebase --continue

13.4 Committed a Secret

git reset --soft HEAD~1   # if recent
# or, for older commits:
git filter-branch --index-filter \
  'git rm --cached --ignore-unmatch secrets.txt' HEAD
# or use BFG Repo-Cleaner

Needs force push; assume the secret is already leaked and rotate it immediately.

13.5 When Was This File Added?

git log --all --full-history --source -- <file>
git log --diff-filter=A -- <file>
git log -p -- <file>
git blame <file>

14. Changes in 2024-2025 Git

14.1 Worktree Improvements

git worktree add ../feature-x feature-x

Multiple branches checked out in parallel — no context-switch pain.

14.2 Reftable

New single-file ref storage with O(log n) lookup; experimental in 2024, stabilizing in 2025.

14.3 Scalar

Microsoft's big-repo wrapper (integrated into Git 2.38+): partial clone + sparse checkout + background maintenance configured automatically.

14.4 SSH Signing

git config gpg.format ssh
git config user.signingkey ~/.ssh/id_ed25519.pub
git commit -S -m "Signed"

Sign commits with SSH keys — no GPG setup. GitHub supports since 2022.


15. Summary Cheat Sheet

Essence: Content-addressable filesystem, SHA-1 hash → object
Objects: blob (content), tree (dir), commit (snapshot+meta), tag
Storage: loose (.git/objects/ab/cd...) / pack (*.pack+*.idx+*.bitmap)
Refs:    refs/heads, refs/tags, refs/remotes, HEAD, packed-refs
Index:   .git/index binary, three-tree model (HEAD/Index/Working)
Merge:   3-way, fast-forward, ort (2021+), rename detection
Rebase:  cherry-pick new commits, new hashes, not on public branches
Reflog:  local history of ref changes, 30-90 day retention
Big repo: --depth=1, --filter=blob:none, sparse-checkout, LFS
Protocol: Smart HTTP / SSH / Protocol V2
Tools:   cat-file, rev-list, fsck, verify-pack, count-objects, reflog

16. Quiz

Q1. What are the four Git object types and how do they link?

A. blob (file contents), tree (directory listing), commit (snapshot + metadata), tag (annotated tag). A commit points at a top-level tree; trees point at blobs and sub-trees. Commits also point at parent commit(s), forming a DAG. Every edge is a SHA-1 hash. Trees are recursive. Identical blobs are stored once (deduplication). Any byte change cascades through every ancestor hash — the integrity guarantee.

Q2. How does packfile delta compression work?

A. Similar objects are stored as base + edit instructions. The base is full (zlib-compressed); similar objects say "copy offset X for N bytes from base" and "insert these M new bytes." A 10-byte edit of a 100MB file costs ~10 bytes instead of a new 100MB blob. Base selection is heuristic (same filename previous version, size similarity). Chain depth is capped (default 50) to bound reconstruction cost. This is why git clone is KB-scale.

Q3. Why does the index exist?

A. Three reasons: (1) Partial commits — stage only selected files. (2) Performance — the index caches inode/mtime/size so git status avoids full diffs. (3) Merge state — during conflicts the index holds multiple stages (ancestor/ours/theirs) per path, enabling git checkout --theirs file. In the three-tree model (HEAD/Index/Working), the index is the explicit "what the next commit will be" layer.

Q4. Why is rebase dangerous on public branches?

A. Rebase changes commit hashes. Same changes, different parent, entirely new SHA-1 — effectively new commits. If teammates branched off the old hashes, their work depends on commits that "disappear" after a force push. Shared branches should be merged, not rebased. --force-with-lease narrows the window by refusing push if the remote advanced since you last fetched.

Q5. Why is reflog a safety net for mistakes?

A. Reflog records every HEAD/ref update locally with timestamps under .git/logs/. Objects are not immediately deleted — even after git reset --hard, the blob/commit still lives in .git/objects/ as long as the reflog references it. So git reflog + git reset --hard <hash> recovers almost anything within 30-90 days. Mistakes are nearly always reversible until git gc actually prunes.

Q6. Partial clone vs shallow clone — what's the difference?

A. Shallow (--depth=N) fetches only the last N commits; history is truncated and some git pull paths break. Partial (--filter=blob:none) fetches all commits and trees but fetches blobs lazily — full log history, file contents on demand. Combined with sparse checkout a 50GB repo can operate under 1GB. Microsoft Windows runs on this model.

Q7. Why did ort replace recursive?

A. Performance and correctness. Recursive could explode on criss-cross merges (recursively merging multiple ancestors), slowing or OOM-ing on large repos. Ort (Ours/Recursive/Theirs), rewritten by Elijah Newren, is up to 500x faster, has more accurate rename detection, and uses far less memory. Default from Git 2.33 (2021). Most developers never noticed — merges just got faster.


If this was useful, check out:

  • "Binary Serialization: Protobuf/Thrift/Avro/FlatBuffers" — compare with Git's binary formats.
  • "Docker BuildKit & Image Layers Deep Dive" — another content-addressable system.
  • "RocksDB & LSM-Tree Deep Dive" — append-only + background compaction philosophy.
  • "Consistent Hashing & Virtual Nodes" — distributed content addressing.