✍️ 필사 모드: Git Internals Deep Dive — Object Model, Packfile, Merge Algorithms, Reflog, Protocol (2025)
EnglishTL;DR
- Git is a content-addressable filesystem. Version control is a layer built on top. Every piece of data (files, directories, commits) is addressed by SHA-1 hash.
- Four object types: blob (file content), tree (directory), commit (snapshot + metadata), tag (signed reference).
- Loose object vs Packfile: each object starts as its own file (
.git/objects/ab/cd...), later packed and delta-compressed. - Refs: branches/tags are just text files pointing at commit hashes.
HEADpoints to whatever is currently checked out. - Index: staging area. A binary file that pre-builds the next commit's tree.
- Merge: in 2021 the default changed from
recursivetoort. Up to 500x faster in some cases. - Rebase: cherry-picks commits one by one onto a new base. Changes commit hashes.
- Reflog: complete history of HEAD and refs. The secret to recovering "lost commits."
- Packfile: delta compression + index (
.idx) + bitmap (.bitmap). Makes push/fetch KB-scale. - Protocol: Smart HTTP (most common), SSH, Git protocol. Client declares "what I have" and server sends only what's needed.
1. Git's Philosophy — "A Filesystem, Not Version Control"
In 2005, BitKeeper (then used by the Linux kernel) revoked its free license. Linus built Git in two weeks. Design goals: distributed, fast, data-integrity, first-class branching/merging.
Linus's insight: existing VCS picked the wrong abstraction. Instead of storing diffs between files, Git stores snapshots.
Content-Addressable Filesystem
At its core, Git is a key-value store:
SHA-1 hash → object content
Store: "save this content" → returns a SHA-1. Retrieve: "give me the content for this hash" → returns the object. Everything else (branches, merges, history) is built on top.
mkdir /tmp/gitrepo && cd /tmp/gitrepo
git init
echo "Hello, Git" | git hash-object -w --stdin
# 3b18e512dba79e4c8300dd08aeb37f8e728b8dad
git cat-file -p 3b18e512
# Hello, Git
hash-object -w prepends a header, computes SHA-1, zlib-compresses, and stores under .git/objects/3b/18e5.../.
2. The Object Model
2.1 Four Types
- Blob: pure file content (no metadata). Same content = same blob.
- Tree: directory listing. Maps name → mode → blob/tree hash.
- Commit: snapshot (tree pointer) + metadata (parent, author, message).
- Tag: annotated tag. Object pointing at a commit with a signature/message.
2.2 Blob
Format: blob <byte_length>\0<content>, then SHA-1, then zlib. 10,000 files with identical "Hello World" contents share one blob.
2.3 Tree
tree <length>\0
100644 blob abc123... README.md
040000 tree def456... src
100755 blob ghi789... build.sh
Each entry: mode (100644 regular, 100755 executable, 040000 tree, 120000 symlink, 160000 submodule), type, SHA-1 (20 bytes binary), null-terminated name.
git cat-file -p HEAD^{tree}
Trees compose recursively — directory of directories is tree containing trees.
2.4 Commit
commit <length>\0
tree 3c4e9cd789...
parent 5a9d8b41...
author Linus Torvalds <torvalds@linux.org> 1234567890 -0700
committer Linus Torvalds <torvalds@linux.org> 1234567890 -0700
Initial commit
Fields: tree (full snapshot), parent(s) (initial has none, merges have 2+), author (who made change), committer (who applied — differs on rebase/cherry-pick), message.
2.5 Tag
Annotated tag (git tag -a) is a real object with object, type, tag, tagger, message, and optional PGP signature. Lightweight tags are just refs.
2.6 Hash Chain
commit → tree → blob
→ blob
→ tree → blob
commit → parent commit → parent commit → ...
Every edge is a hash. Change one byte and every ancestor hash shifts — Git's integrity guarantee.
2.7 SHA-1 vs SHA-256
Git originally used SHA-1. After Google proved SHAttered (SHA-1 collision) in 2017, SHA-256 support was added (git init --object-format=sha256). The two are not interoperable. Most projects still use SHA-1.
2.8 Loose Object Storage
.git/objects/
├── 3b/
│ └── 18e512dba79e4c8300dd08aeb37f8e728b8dad
├── 5a/
│ └── 9d8b41...
First 2 hex chars form directory, remaining 38 form filename — 256 buckets to avoid filesystem slowdown with huge directories.
3. Refs — Meaningful Names
3.1 Ref = Alias for a Hash
cat .git/refs/heads/main
# 3c4e9cd789abc...
That's it. A branch is one text file containing a commit hash.
3.2 HEAD
cat .git/HEAD
# ref: refs/heads/main
Detached HEAD holds a commit hash directly.
3.3 Tag and Remote Refs
Lightweight tags point at commits; annotated tags point at tag objects. Remote refs live under .git/refs/remotes/origin/ and hold the last fetched state, fully independent from local branches.
3.4 packed-refs
With many refs, Git packs them into a single .git/packed-refs file. Git consults both unpacked files and packed-refs (unpacked wins).
3.5 Symbolic Ref
A ref may point at another ref (HEAD being the canonical example). See with git symbolic-ref HEAD.
4. Index — The Staging Area
4.1 Role
Working directory → (git add) → Index → (git commit) → Commit
The index pre-builds the next commit's tree.
4.2 Structure
.git/index is binary: header + entries (ctime, mtime, dev, ino, mode, uid, gid, size, SHA-1, flags, path) + checksum.
git ls-files --stage
# 100644 a906cb2a4a904a15... 0 README.md
# 100644 3b18e512dba79e4c... 0 src/main.c
4.3 Why an Index?
- Partial commits: stage only a subset.
- Performance: file change detection via inode/mtime/size is cheap.
- Merge state: on conflict, the index stores multiple stages (1 ancestor, 2 ours, 3 theirs) per path.
4.4 The Three-Tree Model
HEAD Index Working Tree
(last commit) → (staged) → (unstaged)
git status compares HEAD vs Index ("Changes to be committed") and Index vs Working Tree ("Changes not staged").
5. Packfiles — Efficient Storage
5.1 Loose Object Limits
10,000 files with 10 modifications = 100,000 loose objects (inode waste). Worse: no delta compression — a 10-byte edit of a 100MB file produces a new 100MB blob.
5.2 Packfile Layout
.git/objects/pack/
├── pack-abc123.idx # index for offset lookup
├── pack-abc123.pack # actual data
├── pack-abc123.bitmap # reachability bitmap (optional)
5.3 Pack Format
Header (PACK + version + count), objects (type + compressed data; deltas carry OFS_DELTA or REF_DELTA base ref), 20-byte checksum.
5.4 Delta Compression
If object A is similar to B:
Pack's B: full content (zlib-compressed)
Pack's A: "reference B, then apply these edit instructions"
Edit instructions: COPY (offset X, N bytes from base) or INSERT (N new bytes).
Example: "Hello, World" to "Hello, Git":
A = COPY(0, 7) + INSERT("Git") + COPY(12, 1)
5.5 Base Object Selection
Heuristics: same filename previous version, similar size, prefix match. git repack -f forces rebuild.
5.6 Delta Chain
V1 (full) ← V2 (delta of V1) ← V3 (delta of V2) ← ...
Walk backwards to reconstruct. Chains are capped (default pack.depth=50).
5.7 Pack Index (.idx)
Fanout table (256 entries keyed on first byte) + sorted SHA-1 list + CRC32 + offsets. Lookup is O(log n) via binary search.
5.8 Reachability Bitmap
Git 2.0+. Each selected commit stores a bitmap of reachable objects. git fetch computes "need X, already have Y" instantly. Why GitHub git clone is fast.
6. Object Storage Operations
6.1 Create
git add file.txt: read → compute blob hash → check .git/objects/ → write if absent → update index.
6.2 Commit
git commit -m "...": build tree objects from index (recursive per dir) → create commit object with tree, parent, author, message → update HEAD's ref.
6.3 Checkout
git checkout main: read refs/heads/main → read its commit tree → materialize tree into working dir → sync index → update HEAD.
6.4 GC and Prune
git gc
Compacts loose objects into packfiles and deletes unreachable objects older than 2 weeks. Auto-triggers at ~6700 loose objects. git prune --expire=now purges everything unreachable (reflog entries keep objects alive).
7. Merge Algorithms
7.1 Three-way Merge
- Common Ancestor: shared ancestor commit.
- Ours: current branch tip.
- Theirs: branch being merged.
Per file: only Ours changed → Ours; only Theirs → Theirs; both same change → that change; both diverge → conflict.
7.2 Fast-forward
When main is a strict ancestor of branch, main is simply advanced. No merge commit.
7.3 Three-way Merge Commit
When histories diverge, Git creates a merge commit M with two parents.
7.4 Recursive to Ort
Until 2021, Git's default was recursive. Criss-cross merges (multiple common ancestors) recursed into merging ancestors — sometimes pathologically slow.
Ort (Ours/Recursive/Theirs), rewritten by Elijah Newren: up to 500x faster, better rename detection, improved subtree merges, much lower memory. Default from Git 2.33 (2021). Most developers never noticed — merges just got faster.
7.5 Rename Detection
Renamed file (50%+ content similarity) is tracked as a rename, so edits on another branch apply to the new name, not the old. Ort makes this faster and more accurate.
7.6 Conflict Resolution
<<<<<<< HEAD
int x = 1;
=======
int x = 2;
>>>>>>> branch
Edit, git add, git commit. Multi-stage index entries drive this.
7.7 Strategy Options
-X ours/-X theirs: auto-pick on conflict.-X ignore-all-space: ignore whitespace.-X rename-threshold=80: tune rename detection.
Strategies (-s): ort (default), recursive (legacy), resolve, octopus (multi-branch, conflict-free only), ours (discard Theirs content).
8. Rebase Mechanics
8.1 Idea
main: A → B → C
branch: A → B → D → E
After git rebase main on branch:
main: A → B → C
branch: A → B → C → D' → E'
D', E' are new commits — same changes, different parent, different hashes.
8.2 How It Works
- Start at main's tip as temporary HEAD.
- List branch-only commits in order.
- Cherry-pick each: compute ancestor, three-way merge, create new commit with new parent.
- Move the branch ref to the new tip.
8.3 Interactive Rebase
git rebase -i HEAD~5
Actions: pick, reword, edit, squash, fixup, drop. History rewrite — new hashes.
8.4 Danger on Public Branches
Rebase changes hashes. Anyone working on old hashes will have divergent history after git push --force. Rebase only your own branches; merge shared ones.
8.5 Force-with-lease
git push --force-with-lease
Refuses the push if the remote advanced since you last fetched — prevents clobbering teammates' work.
9. Reflog — The Safety Net
9.1 What It Is
Local log of every HEAD/ref change. Even after git reset --hard or git rebase, you can recover.
git reflog
# abc123 (HEAD -> main) HEAD@{0}: commit: Add feature
# def456 HEAD@{1}: commit: Fix bug
# ghi789 HEAD@{2}: rebase finished
9.2 Recovery Example
Accidentally dropped a commit with reset:
git reflog
# abc123 HEAD@{1}: commit: the lost one
# def456 HEAD@{0}: reset: moving to HEAD~1
git reset --hard abc123
Objects remain in .git/objects/ as long as reflog holds the hash.
9.3 Storage and Expiry
.git/logs/HEAD and .git/logs/refs/.... Default expiry: 90 days for reachable, 30 days for unreachable. git gc cleans up.
9.4 Recovery Routine
Don't panic → git reflog → find hash → git reset --hard <hash> or git cherry-pick <hash>. Within 30 days, almost any mistake is recoverable.
10. Partial Clone and Sparse Checkout
10.1 Big Repos
Linux kernel: 3GB. Chromium: 50GB. Full clones waste time and space.
10.2 Shallow Clone
git clone --depth=1 https://github.com/...
Latest commit only. Common in CI. Limits some git pull workflows.
10.3 Partial Clone (Git 2.19+)
git clone --filter=blob:none https://github.com/...
All commits + trees; blobs fetched lazily on demand.
10.4 Sparse Checkout
git sparse-checkout init
git sparse-checkout set src/feature-x docs
Only specified directories materialize. Essential for monorepos.
10.5 Combined
git clone --filter=blob:none --sparse https://...
git sparse-checkout init
git sparse-checkout set src/mymodule
50GB repo can operate in under 1GB.
10.6 Git LFS
Stores large files on a separate server; Git tracks only a pointer (~60 bytes). Good for media/binaries, but no real diff.
11. Git Protocols
11.1 Clone/Fetch Flow
- Refs exchange.
- "have"/"want" negotiation.
- Server builds and sends a packfile.
11.2 Smart HTTP
Most common (GitHub, GitLab):
GET /repo.git/info/refs?service=git-upload-pack
POST /repo.git/git-upload-pack
Firewall-friendly over port 443.
11.3 SSH
ssh git@github.com "git-upload-pack 'user/repo.git'"
Same protocol over SSH — often faster (no HTTP overhead).
11.4 Git Protocol (port 9418)
Original, anonymous-only, rarely used.
11.5 Protocol V2 (2018+)
Capability negotiation, ls-refs filtering, better compression, stateless. Default in GitHub/GitLab, Git 2.18+.
11.6 Transfer Optimization
Server uses reachability bitmaps to compute needed objects instantly, skipping what the client already has, then delta-compresses into a pack. Large hosts pre-compute packs plus per-user deltas.
12. Debugging and Inspection Tools
git cat-file -t abc123 # type
git cat-file -s abc123 # size
git cat-file -p abc123 # pretty-print
git cat-file --batch-all-objects --unordered
git rev-list HEAD
git rev-list --count HEAD
git rev-list --objects HEAD
git verify-pack -v .git/objects/pack/pack-abc123.idx
git fsck --full
git show-ref
git count-objects -v
13. Common Real-World Scenarios
13.1 Accidental Force Push
git reflog
git push --force-with-lease origin main:<old-hash>
13.2 Brick-wall Conflict
git merge --abort
git checkout feature
git rebase -i main
git checkout main
git merge feature
13.3 Split a Commit
git rebase -i HEAD~3
# mark target commit 'edit'
git reset HEAD^
git add file1 && git commit -m "first half"
git add file2 && git commit -m "second half"
git rebase --continue
13.4 Committed a Secret
git reset --soft HEAD~1 # if recent
# or, for older commits:
git filter-branch --index-filter \
'git rm --cached --ignore-unmatch secrets.txt' HEAD
# or use BFG Repo-Cleaner
Needs force push; assume the secret is already leaked and rotate it immediately.
13.5 When Was This File Added?
git log --all --full-history --source -- <file>
git log --diff-filter=A -- <file>
git log -p -- <file>
git blame <file>
14. Changes in 2024-2025 Git
14.1 Worktree Improvements
git worktree add ../feature-x feature-x
Multiple branches checked out in parallel — no context-switch pain.
14.2 Reftable
New single-file ref storage with O(log n) lookup; experimental in 2024, stabilizing in 2025.
14.3 Scalar
Microsoft's big-repo wrapper (integrated into Git 2.38+): partial clone + sparse checkout + background maintenance configured automatically.
14.4 SSH Signing
git config gpg.format ssh
git config user.signingkey ~/.ssh/id_ed25519.pub
git commit -S -m "Signed"
Sign commits with SSH keys — no GPG setup. GitHub supports since 2022.
15. Summary Cheat Sheet
Essence: Content-addressable filesystem, SHA-1 hash → object
Objects: blob (content), tree (dir), commit (snapshot+meta), tag
Storage: loose (.git/objects/ab/cd...) / pack (*.pack+*.idx+*.bitmap)
Refs: refs/heads, refs/tags, refs/remotes, HEAD, packed-refs
Index: .git/index binary, three-tree model (HEAD/Index/Working)
Merge: 3-way, fast-forward, ort (2021+), rename detection
Rebase: cherry-pick new commits, new hashes, not on public branches
Reflog: local history of ref changes, 30-90 day retention
Big repo: --depth=1, --filter=blob:none, sparse-checkout, LFS
Protocol: Smart HTTP / SSH / Protocol V2
Tools: cat-file, rev-list, fsck, verify-pack, count-objects, reflog
16. Quiz
Q1. What are the four Git object types and how do they link?
A. blob (file contents), tree (directory listing), commit (snapshot + metadata), tag (annotated tag). A commit points at a top-level tree; trees point at blobs and sub-trees. Commits also point at parent commit(s), forming a DAG. Every edge is a SHA-1 hash. Trees are recursive. Identical blobs are stored once (deduplication). Any byte change cascades through every ancestor hash — the integrity guarantee.
Q2. How does packfile delta compression work?
A. Similar objects are stored as base + edit instructions. The base is full (zlib-compressed); similar objects say "copy offset X for N bytes from base" and "insert these M new bytes." A 10-byte edit of a 100MB file costs ~10 bytes instead of a new 100MB blob. Base selection is heuristic (same filename previous version, size similarity). Chain depth is capped (default 50) to bound reconstruction cost. This is why git clone is KB-scale.
Q3. Why does the index exist?
A. Three reasons: (1) Partial commits — stage only selected files. (2) Performance — the index caches inode/mtime/size so git status avoids full diffs. (3) Merge state — during conflicts the index holds multiple stages (ancestor/ours/theirs) per path, enabling git checkout --theirs file. In the three-tree model (HEAD/Index/Working), the index is the explicit "what the next commit will be" layer.
Q4. Why is rebase dangerous on public branches?
A. Rebase changes commit hashes. Same changes, different parent, entirely new SHA-1 — effectively new commits. If teammates branched off the old hashes, their work depends on commits that "disappear" after a force push. Shared branches should be merged, not rebased. --force-with-lease narrows the window by refusing push if the remote advanced since you last fetched.
Q5. Why is reflog a safety net for mistakes?
A. Reflog records every HEAD/ref update locally with timestamps under .git/logs/. Objects are not immediately deleted — even after git reset --hard, the blob/commit still lives in .git/objects/ as long as the reflog references it. So git reflog + git reset --hard <hash> recovers almost anything within 30-90 days. Mistakes are nearly always reversible until git gc actually prunes.
Q6. Partial clone vs shallow clone — what's the difference?
A. Shallow (--depth=N) fetches only the last N commits; history is truncated and some git pull paths break. Partial (--filter=blob:none) fetches all commits and trees but fetches blobs lazily — full log history, file contents on demand. Combined with sparse checkout a 50GB repo can operate under 1GB. Microsoft Windows runs on this model.
Q7. Why did ort replace recursive?
A. Performance and correctness. Recursive could explode on criss-cross merges (recursively merging multiple ancestors), slowing or OOM-ing on large repos. Ort (Ours/Recursive/Theirs), rewritten by Elijah Newren, is up to 500x faster, has more accurate rename detection, and uses far less memory. Default from Git 2.33 (2021). Most developers never noticed — merges just got faster.
If this was useful, check out:
- "Binary Serialization: Protobuf/Thrift/Avro/FlatBuffers" — compare with Git's binary formats.
- "Docker BuildKit & Image Layers Deep Dive" — another content-addressable system.
- "RocksDB & LSM-Tree Deep Dive" — append-only + background compaction philosophy.
- "Consistent Hashing & Virtual Nodes" — distributed content addressing.
현재 단락 (1/240)
- **Git is a content-addressable filesystem**. Version control is a layer built on top. Every piece ...