- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction: Why Understand Git Internals
- 1. Git Object Model — The Foundation of Everything
- 2. SHA-1 Hashing and Content-Addressable Storage
- 3. .git Directory Anatomy
- 4. DAG and Commit Graph
- 5. Branches and Tags — Just Pointers
- 6. Merge Internals
- 7. Rebase Internals
- 8. Cherry-Pick and Revert
- 9. Reflog — The Safety Net
- 10. Pack Files and Garbage Collection
- 11. Plumbing vs Porcelain Commands
- 12. Interview Questions (10 Questions)
- Q1. What happens internally when you run git add?
- Q2. Explain the internal structure of a Git branch.
- Q3. Explain the internal differences between merge and rebase.
- Q4. What is detached HEAD and why is it dangerous?
- Q5. How do you recover deleted commits using reflog?
- Q6. What are pack files and why are they needed?
- Q7. What impact do SHA-1 collisions have on Git, and how is it being addressed?
- Q8. Explain the internal workings of git clone --depth 1.
- Q9. Explain the three-way merge algorithm.
- Q10. Describe the process of creating a commit using only plumbing commands.
- 13. Quiz (5 Questions)
- 14. References
Introduction: Why Understand Git Internals
Most developers only use git add, git commit, and git push. However, understanding Git internals provides significant advantages:
- Better debugging: Understand and recover from
detached HEAD, conflicts, and lost commits - Advanced workflows: Use rebase, cherry-pick, and bisect with confidence
- Interview preparation: FAANG-level companies frequently ask about Git internals
- Troubleshooting: Resolve repository corruption, large file issues, and slow clones
This article dissects Git from the ground up: the object model, hashing, directory structure, DAG, the true nature of branches, merge/rebase internals, reflog, and pack files.
1. Git Object Model — The Foundation of Everything
At its core, Git is a Content-Addressable Storage system. It stores four types of objects using the SHA-1 hash of their content as keys.
1.1 Blob (Binary Large Object)
A blob stores only the file content. It does not include the filename or path information.
# Create a blob from content
echo "Hello, Git!" | git hash-object --stdin
# Output: 0907f4a3c4740fa3a5c919cb4447fdb1f1a66aec
# Inspect blob content
git cat-file -p 0907f4a
# Output: Hello, Git!
# Check blob type
git cat-file -t 0907f4a
# Output: blob
Key point: Even if you have 100 files with identical content, Git stores only one blob. This is what makes Git efficient.
1.2 Tree
A tree represents a directory structure. It contains filenames, file modes, and references to blobs or other trees.
# Inspect the tree of the latest commit
git cat-file -p HEAD^{tree}
# Output:
# 100644 blob a1b2c3d... README.md
# 100644 blob d4e5f6a... package.json
# 040000 tree b7c8d9e... src
Tree (root)
├── blob: README.md (100644)
├── blob: package.json (100644)
└── tree: src/
├── blob: index.ts (100644)
└── blob: utils.ts (100644)
File mode meanings:
| Mode | Meaning |
|---|---|
| 100644 | Regular file |
| 100755 | Executable file |
| 120000 | Symbolic link |
| 040000 | Directory (tree) |
1.3 Commit
A commit object stores metadata about a snapshot.
git cat-file -p HEAD
# Output:
# tree a1b2c3d4e5f6...
# parent 9f8e7d6c5b4a...
# author Kim <kim@example.com> 1711234567 +0900
# committer Kim <kim@example.com> 1711234567 +0900
#
# feat: add user authentication
Commit object components:
- tree: Complete project snapshot at this point (references the root tree)
- parent: Parent commit SHA-1 (initial commit has no parent; merge commits have 2+)
- author: Person who wrote the code
- committer: Person who created the commit (can differ in cherry-pick)
- message: Commit message
1.4 Tag
Annotated tags are stored as separate objects.
# Create annotated tag
git tag -a v1.0.0 -m "Release version 1.0.0"
# Inspect tag object
git cat-file -p v1.0.0
# Output:
# object d4e5f6a7b8c9...
# type commit
# tag v1.0.0
# tagger Kim <kim@example.com> 1711234567 +0900
#
# Release version 1.0.0
1.5 Object Relationship Diagram
Tag ──▶ Commit ──▶ Tree ──▶ Blob
│ │
▼ ▼
Commit Tree ──▶ Blob
(parent)
2. SHA-1 Hashing and Content-Addressable Storage
2.1 How SHA-1 Works in Git
Git computes the SHA-1 hash by prepending a header to the object content.
# What Git does internally
content="Hello, Git!"
header="blob ${#content}\0"
echo -en "${header}${content}" | sha1sum
# Result: 0907f4a3c4740fa3a5c919cb4447fdb1f1a66aec
Header format: type size\0content
blob 11\0Hello, Git!
└─┬─┘└┬┘└┬┘└────┬────┘
type size null actual content
2.2 What Content-Addressable Means
The same content always produces the same hash. This guarantees:
- Integrity: Tampered data produces a different hash
- Deduplication: Identical files are stored only once
- Efficient comparison: Comparing hashes instantly determines content equality
2.3 SHA-1 Collisions and SHA-256 Migration
In 2017, Google demonstrated a SHA-1 collision (the SHAttered attack). Git is responding by transitioning to SHA-256.
# Create a SHA-256 repository (Git 2.29+)
git init --object-format=sha256
# Check current repository hash algorithm
git rev-parse --show-object-format
| Aspect | SHA-1 | SHA-256 |
|---|---|---|
| Hash length | 40 chars | 64 chars |
| Security | Collision found | Secure |
| Compatibility | All Git versions | Git 2.29+ |
| Status | Default | Experimental |
3. .git Directory Anatomy
Let us examine the .git directory structure created after git init.
.git/
├── HEAD # Reference to currently checked-out branch
├── config # Repository-specific settings
├── description # For GitWeb (rarely used)
├── hooks/ # Client/server hook scripts
│ ├── pre-commit.sample
│ ├── commit-msg.sample
│ └── ...
├── info/
│ └── exclude # Local version of .gitignore
├── objects/ # All Git objects
│ ├── 09/
│ │ └── 07f4a3c4740fa3a5c919cb4447fdb1f1a66aec
│ ├── info/
│ └── pack/ # Pack files
├── refs/ # Branch and tag references
│ ├── heads/ # Local branches
│ │ └── main
│ ├── remotes/ # Remote branches
│ │ └── origin/
│ └── tags/ # Tags
└── index # Staging area (binary)
3.1 HEAD File
# HEAD points to the current branch
cat .git/HEAD
# ref: refs/heads/main
# In detached HEAD state, it points directly to a commit
git checkout a1b2c3d
cat .git/HEAD
# a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0
3.2 refs Directory
# A branch is simply a file containing a commit hash
cat .git/refs/heads/main
# d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3
# You can even create branches manually!
echo "d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3" > .git/refs/heads/my-branch
3.3 objects Directory
Objects are stored using the first 2 characters of the SHA-1 hash as a directory name, and the remaining 38 characters as the filename.
# Hash: 0907f4a3c4740fa3a5c919cb4447fdb1f1a66aec
# Location: .git/objects/09/07f4a3c4740fa3a5c919cb4447fdb1f1a66aec
# Objects are compressed with zlib
python3 -c "
import zlib
with open('.git/objects/09/07f4a3c4740fa3a5c919cb4447fdb1f1a66aec', 'rb') as f:
print(zlib.decompress(f.read()))
"
3.4 Index File (Staging Area)
# Check staging area contents
git ls-files --stage
# 100644 a1b2c3d4... 0 README.md
# 100644 d4e5f6a7... 0 src/index.ts
# Index is a binary file starting with "DIRC" signature
hexdump -C .git/index | head -3
4. DAG and Commit Graph
4.1 What is a DAG (Directed Acyclic Graph)
Git's commit history is a Directed Acyclic Graph (DAG) data structure:
- Directed: Commits point to their parents (child-to-parent direction)
- Acyclic: No circular references (A cannot be both parent and child of B)
- Graph: Composed of nodes (commits) and edges (parent references)
A ◀── B ◀── C ◀── D (main)
▲
└── E ◀── F (feature)
4.2 Visualizing the Commit Graph
# Visualize the graph
git log --oneline --graph --all
# Example output:
# * f1a2b3c (HEAD -> main) Merge branch 'feature'
# |\
# | * d4e5f6a (feature) feat: add search
# | * a7b8c9d feat: add filter
# |/
# * 1e2f3a4 initial commit
4.3 Parent Pointers
# First parent only (useful for merge commits)
git log --first-parent --oneline
# Access parent commits
git cat-file -p HEAD^1 # First parent
git cat-file -p HEAD^2 # Second parent (merge commits)
git cat-file -p HEAD~3 # 3 ancestors up
Parent reference syntax:
HEAD~1 = HEAD^ = first parent
HEAD~2 = HEAD^^ = first parent's first parent
HEAD^2 = second parent (only on merge commits)
5. Branches and Tags — Just Pointers
5.1 Branches Are Pointers
A Git branch is nothing more than a 40-byte file pointing to a specific commit.
# Creating a branch = creating a file
git branch feature
# This is essentially:
# echo $(git rev-parse HEAD) > .git/refs/heads/feature
# Switching branches = modifying the HEAD file
git checkout feature
# This essentially:
# 1. Writes "ref: refs/heads/feature" to .git/HEAD
# 2. Updates working tree to match the commit's tree
# 3. Updates index to match the tree
.git/refs/heads/
├── main → d4e5f6a (commit)
├── feature → a7b8c9d (commit)
└── hotfix → 1e2f3a4 (commit)
.git/HEAD → ref: refs/heads/main
5.2 The Role of HEAD
HEAD is a pointer to the currently checked-out location.
Normal state:
HEAD → refs/heads/main → commit d4e5f6a
Detached HEAD state:
HEAD → commit a7b8c9d (directly points to a commit)
Detached HEAD warning: Commits made in this state are not referenced by any branch. When you check out another branch, those commits become unreachable and may be deleted by git gc.
# Save work from detached HEAD state
git checkout -b recovery-branch
5.3 Lightweight vs Annotated Tags
# Lightweight tag: simple reference (just a commit hash in refs/tags/)
git tag v1.0.0-rc1
# Annotated tag: creates a separate tag object
git tag -a v1.0.0 -m "Release 1.0.0"
# See the difference
git cat-file -t v1.0.0-rc1 # commit
git cat-file -t v1.0.0 # tag
6. Merge Internals
6.1 Fast-Forward Merge
When main has no additional commits after the feature branch diverged:
Before:
A ◀── B ◀── C (main)
◀── D ◀── E (feature)
After (fast-forward):
A ◀── B ◀── C ◀── D ◀── E (main, feature)
# Fast-forward merge
git checkout main
git merge feature
# "Fast-forward" message displayed
# Force a merge commit instead of fast-forward
git merge --no-ff feature
Fast-forward only moves the pointer — no new commit is created.
6.2 Three-Way Merge
When both branches have new commits:
Before:
A ◀── B ◀── C ◀── F (main)
◀── D ◀── E (feature)
After (3-way merge):
A ◀── B ◀── C ◀── F ◀── G (main) [merge commit]
◀── D ◀── E ──────┘
(feature)
Three-Way Merge Algorithm:
- Find the common ancestor (Merge Base): B is the common ancestor
- Calculate diffs from both branches: B to F, and B to E
- Combine changes: Auto-merge if no conflict; manual resolution if conflicted
# Find merge base
git merge-base main feature
# Output: B's SHA-1 hash
# Compare the 3 trees for merge
git diff $(git merge-base main feature) main # base vs main
git diff $(git merge-base main feature) feature # base vs feature
6.3 Recursive Strategy
When there are multiple merge bases, Git uses the recursive strategy.
A ◀── B ◀── E (main)
▲ ◀── F (feature)
└── C ◀── D
(cross-merge history)
In this case, Git:
- Finds multiple merge bases
- First merges the merge bases into a virtual merge
- Uses that result as the actual merge base
# Specify merge strategy
git merge -s recursive feature
git merge -s ort feature # Git 2.34+ default (Ostensibly Recursive's Twin)
6.4 Octopus Merge
Used when merging 3 or more branches simultaneously:
git merge feature1 feature2 feature3
A ◀── B (main)
◀── C (feature1)
◀── D (feature2)
◀── E (feature3)
→
A ◀── B ◀── M (main) [3 parents]
◀── C ──────┘
◀── D ──────┘
◀── E ──────┘
Note: Octopus merge fails if there are conflicts.
7. Rebase Internals
7.1 The Essence of Rebase: Recreating Commits
Rebase copies existing commits and recreates them on top of a new base.
Before:
A ◀── B ◀── C (main)
◀── D ◀── E (feature)
After rebase (git checkout feature && git rebase main):
A ◀── B ◀── C (main)
◀── D' ◀── E' (feature)
Important: D' and E' are different commits from D and E. They have the same content but different parents, so their SHA-1 hashes differ.
7.2 Rebase Internal Steps
git checkout feature
git rebase main
What Git does internally:
- Find commits on feature that are not on main (D, E)
- Save patches to temporary storage (
.git/rebase-apply/or.git/rebase-merge/) - Reset feature to main's latest commit (C)
- Apply saved patches one by one to create new commits (D', E')
# Check temporary files during rebase
ls .git/rebase-merge/
# done # completed commits
# git-rebase-todo # remaining commits
# head-name # original branch name
# onto # rebase target commit
7.3 Interactive Rebase
git rebase -i HEAD~3
# Contents of .git/rebase-merge/git-rebase-todo:
pick a1b2c3d feat: add login
pick d4e5f6a feat: add signup
pick 7b8c9d0 fix: typo in login
# Commands:
# pick = use commit
# reword = change message
# edit = modify commit then continue
# squash = merge with previous (combine messages)
# fixup = merge with previous (discard message)
# drop = remove commit
7.4 Rebase vs Merge Comparison
| Aspect | Merge | Rebase |
|---|---|---|
| History | Preserves branches (non-linear) | Linear |
| Existing commits | Unchanged | New commits created |
| Conflict resolution | Once | Per commit |
| Shared branches | Safe | Dangerous (force push needed) |
| Merge commit | Created | None |
Golden rule: Never rebase commits that have been pushed. Others may have based their work on those commits.
8. Cherry-Pick and Revert
8.1 Cherry-Pick Internals
Cherry-pick applies only the changes from a specific commit onto the current branch.
git cherry-pick d4e5f6a
Internal operation:
- Calculate the diff between the target commit (d4e5f6a) and its parent
- Apply that diff to the current HEAD
- Create a new commit (same message, different SHA-1)
Before:
A ◀── B ◀── C (main)
◀── D ◀── E (feature)
git checkout main && git cherry-pick E:
A ◀── B ◀── C ◀── E' (main) [only E's changes copied]
◀── D ◀── E (feature)
8.2 Revert Internals
Revert creates a new commit that applies the reverse patch of a specific commit.
git revert d4e5f6a
Before:
A ◀── B ◀── C (main)
git revert B:
A ◀── B ◀── C ◀── B' (main) [new commit undoing B's changes]
Revert does not rewrite history, making it safe for shared branches.
8.3 Reverting Merge Commits
# When reverting a merge commit, specify which parent to use as reference
git revert -m 1 MERGE_COMMIT_HASH
# -m 1: revert based on first parent (main)
# -m 2: revert based on second parent (feature)
9. Reflog — The Safety Net
9.1 What is Reflog
Reflog records all changes to HEAD and branch references. It is the essential tool for recovery when you accidentally delete commits or mess up history with rebase.
# View reflog
git reflog
# d4e5f6a HEAD@{0}: commit: feat: add auth
# a1b2c3d HEAD@{1}: checkout: moving from feature to main
# 7b8c9d0 HEAD@{2}: commit: feat: add search
# 1e2f3a4 HEAD@{3}: rebase finished
# ...
# Reflog for a specific branch
git reflog show feature
9.2 Recovery Scenarios Using Reflog
Scenario 1: Accidental hard reset
# Mistake!
git reset --hard HEAD~3
# Find previous state in reflog
git reflog
# a1b2c3d HEAD@{1}: previous state
# Recover
git reset --hard a1b2c3d
Scenario 2: Undo rebase
# Pre-rebase state is in the reflog
git reflog
# ... HEAD@{5}: rebase (start): checkout main
# Recover to pre-rebase state
git reset --hard HEAD@{5}
Scenario 3: Recover deleted branch
# Delete branch
git branch -D feature
# Find the branch's last commit in reflog
git reflog | grep feature
# Recreate branch
git branch feature a1b2c3d
9.3 Reflog Expiry
# Default expiry periods
# Reachable entries: 90 days
# Unreachable entries: 30 days
# Configure expiry
git config gc.reflogExpire "180 days"
git config gc.reflogExpireUnreachable "60 days"
# Manual expiry
git reflog expire --expire=now --all
10. Pack Files and Garbage Collection
10.1 Loose Objects vs Packed Objects
Git initially stores each object as an individual file (loose object). When objects accumulate, they are compressed into pack files.
# Count loose objects
find .git/objects -type f | grep -v 'pack\|info' | wc -l
# View pack files
ls .git/objects/pack/
# pack-a1b2c3d4e5f6.idx (index)
# pack-a1b2c3d4e5f6.pack (data)
10.2 Delta Compression
Pack files use delta compression. Similar files store only the differences.
# Inspect pack file contents
git verify-pack -v .git/objects/pack/pack-*.idx
# SHA-1 type size size-in-pack offset depth base-SHA-1
# a1b2c3d blob 10240 3521 12 0
# d4e5f6a blob 10245 45 1200 1 a1b2c3d # delta!
In the example above, d4e5f6a stores only the 45-byte difference from a1b2c3d.
10.3 git gc (Garbage Collection)
# Run garbage collection
git gc
# What it does:
# 1. Compresses loose objects into pack files
# 2. Removes unreachable objects
# 3. Cleans up reflog
# 4. Compresses refs into packed-refs
# More aggressive GC
git gc --aggressive --prune=now
# View GC statistics
git count-objects -v
# count: 0 (loose objects)
# size: 0 (loose objects size, KB)
# in-pack: 1234 (packed objects)
# packs: 1 (number of pack files)
# size-pack: 5678 (pack file size, KB)
# prune-packable: 0
# garbage: 0
10.4 Large File Issues and Solutions
# Find the largest files in the repository
git rev-list --objects --all | \
git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | \
sed -n 's/^blob //p' | sort -rnk2 | head -10
# Remove large files with BFG Repo-Cleaner
bfg --strip-blobs-bigger-than 100M
# git filter-repo (recommended)
git filter-repo --strip-blobs-bigger-than 100M
11. Plumbing vs Porcelain Commands
Git commands are divided into two layers.
11.1 Porcelain — User-Friendly
High-level commands used daily:
git add, git commit, git push, git pull
git branch, git checkout, git merge, git rebase
git log, git diff, git status
git stash, git tag, git remote
11.2 Plumbing — Low-Level
Low-level commands used internally by Git:
# Object manipulation
git hash-object # Compute and store object hash
git cat-file # Inspect object content/type/size
git write-tree # Create tree object from index
git commit-tree # Create commit object from tree
git update-ref # Update references
# Index manipulation
git update-index # Add files to index
git ls-files # List index contents
git read-tree # Read tree into index
# Transfer
git pack-objects # Create pack files
git unpack-objects # Unpack pack files
git send-pack # Send objects
git receive-pack # Receive objects
11.3 Creating a Commit with Plumbing Only
The process of creating a commit without porcelain commands:
# 1. Create blob
echo "Hello World" | git hash-object -w --stdin
# a1b2c3d...
# 2. Add to index
git update-index --add --cacheinfo 100644 a1b2c3d hello.txt
# 3. Create tree
git write-tree
# d4e5f6a...
# 4. Create commit
echo "first commit" | git commit-tree d4e5f6a
# 7b8c9d0...
# 5. Point branch to this commit
git update-ref refs/heads/main 7b8c9d0
12. Interview Questions (10 Questions)
Q1. What happens internally when you run git add?
Model answer: The file content is hashed with SHA-1 and stored as a blob object in .git/objects/. Then the blob's hash and file path are recorded in .git/index (staging area). The previous version remains intact; only the new version is added.
Q2. Explain the internal structure of a Git branch.
Model answer: A branch is a 40-byte file in .git/refs/heads/. It contains only the SHA-1 hash of the commit the branch points to. Switching branches modifies the HEAD file and updates the working tree and index to match that commit's tree.
Q3. Explain the internal differences between merge and rebase.
Model answer: Merge creates a new merge commit using a three-way merge of both branches' latest commits and their common ancestor. Rebase copies current branch commits and recreates them on top of the target branch one by one. Since rebase creates new commits, the SHA-1 hashes change.
Q4. What is detached HEAD and why is it dangerous?
Model answer: Detached HEAD occurs when HEAD points directly to a commit instead of through a branch. Commits made in this state are not referenced by any branch. When you check out another branch, those commits become unreachable and may be removed by git gc.
Q5. How do you recover deleted commits using reflog?
Model answer: Use git reflog to view the history of HEAD changes and find the hash of the commit to recover. Then use git reset --hard HASH or git checkout -b recovery HASH. Reflog retains entries for 30 days (unreachable) to 90 days (reachable) by default.
Q6. What are pack files and why are they needed?
Model answer: Git initially stores each object as an individual file (loose object), but file system performance degrades as objects accumulate. Pack files compress multiple objects into a single file and use delta compression to store only differences between similar objects. They are automatically created during git gc or git push.
Q7. What impact do SHA-1 collisions have on Git, and how is it being addressed?
Model answer: A SHA-1 collision means different content produces the same hash, breaking data integrity. The 2017 SHAttered attack demonstrated this. Git responded by adding collision detection logic and transitioning to SHA-256. Git 2.29+ supports --object-format=sha256.
Q8. Explain the internal workings of git clone --depth 1.
Model answer: Shallow clone fetches only part of the history. The server sends a pack file containing only the latest commit and its required trees and blobs. The .git/shallow file records the shallow boundary commits, making earlier history inaccessible.
Q9. Explain the three-way merge algorithm.
Model answer: Find the common ancestor (merge base) of both branches, then calculate the diff from the base to each branch. Changes on only one side are applied automatically. When both sides modify the same section differently, it is marked as a conflict. Git uses git merge-base to find the common ancestor.
Q10. Describe the process of creating a commit using only plumbing commands.
Model answer: (1) Create a blob with git hash-object -w, (2) add to index with git update-index, (3) create a tree with git write-tree, (4) create a commit with git commit-tree, (5) update the branch reference with git update-ref.
13. Quiz (5 Questions)
Q1. Which of the following is NOT a Git object type? (a) blob (b) tree (c) branch (d) commit (e) tag
Answer: (c) branch
A branch is not a Git object. It is a reference stored as a text file in .git/refs/heads/. Git has four object types: blob, tree, commit, and tag.
Q2. When do git rebase main and git merge main produce the same result?
Answer: When fast-forward is possible
When the current branch diverged from main and main has no additional commits, merge performs a fast-forward and rebase produces an identical result. Both yield the same linear history.
Q3. How do you recover a deleted commit after git reset --hard HEAD~1?
Answer: Find the deleted commit's hash in git reflog and recover with git reset --hard HASH or git branch recovery HASH. The reflog records all HEAD changes, so the pre-reset commit hash is preserved.
Q4. If 3 files have identical content but different names, how many blobs does Git store?
Answer: 1
Git uses content-addressable storage, so the SHA-1 hash of the content is the key. Identical content produces the same hash, so only 1 blob is stored. Filenames and paths are stored in tree objects.
Q5. When can a merge commit have 3 or more parents?
Answer: Octopus merge
Running git merge branch1 branch2 branch3 performs an octopus merge, creating a merge commit with 3 or more parents. However, octopus merge fails if there are conflicts.
14. References
- Pro Git Book - Git Internals — Official documentation
- Git from the Bottom Up — John Wiegley
- How Git Works Internally — GitHub Blog
- Git Object Model — Official documentation
- SHAttered Attack — SHA-1 collision demonstration
- Git SHA-256 Transition — SHA-256 transition plan
- Merge Strategies in Git — Official documentation
- Git Rebase Documentation — Official documentation
- Git Internals - Transfer Protocols
- Unpacking Git Packfiles — Recurse Center
- Git Delta Compression — Matthew McCullough
- Think Like a Git — Understanding Git through DAG and graph theory