Systematic Debugging — Finding Bugs by Reasoning, Not Guessing

Prologue — Most Debugging Is Guessing

A bug report comes in. The developer opens the code. "This part looks suspicious," they think, and drop in a console.log. Rebuild, reproduce, no log fires — so they drop another one somewhere else. Thirty minutes later there are log statements scattered all over the code and the bug is still right there.

This is 90% of debugging as it actually happens in the field. Guessing with no structure. "Maybe it's this?" "Maybe it's that?" If you're lucky, you find it fast. If you're not, you burn a day. And even after you find it, you can't explain exactly why it was a bug. If you can't explain it, the same bug comes back.

Debugging is not luck. It's a methodology. A good debugger (the person) isn't smarter — they narrow the search space systematically. Instead of randomly poking at 100 possibilities, they eliminate 50 with a single experiment. Then 25. Then 12. This is not talent; it's a trainable skill.

This post lays out that skill: the core loop, securing reproducibility first, applying the scientific method to bugs, binary search, reading stack traces properly, observability tooling, the hard bug classes, getting unstuck, and debugging with AI agents. It's language- and stack-agnostic — it applies anywhere there's a bug.

The one-liner: debugging is forming a hypothesis and designing the experiment that disproves it most efficiently. It is not staring at code.

Chapter 1 · The Core Loop — Reproduce → Isolate → Hypothesize → Test → Fix → Verify

Every systematic debugging session runs the same loop. The order matters. Skip a step and you fall back into guessing.

┌─────────────────────────────────────────────────┐
│  1. Reproduce                                   │
│     Can you trigger the bug with one command?   │
│             ↓                                   │
│  2. Isolate                                     │
│     Halve the region the bug lives in           │
│             ↓                                   │
│  3. Hypothesize                                 │
│     "It's caused by X" — one testable sentence  │
│             ↓                                   │
│  4. Test                                        │
│     Design and run an experiment that disproves │
│             ↓                                   │
│  5. Fix                                         │
│     Fix the root cause (not the symptom)        │
│             ↓                                   │
│  6. Verify                                      │
│     Does the repro case pass now? Regression?   │
└─────────────────────────────────────────────────┘
        ↑________________________________│
         If the hypothesis was wrong, go back to 3

The key to each step:

Step	Question to ask	Common mistake
Reproduce	"Can I trigger it again with one command?"	Starting to guess with no repro
Isolate	"Did I halve the scope?"	Staring at the whole codebase at once
Hypothesize	"Is it one testable sentence?"	A vague hypothesis like "something's off"
Test	"Can this experiment disprove the hypothesis?"	An experiment that only confirms
Fix	"Did I fix the cause, not the symptom?"	Covering the symptom with `try/catch`
Verify	"Does the repro case pass?"	Eyeballing it and saying "looks fixed"

The power of this loop is that each step shrinks the next one. A solid repro makes isolation faster. Good isolation narrows the hypothesis. A narrow hypothesis means the test ends in one shot. Conversely, if you skip reproduction, everything else is built on a guess.

Chapter 2 · Reproducibility First — Deterministic Repro and Minimizing the Case

You cannot debug a bug you cannot reproduce, because you have no way to confirm you fixed it. So the first step is always triggering the bug again with a single command.

Building a Deterministic Repro

"It happens sometimes" is an un-debuggable state. You have to turn "sometimes" into "always." Pin down the sources of non-determinism one by one.

Source of non-determinism	How to pin it
Time / date	Inject the clock and use a fixed value (mock `clock.now()`)
Randomness	Fix the seed (`seed=42`)
Concurrency / scheduling	Set thread count to 1, or force the schedule
Network responses	Record/replay responses (VCR, fixtures)
Input data	Save the exact failing input to a file
Environment variables / config	Capture the whole `.env`
Build / dependency versions	Pin the lockfile, pin the container image

The goal is a line like this:

# This command always produces the same failure
SEED=42 TZ=UTC ./repro.sh fixtures/bug-1234.json

Even if it takes 30 minutes to build that one line, those 30 minutes almost always pay for themselves. Once you have a repro command, every remaining step becomes measurable.

Minimizing the Case

Once it reproduces, the next move is to shrink the repro case to the minimum. If a 1000-line input triggers the bug, see whether deleting 990 of those lines still triggers it. If it does, those 990 lines are irrelevant.

This is the core idea of "delta debugging." Halve the input; if it still fails, halve that half again. When the failure disappears, back up one step.

input 1000 lines → fails ✗
input first 500   → fails ✗   (drop the back 500)
input first 250   → passes ✓  (back up one step)
input first 375   → fails ✗
input first 312   → fails ✗
...
input 3 lines     → fails ✗   ← minimal repro case

A 3-line repro case is overwhelmingly easier to debug than a 1000-line one. With the noise stripped out, the remaining 3 lines are the clue. For C/C++ compiler bugs there's creduce; for plain text there's manual binary deletion; property-based testing tools have automatic shrinking built in.

The process of building a minimal repro case is debugging. While deleting those 990 lines, you see which line, when removed, makes the bug vanish — and that's the location of the cause.

Chapter 3 · The Scientific Method, Applied to Bugs — One Variable at a Time

The essence of systematic debugging is the scientific method: observe → hypothesize → experiment → conclude. And the most important rule of the scientific method: change one variable at a time.

Bad Debugging vs Good Debugging

Bad debugging:
  "Hmm, let me turn off the cache, raise the log level,
   bump the timeout, and try a different version of this library."
  → The bug disappears. But you don't know what made it disappear.
  → Which of the 4 was it? No idea. It may come back if you re-enable.

Good debugging:
  Hypothesis: "It's the cache."
  Experiment: Turn off only the cache. Leave everything else as-is.
  Outcome A: bug gone → cache is the cause (or interacts with it)
  Outcome B: bug remains → cache is irrelevant. Reject, next hypothesis.

When you change several variables at once, even a result tells you nothing about which variable caused it. It's an experiment that yielded no information. Change one variable and any outcome is informative — that variable is the cause, or it isn't.

Aim to Disprove

Confirmation bias is debugging's enemy. Once you believe "X is the cause," you only look for evidence that confirms X. A good experiment is designed to disprove the hypothesis.

Hypothesis	Confirming experiment (weak)	Disproving experiment (strong)
"A null pointer is the cause"	Log at that spot	Guarantee that variable is never null — does the bug vanish?
"It's a race condition"	Stare at the concurrent code	Add a coarse lock — does it vanish? Then it's a race
"This PR broke it"	Read the PR diff	Revert that PR — does the bug vanish?

If the hypothesis survives (disproof fails), your confidence rises. If it dies, you move on. Either way the search space shrinks.

What Changed?

If something that used to work stopped working, the most powerful question is "What changed?" Code, data, config, dependencies, infrastructure, traffic — if a healthy system stopped, something changed.

"It worked yesterday, not today" checklist:
  □ Code        — git log, recent deploys
  □ Config      — env vars, feature flags, secrets
  □ Data        — a new edge-case input arrived
  □ Dependencies — lockfile change, auto-updates
  □ Infrastructure — instance swap, OS patch, cert expiry
  □ Traffic     — volume/pattern shift, a new client
  □ Time        — month-end, midnight, DST, leap year, epoch boundary

Chapter 4 · Binary Search — git bisect and Binary Search Everywhere

Binary search is the single most powerful technique in debugging. It cuts N candidates down in log2(N) steps. Even with 1000 candidates, you're done in 10 steps. Binary search applies in three places: commit history, input, and the code path.

git bisect — Binary Search the Commit History

If "it worked before and doesn't now," some commit broke it. git bisect finds that commit in log2(commit count) steps. Even across 500 commits, you only test 9 times.

# Start the session
git bisect start
git bisect bad                  # the current commit is broken
git bisect good v2.3.0          # it worked at this tag

# git checks out a middle commit. Test and judge:
#   $ ./repro.sh && git bisect good   (or)
#   $ ./repro.sh || git bisect bad
# Repeat ~9 times and git prints the culprit:
#
#   a1b2c3d is the first bad commit
#   Author: ...
#   Date:   ...
#       refactor: switch session store to redis

git bisect reset                # return to the original branch

The key is automation. If the repro command returns an exit code precisely (0 on success, non-zero on failure), no human needs to repeat the loop:

git bisect start HEAD v2.3.0
git bisect run ./repro.sh
# git binary-searches the whole range itself and spits out the culprit commit

git bisect run is exactly where the deterministic repro command from Chapter 2 shows its worth. With one repro command, debugging across 500 commits collapses into a single line.

Binary Search the Input

The case minimization from Chapter 3 is binary search the input. Delete half of a 1000-line input. Which of 10,000 records breaks the pipeline — feed them in by halves. You find it in 14 steps.

Binary Search the Code Path

When you don't know which function the bug originates in, plant a check at the midpoint of the code path. If the processing flow is A → B → C → D → E, inspect the state at point C.

A → B → [C: state OK?] → D → E
         ├─ OK     → bug is between C and E (inspect D)
         └─ broken → bug is between A and C (inspect B)

You're asking "is the data still intact at the midpoint?" If it's intact, the bug is downstream; if broken, upstream. Each check halves the code path. This is the difference between blindly scattering console.log and systematic print debugging.

Chapter 5 · Reading Stack Traces and Error Messages Properly

A stack trace is the most precise clue you get for free. Yet many developers read only the top line and close it. A stack trace has a way it should be read.

Anatomy of a Stack Trace

TypeError: Cannot read properties of undefined (reading 'id')   ← (1) exception type + message
    at getUserName (src/user.js:42:18)                          ← (2) where it was thrown (deepest frame)
    at formatProfile (src/profile.js:17:9)                      ← (3) the caller
    at renderPage (src/page.js:88:14)                           ← (4) its caller
    at processRequest (src/server.js:120:5)                     ← (5) toward the entry point

Reading order:

Exception type and message (1) — TypeError, Cannot read properties of undefined. It even tells you what was undefined (reading 'id'). Read the message to the end.
The deepest frame (2) — src/user.js:42. The line where the exception was actually thrown. 90% of the time the cause is near here. But "near" is not "exactly there" — line 42 blew up, but the undefined may have entered because of a caller.
Walking down the call stack (3, 4, 5) — trace where the undefined value originated. If the argument that entered getUserName was already undefined, look at formatProfile; if it was undefined there too, look at renderPage. This is the same as binary search the code path from Chapter 4.
Distinguish your code from library code — node_modules/ frames can usually be skipped. The boundary frame between your code and the library is the key. That's where you passed a bad value into the library.

Read the Error Message to the End

The most common mistake: reading only the first line of an error message and starting to guess. The message usually carries information that's close to the answer.

Message clue	What people often miss
`ECONNREFUSED 127.0.0.1:5432`	The port number — tells you which service
`expected 3 arguments but got 2`	The exact count — narrows which call
`... at line 42 column 18`	The column number — exact spot on the same line
`Caused by: ...` (nested exception)	The real cause is the bottom-most `Caused by`
Warning logs (`WARN`)	A warning logged before the error is a clue

In async code the stack trace breaks. async/await stitches it reasonably well, but callback- and event-based code leaves "where it was called from" out of the trace. In that case you have to catch the error at the entry point and log the context (request ID, input) alongside it.

Chapter 6 · Observability for Debugging — Logs, Traces, Metrics, Debuggers, and Print Done Right

Tools are windows into the search space. Each tool shows you something different.

Tool	What it shows	Strength	Weakness
Debugger (breakpoints)	The whole state at one instant	Freely explore variables and call stack	Changes timing; unfit for distributed/prod
print / logs	How a specific value changes over time	Works anywhere, low timing impact	You must decide what to print up front
Structured logs	A queryable stream of events	Post-mortem analysis, production	Requires design
Distributed traces	A request flow crossing service boundaries	"Which service is slow/broken"	Requires instrumentation
Metrics	Aggregates over time (latency, error rate)	Trend and anomaly detection	Can't see individual cases
Core dumps / profilers	The moment of a crash, or resource usage	Memory and performance bugs	Analysis needs expertise

Try the Debugger First

Before falling into print debugging, first ask can I attach a debugger. A debugger shows you every variable at that instant with a single breakpoint. Print shows only the variables you picked in advance. For a bug that reproduces locally, the debugger is almost always faster. Conditional breakpoints (stop only when i == 9999), watch expressions, call-stack navigation — not using these is leaving value on the table.

Print Debugging Done Right

In situations where you can't use a debugger (production, distributed, timing-sensitive), print is the answer. But systematically, as in Chapter 4:

Bad print debugging:
  Spray log("here1"), log("here2") at every suspicious spot
  → Log bomb. Unclear what you're looking for.

Good print debugging:
  - Form a hypothesis first: "at point C, user.id is already null"
  - Print only the value that tests that hypothesis: log("at C, user.id =", user?.id)
  - Start at the midpoint of the code path (binary search)
  - Attach an identifiable tag: log("[bug-1234] at C:", ...)
  - Remove them all when done (or gate them behind a log level)

Print debugging is not shameful. Unplanned print is. A print that tests a hypothesis is a legitimate experiment.

Production: Logs, Traces, and Metrics Working Together

A production bug usually uses all three together. Metrics tell you "when did the error rate climb" (narrowing the time window), traces tell you "which service breaks" (narrowing the space), and logs tell you "exactly what value inside that service caused it" (pinning the cause). Start wide, narrow down — the same structure as the isolation step in Chapter 1.

Chapter 7 · The Hard Bug Classes — Heisenbugs, Races, Memory, "Works on My Machine"

Some bugs resist a straight application of the core loop, because they won't reproduce or they vanish when observed. Each class needs a different strategy.

Heisenbugs — Bugs That Vanish When Observed

A bug that disappears when you set a breakpoint or add a log. Usually the cause is timing or uninitialized memory. One print line changes the timing, or a debug build zeroes out the memory.

Strategy: keep your observation tools from changing the timing. Async or buffered logging, or a post-mortem core dump. Reproduce on the release build, not the debug build. And the fact that it "vanishes when observed" is itself a clue — it's almost always a timing or memory-initialization problem.

Race Conditions — Bugs That Depend on Ordering

A bug whose outcome depends on the execution order of two threads/processes. It happens 1 in 1000 times.

Strategy:

Raise the reproduction rate. Insert a sleep in the suspect region to artificially widen the race window. When 1-in-1000 becomes 1-in-10, it becomes debuggable.
Disproving experiment. Put a coarse lock around the suspect region. If the bug vanishes, it's a race (Chapter 3).
Tools. ThreadSanitizer, Go's -race, and Java concurrency checkers catch data races statically and dynamically.

Memory Bugs — Leaks, Corruption, Use-After-Free

Strategy: use AddressSanitizer/Valgrind to catch corruption and use-after-free. For leaks, take a heap snapshot at two points in time and diff them — what keeps piling up is the clue. A defining trait of memory bugs is that the symptom shows up far from the cause. The cause is where the corrupted memory was written, not where it was read.

"Works on My Machine"

An environment-difference bug. Build a diff list between your environment and the one that breaks.

Environment diff checklist:
  □ OS / architecture (arm64 vs x86_64, CRLF/LF line endings)
  □ Language / runtime version
  □ Dependency versions (did you actually commit the lockfile?)
  □ Environment variables / config files
  □ Locale / timezone / encoding
  □ File system (case-sensitive or not)
  □ Data (local DB vs the real data in the production DB)
  □ Permissions / network / firewall

The fundamental fix is to codify the environment — containers, lockfiles, IaC. Drive "the difference" close to zero and this bug class disappears entirely.

Distributed and Timing Bugs

A bug spanning multiple services. A single stack trace can't see it. The key is a correlation ID — issue an ID when a request starts and stamp every service's logs with that ID. Then distributed tracing can reconstruct the full journey of one request. Watch out for clock skew too — timestamps across services can be off by several seconds.

Chapter 8 · When You're Truly Stuck — Rubber Duck, Take a Break, Question Assumptions

Even following the methodology, you get stuck. Being stuck is usually a sign that you're searching on top of a wrong assumption.

Rubber Duck Debugging

Explain the problem from start to finish, line by line, out loud — to a rubber duck, a colleague, or an empty chat window. While explaining, you hit a point where you go "wait... why is this like this?" Saying out loud an assumption you'd been skipping in your head exposes it.

Take a Break

If you've been stuck for 2 hours, staring harder won't solve it. Walk, sleep, switch tasks. Your subconscious reorganizes the search space in the background. "The answer came to me in the shower" isn't a cliché — it's cognitive science. When stuck, a break is a strategy, not laziness.

Question Assumptions

The most powerful move when stuck: question, one by one, the things you've believed to be true.

Verify the things "you thought were certain":
  - "This function definitely gets called" → Really? Did you confirm with a log?
  - "This value is never null" → Really? Did you add an assert?
  - "The code I fixed got deployed" → Really? Did you check the build hash?
  - "The DB has the correct data" → Really? Did you query it directly?
  - "This config file gets read" → Really? Did you see which path it reads?
  - "The library behaves as documented" → Really? Did you read the source?

An unverified assumption is not an assumption — it's a guess. The moment you say "it's definitely X," that's exactly what you need to verify.

The Bug Is Not Where You Think

If you've been stuck a long time, the search scope itself is probably wrong. You spent a week looking at application code, but the cause was actually a config file, a library version, the data, or the infrastructure. If you've searched one area thoroughly and it's not there — it's not that area. Widen the scope. Go back to the core loop in Chapter 1 and start over from "reproduce." Often it turns out you did the reproduction step sloppily.

Chapter 9 · Debugging With AI Agents — Don't Let Agents Guess Either

AI coding agents are powerful debugging partners, but used wrong they guess faster than humans. The core principle is the same: give the agent material to reason with, and stop it from guessing.

What to Give the Agent

An agent is more accurate the better its context. When you ask it to debug, hand over the following:

What to give	Why
The repro command (Ch. 2)	So the agent can verify its own hypotheses
The exact error / stack trace (Ch. 5)	The full text, not "it errors"
What changed (Ch. 3)	Recent diffs, deploys — narrows the search scope
What you already tried and ruled out	So it doesn't repeat the same hypothesis
Expected behavior vs actual behavior	Make the definition of the bug explicit

Giving the repro command matters most. Then the agent can run the core loop itself — not "this looks like the cause" but "form a hypothesis → verify with the repro command → report pass/fail."

Don't Let the Agent Guess

The anti-patterns an agent falls into are exactly the human ones — changing several places at once, modifying code with no hypothesis, covering the symptom with try/catch. You have to block them:

How to manage agent debugging:
  - Instruct: "Don't guess. State a hypothesis first, then verify it."
  - Explicitly require the core loop: reproduce → isolate → hypothesize → test
  - One variable at a time — "don't change several things at once"
  - Before any fix, make it explain "what the root cause is"
  - After the fix, have it verify with the repro command and show the result
  - Review the agent's hypotheses yourself — plausible is still a guess until verified

The agent's advantage is that it runs binary search tirelessly, skims vast logs fast, and can run git bisect automatically. Its weakness is that it states plausible hypotheses with confidence. So the human's role is to enforce the methodology — make the agent follow the core loop and never accept an unverified hypothesis as fact.

Human or agent, the rule is the same: an unverified hypothesis is a guess. You don't fix code on a guess.

Epilogue — Checklist and Anti-Patterns

Systematic debugging is not talent; it's a trainable procedure. Follow the core loop, change one variable at a time, narrow the search space with binary search, and question unverified assumptions. Reasoning, not guessing.

Debugging Checklist

Did you reproduce it? — Can you trigger the bug reliably with one command? (If not, stop here.)
Did you minimize it? — Have you removed every irrelevant part from the repro case?
What changed? — Of code/config/data/dependencies/infrastructure, which?
Is the hypothesis one sentence? — "It's caused by X" — in a testable form?
One variable at a time? — Is there only one thing you're changing in this experiment?
Are you aiming to disprove? — Can this experiment kill the hypothesis?
Can you use binary search? — On git bisect / the input / the code path?
Did you read the error message to the end? — The whole stack trace?
Is it the right tool? — Did you try the debugger first?
Did you fix the root cause? — The cause, not the symptom? Can you explain why it was a bug?
Did you verify? — Does the repro case pass now? Did you add a regression test?
If you're stuck — Rubber duck, take a break, question assumptions, widen the scope.

Anti-Patterns

Anti-pattern	Why it's bad	Instead
Start guessing with no repro	Can't confirm you fixed it	Build the repro command first
Change several variables at once	You don't know what the cause is	One at a time
Read only the first line of a stack trace	You miss the real clue	Read to the end, including `Caused by`
Spray `console.log`	Log bomb, no plan	Only the value that tests a hypothesis
Cover the symptom with `try/catch`	The bug remains, hidden deeper	Fix the root cause
Confirmation bias (only confirm)	You can't drop a wrong hypothesis	Design a disproving experiment
Search on top of an unverified assumption	You dig forever in the wrong place	Verify every "it's definitely..."
Eyeball it and say "looks fixed"	The regression comes back	Verify with the repro case + regression test
Keep staring while stuck	You repeat the same guess	Break / rubber duck / widen the scope
Delegate guessing to the agent	A fast guess is still a guess	Give it the repro and logs, enforce the loop

Next Post Teaser

The next post is "Designing Regression Tests — How Not to Meet a Bug Twice." If this post is how to find a bug, the next is how not to meet the same bug again. It covers how to harden a repro case into a permanent test, which bugs deserve a regression test, how to verify the test actually catches the bug, and how not to write flaky regression tests. As important as catching bugs well is caging the bugs you've caught.