- Published on
Code That Fails Well — A Deep Dive into Error Handling and Resilience Design (2026)
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Prologue — error handling is design
In most code reviews, error handling gets the last five minutes. "You're missing a try/catch here", "let's add some logging to this error" — treated like decoration bolted onto a feature that's already built.
But spend an hour staring at a system that breaks in production and you realize: the cause of an outage is almost always a "mishandled failure", not an "unhandled success". The happy path works fine on its own. What brings a system down is the call with no timeout, the loop that retries forever, the transaction applied only halfway, the error message that hands a user a raw undefined.
Error handling is not something you add after the feature is built. Error handling is the design. A function's signature, a module's boundaries, the shape of the call graph — all of it flows from the question, "what happens when this fails?"
This post is about designing software that fails well at the code level. Chaos engineering — deliberately breaking infrastructure to verify it — is another post's topic. Here we look at how you write a single function, a single class, a single call so that the system as a whole fails gracefully.
The order we'll go: classifying failure → exceptions vs error values → Result types → the boundary principle → timeouts → retries and backoff → idempotency → circuit breakers and bulkheads → graceful degradation → error messages and observability. Code is a mix of TypeScript, Go, and Python — because the pattern, not the language, is the point.
Chapter 1 · Failures come in kinds
Code that treats all failures the same treats all failures wrong. To pick a handling strategy, you first have to classify. There are three axes.
Axis 1 — expected failure vs unexpected failure
An expected failure is part of normal operation. A user looks up a nonexistent ID → "404"; the balance is insufficient → "payment declined". This is not a bug — it is part of the domain. It should be handled with control flow, not thrown as an exception.
An unexpected failure is a broken assumption. A value that should never be null is null; a branch that should never be reached was reached. This is a bug, and it should fail fast and loud so a developer hears about it.
Axis 2 — transient failure vs permanent failure
A transient failure might succeed if you try again. A momentary network blip, a drained DB connection pool, 429 Too Many Requests, 503 Service Unavailable. Retrying is meaningful.
A permanent failure fails identically a hundred times. 400 Bad Request, 401 Unauthorized, 404 Not Found, 422 Unprocessable Entity. Retrying only adds load.
Axis 3 — recoverable failure vs fatal failure
A recoverable failure means this one request fails but the process can keep running. An external API call fails → just this request gets an error response.
A fatal failure means a process invariant is broken. Memory corruption, a config file that won't parse (at boot), a write to a closed channel. Here it is safer to crash and restart. A clean restart beats a zombie process.
The classification table
| Class | Example | Strategy |
|---|---|---|
| Expected + permanent | Insufficient balance, validation failure | Return as a domain error value, explain to the user |
| Expected + transient | 429, 503, lock contention | Retry with backoff, give up at the limit |
| Unexpected + recoverable | An unknown 500 from an external API | Log + fail only this request, alert |
| Unexpected + fatal | Corrupt config, broken invariant | Fail fast, crash, restart |
Without this classification in your head, two anti-patterns appear: (1) you retry everything and loop forever on permanent failures, or (2) you swallow everything the same way and a fatal bug gets buried silently.
Chapter 2 · Exceptions vs error values
How do you represent a failure? There are two camps.
Exceptions: on failure, you throw, the throw climbs the call stack, and someone catches it. The default model of Java, Python, C#, JavaScript.
Error as value: failure is represented as an ordinary return value. A function returns "a result or an error", and the caller checks explicitly. The model of Go and Rust.
The problem with exceptions
The biggest problem with exceptions is that they are invisible in the signature.
function getUser(id: string): User {
// Can this function throw? You can't tell from the signature.
// If it does, throw what? Where should it be caught?
}
The type says User, but it actually means "a User, or something thrown somewhere". The caller doesn't know what to catch, so they do one of two things: catch nothing (crash), or catch everything (catch (e) {} — swallows it all).
On top of that, exceptions make control flow non-local. A throw is effectively "a goto from here to somewhere unknown". When you read the code, the normal path and the error path don't appear separated.
The problem with error values
Error values are honest, but verbose. The famous landscape of Go code:
user, err := getUser(id)
if err != nil {
return nil, err
}
account, err := getAccount(user.ID)
if err != nil {
return nil, err
}
balance, err := getBalance(account.ID)
if err != nil {
return nil, err
}
An if err != nil every three lines. And if you forget? Go compiles even when you ignore a return value (you have to stop it with a linter).
A good compromise: mix both, but with rules
The pragmatic answer is not "pick one", it is use each for its job.
- Expected domain failures → as error values / types. The caller must handle them, so they should be visible in the signature.
- Unexpected bugs → as exceptions (or panic). They're unrecoverable anyway, so let them climb the stack and get caught, logged, and crashed at the top.
In JavaScript/TypeScript: return domain failures as a discriminated union, and throw only for genuine exceptional situations. In Python: domain failures as explicit result objects or a narrow custom exception hierarchy; system failures caught by a broad handler at the top.
Chapter 3 · Result types — making failure a type
The most refined form of the error-value camp is the Result type. It is "a success value or an error value", and the type system enforces that it is exactly one of them.
Rust's Result
enum Result<T, E> {
Ok(T),
Err(E),
}
fn parse_port(s: &str) -> Result<u16, ParseError> {
s.parse().map_err(|_| ParseError::InvalidPort)
}
In Rust you cannot simply ignore a Result. To get the value out, you must go through code that checks whether it's Ok or Err. The ? operator collapses the "return early if error" boilerplate into a single character.
Go's multiple returns
Go has no sum type at the type level, so it solves it by convention — a function returns (value, error), and the caller checks err. It isn't as enforced as a sum type (you can ignore it and still compile), but culture and linters fill the gap. Go 1.13+ standardized error wrapping and inspection with errors.Is / errors.As.
if errors.Is(err, sql.ErrNoRows) {
// "not found" is an expected failure — into domain flow
return defaultUser(), nil
}
TypeScript's discriminated union
TypeScript has no built-in Result, but you can build one.
type Result<T, E> =
| { ok: true; value: T }
| { ok: false; error: E }
function parsePort(s: string): Result<number, "invalid_port"> {
const n = Number(s)
if (!Number.isInteger(n) || n < 1 || n > 65535) {
return { ok: false, error: "invalid_port" }
}
return { ok: true, value: n }
}
const r = parsePort(input)
if (!r.ok) {
// the compiler blocks access to r.value here
return reject(r.error)
}
use(r.value) // here r.value is safe
The value of Result types — and the cost
The value is clear. Failure shows up in the signature, and the compiler asks "did you handle this error?" You can't forget.
Be honest about the cost too. (1) Boilerplate — without sugar like ?, it's verbose. (2) Contagiousness — a function that calls a function returning Result usually has to return Result itself. (3) It's wrong for real bugs (null dereference, array out of bounds) — that's the territory of exceptions/panic.
Rule: use Result types for expected domain failures. Don't try to represent a "bug" as a Result. Go back to the Chapter 1 classification — expected failures as types, unexpected bugs as fail fast.
Chapter 4 · The boundary principle — validate at the edge, trust the core
The most powerful resilience pattern is not a library — it is a single architectural rule.
Validate input at the boundary of the system. Once data is inside, trust it throughout the core.
A boundary is any point where untrusted data enters the system — an HTTP handler, a message queue consumer, a CLI argument parser, the place that receives an external API response, a row read from the DB. At this point you validate strictly, exactly once. Data that passes is converted to a well-defined type and handed inward.
// Boundary: an HTTP handler. Validate here.
function handleCreateOrder(req: Request): Response {
const parsed = OrderSchema.safeParse(req.body) // zod, etc.
if (!parsed.success) {
return badRequest(parsed.error) // reject at the boundary
}
// parsed.data is now a validated Order type.
return createOrder(parsed.data) // hand to the core
}
// Core: does not re-validate. It trusts that Order is valid.
function createOrder(order: Order): Response {
// does not re-check whether order.quantity > 0.
// the boundary guaranteed it. focus on business logic.
}
Why this matters
When validation is scattered through the core, three things break: (1) the same validation runs multiple times — nobody knows how far it got. (2) Inconsistent validation — path A checks it, path B misses it. (3) Business logic gets buried under defensive code and becomes unreadable.
Validate once at the boundary, and a core function's signature becomes the contract. When createOrder(order: Order) says "give me a valid Order", that is a real guarantee — the function body doesn't have to second-guess it every time.
Guard core invariants with assertions
The core has its own assumptions of "this must never happen". Those are not validation — they are expressed as assertions. Validation is premised on "the user might be wrong"; an assertion is premised on "if my code is wrong". When an assertion breaks, it's a bug — it should fail fast.
| Validation | Assertion | |
|---|---|---|
| Location | The boundary | Anywhere in the core |
| Premise | External input can't be trusted | An invariant of my code |
| On failure | A polite error response | Crash / panic (it's a bug) |
| Target | The user / external systems | The developer |
Chapter 5 · Timeouts — every remote call needs a deadline
If you had to name the single most commonly omitted thing in resilience: the timeout.
Every call that crosses a process boundary — HTTP, DB query, cache, gRPC, message publish — must have a timeout. No exceptions.
What happens without a timeout? A downstream service gets slow (not dead — just slow). Threads, goroutines, connections waiting for a response pile up. The pool drains. Now even healthy requests can't get resources. One slow dependency stops the entire service. This is the most common path to a cascading failure.
The default is almost always too long
Most HTTP clients have no default timeout (wait forever) or one so high — 30s, 60s — it is effectively infinite. Waiting 30 seconds inside a user-facing request is nearly the same as "having no timeout" — resources are held the whole time.
import requests
# Bad: no timeout. Can hang forever.
r = requests.get(url)
# Good: (connect timeout, read timeout)
r = requests.get(url, timeout=(1.0, 3.0))
// Go: propagate the timeout with a context
ctx, cancel := context.WithTimeout(ctx, 2*time.Second)
defer cancel()
resp, err := httpClient.Do(req.WithContext(ctx))
A timeout budget — the sum can't exceed the parent
Timeouts should be treated as a budget across the whole call graph. If handler A's budget is 3s but it sequentially calls B (2s) and C (2s), the sum is 4s — A is already done. The sum of child timeouts must fit inside the parent's budget. Use context (Go), AbortSignal (JS), or explicit deadline propagation to flow this budget downward.
Timeout and cancellation are a pair
If a timeout fires but the work keeps running in the background, it keeps eating resources. A timeout must be bound to actual cancellation (context cancel, AbortController.abort(), closing the connection). "Stop waiting" and "stop the work" are different things — you have to do both.
Chapter 6 · Retries, exponential backoff, jitter — and when NOT to retry
For transient failures (Chapter 1), retrying is the answer. But only when done correctly. A bad retry escalates a small fault into a large one.
Rule 1 — only retry what is retryable
Retry OK: 408, 429, 503, 504, connection refused, connection timeout
Do not retry: 400, 401, 403, 404, 409, 422 (will fail identically again)
Careful: 500 (depends on the cause — default to conservative)
Retrying a 400 only sends the same bad request five times. 5x the load, nothing else.
Rule 2 — exponential backoff
Don't fix the retry interval (every 1s). If the downstream is overloaded, fixed-interval retries keep applying the same pressure. Grow the interval exponentially: 1s, 2s, 4s, 8s.
Rule 3 — add jitter (this is the key one)
Pure exponential backoff has a hidden trap. If 100 clients hit the same failure at the same time, they all retry 1s later, then 2s later, then 4s later — as a synchronized thundering herd. The downstream gets hit by another 100 right when it had a chance to recover.
The fix is jitter — mix in randomness to spread the retries out in time.
function backoffWithJitter(attempt: number): number {
const base = 100 // ms
const cap = 10_000 // cap at 10s
const exp = Math.min(cap, base * 2 ** attempt)
// "full jitter": uniform random between 0 and exp
return Math.random() * exp
}
AWS's well-known analysis concluded that "full jitter" (a uniform random from 0 up to the computed cap) is the most stable. The herd spreads evenly across the time axis.
Rule 4 — limit both the retry count and the total budget
"At most 5 times" is not enough. Because of backoff, 5 retries can take over 30 seconds. Set both a retry-count cap and a total time budget — give up when either one is hit.
Rule 5 — don't nest retries
The most dangerous anti-pattern. A retries B 3 times, B retries C 3 times, C retries D 3 times → the actual attempts against D are 27. Retry at exactly one layer of the stack. Usually the outermost one (or the one closest to the client).
Rule 6 — retries and idempotency are inseparable
The moment you retry a write operation (POST, payment, order creation), you run straight into the next chapter's question — what if that request already succeeded? That's Chapter 7.
Chapter 7 · Idempotency — retrying safely
An idempotent operation produces the same result whether you apply it once or five times. GET is inherently idempotent; PUT/DELETE are usually idempotent. The problem is POST — things like "create order", "charge payment", "publish message".
Look at the retry scenario.
client ──POST /payments──▶ server: charges payment, succeeds ✅
client ◀───(response lost)─ the network swallows the response ❌
client: "looks like a timeout, let's retry"
client ──POST /payments──▶ server: charge again?? 💸💸
The client cannot know whether the first attempt succeeded. Retry and you double-charge; don't retry and you lose the payment. Both are bad.
The fix — idempotency keys
The client generates a unique key (a UUID, etc.) per request and carries it in a header. To retry the same operation, it uses the same key. The server stores the result per key.
first request: Idempotency-Key: abc-123 → no key → execute → store result → respond
retry: Idempotency-Key: abc-123 → key exists → don't execute → return stored result
async function createPayment(key: string, body: PaymentBody) {
const existing = await store.get(key)
if (existing) return existing.response // replay: don't execute again
// race-condition care: claim the key as "in progress" first
const claimed = await store.claim(key) // atomic insert
if (!claimed) return await store.waitFor(key) // another worker is handling it
const response = await reallyCharge(body)
await store.complete(key, response) // persist the result
return response
}
Designing idempotency keys — the details are everything
- The client generates the key. If the server generates it, a retry gets a new key and idempotency breaks.
- Guard against race conditions. Two retries can arrive at the same time. Claiming the key must be atomic (
INSERT ... ON CONFLICT, a unique constraint, a distributed lock). - Bind the request body to the key. Same key + different body = a client bug. Reject it with
409(don't silently hand back the old result). - Set a TTL. You can't hold keys forever. Around 24 hours usually covers the retry window comfortably.
- Handle partial failure. What if a worker dies while "in progress"? That key is stuck. Make it retryable after a timeout, or put an expiry on the "in progress" state.
Idempotency is also a client responsibility
It is not the server's job alone. The client must keep the same key across all retries. If it generates a new key inside the retry loop, the whole idempotency scheme collapses. Generate the key once, outside the retry loop.
Chapter 8 · Circuit breakers, bulkheads, fallbacks — caging the fault
Timeouts, retries, and idempotency deal with a single call. This chapter's patterns work on a bigger picture — caging a fault so it doesn't spread.
Circuit breaker
If a downstream dependency is definitely dead, waiting out a timeout on every request is a waste. A circuit breaker tracks failures, and when a threshold is crossed it opens the circuit — after that, it doesn't even call the downstream, it fails immediately (fail fast). It stops knocking on a dead service, giving it room to recover.
It has three states.
failure rate exceeds threshold
CLOSED ───────────────────────────▶ OPEN
▲ │
│ │ cooldown timer expires
│ trial call succeeds ▼
└──────────────────────────── HALF_OPEN
trial call fails ──▶ back to OPEN
CLOSED : normal. calls pass. counts failures.
OPEN : broken. no calls. fail immediately. wait out the cooldown.
HALF_OPEN : testing. let only a few calls through to check for recovery.
class CircuitBreaker {
private state: "CLOSED" | "OPEN" | "HALF_OPEN" = "CLOSED"
private failures = 0
private openedAt = 0
async call<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === "OPEN") {
if (Date.now() - this.openedAt < this.cooldownMs) {
throw new CircuitOpenError() // don't even try
}
this.state = "HALF_OPEN" // cooldown over — try once
}
try {
const result = await fn()
this.onSuccess() // back to CLOSED, reset the counter
return result
} catch (err) {
this.onFailure() // cross the threshold and go OPEN
throw err
}
}
}
The key value: when the circuit is open, the caller gets a failure response immediately, not after 30 seconds. That fast failure makes room to trigger a fallback (below).
Bulkhead
The name comes from a ship's bulkheads — compartments so that water in one section doesn't sink the whole ship. In software, it is splitting resources into separate pools.
Say a service calls dependencies A, B, C and they share one connection pool of 100. If A gets slow, all 100 connections are held waiting on A — B and C are fine but have no connection to call with. One fault in A drags B and C down with it.
Bulkhead: give A, B, C each their own pool (say 40/30/30). Now even if A eats its whole pool, that's only A's 40 — B and C operate normally on their own pools. The fault is caged in one compartment.
Fallback
When a call fails (or the circuit is open), can you give a second-best answer? That is a fallback.
| Primary (failed) | Fallback |
|---|---|
| Live recommendation engine | A cached list of popular items |
| Live exchange-rate API | The last known rate + a "stale" marker |
| Personalized home feed | A generic curated feed |
| Exact inventory count | Just "in stock / out of stock" |
The iron rule of fallbacks: a fallback must be simpler and have fewer dependencies than the primary. A fallback that dies for the very same reason the primary dies is not a fallback. A good fallback is usually cached data, a static default, or an honest partial response that says "I can't show you this right now".
Chapter 9 · Graceful degradation — failing soft
Tie Chapter 8's patterns into a single mindset and you get graceful degradation.
When one part of the system fails, it should be able to drop to something between 0 and 1. A system with only two states — fully working or fully stopped — is brittle.
All-or-nothing is brittle
Look at an e-commerce product page. It calls: product info, price, inventory, reviews, recommendations, a personalized banner. What happens when the reviews service dies?
Brittle design: the whole page is a 500. The user cannot buy a product they could have bought — because the reviews didn't load.
Resilient design: the page renders. Product, price, inventory, the buy button — all fine. In the reviews slot it shows "couldn't load reviews". The critical function (buying) is alive. Only the enhancement (reviews) drops out gracefully.
Split features into critical and enhancement
To do this, you have to classify deliberately.
- Critical: if this dies, the request dies. The "product info + price" of a product page. Fail honestly.
- Enhancement: if this dies, the request must survive. The "recommendations + reviews". Replace it with an empty slot, a cache, a placeholder.
The code's structure should reflect this classification. Enhancement calls isolate their own errors — their failure must not propagate to the whole handler.
async function getProductPage(id: string): Promise<ProductPage> {
// critical: if it fails, the whole thing fails. honestly.
const [product, price] = await Promise.all([
getProduct(id),
getPrice(id),
])
// enhancement: each isolated. the page survives a failure.
const reviews = await getReviews(id).catch(() => null)
const recs = await getRecommendations(id).catch(() => [])
return {
product,
price,
reviews, // may be null — the UI handles it
recommendations: recs, // may be an empty array
degraded: reviews === null, // tell the client honestly
}
}
Load shedding — deliberate degradation
Degradation isn't only for partial outages. If the system is overloaded, deliberately rejecting some of what comes in beats slowly killing all of it. Load shedding: when the queue is full, reject new requests fast with 503 + Retry-After. Rejecting 90% fast and honestly while handling 10% well beats accepting 100% and timing out on all of it. Under saturation, rejection is a feature.
Chapter 10 · Error messages and observability — errors a human can act on
Even if you handle errors well, if those errors are opaque, you've only done half the job. An error is, in the end, read by a human — the developer debugging, the user looking at the screen.
The three audiences for a good error message
| Audience | What they need | Bad | Good |
|---|---|---|---|
| End user | What to do | "Error 500" | "We couldn't process your payment. Your card was not charged. Please try again." |
| Operator / on-call | Where it broke | "Something went wrong" | "payment-svc → stripe call timed out (2s), order_id=789" |
| Future developer | Why it broke | A stack trace only | Stack + input context + which invariant broke |
You cannot satisfy three audiences with one string. So you structure the error.
Structured errors — data, not strings
Don't throw a string like "user 123 not found in tenant 9". You can't search, aggregate, or route on it. Instead, a struct with fields:
type AppError = {
code: string // stable, machine-readable: "PAYMENT_TIMEOUT"
message: string // for humans (developers)
retryable: boolean // may the caller retry?
context: Record<string, unknown> // order_id, tenant, attempt count...
cause?: unknown // the original error — preserve the chain
}
With this: you can aggregate metrics by code, retry logic can branch on retryable, context carries what you need to reproduce, and cause preserves the chain down to the root cause.
Wrap errors, but add context
As an error climbs the stack, each layer should add its own context while preserving the original. Don't crush it — wrap it.
// Bad: throws away the original error. you'll never know where it came from.
if err != nil {
return errors.New("something failed")
}
// Good: add context, preserve the cause with %w
if err != nil {
return fmt.Errorf("charging order %s: %w", orderID, err)
}
// the caller can still inspect with errors.Is(err, context.DeadlineExceeded)
Structured logging — events, not prose
A line like log.Error("payment failed for user " + id) cannot be searched or aggregated. Log structured key-values.
logger.error("payment_failed", extra={
"error_code": "PAYMENT_TIMEOUT",
"order_id": order_id,
"downstream": "stripe",
"duration_ms": 2013,
"attempt": 2,
"trace_id": trace_id, # tie it to distributed tracing
})
Now you can query "error_code=PAYMENT_TIMEOUT over the last hour, broken down by downstream". The trace_id connects this log to the full request flow in distributed tracing. That is the difference between prose logs and an observable system.
What to count
Error handling is only complete when it also includes error metrics. At a minimum, these as counters:
- Error rate — by
code, by endpoint - Retry count / retries-exhausted count
- Circuit breaker state transitions (how many times it opened)
- Timeouts — by call target
- Fallback triggers (how often an enhancement drops out)
Without this visibility, the system degrades silently — until someone complains. Good error handling makes failure visible.
Epilogue — failing well is designing well
Back to the opening claim. Error handling is not decoration you bolt on after the feature is built — it is the design itself.
The single thread running through this post is this — make failure explicit. Visible in the signature, enforced in the type, measured in the metrics, searchable in the logs. Hidden failure kills a system. Exposed failure can be dealt with.
Every one of these patterns — classification, Result types, boundary validation, timeouts, backoff, idempotency, circuit breakers, graceful degradation — compresses into the same sentence.
For every call, ask: what happens when this fails? And write the answer down in code — not in your head.
A 14-item checklist
- Did you classify every failure by the Chapter 1 axes (expected/transient/recoverable)?
- Are expected domain failures error values/types, visible in the signature?
- Are unexpected bugs fail fast (exception/panic), not swallowed?
- Do you validate input once at the boundary and trust the core?
- Are core invariants guarded with assertions, not validation?
- Does every remote call have a timeout? (no exceptions)
- Are timeouts set deliberately, not left at the default (usually too long)?
- Is the timeout bound to actual cancellation?
- Do you retry only retryable errors? (not
4xx) - Do retries have exponential backoff + jitter?
- Are retries un-nested, with both a count and a time budget capped?
- Are retries of write operations safe via an idempotency key?
- Do critical dependencies have a circuit breaker, and enhancements a fallback?
- Are errors structured (
code/retryable/context) and measured as metrics?
Ten anti-patterns
- Empty catch —
catch (e) {}. Silently swallows the failure. The root of undebuggable. - Retry everything — retrying
4xxloops forever on permanent failure and only amplifies load. - Retry without jitter — a synchronized herd hits the downstream again every time it had a chance to recover.
- Nested retries — 3×3×3 = 27 attempts. Retry at one layer only.
- Calls without timeouts — one slow dependency drains the pool and stops everything.
- Timeout without cancellation — you stopped waiting, but the work keeps eating resources in the background.
- Non-idempotent writes that get retried — double-charges, duplicate orders. A payment API with no idempotency key.
- Scattered validation — the same validation repeated all over the core, inconsistent, business logic buried.
- Crushing errors — after a
catch, throwing away the cause and throwing a new generic error. Root cause untraceable. - Opaque error messages —
"Error 500". Neither the user nor the operator nor the future developer can act.
Next post preview
We've seen how to fail well at the code level. But one assumption is left standing — how do we know this code actually fails well? It's common to test only the happy path and ship without ever once running the "timeout handling code".
The next post is about testing failure. Putting fault injection into unit and integration tests, deterministically reproducing timeouts, partial failures, and an open circuit, and from there leading naturally into chaos engineering — deliberately causing failure in a near-production environment to verify the system really does fail gracefully. If you designed it to fail well, the next step is proving that design is real.