- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction — A Log Is a Crime Scene
- Reading a Stack Trace — Top Is the Scene, Bottom Is the Origin
- The "Caused by" Chain — The Real Root Cause
- Structured Logging — Data, Not Prose
- Correlation IDs and trace_id — Threading a Single Request
- grep, jq, ripgrep — Tools for Digging Through Logs
- Log Levels — The Dial That Controls the Noise
- Sampling — When You Can't Keep It All
- Putting It Together — Tracking a 3 a.m. Outage
- Wrapping Up
- References
Introduction — A Log Is a Crime Scene
It's 3 a.m. and the pager goes off. Production is down. All you have in hand is an outage you can't reproduce, hundreds of thousands of lines of logs, and one red stack trace. The good news is that these are, in fact, evidence. The stack trace is a photo of the crime scene; the logs are the witnesses' statements. The detective's job is to read this evidence and name the culprit.
The trouble is that many developers can't read this evidence properly. They see a stack trace, their eyes glaze over, they read the top few lines and jump to a guess; the logs are too voluminous, so they give up and grep for something random. But reading evidence has a method. This post lays out the skill of reading stack traces and logs like a detective — systematically, based on the evidence.
Reading a Stack Trace — Top Is the Scene, Bottom Is the Origin
The first time you meet a stack trace, it overwhelms you. But once you know the structure, it's a treasure trove of information. Remember one core rule: one end is where it blew up, the other end is where it began.
Take a Python stack trace as an example.
Traceback (most recent call last):
File "app.py", line 42, in <module>
result = process_order(order_id)
File "services/order.py", line 18, in process_order
total = calculate_total(items)
File "services/pricing.py", line 7, in calculate_total
return sum(item.price for item in items)
File "services/pricing.py", line 7, in <genexpr>
return sum(item.price for item in items)
AttributeError: 'NoneType' object has no attribute 'price'
The order to read it, and what to extract:
- Read the exception line at the bottom first.
AttributeError: 'NoneType' object has no attribute 'price'. This is the summary of what happened: something tried to read.priceoff aNone. - The frame just above it is where it blew up. Line 7 of
pricing.py,item.price. Here,itemwasNone. This is the crime scene. - Scan upward through the chain of calls.
app.py(42) →process_order(18) →calculate_total(7). This shows the path the request took to get here — the origin.
There's one important judgment to make: which frame is "my code"? A stack trace is often stuffed with frames from inside frameworks and libraries. The exception blew up deep in a library, but the true cause is almost always in your code, which handed the library a weird value. So scan the stack and find "the last frame that's in your project's path." Nine times out of ten, that's where to start the investigation.
The direction can be confusing across languages. Python says "most recent call last," so the bottom is where it blew up; but JavaScript and Java usually put the blow-up at the top, with the origin toward the bottom. The rule itself ("one end is the scene, the other end is the origin") is the same — you just have to confirm which way that language runs.
The "Caused by" Chain — The Real Root Cause
A pattern you meet especially often in the Java world, and absolutely must know, is the "Caused by" chain: one error wrapping another, stacked in layers.
Exception in thread "main" com.example.OrderProcessingException: Failed to process order 4821
at com.example.OrderService.process(OrderService.java:52)
at com.example.Main.main(Main.java:14)
Caused by: java.sql.SQLException: Connection timed out
at com.example.db.Pool.getConnection(Pool.java:88)
at com.example.OrderService.process(OrderService.java:48)
... 1 more
Caused by: java.net.SocketTimeoutException: connect timed out
at java.base/java.net.Socket.connect(Socket.java:507)
at com.example.db.Pool.getConnection(Pool.java:85)
... 2 more
The common beginner mistake here is judging from the top exception alone. OrderProcessingException: Failed to process order 4821 is really just the outermost wrapper. "Order processing failed" is a symptom, not a cause.
The true cause is in the last "Caused by" at the bottom of the chain. In the example above:
- Top level: order processing failed (symptom)
- Caused by: SQL exception, connection timed out (intermediate cause)
- Caused by: socket connect timed out (root cause — couldn't reach the DB over the network)
So when you see "Caused by," scroll all the way down. The bottom-most root exception is what you actually have to fix. The exceptions above it are just that same one, re-wrapped as it passed through each layer. Reading this chain as "this, because of this, because ultimately of this" is the detective's causal trace.
Structured Logging — Data, Not Prose
If a stack trace is a snapshot of "the moment it blew up," logs are a record of "events over time." And how you write your logs completely determines whether an investigation is even possible later. The key is to write logs as structured data a machine can parse, not prose for a human to read.
A traditional log looks like this:
2026-06-28 03:14:22 ERROR Failed to process order 4821 for user 91 after 3 retries
Human-readable, but to handle it in a program you have to parse it with a regex, and if one field is added the parser breaks. Structured logging (JSON), by contrast, looks like this:
{
"timestamp": "2026-06-28T03:14:22.481Z",
"level": "ERROR",
"service": "order-service",
"message": "failed to process order",
"order_id": 4821,
"user_id": 91,
"retries": 3,
"duration_ms": 2140,
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736"
}
The benefits of structured logs are overwhelming.
- Search and filter: you can query precisely for "logs where order_id is 4821" or "logs where duration_ms exceeds 2000." That's a field query, not a string match.
- Aggregation: pull statistics straight from the logs, like "error count per service" or "average retries."
- Parsing stability: adding a field doesn't break the existing pipeline.
- Correlation: embed a field like
trace_idand, as we'll see, you can stitch logs from many services into one request.
The principle: don't cram values you want to log into a string — put them in separate fields. Instead of "order 4821 failed", keep message as "order failed" and attach order_id: 4821 as a field. When it's time to investigate, this difference is life or death.
Correlation IDs and trace_id — Threading a Single Request
In a microservices setup, one user request passes through several services: API gateway → order service → payment service → notification service... If each service writes its own logs separately, how do you gather up "the logs this one request produced" when something goes wrong? The answer is a correlation ID, whose standardized form is the trace_id.
The principle is simple. When a request first enters the system, you issue a unique ID, and every service the request touches writes that ID into its own logs. On calls between services, you propagate the ID in an HTTP header or similar. Then, later, simply filtering by that one ID reconstructs the logs scattered across many services into the timeline of a single request.
{ "service": "gateway", "trace_id": "abc123", "message": "request received", "path": "/checkout" }
{ "service": "order-service", "trace_id": "abc123", "message": "order created", "order_id": 4821 }
{ "service": "payment-service", "trace_id": "abc123", "message": "charge failed", "reason": "card_declined" }
{ "service": "gateway", "trace_id": "abc123", "message": "responded 402", "duration_ms": 890 }
Filter for trace_id equal to abc123 and the whole journey is visible at a glance: it came in through the gateway, an order was created, payment was declined, and it responded 402. Without a correlation ID, these four lines are scattered among hundreds of thousands, buried and unconnected.
A practical tip: issue the correlation ID once at the front line of the request (the gateway or a middleware), and have it injected automatically into every subsequent log. Rather than adding it by hand to each log, plant it once in the logger's context and it attaches automatically to every log during that request. And if you also expose this ID on the error screen shown to users ("please quote this code when you contact us: abc123"), you can tie a user's support request straight to the logs.
grep, jq, ripgrep — Tools for Digging Through Logs
Even with all your logs gathered, they're useless if you can't extract what you want from them. The command line has powerful tools for exactly this job.
grep / ripgrep (rg): the fundamental for finding patterns in text logs. ripgrep is faster than grep with friendlier defaults.
# All logs for a specific trace_id
rg "abc123" app.log
# Only the ERROR level
rg '"level":"ERROR"' app.log
# See 3 lines before and after each match (for context)
rg -C 3 "card_declined" app.log
# Multiple conditions: errors from the payment service
rg '"service":"payment-service"' app.log | rg '"level":"ERROR"'
The -C (context) option matters especially to a detective. Seeing one error line versus seeing a few lines around it — what happened just before — is night and day. An event is usually caused by the event right before it.
jq: the dedicated knife for JSON logs. If you use structured logging, jq is far more precise than grep.
# Pull just the order_id from ERROR-level logs
jq 'select(.level == "ERROR") | .order_id' app.log
# Slow requests where duration_ms exceeds 2000
jq 'select(.duration_ms > 2000)' app.log
# Logs for a specific trace_id, just service and message
jq 'select(.trace_id == "abc123") | {service, message}' app.log
# Count errors by reason
jq -r 'select(.level=="ERROR") | .reason' app.log | sort | uniq -c | sort -rn
This is the reward for structured logging. A numeric comparison like duration_ms > 2000, or a field-level aggregation, is essentially impossible with string grep, but with JSON it's a one-liner in jq.
If your service deals in HTTP, status codes show up in the logs constantly. When you can't recall what 402 was, or how 499 differs from 500, check each code's meaning quickly in the HTTP Status Codes reference, so you can tell at a glance whether a status code in the logs means "the client's fault or the server's fault."
Log Levels — The Dial That Controls the Noise
If you log everything indiscriminately, the signal drowns in noise exactly when you need it. That's what log levels are for. You tag each log with a severity, then adjust which level and above you look at depending on the situation. The standard hierarchy:
- DEBUG: detailed info for development and debugging. "This variable's value is this." Usually turned off in production.
- INFO: normal, significant events. "Order 4821 created," "server started." A record of what the system is doing.
- WARN: not a problem right now, but suspicious. "Succeeded after 3 retries," "config value missing, using default." An early warning of a potential problem.
- ERROR: something failed and needs attention. "Payment processing failed," "DB connection lost."
Using levels well is both courtesy and skill. Two common anti-patterns:
- Logging everything as ERROR: if even trivial things are ERROR, real ERRORs get buried. Cry wolf too often and no one looks when the wolf actually comes.
- Logging nothing: with not one INFO on the happy path, you have no way to ask "did it even get this far?" when something breaks.
When investigating, use levels as a filter. First look only at ERROR to catch the big events; if a particular moment looks suspicious, drop down to the INFO and DEBUG around it to fill in the context. The trick is narrowing from wide (ERROR) to narrow (all levels at that moment).
Sampling — When You Can't Keep It All
In a high-traffic service, keeping 100% of logs is impossible or uneconomical. If tens of thousands of requests per second each emit several lines, the storage cost and processing load become unbearable. So you sample — keeping only a representative fraction.
The key principle is to sample the normal, but keep the problems in full.
- Successful, normal requests: no need to keep them all. Sampling just 1%, say, is plenty to grasp the overall trends.
- Errors and slow requests: keep 100%. These are exactly what needs investigating, and they're relatively rare, so keeping all of them stays manageable in volume.
This way, storage drops dramatically while you miss none of the events you'll actually dig into (errors, anomalies). The tail sampling for traces covered in the earlier observability post is exactly the trace version of this philosophy: the rule "once it's done, if it was an error or slow, always keep it."
A trap to watch for: when you apply sampling, the decision must be made per trace_id, so that the logs of one request are kept or dropped together. If some logs from a single request are kept and others dropped, investigating by that trace_id later leaves holes in the timeline. Handle the whole request atomically — if "this trace is sampled in," keep all of that trace's logs.
Putting It Together — Tracking a 3 a.m. Outage
Let's weave the tools so far into a single investigation. The pager fires: "payment success rate plummeting."
- Get the stack trace: open a representative stack trace in the error-reporting tool. Follow the bottom "Caused by" all the way down and there's
SocketTimeoutException. Something can't reach the DB or an external API over the network. - Which side?: look at the "my code" frame in the stack — it's the payment gateway client. That narrows it to the external card processor's API, not our DB.
- Cross-check with logs: grab the
trace_idof one failed request and use jq to pull that request's full timeline. Inpayment-service, the call to the processor died right at the timeout value. - Assess the scope: aggregate errors by reason (
jq ... | sort | uniq -c). All timeouts on the same processor endpoint. Likely not our bug but a processor outage. - Get the context: use
rg -C 5to see the logs just before the first timeout. Latency spiked starting at a specific time. Check the processor's status page and there's an outage notice at exactly that time.
The stack trace answered "what" (network timeout), the trace_id-based logs answered "where" (the processor's API), the aggregation answered "how much" (all of it), and the context answered "since when" (that time). With no guessing, on evidence alone, you showed the culprit was outside your walls.
Wrapping Up
Logs and stack traces aren't annoying red text — they're the evidence an event leaves behind. Gathering the detective's skills again:
- In a stack trace, one end is the scene, the other end is the origin. Find the "my code" frame.
- Follow "Caused by" all the way down. The root cause is at the bottom.
- Write logs as structured data (JSON), not prose.
- Thread scattered logs into one request with trace_id.
- Dig with grep/rg/jq, but view the context (
-C) alongside. - Control the noise with log levels, and with sampling reduce the normal but keep the problems in full.
The next time you meet a stack trace, don't close your eyes and guess. It's a confession — read it to the end. The next time you meet an ocean of logs, don't despair. It's witness testimony — summon it by trace_id. The evidence is already all there. You just have to read it.
References
- "Structured Logging" (honeycomb.io blog): https://www.honeycomb.io/blog/structured-logging-and-your-team
- jq manual: https://jqlang.github.io/jq/manual/
- ripgrep User Guide: https://github.com/BurntSushi/ripgrep/blob/master/GUIDE.md
- W3C Trace Context (the trace_id propagation standard): https://www.w3.org/TR/trace-context/
- Python official docs — reading tracebacks: https://docs.python.org/3/tutorial/errors.html