Incident Communication: Talking Through an Outage

Introduction — The Fix Is Only Half the Job
The Incident Commander — The Person Who Doesn't Fix It
The Status-Update Cadence — Keep the Drumbeat Even With No News
The Three Questions — What Every Update Must Answer
Blameless Postmortems — Blame the System, Not the Person
The 5 Whys — From Symptom to Root Cause
Customer Comms vs. Internal Comms
Staying Calm — Tone Sets the Air in the Room
Wrapping Up
References

Introduction — The Fix Is Only Half the Job

It is 3 a.m. and your pager goes off. The payments API is spewing 5xx errors, and every dashboard is red. You dig into logs, narrow down the cause, and prepare a rollback. This is the work you were trained for. And yet it is only half the work.

The other half is talking. Telling the team what is happening, sorting out who owns what, keeping customers informed, and answering the executive who keeps asking "how's it going?" Fail at this half, and an outage that was technically a 20-minute blip becomes an all-day trust crisis. People act on conflicting information and duplicate each other's work, customers imagine the worst in the silence, and long after recovery the question "what actually happened?" drags on for days.

This post is about that other half. We will talk about the Incident Commander role, the rhythm of status updates, the three questions every update must answer, blameless postmortems and the 5 Whys, the different flavors of customer and internal communication, and the calm that holds all of it together.

The Incident Commander — The Person Who Doesn't Fix It

The first thing a chaotic incident needs is a single coordinator. We call this person the Incident Commander, or IC. The essence of the role is paradoxical.

The IC's job is not to fix the outage. It is to coordinate the people who fix the outage.

The moment the IC grabs a keyboard and dives into debugging, the coordinator vanishes. Now nobody holds the whole picture, several people chase the same hypothesis in parallel, and no one is telling customers or executives what is going on. The IC has to keep their hands free so they can make decisions, allocate people, and own the flow of communication.

Effective response separates the roles clearly. In a small incident one person may wear several hats, but the roles themselves should be conceptually distinct.

Incident Commander (IC): the single coordinator over the whole response. Makes decisions, delegates work, sets priorities. Does not fix it themselves.
Communications lead (comms lead): owns communication that flows outward and upward. Status-page updates, customer notices, and executive briefings, so the IC and the responders are freed from that burden.
Ops / subject-matter expert (SME): the people actually on the keys, diagnosing and fixing. The engineers who know the failing system best belong here.
Scribe: keeps the timeline. Records, in real time, what happened at each minute, what decisions were made, and what was tried. This record becomes the skeleton of the postmortem later.

The power of this structure is the distribution of cognitive load. If one person tries to diagnose while also writing customer notices and answering executive questions, all three come out sloppy. Split the roles and each person can focus on one thing. And everyone must know exactly who the IC is. A single explicit sentence, "I am taking command," instantly aligns a channel where ten people were milling around in confusion.

The Status-Update Cadence — Keep the Drumbeat Even With No News

During an outage, silence is not the absence of information. It is an amplifier of anxiety. When updates go quiet, the people watching start to imagine: "has everyone given up?", "did it get worse?" So the core principle of good incident communication is this.

Update on a regular rhythm, even when there is nothing new.

This is often called the drumbeat. Pick a cadence based on severity, say every 15, 30, or 60 minutes, and honor it. Even on a cycle with no progress, you say "still investigating the cause; no new developments." That one sentence signals "we are still here, and we have not let go of this." It is far more reassuring than silence.

The decisive habit here is to name the next update time and honor it. End every update with a line like:

"Next update at 14:30."
"More news in 30 minutes, no later than 15:00."

Now the people watching know when the next word is coming, so they do not refresh the channel or repeatedly ping the responder with "any progress?" in the meantime. And this is a promise. If you said 14:30, then at 14:30 you must post something, even if that something is only "still investigating; next update at 15:00." The moment you miss the time you named, the trust you worked to build starts to crumble.

The Three Questions — What Every Update Must Answer

When you are staring at a blank box wondering how to write a status update, just remember three questions. A good update always answers these.

What do we know — the facts established so far. What is affected, and how far does it reach. Do not mix fact and speculation; state only what is confirmed as fact.
What are we doing — the action in progress. Which hypothesis you are testing, which mitigation you have applied, and what the next step is.
When's the next update — the promise emphasized above. Always include an explicit time.

The reason this three-question frame is so good is that it is exactly the whole of what a watcher genuinely wants to know. People do not want your stack traces or your internal debates. How does it stand right now, is something being done, and when will I hear again — knowing only that puts them at ease.

Early in triage, establishing "what do we know" quickly matters especially. Even just telling whether it is a spike of 5xx errors or a spike of 4xx tells you whether the problem is server-side or client-side. When the meaning of a status code is fuzzy, keeping a reference like HTTP Status Codes at hand for a quick check can save you from burning 30 minutes chasing the wrong direction.

A real internal Slack update looks like this. It answers all three questions and nails down the next update time.

[Incident #4021] Payments API outage — Update 3 (14:05)

What we know:
  - Since 14:00, ~40% of payment authorization API calls are failing with 5xx.
  - Cause narrowed to payments-service v2.7.3, deployed today at 13:50.
  - Other features (browse, cart, etc.) are unaffected.

What we're doing:
  - Rolling back to v2.7.2 (deploy pipeline running, ETA ~14:15).
  - Will monitor authorization success rate after rollback to confirm recovery.

Impact:
  - New payment attempts fail or hang. No data loss observed.

Next update: 14:20, or immediately on rollback completion (whichever is sooner).
IC: Youngju Kim / Comms: Seojun Lee

Once this format is in your bones, you can fill in the blanks like a form even at 3 a.m. with a foggy head.

Blameless Postmortems — Blame the System, Not the Person

After the outage ends, you write a postmortem. The single most important principle here is that it is blameless.

The point of a blameless postmortem is not to find the person who made the mistake, but to blame the system and process that made that mistake possible. If an engineer deployed a bad config file and caused the outage, the question is not "who did it" but "why did our system allow one person's single mistake to bring everything down." Why did validation not catch it? Why could a bad config reach production so easily? Why did the rollback take so long?

The reason this principle is practical is not morality but the quality of information. In a blaming culture, nobody tells the truth. Someone who knows that admitting a mistake means punishment will blur the timeline, hide what they pushed, and turn defensive. Then the real weaknesses of the system are never exposed, and the same outage repeats in someone else's hands.

Conversely, when there is psychological safety, people speak honestly. Only when someone can candidly say "I ran this command, I ignored this warning, I skipped checking this part" can you put a guardrail at that exact point. Honest timelines come only from psychological safety, and without honest timelines there are no real lessons.

A blameless postmortem should answer roughly the following.

What happened: a fact-based timeline (this is where the scribe's record shines).
What was the impact: how long, how many users/requests, in what form.
Why did it happen: the root cause(s). Conditions and systems, not people.
How did we detect and respond: what went well and what was slow.
How do we prevent recurrence: concrete action items with owners and deadlines.

The 5 Whys — From Symptom to Root Cause

The classic technique for drilling to a root cause is the 5 Whys. Originating in the Toyota Production System, the method is simple. Start at the surface symptom and repeatedly ask "why?", descending toward deeper causes.

Here is an example.

The payments API failed. Why? — The payments service exhausted its database connections.
Why did it exhaust connections? — A slow query held connections open too long without releasing them.
Why was the query slow? — A newly added lookup scanned a column with no index.
Why was it deployed without an index? — Code review did not check the query's execution plan.
Why did review miss it? — There was no checklist or automated check to review execution plans.

By the fifth "why," you arrive at a completely different place from the initial "the API failed." The real thing to fix is not the payments service but the review process that failed to catch a dangerous query. That is the power of the 5 Whys. Instead of patching the symptom, it takes you down to the conditions that produced it.

That said, the 5 Whys has clear limits, so do not treat it as gospel.

There is usually more than one cause. Real outages happen when several causes overlap. Following "why" down a single strand misses the other contributing factors. In practice, a single "why" can have several branching answers, so laying causes out as a tree is often more accurate than a linear 5 Whys.
"5" is not a magic number. You might arrive in three, or need seven. Do not fixate on the count.
If the answer stops at a person, you drilled wrong. Stopping at "why? because he made a mistake" slides into blame. You have to take one more step and ask "why was that mistake possible" to reach the system.

The 5 Whys is a good starting point for structuring your thinking, but it is not the only tool. Complex outages need a wider view that considers the interaction of multiple causes together.

Customer Comms vs. Internal Comms

A common mistake in outage communication is saying the same thing to different audiences. Communication toward customers and communication internally differ in purpose, in the information wanted, and in tone.

Customers want three things: impact (what is broken), an estimated time to recovery (ETA), and an acknowledgment and apology (confirmation that you know and are on it). Customers do not care about your internal architecture or which microservice exhausted its connection pool. Such internal detail only breeds confusion and needless worry. Customer notices should be clear, plain, and empathetic.

Internal is the opposite. Responders and executives want the technical truth. The exact error rate, the suspected cause, what was tried and failed, how long the rollback takes. Here you should share even the uncertainty honestly. A nuance like "likely the cause but not yet confirmed" is useful information internally. Throw that raw detail at customers, though, and you buy anxiety instead of trust.

Placing the two side by side looks like this.

Aspect	Customer comms	Internal comms
Audience	Users, customer organizations	Responders, executives, stakeholders
What they want	Impact + ETA + acknowledgment/apology	Precise technical truth, cause, progress
Tone	Clear, plain, empathetic	Direct, specific, candid (uncertainty included)
Channel	Status page, email, notice banner	Incident Slack channel, phone bridge

The key is translation. The comms lead translates the raw technical truth flowing through the internal channel into the impact, ETA, and apology that customers need. Same event, but it must become entirely different sentences depending on the audience.

Staying Calm — Tone Sets the Air in the Room

Finally, a word about the attitude that precedes technique. In incident response, the IC's calm is contagious. And so is the opposite: panic is just as contagious.

If the IC types frantically, raises their voice, and radiates "we're doomed," the whole response team absorbs that tension. Tense people make hasty decisions, run commands without checking, and make mistakes. Conversely, when the IC says in a steady voice "okay, let's take this one step at a time; first, let's confirm the blast radius," the heart rate of the whole room drops. People find room to think again, and better judgment comes out of that room.

Tone is not merely a matter of manner; it is a real variable that governs the quality of the response. A maxim long current in crisis response captures it well.

Slow is smooth, smooth is fast.

It sounds like a paradox, but it is true. Rushing produces mistakes, and mistakes cost more time to undo. Slow down by a beat and move accurately, and that smoothness turns out to be the fastest path. The more urgent it feels, the more you take a breath, say the next action aloud ("I am starting the rollback to v2.7.2 now"), and double-check any irreversible step — this is the rhythm the IC must plant in the room.

Calm is not suppressing emotion; it is a leadership tool that gives the team space to think. When you are calm, the team is calm. And a calm team ends the outage faster and more safely.

Wrapping Up

When an outage hits, we instinctively run to the code. But as we have seen, the technical fix is only half the job. The other half is talking. A hands-free Incident Commander splits the roles, a status update that drums on even with no news calms the anxiety, the three questions keep every update clear, a blameless postmortem leaves honest lessons, and different translations toward customers and internally reassure each audience. And beneath all of it lies the IC's calm.

Good incident communication cannot prevent outages. Outages will always happen eventually. But whether it ends as a 20-minute technical problem or spreads into an all-day trust crisis is decided by the talking. The fix is only half the job. The team that does the other half well comes out of the outage stronger, not weaker.

References

Google SRE Book — Managing Incidents: https://sre.google/sre-book/managing-incidents/
Google SRE Book — Postmortem Culture: Learning from Failure: https://sre.google/sre-book/postmortem-culture/
PagerDuty Incident Response Documentation: https://response.pagerduty.com/
Atlassian Incident Management Handbook: https://www.atlassian.com/incident-management
Five whys (overview): https://en.wikipedia.org/wiki/Five_whys