Practical guide to writing system failure reports and incident response communication in Japanese

Entering
Basic structure of fault report
- Report Type Comparison Table
- Troubleshooting Report Standard ConfigurationThe standard Japanese disability report consists of the following items: If even one item is missing, it may be pointed out as 「記載漏れ(きさいもれ)」, so it should be used like a checklist.```text

Japanese Incident Report

Entering

System failures are a reality that every IT engineer would like to avoid, but must face. And in responding to a disaster, communication skills are just as important as technical recovery skills. In particular, in the Japanese IT field, the format and expression of failure reports are very standardized, so if you do not know this, you will receive a low score in the post-evaluation no matter how quickly you recover.

In Japanese corporate culture, failure reporting is not simply a record of facts. It is an act of protecting the company's trust and the starting point of organizational learning to prevent recurrence. A single failure report sent to a customer determines whether or not to renew a contract, and a single internal audit report can determine the team's reputation.

When Korean engineers experience failure response at Japanese IT sites, they are often skilled at resolving technical problems but often have difficulty writing reports and communicating with stakeholders in Japanese. This article systematically deals with Japanese communication across the entire incident life cycle, from the time a failure occurs to the postmortem.

Basic structure of fault report

Failure reports used in the Japanese IT industry are broadly divided into three types. Since each purpose, readership, and writing time are different, you must accurately select the document that suits the situation.

Report Type Comparison Table

Category	障害報告書(しょうがいほうこくしょ)	経緯書(けいいしょ)	ポストモーテム
Purpose	Official reporting of the facts of the disorder and response results	Detailed records of events (chronological passage)	Root cause analysis and learning to prevent recurrence
Reader	Customers, Management, Relations Department	In-House Manager, Quality Assurance Department	Engineering Team, Related Developers
At the time of writing	Within 24 to 48 hours after recovery is completed	Within 1 week after recovery	Within 1-2 weeks after recovery
tone	Formal style, emphasis on apology and prevention of recurrence	Listing objective facts	Blameless, learning-oriented
Quantity	1~2 sheets of A4	2~3 A4 sheets	3~5 A4 sheets
Japanese honorific level	Best (including 丁重語)	High (centered on 丁寧語)	Normal (です/ます font)

Troubleshooting Report Standard ConfigurationThe standard Japanese disability report consists of the following items: If even one item is missing, it may be pointed out as 「記載漏れ(きさいもれ)」, so it should be used like a checklist.```text

障害報告書

作成日：2026年3月7日作成者：開発部金英柱

■ 障害概要・障害件名：決済システム応答遅延障害・障害番号：INC-2026-0307-001 ・発生日時：2026年3月6日（金）14:23 JST ・復旧日時：2026年3月6日（金）16:45 JST ・障害時間：2時間22分・影響範囲：EC決済サービス全ユーザー（約50,000名）・重要度　：Critical（緊急）

■ 障害内容本番環境の決済APIサーバーにおいて、データベースコネクションプールの枯渇が発生し、決済処理の応答時間が通常の約30倍（平均15秒）に増大しました。これにより、決済タイムアウトが多発し、約3,200件の決済処理が失敗いたしました。

■ 原因・直接原因：バッチ処理の実行タイミングが重複し、 DBコネクションが上限（200）に到達・根本原因：バッチスケジューラーの排他制御が未実装であったこと

■ 対応経緯（タイムライン） 14:23 監視アラート検知（Datadog） 14:25 オンコールエンジニア確認開始 14:30 障害対策本部設置、関係者召集 14:45 原因特定（DBコネクションプール枯渇） 15:00 暫定対応実施（バッチ処理停止、コネクションプール拡張） 15:30 応答時間正常化確認 16:00 失敗トランザクションの再処理完了 16:45 全面復旧宣言

■ 暫定対応・バッチ処理の手動停止・コネクションプール上限を200→500に拡張・監視閾値の引き下げ（応答時間3秒→1秒）

■ 恒久対応（再発防止策）・バッチスケジューラーに排他制御を実装（対応期限：3月14日）・コネクションプール監視ダッシュボードの整備（対応期限：3月10日）・負荷試験シナリオにバッチ同時実行ケースを追加（対応期限：3月21日）

■ お客様への影響と対応・決済失敗3,200件は全件再処理完了済み・影響を受けたお客様には個別にお詫びメールを送信済み

この度はご迷惑をおかけし、深くお詫び申し上げます。再発防止に向けて、上記対策を確実に実施してまいります。

以上


## Core Japanese expressions

The Japanese used in disaster response is more specialized and formal than everyday business Japanese. Here, we summarize the expressions you must know at each stage of the disability life cycle.

### Report of failure

| Japanese expressions | Read | Korean Meaning | Usage situation |
| -------------------------- | ---------------------------------------- | ----------------------------- | -------------- |
| Living without harm | しょうがいがはっせいしました | A failure has occurred | First report |
| サービスが停止しております | サービスがていししております | Service is out of service | When service is down |
| If you want to live and live with your mind | おうとうちえんがはっせいしております | There is a response delay | When performance deteriorates |
| アラートを検知しました | アラートをけんちしました | Alert detected | Monitoring Alert |
| In the middle of the sound recording | えいきょうはんいをちょうさちゅうです | Scope of impact is being investigated | Initial investigation phase |
| 障害対策本部を設置しました | しょうがいたいさくほんぶをせっちしました | A disability response headquarters was established | In case of major disability |

### Cause analysis

| Japanese expressions | Read | Korean Meaning |
| ---------------------------- | ---------------------------------------- | ----------------------------------- |
| The original source has been established | げんいんをとくていしました | The cause has been identified |
| Original source (RCA) | こんぽんげんいん | root cause |
| 直接原因 | ちょくせつげんいん | direct cause |
| Original result and conclusion | げんいんちょうさをけいぞくしております | Investigation into cause is continuing |
| 設定ミスが原因と判明しました | せっていミスがげんいんとはんめいしました | It turned out that a configuration mistake was the cause |
| Think outside the box and get the original result | そうていがいのふかがげんいんです | Unexpected load is the cause |

### Scope of influence| Japanese expressions | Read | Korean Meaning |
| -------------------------------- | --------------------------- | ----------------------------- |
| 影響範囲は限箚的です | えいきょうはんいはげんていてきです | The scope of influence is limited |
| All the music sounds | ぜんユーザーにえいきょうがあります | Affects all users |
| 一部機能が利用不可です | いちぶきのうがりようふかです | Some features are not available |
| データ損失はございません | データそんしつはございません | No data loss |
| 約〇〇件のトランザクションに影響 | やく〇〇けんのトランザクションにえいきょう | Approximately OO transactions affected |

### Restoration work

| Japanese expressions | Read | Korean Meaning |
| ---------------------------- | ---------------------------------------- | ---------------------------- |
| After opening the business | ふっきゅうさぎょうをかいししました | Recovery work has begun |
| As soon as I leave, I leave home. | ざんていたいおうをじっししました | Temporary response has been implemented |
| ロールバックを実施します | ロールバックをじっしします | Rollback |
| サービスが正常に復旧しました | サービスがせいじょうにふっきゅうしました | Service has been restored to normal |
| In the middle of the road | けいかかんさつちゅうです | Under observation |
| Completely clear and complete | ぜんめんふっきゅうをせんげんします | We declare full recovery |

## Emergency reporting communication (Slack / phone)

In the early stages of a failure, real-time communication takes priority over official reports. The Japanese expressions used in Slack and on the phone have a different tone than in reports, but they should convey both accuracy and a sense of urgency.

### Slack failure reporting message template

The following format is standardly used in Japanese IT companies' failure response Slack channels (usually #incident or #障害対応).```text
🚨 【障害発生】決済システム応答遅延

@channel 障害が発生しました。

■ 発生日時：2026/03/06 14:23 JST
■ 影響：決済APIのレスポンスタイムが15秒超
■ 影響範囲：EC決済サービス全体
■ ステータス：🔴 調査中
■ 担当：@tanaka @kim
■ 対応チャンネル：#inc-20260306-payment

現在原因調査中です。状況分かり次第、随時更新します。

---

[14:30 更新] @kim
原因の手がかりを掴みました。DBコネクションプールが枯渇している
模様です。詳細調査を続けます。

[14:45 更新] @kim
原因特定しました。バッチ処理の重複実行によるDBコネクション枯渇です。
暫定対応としてバッチ停止＋コネクションプール拡張を実施します。

[15:30 更新] @tanaka
暫定対応完了。レスポンスタイムが正常値に戻りました。
引き続き失敗トランザクションの再処理を行います。

[16:45 更新] @kim
全面復旧しました。失敗トランザクション3,200件の再処理も完了。
ステータスを🟢に変更します。
明日以降、ポストモーテムを実施予定です。
```Slack follow the instructions.

- **@channel** はAll people know。深夜でもㅨくため、Critical の場合のみ使用する。Major 以下は**@here**(オンラインメンバーのみ通know)を使う
- ステータスは絵文字で視覚的に表現する（🔴調査中 → 🟡対応中 → 🟢復旧済み）
- It's time to write something new, record it, write it down, read it.ネルに直接投稿するのが日本の多くのcurrent situationでの慣例
- 「模様です」(It seems to be~)は推定、「判明しました」(It has been proven)は確定。推定段階では断定表現を避ける

### 電話でのエスカレーション表current

日本の IT current situation、Critical danger and situation Slack Let's talk about music and listen to music. When I listen to music at night, I listen to music and talk about it.

When you talk to someone, you can use it on Slack.

**First chapter of Chapter 1**

「お疲れ様です。開発部の金です。緊急のご連絡です。本日14時23分頃より、決済システムにて重大な障害がIt's about to be born, to be born, to be in the present, to be aware of the original, to be born, to be alive, to be in danger. “The main part of the book is to read the book, and then read it.”

(Thank you for your hard work. This is Kim from the development department. This is an urgent call. There has been a major failure in the payment system since around 14:23 today. We have started investigating the cause, and we are contacting you to request the establishment of a failure response center.)

**Contact your customer representative**

「お世話になっております.ございませんが、現在弊社の決済システムにて障害が発生しており、貴社のサービスにも㽱響が出ている状況でございます。現在、全力で復旧作業を“It’s time to move on, change things, change things, change things, change things, change things, change things.”

(We are indebted to you. This is Kim from

The point to note here is the expression “復旧の目処 (ふっきゅうのめど).” What the other party wants to know most is “when will it be restored?”, so you should not make promises without knowing the exact time. The principle is to present a target value, such as 「〇Time 間以内の復旧を目指しております」 (Aim for recovery within O hours), but avoid making promises.

## How to write a transcript (経緯書の書き方)

A logbook is a document that records chronological progress in more detail than a disability report. While a disturbance report focuses on “what happened and how you responded,” an incident report records “exactly in what order the events unfolded, minute by minute.”

### Principles of writing a police report

1. **Record private affairs** (Record only objective facts) - Write only facts, excluding guesses and emotions
2. **時系列で記載する** (Write in time series) - Be sure to organize in chronological order.
3. **主語を明確にする** (Clarify the subject) - Clearly record who did what
4. **Attach the base log or evidence** (Attach supporting logs or evidence)

### Inspector General Timeline Format Template```text
経緯書

件名：決済システム応答遅延障害に関する経緯書
作成日：2026年3月7日
作成者：開発部 金 英柱
承認者：開発部部長 田中 太郎

1. 障害発生前の状況
  2026年3月6日（金）は通常の営業日であり、特別なリリースや
  メンテナンスは予定されていなかった。
  14:00より、月次集計バッチ処理が自動実行される設定となっていた。

2. 障害発生から復旧までの経緯

  時刻      | 担当者 | 対応内容
  ----------|--------|--------------------------------------------------
  14:00     | (自動) | 月次集計バッチ処理が自動実行開始
  14:15     | (自動) | 日次レポートバッチ処理が自動実行開始（重複実行）
  14:23     | (自動) | Datadogアラート発報（API応答時間 > 5秒）
  14:23     | 金     | PagerDutyよりオンコール通知受信
  14:25     | 金     | Datadogダッシュボード確認、障害認定
  14:26     | 金     | #incident Slackチャンネルに障害発生を投稿
  14:28     | 金     | 田中部長に電話でエスカレーション
  14:30     | 田中   | 障害対策本部設置を指示
  14:30     | 金     | DBメトリクス確認開始
  14:35     | 金     | コネクションプール使用率100%を確認
  14:40     | 金     | バッチ処理の重複実行をCloudWatchログで確認
  14:45     | 金     | 原因特定：バッチ重複実行によるDB接続枯渇
  14:50     | 田中   | 暫定対応方針を承認
  14:55     | 金     | 月次集計バッチ処理を手動停止
  15:00     | 金     | コネクションプール上限を200→500に変更、デプロイ
  15:10     | 金     | API応答時間の改善を確認（15秒→0.8秒）
  15:15     | 金     | Slackにて暫定対応完了を報告
  15:20     | 佐藤   | 失敗トランザクションの抽出開始（SQL実行）
  15:45     | 佐藤   | 失敗トランザクション3,200件を特定
  16:00     | 佐藤   | 再処理バッチ実行、全件正常完了確認
  16:30     | 金     | 経過観察（30分間異常なし）
  16:45     | 田中   | 全面復旧宣言

3. 障害の原因
  （省略：障害報告書と同内容を詳細に記載）

4. 再発防止策
  （省略：後述の再発防止策セクションと同内容）

5. 添付資料
  ・別紙1：Datadogアラート画面キャプチャ
  ・別紙2：CloudWatchログ抜粋
  ・別紙3：DBコネクション数推移グラフ

以上
```Frequently used expressions in inspectors include 「〇〇を確認」(confirm OO), 「〇〇を実施」(carry out OO), 「〇〇を指示」(instruct OO), and 「〇〇を報告」(report OO). They are all in the form of Chinese character noun + する, and are suitable expressions for concise and objective description.

## Create a plan to prevent recurrence

Measures to prevent recurrence (さいはつぼうしさく) are the core of the disability report. In the Japanese IT industry, there is a strong tendency to evaluate a team's capabilities based on **the quality of measures to prevent recurrence** rather than the failure itself. The will to “never cause the same problem twice” must be shown through specific action items.

### 3-step structure of recurrence prevention measures

Effective recurrence prevention measures in the Japanese IT field are classified and developed from the following three perspectives.

| Category | Japanese | Description | Example |
| -------- | ------------------ | -------------------------------- | --------------------------------- |
| 発生防止 | はっせいぼうし | Prevent the same thing from happening | Implement exclusive control, automate configuration verification |
| New Year’s Eve | そうきけんち | Detect immediately even if something happens | Strengthened monitoring, adjusted alert threshold |
| 影響軽減 | えいきょうけいげん | Minimize impact after detection | Introduction of automatic failover and circuit breakers |

### Expressions to avoid and correct expressions when writing measures to prevent recurrence

| Bad expression (NG) | Problem | Good expression (OK) |
| ------------------------------ | ------------------ | ------------------------------------------------------ |
| Note | Not specific | チェックリストを作成し、ダブルチェック体制を導ㅥする |
| Let's breathe | Rely on human will | How to wear self-moving clothes |
| After seeing it again and again, let's talk about it | Behavior unclear | CIパイプラインにバッチ排他制御のテストケースを追加する |
| Next question | unfounded assertion | As soon as I got my money back, I decided to do it again |
| ヒューマンエラーなので仕方ない | Avoidance of responsibility | ヒューマンエラーを防止するための自動化を実装する |

As this table shows, measures to prevent recurrence in Japanese IT sites should focus on improving systems and processes, not human attention. “Be careful” is not recognized as a measure to prevent recurrence, and technical or process measures must be described in detail.

## Postmortem culture (ポストモーテム)

In Japan's advanced IT companies (Merukari, LINE, Rakuten, etc.), a postmortem (ポストモーテム) culture has been established under the influence of Google SRE. Unlike a disability report or a report, a postmortem is a document that focuses on technical learning under the principle of **Blameless**.

### Differences between postmortem and traditional reportsWhile disability reports and records focus on “誰が何をした” (who did what), postmortems We dig deep into “Why did it happen and what was lacking as a system?”

### Postmortem Template```text
ポストモーテム報告書

タイトル：決済システム応答遅延障害（INC-2026-0307-001）
作成日：2026年3月13日
ファシリテーター：金 英柱
参加者：田中太郎、佐藤花子、鈴木一郎

■ サマリー
2026年3月6日14:23〜16:45（2時間22分）にわたり、
決済システムの応答遅延が発生。約3,200件の決済処理が失敗した。
直接原因はバッチ処理の重複実行によるDBコネクション枯渇。
根本原因はバッチスケジューラーの排他制御の欠如。

■ インパクト
・影響期間：2時間22分
・影響ユーザー数：約50,000名
・失敗トランザクション：3,200件（全件再処理完了）
・推定売上影響：約480万円（遅延による離脱を含む）
・SLA違反：あり（月次稼働率99.95%目標に対し影響あり）

■ タイムライン
（経緯書のタイムラインを転記）

■ 根本原因分析（5 Whys）
1. なぜ決済APIが遅延したか？
   → DBコネクションプールが枯渇したため
2. なぜコネクションプールが枯渇したか？
   → 2つのバッチ処理が同時に大量のDB接続を使用したため
3. なぜバッチ処理が同時実行されたか？
   → スケジューラーに排他制御が未実装だったため
4. なぜ排他制御が未実装だったか？
   → 初期設計時にバッチ処理の同時実行シナリオが考慮されていなかったため
5. なぜ設計レビューで検知できなかったか？
   → バッチ処理の負荷試験シナリオが不十分だったため

■ うまくいったこと（What went well）
・アラート検知から2分以内にオンコール担当が対応を開始した
・原因特定まで22分と迅速だった
・関係者間の情報共有がSlackで適切に行われた
・失敗トランザクションの再処理手順が整備されていた

■ うまくいかなかったこと（What didn't go well）
・バッチ排他制御が設計段階で考慮されていなかった
・コネクションプールの監視アラートの閾値が高すぎた（80%→実質的に検知不能）
・顧客への第一報に35分を要した（目標は15分以内）

■ 幸運だったこと（Where we got lucky）
・障害発生がピーク時間帯（12:00〜13:00）を過ぎていた
・データ損失が発生しなかった

■ アクションアイテム
| No. | アクション | 担当者 | 期限 | ステータス |
|-----|-----------|--------|------|-----------|
| 1   | バッチスケジューラー排他制御実装 | 金 | 3/14 | 対応中 |
| 2   | コネクションプール監視強化 | 佐藤 | 3/10 | 完了 |
| 3   | 負荷試験シナリオ追加 | 鈴木 | 3/21 | 未着手 |
| 4   | 顧客通知フロー見直し | 田中 | 3/17 | 未着手 |
| 5   | 障害訓練（ゲームデー）実施 | 金 | 4/末 | 未着手 |

■ 学んだこと（Lessons Learned）
・バッチ処理の設計時には、同時実行・リソース競合を必ず考慮する
・「起きないだろう」という前提でなく、「起きたらどうなるか」で設計する
・監視アラートの閾値は、障害に至る前に検知できる値に設定する

以上
```ポストモーテムの会議では、次のような進行フレーズが使われる。

- 「このポストモーテムはブレームレスで行います。個人の責任を追及する場ではありません。」(This Postmortem is not a place to hold individuals responsible.)
-「Let’s focus on improvements to the system.」
- 「なぜそうなったか、5 times to come down and do itてみましょう。」(Let’s dig into why it happened 5 times.)

## Real-life conversation simulation

Let’s take a look at a simulation of the conversation that occurs in an actual disaster response situation. You can gain practical experience through communication at each stage from failure occurrence to recovery.

### Scenario: Friday afternoon payment system failure

**Scene 1: Slack reporting immediately after failure detection**

Kim(エンジニア): 「@channel緊急です。決済APIのレスポンスタイムが15秒を超えています。Da tadogのアラートを14:23に検知しました。調査をopeningします。」

(This is urgent. The response time of the payment API is exceeding 15 seconds. Datadog alert detected at 14:23. Starting investigation.)

**Scene 2: Call Escalation to Boss**

Kim: 「田中部長、お疲れ様です。金です。緊急のご連絡です。」

Tanaka: 「はい、何かありましたか。」

Kim: 「決済システムに重大な障害が発生しております。14時23分にDatadogアラートを検知し、 The current API can be used within 30 minutes and can be reached within 30 minutes.す。It is difficult to understand the main part of the book and read it.”

Tanaka: 「It's easy to understand. It's easy to understand.」

Kim: 「The current time is right and the EC is in the middle of nowhere.ると見ております。詳細な影评囲は調査中です。”

Tanaka: 「Review, private information, change, think about it, find out the original information, first come to a conclusion, ask for information.」

Kim: 「承知いたしました。状況ががかり次第、Slackで随時報告いたします。」

**Scene 3: Identifying the cause and confirming response policy**

Kim(Slackにて): 「@tanaka_bucho The original record has been decided, and the monthly report book has been released. At the same time, the database is open to the public.対応として、(1)バッチ処理の手動停止、(2)コネクションプールをお願いいたします。”

Tanaka: 「了解です。その方針で進めてください。本番環境への変更なので、変更内容をSlackに記録した上で実施してください。」

Kim: 「It is important to know and to live in a foreign country.」

**Scene 4: Post-Recovery Debrief**

Kim (Slackにて): 「It's all over again. 16:45 It's all over.だきます。ケ㕗トランザクション3,200件の再処理も了しております。障害報告書は明日中に作成いたします。ポストモーテムは来週実“As soon as I think about it, I put it in my mind.”

What is noteworthy in this simulation is that Engineer Kim is communicating **judging (お伺いを立てる)** at every step. In the Japanese IT field, even if the technical judgment is correct, it is important to obtain approval from the superior before implementing it. In particular, changes to the main environment must receive prior approval.

## Precautions and common mistakes

We summarize the mistakes that Korean engineers frequently make when responding to failures at Japanese IT sites and how to deal with them.

### Mistake 1: Expressing guesses as assertions

**NG**: 「原因はDBの負荷です。」(The cause is DB load.)
**OK**: 「Original cause is DB, it is possible to read and understand.」(The cause is thought to be DB load. Investigation is continuing.)If you state unconfirmed information conclusively, you will lose trust when the cause is later revealed to be different. The Japanese language has a wealth of estimation expressions, so you should actively utilize them.

- 「〇〇と〝われます」(I think it’s OO)
-「〇〇の可能性があります」(There is a possibility of OO)
-「〇〇と見ております」(I see it as OO)
- 「〇〇の模様です」(I think it’s OO)

### Mistake 2: Responding independently without reporting

In Korea's IT field, it is often a virtue for engineers to make autonomous decisions and respond quickly. However, in Japan, reporting, contact, and consultation are basic. In particular, changes to the main environment, no matter how urgent, must first obtain approval from the superior.

**NG**: (Restart the server immediately without reporting to Slack)
**OK**: 「I want to restart the server.」

### Mistake 3: Lack or excess of apology

If an apology is omitted from a disability report, it is thought that there is no reflection, and conversely, excessive apologies cause anxiety.

**Overload**: 「障害が発生しました。復旧済みです。」(An error occurred. Recovery is complete.) - No apology.
**Excess**: 「大変申し訳ございません、本当に申し訳ございません、誠に...」 - Excessive repetition
**Appropriate**: 「この度はご迷惑をおかけし、深くお詫び申し上げます。」(We sincerely apologize for the inconvenience this time.) - Sincerely in one sentence

### Mistake 4: Preventing recurrence is mentalism

We've covered this before, but it's an important point worth emphasizing once more. 「気をつけます」(I will be careful) and 「今後注意します」(I will be careful in the future) are never recognized as a recurrence prevention measure in the Japanese IT industry. Specific technical measures or process changes must be specified.

### Mistake 5: Inaccurate recording of time

It should be written accurately as “14時23分” (14:23), not “14時ごろ” (around 14:00). In failure reports and incident reports, time accuracy is a key factor in determining the reliability of the documents. The principle is to accurately record down to the minute based on the log of the monitoring tool.

## Learn from failures

Let’s learn lessons from failure reporting failures that may occur in the actual field.

### Case 1: Loss of trust due to reporting delays

A Korean engineer detected a problem during a nighttime on-call. Technically, the cause was quickly identified and repaired within 30 minutes, but it was reported to Slack only after recovery. The next day, I received a criticism from my supervisor saying, “Why didn’t you report it right away?”

**Lesson**: In Japan, reporting comes before recovery. The principle is 「まず報告、それから対応」 (see first, then respond). As soon as a disturbance is detected, an investigation must be initiated and an initial report must be filed. It's okay if the content is incomplete. A single line, “There is a possibility of a malfunction. We are checking” is sufficient.

### Case 2: When measures to prevent recurrence are not accepted

In the failure report, it was written that “the person in charge will thoroughly check before distribution” as a measure to prevent recurrence. I received feedback from my boss saying, "It's not a plan, it's a mindset. Please use technical measures."**Lesson**: Recurrence prevention measures must be **verifiable and automatable**. 「CI/CD pipeline processing」(Add OO check to CI/CD pipeline), It is described as a specific technical action, such as 「〇〇のアラート閾値を変更する」(Change the alert threshold of OO).

### Case 3: Excessive technical terminology in customer reports

In the failure report sent to the customer, the technical details were written as follows: "DB control function...". I received a request from a customer representative asking, “Can you write it in a more understandable way?”

**Lesson**: Client reports should minimize technical jargon and focus on impacts and responses. 「システム内部の処理能力の上限に達したため、一“Time is the best time to think about it” (System Change it to a non-technical expression, such as (the service response has temporarily slowed down due to the internal processing capacity reaching its upper limit).

## Disruption response email template

Both timing and content of emails sent to customers or stakeholders during a disaster situation are important. The following is an email template for the first report sent when a failure occurs and the recovery completion report.```text
件名：【緊急・障害報告】決済システム障害発生のお知らせ

株式会社ABCコマース
システム管理部
佐々木 様

いつもお世話になっております。
株式会社XYZテック 開発部の金 英柱でございます。

大変申し訳ございませんが、弊社が運用しております
決済システムにおいて、下記の通り障害が発生しております。

■ 障害概要
・発生日時：2026年3月6日（金）14:23頃
・影響内容：決済処理の応答遅延
・影響範囲：決済サービスをご利用の全てのお客様
・現在のステータス：原因調査中

■ 現在の対応状況
現在、原因の特定および復旧作業を最優先で進めております。
復旧の目処が立ち次第、改めてご連絡いたします。

ご利用のお客様にはご不便をおかけしておりますこと、
深くお詫び申し上げます。

取り急ぎ、第一報としてご連絡申し上げます。

何卒よろしくお願い申し上げます。

──────────────────────────
株式会社XYZテック 開発部
金 英柱（キム・ヨンジュ）
TEL：03-XXXX-XXXX
Email：kim@xyztech.co.jp
──────────────────────────
```Some noteworthy expressions in this email are as follows:

- 「取り急ぎ、第一報としてご連絡申し上げます」(I am contacting you urgently and as the first report) - Notice that the information is not yet complete.
- 「As soon as there is a prospect for recovery」 - No specific time is promised, but follow-up contact is announced.
- 「深くお詫び申し上げます」(I sincerely apologize) - A formal apology

## Checklist

Organize items that are easy to miss in the event of a failure into a checklist. If you share this checklist within your team, you can respond without missing anything in the event of an actual failure.

### Immediately after the disability occurs

- Confirmation of monitoring alerts and recognition of failures
- Submit the first report to the Slack #incident channel (Slack article 1 report)
- Escalation to superior
- Request for establishment of Disability Countermeasure Headquarters (Disability Control Headquarters) ※In case of Critical
- Initial understanding of the scope of influence

### Investigation/Response in progress (under investigation/response)

- Cause investigation status updated on Slack every 15 minutes (15 minutes new)
- Obtain approval of response policy from superior (対応方針の承認取得)
- When changing the main environment, record the changes (本番変更内容の記録)
- Send the first information email to the customer representative (顧客への第 1)

### After recovery

- Conduct progress observation (confirm that there are no abnormalities for more than 30 minutes)
- Declaration of full recovery
- Send recovery completion report to customer by email (recovery completion report)
- Processing of remaining responses such as failed transactions (残対応の実施)

### Post-event response

- Complete a disability report (within 24 to 48 hours)
- Preparation of a report (within 1 week)
- Postmortem implementation (within 1-2 weeks) (ポストモーテム実施)
- Implementation and tracking of measures to prevent recurrence
- Report on the completion of measures to prevent recurrence

## Japanese abbreviations and jargon frequently used in disaster response

Abbreviations and technical terms are frequently used in Slack and verbal communication in the Japanese IT field. If you don't know this, it's difficult to keep up with the pace of the conversation.| Abbreviations/Terms | Official name | meaning |
| ---------------- | ---------------------------- | --------------------------------------- |
| 障害票 | 障害管理票 | Fault Management Ticket |
| 一次切り分け | 一次切り分け調査 | First classification survey |
| 暫定/恒久 | 暫定対応/恒久対応 | Temporary Response/Permanent Response |
| Opening of exhibition | Opening of exhibition | Horizontal deployment check for similar systems |
| Water Flat Exhibition | 水平展展確認 | Check if the same cause exists elsewhere |
| 切り戻し | 切り戻し(きりもどし) | rollback |
| road trip | 縮退運転(しゅくたいうんてん) | Operate with reduced functions (Degraded mode) |
| 冗長化 | 冗長化(じょうちょうか) | Redundancy/redundancy |
| フェイルオーバー | フェイルオーバー | failover |
| ゲームデー | ゲームデー | Disability Response Training (Game Day) |

In particular, 「横展開(よこてんかい)」 is a concept that is not often used in Korea, but is treated as very important in the Japanese IT field. If a problem occurs, a request must be made saying "Please check with horizontal deployment to see if the same problem exists in other systems." It follows.

## Organizing communication by level of failure response

The scope and formality of communication required varies depending on the severity of the disability. The table below shows the response to each level at a glance.| Item | Critical (P1) | Major (P2) | Minor (P3) | Low (P4) |
| ----------------- | --------------- | --------------- | ---------- | ----------- |
| Japanese | 緊急 | 重大 | 軽微 | 低 |
| Slack notifications | @channel | @here | Channel submission | thread |
| phone escalation | Required | Depending on the situation | Not necessary | Not necessary |
| Customer Contact | Immediately | Within 1 hour | When needed | After-action report |
| Disability Report | Required (within 24h) | Required (within 48h) | Brief report | Ticket records only |
| Postmortem | Required | Recommended | Random | Not necessary |
| Management Report | Immediately | Same day | Weekly Report | Not necessary |

## References

To learn more about fault report writing and incident response, the resources below can be helpful.

- [About the disaster report - Qiita](https://qiita.com/hirokidaichi/items/f9f4549c88aaf8b38bda) - Article covering the basic principles and practical examples of writing a disability report
- [Damage report template - NotePM](https://notepm.jp/template/failure-report) - Collection of ready-to-use failure report templates
- [SHIFT](https://service.shiftinc.jp/column/5344/) - Guide to writing a fault report from a quality assurance perspective
- [Failure report report - Qbook](https://www.qbook.jp/column/1793.html) - Failure reporting methodology from a testing perspective
- [ポストモーテムの書き方 - Qiita](https://qiita.com/Ping/items/7cc93bcae87583184121) - How to write a Google SRE style postmortem
- [Google SRE Book - Postmortem Culture](https://sre.google/sre-book/postmortem-culture/) - Original source of postmortem culture
- [PagerDuty Incident Response Guide](https://response.pagerduty.com/) - Incident response process guide

In order to use the content covered in this article in practice, it is important to read a lot of Japanese language disability reports on a regular basis. We encourage you to become familiar with Japanese disability response communication by reading our past disability reports or studying the templates in the reference materials above. Failures can occur at any time, but prepared engineers can turn them into learning opportunities for the team.