Incident Response Playbook That Actually Gets Used

Most incident playbooks end up as wiki pages nobody reads during an actual incident. Here’s what survives contact with a real 3am pager.

The first five minutes

One person is the Incident Commander (IC). If that’s not clear, declare yourself IC.

Acknowledge the page
Post in #incidents: “Incident: [brief]. I am IC.”
Start a timeline document (even just a text file)
Check public status page — update if user-visible

Don’t dig into the problem yet. Set up the command structure first.

Comms are the hardest part

During an incident:

Stakeholders will ask for updates constantly
Engineers will speculate in chat
Customers will escalate through support

Designate a Comms Lead (separate from IC). Their job:

Updates to stakeholders every 15-30 minutes
Status page updates
Coordinate with support/CS teams

IC stays focused on technical resolution.

Timeline document template

## Incident: [Service] degraded on [date]

### Severity: SEV-2 (impacting users)
### IC: @alice
### Comms: @bob

### Timeline (UTC)
- 14:23 - First page from Pingdom
- 14:25 - alice ACK, declared SEV-2
- 14:28 - Initial hypothesis: DB connection pool exhausted
- ...

### Actions taken
- 14:32 - Scaled DB connection pool from 100 to 200
- 14:35 - Latency improved, error rate dropping

### Open questions
- Why did connection pool fill up?
- Was there a traffic spike?

Update in real-time. This becomes the basis for the post-mortem.

Escalation criteria

Know when to escalate:

You’re solo and it’s been 20 minutes without progress
Multiple systems affected
Data integrity concerns (corruption, loss)
Security incident (breach, exposed secrets)

Paging someone is not a failure. Staying alone too long is.

Mitigation before root cause

In the moment, stop the bleeding:

Roll back the recent deploy
Flip the feature flag
Drain the problem instance from load balancer
Failover to standby

Don’t try to fully understand root cause while users are down. Mitigate first, investigate afterward.

Post-mortem (within 72 hours)

Blameless post-mortem covering:

Summary: what happened, user impact, duration
Timeline: what the responders did, when
Root cause analysis: 5-whys depth
Contributing factors: not just “the bug”, what let the bug cause this
Action items: specific, assigned, with due dates

The action items are the whole point. Without them, the post-mortem is therapy, not prevention.

Blameless means blameless

“The engineer didn’t test enough” is a blame statement. “Our test suite didn’t cover this code path” is a systems statement.

Reframe everything as a systems issue. People make mistakes; systems should catch them.

Practice

Run game days: intentionally break things in staging, practice the runbooks. Most incident responses are bad because the team hasn’t practiced.

First time running a failover during an actual outage is terrifying. Do it during a game day first.

What to cut

Don’t include in your playbook:

Detailed diagnostic steps for specific services (too many, goes out of date)
Long context paragraphs (nobody reads them at 3am)
Complex decision trees (hard to follow under stress)

Do include:

Phone numbers for vendors/cloud providers
Command cheatsheet for the specific platform (kubectl, terraform, etc.)
Escalation contacts for adjacent teams
Links to dashboards and runbooks

Keep it under 2 pages. If longer, split by service.

The first five minutes#

Comms are the hardest part#

Timeline document template#

Escalation criteria#

Mitigation before root cause#

Post-mortem (within 72 hours)#

Blameless means blameless#

Practice#

What to cut#