Most incident playbooks end up as wiki pages nobody reads during an actual incident. Here’s what survives contact with a real 3am pager.
The first five minutes
One person is the Incident Commander (IC). If that’s not clear, declare yourself IC.
- Acknowledge the page
- Post in #incidents: “Incident: [brief]. I am IC.”
- Start a timeline document (even just a text file)
- Check public status page — update if user-visible
Don’t dig into the problem yet. Set up the command structure first.
Comms are the hardest part
During an incident:
- Stakeholders will ask for updates constantly
- Engineers will speculate in chat
- Customers will escalate through support
Designate a Comms Lead (separate from IC). Their job:
- Updates to stakeholders every 15-30 minutes
- Status page updates
- Coordinate with support/CS teams
IC stays focused on technical resolution.
Timeline document template
## Incident: [Service] degraded on [date]
### Severity: SEV-2 (impacting users)
### IC: @alice
### Comms: @bob
### Timeline (UTC)
- 14:23 - First page from Pingdom
- 14:25 - alice ACK, declared SEV-2
- 14:28 - Initial hypothesis: DB connection pool exhausted
- ...
### Actions taken
- 14:32 - Scaled DB connection pool from 100 to 200
- 14:35 - Latency improved, error rate dropping
### Open questions
- Why did connection pool fill up?
- Was there a traffic spike?
Update in real-time. This becomes the basis for the post-mortem.
Escalation criteria
Know when to escalate:
- You’re solo and it’s been 20 minutes without progress
- Multiple systems affected
- Data integrity concerns (corruption, loss)
- Security incident (breach, exposed secrets)
Paging someone is not a failure. Staying alone too long is.
Mitigation before root cause
In the moment, stop the bleeding:
- Roll back the recent deploy
- Flip the feature flag
- Drain the problem instance from load balancer
- Failover to standby
Don’t try to fully understand root cause while users are down. Mitigate first, investigate afterward.
Post-mortem (within 72 hours)
Blameless post-mortem covering:
- Summary: what happened, user impact, duration
- Timeline: what the responders did, when
- Root cause analysis: 5-whys depth
- Contributing factors: not just “the bug”, what let the bug cause this
- Action items: specific, assigned, with due dates
The action items are the whole point. Without them, the post-mortem is therapy, not prevention.
Blameless means blameless
“The engineer didn’t test enough” is a blame statement. “Our test suite didn’t cover this code path” is a systems statement.
Reframe everything as a systems issue. People make mistakes; systems should catch them.
Practice
Run game days: intentionally break things in staging, practice the runbooks. Most incident responses are bad because the team hasn’t practiced.
First time running a failover during an actual outage is terrifying. Do it during a game day first.
What to cut
Don’t include in your playbook:
- Detailed diagnostic steps for specific services (too many, goes out of date)
- Long context paragraphs (nobody reads them at 3am)
- Complex decision trees (hard to follow under stress)
Do include:
- Phone numbers for vendors/cloud providers
- Command cheatsheet for the specific platform (kubectl, terraform, etc.)
- Escalation contacts for adjacent teams
- Links to dashboards and runbooks
Keep it under 2 pages. If longer, split by service.