Most incident playbooks end up as wiki pages nobody reads during an actual incident. Here’s what survives contact with a real 3am pager.

The first five minutes

One person is the Incident Commander (IC). If that’s not clear, declare yourself IC.

  1. Acknowledge the page
  2. Post in #incidents: “Incident: [brief]. I am IC.”
  3. Start a timeline document (even just a text file)
  4. Check public status page — update if user-visible

Don’t dig into the problem yet. Set up the command structure first.

Comms are the hardest part

During an incident:

  • Stakeholders will ask for updates constantly
  • Engineers will speculate in chat
  • Customers will escalate through support

Designate a Comms Lead (separate from IC). Their job:

  • Updates to stakeholders every 15-30 minutes
  • Status page updates
  • Coordinate with support/CS teams

IC stays focused on technical resolution.

Timeline document template

## Incident: [Service] degraded on [date]

### Severity: SEV-2 (impacting users)
### IC: @alice
### Comms: @bob

### Timeline (UTC)
- 14:23 - First page from Pingdom
- 14:25 - alice ACK, declared SEV-2
- 14:28 - Initial hypothesis: DB connection pool exhausted
- ...

### Actions taken
- 14:32 - Scaled DB connection pool from 100 to 200
- 14:35 - Latency improved, error rate dropping

### Open questions
- Why did connection pool fill up?
- Was there a traffic spike?

Update in real-time. This becomes the basis for the post-mortem.

Escalation criteria

Know when to escalate:

  • You’re solo and it’s been 20 minutes without progress
  • Multiple systems affected
  • Data integrity concerns (corruption, loss)
  • Security incident (breach, exposed secrets)

Paging someone is not a failure. Staying alone too long is.

Mitigation before root cause

In the moment, stop the bleeding:

  • Roll back the recent deploy
  • Flip the feature flag
  • Drain the problem instance from load balancer
  • Failover to standby

Don’t try to fully understand root cause while users are down. Mitigate first, investigate afterward.

Post-mortem (within 72 hours)

Blameless post-mortem covering:

  1. Summary: what happened, user impact, duration
  2. Timeline: what the responders did, when
  3. Root cause analysis: 5-whys depth
  4. Contributing factors: not just “the bug”, what let the bug cause this
  5. Action items: specific, assigned, with due dates

The action items are the whole point. Without them, the post-mortem is therapy, not prevention.

Blameless means blameless

“The engineer didn’t test enough” is a blame statement. “Our test suite didn’t cover this code path” is a systems statement.

Reframe everything as a systems issue. People make mistakes; systems should catch them.

Practice

Run game days: intentionally break things in staging, practice the runbooks. Most incident responses are bad because the team hasn’t practiced.

First time running a failover during an actual outage is terrifying. Do it during a game day first.

What to cut

Don’t include in your playbook:

  • Detailed diagnostic steps for specific services (too many, goes out of date)
  • Long context paragraphs (nobody reads them at 3am)
  • Complex decision trees (hard to follow under stress)

Do include:

  • Phone numbers for vendors/cloud providers
  • Command cheatsheet for the specific platform (kubectl, terraform, etc.)
  • Escalation contacts for adjacent teams
  • Links to dashboards and runbooks

Keep it under 2 pages. If longer, split by service.