Hook
In incidents, clarity beats memory.
Problem
Runbooks are often missing or outdated. During incidents and incident response, teams waste time hunting for information or asking the same questions repeatedly.
Why it matters
Clear, current runbooks reduce mean time to recovery and make on-call sustainable. They also encode knowledge so it scales beyond a few experts.
Signals you are here
- On-call engineers ask for basic service details
- Incidents require ad hoc guidance from senior staff
- Runbooks are stored in multiple inconsistent places
- Recovery steps differ by environment
Anti-patterns
- Runbooks that are not updated after changes
- Docs stored outside the codebase with no review
- Vague procedures without command examples
- Assuming everyone knows the system
Try this
- Store runbooks alongside code and review them
- Include clear steps, rollback, and escalation paths
- Automate verification of runbook steps
- Use templates for consistency
- Update runbooks as part of change reviews
Example
A team embedded runbooks in the repo and linked them in alerts. On-call engineers resolved issues without waiting for senior staff.
Reflection prompt
Which service has the weakest runbook? Improve it this week.
More like this
Heuristic
You Cannot Rely on People Under Stress
Design for tired humans.
Heuristic
Fail Closed, Log Everything, Recover Gracefully
Safe failure beats quiet failure.
Heuristic
Increase Contrast, Not Volume
Prompt length does not guarantee novelty. Context contrast does.
Heuristic
You Build It, You Run It
Build it, run it.
Heuristic
Test Where It Breaks, Not Where It Works
Test the breaks, not the breeze.
Heuristic
Blame the Process, Not People
Fix the system.