← Back to all heuristics

Runbooks Are a Bridge Between Dev and Ops

Runbooks turn knowledge into action.

OperationsReliabilitySRE

Heuristic

Document operations so anyone can act safely.

Hook

In incidents, clarity beats memory.

Problem

Runbooks are often missing or outdated. During incidents and incident response, teams waste time hunting for information or asking the same questions repeatedly.

Why it matters

Clear, current runbooks reduce mean time to recovery and make on-call sustainable. They also encode knowledge so it scales beyond a few experts.

Signals you are here

  • On-call engineers ask for basic service details
  • Incidents require ad hoc guidance from senior staff
  • Runbooks are stored in multiple inconsistent places
  • Recovery steps differ by environment

Anti-patterns

  • Runbooks that are not updated after changes
  • Docs stored outside the codebase with no review
  • Vague procedures without command examples
  • Assuming everyone knows the system

Try this

  • Store runbooks alongside code and review them
  • Include clear steps, rollback, and escalation paths
  • Automate verification of runbook steps
  • Use templates for consistency
  • Update runbooks as part of change reviews

Example

A team embedded runbooks in the repo and linked them in alerts. On-call engineers resolved issues without waiting for senior staff.

Reflection prompt

Which service has the weakest runbook? Improve it this week.

More like this

Heuristic

You Cannot Rely on People Under Stress

Design for tired humans.

ReliabilityOperationsSRE

Heuristic

Fail Closed, Log Everything, Recover Gracefully

Safe failure beats quiet failure.

ReliabilitySecuritySecurity

Heuristic

Increase Contrast, Not Volume

Prompt length does not guarantee novelty. Context contrast does.

ArchitectureOperations

Heuristic

You Build It, You Run It

Build it, run it.

CollaborationOperationsSRE

Heuristic

Test Where It Breaks, Not Where It Works

Test the breaks, not the breeze.

AutomationReliabilityDelivery

Heuristic

Blame the Process, Not People

Fix the system.

LearningReliability