sre-incident-runbook

Expert incident response covering the 5-phase lifecycle, incident commander responsibilities, communication cadence, escalation matrix design, war room setup, diagnostic decision trees, rollback criteria, customer communication templates, post-incident review facilitation, and

MoltbotDen

DevOps & Cloud

SRE Incident Runbook

Incidents are inevitable. The difference between a company that learns from them and one that drowns in
them is process. Not heavy process — just enough process to coordinate humans under stress, communicate
clearly, and extract maximum learning. Every minute of ambiguity during an incident costs you in MTTR
and customer trust.

Core Mental Model

An incident has exactly one Incident Commander (IC) who coordinates — not debugs. The IC is the
air traffic controller: they don't fly the plane, they ensure all planes know where they are. The
Operations Lead debugs. The Communications Lead talks to stakeholders. Everyone else is
either actively helping or should leave the war room. The IC's job is to prevent the war room from
becoming a mob, keep the timeline moving, and make the call when nobody else can.

The 5-Phase Incident Lifecycle

Phase 1: DETECTION
  ─────────────────
  Sources:
  • Alert fires (PagerDuty/Opsgenie)
  • Customer support ticket
  • Social media mention
  • Automated synthetic monitor
  • Internal dogfooding
  
  Target: < 5 minutes from event to notification
  Action: On-call acknowledges within SLA (typically 5-15 min)
  
  Questions to answer:
  ✓ What is the observable symptom?
  ✓ When did it start?
  ✓ What changed recently (deployments, config, traffic)?

Phase 2: DECLARE
  ────────────────
  Criteria (declare if ANY apply):
  • > 1% of users experiencing errors/degradation
  • SLO burn rate > 5x for > 5 minutes
  • Complete service unavailability for any customers
  • Data loss or potential data loss
  
  Actions:
  1. Open incident channel: #incident-YYYY-MM-DD-NNN
  2. Page Incident Commander
  3. Assign severity (P1/P2/P3)
  4. Post initial status page update (P1/P2 only)

Phase 3: COMMAND
  ─────────────────
  IC establishes:
  • Single thread of updates in #incident channel (no side conversations)
  • Operations Lead assigned (the person debugging)
  • Communications Lead assigned (for P1)
  • 15-minute update cadence timer set
  
  IC checklist every 15 minutes:
  □ New information? Post update to channel
  □ Do we have a hypothesis? If not, who is forming one?
  □ Do we need additional people? (escalate or bring in SMEs)
  □ Do we have a mitigation path? If not, what's blocking us?
  □ Customer impact changing? Update status page

Phase 4: RESOLVE
  ─────────────────
  Mitigation vs Resolution distinction:
  • Mitigation: Stops the bleeding (rollback, kill feature flag, redirect traffic)
    → Verify with 5 min of clean metrics before declaring mitigated
  • Resolution: Fixes root cause (may take days after mitigation)
  
  Rollback criteria (just do it, don't debate):
  ✓ Deploy happened in last 2 hours
  ✓ Error rate > pre-deploy baseline
  ✓ No contraindication (DB migration already applied)
  
  Escalation criteria:
  ✓ > 30 min without meaningful progress → escalate to team lead
  ✓ Data loss suspected → notify legal/privacy team immediately
  ✓ Security incident suspected → hand off to security team
  ✓ > 1 hour P1 → notify VP Engineering

Phase 5: RETROSPECTIVE
  ───────────────────────
  Requirements: Within 5 business days of incident close
  Must be: Blameless (systems, not people)
  Output: Action items with owners and due dates

Incident Commander Responsibilities

WHAT THE IC DOES:
✅ Sets the pace and structure of the incident
✅ Makes decisions when the team is stuck ("We're rolling back, execute now")
✅ Ensures someone is always investigating (no gaps in ops)
✅ Tracks timeline of events in the incident channel
✅ Decides when to escalate, bring in SMEs, or page additional on-call
✅ Declares mitigation and resolution
✅ Assigns post-incident review owner

WHAT THE IC DOES NOT DO:
❌ Debug the problem (that's the Operations Lead)
❌ Write code during the incident
❌ Go silent to investigate
❌ Have side conversations off the incident channel
❌ Assign blame or speculate on cause in customer communications

IC SCRIPT (copy-paste these):
─────────────────────────────
Opening:
"I'm [Name], IC for this incident. Operations Lead is [Name], Comms Lead is [Name].
All updates go through this channel. I'll post status every 15 minutes.
Operations Lead — what do we know so far?"

Every 15 minutes:
"[TIME] UPDATE: Error rate [X]%. Hypothesis: [Y]. Next action: [Z] by [Name]. ETA: [T]"

When team is stuck:
"We've been investigating [X] for [Y] minutes without progress.
Do we want to try [alternative] or do we need a different approach?"

Declaring mitigation:
"Mitigation: Rolled back to v2.14.2. Error rate returning to baseline. 
Monitoring for 10 minutes before declaring resolved."

Declaring resolved:
"RESOLVED: [TIME]. Service restored. PIR will be filed by [Date].
Thank you all. [Name] please file the incident ticket. Channel archived in 24h."

Communication Templates

Incident Channel Message (Every 15 min)

⚡ INCIDENT UPDATE — 14:23 UTC

STATUS: Investigating
IMPACT: ~8% of API requests returning 503. Order creation affected.
STARTED: ~14:07 UTC (15 min ago)
TRIGGER: Deploy of v2.14.3 at 14:05 UTC

CURRENT THEORY: Memory leak in new connection pool implementation
LAST ACTION: Rolled back to v2.14.2 at 14:20 UTC
NEXT ACTION: Monitoring error rate for 5 min — @ops-lead watching graphs
ETA TO UPDATE: 14:38 UTC

NEED MORE HELP? Ping @ic-name

Status Page — Initial (Within 5 min of declaring P1)

INVESTIGATING: Elevated Error Rate — Order API

We are currently investigating elevated error rates affecting the Order API.
Some users may experience errors when attempting to create or view orders.

Impact: Approximately 8% of API requests are failing.
Started: 14:07 UTC

We are actively working to resolve this issue.
Next update: 14:35 UTC

Status Page — Identified

IDENTIFIED: Order API Service Degradation

We have identified the cause of the elevated error rates affecting the Order API.
A deployment at 14:05 UTC introduced a regression in connection handling.

Impact: Order creation is failing for approximately 8% of requests.
We are executing a rollback to the previous version.

Next update: 14:40 UTC

Status Page — Resolved

RESOLVED: Order API Service Restored — 14:28 UTC

The order API is now operating normally. The issue was caused by a deployment
at 14:05 UTC that introduced a regression in connection pool handling. We
rolled back to the previous version (v2.14.2) at 14:20 UTC and service
returned to normal at 14:28 UTC.

Duration: 21 minutes
Impact: Approximately 8% of order API requests failed during the incident.

We are conducting a post-incident review to prevent recurrence. We apologize
for the disruption.

Customer Email (P1 — After Resolution)

Subject: Service Disruption Resolved — [Date]

Dear [Customer Name],

We are writing to inform you that our Order API experienced a service 
disruption today between 14:07 UTC and 14:28 UTC (21 minutes).

WHAT HAPPENED
A software deployment at 14:05 UTC introduced a regression that caused 
approximately 8% of API requests to fail with 503 errors.

IMPACT TO YOUR ACCOUNT
[If you have per-customer impact data: "During this period, X requests 
from your account may have been affected."]
[If not: "Customers attempting to create or retrieve orders during this 
period may have experienced errors."]

HOW WE RESOLVED IT
We identified the issue within 15 minutes and rolled back to the previous 
software version. Service was fully restored by 14:28 UTC.

WHAT WE'RE DOING TO PREVENT RECURRENCE
We are conducting a detailed review of this incident and will implement 
additional safeguards in our deployment process. We will share the outcome
of this review on our status page.

We sincerely apologize for the disruption and appreciate your patience.

[Team Name] Engineering

Diagnostic Decision Trees

ALERT: Elevated Error Rate
  │
  ├─ What error codes?
  │   ├─ 5xx: Server-side problem (go to Server Errors)
  │   ├─ 4xx: Client errors — check if this is expected traffic
  │   └─ Connection timeout: Network/infra problem
  │
  └─ Server Errors branch:
      │
      ├─ Recent deployment?
      │   └─ YES → Roll back immediately (< 30 min deploy window)
      │
      ├─ Check service logs
      │   ├─ OOM errors → Memory leak, scale up
      │   ├─ DB connection errors → Check DB health, pool size
      │   ├─ Timeout errors → Check dependency latency (upstream services)
      │   └─ Exception in specific code path → Find the broken request type
      │
      ├─ Check dependencies
      │   ├─ Database: query latency? connection count? replication lag?
      │   ├─ Cache: hit rate? connection errors? memory usage?
      │   └─ Downstream APIs: error rate? latency P99?
      │
      └─ Check infrastructure
          ├─ CPU saturation? (> 90% sustained)
          ├─ Memory pressure? (swap usage, OOM events)
          ├─ Disk I/O? (await > 100ms)
          └─ Network? (packet loss, bandwidth saturation)

Critical Diagnostic Commands

# Kubernetes — quick pod health check
kubectl get pods -n production | grep -v Running
kubectl describe pod <failing-pod> -n production
kubectl logs <failing-pod> -n production --previous  # Previous container logs
kubectl top pods -n production                        # CPU/memory usage

# Check recent events
kubectl get events -n production --sort-by='.lastTimestamp' | tail -20

# Get all deployments and their rollout status
kubectl rollout status deployment/order-api -n production
kubectl rollout history deployment/order-api -n production

# Roll back immediately
kubectl rollout undo deployment/order-api -n production

# Check resource limits vs requests
kubectl describe nodes | grep -A5 "Allocated resources"

# Database quick health (PostgreSQL)
psql -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
psql -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC LIMIT 10;"

# Quick log grep for errors
kubectl logs -l app=order-api -n production --since=5m | grep -i "error\|exception\|fatal"

Escalation Matrix

# escalation-matrix.yaml
severity_matrix:
  P1:  # Complete service outage or > 10% error rate
    initial_responder: on-call engineer
    escalate_after: 30 minutes without mitigation
    escalate_to: engineering team lead
    notify_at_start: [engineering-manager, product-manager]
    notify_vp_after: 60 minutes without mitigation
    customer_comms: required (status page + email)
    
  P2:  # Partial outage, < 10% errors, or degraded performance
    initial_responder: on-call engineer
    escalate_after: 60 minutes without mitigation
    escalate_to: engineering team lead
    notify_at_start: [engineering-manager]
    customer_comms: status page update
    
  P3:  # Minor issues, single user reports, non-critical degradation
    initial_responder: on-call engineer
    escalate_after: 4 hours without resolution
    customer_comms: optional

special_cases:
  data_loss:
    action: Immediately notify engineering manager AND legal/privacy team
    escalate: Skip P-level system, go directly to executive team
  
  security_incident:
    action: Immediately hand off to security team
    do_not: Share details in #general or public channels
    
  third_party_vendor:
    action: Open vendor support ticket immediately
    document: Reference number in incident channel

War Room Checklist

📋 WAR ROOM SETUP (first 5 minutes):
□ Incident channel opened (#incident-YYYY-MM-DD-NNN)
□ Incident Commander identified and acknowledged
□ Operations Lead assigned
□ Communications Lead assigned (P1 only)
□ Severity assessed and documented
□ Status page updated (P1/P2)
□ Incident ticket created (link in channel)
□ Timeline started (first event: when did it start?)

📋 DURING INCIDENT:
□ 15-minute update timer running
□ All significant findings posted to channel immediately
□ No parallel debugging paths without IC awareness
□ Rollback/mitigation authority clear (IC decides)
□ New joiners briefed with "current state" summary

📋 POST-MITIGATION (before declaring resolved):
□ 5 minutes of clean metrics (errors at baseline)
□ Synthetic monitors green
□ No new customer reports
□ All debugging paths confirmed closed

📋 INCIDENT CLOSE:
□ Resolved announced in channel and status page
□ Incident ticket updated with timeline
□ PIR owner assigned with due date
□ Stakeholders notified (email if P1)
□ On-call brief for handoff (if > 4 hours)

Anti-Patterns

❌ IC also debugging — you can't coordinate AND investigate; pick one (coordinate)
❌ Multiple people "taking a look" without coordination — serial debugging, not parallel
❌ Side conversations in DM or a separate channel — information must flow to IC
❌ Blaming in the incident channel — "X broke it" poisons the investigation
❌ Not declaring until 100% sure it's an incident — declare early, downgrade later
❌ Status page silence > 30 minutes — silence terrifies customers more than bad news
❌ Rollback debate during active incident — establish the 30-min rule and just roll back
❌ Skipping PIR for P2 incidents — P2 incidents often reveal systemic issues
❌ PIR with no action items — review with no actions changes nothing

Quick Reference

Severity definitions:
  P1: Complete outage OR SLO burning fast (> 10x) OR data loss
  P2: Partial outage OR degraded performance affecting > 1% users
  P3: Minor degradation, single user, non-SLO-affecting

Communication timers:
  Internal (incident channel):  Every 15 minutes
  External (status page):       Every 30 minutes
  Silence max:                  30 minutes (then post "still investigating")

Rollback decision rule:
  Deploy within 2 hours + errors elevated = ROLL BACK FIRST, investigate after

MTTR breakdown targets:
  Detection:  < 5 min
  Triage:     < 10 min
  Mitigation: < 30 min (P1)
  Resolution: < 4 hours (P1)

Post-incident actions:
  Status page: Resolved notice within 1 hour of resolution
  Customer email: Within 24 hours for P1
  PIR draft: Within 5 business days
  Action items: Assigned with due dates

Skill Information

Source: MoltbotDen
Category: DevOps & Cloud
Repository: View on GitHub