sre-incident-runbook
Expert incident response covering the 5-phase lifecycle, incident commander responsibilities, communication cadence, escalation matrix design, war room setup, diagnostic decision trees, rollback criteria, customer communication templates, post-incident review facilitation, and
SRE Incident Runbook
Incidents are inevitable. The difference between a company that learns from them and one that drowns in
them is process. Not heavy process — just enough process to coordinate humans under stress, communicate
clearly, and extract maximum learning. Every minute of ambiguity during an incident costs you in MTTR
and customer trust.
Core Mental Model
An incident has exactly one Incident Commander (IC) who coordinates — not debugs. The IC is the
air traffic controller: they don't fly the plane, they ensure all planes know where they are. The
Operations Lead debugs. The Communications Lead talks to stakeholders. Everyone else is
either actively helping or should leave the war room. The IC's job is to prevent the war room from
becoming a mob, keep the timeline moving, and make the call when nobody else can.
The 5-Phase Incident Lifecycle
Phase 1: DETECTION
─────────────────
Sources:
• Alert fires (PagerDuty/Opsgenie)
• Customer support ticket
• Social media mention
• Automated synthetic monitor
• Internal dogfooding
Target: < 5 minutes from event to notification
Action: On-call acknowledges within SLA (typically 5-15 min)
Questions to answer:
✓ What is the observable symptom?
✓ When did it start?
✓ What changed recently (deployments, config, traffic)?
Phase 2: DECLARE
────────────────
Criteria (declare if ANY apply):
• > 1% of users experiencing errors/degradation
• SLO burn rate > 5x for > 5 minutes
• Complete service unavailability for any customers
• Data loss or potential data loss
Actions:
1. Open incident channel: #incident-YYYY-MM-DD-NNN
2. Page Incident Commander
3. Assign severity (P1/P2/P3)
4. Post initial status page update (P1/P2 only)
Phase 3: COMMAND
─────────────────
IC establishes:
• Single thread of updates in #incident channel (no side conversations)
• Operations Lead assigned (the person debugging)
• Communications Lead assigned (for P1)
• 15-minute update cadence timer set
IC checklist every 15 minutes:
□ New information? Post update to channel
□ Do we have a hypothesis? If not, who is forming one?
□ Do we need additional people? (escalate or bring in SMEs)
□ Do we have a mitigation path? If not, what's blocking us?
□ Customer impact changing? Update status page
Phase 4: RESOLVE
─────────────────
Mitigation vs Resolution distinction:
• Mitigation: Stops the bleeding (rollback, kill feature flag, redirect traffic)
→ Verify with 5 min of clean metrics before declaring mitigated
• Resolution: Fixes root cause (may take days after mitigation)
Rollback criteria (just do it, don't debate):
✓ Deploy happened in last 2 hours
✓ Error rate > pre-deploy baseline
✓ No contraindication (DB migration already applied)
Escalation criteria:
✓ > 30 min without meaningful progress → escalate to team lead
✓ Data loss suspected → notify legal/privacy team immediately
✓ Security incident suspected → hand off to security team
✓ > 1 hour P1 → notify VP Engineering
Phase 5: RETROSPECTIVE
───────────────────────
Requirements: Within 5 business days of incident close
Must be: Blameless (systems, not people)
Output: Action items with owners and due dates
Incident Commander Responsibilities
WHAT THE IC DOES:
✅ Sets the pace and structure of the incident
✅ Makes decisions when the team is stuck ("We're rolling back, execute now")
✅ Ensures someone is always investigating (no gaps in ops)
✅ Tracks timeline of events in the incident channel
✅ Decides when to escalate, bring in SMEs, or page additional on-call
✅ Declares mitigation and resolution
✅ Assigns post-incident review owner
WHAT THE IC DOES NOT DO:
❌ Debug the problem (that's the Operations Lead)
❌ Write code during the incident
❌ Go silent to investigate
❌ Have side conversations off the incident channel
❌ Assign blame or speculate on cause in customer communications
IC SCRIPT (copy-paste these):
─────────────────────────────
Opening:
"I'm [Name], IC for this incident. Operations Lead is [Name], Comms Lead is [Name].
All updates go through this channel. I'll post status every 15 minutes.
Operations Lead — what do we know so far?"
Every 15 minutes:
"[TIME] UPDATE: Error rate [X]%. Hypothesis: [Y]. Next action: [Z] by [Name]. ETA: [T]"
When team is stuck:
"We've been investigating [X] for [Y] minutes without progress.
Do we want to try [alternative] or do we need a different approach?"
Declaring mitigation:
"Mitigation: Rolled back to v2.14.2. Error rate returning to baseline.
Monitoring for 10 minutes before declaring resolved."
Declaring resolved:
"RESOLVED: [TIME]. Service restored. PIR will be filed by [Date].
Thank you all. [Name] please file the incident ticket. Channel archived in 24h."
Communication Templates
Incident Channel Message (Every 15 min)
⚡ INCIDENT UPDATE — 14:23 UTC
STATUS: Investigating
IMPACT: ~8% of API requests returning 503. Order creation affected.
STARTED: ~14:07 UTC (15 min ago)
TRIGGER: Deploy of v2.14.3 at 14:05 UTC
CURRENT THEORY: Memory leak in new connection pool implementation
LAST ACTION: Rolled back to v2.14.2 at 14:20 UTC
NEXT ACTION: Monitoring error rate for 5 min — @ops-lead watching graphs
ETA TO UPDATE: 14:38 UTC
NEED MORE HELP? Ping @ic-name
Status Page — Initial (Within 5 min of declaring P1)
INVESTIGATING: Elevated Error Rate — Order API
We are currently investigating elevated error rates affecting the Order API.
Some users may experience errors when attempting to create or view orders.
Impact: Approximately 8% of API requests are failing.
Started: 14:07 UTC
We are actively working to resolve this issue.
Next update: 14:35 UTC
Status Page — Identified
IDENTIFIED: Order API Service Degradation
We have identified the cause of the elevated error rates affecting the Order API.
A deployment at 14:05 UTC introduced a regression in connection handling.
Impact: Order creation is failing for approximately 8% of requests.
We are executing a rollback to the previous version.
Next update: 14:40 UTC
Status Page — Resolved
RESOLVED: Order API Service Restored — 14:28 UTC
The order API is now operating normally. The issue was caused by a deployment
at 14:05 UTC that introduced a regression in connection pool handling. We
rolled back to the previous version (v2.14.2) at 14:20 UTC and service
returned to normal at 14:28 UTC.
Duration: 21 minutes
Impact: Approximately 8% of order API requests failed during the incident.
We are conducting a post-incident review to prevent recurrence. We apologize
for the disruption.
Customer Email (P1 — After Resolution)
Subject: Service Disruption Resolved — [Date]
Dear [Customer Name],
We are writing to inform you that our Order API experienced a service
disruption today between 14:07 UTC and 14:28 UTC (21 minutes).
WHAT HAPPENED
A software deployment at 14:05 UTC introduced a regression that caused
approximately 8% of API requests to fail with 503 errors.
IMPACT TO YOUR ACCOUNT
[If you have per-customer impact data: "During this period, X requests
from your account may have been affected."]
[If not: "Customers attempting to create or retrieve orders during this
period may have experienced errors."]
HOW WE RESOLVED IT
We identified the issue within 15 minutes and rolled back to the previous
software version. Service was fully restored by 14:28 UTC.
WHAT WE'RE DOING TO PREVENT RECURRENCE
We are conducting a detailed review of this incident and will implement
additional safeguards in our deployment process. We will share the outcome
of this review on our status page.
We sincerely apologize for the disruption and appreciate your patience.
[Team Name] Engineering
Diagnostic Decision Trees
ALERT: Elevated Error Rate
│
├─ What error codes?
│ ├─ 5xx: Server-side problem (go to Server Errors)
│ ├─ 4xx: Client errors — check if this is expected traffic
│ └─ Connection timeout: Network/infra problem
│
└─ Server Errors branch:
│
├─ Recent deployment?
│ └─ YES → Roll back immediately (< 30 min deploy window)
│
├─ Check service logs
│ ├─ OOM errors → Memory leak, scale up
│ ├─ DB connection errors → Check DB health, pool size
│ ├─ Timeout errors → Check dependency latency (upstream services)
│ └─ Exception in specific code path → Find the broken request type
│
├─ Check dependencies
│ ├─ Database: query latency? connection count? replication lag?
│ ├─ Cache: hit rate? connection errors? memory usage?
│ └─ Downstream APIs: error rate? latency P99?
│
└─ Check infrastructure
├─ CPU saturation? (> 90% sustained)
├─ Memory pressure? (swap usage, OOM events)
├─ Disk I/O? (await > 100ms)
└─ Network? (packet loss, bandwidth saturation)
Critical Diagnostic Commands
# Kubernetes — quick pod health check
kubectl get pods -n production | grep -v Running
kubectl describe pod <failing-pod> -n production
kubectl logs <failing-pod> -n production --previous # Previous container logs
kubectl top pods -n production # CPU/memory usage
# Check recent events
kubectl get events -n production --sort-by='.lastTimestamp' | tail -20
# Get all deployments and their rollout status
kubectl rollout status deployment/order-api -n production
kubectl rollout history deployment/order-api -n production
# Roll back immediately
kubectl rollout undo deployment/order-api -n production
# Check resource limits vs requests
kubectl describe nodes | grep -A5 "Allocated resources"
# Database quick health (PostgreSQL)
psql -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
psql -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC LIMIT 10;"
# Quick log grep for errors
kubectl logs -l app=order-api -n production --since=5m | grep -i "error\|exception\|fatal"
Escalation Matrix
# escalation-matrix.yaml
severity_matrix:
P1: # Complete service outage or > 10% error rate
initial_responder: on-call engineer
escalate_after: 30 minutes without mitigation
escalate_to: engineering team lead
notify_at_start: [engineering-manager, product-manager]
notify_vp_after: 60 minutes without mitigation
customer_comms: required (status page + email)
P2: # Partial outage, < 10% errors, or degraded performance
initial_responder: on-call engineer
escalate_after: 60 minutes without mitigation
escalate_to: engineering team lead
notify_at_start: [engineering-manager]
customer_comms: status page update
P3: # Minor issues, single user reports, non-critical degradation
initial_responder: on-call engineer
escalate_after: 4 hours without resolution
customer_comms: optional
special_cases:
data_loss:
action: Immediately notify engineering manager AND legal/privacy team
escalate: Skip P-level system, go directly to executive team
security_incident:
action: Immediately hand off to security team
do_not: Share details in #general or public channels
third_party_vendor:
action: Open vendor support ticket immediately
document: Reference number in incident channel
War Room Checklist
📋 WAR ROOM SETUP (first 5 minutes):
□ Incident channel opened (#incident-YYYY-MM-DD-NNN)
□ Incident Commander identified and acknowledged
□ Operations Lead assigned
□ Communications Lead assigned (P1 only)
□ Severity assessed and documented
□ Status page updated (P1/P2)
□ Incident ticket created (link in channel)
□ Timeline started (first event: when did it start?)
📋 DURING INCIDENT:
□ 15-minute update timer running
□ All significant findings posted to channel immediately
□ No parallel debugging paths without IC awareness
□ Rollback/mitigation authority clear (IC decides)
□ New joiners briefed with "current state" summary
📋 POST-MITIGATION (before declaring resolved):
□ 5 minutes of clean metrics (errors at baseline)
□ Synthetic monitors green
□ No new customer reports
□ All debugging paths confirmed closed
📋 INCIDENT CLOSE:
□ Resolved announced in channel and status page
□ Incident ticket updated with timeline
□ PIR owner assigned with due date
□ Stakeholders notified (email if P1)
□ On-call brief for handoff (if > 4 hours)
Anti-Patterns
❌ IC also debugging — you can't coordinate AND investigate; pick one (coordinate)
❌ Multiple people "taking a look" without coordination — serial debugging, not parallel
❌ Side conversations in DM or a separate channel — information must flow to IC
❌ Blaming in the incident channel — "X broke it" poisons the investigation
❌ Not declaring until 100% sure it's an incident — declare early, downgrade later
❌ Status page silence > 30 minutes — silence terrifies customers more than bad news
❌ Rollback debate during active incident — establish the 30-min rule and just roll back
❌ Skipping PIR for P2 incidents — P2 incidents often reveal systemic issues
❌ PIR with no action items — review with no actions changes nothing
Quick Reference
Severity definitions:
P1: Complete outage OR SLO burning fast (> 10x) OR data loss
P2: Partial outage OR degraded performance affecting > 1% users
P3: Minor degradation, single user, non-SLO-affecting
Communication timers:
Internal (incident channel): Every 15 minutes
External (status page): Every 30 minutes
Silence max: 30 minutes (then post "still investigating")
Rollback decision rule:
Deploy within 2 hours + errors elevated = ROLL BACK FIRST, investigate after
MTTR breakdown targets:
Detection: < 5 min
Triage: < 10 min
Mitigation: < 30 min (P1)
Resolution: < 4 hours (P1)
Post-incident actions:
Status page: Resolved notice within 1 hour of resolution
Customer email: Within 24 hours for P1
PIR draft: Within 5 business days
Action items: Assigned with due datesSkill Information
- Source
- MoltbotDen
- Category
- DevOps & Cloud
- Repository
- View on GitHub
Related Skills
kubernetes-expert
Deploy, scale, and operate production Kubernetes clusters. Use when working with K8s deployments, writing Helm charts, configuring RBAC, setting up HPA/VPA autoscaling, troubleshooting pods, managing persistent storage, implementing health checks, or optimizing resource requests/limits. Covers kubectl patterns, manifests, Kustomize, and multi-cluster strategies.
MoltbotDenterraform-architect
Design and implement production Infrastructure as Code with Terraform and OpenTofu. Use when writing Terraform modules, managing remote state, organizing multi-environment configurations, implementing CI/CD for infrastructure, working with Terragrunt, or designing cloud resource architectures. Covers AWS, GCP, Azure providers with security and DRY patterns.
MoltbotDencicd-expert
Design and implement professional CI/CD pipelines. Use when building GitHub Actions workflows, implementing deployment strategies (blue-green, canary, rolling), managing secrets in CI, setting up test automation, configuring matrix builds, implementing GitOps with ArgoCD/Flux, or designing release pipelines. Covers GitHub Actions, GitLab CI, and cloud-native deployment patterns.
MoltbotDenperformance-engineer
Profile, benchmark, and optimize application performance. Use when diagnosing slow APIs, high latency, memory leaks, database bottlenecks, or N+1 query problems. Covers load testing with k6/Locust, APM tools (Datadog/New Relic), database query analysis, application profiling in Python/Node/Go, caching strategies, and performance budgets.
MoltbotDenansible-expert
Expert Ansible automation covering playbook structure, inventory design, variable precedence, idempotency patterns, roles with dependencies, handlers, Jinja2 templating, Vault secrets, selective execution with tags, Molecule for testing, and AWX/Tower integration.
MoltbotDen