sre-practices
Expert Site Reliability Engineering practices covering SLI/SLO/SLA hierarchy, error budget management, burn rate alerting, toil elimination, reliability decomposition, incident management lifecycle, blameless postmortems, chaos engineering, capacity planning, and on-call design.
SRE Practices
SRE is not just DevOps with a fancier title. It's a discipline that treats reliability as a feature,
measures it with math, and makes explicit trade-offs between reliability and velocity using error budgets.
The fundamental insight: 100% reliability is wrong — it means you're moving too slowly. The goal is to
be reliably reliable enough while shipping as fast as the business needs.
Core Mental Model
The SRE contract with the business is: "Define how reliable you need to be, we'll measure it, and if we
have remaining error budget, we'll spend it shipping features. When it runs out, we freeze releases and
focus on reliability." This transforms reliability from a fuzzy aspiration into a budget that product,
engineering, and operations can reason about together. SLIs are the measurements. SLOs are the
targets. SLAs are the customer-facing commitments (always looser than SLOs — you need margin to recover).
Error budget = 1 - SLO. Everything else flows from this.
SLI / SLO / SLA Hierarchy
SLA (customer commitment): "We'll refund credits if availability < 99%/month"
└── SLO (internal target): "We aim for 99.9% availability" (margin above SLA)
└── SLI (measurement): "% of HTTP requests returning 2xx or 3xx in < 500ms"
Error Budget:
SLO = 99.9% → 0.1% allowed failure
Monthly: 30 days × 24h × 60m × 60s = 2,592,000 seconds
Error budget = 2,592 seconds of downtime (43 minutes/month)
Burn Rate:
Budget depleted in 30 days → 1x burn rate (normal)
Budget depleted in 3 days → 10x burn rate (page immediately)
Budget depleted in 1 hour → 720x burn rate (all hands)
SLO Definition YAML
# slo.yaml — store in version control alongside service code
apiVersion: openslo/v1
kind: SLO
metadata:
name: order-api-availability
displayName: Order API Availability
labels:
team: platform
tier: "1"
spec:
service: order-api
description: "Availability of the Order API for external customers"
indicator:
rawMetric:
metricSource:
type: Prometheus
good:
metricQuery: |
sum(rate(http_requests_total{job="order-api",status!~"5.."}[{{.Window}}]))
total:
metricQuery: |
sum(rate(http_requests_total{job="order-api"}[{{.Window}}]))
objectives:
- displayName: "Monthly availability"
op: gte
target: 0.999 # 99.9%
timeWindow:
duration: 30d
isRolling: true
alertPolicies:
- name: order-api-high-burn
alertWhenBreaching: true
conditions:
- kind: burnrate
op: gte
threshold: 14.4 # 2% budget in 1h
lookbackWindow: 1h
alertAfter: 2m
Error Budget Burn Rate Alerting
Google's recommendation: two pairs of windows to catch both fast and slow burns.
# Prometheus alerting rules
groups:
- name: slo-burn-rates
rules:
# Page: 2% budget in 1h (fast burn)
- alert: ErrorBudgetFastBurn
expr: |
(
rate(http_requests_total{status=~"5.."}[1h]) /
rate(http_requests_total[1h])
) > 0.0144
and
(
rate(http_requests_total{status=~"5.."}[5m]) /
rate(http_requests_total[5m])
) > 0.0144
for: 2m
labels:
severity: critical
team: platform
annotations:
summary: "SLO fast burn: {{ $labels.job }}"
description: |
Error rate {{ $value | humanizePercentage }} exceeds 14.4x burn.
At this rate, 2% of monthly error budget burns in 1 hour.
runbook_url: "https://runbooks.internal/order-api/high-error-rate"
# Ticket: 5% budget in 6h (slow burn)
- alert: ErrorBudgetSlowBurn
expr: |
(
rate(http_requests_total{status=~"5.."}[6h]) /
rate(http_requests_total[6h])
) > 0.006
and
(
rate(http_requests_total{status=~"5.."}[30m]) /
rate(http_requests_total[30m])
) > 0.006
for: 15m
labels:
severity: warning
annotations:
summary: "SLO slow burn: {{ $labels.job }}"
Toil: Identification and Elimination
Toil is manual, repetitive, automatable operational work that scales with traffic but adds no enduring value. SRE teams should spend ≤50% time on toil (Google's rule of thumb).
Toil classification:
✅ Toil: Manually restarting crashed pods when OOM
✅ Toil: Provisioning access requests via ticket
✅ Toil: Copying logs from prod to debug a ticket
❌ Not toil: Writing the automation that prevents the crash
❌ Not toil: Designing the oncall rotation
❌ Not toil: Writing a postmortem
Elimination hierarchy:
1. Eliminate: Does the task need to exist at all? (auto-restart solves the crash case)
2. Automate: Write code to handle it
3. Reduce frequency: Better monitoring to catch earlier
4. Reduce duration: Runbooks, tooling, pre-provisioned access
Reliability Decomposition
Availability = MTBF / (MTBF + MTTR)
Where:
MTBF = Mean Time Between Failures
MTTR = Mean Time To Restore (detect + respond + diagnose + fix + verify)
To improve availability, you have two levers:
MTBF (reduce failure frequency):
- Better testing, canary deploys, feature flags
- Dependency isolation, circuit breakers
- Chaos engineering to find weaknesses proactively
MTTR (reduce recovery time):
- Faster detection (alerting latency)
- Better runbooks (on-call tooling)
- Rollback automation (< 5 min to roll back a deploy)
- Pre-provisioned access (no ticket required to access prod)
Reliability calculation:
Single service: 99.9%
Two services in series: 99.9% × 99.9% = 99.8%
Two services in parallel (either handles): 1 - (1-0.999)² = 99.9999%
→ Use parallel/redundant paths for critical flows
Incident Management Lifecycle
Phase 1: DETECT
Sources: Alert fires, customer report, anomaly detection, synthetic monitor
Target: < 5 min from event to alert
Phase 2: DECLARE
Criteria: > N users impacted OR SLO burning at > Mx rate
Action: Page incident commander, open incident channel (#incident-YYYY-MM-DD-NNN)
Phase 3: COMMAND
Incident Commander (IC) roles:
- IC: Coordinates response, not doing the debugging
- Operations Lead: Does the technical investigation/mitigation
- Communications Lead: Updates stakeholders
Phase 4: COMMUNICATE
Internal: Every 15 minutes to incident channel
External: Every 30 minutes to status page
Format: "We are investigating elevated error rates in Order API.
Impact: ~5% of checkout attempts failing.
Next update: 14:30 UTC"
Phase 5: RESOLVE
Mitigation vs Resolution:
Mitigation = stop the bleeding (rollback, disable feature flag, redirect traffic)
Resolution = fix the root cause (may come later)
Always mitigate first, investigate second
Phase 6: RETROSPECTIVE (within 5 business days)
Blameless postmortem — see template below
On-Call Escalation Policy
# PagerDuty / Opsgenie escalation policy
escalation_policy:
name: "Order API On-Call"
rules:
- delay: 0 # Immediately
targets:
- type: schedule
name: "Order API Primary On-Call"
- delay: 10 # After 10 minutes with no ack
targets:
- type: schedule
name: "Order API Secondary On-Call"
- delay: 20 # After 20 minutes
targets:
- type: user
name: "[email protected]"
- delay: 30 # After 30 minutes (escalate to VP)
targets:
- type: user
name: "[email protected]"
repeat_enabled: true
num_loops: 3
Postmortem Template (Blameless)
# Postmortem: [Service] [Brief Description] — [Date]
**Severity**: P1 / P2 / P3
**Duration**: [Start] to [End] ([X hours Y minutes])
**Impact**: [N users affected], [X% of requests failed/degraded]
**Status**: Complete / In Progress
## Summary
[2-3 sentence executive summary: what happened, impact, how it was resolved]
## Timeline (all times UTC)
| Time | Event |
|-------|-------|
| 13:42 | Deployment of v2.14.3 began |
| 13:47 | Alert: ErrorBudgetFastBurn fired (error rate 8%) |
| 13:49 | On-call acknowledged, began investigation |
| 13:52 | Root cause identified: missing DB migration |
| 13:55 | Rollback to v2.14.2 initiated |
| 14:01 | Error rate returned to baseline |
| 14:10 | Incident resolved, monitoring for 30 min |
## Root Cause
[Single clear sentence: The deployment of v2.14.3 introduced a query
using a column that did not yet exist in the production database.]
## Contributing Factors
- Migration was present in the codebase but not applied before deploy
- No pre-deploy migration check in CI/CD pipeline
- Staging database schema was ahead of migration (masked issue locally)
## What Went Well
- Alert fired within 2 minutes of first errors
- Rollback procedure executed in < 6 minutes
- Clear incident channel communication kept stakeholders informed
## What Could Have Gone Better
- 8 minutes to identify root cause (migration history not checked first)
- No automated migration-vs-schema check in deployment pipeline
## Action Items
| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| Add pre-deploy migration check to CI pipeline | @platform-team | 2024-02-01 | P1 |
| Document migration verification in deployment runbook | @platform-team | 2024-02-03 | P2 |
| Add staging DB parity check to pre-deploy checklist | @platform-team | 2024-02-07 | P2 |
## Lessons Learned
[What knowledge does the team have now that it didn't before?]
Chaos Engineering
# LitmusChaos experiment: pod network latency
# litmus-network-chaos.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: order-api-network-chaos
spec:
appinfo:
appns: production
applabel: app=order-api
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-network-latency
spec:
components:
env:
- name: NETWORK_INTERFACE
value: eth0
- name: NETWORK_LATENCY
value: "200" # 200ms added latency
- name: TOTAL_CHAOS_DURATION
value: "60" # 60 seconds
- name: TARGET_PODS
value: "order-api-xxx"
# Steady-state hypothesis: verify SLO holds during chaos
steadystate:
- name: error-rate-within-slo
type: Prometheus
properties:
prometheusEndpoint: http://prometheus:9090
query: |
rate(http_requests_total{status=~"5.."}[5m]) /
rate(http_requests_total[5m]) < 0.001
Chaos experiment ladder:
- Kill a single pod (basic resilience)
- Kill all pods in one AZ (zone failure)
- Add 200ms latency to a dependency (degraded mode)
- Drop 10% of packets to a downstream service (partial outage)
- Fill disk on a node (resource exhaustion)
- Terminate the database primary (failover testing)
Anti-Patterns
❌ SLO tighter than SLA — you need margin between internal target and external commitment
❌ Threshold alerts instead of SLO burn rate alerts — "CPU > 80%" has no user impact correlation
❌ Incident postmortems with blame — blame-finding stops at people; blameless finds systemic causes
❌ Treating toil as unavoidable — every toil task should have a reduction ticket filed
❌ On-call without runbooks — alerting without guidance inflates MTTR and burns out engineers
❌ Chaos in production without SteadyState hypothesis — chaos should verify, not break things randomly
❌ Tracking MTTD/MTTR but not doing anything with the data — metrics without action are theater
❌ Postmortem action items with no owner or due date — they will never get done
Quick Reference
Error budget math:
Monthly budget in minutes = (1 - SLO) × 43,800 min
99.9% SLO = 43.8 min/month
99.95% SLO = 21.9 min/month
99.99% SLO = 4.38 min/month
Burn rate thresholds (for 0.1% error budget):
14.4x (1h + 5m windows) → Critical/Page
6x (6h + 30m windows) → Warning/Ticket
3x (3d + 6h windows) → Info/Track
DORA metrics targets (elite performers):
Deployment frequency: Multiple times/day
Lead time for changes: < 1 hour
Change failure rate: 0-5%
MTTR: < 1 hour
On-call health metrics:
< 2 pages/shift (interrupt budget)
> 80% alert actionability (low noise)
< 10% time spent on toil
Postmortem filed within 5 business days of P1/P2Skill Information
- Source
- MoltbotDen
- Category
- DevOps & Cloud
- Repository
- View on GitHub
Related Skills
kubernetes-expert
Deploy, scale, and operate production Kubernetes clusters. Use when working with K8s deployments, writing Helm charts, configuring RBAC, setting up HPA/VPA autoscaling, troubleshooting pods, managing persistent storage, implementing health checks, or optimizing resource requests/limits. Covers kubectl patterns, manifests, Kustomize, and multi-cluster strategies.
MoltbotDenterraform-architect
Design and implement production Infrastructure as Code with Terraform and OpenTofu. Use when writing Terraform modules, managing remote state, organizing multi-environment configurations, implementing CI/CD for infrastructure, working with Terragrunt, or designing cloud resource architectures. Covers AWS, GCP, Azure providers with security and DRY patterns.
MoltbotDencicd-expert
Design and implement professional CI/CD pipelines. Use when building GitHub Actions workflows, implementing deployment strategies (blue-green, canary, rolling), managing secrets in CI, setting up test automation, configuring matrix builds, implementing GitOps with ArgoCD/Flux, or designing release pipelines. Covers GitHub Actions, GitLab CI, and cloud-native deployment patterns.
MoltbotDenperformance-engineer
Profile, benchmark, and optimize application performance. Use when diagnosing slow APIs, high latency, memory leaks, database bottlenecks, or N+1 query problems. Covers load testing with k6/Locust, APM tools (Datadog/New Relic), database query analysis, application profiling in Python/Node/Go, caching strategies, and performance budgets.
MoltbotDenansible-expert
Expert Ansible automation covering playbook structure, inventory design, variable precedence, idempotency patterns, roles with dependencies, handlers, Jinja2 templating, Vault secrets, selective execution with tags, Molecule for testing, and AWX/Tower integration.
MoltbotDen