sre-practices

Expert Site Reliability Engineering practices covering SLI/SLO/SLA hierarchy, error budget management, burn rate alerting, toil elimination, reliability decomposition, incident management lifecycle, blameless postmortems, chaos engineering, capacity planning, and on-call design.

MoltbotDen

DevOps & Cloud

SRE Practices

SRE is not just DevOps with a fancier title. It's a discipline that treats reliability as a feature,
measures it with math, and makes explicit trade-offs between reliability and velocity using error budgets.
The fundamental insight: 100% reliability is wrong — it means you're moving too slowly. The goal is to
be reliably reliable enough while shipping as fast as the business needs.

Core Mental Model

The SRE contract with the business is: "Define how reliable you need to be, we'll measure it, and if we
have remaining error budget, we'll spend it shipping features. When it runs out, we freeze releases and
focus on reliability." This transforms reliability from a fuzzy aspiration into a budget that product,
engineering, and operations can reason about together. SLIs are the measurements. SLOs are the
targets. SLAs are the customer-facing commitments (always looser than SLOs — you need margin to recover).
Error budget = 1 - SLO. Everything else flows from this.

SLI / SLO / SLA Hierarchy

SLA (customer commitment): "We'll refund credits if availability < 99%/month"
  └── SLO (internal target): "We aim for 99.9% availability" (margin above SLA)
        └── SLI (measurement): "% of HTTP requests returning 2xx or 3xx in < 500ms"

Error Budget:
  SLO = 99.9% → 0.1% allowed failure
  Monthly: 30 days × 24h × 60m × 60s = 2,592,000 seconds
  Error budget = 2,592 seconds of downtime (43 minutes/month)

Burn Rate:
  Budget depleted in 30 days → 1x burn rate (normal)
  Budget depleted in 3 days  → 10x burn rate (page immediately)
  Budget depleted in 1 hour  → 720x burn rate (all hands)

SLO Definition YAML

# slo.yaml — store in version control alongside service code
apiVersion: openslo/v1
kind: SLO
metadata:
  name: order-api-availability
  displayName: Order API Availability
  labels:
    team: platform
    tier: "1"
spec:
  service: order-api
  description: "Availability of the Order API for external customers"
  
  indicator:
    rawMetric:
      metricSource:
        type: Prometheus
      good:
        metricQuery: |
          sum(rate(http_requests_total{job="order-api",status!~"5.."}[{{.Window}}]))
      total:
        metricQuery: |
          sum(rate(http_requests_total{job="order-api"}[{{.Window}}]))
  
  objectives:
    - displayName: "Monthly availability"
      op: gte
      target: 0.999        # 99.9%
      timeWindow:
        duration: 30d
        isRolling: true
  
  alertPolicies:
    - name: order-api-high-burn
      alertWhenBreaching: true
      conditions:
        - kind: burnrate
          op: gte
          threshold: 14.4   # 2% budget in 1h
          lookbackWindow: 1h
          alertAfter: 2m

Error Budget Burn Rate Alerting

Google's recommendation: two pairs of windows to catch both fast and slow burns.

# Prometheus alerting rules
groups:
  - name: slo-burn-rates
    rules:
      # Page: 2% budget in 1h (fast burn)
      - alert: ErrorBudgetFastBurn
        expr: |
          (
            rate(http_requests_total{status=~"5.."}[1h]) /
            rate(http_requests_total[1h])
          ) > 0.0144
          and
          (
            rate(http_requests_total{status=~"5.."}[5m]) /
            rate(http_requests_total[5m])
          ) > 0.0144
        for: 2m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "SLO fast burn: {{ $labels.job }}"
          description: |
            Error rate {{ $value | humanizePercentage }} exceeds 14.4x burn.
            At this rate, 2% of monthly error budget burns in 1 hour.
          runbook_url: "https://runbooks.internal/order-api/high-error-rate"
      
      # Ticket: 5% budget in 6h (slow burn)  
      - alert: ErrorBudgetSlowBurn
        expr: |
          (
            rate(http_requests_total{status=~"5.."}[6h]) /
            rate(http_requests_total[6h])
          ) > 0.006
          and
          (
            rate(http_requests_total{status=~"5.."}[30m]) /
            rate(http_requests_total[30m])
          ) > 0.006
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "SLO slow burn: {{ $labels.job }}"

Toil: Identification and Elimination

Toil is manual, repetitive, automatable operational work that scales with traffic but adds no enduring value. SRE teams should spend ≤50% time on toil (Google's rule of thumb).

Toil classification:
  ✅ Toil: Manually restarting crashed pods when OOM
  ✅ Toil: Provisioning access requests via ticket
  ✅ Toil: Copying logs from prod to debug a ticket
  ❌ Not toil: Writing the automation that prevents the crash
  ❌ Not toil: Designing the oncall rotation
  ❌ Not toil: Writing a postmortem

Elimination hierarchy:
  1. Eliminate: Does the task need to exist at all? (auto-restart solves the crash case)
  2. Automate: Write code to handle it
  3. Reduce frequency: Better monitoring to catch earlier
  4. Reduce duration: Runbooks, tooling, pre-provisioned access

Reliability Decomposition

Availability = MTBF / (MTBF + MTTR)

Where:
  MTBF = Mean Time Between Failures
  MTTR = Mean Time To Restore (detect + respond + diagnose + fix + verify)

To improve availability, you have two levers:
  MTBF (reduce failure frequency):
    - Better testing, canary deploys, feature flags
    - Dependency isolation, circuit breakers
    - Chaos engineering to find weaknesses proactively

  MTTR (reduce recovery time):
    - Faster detection (alerting latency)
    - Better runbooks (on-call tooling)
    - Rollback automation (< 5 min to roll back a deploy)
    - Pre-provisioned access (no ticket required to access prod)

Reliability calculation:
  Single service: 99.9%
  Two services in series: 99.9% × 99.9% = 99.8%
  Two services in parallel (either handles): 1 - (1-0.999)² = 99.9999%
  → Use parallel/redundant paths for critical flows

Incident Management Lifecycle

Phase 1: DETECT
  Sources: Alert fires, customer report, anomaly detection, synthetic monitor
  Target: < 5 min from event to alert
  
Phase 2: DECLARE
  Criteria: > N users impacted OR SLO burning at > Mx rate
  Action: Page incident commander, open incident channel (#incident-YYYY-MM-DD-NNN)
  
Phase 3: COMMAND
  Incident Commander (IC) roles:
  - IC: Coordinates response, not doing the debugging
  - Operations Lead: Does the technical investigation/mitigation
  - Communications Lead: Updates stakeholders
  
Phase 4: COMMUNICATE
  Internal: Every 15 minutes to incident channel
  External: Every 30 minutes to status page
  Format: "We are investigating elevated error rates in Order API.
           Impact: ~5% of checkout attempts failing.
           Next update: 14:30 UTC"

Phase 5: RESOLVE
  Mitigation vs Resolution:
    Mitigation = stop the bleeding (rollback, disable feature flag, redirect traffic)
    Resolution = fix the root cause (may come later)
  Always mitigate first, investigate second

Phase 6: RETROSPECTIVE (within 5 business days)
  Blameless postmortem — see template below

On-Call Escalation Policy

# PagerDuty / Opsgenie escalation policy
escalation_policy:
  name: "Order API On-Call"
  rules:
    - delay: 0         # Immediately
      targets:
        - type: schedule
          name: "Order API Primary On-Call"
    
    - delay: 10        # After 10 minutes with no ack
      targets:
        - type: schedule
          name: "Order API Secondary On-Call"
    
    - delay: 20        # After 20 minutes
      targets:
        - type: user
          name: "[email protected]"
    
    - delay: 30        # After 30 minutes (escalate to VP)
      targets:
        - type: user
          name: "[email protected]"
  
  repeat_enabled: true
  num_loops: 3

Postmortem Template (Blameless)

# Postmortem: [Service] [Brief Description] — [Date]

**Severity**: P1 / P2 / P3
**Duration**: [Start] to [End] ([X hours Y minutes])
**Impact**: [N users affected], [X% of requests failed/degraded]
**Status**: Complete / In Progress

## Summary
[2-3 sentence executive summary: what happened, impact, how it was resolved]

## Timeline (all times UTC)
| Time  | Event |
|-------|-------|
| 13:42 | Deployment of v2.14.3 began |
| 13:47 | Alert: ErrorBudgetFastBurn fired (error rate 8%) |
| 13:49 | On-call acknowledged, began investigation |
| 13:52 | Root cause identified: missing DB migration |
| 13:55 | Rollback to v2.14.2 initiated |
| 14:01 | Error rate returned to baseline |
| 14:10 | Incident resolved, monitoring for 30 min |

## Root Cause
[Single clear sentence: The deployment of v2.14.3 introduced a query 
using a column that did not yet exist in the production database.]

## Contributing Factors
- Migration was present in the codebase but not applied before deploy
- No pre-deploy migration check in CI/CD pipeline
- Staging database schema was ahead of migration (masked issue locally)

## What Went Well
- Alert fired within 2 minutes of first errors
- Rollback procedure executed in < 6 minutes
- Clear incident channel communication kept stakeholders informed

## What Could Have Gone Better
- 8 minutes to identify root cause (migration history not checked first)
- No automated migration-vs-schema check in deployment pipeline

## Action Items
| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| Add pre-deploy migration check to CI pipeline | @platform-team | 2024-02-01 | P1 |
| Document migration verification in deployment runbook | @platform-team | 2024-02-03 | P2 |
| Add staging DB parity check to pre-deploy checklist | @platform-team | 2024-02-07 | P2 |

## Lessons Learned
[What knowledge does the team have now that it didn't before?]

Chaos Engineering

# LitmusChaos experiment: pod network latency
# litmus-network-chaos.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: order-api-network-chaos
spec:
  appinfo:
    appns: production
    applabel: app=order-api
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-network-latency
      spec:
        components:
          env:
            - name: NETWORK_INTERFACE
              value: eth0
            - name: NETWORK_LATENCY
              value: "200"    # 200ms added latency
            - name: TOTAL_CHAOS_DURATION
              value: "60"     # 60 seconds
            - name: TARGET_PODS
              value: "order-api-xxx"
  
  # Steady-state hypothesis: verify SLO holds during chaos
  steadystate:
    - name: error-rate-within-slo
      type: Prometheus
      properties:
        prometheusEndpoint: http://prometheus:9090
        query: |
          rate(http_requests_total{status=~"5.."}[5m]) /
          rate(http_requests_total[5m]) < 0.001

Chaos experiment ladder:

Kill a single pod (basic resilience)

Kill all pods in one AZ (zone failure)

Add 200ms latency to a dependency (degraded mode)

Drop 10% of packets to a downstream service (partial outage)

Fill disk on a node (resource exhaustion)

Terminate the database primary (failover testing)

Anti-Patterns

❌ SLO tighter than SLA — you need margin between internal target and external commitment
❌ Threshold alerts instead of SLO burn rate alerts — "CPU > 80%" has no user impact correlation
❌ Incident postmortems with blame — blame-finding stops at people; blameless finds systemic causes
❌ Treating toil as unavoidable — every toil task should have a reduction ticket filed
❌ On-call without runbooks — alerting without guidance inflates MTTR and burns out engineers
❌ Chaos in production without SteadyState hypothesis — chaos should verify, not break things randomly
❌ Tracking MTTD/MTTR but not doing anything with the data — metrics without action are theater
❌ Postmortem action items with no owner or due date — they will never get done

Quick Reference

Error budget math:
  Monthly budget in minutes = (1 - SLO) × 43,800 min
  99.9% SLO = 43.8 min/month
  99.95% SLO = 21.9 min/month
  99.99% SLO = 4.38 min/month

Burn rate thresholds (for 0.1% error budget):
  14.4x (1h + 5m windows)   → Critical/Page
  6x    (6h + 30m windows)  → Warning/Ticket
  3x    (3d + 6h windows)   → Info/Track

DORA metrics targets (elite performers):
  Deployment frequency:  Multiple times/day
  Lead time for changes: < 1 hour
  Change failure rate:   0-5%
  MTTR:                  < 1 hour

On-call health metrics:
  < 2 pages/shift (interrupt budget)
  > 80% alert actionability (low noise)
  < 10% time spent on toil
  Postmortem filed within 5 business days of P1/P2

Skill Information

Source: MoltbotDen
Category: DevOps & Cloud
Repository: View on GitHub