observability-expert

Expert-level observability covering the three pillars (metrics, logs, traces), OpenTelemetry instrumentation, Prometheus metric types and PromQL, Grafana dashboard design using RED/USE methods, structured logging, distributed tracing with sampling strategies, SLO-based alerting, and Loki log

MoltbotDen

DevOps & Cloud

Observability Expert

Observability is not monitoring. Monitoring tells you when something is broken by checking known failure
modes. Observability lets you debug unknown failures by asking questions of your system after the fact.
The three pillars — metrics, logs, traces — are complementary, not interchangeable. You need all three.

Core Mental Model

Metrics tell you what is happening at aggregate level (request rate, error rate, latency). They're
cheap to store and query but lose individual request detail. Logs tell you what happened for specific
events, with full context. They're expensive at scale but essential for debugging. Traces tell you *where
time was spent across service boundaries for a single request. The power is in correlation*: a metric
spike leads you to a time window, logs give you the error details, a trace shows you the slow span. Design
your telemetry so all three can be linked by a common trace ID.

The Three Pillars

Metrics (Prometheus/OTEL):
  "5% of requests are failing" → WHERE to look
  Aggregated, sampled, cheap to store and query
  
Logs (Loki/CloudWatch/ELK):
  "ORDER-123 failed: constraint violation on user_id" → WHAT happened
  Event-level detail, expensive at scale, essential for context
  
Traces (Jaeger/Tempo/Zipkin):
  "Payment service took 2.3s; 1.8s was in the DB query" → WHY it was slow
  Request-level, cross-service, shows causality

Correlation: trace_id links all three
  Metric alert fires → log query filters by time + service → trace ID found → full trace loaded

OpenTelemetry: Architecture

Your App
  │
  ▼
OTel SDK (instrumentation)
  │ OTLP (gRPC/HTTP)
  ▼
OTel Collector
  ├── Receivers: OTLP, Jaeger, Zipkin, Prometheus scrape
  ├── Processors: batch, memory_limiter, resource detection, sampling
  └── Exporters: Jaeger, Tempo, Prometheus, CloudWatch, Datadog, OTLP
  
Backends:
  Traces → Jaeger / Grafana Tempo / Zipkin
  Metrics → Prometheus / Mimir / Datadog
  Logs → Loki / ELK / CloudWatch Logs

FastAPI + OpenTelemetry Auto-Instrumentation

# requirements.txt
# opentelemetry-sdk opentelemetry-exporter-otlp opentelemetry-instrumentation-fastapi
# opentelemetry-instrumentation-httpx opentelemetry-instrumentation-sqlalchemy

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.sdk.resources import Resource

def setup_telemetry(app):
    resource = Resource.create({
        "service.name": "order-api",
        "service.version": os.environ.get("APP_VERSION", "unknown"),
        "deployment.environment": os.environ.get("ENVIRONMENT", "development"),
    })
    
    # Traces
    tracer_provider = TracerProvider(resource=resource)
    tracer_provider.add_span_processor(
        BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
    )
    trace.set_tracer_provider(tracer_provider)
    
    # Metrics
    reader = PeriodicExportingMetricReader(
        OTLPMetricExporter(endpoint="http://otel-collector:4317"),
        export_interval_millis=10000
    )
    metrics.set_meter_provider(MeterProvider(resource=resource, metric_readers=[reader]))
    
    # Auto-instrument
    FastAPIInstrumentor.instrument_app(app)
    HTTPXClientInstrumentor().instrument()

Manual Spans with Semantic Conventions

from opentelemetry import trace
from opentelemetry.semconv.trace import SpanAttributes

tracer = trace.get_tracer(__name__)

async def process_payment(order_id: str, amount: float) -> dict:
    with tracer.start_as_current_span(
        "payment.process",
        kind=trace.SpanKind.CLIENT,
        attributes={
            SpanAttributes.DB_SYSTEM: "postgresql",
            "app.order.id": order_id,
            "app.payment.amount": amount,
        }
    ) as span:
        try:
            result = await payment_gateway.charge(order_id, amount)
            span.set_attribute("app.payment.transaction_id", result["transaction_id"])
            span.set_status(trace.Status(trace.StatusCode.OK))
            return result
        except PaymentDeclinedError as e:
            # Business error — mark as error but not exception
            span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
            span.set_attribute("app.payment.decline_reason", e.reason)
            raise
        except Exception as e:
            # Unexpected error — record exception with stack trace
            span.record_exception(e)
            span.set_status(trace.Status(trace.StatusCode.ERROR))
            raise

OTel Collector Configuration

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  
  # Scrape Prometheus metrics from services
  prometheus:
    config:
      scrape_configs:
        - job_name: 'fastapi-services'
          scrape_interval: 15s
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
              action: keep
              regex: true

processors:
  batch:
    timeout: 1s
    send_batch_size: 1000
    send_batch_max_size: 2000
  
  memory_limiter:
    limit_mib: 512
    spike_limit_mib: 128
  
  # Tail-based sampling: keep 100% of errors, 10% of successful traces
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-traces-policy
        type: latency
        latency: {threshold_ms: 1000}
      - name: probabilistic-policy
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  
  prometheusremotewrite:
    endpoint: "http://prometheus:9090/api/v1/write"
  
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
    labels:
      resource:
        service.name: "service_name"
        deployment.environment: "environment"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

Prometheus Metric Types

Type

Use When

Example

Counter	Things that only go up	requests_total, errors_total
Gauge	Things that go up and down	active_connections, queue_depth, memory_bytes
Histogram	Distribution of values (latency, size)	request_duration_seconds
Summary	Client-side quantiles (avoid at scale)	Legacy; prefer Histogram

Critical rule: Counters always use rate() or increase() in PromQL. Gauges are used directly.

from prometheus_client import Counter, Histogram, Gauge

# Counter: always increment, never decrement
requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'path', 'status_code']
)

# Histogram: buckets should match your SLO
request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['method', 'path'],
    buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)

# Gauge: current state
active_users = Gauge('active_users_current', 'Currently active users')

# Usage
with request_duration.labels(method='GET', path='/orders').time():
    result = process_request()
    requests_total.labels(method='GET', path='/orders', status_code='200').inc()

PromQL Essentials

# Rate of requests per second (5m window)
rate(http_requests_total[5m])

# Error rate percentage
rate(http_requests_total{status_code=~"5.."}[5m]) /
  rate(http_requests_total[5m]) * 100

# 99th percentile latency
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

# Apdex score (SLO: 100ms target, 300ms frustrated)
(
  sum(rate(http_request_duration_seconds_bucket{le="0.1"}[5m])) +
  sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
) / 2 / sum(rate(http_request_duration_seconds_count[5m]))

# Multi-window burn rate for SLO alerting
# 1h window: fast burn detection
rate(http_requests_total{status="500"}[1h]) /
  rate(http_requests_total[1h])

# Recording rules (precompute expensive queries)
# In prometheus rules YAML:
# - record: job:http_requests:rate5m
#   expr: sum(rate(http_requests_total[5m])) by (job)

Grafana Dashboard Design

RED Method (for services receiving requests)

Rate: Requests per second
Errors: Error rate %
Duration: Latency (p50, p95, p99)

USE Method (for resources: CPU, disk, network)

Utilization: % of time resource is busy
Saturation: Queued/waiting work
Errors: Error events

// Grafana panel: SLO status with threshold coloring
{
  "title": "Error Rate (SLO: <0.1%)",
  "type": "stat",
  "targets": [{
    "expr": "sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m])) * 100",
    "legendFormat": "Error Rate %"
  }],
  "fieldConfig": {
    "defaults": {
      "thresholds": {
        "mode": "absolute",
        "steps": [
          {"color": "green", "value": null},
          {"color": "yellow", "value": 0.05},
          {"color": "red", "value": 0.1}
        ]
      },
      "unit": "percent"
    }
  }
}

Structured Logging

import structlog
import logging

# Configure structlog with JSON output
structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.stdlib.add_logger_name,
        structlog.processors.CallsiteParameterAdder([
            structlog.processors.CallsiteParameter.FILENAME,
            structlog.processors.CallsiteParameter.LINENO,
        ]),
        structlog.processors.JSONRenderer()
    ],
    wrapper_class=structlog.BoundLogger,
    context_class=dict,
    logger_factory=structlog.PrintLoggerFactory(),
)

log = structlog.get_logger()

# Always log with structured fields — never string interpolation
log.info("order.processed",
    order_id="ord-123",
    user_id="usr-456",
    amount=99.99,
    duration_ms=142,
    trace_id=get_current_trace_id()  # Correlate with traces!
)

# Output: {"event": "order.processed", "order_id": "ord-123", "level": "info",
#          "timestamp": "2024-01-15T10:30:00Z", "trace_id": "abc123..."}

SLO-Based Alerting

# Prometheus alerting rule: multi-window burn rate
groups:
  - name: slo_alerts
    rules:
      # Fast burn: 2% budget consumed in 1 hour → page immediately
      - alert: SLOHighBurnRate
        expr: |
          (
            rate(http_requests_total{job="order-api",status=~"5.."}[1h])
            / rate(http_requests_total{job="order-api"}[1h])
          ) > (14.4 * 0.001)  # 14.4x burn rate on 0.1% error budget
          and
          (
            rate(http_requests_total{job="order-api",status=~"5.."}[5m])
            / rate(http_requests_total{job="order-api"}[5m])
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
          runbook_url: "https://wiki.example.com/runbooks/order-api-error-rate"
        annotations:
          summary: "Order API SLO burn rate critical"
          description: "Error rate {{ $value | humanizePercentage }} exceeds 14.4x burn rate"
      
      # Slow burn: 5% budget consumed in 6 hours → ticket
      - alert: SLOLowBurnRate
        expr: |
          (
            rate(http_requests_total{job="order-api",status=~"5.."}[6h])
            / rate(http_requests_total{job="order-api"}[6h])
          ) > (6 * 0.001)
        for: 15m
        labels:
          severity: warning

Anti-Patterns

❌ Logging at DEBUG level in production — use sampling or dynamic log level adjustment
❌ High-cardinality label values in Prometheus (user IDs, order IDs as labels) — causes cardinality explosion
❌ Histograms with wrong bucket boundaries — buckets must bracket your SLO threshold
❌ Tracing 100% of requests — tail-based sampling keeps 100% of errors, sample the rest
❌ Threshold-based alerts instead of SLO-based — "CPU > 80%" tells you nothing about user impact
❌ Alerts without runbook URLs — an alert that fires without guidance causes MTTR inflation
❌ Ignoring cold start metrics — Lambda/Cloud Run p99 is dominated by cold starts; track separately
❌ String interpolation in log messages — log.info(f"processed {order_id}") is unsearchable
❌ Missing trace ID in logs — without this, metrics → logs → traces correlation is manual and slow

Quick Reference

Metric naming conventions:
  <namespace>_<unit>_total         (counter: requests_total, errors_total)
  <namespace>_<unit>_bytes         (gauge: memory_bytes, queue_bytes)  
  <namespace>_duration_seconds     (histogram: http_request_duration_seconds)
  <namespace>_<unit>_created       (auto-created timestamp for counters)

PromQL cheat sheet:
  rate(counter[5m])                → per-second rate over 5min window
  increase(counter[1h])            → total increase over 1 hour
  histogram_quantile(0.99, ...)    → 99th percentile
  avg_over_time(gauge[5m])         → average of gauge over time
  topk(5, metric)                  → top 5 series by value
  sum by (label) (metric)          → aggregate by label

OTel span status guide:
  UNSET  → default, didn't set status (treated as OK by backends)
  OK     → explicitly mark successful (use sparingly)
  ERROR  → something went wrong (set description)
  Use record_exception() for stack traces on unexpected errors

Sampling strategy:
  Head-based: decision at trace start (fast, lose tail errors)
  Tail-based: decision after trace complete (catches errors, needs collector buffer)
  Parent-based: inherit parent's sampling decision (distributed systems default)

Skill Information

Source: MoltbotDen
Category: DevOps & Cloud
Repository: View on GitHub