platform-engineering

Expert platform engineering covering Internal Developer Platforms (IDP), golden paths, Backstage setup and configuration, DORA metrics and developer experience, service catalog design, secret management self-service, dependency tracking, and developer onboarding automation.

MoltbotDen

DevOps & Cloud

Platform Engineering

Platform engineering is product management applied to developer infrastructure. Your customers are the
engineers who build the product. Platform success means faster onboarding, fewer toil tasks, fewer
"how do I deploy this?" Slack messages, and measurable improvement in DORA metrics. The platform team
builds the bike lanes — developers ride them without thinking about the asphalt.

Core Mental Model

An Internal Developer Platform (IDP) is a curated layer of abstraction over your cloud infrastructure.
It doesn't replace Kubernetes or Terraform — it hides the complexity behind self-service interfaces.
The golden path is the opinionated, pre-approved way to do common tasks: create a service, add a
database, configure CI/CD, handle secrets. Engineers can go off-path when needed, but the path should
be so smooth they rarely need to. Platform engineering fails when it over-abstracts and loses the
escape hatch, or under-abstracts and just creates a fancy ticket system.

IDP Core Concepts

Golden Path (paved road):
  "Creating a new microservice" → template fills in:
  ✅ GitHub repo with CI/CD pre-configured
  ✅ Service catalog entry created
  ✅ Default observability (OTel, Grafana dashboard)
  ✅ Deployment to staging on first push
  ✅ Secret management integration
  ✅ Runbook template created
  ✅ PagerDuty service registration

Self-Service Capabilities (target state):
  Developer does themselves:     Ops/Platform does on request:
  ─ Create new service           ─ Create new GCP project
  ─ Deploy to staging            ─ Production access (RBAC)
  ─ Add environment variables    ─ Budget alerts
  ─ Access logs/metrics          ─ Cross-account access
  ─ Create feature flags         ─ Compliance exceptions
  ─ Run load tests
  ─ Provision preview environments

Platform Team: Build the rails, not the trains.

Backstage: Setup and Configuration

Catalog: catalog-info.yaml

# Every service, library, website, API, and resource should have this file
# Commit it at the root of the repo

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: order-api
  title: Order API
  description: Core order processing service. Handles cart → checkout → fulfillment.
  labels:
    tier: "1"                    # Criticality tier
    team: "platform"
  annotations:
    # Source control
    github.com/project-slug: my-org/order-api
    
    # CI/CD
    github.com/workflow-dispatch: build
    
    # Observability
    grafana/dashboard-selector: "title=Order API"
    prometheus.io/rule: "job=order-api"
    
    # Runbook and docs
    backstage.io/techdocs-ref: dir:.    # TechDocs from this repo
    pagerduty.com/service-id: "PXXXXXX"
    
    # Cost tracking
    finops.example.com/cost-center: "engineering-platform"
  tags:
    - go
    - rest
    - postgresql
    - tier-1
  links:
    - url: https://order-api.example.com/docs
      title: API Docs
    - url: https://runbooks.example.com/order-api
      title: Runbook
    - url: https://grafana.example.com/d/order-api
      title: Dashboard

spec:
  type: service
  lifecycle: production
  owner: group:platform-team
  system: ecommerce-platform
  dependsOn:
    - component:payment-service
    - component:inventory-service
    - resource:orders-postgres-db
  providesApis:
    - order-api-v1

Software Template (Golden Path)

# template.yaml — Backstage scaffolder template for new Go microservice
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: go-microservice
  title: Go Microservice
  description: Create a new Go microservice with CI/CD, observability, and deployment
  tags:
    - go
    - microservice
    - recommended

spec:
  owner: group:platform-team
  type: service

  parameters:
    - title: Service Details
      required: [name, description, owner]
      properties:
        name:
          title: Service Name
          type: string
          pattern: '^[a-z][a-z0-9-]{1,40}[a-z0-9]
DORA Metrics
Four Key Metrics (measure these, improve these):

1. Deployment Frequency
   Elite: Multiple times/day
   High:  Once/day to once/week
   Medium: Once/week to once/month
   Low:   Less than once/month
   → Proxy for: small batch size, CI/CD maturity

2. Lead Time for Changes
   Elite: < 1 hour (commit to production)
   High:  1 day
   Medium: 1 week to 1 month
   Low:   > 1 month
   → Proxy for: PR review speed, CI speed, deployment automation

3. Change Failure Rate
   Elite: 0-5%
   High:  0-15%
   Medium: 16-30%
   Low:   46-60%
   → Proxy for: test coverage, feature flags, canary deploys

4. Mean Time to Restore (MTTR)
   Elite: < 1 hour
   High:  < 1 day
   Medium: 1 day to 1 week
   Low:   > 1 week
   → Proxy for: observability, runbook quality, on-call access
# DORA metrics collection from GitHub + PagerDuty
# Deployment frequency: count deployments to production per team per day

import datetime
from github import Github

def calculate_dora_metrics(repo_name: str, days: int = 90):
    gh = Github(os.environ['GITHUB_TOKEN'])
    repo = gh.get_repo(repo_name)
    
    since = datetime.datetime.now() - datetime.timedelta(days=days)
    
    # Deployment frequency: successful prod deploys
    deployments = [d for d in repo.get_deployments(environment="production")
                   if d.created_at > since and d.state == "success"]
    
    deploy_frequency = len(deployments) / days  # per day
    
    # Lead time: commit SHA to production deployment
    lead_times = []
    for deploy in deployments[:50]:  # Sample last 50
        commit = repo.get_commit(deploy.sha)
        lead_time = (deploy.created_at - commit.commit.author.date).total_seconds() / 3600
        lead_times.append(lead_time)
    
    median_lead_time = sorted(lead_times)[len(lead_times)//2] if lead_times else None
    
    return {
        "deployment_frequency": deploy_frequency,
        "median_lead_time_hours": median_lead_time,
        "period_days": days,
        "total_deployments": len(deployments),
    }
Helmfile for Multi-Environment Management
# helmfile.yaml
environments:
  development:
    values:
      - environments/development.yaml
  staging:
    values:
      - environments/staging.yaml
  production:
    values:
      - environments/production.yaml

repositories:
  - name: bitnami
    url: https://charts.bitnami.com/bitnami
  - name: my-charts
    url: oci://us-central1-docker.pkg.dev/my-project/helm-charts

releases:
  - name: order-api
    chart: my-charts/order-api
    version: "1.2.3"
    namespace: order-service
    createNamespace: true
    values:
      - values/order-api.yaml
      - values/order-api.{{ .Environment.Name }}.yaml
    set:
      - name: image.tag
        value: {{ requiredEnv "IMAGE_TAG" }}

  - name: postgresql
    chart: bitnami/postgresql
    version: "12.x.x"
    namespace: order-service
    condition: postgresql.enabled   # Only deploy if enabled in env values
    values:
      - values/postgresql.yaml

  - name: redis
    chart: bitnami/redis
    version: "17.x.x"
    namespace: shared
    condition: redis.enabled
Secret Management Self-Service
# Platform API: self-service secret creation
# Engineers request secrets via PR to a secrets config, platform auto-provisions

# secrets-config.yaml (in version control, reviewed by platform team)
# secrets:
#   - name: stripe-api-key
#     service: order-api
#     environment: production
#     type: external    # Managed by team, stored in Vault
#   - name: db-password
#     service: order-api
#     environment: production
#     type: generated   # Platform generates and rotates

import hvac  # HashiCorp Vault client

class SecretManager:
    def __init__(self):
        self.client = hvac.Client(url=os.environ['VAULT_ADDR'])
        self.client.auth.kubernetes.login(role="platform-provisioner")
    
    def provision_secret(self, service: str, env: str, secret_name: str) -> str:
        """Create a Vault path for a service secret and return the path."""
        vault_path = f"secrets/{env}/{service}/{secret_name}"
        
        # Create policy that allows service SA to read this path
        policy_name = f"{env}-{service}-{secret_name}"
        policy_hcl = f"""
path "{vault_path}" {{
  capabilities = ["read"]
}}
"""
        self.client.sys.create_or_update_policy(
            name=policy_name,
            policy=policy_hcl,
        )
        
        # Bind policy to k8s service account
        self.client.auth.kubernetes.create_role(
            name=f"{env}-{service}",
            bound_service_account_names=[service],
            bound_service_account_namespaces=[env],
            policies=[policy_name],
            ttl="1h",
        )
        
        return vault_path
Developer Onboarding Automation
#!/usr/bin/env bash
# onboard-developer.sh — run for new engineers
# Usage: ./onboard-developer.sh [email protected]

set -euo pipefail

EMAIL="$1"
USERNAME="${EMAIL%%@*}"

echo "🚀 Onboarding $USERNAME..."

# 1. GitHub access
gh api orgs/my-org/invitations \
  --method POST \
  -f email="$EMAIL" \
  -f role=direct_member

# 2. Add to standard teams
for team in "all-engineers" "your-squad" "on-call-rotation"; do
  gh api orgs/my-org/teams/$team/memberships/$USERNAME \
    --method PUT -f role=member
done

# 3. Cloud access (GCP: add to group, IAM propagates)
gcloud identity groups memberships add \
  [email protected] \
  --member-email="$EMAIL"

# 4. 1Password team invite
op user provision --name "$USERNAME" --email "$EMAIL" --team engineers

# 5. Send welcome message with checklist
cat <<EOF | curl -X POST $SLACK_WEBHOOK -d @-
{
  "text": "👋 Welcome $USERNAME! Your access has been provisioned.\n\n*Checklist:*\n• ✅ GitHub access granted\n• ✅ GCP read access granted\n• ✅ 1Password invite sent\n• 📖 Read: https://wiki.example.com/onboarding\n• 🔧 Set up local dev: https://wiki.example.com/local-dev"
}
EOF

echo "✅ Done! $USERNAME has been onboarded."
Anti-Patterns
❌ Golden paths with no escape hatch — developers will route around a platform they can't customize  
❌ Platform team as gatekeeper, not enabler — every request through a ticket defeats self-service  
❌ Backstage without owners — a stale catalog is worse than no catalog; enforce ownership  
❌ Measuring platform success by features shipped — measure developer time saved and DORA metrics  
❌ IDP that just wraps existing complexity — if the UI requires understanding Terraform, you failed  
❌ Template scaffolding without keeping it updated — outdated templates breed security debt  
❌ Missing delete/teardown automation — spinning up is easy, cleanup is where cost runs away  
❌ Ignoring mobile/non-cloud developers — platform must work for all personas
Quick Reference
Platform maturity stages:
  Level 0: Manual everything, tribal knowledge
  Level 1: Runbooks exist, some automation scripts
  Level 2: Self-service deployment (push to deploy)
  Level 3: Service catalog + golden path templates
  Level 4: Internal developer portal (Backstage), DORA tracked
  Level 5: Platform as product with NPS, roadmap, on-call

DORA quick wins by metric:
  Deployment frequency: Feature flags + trunk-based development
  Lead time: PR size limit (< 400 lines), automated PR review bots
  Change failure rate: Canary deploys + automated rollback
  MTTR: Runbook links in alerts + pre-provisioned prod access

Backstage entity types:
  Component   → Services, websites, libraries, documentation
  API         → OpenAPI, GraphQL, Async API specs
  Resource    → Databases, S3 buckets, queues
  Group       → Teams
  User        → Engineers
  System      → Logical grouping of related components
  Domain      → Business domain grouping systems
          description: "Lowercase, hyphens only (e.g. payment-processor)"
        description:
          title: Description
          type: string
        owner:
          title: Team Owner
          type: string
          ui:field: OwnerPicker
          ui:options:
            catalogFilter:
              kind: Group
        tier:
          title: Service Tier
          type: string
          enum: ['1', '2', '3']
          description: "Tier 1 = SLO 99.9%, Tier 2 = 99.5%, Tier 3 = 99%"
    
    - title: Infrastructure
      properties:
        hasDatabase:
          title: Needs PostgreSQL database?
          type: boolean
          default: false
        hasCaching:
          title: Needs Redis cache?
          type: boolean
          default: false
        minInstances:
          title: Minimum Cloud Run instances
          type: integer
          default: 1

  steps:
    - id: fetch-template
      name: Fetch service template
      action: fetch:template
      input:
        url: ./skeleton
        values:
          name: ${{ parameters.name }}
          description: ${{ parameters.description }}
          owner: ${{ parameters.owner }}
          tier: ${{ parameters.tier }}
          hasDatabase: ${{ parameters.hasDatabase }}
    
    - id: publish-github
      name: Create GitHub repo
      action: publish:github
      input:
        repoUrl: github.com?owner=my-org&repo=${{ parameters.name }}
        defaultBranch: main
        description: ${{ parameters.description }}
        topics: [go, microservice]
        requireCodeOwnerReviews: true
        dismissStaleReviews: true
    
    - id: register-catalog
      name: Register in catalog
      action: catalog:register
      input:
        repoContentsUrl: ${{ steps['publish-github'].output.repoContentsUrl }}
        catalogInfoPath: /catalog-info.yaml
    
    - id: create-pagerduty-service
      name: Create PagerDuty service
      action: pagerduty:service:create
      input:
        name: ${{ parameters.name }}
        escalationPolicyId: ${{ parameters.tier == '1' && 'P-TIER1' || 'P-TIER23' }}
    
    - id: notify-slack
      name: Notify team channel
      action: http:backstage:request
      input:
        method: POST
        path: /api/slack/send
        body:
          channel: "#platform-new-services"
          text: "New service created: ${{ parameters.name }} by ${{ parameters.owner }}"

  output:
    links:
      - title: Repository
        url: ${{ steps['publish-github'].output.remoteUrl }}
      - title: Service Catalog
        url: ${{ steps['register-catalog'].output.entityRef }}

DORA Metrics

__CODE_BLOCK_3__ __CODE_BLOCK_4__

Helmfile for Multi-Environment Management

__CODE_BLOCK_5__

Secret Management Self-Service

__CODE_BLOCK_6__

Developer Onboarding Automation

__CODE_BLOCK_7__

Anti-Patterns

❌ Golden paths with no escape hatch — developers will route around a platform they can't customize
❌ Platform team as gatekeeper, not enabler — every request through a ticket defeats self-service
❌ Backstage without owners — a stale catalog is worse than no catalog; enforce ownership
❌ Measuring platform success by features shipped — measure developer time saved and DORA metrics
❌ IDP that just wraps existing complexity — if the UI requires understanding Terraform, you failed
❌ Template scaffolding without keeping it updated — outdated templates breed security debt
❌ Missing delete/teardown automation — spinning up is easy, cleanup is where cost runs away
❌ Ignoring mobile/non-cloud developers — platform must work for all personas

platform-engineering

Platform Engineering

Core Mental Model

IDP Core Concepts

Backstage: Setup and Configuration

Catalog: catalog-info.yaml

Software Template (Golden Path)

DORA Metrics

Helmfile for Multi-Environment Management

Secret Management Self-Service

Developer Onboarding Automation

Anti-Patterns

Quick Reference

DORA Metrics

Helmfile for Multi-Environment Management

Secret Management Self-Service

Developer Onboarding Automation

Anti-Patterns

Quick Reference

Skill Information

Related Skills