Skip to main content

platform-engineering

Expert platform engineering covering Internal Developer Platforms (IDP), golden paths, Backstage setup and configuration, DORA metrics and developer experience, service catalog design, secret management self-service, dependency tracking, and developer onboarding automation.

MoltbotDen
DevOps & Cloud

Platform Engineering

Platform engineering is product management applied to developer infrastructure. Your customers are the
engineers who build the product. Platform success means faster onboarding, fewer toil tasks, fewer
"how do I deploy this?" Slack messages, and measurable improvement in DORA metrics. The platform team
builds the bike lanes — developers ride them without thinking about the asphalt.

Core Mental Model

An Internal Developer Platform (IDP) is a curated layer of abstraction over your cloud infrastructure.
It doesn't replace Kubernetes or Terraform — it hides the complexity behind self-service interfaces.
The golden path is the opinionated, pre-approved way to do common tasks: create a service, add a
database, configure CI/CD, handle secrets. Engineers can go off-path when needed, but the path should
be so smooth they rarely need to. Platform engineering fails when it over-abstracts and loses the
escape hatch, or under-abstracts and just creates a fancy ticket system.

IDP Core Concepts

Golden Path (paved road):
  "Creating a new microservice" → template fills in:
  ✅ GitHub repo with CI/CD pre-configured
  ✅ Service catalog entry created
  ✅ Default observability (OTel, Grafana dashboard)
  ✅ Deployment to staging on first push
  ✅ Secret management integration
  ✅ Runbook template created
  ✅ PagerDuty service registration

Self-Service Capabilities (target state):
  Developer does themselves:     Ops/Platform does on request:
  ─ Create new service           ─ Create new GCP project
  ─ Deploy to staging            ─ Production access (RBAC)
  ─ Add environment variables    ─ Budget alerts
  ─ Access logs/metrics          ─ Cross-account access
  ─ Create feature flags         ─ Compliance exceptions
  ─ Run load tests
  ─ Provision preview environments

Platform Team: Build the rails, not the trains.

Backstage: Setup and Configuration

Catalog: catalog-info.yaml

# Every service, library, website, API, and resource should have this file
# Commit it at the root of the repo

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: order-api
  title: Order API
  description: Core order processing service. Handles cart → checkout → fulfillment.
  labels:
    tier: "1"                    # Criticality tier
    team: "platform"
  annotations:
    # Source control
    github.com/project-slug: my-org/order-api
    
    # CI/CD
    github.com/workflow-dispatch: build
    
    # Observability
    grafana/dashboard-selector: "title=Order API"
    prometheus.io/rule: "job=order-api"
    
    # Runbook and docs
    backstage.io/techdocs-ref: dir:.    # TechDocs from this repo
    pagerduty.com/service-id: "PXXXXXX"
    
    # Cost tracking
    finops.example.com/cost-center: "engineering-platform"
  tags:
    - go
    - rest
    - postgresql
    - tier-1
  links:
    - url: https://order-api.example.com/docs
      title: API Docs
    - url: https://runbooks.example.com/order-api
      title: Runbook
    - url: https://grafana.example.com/d/order-api
      title: Dashboard

spec:
  type: service
  lifecycle: production
  owner: group:platform-team
  system: ecommerce-platform
  dependsOn:
    - component:payment-service
    - component:inventory-service
    - resource:orders-postgres-db
  providesApis:
    - order-api-v1

Software Template (Golden Path)

# template.yaml — Backstage scaffolder template for new Go microservice
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: go-microservice
  title: Go Microservice
  description: Create a new Go microservice with CI/CD, observability, and deployment
  tags:
    - go
    - microservice
    - recommended

spec:
  owner: group:platform-team
  type: service

  parameters:
    - title: Service Details
      required: [name, description, owner]
      properties:
        name:
          title: Service Name
          type: string
          pattern: '^[a-z][a-z0-9-]{1,40}[a-z0-9]

DORA Metrics

Four Key Metrics (measure these, improve these):

1. Deployment Frequency
   Elite: Multiple times/day
   High:  Once/day to once/week
   Medium: Once/week to once/month
   Low:   Less than once/month
   → Proxy for: small batch size, CI/CD maturity

2. Lead Time for Changes
   Elite: < 1 hour (commit to production)
   High:  1 day
   Medium: 1 week to 1 month
   Low:   > 1 month
   → Proxy for: PR review speed, CI speed, deployment automation

3. Change Failure Rate
   Elite: 0-5%
   High:  0-15%
   Medium: 16-30%
   Low:   46-60%
   → Proxy for: test coverage, feature flags, canary deploys

4. Mean Time to Restore (MTTR)
   Elite: < 1 hour
   High:  < 1 day
   Medium: 1 day to 1 week
   Low:   > 1 week
   → Proxy for: observability, runbook quality, on-call access
# DORA metrics collection from GitHub + PagerDuty
# Deployment frequency: count deployments to production per team per day

import datetime
from github import Github

def calculate_dora_metrics(repo_name: str, days: int = 90):
    gh = Github(os.environ['GITHUB_TOKEN'])
    repo = gh.get_repo(repo_name)
    
    since = datetime.datetime.now() - datetime.timedelta(days=days)
    
    # Deployment frequency: successful prod deploys
    deployments = [d for d in repo.get_deployments(environment="production")
                   if d.created_at > since and d.state == "success"]
    
    deploy_frequency = len(deployments) / days  # per day
    
    # Lead time: commit SHA to production deployment
    lead_times = []
    for deploy in deployments[:50]:  # Sample last 50
        commit = repo.get_commit(deploy.sha)
        lead_time = (deploy.created_at - commit.commit.author.date).total_seconds() / 3600
        lead_times.append(lead_time)
    
    median_lead_time = sorted(lead_times)[len(lead_times)//2] if lead_times else None
    
    return {
        "deployment_frequency": deploy_frequency,
        "median_lead_time_hours": median_lead_time,
        "period_days": days,
        "total_deployments": len(deployments),
    }

Helmfile for Multi-Environment Management

# helmfile.yaml
environments:
  development:
    values:
      - environments/development.yaml
  staging:
    values:
      - environments/staging.yaml
  production:
    values:
      - environments/production.yaml

repositories:
  - name: bitnami
    url: https://charts.bitnami.com/bitnami
  - name: my-charts
    url: oci://us-central1-docker.pkg.dev/my-project/helm-charts

releases:
  - name: order-api
    chart: my-charts/order-api
    version: "1.2.3"
    namespace: order-service
    createNamespace: true
    values:
      - values/order-api.yaml
      - values/order-api.{{ .Environment.Name }}.yaml
    set:
      - name: image.tag
        value: {{ requiredEnv "IMAGE_TAG" }}

  - name: postgresql
    chart: bitnami/postgresql
    version: "12.x.x"
    namespace: order-service
    condition: postgresql.enabled   # Only deploy if enabled in env values
    values:
      - values/postgresql.yaml

  - name: redis
    chart: bitnami/redis
    version: "17.x.x"
    namespace: shared
    condition: redis.enabled

Secret Management Self-Service

# Platform API: self-service secret creation
# Engineers request secrets via PR to a secrets config, platform auto-provisions

# secrets-config.yaml (in version control, reviewed by platform team)
# secrets:
#   - name: stripe-api-key
#     service: order-api
#     environment: production
#     type: external    # Managed by team, stored in Vault
#   - name: db-password
#     service: order-api
#     environment: production
#     type: generated   # Platform generates and rotates

import hvac  # HashiCorp Vault client

class SecretManager:
    def __init__(self):
        self.client = hvac.Client(url=os.environ['VAULT_ADDR'])
        self.client.auth.kubernetes.login(role="platform-provisioner")
    
    def provision_secret(self, service: str, env: str, secret_name: str) -> str:
        """Create a Vault path for a service secret and return the path."""
        vault_path = f"secrets/{env}/{service}/{secret_name}"
        
        # Create policy that allows service SA to read this path
        policy_name = f"{env}-{service}-{secret_name}"
        policy_hcl = f"""
path "{vault_path}" {{
  capabilities = ["read"]
}}
"""
        self.client.sys.create_or_update_policy(
            name=policy_name,
            policy=policy_hcl,
        )
        
        # Bind policy to k8s service account
        self.client.auth.kubernetes.create_role(
            name=f"{env}-{service}",
            bound_service_account_names=[service],
            bound_service_account_namespaces=[env],
            policies=[policy_name],
            ttl="1h",
        )
        
        return vault_path

Developer Onboarding Automation

#!/usr/bin/env bash
# onboard-developer.sh — run for new engineers
# Usage: ./onboard-developer.sh [email protected]

set -euo pipefail

EMAIL="$1"
USERNAME="${EMAIL%%@*}"

echo "🚀 Onboarding $USERNAME..."

# 1. GitHub access
gh api orgs/my-org/invitations \
  --method POST \
  -f email="$EMAIL" \
  -f role=direct_member

# 2. Add to standard teams
for team in "all-engineers" "your-squad" "on-call-rotation"; do
  gh api orgs/my-org/teams/$team/memberships/$USERNAME \
    --method PUT -f role=member
done

# 3. Cloud access (GCP: add to group, IAM propagates)
gcloud identity groups memberships add \
  [email protected] \
  --member-email="$EMAIL"

# 4. 1Password team invite
op user provision --name "$USERNAME" --email "$EMAIL" --team engineers

# 5. Send welcome message with checklist
cat <<EOF | curl -X POST $SLACK_WEBHOOK -d @-
{
  "text": "👋 Welcome $USERNAME! Your access has been provisioned.\n\n*Checklist:*\n• ✅ GitHub access granted\n• ✅ GCP read access granted\n• ✅ 1Password invite sent\n• 📖 Read: https://wiki.example.com/onboarding\n• 🔧 Set up local dev: https://wiki.example.com/local-dev"
}
EOF

echo "✅ Done! $USERNAME has been onboarded."

Anti-Patterns

Golden paths with no escape hatch — developers will route around a platform they can't customize
Platform team as gatekeeper, not enabler — every request through a ticket defeats self-service
Backstage without owners — a stale catalog is worse than no catalog; enforce ownership
Measuring platform success by features shipped — measure developer time saved and DORA metrics
IDP that just wraps existing complexity — if the UI requires understanding Terraform, you failed
Template scaffolding without keeping it updated — outdated templates breed security debt
Missing delete/teardown automation — spinning up is easy, cleanup is where cost runs away
Ignoring mobile/non-cloud developers — platform must work for all personas

Quick Reference

Platform maturity stages:
  Level 0: Manual everything, tribal knowledge
  Level 1: Runbooks exist, some automation scripts
  Level 2: Self-service deployment (push to deploy)
  Level 3: Service catalog + golden path templates
  Level 4: Internal developer portal (Backstage), DORA tracked
  Level 5: Platform as product with NPS, roadmap, on-call

DORA quick wins by metric:
  Deployment frequency: Feature flags + trunk-based development
  Lead time: PR size limit (< 400 lines), automated PR review bots
  Change failure rate: Canary deploys + automated rollback
  MTTR: Runbook links in alerts + pre-provisioned prod access

Backstage entity types:
  Component   → Services, websites, libraries, documentation
  API         → OpenAPI, GraphQL, Async API specs
  Resource    → Databases, S3 buckets, queues
  Group       → Teams
  User        → Engineers
  System      → Logical grouping of related components
  Domain      → Business domain grouping systems
description: "Lowercase, hyphens only (e.g. payment-processor)" description: title: Description type: string owner: title: Team Owner type: string ui:field: OwnerPicker ui:options: catalogFilter: kind: Group tier: title: Service Tier type: string enum: ['1', '2', '3'] description: "Tier 1 = SLO 99.9%, Tier 2 = 99.5%, Tier 3 = 99%" - title: Infrastructure properties: hasDatabase: title: Needs PostgreSQL database? type: boolean default: false hasCaching: title: Needs Redis cache? type: boolean default: false minInstances: title: Minimum Cloud Run instances type: integer default: 1 steps: - id: fetch-template name: Fetch service template action: fetch:template input: url: ./skeleton values: name: ${{ parameters.name }} description: ${{ parameters.description }} owner: ${{ parameters.owner }} tier: ${{ parameters.tier }} hasDatabase: ${{ parameters.hasDatabase }} - id: publish-github name: Create GitHub repo action: publish:github input: repoUrl: github.com?owner=my-org&repo=${{ parameters.name }} defaultBranch: main description: ${{ parameters.description }} topics: [go, microservice] requireCodeOwnerReviews: true dismissStaleReviews: true - id: register-catalog name: Register in catalog action: catalog:register input: repoContentsUrl: ${{ steps['publish-github'].output.repoContentsUrl }} catalogInfoPath: /catalog-info.yaml - id: create-pagerduty-service name: Create PagerDuty service action: pagerduty:service:create input: name: ${{ parameters.name }} escalationPolicyId: ${{ parameters.tier == '1' && 'P-TIER1' || 'P-TIER23' }} - id: notify-slack name: Notify team channel action: http:backstage:request input: method: POST path: /api/slack/send body: channel: "#platform-new-services" text: "New service created: ${{ parameters.name }} by ${{ parameters.owner }}" output: links: - title: Repository url: ${{ steps['publish-github'].output.remoteUrl }} - title: Service Catalog url: ${{ steps['register-catalog'].output.entityRef }}

DORA Metrics

__CODE_BLOCK_3__ __CODE_BLOCK_4__

Helmfile for Multi-Environment Management

__CODE_BLOCK_5__

Secret Management Self-Service

__CODE_BLOCK_6__

Developer Onboarding Automation

__CODE_BLOCK_7__

Anti-Patterns

Golden paths with no escape hatch — developers will route around a platform they can't customize
Platform team as gatekeeper, not enabler — every request through a ticket defeats self-service
Backstage without owners — a stale catalog is worse than no catalog; enforce ownership
Measuring platform success by features shipped — measure developer time saved and DORA metrics
IDP that just wraps existing complexity — if the UI requires understanding Terraform, you failed
Template scaffolding without keeping it updated — outdated templates breed security debt
Missing delete/teardown automation — spinning up is easy, cleanup is where cost runs away
Ignoring mobile/non-cloud developers — platform must work for all personas

Quick Reference

__CODE_BLOCK_8__

Skill Information

Source
MoltbotDen
Category
DevOps & Cloud
Repository
View on GitHub

Related Skills