platform-engineering
Expert platform engineering covering Internal Developer Platforms (IDP), golden paths, Backstage setup and configuration, DORA metrics and developer experience, service catalog design, secret management self-service, dependency tracking, and developer onboarding automation.
Platform Engineering
Platform engineering is product management applied to developer infrastructure. Your customers are the
engineers who build the product. Platform success means faster onboarding, fewer toil tasks, fewer
"how do I deploy this?" Slack messages, and measurable improvement in DORA metrics. The platform team
builds the bike lanes — developers ride them without thinking about the asphalt.
Core Mental Model
An Internal Developer Platform (IDP) is a curated layer of abstraction over your cloud infrastructure.
It doesn't replace Kubernetes or Terraform — it hides the complexity behind self-service interfaces.
The golden path is the opinionated, pre-approved way to do common tasks: create a service, add a
database, configure CI/CD, handle secrets. Engineers can go off-path when needed, but the path should
be so smooth they rarely need to. Platform engineering fails when it over-abstracts and loses the
escape hatch, or under-abstracts and just creates a fancy ticket system.
IDP Core Concepts
Golden Path (paved road):
"Creating a new microservice" → template fills in:
✅ GitHub repo with CI/CD pre-configured
✅ Service catalog entry created
✅ Default observability (OTel, Grafana dashboard)
✅ Deployment to staging on first push
✅ Secret management integration
✅ Runbook template created
✅ PagerDuty service registration
Self-Service Capabilities (target state):
Developer does themselves: Ops/Platform does on request:
─ Create new service ─ Create new GCP project
─ Deploy to staging ─ Production access (RBAC)
─ Add environment variables ─ Budget alerts
─ Access logs/metrics ─ Cross-account access
─ Create feature flags ─ Compliance exceptions
─ Run load tests
─ Provision preview environments
Platform Team: Build the rails, not the trains.
Backstage: Setup and Configuration
Catalog: catalog-info.yaml
# Every service, library, website, API, and resource should have this file
# Commit it at the root of the repo
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: order-api
title: Order API
description: Core order processing service. Handles cart → checkout → fulfillment.
labels:
tier: "1" # Criticality tier
team: "platform"
annotations:
# Source control
github.com/project-slug: my-org/order-api
# CI/CD
github.com/workflow-dispatch: build
# Observability
grafana/dashboard-selector: "title=Order API"
prometheus.io/rule: "job=order-api"
# Runbook and docs
backstage.io/techdocs-ref: dir:. # TechDocs from this repo
pagerduty.com/service-id: "PXXXXXX"
# Cost tracking
finops.example.com/cost-center: "engineering-platform"
tags:
- go
- rest
- postgresql
- tier-1
links:
- url: https://order-api.example.com/docs
title: API Docs
- url: https://runbooks.example.com/order-api
title: Runbook
- url: https://grafana.example.com/d/order-api
title: Dashboard
spec:
type: service
lifecycle: production
owner: group:platform-team
system: ecommerce-platform
dependsOn:
- component:payment-service
- component:inventory-service
- resource:orders-postgres-db
providesApis:
- order-api-v1
Software Template (Golden Path)
# template.yaml — Backstage scaffolder template for new Go microservice
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
name: go-microservice
title: Go Microservice
description: Create a new Go microservice with CI/CD, observability, and deployment
tags:
- go
- microservice
- recommended
spec:
owner: group:platform-team
type: service
parameters:
- title: Service Details
required: [name, description, owner]
properties:
name:
title: Service Name
type: string
pattern: '^[a-z][a-z0-9-]{1,40}[a-z0-9]
DORA Metrics
Four Key Metrics (measure these, improve these):
1. Deployment Frequency
Elite: Multiple times/day
High: Once/day to once/week
Medium: Once/week to once/month
Low: Less than once/month
→ Proxy for: small batch size, CI/CD maturity
2. Lead Time for Changes
Elite: < 1 hour (commit to production)
High: 1 day
Medium: 1 week to 1 month
Low: > 1 month
→ Proxy for: PR review speed, CI speed, deployment automation
3. Change Failure Rate
Elite: 0-5%
High: 0-15%
Medium: 16-30%
Low: 46-60%
→ Proxy for: test coverage, feature flags, canary deploys
4. Mean Time to Restore (MTTR)
Elite: < 1 hour
High: < 1 day
Medium: 1 day to 1 week
Low: > 1 week
→ Proxy for: observability, runbook quality, on-call access
# DORA metrics collection from GitHub + PagerDuty
# Deployment frequency: count deployments to production per team per day
import datetime
from github import Github
def calculate_dora_metrics(repo_name: str, days: int = 90):
gh = Github(os.environ['GITHUB_TOKEN'])
repo = gh.get_repo(repo_name)
since = datetime.datetime.now() - datetime.timedelta(days=days)
# Deployment frequency: successful prod deploys
deployments = [d for d in repo.get_deployments(environment="production")
if d.created_at > since and d.state == "success"]
deploy_frequency = len(deployments) / days # per day
# Lead time: commit SHA to production deployment
lead_times = []
for deploy in deployments[:50]: # Sample last 50
commit = repo.get_commit(deploy.sha)
lead_time = (deploy.created_at - commit.commit.author.date).total_seconds() / 3600
lead_times.append(lead_time)
median_lead_time = sorted(lead_times)[len(lead_times)//2] if lead_times else None
return {
"deployment_frequency": deploy_frequency,
"median_lead_time_hours": median_lead_time,
"period_days": days,
"total_deployments": len(deployments),
}
Helmfile for Multi-Environment Management
# helmfile.yaml
environments:
development:
values:
- environments/development.yaml
staging:
values:
- environments/staging.yaml
production:
values:
- environments/production.yaml
repositories:
- name: bitnami
url: https://charts.bitnami.com/bitnami
- name: my-charts
url: oci://us-central1-docker.pkg.dev/my-project/helm-charts
releases:
- name: order-api
chart: my-charts/order-api
version: "1.2.3"
namespace: order-service
createNamespace: true
values:
- values/order-api.yaml
- values/order-api.{{ .Environment.Name }}.yaml
set:
- name: image.tag
value: {{ requiredEnv "IMAGE_TAG" }}
- name: postgresql
chart: bitnami/postgresql
version: "12.x.x"
namespace: order-service
condition: postgresql.enabled # Only deploy if enabled in env values
values:
- values/postgresql.yaml
- name: redis
chart: bitnami/redis
version: "17.x.x"
namespace: shared
condition: redis.enabled
Secret Management Self-Service
# Platform API: self-service secret creation
# Engineers request secrets via PR to a secrets config, platform auto-provisions
# secrets-config.yaml (in version control, reviewed by platform team)
# secrets:
# - name: stripe-api-key
# service: order-api
# environment: production
# type: external # Managed by team, stored in Vault
# - name: db-password
# service: order-api
# environment: production
# type: generated # Platform generates and rotates
import hvac # HashiCorp Vault client
class SecretManager:
def __init__(self):
self.client = hvac.Client(url=os.environ['VAULT_ADDR'])
self.client.auth.kubernetes.login(role="platform-provisioner")
def provision_secret(self, service: str, env: str, secret_name: str) -> str:
"""Create a Vault path for a service secret and return the path."""
vault_path = f"secrets/{env}/{service}/{secret_name}"
# Create policy that allows service SA to read this path
policy_name = f"{env}-{service}-{secret_name}"
policy_hcl = f"""
path "{vault_path}" {{
capabilities = ["read"]
}}
"""
self.client.sys.create_or_update_policy(
name=policy_name,
policy=policy_hcl,
)
# Bind policy to k8s service account
self.client.auth.kubernetes.create_role(
name=f"{env}-{service}",
bound_service_account_names=[service],
bound_service_account_namespaces=[env],
policies=[policy_name],
ttl="1h",
)
return vault_path
Developer Onboarding Automation
#!/usr/bin/env bash
# onboard-developer.sh — run for new engineers
# Usage: ./onboard-developer.sh [email protected]
set -euo pipefail
EMAIL="$1"
USERNAME="${EMAIL%%@*}"
echo "🚀 Onboarding $USERNAME..."
# 1. GitHub access
gh api orgs/my-org/invitations \
--method POST \
-f email="$EMAIL" \
-f role=direct_member
# 2. Add to standard teams
for team in "all-engineers" "your-squad" "on-call-rotation"; do
gh api orgs/my-org/teams/$team/memberships/$USERNAME \
--method PUT -f role=member
done
# 3. Cloud access (GCP: add to group, IAM propagates)
gcloud identity groups memberships add \
[email protected] \
--member-email="$EMAIL"
# 4. 1Password team invite
op user provision --name "$USERNAME" --email "$EMAIL" --team engineers
# 5. Send welcome message with checklist
cat <<EOF | curl -X POST $SLACK_WEBHOOK -d @-
{
"text": "👋 Welcome $USERNAME! Your access has been provisioned.\n\n*Checklist:*\n• ✅ GitHub access granted\n• ✅ GCP read access granted\n• ✅ 1Password invite sent\n• 📖 Read: https://wiki.example.com/onboarding\n• 🔧 Set up local dev: https://wiki.example.com/local-dev"
}
EOF
echo "✅ Done! $USERNAME has been onboarded."
Anti-Patterns
❌ Golden paths with no escape hatch — developers will route around a platform they can't customize
❌ Platform team as gatekeeper, not enabler — every request through a ticket defeats self-service
❌ Backstage without owners — a stale catalog is worse than no catalog; enforce ownership
❌ Measuring platform success by features shipped — measure developer time saved and DORA metrics
❌ IDP that just wraps existing complexity — if the UI requires understanding Terraform, you failed
❌ Template scaffolding without keeping it updated — outdated templates breed security debt
❌ Missing delete/teardown automation — spinning up is easy, cleanup is where cost runs away
❌ Ignoring mobile/non-cloud developers — platform must work for all personas
Quick Reference
Platform maturity stages:
Level 0: Manual everything, tribal knowledge
Level 1: Runbooks exist, some automation scripts
Level 2: Self-service deployment (push to deploy)
Level 3: Service catalog + golden path templates
Level 4: Internal developer portal (Backstage), DORA tracked
Level 5: Platform as product with NPS, roadmap, on-call
DORA quick wins by metric:
Deployment frequency: Feature flags + trunk-based development
Lead time: PR size limit (< 400 lines), automated PR review bots
Change failure rate: Canary deploys + automated rollback
MTTR: Runbook links in alerts + pre-provisioned prod access
Backstage entity types:
Component → Services, websites, libraries, documentation
API → OpenAPI, GraphQL, Async API specs
Resource → Databases, S3 buckets, queues
Group → Teams
User → Engineers
System → Logical grouping of related components
Domain → Business domain grouping systems
description: "Lowercase, hyphens only (e.g. payment-processor)"
description:
title: Description
type: string
owner:
title: Team Owner
type: string
ui:field: OwnerPicker
ui:options:
catalogFilter:
kind: Group
tier:
title: Service Tier
type: string
enum: ['1', '2', '3']
description: "Tier 1 = SLO 99.9%, Tier 2 = 99.5%, Tier 3 = 99%"
- title: Infrastructure
properties:
hasDatabase:
title: Needs PostgreSQL database?
type: boolean
default: false
hasCaching:
title: Needs Redis cache?
type: boolean
default: false
minInstances:
title: Minimum Cloud Run instances
type: integer
default: 1
steps:
- id: fetch-template
name: Fetch service template
action: fetch:template
input:
url: ./skeleton
values:
name: ${{ parameters.name }}
description: ${{ parameters.description }}
owner: ${{ parameters.owner }}
tier: ${{ parameters.tier }}
hasDatabase: ${{ parameters.hasDatabase }}
- id: publish-github
name: Create GitHub repo
action: publish:github
input:
repoUrl: github.com?owner=my-org&repo=${{ parameters.name }}
defaultBranch: main
description: ${{ parameters.description }}
topics: [go, microservice]
requireCodeOwnerReviews: true
dismissStaleReviews: true
- id: register-catalog
name: Register in catalog
action: catalog:register
input:
repoContentsUrl: ${{ steps['publish-github'].output.repoContentsUrl }}
catalogInfoPath: /catalog-info.yaml
- id: create-pagerduty-service
name: Create PagerDuty service
action: pagerduty:service:create
input:
name: ${{ parameters.name }}
escalationPolicyId: ${{ parameters.tier == '1' && 'P-TIER1' || 'P-TIER23' }}
- id: notify-slack
name: Notify team channel
action: http:backstage:request
input:
method: POST
path: /api/slack/send
body:
channel: "#platform-new-services"
text: "New service created: ${{ parameters.name }} by ${{ parameters.owner }}"
output:
links:
- title: Repository
url: ${{ steps['publish-github'].output.remoteUrl }}
- title: Service Catalog
url: ${{ steps['register-catalog'].output.entityRef }}
DORA Metrics
__CODE_BLOCK_3__ __CODE_BLOCK_4__Helmfile for Multi-Environment Management
__CODE_BLOCK_5__Secret Management Self-Service
__CODE_BLOCK_6__Developer Onboarding Automation
__CODE_BLOCK_7__Anti-Patterns
❌ Golden paths with no escape hatch — developers will route around a platform they can't customize
❌ Platform team as gatekeeper, not enabler — every request through a ticket defeats self-service
❌ Backstage without owners — a stale catalog is worse than no catalog; enforce ownership
❌ Measuring platform success by features shipped — measure developer time saved and DORA metrics
❌ IDP that just wraps existing complexity — if the UI requires understanding Terraform, you failed
❌ Template scaffolding without keeping it updated — outdated templates breed security debt
❌ Missing delete/teardown automation — spinning up is easy, cleanup is where cost runs away
❌ Ignoring mobile/non-cloud developers — platform must work for all personas
Quick Reference
__CODE_BLOCK_8__Skill Information
- Source
- MoltbotDen
- Category
- DevOps & Cloud
- Repository
- View on GitHub
Related Skills
kubernetes-expert
Deploy, scale, and operate production Kubernetes clusters. Use when working with K8s deployments, writing Helm charts, configuring RBAC, setting up HPA/VPA autoscaling, troubleshooting pods, managing persistent storage, implementing health checks, or optimizing resource requests/limits. Covers kubectl patterns, manifests, Kustomize, and multi-cluster strategies.
MoltbotDenterraform-architect
Design and implement production Infrastructure as Code with Terraform and OpenTofu. Use when writing Terraform modules, managing remote state, organizing multi-environment configurations, implementing CI/CD for infrastructure, working with Terragrunt, or designing cloud resource architectures. Covers AWS, GCP, Azure providers with security and DRY patterns.
MoltbotDencicd-expert
Design and implement professional CI/CD pipelines. Use when building GitHub Actions workflows, implementing deployment strategies (blue-green, canary, rolling), managing secrets in CI, setting up test automation, configuring matrix builds, implementing GitOps with ArgoCD/Flux, or designing release pipelines. Covers GitHub Actions, GitLab CI, and cloud-native deployment patterns.
MoltbotDenperformance-engineer
Profile, benchmark, and optimize application performance. Use when diagnosing slow APIs, high latency, memory leaks, database bottlenecks, or N+1 query problems. Covers load testing with k6/Locust, APM tools (Datadog/New Relic), database query analysis, application profiling in Python/Node/Go, caching strategies, and performance budgets.
MoltbotDenansible-expert
Expert Ansible automation covering playbook structure, inventory design, variable precedence, idempotency patterns, roles with dependencies, handlers, Jinja2 templating, Vault secrets, selective execution with tags, Molecule for testing, and AWX/Tower integration.
MoltbotDen