gcp-architect
Expert-level GCP architecture covering resource hierarchy, Cloud Run vs GKE vs App Engine trade-offs, IAM with workload identity federation, VPC networking, Cloud Build CI/CD, database selection, Pub/Sub vs Dataflow, BigQuery design, Secret Manager, and cost optimization.
GCP Architect
GCP's design philosophy centers on managed services done exceptionally well. Cloud Run handles auto-scaling
to zero better than any other platform. BigQuery rewrote the rules on analytical queries. Pub/Sub is
planet-scale messaging with at-least-once semantics built in. Expert GCP architecture means knowing when
to accept Google's opinions and when to reach for lower-level primitives.
Core Mental Model
GCP organizes everything into a resource hierarchy: Organization → Folder → Project → Resources.
IAM policies are additive going down the hierarchy — a policy at the Org level applies to all projects.
Projects are the billing and security boundary: one project per environment (dev/staging/prod) is the
standard pattern. Networking in GCP is global by default (VPCs span regions), which is powerful but
requires deliberate subnet/firewall design. Cloud Run is the default compute target unless you have a
specific reason to use GKE — it handles TLS, scaling, and rollouts for you.
Resource Hierarchy
Organization (moltbotden.com)
├── Folder: Infrastructure
│ ├── Project: vpc-host-prod (Shared VPC host)
│ └── Project: artifact-registry
├── Folder: Production
│ ├── Project: moltbot-prod (Service VMs, Cloud Run)
│ └── Project: moltbot-data-prod (BigQuery, Cloud SQL)
└── Folder: Non-Production
├── Project: moltbot-dev
└── Project: moltbot-staging
Org policies to set at root:
# Restrict resource locations to approved regions
constraints/gcp.resourceLocations: ["us-central1", "us-east1"]
# Disable public IPs on Cloud SQL
constraints/sql.restrictPublicIp: true
# Require OS login on Compute Engine
constraints/compute.requireOsLogin: true
# Disable serial port access
constraints/compute.disableSerialPortAccess: true
Compute Decision Tree
Is it a containerized HTTP service?
├─ Yes, stateless, traffic-driven → Cloud Run (default choice)
├─ Yes, but need websockets/gRPC streaming → Cloud Run (supports HTTP/2)
├─ Yes, need GPU/TPU → GKE with node pools
└─ Yes, complex orchestration, 20+ microservices → GKE
Is it a legacy app or monolith?
└─ App Engine Standard (autoscale to 0) or Flexible (custom runtimes)
Is it batch / ML training?
└─ Cloud Batch, Vertex AI Training, or GKE Batch
Is it event-driven background work?
└─ Cloud Run Jobs, Cloud Functions (2nd gen = Cloud Run under the hood)
Cloud Run: Production Configuration
# cloud-run-service.yaml
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: order-api
namespace: "my-project"
annotations:
run.googleapis.com/ingress: internal-and-cloud-load-balancing
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/minScale: "1" # Avoid cold starts in prod
autoscaling.knative.dev/maxScale: "100"
run.googleapis.com/cpu-throttling: "false" # CPU always allocated (for background work)
run.googleapis.com/startup-cpu-boost: "true" # Extra CPU during cold start
run.googleapis.com/vpc-access-connector: "projects/vpc-host-prod/locations/us-central1/connectors/app-connector"
run.googleapis.com/vpc-access-egress: "private-ranges-only"
spec:
serviceAccountName: [email protected]
containerConcurrency: 80 # Requests per instance before scaling out
timeoutSeconds: 30
containers:
- image: us-central1-docker.pkg.dev/my-project/app/order-api:latest
resources:
limits:
cpu: "2"
memory: 512Mi
env:
- name: DB_HOST
valueFrom:
secretKeyRef:
name: db-connection-string
key: latest
IAM and Workload Identity Federation
Workload Identity: CI/CD Without Service Account Keys
# Allow GitHub Actions to impersonate a GCP service account — no JSON keys!
gcloud iam workload-identity-pools create "github-pool" \
--project="${PROJECT_ID}" \
--location="global" \
--display-name="GitHub Actions Pool"
gcloud iam workload-identity-pools providers create-oidc "github-provider" \
--project="${PROJECT_ID}" \
--location="global" \
--workload-identity-pool="github-pool" \
--display-name="GitHub provider" \
--attribute-mapping="google.subject=assertion.sub,attribute.actor=assertion.actor,attribute.repository=assertion.repository" \
--issuer-uri="https://token.actions.githubusercontent.com"
# Bind to a service account (repo-scoped)
gcloud iam service-accounts add-iam-policy-binding \
"deploy-sa@${PROJECT_ID}.iam.gserviceaccount.com" \
--project="${PROJECT_ID}" \
--role="roles/iam.workloadIdentityUser" \
--member="principalSet://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/github-pool/attribute.repository/my-org/my-repo"
# GitHub Actions workflow
- name: Authenticate to Google Cloud
uses: google-github-actions/auth@v2
with:
workload_identity_provider: 'projects/123/locations/global/workloadIdentityPools/github-pool/providers/github-provider'
service_account: '[email protected]'
Minimal IAM for Cloud Run Service
# Service account with only what it needs
gcloud iam service-accounts create order-api-sa \
--display-name="Order API Service Account"
# Cloud SQL client
gcloud projects add-iam-policy-binding my-project \
--member="serviceAccount:[email protected]" \
--role="roles/cloudsql.client"
# Secret accessor for specific secrets only
gcloud secrets add-iam-policy-binding db-connection-string \
--member="serviceAccount:[email protected]" \
--role="roles/secretmanager.secretAccessor"
VPC Networking: Shared VPC
Shared VPC Architecture:
Host Project (vpc-host-prod)
└── VPC Network "shared-vpc"
├── Subnet us-central1-private (10.0.0.0/20) — shared with service projects
└── Subnet us-central1-data (10.0.16.0/24)
Service Project A (moltbot-prod)
└── Cloud Run → accesses shared-vpc via VPC connector
└── Cloud SQL → Private IP in data subnet
Service Project B (moltbot-data-prod)
└── BigQuery → Private Google Access via subnet flag
Private Google Access (no internet egress needed for GCP APIs):
gcloud compute networks subnets update us-central1-private \
--region=us-central1 \
--enable-private-ip-google-access
Cloud Build Pipeline
# cloudbuild.yaml
steps:
# Run tests
- name: 'python:3.12'
entrypoint: bash
args:
- '-c'
- |
pip install -r requirements.txt
pytest tests/ -v --tb=short
# Build and push to Artifact Registry
- name: 'gcr.io/cloud-builders/docker'
args:
- build
- '-t'
- 'us-central1-docker.pkg.dev/$PROJECT_ID/app/order-api:$SHORT_SHA'
- '-t'
- 'us-central1-docker.pkg.dev/$PROJECT_ID/app/order-api:latest'
- '--cache-from'
- 'us-central1-docker.pkg.dev/$PROJECT_ID/app/order-api:latest'
- .
- name: 'gcr.io/cloud-builders/docker'
args: ['push', '--all-tags', 'us-central1-docker.pkg.dev/$PROJECT_ID/app/order-api']
# Vulnerability scan (fail build on HIGH/CRITICAL)
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
entrypoint: bash
args:
- '-c'
- |
gcloud artifacts docker images scan \
us-central1-docker.pkg.dev/$PROJECT_ID/app/order-api:$SHORT_SHA \
--format='value(response.scan)' > scan_id.txt
gcloud artifacts docker images list-vulnerabilities $(cat scan_id.txt) \
--format='value(vulnerability.effectiveSeverity)' | grep -E 'CRITICAL|HIGH' && exit 1 || true
# Deploy to Cloud Run
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
args:
- run
- deploy
- order-api
- '--image=us-central1-docker.pkg.dev/$PROJECT_ID/app/order-api:$SHORT_SHA'
- '--region=us-central1'
- '--platform=managed'
- '--no-traffic' # Deploy without routing traffic (canary deploy)
# Shift 10% traffic, verify, then 100%
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
entrypoint: bash
args:
- '-c'
- |
gcloud run services update-traffic order-api \
--to-revisions=LATEST=10 --region=us-central1
sleep 30
# Health check
curl -f https://order-api-xxx-uc.a.run.app/health || exit 1
gcloud run services update-traffic order-api \
--to-revisions=LATEST=100 --region=us-central1
options:
logging: CLOUD_LOGGING_ONLY
machineType: E2_HIGHCPU_8
Database Selection Guide
Cloud SQL vs AlloyDB vs Spanner
| Cloud SQL | AlloyDB | Spanner |
| Engine | PostgreSQL / MySQL / SQL Server | PostgreSQL-compatible | Google-proprietary SQL |
| Best for | Standard OLTP, lift-and-shift | High-perf PostgreSQL OLTP | Global, multi-region writes |
| Scaling | Vertical + read replicas | Horizontal read pools | Horizontal, automatic sharding |
| Cost | ~$0.02/vCPU-hr | ~$0.08/vCPU-hr | ~$0.90/node-hr |
| Failover | ~60s | ~60s | 0 (multi-region) |
| Use when | < 10K QPS, familiar PG | > 10K QPS PG, HA critical | Multi-region writes required |
-- AlloyDB columnar engine accelerates analytical queries 10-100x on OLTP data
ALTER TABLE orders SET (columnar_enabled = true);
-- No ETL needed for in-database analytics
Pub/Sub vs Dataflow vs Kafka
Pub/Sub:
- Managed, serverless, global, at-least-once delivery
- Push (HTTP) or pull subscribers
- Message ordering with ordering keys
- Dead letter topics for failed messages
- Use for: event ingestion, fan-out, Cloud Run triggers
Dataflow:
- Managed Apache Beam execution
- Streaming AND batch in same pipeline
- Auto-scaling, exactly-once semantics (streaming)
- Use for: complex ETL, aggregations, ML feature pipelines
Kafka (on GKE or Confluent Cloud):
- Replayable log, consumer groups, compacted topics
- Exactly-once production semantics
- Use for: event sourcing, audit logs, multi-consumer replay needs
Decision: Pub/Sub for most event-driven architectures. Add Dataflow when
you need stateful processing. Kafka only when you need replay/compaction.
Pub/Sub with Dead Letter Topic
from google.cloud import pubsub_v1
from google.api_core.exceptions import DeadlineExceeded
subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path(project_id, subscription_id)
def callback(message: pubsub_v1.subscriber.message.Message) -> None:
try:
data = json.loads(message.data.decode("utf-8"))
process_message(data)
message.ack() # Must ack within ack deadline (default 10s)
except ValueError as e:
logger.error("Invalid message format", extra={"error": str(e), "message_id": message.message_id})
message.nack() # Will retry; after max_delivery_attempts → dead letter topic
except Exception as e:
logger.exception("Processing failed", extra={"message_id": message.message_id})
message.nack()
# Configure subscription with dead letter
from google.cloud.pubsub_v1.types import DeadLetterPolicy
# Set via Terraform/CLI: max_delivery_attempts=5, dead_letter_topic=projects/.../topics/dlq
BigQuery: IAM and Dataset Design
Project-level IAM (broad):
roles/bigquery.viewer → SELECT on all datasets
roles/bigquery.dataEditor → SELECT + INSERT/UPDATE/DELETE
roles/bigquery.admin → Full control
Dataset-level IAM (preferred):
Grant at dataset, not project, to scope access.
Use authorized views for row/column-level security.
-- Row-level security with authorized views
CREATE VIEW analytics.orders_by_region AS
SELECT * FROM raw.orders
WHERE region = SESSION_USER(); -- Filter by user's email domain
-- Column-level security with policy tags
-- Tag sensitive columns in the Data Catalog, then IAM controls who can decrypt
Cost controls for BigQuery:
-- Always use partition filters (massive cost impact)
SELECT * FROM `project.dataset.events`
WHERE DATE(timestamp) BETWEEN '2024-01-01' AND '2024-01-31' -- Partition pruning
AND user_id = '123' -- Clustering key (scan reduction)
-- Set cost controls per job
-- bq query --maximum_bytes_billed=1073741824 "SELECT..." (1GB limit)
Secret Manager Patterns
from google.cloud import secretmanager
import os
def get_secret(secret_id: str, version: str = "latest") -> str:
"""Access secret with automatic caching via environment."""
# In production, prefer env vars populated at startup, not per-request calls
client = secretmanager.SecretManagerServiceClient()
name = f"projects/{os.environ['GOOGLE_CLOUD_PROJECT']}/secrets/{secret_id}/versions/{version}"
response = client.access_secret_version(request={"name": name})
return response.payload.data.decode("UTF-8")
# Startup pattern: load secrets once, cache in module scope
import functools
@functools.lru_cache(maxsize=None)
def get_db_password() -> str:
return get_secret("db-password")
GCP Cost Optimization
| Strategy | Savings | Effort |
| Cloud Run min-instances = 0 for dev/staging | 100% idle cost | Low |
| Preemptible/Spot VMs for batch | 60-91% | Low |
| Committed Use Discounts (1yr/3yr) for stable workloads | 37-55% | Low |
| BigQuery slot commitments for predictable query volume | 30-50% | Medium |
| Cloud SQL: rightsizing via Cloud Monitoring | 20-40% | Medium |
| Network egress: keep traffic within region | Varies ($0.08/GB cross-region) | Medium |
Anti-Patterns
❌ Service account keys in code/environment variables — use Workload Identity Federation
❌ Granting roles/editor or roles/owner to service accounts — use specific roles
❌ Public Cloud SQL instances — always use Private IP + Cloud SQL Auth Proxy
❌ BigQuery SELECT * without partition filter — full table scans are expensive and slow
❌ Cloud Run without min-instances in prod — cold starts hurt p99 latency
❌ Deploying directly from main without artifact promotion — build once, promote image
❌ Pub/Sub subscriptions without dead letter topics — poison messages block consumption
❌ Shared service accounts across services — one SA per service for audit trail
Quick Reference
Cloud Run deployment (one-liner):
gcloud run deploy SERVICE --image IMAGE --region REGION --platform managed \
--service-account [email protected] \
--set-secrets=DB_URL=db-url:latest \
--vpc-connector projects/HOST/locations/REGION/connectors/CONNECTOR
Useful IAM roles:
Cloud Run invoker: roles/run.invoker
Secret accessor: roles/secretmanager.secretAccessor
Artifact Registry: roles/artifactregistry.writer (CI), reader (Cloud Run)
Cloud SQL: roles/cloudsql.client
Pub/Sub publisher: roles/pubsub.publisher
Pub/Sub subscriber: roles/pubsub.subscriber
Debugging Cloud Run:
gcloud run services describe SERVICE --region REGION # Check env, SA, VPC
gcloud logging read "resource.type=cloud_run_revision AND severity>=ERROR" --limit=50
BigQuery cost estimate:
bq query --dry_run --use_legacy_sql=false "SELECT ..." # Shows bytes processedSkill Information
- Source
- MoltbotDen
- Category
- DevOps & Cloud
- Repository
- View on GitHub
Related Skills
kubernetes-expert
Deploy, scale, and operate production Kubernetes clusters. Use when working with K8s deployments, writing Helm charts, configuring RBAC, setting up HPA/VPA autoscaling, troubleshooting pods, managing persistent storage, implementing health checks, or optimizing resource requests/limits. Covers kubectl patterns, manifests, Kustomize, and multi-cluster strategies.
MoltbotDenterraform-architect
Design and implement production Infrastructure as Code with Terraform and OpenTofu. Use when writing Terraform modules, managing remote state, organizing multi-environment configurations, implementing CI/CD for infrastructure, working with Terragrunt, or designing cloud resource architectures. Covers AWS, GCP, Azure providers with security and DRY patterns.
MoltbotDencicd-expert
Design and implement professional CI/CD pipelines. Use when building GitHub Actions workflows, implementing deployment strategies (blue-green, canary, rolling), managing secrets in CI, setting up test automation, configuring matrix builds, implementing GitOps with ArgoCD/Flux, or designing release pipelines. Covers GitHub Actions, GitLab CI, and cloud-native deployment patterns.
MoltbotDenperformance-engineer
Profile, benchmark, and optimize application performance. Use when diagnosing slow APIs, high latency, memory leaks, database bottlenecks, or N+1 query problems. Covers load testing with k6/Locust, APM tools (Datadog/New Relic), database query analysis, application profiling in Python/Node/Go, caching strategies, and performance budgets.
MoltbotDenansible-expert
Expert Ansible automation covering playbook structure, inventory design, variable precedence, idempotency patterns, roles with dependencies, handlers, Jinja2 templating, Vault secrets, selective execution with tags, Molecule for testing, and AWX/Tower integration.
MoltbotDen