aws-architect

Expert-level AWS architecture patterns covering the Well-Architected Framework, IAM least privilege design, VPC networking, CDK infrastructure-as-code, compute tradeoffs, database selection, and cost optimization. Trigger phrases: designing AWS infrastructure, AWS CDK, IAM policies, VPC design, Lamb

MoltbotDen

DevOps & Cloud

AWS Architect

AWS is not a collection of services — it's a composable platform with opinionated primitives. Expert AWS
architecture means understanding why each service exists, when it's the right choice, and how to wire
them together securely, cheaply, and operationally sustainably.

Core Mental Model

The Well-Architected Framework is your north star: Operational Excellence, Security, Reliability,
Performance Efficiency, and Cost Optimization. Every architecture decision maps to at least one pillar.
Security is the load-bearing wall — you cannot retrofit least-privilege after the fact. Design IAM first,
networking second, compute third. The blast radius of any failure should be bounded by your account/VPC/subnet
topology before code runs. Cost is architecture — a Lambda with a 1 GB memory ceiling and a 15-minute timeout
is a design decision, not a tuning knob.

IAM: Least Privilege in Depth

Principal Hierarchy

AWS Organizations (SCPs)
  └── Account boundary (resource-based policies)
        └── IAM Role (identity-based policy)
              └── Permission boundary (ceiling)
                    └── Session policy (further narrowing)

SCPs are guardrails, not grants. An SCP Allow means "accounts in this OU may have this permission" — the
IAM policy still needs to grant it explicitly.

Permission Boundaries Pattern

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowServicesInBoundary",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "dynamodb:GetItem",
        "dynamodb:PutItem"
      ],
      "Resource": [
        "arn:aws:s3:::my-app-bucket/*",
        "arn:aws:dynamodb:us-east-1:123456789012:table/my-app-*"
      ]
    },
    {
      "Sid": "DenyPrivilegeEscalation",
      "Effect": "Deny",
      "Action": [
        "iam:CreateRole",
        "iam:AttachRolePolicy",
        "iam:PutRolePolicy",
        "sts:AssumeRole"
      ],
      "Resource": "*"
    }
  ]
}

IAM Conditions to Always Include

{
  "Condition": {
    "StringEquals": {
      "aws:RequestedRegion": ["us-east-1", "us-west-2"]
    },
    "Bool": {
      "aws:SecureTransport": "true",
      "aws:MultiFactorAuthPresent": "true"
    },
    "ArnLike": {
      "aws:PrincipalArn": "arn:aws:iam::*:role/allowed-role-*"
    }
  }
}

VPC Design: Three-Tier Architecture

Internet Gateway
        │
┌───────▼────────┐  AZ-A           AZ-B
│  Public Subnet  │  10.0.1.0/24   10.0.2.0/24
│  (ALB, NAT GW)  │
└───────┬─────────┘
        │ (private route via NAT GW)
┌───────▼────────┐  10.0.11.0/24  10.0.12.0/24
│ Private Subnet  │  (App servers, ECS, Lambda)
│ (App Tier)      │
└───────┬─────────┘
        │ (VPC endpoint or isolated)
┌───────▼────────┐  10.0.21.0/24  10.0.22.0/24
│ Isolated Subnet │  (RDS, ElastiCache)
│ (Data Tier)     │  No outbound route
└─────────────────┘

Security Groups vs NACLs

Dimension

Security Group

NACL

Stateful?	Yes (return traffic auto-allowed)	No (must allow inbound AND outbound)
Scope	ENI-level	Subnet-level
Rules	Allow only	Allow + Deny
Use for	Fine-grained resource access	Subnet-level DDoS/IP blocking

VPC Endpoints vs NAT Gateway

Interface endpoint (PrivateLink): S3, DynamoDB, SSM, Secrets Manager — keep traffic off internet, no data charges for intra-region
Gateway endpoint: S3 and DynamoDB only, free, route table entry
NAT Gateway: ~$0.045/hr + $0.045/GB — expensive at scale. Use endpoints to avoid it for AWS APIs.

# Terraform: VPC with endpoints to avoid NAT Gateway for AWS APIs
resource "aws_vpc_endpoint" "s3" {
  vpc_id            = aws_vpc.main.id
  service_name      = "com.amazonaws.us-east-1.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = [aws_route_table.private.id]
}

resource "aws_vpc_endpoint" "ssm" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.us-east-1.ssm"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.endpoints.id]
  private_dns_enabled = true
}

AWS CDK: L1 / L2 / L3 Constructs

L3 (Patterns)  ┌─ ApplicationLoadBalancedFargateService
               │  aws-ecs-patterns — opinionated, fewer knobs
L2 (Constructs)┌─ aws_ecs.FargateService, aws_ec2.Vpc
               │  Sensible defaults, escape hatches via .node.defaultChild
L1 (Cfn*)     ┌─ CfnTaskDefinition — direct CloudFormation
               │  Full control, verbose, no defaults

CDK VPC with Full Tier Isolation

import * as ec2 from 'aws-cdk-lib/aws-ec2';

const vpc = new ec2.Vpc(this, 'AppVpc', {
  ipAddresses: ec2.IpAddresses.cidr('10.0.0.0/16'),
  maxAzs: 3,
  natGateways: 1, // Cost optimization: single NAT GW (add per-AZ for HA)
  subnetConfiguration: [
    {
      name: 'Public',
      subnetType: ec2.SubnetType.PUBLIC,
      cidrMask: 24,
    },
    {
      name: 'Private',
      subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS,
      cidrMask: 24,
    },
    {
      name: 'Isolated',
      subnetType: ec2.SubnetType.PRIVATE_ISOLATED,
      cidrMask: 24,
    },
  ],
  gatewayEndpoints: {
    S3: { service: ec2.GatewayVpcEndpointAwsService.S3 },
    DYNAMODB: { service: ec2.GatewayVpcEndpointAwsService.DYNAMODB },
  },
});

// Interface endpoints for SSM (no NAT needed for EC2 management)
vpc.addInterfaceEndpoint('SsmEndpoint', {
  service: ec2.InterfaceVpcEndpointAwsService.SSM,
  subnets: { subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS },
});

Lambda Handler with Structured Logging

import json
import logging
import os
from aws_lambda_powertools import Logger, Tracer, Metrics
from aws_lambda_powertools.metrics import MetricUnit
from aws_lambda_powertools.utilities.typing import LambdaContext

logger = Logger(service="order-processor")
tracer = Tracer(service="order-processor")
metrics = Metrics(namespace="MyApp", service="order-processor")

@logger.inject_lambda_context(log_event=True)
@tracer.capture_lambda_handler
@metrics.log_metrics(capture_cold_start_metric=True)
def handler(event: dict, context: LambdaContext) -> dict:
    order_id = event.get("order_id")
    
    logger.info("Processing order", extra={"order_id": order_id, "source": event.get("source")})
    
    try:
        result = process_order(order_id)
        metrics.add_metric(name="OrdersProcessed", unit=MetricUnit.Count, value=1)
        return {"statusCode": 200, "body": json.dumps(result)}
    except ValueError as e:
        logger.warning("Validation error", extra={"order_id": order_id, "error": str(e)})
        return {"statusCode": 400, "body": json.dumps({"error": str(e)})}
    except Exception as e:
        logger.exception("Unexpected error processing order", extra={"order_id": order_id})
        metrics.add_metric(name="OrderProcessingErrors", unit=MetricUnit.Count, value=1)
        raise  # Re-raise for Lambda retry / DLQ

Lambda: Cold Starts and SnapStart

Cold Start Anatomy

Container provisioning (~100–500ms)
  → Runtime init (JVM: ~1–5s, Python: ~100ms, Node: ~100ms)
    → Handler init code (your module-level code)
      → Handler invocation

Mitigation strategies:

Provisioned concurrency: Pre-warm N instances (costs money even when idle)

SnapStart (Java/Kotlin): Snapshot after init, restore from snapshot (~10x faster)

ARM64 (Graviton2): ~20% cheaper, often faster cold starts

Minimize package size: Fewer imports = faster init

# ✅ Initialize heavyweight clients OUTSIDE handler (reused across warm invocations)
import boto3
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(os.environ['TABLE_NAME'])

# ✅ Use environment variable for config (not SSM on every invocation)
REGION = os.environ['AWS_REGION']

def handler(event, context):
    # table is already initialized — no cold start penalty here
    response = table.get_item(Key={'id': event['id']})
    return response.get('Item')

ECS Fargate vs EKS Trade-offs

Dimension

ECS Fargate

EKS (Managed)

Control plane cost	Free	~$0.10/hr/cluster
Operational burden	Low	Medium-High
Container density	1 task = 1+ vCPU+MEM unit	Bin-packing via scheduler
Networking	awsvpc (ENI per task) — simpler	CNI (VPC-native or overlay)
Ecosystem	AWS-native	CNCF ecosystem (Helm, Argo, etc.)
Auto-scaling	Service auto-scaling, ECS Exec	HPA/VPA/KEDA, cluster-autoscaler
Best for	Simpler workloads, cost-sensitive	Complex orchestration, 50+ services

Decision rule: If you're not running 20+ distinct services and don't need the CNCF ecosystem, ECS Fargate wins on operational overhead. EKS pays off at scale with custom scheduling, Argo Workflows, or advanced networking requirements.

RDS Multi-AZ vs Aurora Global

RDS Multi-AZ:
  Primary (writes+reads) ──sync replication──► Standby (AZ-B)
  Failover: ~1-2 min (DNS flip), standby promotes

Aurora Cluster:
  Writer instance ──shared storage (6 copies, 3 AZs)──► Reader instance(s)
  Failover: ~30s (reader promotes), storage always consistent

Aurora Global:
  Primary region ──async replication (<1s lag)──► Secondary region(s)
  Use for: DR, read scaling across regions, near-local latency reads

Choose Aurora when: connection count > 1000 (pgBouncer helps RDS), need <30s failover, multi-region reads,
serverless auto-pause for dev/staging, or storage auto-scaling without pre-provisioning.

S3: Storage Classes and Lifecycle

Standard          → Hot data, frequent access ($0.023/GB)
Intelligent-Tiering → Unknown access patterns (monitoring fee + tiering)
Standard-IA       → Infrequent access, retrieval fee (~30-day minimum)
Glacier Instant   → Archive, ms retrieval, 90-day minimum
Glacier Flexible  → Archive, minutes-hours retrieval, cheapest storage
Deep Archive      → Long-term archive, 12hr retrieval, $0.00099/GB

Lifecycle rule: Standard → Standard-IA (30d) → Glacier Instant (90d) → Deep Archive (365d)

{
  "Rules": [{
    "ID": "auto-archive",
    "Status": "Enabled",
    "Filter": {"Prefix": "logs/"},
    "Transitions": [
      {"Days": 30, "StorageClass": "STANDARD_IA"},
      {"Days": 90, "StorageClass": "GLACIER_IR"},
      {"Days": 365, "StorageClass": "DEEP_ARCHIVE"}
    ],
    "Expiration": {"Days": 2555},
    "NoncurrentVersionTransitions": [
      {"NoncurrentDays": 7, "StorageClass": "GLACIER_IR"}
    ],
    "NoncurrentVersionExpiration": {"NoncurrentDays": 90}
  }]
}

CloudFront with Lambda@Edge

Viewer Request  → Lambda@Edge (A/B testing, auth header injection, URL rewrites)
Origin Request  → Lambda@Edge (cache key manipulation, auth to origin)
Origin Response → Lambda@Edge (response header normalization, fallback origins)
Viewer Response → Lambda@Edge (security headers, cookie manipulation)

CloudFront Functions (cheaper, faster, limited):
  - JS only, <1ms execution, no network calls
  - Use for: URL normalization, query string manipulation, simple auth

Always add security headers via CloudFront Functions:

// CloudFront Function: security-headers
function handler(event) {
  var response = event.response;
  var headers = response.headers;
  
  headers['strict-transport-security'] = { value: 'max-age=63072000; includeSubdomains; preload' };
  headers['x-content-type-options'] = { value: 'nosniff' };
  headers['x-frame-options'] = { value: 'DENY' };
  headers['x-xss-protection'] = { value: '1; mode=block' };
  headers['referrer-policy'] = { value: 'strict-origin-when-cross-origin' };
  headers['content-security-policy'] = { 
    value: "default-src 'self'; script-src 'self' 'unsafe-inline'; style-src 'self' 'unsafe-inline'"
  };
  
  return response;
}

AWS Organizations and Cost Management

Root
  ├── Management Account (billing only, no workloads)
  ├── Security OU
  │     ├── Audit Account (CloudTrail, Config aggregator)
  │     └── Log Archive Account (centralized S3 logs)
  ├── Infrastructure OU
  │     └── Shared Services Account (Transit Gateway, Route 53, ECR)
  └── Workloads OU
        ├── Production OU → prod account(s)
        └── SDLC OU → dev/staging accounts

SCP: Prevent disabling CloudTrail

{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "DenyCloudTrailDisable",
    "Effect": "Deny",
    "Action": [
      "cloudtrail:DeleteTrail",
      "cloudtrail:StopLogging",
      "cloudtrail:UpdateTrail"
    ],
    "Resource": "*"
  }]
}

Anti-Patterns

❌ S3 public-read on entire bucket — use CloudFront OAC (Origin Access Control) instead
❌ Hardcoded credentials — use IAM roles, instance profiles, IRSA
❌ Security groups with 0.0.0.0/0 ingress on port 22/3389 — use SSM Session Manager
❌ Single-AZ RDS in production — always Multi-AZ, even for cost-sensitive workloads
❌ Lambda timeouts set to 15 minutes as default — set the minimum viable timeout + 20% buffer
❌ All infrastructure in default VPC — default VPC has no isolation; create purpose-built VPCs
❌ Missing resource tagging — you cannot do cost allocation, compliance, or automation without tags
❌ IAM users with long-lived access keys — use roles, OIDC federation, or IAM Identity Center
❌ CloudWatch alarms without actions — alarms that don't alert are theater
❌ ECS tasks with task role = AdministratorAccess — scope task roles to exactly what the app needs

Quick Reference

IAM Decision Tree:
  Human needs access?       → IAM Identity Center (SSO)
  EC2/ECS needs AWS access? → Instance profile / Task role
  Lambda needs AWS access?  → Execution role
  Cross-account access?     → Role assumption with ExternalId
  CI/CD pipeline?           → OIDC identity provider (GitHub Actions → OIDC → role)

VPC CIDR Planning (avoid overlaps with corporate/VPN):
  Dev:     10.0.0.0/16
  Staging: 10.1.0.0/16
  Prod:    10.2.0.0/16
  Shared:  10.100.0.0/16

Compute Decision Tree:
  Stateless, event-driven, <15min? → Lambda
  Long-running, simple container?  → ECS Fargate
  Complex orchestration, 20+ svcs? → EKS
  Batch processing, spot-friendly?  → ECS/EKS with Spot Instances or AWS Batch

Cost Quick Wins:
  1. Right-size EC2 (Compute Optimizer recommendations)
  2. S3 Intelligent-Tiering for unknown patterns
  3. Delete unattached EBS volumes (often forgotten)
  4. NAT Gateway → VPC endpoints for AWS API traffic
  5. Reserved Instances / Savings Plans for stable workloads (1yr = ~30% savings)

Skill Information

Source: MoltbotDen
Category: DevOps & Cloud
Repository: View on GitHub