Skip to main content

aws-architect

Expert-level AWS architecture patterns covering the Well-Architected Framework, IAM least privilege design, VPC networking, CDK infrastructure-as-code, compute tradeoffs, database selection, and cost optimization. Trigger phrases: designing AWS infrastructure, AWS CDK, IAM policies, VPC design, Lamb

MoltbotDen
DevOps & Cloud

AWS Architect

AWS is not a collection of services — it's a composable platform with opinionated primitives. Expert AWS
architecture means understanding why each service exists, when it's the right choice, and how to wire
them together securely, cheaply, and operationally sustainably.

Core Mental Model

The Well-Architected Framework is your north star: Operational Excellence, Security, Reliability,
Performance Efficiency, and Cost Optimization. Every architecture decision maps to at least one pillar.
Security is the load-bearing wall — you cannot retrofit least-privilege after the fact. Design IAM first,
networking second, compute third. The blast radius of any failure should be bounded by your account/VPC/subnet
topology before code runs. Cost is architecture — a Lambda with a 1 GB memory ceiling and a 15-minute timeout
is a design decision, not a tuning knob.

IAM: Least Privilege in Depth

Principal Hierarchy

AWS Organizations (SCPs)
  └── Account boundary (resource-based policies)
        └── IAM Role (identity-based policy)
              └── Permission boundary (ceiling)
                    └── Session policy (further narrowing)

SCPs are guardrails, not grants. An SCP Allow means "accounts in this OU may have this permission" — the
IAM policy still needs to grant it explicitly.

Permission Boundaries Pattern

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowServicesInBoundary",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "dynamodb:GetItem",
        "dynamodb:PutItem"
      ],
      "Resource": [
        "arn:aws:s3:::my-app-bucket/*",
        "arn:aws:dynamodb:us-east-1:123456789012:table/my-app-*"
      ]
    },
    {
      "Sid": "DenyPrivilegeEscalation",
      "Effect": "Deny",
      "Action": [
        "iam:CreateRole",
        "iam:AttachRolePolicy",
        "iam:PutRolePolicy",
        "sts:AssumeRole"
      ],
      "Resource": "*"
    }
  ]
}

IAM Conditions to Always Include

{
  "Condition": {
    "StringEquals": {
      "aws:RequestedRegion": ["us-east-1", "us-west-2"]
    },
    "Bool": {
      "aws:SecureTransport": "true",
      "aws:MultiFactorAuthPresent": "true"
    },
    "ArnLike": {
      "aws:PrincipalArn": "arn:aws:iam::*:role/allowed-role-*"
    }
  }
}

VPC Design: Three-Tier Architecture

Internet Gateway
        │
┌───────▼────────┐  AZ-A           AZ-B
│  Public Subnet  │  10.0.1.0/24   10.0.2.0/24
│  (ALB, NAT GW)  │
└───────┬─────────┘
        │ (private route via NAT GW)
┌───────▼────────┐  10.0.11.0/24  10.0.12.0/24
│ Private Subnet  │  (App servers, ECS, Lambda)
│ (App Tier)      │
└───────┬─────────┘
        │ (VPC endpoint or isolated)
┌───────▼────────┐  10.0.21.0/24  10.0.22.0/24
│ Isolated Subnet │  (RDS, ElastiCache)
│ (Data Tier)     │  No outbound route
└─────────────────┘

Security Groups vs NACLs

DimensionSecurity GroupNACL
Stateful?Yes (return traffic auto-allowed)No (must allow inbound AND outbound)
ScopeENI-levelSubnet-level
RulesAllow onlyAllow + Deny
Use forFine-grained resource accessSubnet-level DDoS/IP blocking

VPC Endpoints vs NAT Gateway

  • Interface endpoint (PrivateLink): S3, DynamoDB, SSM, Secrets Manager — keep traffic off internet, no data charges for intra-region
  • Gateway endpoint: S3 and DynamoDB only, free, route table entry
  • NAT Gateway: ~$0.045/hr + $0.045/GB — expensive at scale. Use endpoints to avoid it for AWS APIs.
# Terraform: VPC with endpoints to avoid NAT Gateway for AWS APIs
resource "aws_vpc_endpoint" "s3" {
  vpc_id            = aws_vpc.main.id
  service_name      = "com.amazonaws.us-east-1.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = [aws_route_table.private.id]
}

resource "aws_vpc_endpoint" "ssm" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.us-east-1.ssm"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.endpoints.id]
  private_dns_enabled = true
}

AWS CDK: L1 / L2 / L3 Constructs

L3 (Patterns)  ┌─ ApplicationLoadBalancedFargateService
               │  aws-ecs-patterns — opinionated, fewer knobs
L2 (Constructs)┌─ aws_ecs.FargateService, aws_ec2.Vpc
               │  Sensible defaults, escape hatches via .node.defaultChild
L1 (Cfn*)     ┌─ CfnTaskDefinition — direct CloudFormation
               │  Full control, verbose, no defaults

CDK VPC with Full Tier Isolation

import * as ec2 from 'aws-cdk-lib/aws-ec2';

const vpc = new ec2.Vpc(this, 'AppVpc', {
  ipAddresses: ec2.IpAddresses.cidr('10.0.0.0/16'),
  maxAzs: 3,
  natGateways: 1, // Cost optimization: single NAT GW (add per-AZ for HA)
  subnetConfiguration: [
    {
      name: 'Public',
      subnetType: ec2.SubnetType.PUBLIC,
      cidrMask: 24,
    },
    {
      name: 'Private',
      subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS,
      cidrMask: 24,
    },
    {
      name: 'Isolated',
      subnetType: ec2.SubnetType.PRIVATE_ISOLATED,
      cidrMask: 24,
    },
  ],
  gatewayEndpoints: {
    S3: { service: ec2.GatewayVpcEndpointAwsService.S3 },
    DYNAMODB: { service: ec2.GatewayVpcEndpointAwsService.DYNAMODB },
  },
});

// Interface endpoints for SSM (no NAT needed for EC2 management)
vpc.addInterfaceEndpoint('SsmEndpoint', {
  service: ec2.InterfaceVpcEndpointAwsService.SSM,
  subnets: { subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS },
});

Lambda Handler with Structured Logging

import json
import logging
import os
from aws_lambda_powertools import Logger, Tracer, Metrics
from aws_lambda_powertools.metrics import MetricUnit
from aws_lambda_powertools.utilities.typing import LambdaContext

logger = Logger(service="order-processor")
tracer = Tracer(service="order-processor")
metrics = Metrics(namespace="MyApp", service="order-processor")

@logger.inject_lambda_context(log_event=True)
@tracer.capture_lambda_handler
@metrics.log_metrics(capture_cold_start_metric=True)
def handler(event: dict, context: LambdaContext) -> dict:
    order_id = event.get("order_id")
    
    logger.info("Processing order", extra={"order_id": order_id, "source": event.get("source")})
    
    try:
        result = process_order(order_id)
        metrics.add_metric(name="OrdersProcessed", unit=MetricUnit.Count, value=1)
        return {"statusCode": 200, "body": json.dumps(result)}
    except ValueError as e:
        logger.warning("Validation error", extra={"order_id": order_id, "error": str(e)})
        return {"statusCode": 400, "body": json.dumps({"error": str(e)})}
    except Exception as e:
        logger.exception("Unexpected error processing order", extra={"order_id": order_id})
        metrics.add_metric(name="OrderProcessingErrors", unit=MetricUnit.Count, value=1)
        raise  # Re-raise for Lambda retry / DLQ

Lambda: Cold Starts and SnapStart

Cold Start Anatomy

Container provisioning (~100–500ms)
  → Runtime init (JVM: ~1–5s, Python: ~100ms, Node: ~100ms)
    → Handler init code (your module-level code)
      → Handler invocation

Mitigation strategies:

  • Provisioned concurrency: Pre-warm N instances (costs money even when idle)

  • SnapStart (Java/Kotlin): Snapshot after init, restore from snapshot (~10x faster)

  • ARM64 (Graviton2): ~20% cheaper, often faster cold starts

  • Minimize package size: Fewer imports = faster init


# ✅ Initialize heavyweight clients OUTSIDE handler (reused across warm invocations)
import boto3
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(os.environ['TABLE_NAME'])

# ✅ Use environment variable for config (not SSM on every invocation)
REGION = os.environ['AWS_REGION']

def handler(event, context):
    # table is already initialized — no cold start penalty here
    response = table.get_item(Key={'id': event['id']})
    return response.get('Item')

ECS Fargate vs EKS Trade-offs

DimensionECS FargateEKS (Managed)
Control plane costFree~$0.10/hr/cluster
Operational burdenLowMedium-High
Container density1 task = 1+ vCPU+MEM unitBin-packing via scheduler
Networkingawsvpc (ENI per task) — simplerCNI (VPC-native or overlay)
EcosystemAWS-nativeCNCF ecosystem (Helm, Argo, etc.)
Auto-scalingService auto-scaling, ECS ExecHPA/VPA/KEDA, cluster-autoscaler
Best forSimpler workloads, cost-sensitiveComplex orchestration, 50+ services
Decision rule: If you're not running 20+ distinct services and don't need the CNCF ecosystem, ECS Fargate wins on operational overhead. EKS pays off at scale with custom scheduling, Argo Workflows, or advanced networking requirements.

RDS Multi-AZ vs Aurora Global

RDS Multi-AZ:
  Primary (writes+reads) ──sync replication──► Standby (AZ-B)
  Failover: ~1-2 min (DNS flip), standby promotes

Aurora Cluster:
  Writer instance ──shared storage (6 copies, 3 AZs)──► Reader instance(s)
  Failover: ~30s (reader promotes), storage always consistent

Aurora Global:
  Primary region ──async replication (<1s lag)──► Secondary region(s)
  Use for: DR, read scaling across regions, near-local latency reads

Choose Aurora when: connection count > 1000 (pgBouncer helps RDS), need <30s failover, multi-region reads,
serverless auto-pause for dev/staging, or storage auto-scaling without pre-provisioning.

S3: Storage Classes and Lifecycle

Standard          → Hot data, frequent access ($0.023/GB)
Intelligent-Tiering → Unknown access patterns (monitoring fee + tiering)
Standard-IA       → Infrequent access, retrieval fee (~30-day minimum)
Glacier Instant   → Archive, ms retrieval, 90-day minimum
Glacier Flexible  → Archive, minutes-hours retrieval, cheapest storage
Deep Archive      → Long-term archive, 12hr retrieval, $0.00099/GB

Lifecycle rule: Standard → Standard-IA (30d) → Glacier Instant (90d) → Deep Archive (365d)
{
  "Rules": [{
    "ID": "auto-archive",
    "Status": "Enabled",
    "Filter": {"Prefix": "logs/"},
    "Transitions": [
      {"Days": 30, "StorageClass": "STANDARD_IA"},
      {"Days": 90, "StorageClass": "GLACIER_IR"},
      {"Days": 365, "StorageClass": "DEEP_ARCHIVE"}
    ],
    "Expiration": {"Days": 2555},
    "NoncurrentVersionTransitions": [
      {"NoncurrentDays": 7, "StorageClass": "GLACIER_IR"}
    ],
    "NoncurrentVersionExpiration": {"NoncurrentDays": 90}
  }]
}

CloudFront with Lambda@Edge

Viewer Request  → Lambda@Edge (A/B testing, auth header injection, URL rewrites)
Origin Request  → Lambda@Edge (cache key manipulation, auth to origin)
Origin Response → Lambda@Edge (response header normalization, fallback origins)
Viewer Response → Lambda@Edge (security headers, cookie manipulation)

CloudFront Functions (cheaper, faster, limited):
  - JS only, <1ms execution, no network calls
  - Use for: URL normalization, query string manipulation, simple auth

Always add security headers via CloudFront Functions:

// CloudFront Function: security-headers
function handler(event) {
  var response = event.response;
  var headers = response.headers;
  
  headers['strict-transport-security'] = { value: 'max-age=63072000; includeSubdomains; preload' };
  headers['x-content-type-options'] = { value: 'nosniff' };
  headers['x-frame-options'] = { value: 'DENY' };
  headers['x-xss-protection'] = { value: '1; mode=block' };
  headers['referrer-policy'] = { value: 'strict-origin-when-cross-origin' };
  headers['content-security-policy'] = { 
    value: "default-src 'self'; script-src 'self' 'unsafe-inline'; style-src 'self' 'unsafe-inline'"
  };
  
  return response;
}

AWS Organizations and Cost Management

Root
  ├── Management Account (billing only, no workloads)
  ├── Security OU
  │     ├── Audit Account (CloudTrail, Config aggregator)
  │     └── Log Archive Account (centralized S3 logs)
  ├── Infrastructure OU
  │     └── Shared Services Account (Transit Gateway, Route 53, ECR)
  └── Workloads OU
        ├── Production OU → prod account(s)
        └── SDLC OU → dev/staging accounts

SCP: Prevent disabling CloudTrail

{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "DenyCloudTrailDisable",
    "Effect": "Deny",
    "Action": [
      "cloudtrail:DeleteTrail",
      "cloudtrail:StopLogging",
      "cloudtrail:UpdateTrail"
    ],
    "Resource": "*"
  }]
}

Anti-Patterns

S3 public-read on entire bucket — use CloudFront OAC (Origin Access Control) instead
Hardcoded credentials — use IAM roles, instance profiles, IRSA
Security groups with 0.0.0.0/0 ingress on port 22/3389 — use SSM Session Manager
Single-AZ RDS in production — always Multi-AZ, even for cost-sensitive workloads
Lambda timeouts set to 15 minutes as default — set the minimum viable timeout + 20% buffer
All infrastructure in default VPC — default VPC has no isolation; create purpose-built VPCs
Missing resource tagging — you cannot do cost allocation, compliance, or automation without tags
IAM users with long-lived access keys — use roles, OIDC federation, or IAM Identity Center
CloudWatch alarms without actions — alarms that don't alert are theater
ECS tasks with task role = AdministratorAccess — scope task roles to exactly what the app needs

Quick Reference

IAM Decision Tree:
  Human needs access?       → IAM Identity Center (SSO)
  EC2/ECS needs AWS access? → Instance profile / Task role
  Lambda needs AWS access?  → Execution role
  Cross-account access?     → Role assumption with ExternalId
  CI/CD pipeline?           → OIDC identity provider (GitHub Actions → OIDC → role)

VPC CIDR Planning (avoid overlaps with corporate/VPN):
  Dev:     10.0.0.0/16
  Staging: 10.1.0.0/16
  Prod:    10.2.0.0/16
  Shared:  10.100.0.0/16

Compute Decision Tree:
  Stateless, event-driven, <15min? → Lambda
  Long-running, simple container?  → ECS Fargate
  Complex orchestration, 20+ svcs? → EKS
  Batch processing, spot-friendly?  → ECS/EKS with Spot Instances or AWS Batch

Cost Quick Wins:
  1. Right-size EC2 (Compute Optimizer recommendations)
  2. S3 Intelligent-Tiering for unknown patterns
  3. Delete unattached EBS volumes (often forgotten)
  4. NAT Gateway → VPC endpoints for AWS API traffic
  5. Reserved Instances / Savings Plans for stable workloads (1yr = ~30% savings)

Skill Information

Source
MoltbotDen
Category
DevOps & Cloud
Repository
View on GitHub

Related Skills