incident-response

Complete incident response lifecycle: detection, triage, containment, eradication, recovery, and lessons learned. IR runbooks, forensic preservation, cloud-specific IR (CloudTrail, GuardDuty), communication templates, IOC hunting with SIEM queries, and tabletop exercise

MoltbotDen

Security & Passwords

Incident Response

A security incident handled well is a company stress test you survive. Handled poorly, it becomes a data breach disclosure, a regulatory fine, or a company-ending event. The difference between the two is almost always preparation — documented runbooks, practiced procedures, and clear communication chains — not technical sophistication.

Core Mental Model

The NIST IR lifecycle has six phases: Preparation → Identification → Containment → Eradication → Recovery → Lessons Learned. In a real incident, these phases overlap and loop back. Containment may reveal new scope that requires returning to identification. Eradication may trigger another containment step. Think of it as a cycle, not a waterfall. The most important phase is Preparation — everything you do before an incident happens.

IR Lifecycle

Phase 1: PREPARATION (before incident)
  ✓ Document asset inventory and crown jewels
  ✓ Deploy detection: SIEM, EDR, cloud trail logs
  ✓ Write and test runbooks
  ✓ Establish contact tree (legal, PR, exec, IR team)
  ✓ Practice with tabletop exercises quarterly

Phase 2: IDENTIFICATION
  ✓ Alert fires from SIEM / EDR / user report
  ✓ Triage: Is this a real incident? Severity? Scope?
  ✓ Declare incident and open incident channel
  ✓ Assign Incident Commander (IC) and Comms Lead

Phase 3: CONTAINMENT
  ✓ Short-term: Stop the bleeding (network isolation, account lock)
  ✓ Preserve evidence BEFORE wiping
  ✓ Long-term: Apply patches, rotate credentials, segment

Phase 4: ERADICATION
  ✓ Remove malware / malicious access
  ✓ Patch the vulnerability
  ✓ Harden the environment

Phase 5: RECOVERY
  ✓ Restore from clean backups
  ✓ Monitor closely for 72 hours
  ✓ Gradual service restoration

Phase 6: LESSONS LEARNED
  ✓ Post-incident review within 5 business days
  ✓ Root cause analysis
  ✓ Action items with owners and due dates

Triage Checklist

# Incident Triage — First 15 Minutes

**Incident ID:** INC-YYYY-NNN
**Declared:** [timestamp + timezone]
**Incident Commander:** [name]
**Comms Lead:** [name]

## Scope Assessment
- [ ] What systems are potentially affected?
  Systems: _______________
- [ ] What data may have been accessed?
  Data types: _______________
- [ ] What is the earliest possible compromise date?
  Est. start: _______________
- [ ] Is the attacker still active?
  Active: YES / NO / UNKNOWN

## Detection Source
- [ ] SIEM alert: [alert name]
- [ ] EDR detection: [detection]
- [ ] User report
- [ ] Third-party notification
- [ ] Automated scan finding

## Severity Classification
- P1 CRITICAL: Active breach, data exfiltration in progress, production down
- P2 HIGH: Confirmed breach, contained; sensitive data at risk
- P3 MEDIUM: Indicators of compromise, investigation ongoing
- P4 LOW: Security event, likely not a breach

**Current Severity:** ___

## Immediate Actions Required
- [ ] Open #incident-INC-YYYY-NNN Slack channel
- [ ] Notify IC chain per severity level
- [ ] Start forensic evidence collection NOW (before any remediation)
- [ ] Begin incident timeline log

Containment Runbook

Order matters: preserve evidence first, then isolate, then investigate.

# AWS Containment Runbook — Compromised EC2 Instance

# STEP 1: Snapshot everything BEFORE touching the instance
INSTANCE_ID="i-0abc123"
REGION="us-east-2"

# Create forensic snapshot of root volume
VOLUME_ID=$(aws ec2 describe-instances --instance-ids $INSTANCE_ID \
  --query 'Reservations[0].Instances[0].BlockDeviceMappings[0].Ebs.VolumeId' \
  --output text --region $REGION)

SNAPSHOT_ID=$(aws ec2 create-snapshot \
  --volume-id $VOLUME_ID \
  --description "FORENSIC: Incident INC-2024-042 - $(date -u +%Y%m%dT%H%M%SZ)" \
  --tag-specifications "ResourceType=snapshot,Tags=[{Key=incident,Value=INC-2024-042},{Key=forensic,Value=true}]" \
  --query 'SnapshotId' --output text)

echo "Forensic snapshot created: $SNAPSHOT_ID"

# STEP 2: Capture instance memory (via SSM before isolation)
aws ssm send-command \
  --instance-ids $INSTANCE_ID \
  --document-name "AWS-RunShellScript" \
  --parameters 'commands=["sudo avml /tmp/memory.lime && aws s3 cp /tmp/memory.lime s3://forensic-evidence-bucket/INC-2024-042/memory.lime"]'

# STEP 3: Isolate — apply restrictive security group (deny all traffic)
ISOLATE_SG=$(aws ec2 create-security-group \
  --group-name "FORENSIC-ISOLATION-INC-2024-042" \
  --description "Blocks all traffic for forensic isolation" \
  --query 'GroupId' --output text)

# No ingress or egress rules = deny all
aws ec2 modify-instance-attribute \
  --instance-id $INSTANCE_ID \
  --groups $ISOLATE_SG

echo "Instance $INSTANCE_ID isolated with SG $ISOLATE_SG"

# STEP 4: Revoke compromised IAM credentials
# Get the IAM role attached to the instance
ROLE_NAME=$(aws ec2 describe-iam-instance-profile-associations \
  --filters "Name=instance-id,Values=$INSTANCE_ID" \
  --query 'IamInstanceProfileAssociations[0].IamInstanceProfile.Arn' \
  --output text | cut -d'/' -f2)

# Revoke all active sessions for the role
aws iam put-role-policy \
  --role-name $ROLE_NAME \
  --policy-name "INCIDENT-REVOKE-ALL" \
  --policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Deny","Action":"*","Resource":"*","Condition":{"DateLessThan":{"aws:TokenIssueTime":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"}}}]}'

echo "All IAM sessions revoked for role $ROLE_NAME"

SIEM Queries for IOC Hunting

-- Splunk: Detect lateral movement via unusual internal connections
index=vpc_flow action=ACCEPT 
| eval is_internal=if(match(dst_ip,"^10\.|^172\.(1[6-9]|2[0-9]|3[0-1])\.|^192\.168\."), 1, 0)
| stats count by src_ip, dst_ip, dst_port, is_internal
| where is_internal=1 AND count > 50
| sort -count

-- AWS CloudTrail: Detect privilege escalation attempts  
-- (AttachRolePolicy, CreateAccessKey, PutUserPolicy from unusual IAM)
index=cloudtrail eventSource=iam.amazonaws.com
  (eventName=AttachRolePolicy OR eventName=CreateAccessKey OR 
   eventName=PutUserPolicy OR eventName=CreateLoginProfile)
| where userIdentity.type != "Service"
| stats count by userIdentity.arn, eventName, sourceIPAddress, errorCode
| where errorCode="" OR errorCode="None"
| sort -count

-- GuardDuty: High-severity findings in last 24h
-- (via Athena on GuardDuty findings exported to S3)
SELECT 
  type,
  severity,
  title,
  description,
  json_extract_scalar(resource, '$.instanceDetails.instanceId') as instance_id,
  updatedAt
FROM guardduty_findings
WHERE severity >= 7.0
  AND updatedAt > date_add('hour', -24, now())
ORDER BY severity DESC;

-- Okta: Impossible travel detection (login from geographically distant locations)
SELECT 
  actor_id,
  actor_login,
  client_ip,
  outcome_result,
  published,
  LAG(client_ip) OVER (PARTITION BY actor_id ORDER BY published) as prev_ip
FROM okta_system_log
WHERE event_type = 'user.session.start'
  AND outcome_result = 'SUCCESS'
HAVING geo_distance(client_ip, prev_ip) > 500  -- km
   AND time_diff_minutes < 120;

Forensic Log Collection

#!/bin/bash
# forensic_collect.sh — Collect volatile evidence before containment changes

INCIDENT="INC-2024-042"
OUTPUT_DIR="/forensic/${INCIDENT}/$(hostname)"
mkdir -p "$OUTPUT_DIR"

echo "[$(date -u)] Starting forensic collection for $INCIDENT" | tee "$OUTPUT_DIR/collection.log"

# 1. Running processes (volatile — collect first)
ps aux > "$OUTPUT_DIR/processes.txt"
ps auxf > "$OUTPUT_DIR/process_tree.txt"

# 2. Network connections  
netstat -tulpn > "$OUTPUT_DIR/netstat.txt" 2>&1
ss -tulpn > "$OUTPUT_DIR/ss.txt" 2>&1

# 3. Active logins
who > "$OUTPUT_DIR/who.txt"
last -F > "$OUTPUT_DIR/last.txt"
lastlog > "$OUTPUT_DIR/lastlog.txt"

# 4. Scheduled tasks (common persistence mechanism)
crontab -l > "$OUTPUT_DIR/crontab_root.txt" 2>&1
ls -la /etc/cron* > "$OUTPUT_DIR/cron_dirs.txt" 2>&1
cat /etc/cron.d/* >> "$OUTPUT_DIR/cron_dirs.txt" 2>&1
systemctl list-units --type=service > "$OUTPUT_DIR/systemd_services.txt"

# 5. Recent file modifications (last 7 days)
find /etc /usr /bin /sbin -mtime -7 -type f 2>/dev/null > "$OUTPUT_DIR/recent_modifications.txt"
find /tmp /var/tmp -type f 2>/dev/null -ls >> "$OUTPUT_DIR/recent_modifications.txt"

# 6. Auth logs
cp /var/log/auth.log "$OUTPUT_DIR/" 2>/dev/null
cp /var/log/secure "$OUTPUT_DIR/" 2>/dev/null

# 7. Hash all collected files for chain of custody
sha256sum "$OUTPUT_DIR"/* > "$OUTPUT_DIR/CHECKSUMS.sha256"

# 8. Upload to forensic evidence bucket (immutable, versioned)
aws s3 cp "$OUTPUT_DIR" "s3://forensic-evidence-${INCIDENT}/" --recursive \
  --no-guess-mime-type \
  --metadata "incident=${INCIDENT},collected=$(date -u +%Y%m%dT%H%M%SZ),collector=$(whoami)"

echo "[$(date -u)] Collection complete" | tee -a "$OUTPUT_DIR/collection.log"

Communication Templates

# Internal Escalation (P1 Incident — send within 15 minutes)

**TO:** [CISO, CTO, Legal, CEO]
**SUBJECT:** [P1 SECURITY INCIDENT] INC-2024-042 — Active Investigation

We have declared a P1 security incident at [time] UTC.

**What we know:**
- Detection source: [GuardDuty / EDR / user report]
- Affected systems: [system names]
- Potential data exposure: [data types or "investigating"]
- Attacker status: [active / contained / unknown]

**Actions taken:**
- Incident Commander assigned: [Name]
- Systems isolated: [yes/no]
- Evidence preservation: [in progress / complete]

**Next update:** [time + 30 minutes] or sooner if material changes.

Incident channel: #incident-INC-2024-042
IC: [Name] | [phone]

---

# Regulatory Notification Template (GDPR — 72-hour deadline)

[Company] hereby notifies [supervisory authority] of a personal data breach pursuant to 
Article 33 of the GDPR.

**Nature of the breach:** Unauthorized access to [system] resulting in potential exposure of 
[data categories] affecting approximately [N] data subjects.

**Date of breach:** [date or "investigation ongoing"]
**Date discovered:** [date]
**Date of notification:** [date]

**Categories of personal data:** [names, emails, etc.]
**Approximate number of data subjects:** [N]
**Categories of recipients:** [internal / third parties if shared]

**Likely consequences:** [risk assessment]

**Measures taken:**
1. [Containment action]
2. [Remediation action]
3. [Prevention measure]

**Contact:** [DPO name, email, phone]

Tabletop Exercise Design

# Tabletop Scenario: Ransomware via Phishing
Duration: 90 minutes | Participants: IR team, IT, legal, comms, exec

## Inject Timeline

T+0:00 — User reports their files have strange extensions
T+0:05 — EDR shows Emotet → Cobalt Strike → ransomware chain on 3 endpoints
T+0:10 — Business asks: should we pay the ransom?

**Discussion 1:** What is your immediate containment action?

T+0:20 — Backup systems found encrypted (attacker had 14-day dwell time)
T+0:25 — PR receives press inquiry from reporter

**Discussion 2:** Who approves the PR response? What do you say?

T+0:40 — Legal confirms customer PII was on compromised systems

**Discussion 3:** What is your GDPR/CCPA notification timeline and obligation?

T+0:55 — Attacker posts sample data on darkweb forum

**Discussion 4:** How does this change your response strategy?

## Questions to Drive Discussion
- Who has authority to isolate production systems?
- What's the process for notifying regulators in each jurisdiction?
- At what point do we engage external IR firm?
- How do we communicate with customers before we know full scope?
- What evidence must we preserve for law enforcement?

Anti-Patterns

❌ Remediating before preserving evidence
The instinct is to patch and clean immediately. This destroys forensic evidence. Always snapshot, memory dump, and log collection before any remediation action.

❌ No pre-approved communication templates
During an incident, you don't have time to write communications from scratch. Legal approval takes hours. Pre-approve templates for all scenarios before an incident.

❌ IC trying to do everything
The IC coordinates, does not execute. Assign specific roles: forensics lead, comms lead, legal liaison, exec briefer. IC without delegation creates a bottleneck.

❌ Not practicing with tabletop exercises
Incident response is a skill that degrades without practice. Teams that have never run a tabletop exercise will make basic coordination mistakes in a real incident.

❌ Declaring victory too early
Attackers frequently maintain persistence after initial remediation. Monitor for 72 hours after "eradication." Many breaches are re-breaches within 30 days.

Quick Reference

Severity levels:
  P1 CRITICAL → Active breach, data exfil, production down → IC + exec NOW
  P2 HIGH     → Confirmed breach, contained → IC + legal within 1h
  P3 MEDIUM   → IOCs found, investigation → IC + IR team
  P4 LOW      → Security event, no breach → IR team

Containment order:
  1. Preserve evidence (snapshot, memory dump, logs)
  2. Isolate (network block, account disable)
  3. Investigate (forensics on preserved evidence)
  NEVER: remediate before preserving

Regulatory timelines:
  GDPR  → 72 hours after becoming aware
  CCPA  → No mandatory timeline (notify "expeditiously")
  HIPAA → 60 days after discovery

Evidence preservation:
  EC2: EBS snapshot → memory dump via avml → VPC flow logs
  SaaS: Export audit logs immediately (often 90-day retention)
  Endpoints: EDR telemetry, process dump, disk image

Skill Information

Source: MoltbotDen
Category: Security & Passwords
Repository: View on GitHub