networking-expert
Expert networking covering TCP internals, DNS resolution chain, HTTP/1.1 vs HTTP/2 vs HTTP/3, TLS handshake and certificate chains, load balancing algorithms, proxy patterns, CDN mechanics, WebSockets, and a comprehensive network troubleshooting toolkit with real command examples.
Networking Expert
Networking is the invisible substrate of every distributed system. When something fails in production,
it's networking more often than you think — not always at the socket level, but DNS timeouts, TLS
misconfigurations, connection pool exhaustion, and TIME_WAIT floods are the silent killers.
Core Mental Model
Every network interaction is a series of layers, and bugs live at specific layers. DNS failure is layer 7
logic, but it manifests as a connection timeout at layer 4. TLS failure happens at layer 6, but you see
it as a 502 from your load balancer. The debugging approach is always the same: start from the outside
(can the client reach the server at all?) and work inward (DNS → TCP → TLS → HTTP). Each tool in the
troubleshooting chain targets a specific layer. Master the handoff between tools and you'll find any
network problem in under 10 minutes.
TCP Deep Internals
Three-Way Handshake
Client Server
│─── SYN (seq=x) ────────────►│ Client wants to connect
│◄── SYN-ACK (seq=y, ack=x+1)─│ Server acknowledges, sends its seq
│─── ACK (ack=y+1) ───────────►│ Client acknowledges — connection ESTABLISHED
[Data transfer]
│─── FIN ──────────────────────►│ Client initiates close
│◄── ACK ───────────────────────│ Server acknowledges
│◄── FIN ───────────────────────│ Server sends its FIN
│─── ACK ──────────────────────►│ Client acknowledges → TIME_WAIT (2×MSL = 60-120s)
TIME_WAIT: Why It Exists and How to Handle It
TIME_WAIT prevents late duplicate packets from a closed connection being misinterpreted by a new connection using the same 4-tuple (src IP, src port, dst IP, dst port). It's 2 × MSL (Maximum Segment Lifetime = 30s = 60s total). At high connection rates, you can exhaust ephemeral ports (default: 32768–60999 = ~28K ports).# Diagnosis
ss -o state time-wait | wc -l # Count TIME_WAIT connections
ss -o state time-wait # Details
# Solutions (pick by context)
# 1. Enable SO_REUSEADDR (most HTTP servers already do this)
# 2. Increase ephemeral port range
sysctl -w net.ipv4.ip_local_port_range="1024 65535"
# 3. Enable tcp_tw_reuse (allows reusing TIME_WAIT for new outbound connections)
sysctl -w net.ipv4.tcp_tw_reuse=1
# 4. Use persistent connections (HTTP keep-alive) to avoid connection churn
TCP Tuning for High-Traffic Services
# /etc/sysctl.d/99-tcp-tuning.conf
# Increase listen backlog (default 128 is too low for busy servers)
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
# Keep-alive: detect dead connections faster
net.ipv4.tcp_keepalive_time = 300 # Start probing after 5 min idle
net.ipv4.tcp_keepalive_intvl = 30 # Probe every 30s
net.ipv4.tcp_keepalive_probes = 5 # Drop after 5 failed probes
# Reduce FIN_WAIT2 and TIME_WAIT lingering
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_tw_reuse = 1
# Increase receive/send buffer sizes for high-bandwidth connections
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728
# Congestion control (BBR is better for modern networks)
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr
DNS Resolution Chain
Application: "resolve api.example.com"
│
▼
Stub Resolver (/etc/resolv.conf, systemd-resolved)
│ cache miss
▼
Recursive Resolver (8.8.8.8, 1.1.1.1, or your VPC DNS)
│ cache miss
▼
Root Nameserver → "com. is at a.gtld-servers.net"
│
▼
TLD Nameserver (com.) → "example.com. is at ns1.cloudflare.com"
│
▼
Authoritative Nameserver → "api.example.com. IN A 93.184.216.34"
│
▼
Answer cached at each level with TTL
DNS Troubleshooting with dig
# Basic lookup
dig api.example.com
# Trace full resolution chain
dig +trace api.example.com
# Query specific nameserver
dig @8.8.8.8 api.example.com
# All record types
dig api.example.com ANY
# Reverse DNS (PTR)
dig -x 93.184.216.34
# Check propagation (hit auth nameserver directly)
dig @ns1.cloudflare.com api.example.com +short
# Check TTL and authoritative answer
dig api.example.com +noall +answer +ttlunits
# MX records for email
dig example.com MX
# TXT records (SPF, DKIM, DMARC)
dig example.com TXT
dig _dmarc.example.com TXT
# DNSSEC validation
dig api.example.com +dnssec +short
HTTP/1.1 vs HTTP/2 vs HTTP/3
HTTP/1.1:
+ Universal compatibility
- Head-of-line blocking (one request/connection)
- 6 parallel connections per host (browser limit workaround)
- Text protocol overhead
HTTP/2 (TCP):
+ Multiplexing (many streams over one TCP connection)
+ Header compression (HPACK)
+ Server push (mostly unused/deprecated)
- TCP head-of-line blocking (one lost packet stalls all streams)
- TLS required in practice (browser enforcement)
HTTP/3 (QUIC over UDP):
+ No head-of-line blocking (streams are independent at transport layer)
+ 0-RTT connection resumption (repeat visitors pay 0 handshake cost)
+ Connection migration (phone switches networks, connection survives)
+ Built-in TLS 1.3
- Requires UDP (firewalls often block it)
- Less mature tooling
When to upgrade to HTTP/2: Always for server-to-server and browser APIs.
When to try HTTP/3: High-latency users, mobile apps, video streaming.
TLS Handshake and Certificate Chains
TLS 1.3 Handshake (1-RTT):
Client Server
│─── ClientHello (key share) ──────►│
│◄── ServerHello + Certificate ──────│
│◄── CertificateVerify + Finished ───│
│─── Finished ──────────────────────►│
│═══════════ Encrypted Data ══════════│
Certificate chain verification:
Your cert (example.com)
└── Signed by Intermediate CA (Let's Encrypt E1)
└── Signed by Root CA (ISRG Root X1)
└── In browser trust store → VALID
Common failure modes:
- Missing intermediate: server sends leaf only (OpenSSL verifies, some clients don't)
- Expired cert: check BOTH leaf and intermediate
- Wrong hostname: CN or SAN doesn't match request hostname
- Clock skew: cert valid range check uses system clock
# Debug TLS with openssl s_client
openssl s_client -connect api.example.com:443 -servername api.example.com
openssl s_client -connect api.example.com:443 -tls1_3 # Force TLS 1.3
# Check cert expiry
echo | openssl s_client -connect api.example.com:443 2>/dev/null | \
openssl x509 -noout -dates
# Verify cert chain
openssl s_client -connect api.example.com:443 -showcerts 2>/dev/null | \
openssl x509 -noout -subject -issuer
# Test mutual TLS
openssl s_client -connect api.example.com:443 \
-cert client.crt -key client.key -CAfile ca.crt
# Check OCSP stapling
openssl s_client -connect api.example.com:443 -status 2>/dev/null | \
grep -A 20 "OCSP response"
Load Balancing Algorithms
| Algorithm | How It Works | Best For |
| Round Robin | Rotate through backends sequentially | Stateless, homogeneous backends |
| Least Connections | Send to backend with fewest active connections | Long-lived connections (WebSocket) |
| IP Hash | Hash client IP → always same backend | Session affinity without sticky cookies |
| Weighted Round Robin | Backends get weight-proportional traffic | Mixed capacity backends |
| Least Response Time | Send to fastest-responding backend | Heterogeneous backend performance |
| Consistent Hashing | Hash request key → minimal reshuffling when backends change | Cache servers, stateful backends |
# Nginx upstream with health checks and weights
upstream order_api {
least_conn; # Algorithm
server 10.0.1.10:8080 weight=3; # Gets 3x traffic
server 10.0.1.11:8080 weight=1;
server 10.0.1.12:8080 backup; # Only used when others fail
keepalive 32; # Connection pool to backends
keepalive_requests 1000;
keepalive_timeout 60s;
}
Proxy Patterns
Forward Proxy:
Client → [Forward Proxy] → Internet
Client configures the proxy explicitly.
Use: corporate filtering, anonymization, bypassing geo-blocks.
Reverse Proxy:
Internet → [Reverse Proxy] → Backend servers
Client doesn't know about the proxy.
Use: load balancing, TLS termination, caching, auth.
CONNECT Tunnel (HTTP proxy for HTTPS):
Client → CONNECT example.com:443 → Proxy
Proxy opens TCP tunnel. Client negotiates TLS through it.
The proxy can't see the encrypted content.
Transparent Proxy:
Traffic redirected to proxy without client knowledge (iptables/WCCP).
Used in enterprise networks for filtering.
Network Troubleshooting Toolkit
# Layer by layer diagnostic:
# 1. DNS resolution
dig api.example.com +short
# → returns IP? Good. Times out? DNS problem.
# 2. TCP connectivity
curl -v --max-time 5 https://api.example.com/health 2>&1 | head -30
# Look for: "Trying X.X.X.X..." (DNS OK), "Connected" (TCP OK), "SSL handshake" (TLS)
# 3. TLS only (ignore HTTP response)
curl --head --max-time 5 https://api.example.com
# 4. Network path (latency per hop)
mtr --report --report-cycles 10 api.example.com
traceroute -n api.example.com # Simpler, no loss stats
# 5. Port reachability (no TLS)
nc -zv api.example.com 443 # TCP connect test
nc -zv -w 3 api.example.com 443 # With 3s timeout
# 6. Packet capture
tcpdump -i eth0 -nn 'host api.example.com and port 443' -w /tmp/debug.pcap
# 7. HTTP timing breakdown
curl -w "\n%{time_namelookup} DNS\n%{time_connect} TCP\n%{time_appconnect} TLS\n\
%{time_pretransfer} start\n%{time_starttransfer} TTFB\n%{time_total} total\n" \
-o /dev/null -s https://api.example.com/
Debugging TLS Certificate Issues
# Check what certificate a server is presenting
echo Q | openssl s_client -connect api.example.com:443 -servername api.example.com 2>/dev/null | \
openssl x509 -text -noout | grep -E "Subject:|DNS:|Not After"
# Verify your cert against CA
openssl verify -CAfile /etc/ssl/certs/ca-certificates.crt cert.pem
# Check for cert expiry in monitoring (check every 15 min)
cert_expiry=$(echo | openssl s_client -connect $HOST:443 2>/dev/null | \
openssl x509 -noout -enddate | cut -d= -f2)
expiry_epoch=$(date -d "$cert_expiry" +%s)
now_epoch=$(date +%s)
days_left=$(( (expiry_epoch - now_epoch) / 86400 ))
CDN Edge Caching Mechanics
User Request → PoP (Point of Presence) → Origin Server
Cache hierarchy:
Browser cache (private) → CDN edge (shared) → CDN shield/mid-tier → Origin
Cache control headers:
Cache-Control: public, max-age=31536000, immutable # Static assets (forever)
Cache-Control: public, max-age=300, stale-while-revalidate=60 # API responses
Cache-Control: private, no-cache # Auth-dependent responses
Cache-Control: no-store # Truly uncacheable (PII)
Vary: Accept-Encoding # Cache separate compressed version
CDN cache hit/miss debugging:
X-Cache: HIT from cloudfront (or MISS)
CF-Cache-Status: HIT | MISS | EXPIRED | STALE | DYNAMIC
Purge strategies:
URL purge: Purge specific path after deploy
Tag purge (Fastly/Cloudflare): Tag assets, purge by tag
Surrogate-Key: product-123 → purge all pages showing product 123
WebSocket and SSE
# WebSocket proxying in Nginx
map $http_upgrade $connection_upgrade {
default upgrade;
'' close;
}
server {
location /ws/ {
proxy_pass http://websocket_backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
# WebSocket connections are long-lived
proxy_read_timeout 3600s;
proxy_send_timeout 3600s;
# Don't buffer WebSocket
proxy_buffering off;
}
}
Anti-Patterns
❌ Hardcoded IP addresses — DNS exists for a reason; IPs change
❌ Ignoring TIME_WAIT — it's a symptom; check your connection pooling before tuning sysctl
❌ Missing SNI in TLS connections — -servername must match hostname for multi-cert servers
❌ HTTP/2 without grpcurl/http2 curl flags — most debug tools default to HTTP/1.1
❌ CDN caching auth-dependent responses — always Cache-Control: private or Vary: Authorization
❌ Trusting traceroute hop order — ECMP routing means different hops per packet; use mtr
❌ Ignoring DNS TTL during incidents — if you changed DNS, old TTL is still being served
❌ WebSocket timeout not increased — default proxy timeout (60s) kills long-lived connections
Quick Reference
curl debugging flags:
-v → verbose (headers, TLS handshake)
-k → ignore TLS verification (test only!)
--http2 → force HTTP/2
--http1.1 → force HTTP/1.1
-H "Host: foo.com" → override Host header (test virtual hosting)
--resolve HOST:PORT:IP → bypass DNS (test specific IP)
--cacert ca.pem → custom CA cert
-w "%{time_total}" → timing
dig flags:
+short → answer only
+trace → full delegation chain
+nocmd +noall +answer → clean answer section
@8.8.8.8 → query specific resolver
-x → reverse lookup
Port state meanings:
ESTABLISHED → active connection
LISTEN → server waiting for connections
TIME_WAIT → closed, waiting for late packets (60-120s)
CLOSE_WAIT → remote closed, local hasn't called close() (app bug if persists)
SYN_SENT → TCP handshake in progress (timeout = network/firewall issue)Skill Information
- Source
- MoltbotDen
- Category
- DevOps & Cloud
- Repository
- View on GitHub
Related Skills
kubernetes-expert
Deploy, scale, and operate production Kubernetes clusters. Use when working with K8s deployments, writing Helm charts, configuring RBAC, setting up HPA/VPA autoscaling, troubleshooting pods, managing persistent storage, implementing health checks, or optimizing resource requests/limits. Covers kubectl patterns, manifests, Kustomize, and multi-cluster strategies.
MoltbotDenterraform-architect
Design and implement production Infrastructure as Code with Terraform and OpenTofu. Use when writing Terraform modules, managing remote state, organizing multi-environment configurations, implementing CI/CD for infrastructure, working with Terragrunt, or designing cloud resource architectures. Covers AWS, GCP, Azure providers with security and DRY patterns.
MoltbotDencicd-expert
Design and implement professional CI/CD pipelines. Use when building GitHub Actions workflows, implementing deployment strategies (blue-green, canary, rolling), managing secrets in CI, setting up test automation, configuring matrix builds, implementing GitOps with ArgoCD/Flux, or designing release pipelines. Covers GitHub Actions, GitLab CI, and cloud-native deployment patterns.
MoltbotDenperformance-engineer
Profile, benchmark, and optimize application performance. Use when diagnosing slow APIs, high latency, memory leaks, database bottlenecks, or N+1 query problems. Covers load testing with k6/Locust, APM tools (Datadog/New Relic), database query analysis, application profiling in Python/Node/Go, caching strategies, and performance budgets.
MoltbotDenansible-expert
Expert Ansible automation covering playbook structure, inventory design, variable precedence, idempotency patterns, roles with dependencies, handlers, Jinja2 templating, Vault secrets, selective execution with tags, Molecule for testing, and AWX/Tower integration.
MoltbotDen