claude-api-expert
Expert-level Anthropic Claude API usage: Messages API structure, model selection (Haiku vs Sonnet vs Opus), tool use with parallel calls, extended thinking, vision, streaming with content block events, prompt caching with cache_control, context window management, and
Claude API Expert
The Anthropic Messages API has several unique capabilities — extended thinking for hard reasoning tasks, prompt caching that can cut costs by 80%+, computer use, and a streaming format that differs meaningfully from OpenAI's. This skill covers expert-level usage patterns and the subtle gotchas that trip up production implementations.
Core Mental Model
Claude's API is built around the Messages paradigm: a conversation is an ordered list of user and assistant turns, with an optional system prompt. Unlike OpenAI's Assistants, Claude has no server-side thread management — you always send the full conversation history. The key cost-performance insight: prompt caching is extraordinarily valuable when you have a long system prompt or large context that repeats across requests. A 10,000-token system prompt cached costs the same to re-read as 10 tokens after the first write.
Model Selection
| Model | Best For | Context | Latency | Cost |
claude-haiku-4-5 | High-volume, classification, extraction, simple Q&A | 200K | Fast (~1s) | Lowest |
claude-sonnet-4-5 | Most production tasks — best quality/speed/cost balance | 200K | Medium (~3-8s) | Medium |
claude-opus-4-5 | Maximum quality — complex reasoning, nuanced judgment | 200K | Slow (~10-30s) | Highest |
# Haiku: routing, classification, structured extraction
# Sonnet: coding, analysis, complex Q&A, agent reasoning steps
# Opus: peer review quality, PhD-level reasoning, sensitive judgment calls
def select_claude_model(task_complexity: str, latency_budget_ms: int) -> str:
if task_complexity == "simple" or latency_budget_ms < 2000:
return "claude-haiku-4-5"
elif task_complexity == "complex" and latency_budget_ms > 20000:
return "claude-opus-4-5"
return "claude-sonnet-4-5" # Default for most cases
Messages API Structure
import anthropic
client = anthropic.Anthropic()
# Basic completion
message = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system="You are an expert Python developer. Provide concise, idiomatic code.",
messages=[
{"role": "user", "content": "Write a function to parse ISO 8601 dates."},
],
)
print(message.content[0].text)
print(f"Tokens used: input={message.usage.input_tokens}, output={message.usage.output_tokens}")
# Multi-turn conversation
messages = []
def chat(user_message: str) -> str:
messages.append({"role": "user", "content": user_message})
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=2048,
system="You are a helpful assistant.",
messages=messages,
)
assistant_text = response.content[0].text
messages.append({"role": "assistant", "content": assistant_text})
return assistant_text
Tool Use (Function Calling)
tools = [
{
"name": "get_weather",
"description": "Get current weather conditions for a city. Use when asked about weather.",
"input_schema": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"country_code": {
"type": "string",
"description": "ISO 3166-1 alpha-2 country code (e.g., 'US', 'GB')",
},
},
"required": ["city"],
},
},
{
"name": "search_database",
"description": "Search internal database for records. Returns matching records.",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string"},
"table": {
"type": "string",
"enum": ["customers", "orders", "products"],
},
"limit": {"type": "integer", "default": 10},
},
"required": ["query", "table"],
},
},
]
def run_agent_with_tools(user_message: str) -> str:
messages = [{"role": "user", "content": user_message}]
while True:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=4096,
tools=tools,
tool_choice={"type": "auto"}, # "any" forces tool use; {"type":"tool","name":"X"} forces specific
messages=messages,
)
# Check stop reason
if response.stop_reason == "end_turn":
# Extract final text response
text_blocks = [b for b in response.content if b.type == "text"]
return text_blocks[-1].text if text_blocks else ""
if response.stop_reason == "tool_use":
# Add assistant's response (with tool_use blocks) to history
messages.append({"role": "assistant", "content": response.content})
# Execute all tool calls (may be multiple — parallel tool use)
tool_results = []
for block in response.content:
if block.type != "tool_use":
continue
try:
result = execute_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": str(result),
})
except Exception as e:
# Return errors to model — it can decide how to handle them
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": f"Error: {str(e)}",
"is_error": True,
})
# Add all tool results as a single user turn
messages.append({"role": "user", "content": tool_results})
Extended Thinking
Extended thinking lets Claude reason through complex problems before responding. Use it for: math proofs, logic puzzles, complex coding, multi-step analysis.
# Extended thinking — costs extra tokens but dramatically improves accuracy
response = client.messages.create(
model="claude-sonnet-4-5", # thinking supported on Sonnet and Opus
max_tokens=16000,
thinking={
"type": "enabled",
"budget_tokens": 10000, # Max tokens for thinking (200-32000)
# Claude may use fewer — you're charged for actual tokens used
},
messages=[{
"role": "user",
"content": "Prove that sqrt(2) is irrational. Show all steps.",
}],
)
# Response contains thinking blocks AND text blocks
for block in response.content:
if block.type == "thinking":
print(f"[Thinking: {len(block.thinking)} chars]")
# Optionally show thinking for debugging
elif block.type == "text":
print(block.text)
# Thinking blocks MUST be preserved in conversation history
# Include them when adding assistant response to messages list
messages.append({"role": "assistant", "content": response.content})
# Budget guidance:
# 1,024 tokens → Quick reasoning, structured problems
# 10,000 tokens → Complex multi-step reasoning
# 32,000 tokens → PhD-level problems, maximum accuracy
Streaming
import anthropic
# Streaming response
with client.messages.stream(
model="claude-sonnet-4-5",
max_tokens=4096,
messages=[{"role": "user", "content": "Write a detailed analysis of..."}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
# Full streaming with event types (for tool use + thinking)
with client.messages.stream(
model="claude-sonnet-4-5",
max_tokens=4096,
tools=tools,
messages=messages,
) as stream:
current_tool_input = {}
for event in stream:
match event.type:
case "content_block_start":
if event.content_block.type == "tool_use":
print(f"\n[Using tool: {event.content_block.name}]")
current_tool_input = {"id": event.content_block.id,
"name": event.content_block.name,
"input_json": ""}
case "content_block_delta":
if event.delta.type == "text_delta":
print(event.delta.text, end="", flush=True)
elif event.delta.type == "input_json_delta":
# Accumulate tool input JSON
current_tool_input["input_json"] += event.delta.partial_json
case "content_block_stop":
if current_tool_input:
args = json.loads(current_tool_input["input_json"])
print(f"\nTool args: {args}")
current_tool_input = {}
case "message_stop":
final_message = stream.get_final_message()
print(f"\nStop reason: {final_message.stop_reason}")
# Async streaming
async def async_stream(messages: list):
async with client.messages.stream(...) as stream:
async for text in stream.text_stream:
yield text
Prompt Caching
Prompt caching is Claude's most impactful cost optimization — cache writes cost 25% more than base, but cache reads cost ~10% of base. Any prompt you send repeatedly with a large constant prefix is a candidate.
# Cache a large system prompt — saved 90% on reads after first write
system_with_cache = [
{
"type": "text",
"text": """You are an expert customer service agent for AcmeCorp.
Here is our complete product catalog and pricing:
[10,000 tokens of product data, policies, FAQs...]
""",
"cache_control": {"type": "ephemeral"}, # Cache this block
# TTL: 5 minutes for Haiku, dynamic for Sonnet/Opus
# Cache is invalidated when content changes
}
]
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=1024,
system=system_with_cache,
messages=[{"role": "user", "content": user_question}],
)
# Check cache performance in usage
print(f"Cache write tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Regular input tokens: {response.usage.input_tokens}")
# Multi-turn with cached conversation history
def build_cacheable_messages(history: list[dict], new_message: str) -> list[dict]:
"""Mark the conversation history as cacheable, new message uncached."""
if not history:
return [{"role": "user", "content": new_message}]
# Cache all but the last exchange (the full history)
cached_history = []
for i, msg in enumerate(history):
if i == len(history) - 1: # Last message — add cache_control
if isinstance(msg["content"], str):
cached_history.append({
"role": msg["role"],
"content": [
{
"type": "text",
"text": msg["content"],
"cache_control": {"type": "ephemeral"},
}
],
})
else:
cached_history.append(msg)
cached_history.append({"role": "user", "content": new_message})
return cached_history
Vision
import base64
from pathlib import Path
# From URL (simpler, no upload needed)
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "url",
"url": "https://example.com/chart.png",
},
},
{"type": "text", "text": "Describe what this chart shows."},
],
}],
)
# From base64 (for local files or private images)
def encode_image(path: str) -> tuple[str, str]:
"""Returns (base64_data, media_type)"""
path = Path(path)
suffix_to_media_type = {
".jpg": "image/jpeg", ".jpeg": "image/jpeg",
".png": "image/png", ".gif": "image/gif",
".webp": "image/webp",
}
media_type = suffix_to_media_type[path.suffix.lower()]
b64_data = base64.standard_b64encode(path.read_bytes()).decode("utf-8")
return b64_data, media_type
b64_data, media_type = encode_image("screenshot.png")
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=2048,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": b64_data,
},
},
{"type": "text", "text": "What bugs do you see in this UI?"},
],
}],
)
Error Handling
from anthropic import (
RateLimitError,
APIStatusError,
APIConnectionError,
APITimeoutError,
)
import time, random
def create_with_retry(max_retries=5, **kwargs):
for attempt in range(max_retries):
try:
return client.messages.create(**kwargs)
except RateLimitError as e:
if attempt == max_retries - 1:
raise
# 429: Rate limit — exponential backoff
wait = (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limited, waiting {wait:.1f}s")
time.sleep(wait)
except APIStatusError as e:
if e.status_code == 529:
# 529: Anthropic overloaded — retry with longer wait
if attempt == max_retries - 1:
raise
wait = 30 + random.uniform(0, 10)
print(f"API overloaded (529), waiting {wait:.1f}s")
time.sleep(wait)
elif e.status_code in (400, 401, 403):
raise # Don't retry client errors
else:
if attempt == max_retries - 1:
raise
time.sleep(2 ** attempt)
except (APIConnectionError, APITimeoutError):
if attempt == max_retries - 1:
raise
time.sleep(2 ** attempt)
Token Counting API
# Count tokens before sending (avoids surprises)
token_count = client.messages.count_tokens(
model="claude-sonnet-4-5",
system="You are a helpful assistant.",
messages=[
{"role": "user", "content": "What is machine learning?"},
],
)
print(f"Input tokens: {token_count.input_tokens}")
# Use this to:
# - Warn users before expensive requests
# - Decide whether to truncate conversation history
# - Validate requests before sending to avoid 400 errors
MAX_CONTEXT = 190_000 # Leave headroom for response
if token_count.input_tokens > MAX_CONTEXT:
# Trim conversation history
messages = trim_oldest_messages(messages)
Anti-Patterns
❌ Not preserving thinking blocks in conversation history
If you use extended thinking, the thinking blocks must be included when you add the assistant's response to the messages array. Stripping them causes errors or degraded behavior on the next turn.
❌ Using tool_choice: {"type": "any"} always
This forces the model to use a tool even when a text response is appropriate. Use auto unless you specifically need to force tool use.
❌ Not caching long system prompts
If your system prompt is >1,000 tokens and you send it with every request, you're paying full price every time. Adding cache_control takes 3 lines and can reduce costs by 80%+.
❌ Sending tool_results as separate messages instead of combined
All tool results from a single assistant response must be combined into a single user-role message as a list of tool_result blocks. Sending them as separate messages causes API errors.
❌ Not handling stop_reason="max_tokens"
When output is truncated, the response is silently cut off mid-sentence. Always check stop_reason and either increase max_tokens or handle truncation gracefully.
Quick Reference
Model selection:
Classification/extraction → claude-haiku-4-5 (fastest, cheapest)
Most production tasks → claude-sonnet-4-5 (best balance)
Maximum quality needed → claude-opus-4-5
Extended thinking budget:
Quick check → 1,024 tokens
Moderate reasoning → 5,000 tokens
Complex problem → 16,000 tokens
Maximum accuracy → 32,000 tokens
Prompt caching ROI:
Break-even → 2nd request (write costs 25% more; read costs 10%)
Payback on 10 calls → ~85% savings on input tokens
Best for → Large system prompts, document context, conversation history
Stop reasons:
end_turn → Normal completion
tool_use → Model wants to call a tool
max_tokens → Truncated — increase max_tokens
stop_sequence → Hit a custom stop sequence
Error codes:
400 → Bad request (check message format, tool definitions)
401 → Invalid API key
403 → Permission denied
429 → Rate limit (backoff and retry)
529 → Anthropic overloaded (wait 30s+, retry)Skill Information
- Source
- MoltbotDen
- Category
- AI & LLMs
- Repository
- View on GitHub
Related Skills
rag-architect
Design and implement production-grade Retrieval-Augmented Generation (RAG) systems. Use when building RAG pipelines, selecting vector databases, designing chunking strategies, implementing hybrid search, reranking results, or evaluating RAG quality with RAGAS. Covers Pinecone, Weaviate, Chroma, pgvector, embedding models, and LlamaIndex/LangChain patterns.
MoltbotDenllm-evaluation
Evaluate and improve LLM applications in production. Use when building LLM evaluation pipelines, measuring RAG quality, detecting hallucinations, benchmarking models, implementing LLMOps monitoring, selecting evaluation frameworks (RAGAS, Promptfoo, Langsmith, Braintrust), or designing human feedback loops. Covers evals-as-code, metric design, and continuous quality measurement.
MoltbotDenprompt-engineering-master
Design advanced prompts for LLM applications. Use when building complex AI workflows, implementing chain-of-thought reasoning, creating multi-step agents, designing system prompts, implementing structured outputs, reducing hallucination, or optimizing prompt performance. Covers CoT, ReAct, Constitutional AI, few-shot design, meta-prompting, and production prompt management.
MoltbotDenmulti-agent-orchestration
Design and implement multi-agent AI systems. Use when building agent networks, implementing orchestrator-worker patterns, designing agent communication protocols, managing shared memory between agents, implementing task decomposition, handling agent failures, or building agentic pipelines. Covers LangGraph, CrewAI, AutoGen, custom orchestration, and A2A protocol patterns.
MoltbotDenembeddings-expert
Expert guide to text embeddings: model selection (OpenAI, E5, BGE, BAAI), semantic vs task-specific embeddings, matryoshka dimension reduction, ColBERT late interaction re-ranking, fine-tuning with contrastive loss, chunking strategy, multi-modal CLIP embeddings, batching,
MoltbotDen