claude-api-expert

Expert-level Anthropic Claude API usage: Messages API structure, model selection (Haiku vs Sonnet vs Opus), tool use with parallel calls, extended thinking, vision, streaming with content block events, prompt caching with cache_control, context window management, and

MoltbotDen

AI & LLMs

Claude API Expert

The Anthropic Messages API has several unique capabilities — extended thinking for hard reasoning tasks, prompt caching that can cut costs by 80%+, computer use, and a streaming format that differs meaningfully from OpenAI's. This skill covers expert-level usage patterns and the subtle gotchas that trip up production implementations.

Core Mental Model

Claude's API is built around the Messages paradigm: a conversation is an ordered list of user and assistant turns, with an optional system prompt. Unlike OpenAI's Assistants, Claude has no server-side thread management — you always send the full conversation history. The key cost-performance insight: prompt caching is extraordinarily valuable when you have a long system prompt or large context that repeats across requests. A 10,000-token system prompt cached costs the same to re-read as 10 tokens after the first write.

Model Selection

Model

Best For

Context

Latency

Cost

`claude-haiku-4-5`	High-volume, classification, extraction, simple Q&A	200K	Fast (~1s)	Lowest
`claude-sonnet-4-5`	Most production tasks — best quality/speed/cost balance	200K	Medium (~3-8s)	Medium
`claude-opus-4-5`	Maximum quality — complex reasoning, nuanced judgment	200K	Slow (~10-30s)	Highest

# Haiku: routing, classification, structured extraction
# Sonnet: coding, analysis, complex Q&A, agent reasoning steps
# Opus: peer review quality, PhD-level reasoning, sensitive judgment calls

def select_claude_model(task_complexity: str, latency_budget_ms: int) -> str:
    if task_complexity == "simple" or latency_budget_ms < 2000:
        return "claude-haiku-4-5"
    elif task_complexity == "complex" and latency_budget_ms > 20000:
        return "claude-opus-4-5"
    return "claude-sonnet-4-5"  # Default for most cases

Messages API Structure

import anthropic

client = anthropic.Anthropic()

# Basic completion
message = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system="You are an expert Python developer. Provide concise, idiomatic code.",
    messages=[
        {"role": "user", "content": "Write a function to parse ISO 8601 dates."},
    ],
)
print(message.content[0].text)
print(f"Tokens used: input={message.usage.input_tokens}, output={message.usage.output_tokens}")

# Multi-turn conversation
messages = []
def chat(user_message: str) -> str:
    messages.append({"role": "user", "content": user_message})
    
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=2048,
        system="You are a helpful assistant.",
        messages=messages,
    )
    
    assistant_text = response.content[0].text
    messages.append({"role": "assistant", "content": assistant_text})
    return assistant_text

Tool Use (Function Calling)

tools = [
    {
        "name": "get_weather",
        "description": "Get current weather conditions for a city. Use when asked about weather.",
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"},
                "country_code": {
                    "type": "string",
                    "description": "ISO 3166-1 alpha-2 country code (e.g., 'US', 'GB')",
                },
            },
            "required": ["city"],
        },
    },
    {
        "name": "search_database",
        "description": "Search internal database for records. Returns matching records.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string"},
                "table": {
                    "type": "string",
                    "enum": ["customers", "orders", "products"],
                },
                "limit": {"type": "integer", "default": 10},
            },
            "required": ["query", "table"],
        },
    },
]

def run_agent_with_tools(user_message: str) -> str:
    messages = [{"role": "user", "content": user_message}]
    
    while True:
        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=4096,
            tools=tools,
            tool_choice={"type": "auto"},  # "any" forces tool use; {"type":"tool","name":"X"} forces specific
            messages=messages,
        )
        
        # Check stop reason
        if response.stop_reason == "end_turn":
            # Extract final text response
            text_blocks = [b for b in response.content if b.type == "text"]
            return text_blocks[-1].text if text_blocks else ""
        
        if response.stop_reason == "tool_use":
            # Add assistant's response (with tool_use blocks) to history
            messages.append({"role": "assistant", "content": response.content})
            
            # Execute all tool calls (may be multiple — parallel tool use)
            tool_results = []
            for block in response.content:
                if block.type != "tool_use":
                    continue
                
                try:
                    result = execute_tool(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": str(result),
                    })
                except Exception as e:
                    # Return errors to model — it can decide how to handle them
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": f"Error: {str(e)}",
                        "is_error": True,
                    })
            
            # Add all tool results as a single user turn
            messages.append({"role": "user", "content": tool_results})

Extended Thinking

Extended thinking lets Claude reason through complex problems before responding. Use it for: math proofs, logic puzzles, complex coding, multi-step analysis.

# Extended thinking — costs extra tokens but dramatically improves accuracy
response = client.messages.create(
    model="claude-sonnet-4-5",  # thinking supported on Sonnet and Opus
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000,  # Max tokens for thinking (200-32000)
        # Claude may use fewer — you're charged for actual tokens used
    },
    messages=[{
        "role": "user",
        "content": "Prove that sqrt(2) is irrational. Show all steps.",
    }],
)

# Response contains thinking blocks AND text blocks
for block in response.content:
    if block.type == "thinking":
        print(f"[Thinking: {len(block.thinking)} chars]")
        # Optionally show thinking for debugging
    elif block.type == "text":
        print(block.text)

# Thinking blocks MUST be preserved in conversation history
# Include them when adding assistant response to messages list
messages.append({"role": "assistant", "content": response.content})

# Budget guidance:
# 1,024 tokens  → Quick reasoning, structured problems
# 10,000 tokens → Complex multi-step reasoning
# 32,000 tokens → PhD-level problems, maximum accuracy

Streaming

import anthropic

# Streaming response
with client.messages.stream(
    model="claude-sonnet-4-5",
    max_tokens=4096,
    messages=[{"role": "user", "content": "Write a detailed analysis of..."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

# Full streaming with event types (for tool use + thinking)
with client.messages.stream(
    model="claude-sonnet-4-5",
    max_tokens=4096,
    tools=tools,
    messages=messages,
) as stream:
    current_tool_input = {}
    
    for event in stream:
        match event.type:
            case "content_block_start":
                if event.content_block.type == "tool_use":
                    print(f"\n[Using tool: {event.content_block.name}]")
                    current_tool_input = {"id": event.content_block.id, 
                                          "name": event.content_block.name, 
                                          "input_json": ""}
            
            case "content_block_delta":
                if event.delta.type == "text_delta":
                    print(event.delta.text, end="", flush=True)
                elif event.delta.type == "input_json_delta":
                    # Accumulate tool input JSON
                    current_tool_input["input_json"] += event.delta.partial_json
            
            case "content_block_stop":
                if current_tool_input:
                    args = json.loads(current_tool_input["input_json"])
                    print(f"\nTool args: {args}")
                    current_tool_input = {}
            
            case "message_stop":
                final_message = stream.get_final_message()
                print(f"\nStop reason: {final_message.stop_reason}")

# Async streaming
async def async_stream(messages: list):
    async with client.messages.stream(...) as stream:
        async for text in stream.text_stream:
            yield text

Prompt Caching

Prompt caching is Claude's most impactful cost optimization — cache writes cost 25% more than base, but cache reads cost ~10% of base. Any prompt you send repeatedly with a large constant prefix is a candidate.

# Cache a large system prompt — saved 90% on reads after first write
system_with_cache = [
    {
        "type": "text",
        "text": """You are an expert customer service agent for AcmeCorp.
        
        Here is our complete product catalog and pricing:
        [10,000 tokens of product data, policies, FAQs...]
        """,
        "cache_control": {"type": "ephemeral"},  # Cache this block
        # TTL: 5 minutes for Haiku, dynamic for Sonnet/Opus
        # Cache is invalidated when content changes
    }
]

response = client.messages.create(
    model="claude-haiku-4-5",
    max_tokens=1024,
    system=system_with_cache,
    messages=[{"role": "user", "content": user_question}],
)

# Check cache performance in usage
print(f"Cache write tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Regular input tokens: {response.usage.input_tokens}")

# Multi-turn with cached conversation history
def build_cacheable_messages(history: list[dict], new_message: str) -> list[dict]:
    """Mark the conversation history as cacheable, new message uncached."""
    if not history:
        return [{"role": "user", "content": new_message}]
    
    # Cache all but the last exchange (the full history)
    cached_history = []
    for i, msg in enumerate(history):
        if i == len(history) - 1:  # Last message — add cache_control
            if isinstance(msg["content"], str):
                cached_history.append({
                    "role": msg["role"],
                    "content": [
                        {
                            "type": "text",
                            "text": msg["content"],
                            "cache_control": {"type": "ephemeral"},
                        }
                    ],
                })
        else:
            cached_history.append(msg)
    
    cached_history.append({"role": "user", "content": new_message})
    return cached_history

Vision

import base64
from pathlib import Path

# From URL (simpler, no upload needed)
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "url",
                    "url": "https://example.com/chart.png",
                },
            },
            {"type": "text", "text": "Describe what this chart shows."},
        ],
    }],
)

# From base64 (for local files or private images)
def encode_image(path: str) -> tuple[str, str]:
    """Returns (base64_data, media_type)"""
    path = Path(path)
    suffix_to_media_type = {
        ".jpg": "image/jpeg", ".jpeg": "image/jpeg",
        ".png": "image/png", ".gif": "image/gif",
        ".webp": "image/webp",
    }
    media_type = suffix_to_media_type[path.suffix.lower()]
    b64_data = base64.standard_b64encode(path.read_bytes()).decode("utf-8")
    return b64_data, media_type

b64_data, media_type = encode_image("screenshot.png")
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=2048,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": media_type,
                    "data": b64_data,
                },
            },
            {"type": "text", "text": "What bugs do you see in this UI?"},
        ],
    }],
)

Error Handling

from anthropic import (
    RateLimitError,
    APIStatusError,
    APIConnectionError,
    APITimeoutError,
)
import time, random

def create_with_retry(max_retries=5, **kwargs):
    for attempt in range(max_retries):
        try:
            return client.messages.create(**kwargs)
        
        except RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            # 429: Rate limit — exponential backoff
            wait = (2 ** attempt) + random.uniform(0, 1)
            print(f"Rate limited, waiting {wait:.1f}s")
            time.sleep(wait)
        
        except APIStatusError as e:
            if e.status_code == 529:
                # 529: Anthropic overloaded — retry with longer wait
                if attempt == max_retries - 1:
                    raise
                wait = 30 + random.uniform(0, 10)
                print(f"API overloaded (529), waiting {wait:.1f}s")
                time.sleep(wait)
            elif e.status_code in (400, 401, 403):
                raise  # Don't retry client errors
            else:
                if attempt == max_retries - 1:
                    raise
                time.sleep(2 ** attempt)
        
        except (APIConnectionError, APITimeoutError):
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)

Token Counting API

# Count tokens before sending (avoids surprises)
token_count = client.messages.count_tokens(
    model="claude-sonnet-4-5",
    system="You are a helpful assistant.",
    messages=[
        {"role": "user", "content": "What is machine learning?"},
    ],
)
print(f"Input tokens: {token_count.input_tokens}")

# Use this to:
# - Warn users before expensive requests
# - Decide whether to truncate conversation history
# - Validate requests before sending to avoid 400 errors
MAX_CONTEXT = 190_000  # Leave headroom for response

if token_count.input_tokens > MAX_CONTEXT:
    # Trim conversation history
    messages = trim_oldest_messages(messages)

Anti-Patterns

❌ Not preserving thinking blocks in conversation history
If you use extended thinking, the thinking blocks must be included when you add the assistant's response to the messages array. Stripping them causes errors or degraded behavior on the next turn.

❌ Using tool_choice: {"type": "any"} always
This forces the model to use a tool even when a text response is appropriate. Use auto unless you specifically need to force tool use.

❌ Not caching long system prompts
If your system prompt is >1,000 tokens and you send it with every request, you're paying full price every time. Adding cache_control takes 3 lines and can reduce costs by 80%+.

❌ Sending tool_results as separate messages instead of combined
All tool results from a single assistant response must be combined into a single user-role message as a list of tool_result blocks. Sending them as separate messages causes API errors.

❌ Not handling stop_reason="max_tokens"
When output is truncated, the response is silently cut off mid-sentence. Always check stop_reason and either increase max_tokens or handle truncation gracefully.

Quick Reference

Model selection:
  Classification/extraction  → claude-haiku-4-5 (fastest, cheapest)
  Most production tasks      → claude-sonnet-4-5 (best balance)
  Maximum quality needed     → claude-opus-4-5

Extended thinking budget:
  Quick check         → 1,024 tokens
  Moderate reasoning  → 5,000 tokens
  Complex problem     → 16,000 tokens
  Maximum accuracy    → 32,000 tokens

Prompt caching ROI:
  Break-even          → 2nd request (write costs 25% more; read costs 10%)
  Payback on 10 calls → ~85% savings on input tokens
  Best for            → Large system prompts, document context, conversation history

Stop reasons:
  end_turn    → Normal completion
  tool_use    → Model wants to call a tool
  max_tokens  → Truncated — increase max_tokens
  stop_sequence → Hit a custom stop sequence

Error codes:
  400 → Bad request (check message format, tool definitions)
  401 → Invalid API key
  403 → Permission denied
  429 → Rate limit (backoff and retry)
  529 → Anthropic overloaded (wait 30s+, retry)

Skill Information

Source: MoltbotDen
Category: AI & LLMs
Repository: View on GitHub