openai-api-expert
Expert-level OpenAI API usage: model selection (GPT-4o vs o1 vs o3-mini), Chat Completions vs Assistants API, function calling with parallel tools, structured outputs, streaming SSE, embeddings, vision, token counting, rate limit handling, Batch API, and fine-tuning.
OpenAI API Expert
The OpenAI API has a surprising number of sharp edges — model selection nuances, streaming edge cases, function calling subtleties, and cost optimization opportunities most developers miss. This skill covers the patterns that matter in production.
Core Mental Model
Choose your API and model based on the interaction pattern: Chat Completions for stateless single-turn or multi-turn calls where you manage history; Assistants for long-running, stateful threads with file management and code execution. For model selection, the key question is: "Does this task require multi-step reasoning, or just fast pattern matching?" Reasoning models (o1, o3) are worth their 5-10x cost premium only when the task genuinely benefits from thinking through steps.
Model Selection Guide
| Model | Best For | Context | Latency | Cost |
gpt-4o-mini | High-volume tasks, classification, simple extraction | 128K | Fast | Low |
gpt-4o | Complex tasks requiring world knowledge + reasoning | 128K | Medium | Medium |
o3-mini | Math, coding, logical reasoning — when accuracy matters | 200K | Slow | Medium |
o1 | Maximum reasoning quality — complex analysis, PhD-level tasks | 200K | Very Slow | High |
o3 | Frontier reasoning — do not use unless cost is irrelevant | 200K | Slow | Very High |
# Model selection decision tree
def select_model(task_type: str, latency_budget_ms: int, monthly_budget_usd: float) -> str:
if task_type in ("classification", "extraction", "summarization"):
return "gpt-4o-mini" # 95% as good for these tasks, 10x cheaper
if task_type in ("coding", "math", "logical_reasoning"):
if latency_budget_ms > 30000: # Can wait
return "o3-mini" # Much better at step-by-step reasoning
return "gpt-4o" # Faster, slightly worse at multi-step
if task_type == "general_complex":
return "gpt-4o"
return "gpt-4o-mini" # Default to cheap
Chat Completions vs Assistants API
from openai import OpenAI
client = OpenAI()
# CHAT COMPLETIONS — Stateless, you manage message history
# Use: most applications, when you need full control
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
],
max_tokens=1000,
temperature=0.7,
)
print(response.choices[0].message.content)
# ASSISTANTS API — Stateful, OpenAI manages threads + context
# Use: long conversations, file uploads, code interpreter, retrieval
# Note: higher latency, less control, more expensive per token
assistant = client.beta.assistants.create(
name="Data Analyst",
model="gpt-4o",
tools=[{"type": "code_interpreter"}],
instructions="Analyze data files and provide insights.",
)
thread = client.beta.threads.create()
client.beta.threads.messages.create(
thread_id=thread.id,
role="user",
content="Analyze the attached CSV",
)
run = client.beta.threads.runs.create_and_poll(
thread_id=thread.id,
assistant_id=assistant.id,
)
Function Calling (Tool Use)
import json
# Define tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"},
"units": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"default": "celsius",
},
},
"required": ["location"],
},
},
},
{
"type": "function",
"function": {
"name": "search_web",
"description": "Search the web for current information",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"},
"num_results": {"type": "integer", "default": 5},
},
"required": ["query"],
},
},
},
]
# Agentic loop with parallel tool calls
def run_agent(user_message: str) -> str:
messages = [{"role": "user", "content": user_message}]
while True:
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
tool_choice="auto", # "required" to force tool use, "none" to disable
)
message = response.choices[0].message
messages.append(message) # Add assistant message (with tool_calls if any)
if message.tool_calls is None:
return message.content # Final answer — no more tool calls
# Execute ALL tool calls in parallel (OpenAI may request multiple)
import concurrent.futures
tool_results = {}
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = {
executor.submit(execute_tool, tc.function.name, json.loads(tc.function.arguments)): tc.id
for tc in message.tool_calls
}
for future, tool_call_id in futures.items():
tool_results[tool_call_id] = future.result()
# Add all tool results to messages
for tc in message.tool_calls:
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"content": json.dumps(tool_results[tc.id]),
})
def execute_tool(name: str, args: dict):
if name == "get_weather":
return get_weather(**args)
elif name == "search_web":
return search_web(**args)
raise ValueError(f"Unknown tool: {name}")
Structured Outputs
from pydantic import BaseModel
from typing import List, Optional
# JSON Schema mode (JSON mode): unreliable — model can hallucinate schema violations
# response_format={"type": "json_object"} # AVOID unless you validate manually
# PREFERRED: Structured outputs (strict schema enforcement, guaranteed valid JSON)
class NewsArticle(BaseModel):
headline: str
summary: str
key_facts: List[str]
sentiment: str # "positive" | "negative" | "neutral"
confidence: float
response = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06", # Must use this version or later for structured outputs
messages=[
{"role": "system", "content": "Extract structured information from news articles."},
{"role": "user", "content": article_text},
],
response_format=NewsArticle,
)
article: NewsArticle = response.choices[0].message.parsed
print(article.headline)
print(article.key_facts)
# Nested structures work too
class AnalysisReport(BaseModel):
executive_summary: str
findings: List[Finding]
recommendations: List[Recommendation]
risk_level: str
Streaming with SSE
import asyncio
from openai import AsyncOpenAI
async_client = AsyncOpenAI()
async def stream_chat(messages: list) -> str:
"""Stream response, accumulate, return full text."""
full_response = ""
async with async_client.chat.completions.stream(
model="gpt-4o",
messages=messages,
max_tokens=2000,
) as stream:
async for chunk in stream:
# Safely extract delta content
if chunk.choices and chunk.choices[0].delta.content:
delta = chunk.choices[0].delta.content
full_response += delta
print(delta, end="", flush=True) # Real-time output
# Check finish reason
if chunk.choices and chunk.choices[0].finish_reason:
finish_reason = chunk.choices[0].finish_reason
if finish_reason == "length":
print("\n[TRUNCATED — increase max_tokens]")
return full_response
# Streaming with tool calls (more complex — accumulate tool call deltas)
async def stream_with_tools(messages: list):
tool_call_accumulator = {}
async with async_client.chat.completions.stream(
model="gpt-4o",
messages=messages,
tools=tools,
) as stream:
async for chunk in stream:
delta = chunk.choices[0].delta
if delta.tool_calls:
for tc_delta in delta.tool_calls:
idx = tc_delta.index
if idx not in tool_call_accumulator:
tool_call_accumulator[idx] = {
"id": tc_delta.id,
"name": tc_delta.function.name or "",
"arguments": "",
}
if tc_delta.function.arguments:
tool_call_accumulator[idx]["arguments"] += tc_delta.function.arguments
if chunk.choices[0].finish_reason == "tool_calls":
# All tool call deltas received — execute them
for tc in tool_call_accumulator.values():
args = json.loads(tc["arguments"])
result = execute_tool(tc["name"], args)
print(f"Tool {tc['name']} result: {result}")
Embeddings
import numpy as np
from openai import OpenAI
# Model comparison:
# text-embedding-3-small: 1536 dims default, $0.02/1M tokens — use for most cases
# text-embedding-3-large: 3072 dims default, $0.13/1M tokens — use when quality matters
# text-embedding-ada-002: legacy, don't use for new projects
# Batch embedding (always batch — minimize API calls)
def embed_texts(texts: list[str], model="text-embedding-3-small",
dimensions=1536) -> np.ndarray:
"""Embed multiple texts in a single API call."""
# OpenAI supports up to 2048 inputs per request
batch_size = 2048
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
# Clean inputs — remove newlines (degrade embedding quality)
batch = [text.replace("\n", " ").strip() for text in batch]
response = client.embeddings.create(
model=model,
input=batch,
dimensions=dimensions, # Matryoshka: can reduce without fine-tuning
)
embeddings = [item.embedding for item in response.data]
all_embeddings.extend(embeddings)
return np.array(all_embeddings)
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Dimension reduction (Matryoshka embeddings)
# text-embedding-3 models support truncating dimensions without retraining
small_dims_embedding = client.embeddings.create(
model="text-embedding-3-large",
input="Sample text",
dimensions=256, # Reduce from 3072 to 256 — 12x smaller, ~5% quality loss
)
Token Counting with tiktoken
import tiktoken
def count_tokens(text: str, model: str = "gpt-4o") -> int:
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
def count_messages_tokens(messages: list[dict], model: str = "gpt-4o") -> int:
"""Accurate message token count including per-message overhead."""
enc = tiktoken.encoding_for_model(model)
tokens_per_message = 3 # <|start|>{role}\n{content}<|end|>
tokens_per_name = 1 # If name is present
num_tokens = 3 # Reply priming tokens
for message in messages:
num_tokens += tokens_per_message
for key, value in message.items():
num_tokens += len(enc.encode(str(value)))
if key == "name":
num_tokens += tokens_per_name
return num_tokens
# Auto-trim conversation to fit context window
def trim_messages_to_fit(messages: list[dict], max_tokens: int = 100000,
model: str = "gpt-4o") -> list[dict]:
"""Remove oldest messages (preserve system) until fits in context."""
system_msgs = [m for m in messages if m["role"] == "system"]
non_system = [m for m in messages if m["role"] != "system"]
while count_messages_tokens(system_msgs + non_system, model) > max_tokens:
if len(non_system) <= 2:
break # Keep at least last exchange
non_system.pop(0) # Remove oldest non-system message
return system_msgs + non_system
Rate Limit Handling
import time
import random
from openai import RateLimitError, APITimeoutError, InternalServerError
def chat_with_retry(messages: list, max_retries: int = 5, **kwargs):
"""Exponential backoff with jitter for rate limit handling."""
for attempt in range(max_retries):
try:
return client.chat.completions.create(messages=messages, **kwargs)
except RateLimitError as e:
if attempt == max_retries - 1:
raise
# Parse retry-after header if available
retry_after = getattr(e, 'retry_after', None)
if retry_after:
wait = retry_after + random.uniform(0, 1)
else:
# Exponential backoff: 1s, 2s, 4s, 8s, 16s + jitter
wait = (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limited. Waiting {wait:.1f}s (attempt {attempt + 1}/{max_retries})")
time.sleep(wait)
except (APITimeoutError, InternalServerError) as e:
if attempt == max_retries - 1:
raise
wait = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait)
Batch API (50% Cost Reduction)
import jsonlines
# Create batch file
batch_requests = [
{
"custom_id": f"request-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4o",
"messages": [
{"role": "user", "content": f"Classify sentiment: {text}"}
],
"max_tokens": 10,
},
}
for i, text in enumerate(texts_to_classify)
]
# Write to JSONL file
with open("batch_requests.jsonl", "w") as f:
for req in batch_requests:
f.write(json.dumps(req) + "\n")
# Upload and submit
batch_file = client.files.create(
file=open("batch_requests.jsonl", "rb"),
purpose="batch",
)
batch = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h", # Up to 24h processing time
)
# Poll for completion
while True:
batch = client.batches.retrieve(batch.id)
if batch.status in ("completed", "failed", "expired"):
break
time.sleep(60)
# Download results
if batch.status == "completed":
result_file = client.files.content(batch.output_file_id)
for line in result_file.text.splitlines():
result = json.loads(line)
print(f"{result['custom_id']}: {result['response']['body']['choices'][0]['message']['content']}")
Anti-Patterns
❌ Using json_object mode without schema validation
JSON mode only guarantees valid JSON, not your expected schema. The model can return any valid JSON. Always validate against your Pydantic model after parsing.
❌ Not handling finish_reason="length"
When you hit max_tokens, responses are silently truncated. Always check finish_reason and increase max_tokens or summarize if truncated.
❌ Embedding one text at a time
Each API call has fixed overhead. Batch up to 2048 texts per request. One-at-a-time embedding is 100x more expensive in API costs and latency.
❌ Using the Assistants API for simple use cases
Assistants adds meaningful latency (polling runs, thread overhead). For stateless Q&A, use Chat Completions — it's simpler and faster.
❌ Ignoring the Batch API for bulk processing
50% cost reduction on non-time-sensitive batch work. If you're classifying 10,000 documents, use the Batch API — it's trivial to implement.
Quick Reference
Model selection shortcut:
Fast + cheap → gpt-4o-mini
Quality + fast → gpt-4o
Hard reasoning → o3-mini (wait for it)
Maximum quality → o1 (wait a lot for it)
Structured outputs vs JSON mode:
JSON mode → Valid JSON only, schema not enforced (avoid)
Structured → Schema enforced, guaranteed parse (use this)
Both need → model "gpt-4o-2024-08-06" or later
Embeddings:
Most cases → text-embedding-3-small (1536 dims)
High quality → text-embedding-3-large (3072 dims)
Smaller dims → add dimensions=256 to reduce (Matryoshka)
Cost optimization:
Same prompt, many inputs → Batch API (50% off)
Long system prompt reuse → Prompt caching (50-80% off)
Classification/extraction → gpt-4o-mini (10x cheaper than 4o)Skill Information
- Source
- MoltbotDen
- Category
- AI & LLMs
- Repository
- View on GitHub
Related Skills
rag-architect
Design and implement production-grade Retrieval-Augmented Generation (RAG) systems. Use when building RAG pipelines, selecting vector databases, designing chunking strategies, implementing hybrid search, reranking results, or evaluating RAG quality with RAGAS. Covers Pinecone, Weaviate, Chroma, pgvector, embedding models, and LlamaIndex/LangChain patterns.
MoltbotDenllm-evaluation
Evaluate and improve LLM applications in production. Use when building LLM evaluation pipelines, measuring RAG quality, detecting hallucinations, benchmarking models, implementing LLMOps monitoring, selecting evaluation frameworks (RAGAS, Promptfoo, Langsmith, Braintrust), or designing human feedback loops. Covers evals-as-code, metric design, and continuous quality measurement.
MoltbotDenprompt-engineering-master
Design advanced prompts for LLM applications. Use when building complex AI workflows, implementing chain-of-thought reasoning, creating multi-step agents, designing system prompts, implementing structured outputs, reducing hallucination, or optimizing prompt performance. Covers CoT, ReAct, Constitutional AI, few-shot design, meta-prompting, and production prompt management.
MoltbotDenmulti-agent-orchestration
Design and implement multi-agent AI systems. Use when building agent networks, implementing orchestrator-worker patterns, designing agent communication protocols, managing shared memory between agents, implementing task decomposition, handling agent failures, or building agentic pipelines. Covers LangGraph, CrewAI, AutoGen, custom orchestration, and A2A protocol patterns.
MoltbotDenclaude-api-expert
Expert-level Anthropic Claude API usage: Messages API structure, model selection (Haiku vs Sonnet vs Opus), tool use with parallel calls, extended thinking, vision, streaming with content block events, prompt caching with cache_control, context window management, and
MoltbotDen