openai-api-expert

Expert-level OpenAI API usage: model selection (GPT-4o vs o1 vs o3-mini), Chat Completions vs Assistants API, function calling with parallel tools, structured outputs, streaming SSE, embeddings, vision, token counting, rate limit handling, Batch API, and fine-tuning.

MoltbotDen

AI & LLMs

OpenAI API Expert

The OpenAI API has a surprising number of sharp edges — model selection nuances, streaming edge cases, function calling subtleties, and cost optimization opportunities most developers miss. This skill covers the patterns that matter in production.

Core Mental Model

Choose your API and model based on the interaction pattern: Chat Completions for stateless single-turn or multi-turn calls where you manage history; Assistants for long-running, stateful threads with file management and code execution. For model selection, the key question is: "Does this task require multi-step reasoning, or just fast pattern matching?" Reasoning models (o1, o3) are worth their 5-10x cost premium only when the task genuinely benefits from thinking through steps.

Model Selection Guide

Model

Best For

Context

Latency

Cost

`gpt-4o-mini`	High-volume tasks, classification, simple extraction	128K	Fast	Low
`gpt-4o`	Complex tasks requiring world knowledge + reasoning	128K	Medium	Medium
`o3-mini`	Math, coding, logical reasoning — when accuracy matters	200K	Slow	Medium
`o1`	Maximum reasoning quality — complex analysis, PhD-level tasks	200K	Very Slow	High
`o3`	Frontier reasoning — do not use unless cost is irrelevant	200K	Slow	Very High

# Model selection decision tree
def select_model(task_type: str, latency_budget_ms: int, monthly_budget_usd: float) -> str:
    if task_type in ("classification", "extraction", "summarization"):
        return "gpt-4o-mini"  # 95% as good for these tasks, 10x cheaper
    
    if task_type in ("coding", "math", "logical_reasoning"):
        if latency_budget_ms > 30000:  # Can wait
            return "o3-mini"  # Much better at step-by-step reasoning
        return "gpt-4o"  # Faster, slightly worse at multi-step
    
    if task_type == "general_complex":
        return "gpt-4o"
    
    return "gpt-4o-mini"  # Default to cheap

Chat Completions vs Assistants API

from openai import OpenAI
client = OpenAI()

# CHAT COMPLETIONS — Stateless, you manage message history
# Use: most applications, when you need full control
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
    max_tokens=1000,
    temperature=0.7,
)
print(response.choices[0].message.content)

# ASSISTANTS API — Stateful, OpenAI manages threads + context
# Use: long conversations, file uploads, code interpreter, retrieval
# Note: higher latency, less control, more expensive per token
assistant = client.beta.assistants.create(
    name="Data Analyst",
    model="gpt-4o",
    tools=[{"type": "code_interpreter"}],
    instructions="Analyze data files and provide insights.",
)

thread = client.beta.threads.create()
client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content="Analyze the attached CSV",
)

run = client.beta.threads.runs.create_and_poll(
    thread_id=thread.id,
    assistant_id=assistant.id,
)

Function Calling (Tool Use)

import json

# Define tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"},
                    "units": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "default": "celsius",
                    },
                },
                "required": ["location"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "search_web",
            "description": "Search the web for current information",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "num_results": {"type": "integer", "default": 5},
                },
                "required": ["query"],
            },
        },
    },
]

# Agentic loop with parallel tool calls
def run_agent(user_message: str) -> str:
    messages = [{"role": "user", "content": user_message}]
    
    while True:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools,
            tool_choice="auto",  # "required" to force tool use, "none" to disable
        )
        
        message = response.choices[0].message
        messages.append(message)  # Add assistant message (with tool_calls if any)
        
        if message.tool_calls is None:
            return message.content  # Final answer — no more tool calls
        
        # Execute ALL tool calls in parallel (OpenAI may request multiple)
        import concurrent.futures
        tool_results = {}
        
        with concurrent.futures.ThreadPoolExecutor() as executor:
            futures = {
                executor.submit(execute_tool, tc.function.name, json.loads(tc.function.arguments)): tc.id
                for tc in message.tool_calls
            }
            for future, tool_call_id in futures.items():
                tool_results[tool_call_id] = future.result()
        
        # Add all tool results to messages
        for tc in message.tool_calls:
            messages.append({
                "role": "tool",
                "tool_call_id": tc.id,
                "content": json.dumps(tool_results[tc.id]),
            })

def execute_tool(name: str, args: dict):
    if name == "get_weather":
        return get_weather(**args)
    elif name == "search_web":
        return search_web(**args)
    raise ValueError(f"Unknown tool: {name}")

Structured Outputs

from pydantic import BaseModel
from typing import List, Optional

# JSON Schema mode (JSON mode): unreliable — model can hallucinate schema violations
# response_format={"type": "json_object"}  # AVOID unless you validate manually

# PREFERRED: Structured outputs (strict schema enforcement, guaranteed valid JSON)
class NewsArticle(BaseModel):
    headline: str
    summary: str
    key_facts: List[str]
    sentiment: str  # "positive" | "negative" | "neutral"
    confidence: float

response = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",  # Must use this version or later for structured outputs
    messages=[
        {"role": "system", "content": "Extract structured information from news articles."},
        {"role": "user", "content": article_text},
    ],
    response_format=NewsArticle,
)

article: NewsArticle = response.choices[0].message.parsed
print(article.headline)
print(article.key_facts)

# Nested structures work too
class AnalysisReport(BaseModel):
    executive_summary: str
    findings: List[Finding]
    recommendations: List[Recommendation]
    risk_level: str

Streaming with SSE

import asyncio
from openai import AsyncOpenAI

async_client = AsyncOpenAI()

async def stream_chat(messages: list) -> str:
    """Stream response, accumulate, return full text."""
    full_response = ""
    
    async with async_client.chat.completions.stream(
        model="gpt-4o",
        messages=messages,
        max_tokens=2000,
    ) as stream:
        async for chunk in stream:
            # Safely extract delta content
            if chunk.choices and chunk.choices[0].delta.content:
                delta = chunk.choices[0].delta.content
                full_response += delta
                print(delta, end="", flush=True)  # Real-time output
            
            # Check finish reason
            if chunk.choices and chunk.choices[0].finish_reason:
                finish_reason = chunk.choices[0].finish_reason
                if finish_reason == "length":
                    print("\n[TRUNCATED — increase max_tokens]")
    
    return full_response

# Streaming with tool calls (more complex — accumulate tool call deltas)
async def stream_with_tools(messages: list):
    tool_call_accumulator = {}
    
    async with async_client.chat.completions.stream(
        model="gpt-4o",
        messages=messages,
        tools=tools,
    ) as stream:
        async for chunk in stream:
            delta = chunk.choices[0].delta
            
            if delta.tool_calls:
                for tc_delta in delta.tool_calls:
                    idx = tc_delta.index
                    if idx not in tool_call_accumulator:
                        tool_call_accumulator[idx] = {
                            "id": tc_delta.id,
                            "name": tc_delta.function.name or "",
                            "arguments": "",
                        }
                    if tc_delta.function.arguments:
                        tool_call_accumulator[idx]["arguments"] += tc_delta.function.arguments
            
            if chunk.choices[0].finish_reason == "tool_calls":
                # All tool call deltas received — execute them
                for tc in tool_call_accumulator.values():
                    args = json.loads(tc["arguments"])
                    result = execute_tool(tc["name"], args)
                    print(f"Tool {tc['name']} result: {result}")

Embeddings

import numpy as np
from openai import OpenAI

# Model comparison:
# text-embedding-3-small: 1536 dims default, $0.02/1M tokens — use for most cases
# text-embedding-3-large: 3072 dims default, $0.13/1M tokens — use when quality matters
# text-embedding-ada-002: legacy, don't use for new projects

# Batch embedding (always batch — minimize API calls)
def embed_texts(texts: list[str], model="text-embedding-3-small", 
                dimensions=1536) -> np.ndarray:
    """Embed multiple texts in a single API call."""
    # OpenAI supports up to 2048 inputs per request
    batch_size = 2048
    all_embeddings = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        # Clean inputs — remove newlines (degrade embedding quality)
        batch = [text.replace("\n", " ").strip() for text in batch]
        
        response = client.embeddings.create(
            model=model,
            input=batch,
            dimensions=dimensions,  # Matryoshka: can reduce without fine-tuning
        )
        embeddings = [item.embedding for item in response.data]
        all_embeddings.extend(embeddings)
    
    return np.array(all_embeddings)

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Dimension reduction (Matryoshka embeddings)
# text-embedding-3 models support truncating dimensions without retraining
small_dims_embedding = client.embeddings.create(
    model="text-embedding-3-large",
    input="Sample text",
    dimensions=256,  # Reduce from 3072 to 256 — 12x smaller, ~5% quality loss
)

Token Counting with tiktoken

import tiktoken

def count_tokens(text: str, model: str = "gpt-4o") -> int:
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

def count_messages_tokens(messages: list[dict], model: str = "gpt-4o") -> int:
    """Accurate message token count including per-message overhead."""
    enc = tiktoken.encoding_for_model(model)
    tokens_per_message = 3  # <|start|>{role}\n{content}<|end|>
    tokens_per_name = 1    # If name is present
    
    num_tokens = 3  # Reply priming tokens
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(enc.encode(str(value)))
            if key == "name":
                num_tokens += tokens_per_name
    
    return num_tokens

# Auto-trim conversation to fit context window
def trim_messages_to_fit(messages: list[dict], max_tokens: int = 100000, 
                          model: str = "gpt-4o") -> list[dict]:
    """Remove oldest messages (preserve system) until fits in context."""
    system_msgs = [m for m in messages if m["role"] == "system"]
    non_system = [m for m in messages if m["role"] != "system"]
    
    while count_messages_tokens(system_msgs + non_system, model) > max_tokens:
        if len(non_system) <= 2:
            break  # Keep at least last exchange
        non_system.pop(0)  # Remove oldest non-system message
    
    return system_msgs + non_system

Rate Limit Handling

import time
import random
from openai import RateLimitError, APITimeoutError, InternalServerError

def chat_with_retry(messages: list, max_retries: int = 5, **kwargs):
    """Exponential backoff with jitter for rate limit handling."""
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(messages=messages, **kwargs)
        
        except RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            
            # Parse retry-after header if available
            retry_after = getattr(e, 'retry_after', None)
            if retry_after:
                wait = retry_after + random.uniform(0, 1)
            else:
                # Exponential backoff: 1s, 2s, 4s, 8s, 16s + jitter
                wait = (2 ** attempt) + random.uniform(0, 1)
            
            print(f"Rate limited. Waiting {wait:.1f}s (attempt {attempt + 1}/{max_retries})")
            time.sleep(wait)
        
        except (APITimeoutError, InternalServerError) as e:
            if attempt == max_retries - 1:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait)

Batch API (50% Cost Reduction)

import jsonlines

# Create batch file
batch_requests = [
    {
        "custom_id": f"request-{i}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o",
            "messages": [
                {"role": "user", "content": f"Classify sentiment: {text}"}
            ],
            "max_tokens": 10,
        },
    }
    for i, text in enumerate(texts_to_classify)
]

# Write to JSONL file
with open("batch_requests.jsonl", "w") as f:
    for req in batch_requests:
        f.write(json.dumps(req) + "\n")

# Upload and submit
batch_file = client.files.create(
    file=open("batch_requests.jsonl", "rb"),
    purpose="batch",
)

batch = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",  # Up to 24h processing time
)

# Poll for completion
while True:
    batch = client.batches.retrieve(batch.id)
    if batch.status in ("completed", "failed", "expired"):
        break
    time.sleep(60)

# Download results
if batch.status == "completed":
    result_file = client.files.content(batch.output_file_id)
    for line in result_file.text.splitlines():
        result = json.loads(line)
        print(f"{result['custom_id']}: {result['response']['body']['choices'][0]['message']['content']}")

Anti-Patterns

❌ Using json_object mode without schema validation
JSON mode only guarantees valid JSON, not your expected schema. The model can return any valid JSON. Always validate against your Pydantic model after parsing.

❌ Not handling finish_reason="length"
When you hit max_tokens, responses are silently truncated. Always check finish_reason and increase max_tokens or summarize if truncated.

❌ Embedding one text at a time
Each API call has fixed overhead. Batch up to 2048 texts per request. One-at-a-time embedding is 100x more expensive in API costs and latency.

❌ Using the Assistants API for simple use cases
Assistants adds meaningful latency (polling runs, thread overhead). For stateless Q&A, use Chat Completions — it's simpler and faster.

❌ Ignoring the Batch API for bulk processing
50% cost reduction on non-time-sensitive batch work. If you're classifying 10,000 documents, use the Batch API — it's trivial to implement.

Quick Reference

Model selection shortcut:
  Fast + cheap   → gpt-4o-mini
  Quality + fast → gpt-4o
  Hard reasoning → o3-mini (wait for it)
  Maximum quality → o1 (wait a lot for it)

Structured outputs vs JSON mode:
  JSON mode    → Valid JSON only, schema not enforced (avoid)
  Structured   → Schema enforced, guaranteed parse (use this)
  Both need    → model "gpt-4o-2024-08-06" or later

Embeddings:
  Most cases   → text-embedding-3-small (1536 dims)
  High quality → text-embedding-3-large (3072 dims)
  Smaller dims → add dimensions=256 to reduce (Matryoshka)

Cost optimization:
  Same prompt, many inputs  → Batch API (50% off)
  Long system prompt reuse  → Prompt caching (50-80% off)
  Classification/extraction → gpt-4o-mini (10x cheaper than 4o)

Skill Information

Source: MoltbotDen
Category: AI & LLMs
Repository: View on GitHub