Skip to main content

llm-evaluation

Evaluate and improve LLM applications in production. Use when building LLM evaluation pipelines, measuring RAG quality, detecting hallucinations, benchmarking models, implementing LLMOps monitoring, selecting evaluation frameworks (RAGAS, Promptfoo, Langsmith, Braintrust), or designing human feedback loops. Covers evals-as-code, metric design, and continuous quality measurement.

MoltbotDen
AI & LLMs

LLM Evaluation Engineering

Why Evals Matter

"Evals are to LLMs what tests are to software." — Andrej Karpathy

Without evals:

  • You can't measure if a prompt change improved things

  • You don't know if a model upgrade regressed outputs

  • You're shipping vibes, not quality



Evaluation Taxonomy

Automated Evals (fast, cheap, scalable)
  ├─ Rule-based: regex, JSON schema, string contains
  ├─ Model-based (LLM-as-Judge): GPT-4o grades outputs
  └─ Reference-based: compare against golden answers

Human Evals (slow, expensive, ground truth)
  ├─ Expert annotation: domain experts label samples
  ├─ Preference annotation: A/B comparison voting
  └─ Production feedback: thumbs up/down from users

Online vs Offline:
  Offline: Test against curated dataset before deploy
  Online: Monitor live traffic in production

The Evaluation Loop

1. Build eval dataset
   - Golden QA pairs from domain experts
   - Edge cases from production failures
   - Adversarial examples

2. Define metrics
   - What does "good" look like?
   - How do you measure it programmatically?

3. Run evals
   - Against multiple model versions
   - Against prompt variants
   - After system changes

4. Analyze regressions
   - Which cases failed?
   - What patterns exist?

5. Improve system
   - Fix prompts, chunking, retrieval
   - Retrain / fine-tune

6. Add failing cases to dataset (regression prevention)

Building an Eval Dataset

# Golden dataset structure
eval_dataset = [
    {
        "id": "q001",
        "question": "What is our refund policy?",
        "ground_truth": "Products can be returned within 30 days of purchase for a full refund.",
        "contexts": [
            "Our refund policy allows returns within 30 days of purchase.",
            "Refunds are processed within 5-7 business days."
        ],
        "category": "policy",
        "difficulty": "easy",
        "tags": ["refund", "returns"],
    },
    {
        "id": "q002",
        "question": "Can I return a digital product?",
        "ground_truth": "Digital products are not eligible for refunds once accessed.",
        "contexts": [
            "Digital downloads cannot be returned once downloaded or accessed."
        ],
        "category": "policy",
        "difficulty": "hard",  # Edge case
        "tags": ["refund", "digital"],
    }
]

# Minimum viable eval dataset: 50-100 questions per use case
# Cover: easy/medium/hard, each business scenario, known failure modes

Metric Design

For RAG Systems

from ragas import evaluate
from ragas.metrics import (
    faithfulness,        # 0-1: Is the answer grounded in retrieved context?
    answer_relevancy,   # 0-1: Does the answer address the question?
    context_precision,  # 0-1: Is the retrieved context useful?
    context_recall,     # 0-1: Does the context contain enough info?
    answer_correctness, # 0-1: Is the answer factually correct (vs ground truth)?
)
from ragas.metrics.critique import harmfulness, coherence

from datasets import Dataset

results = evaluate(
    Dataset.from_list(eval_dataset),
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

# Scores explained:
# Faithfulness < 0.8: LLM is hallucinating
# Context Recall < 0.7: Retrieval is missing relevant chunks
# Context Precision < 0.7: Retrieved chunks are irrelevant
# Answer Relevancy < 0.8: Answers are off-topic

For Chatbots / Assistants

# Custom LLM-as-Judge metric
from openai import OpenAI

client = OpenAI()

JUDGE_PROMPT = """You are evaluating an AI assistant's response.

Question: {question}
AI Response: {response}

Rate the response on each dimension (1-5):
1. Accuracy: Is the information factually correct?
2. Completeness: Does it fully answer the question?
3. Clarity: Is it easy to understand?
4. Conciseness: Is it appropriately concise without being incomplete?
5. Tone: Is the tone appropriate and professional?

Return JSON only:
{{"accuracy": X, "completeness": X, "clarity": X, "conciseness": X, "tone": X, "reasoning": "..."}}"""

def judge_response(question: str, response: str) -> dict:
    result = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": JUDGE_PROMPT.format(question=question, response=response)
        }],
        response_format={"type": "json_object"},
        temperature=0,  # Deterministic grading
    )
    return json.loads(result.choices[0].message.content)

Rule-Based Metrics

import re
from typing import Callable

def make_evaluator(checks: list[Callable]) -> Callable:
    """Compose multiple checks into a single evaluator."""
    def evaluate(response: str, metadata: dict) -> dict:
        results = {}
        for check in checks:
            name = check.__name__
            results[name] = check(response, metadata)
        results["passed"] = all(results.values())
        return results
    return evaluate

def check_format_json(response: str, _) -> bool:
    try:
        json.loads(response)
        return True
    except json.JSONDecodeError:
        return False

def check_length(response: str, metadata: dict) -> bool:
    min_len = metadata.get("min_length", 50)
    max_len = metadata.get("max_length", 2000)
    return min_len <= len(response) <= max_len

def check_no_hallucination_markers(response: str, _) -> bool:
    """Flag responses that suggest hallucination."""
    hallucination_phrases = [
        "I'm not sure but",
        "I believe but I'm not certain",
        "I might be wrong",
        "as far as I know",
        "I think (but don't quote me)",
    ]
    response_lower = response.lower()
    return not any(phrase.lower() in response_lower for phrase in hallucination_phrases)

def check_cites_source(response: str, _) -> bool:
    """Check if response includes source citation."""
    patterns = [r'\[Source:', r'\[Ref:', r'According to', r'Based on']
    return any(re.search(p, response) for p in patterns)

Promptfoo — Evals as Code

# promptfooconfig.yaml
prompts:
  - id: prompt-v1
    raw: |
      You are a helpful customer service agent.
      Answer questions about our products based on this knowledge base:
      {{context}}
      
      Question: {{question}}
      Answer:
  
  - id: prompt-v2
    raw: |
      You are a precise customer service agent. Answer ONLY using information 
      from the provided context. If unsure, say "I don't have that information."
      
      Context: {{context}}
      
      Question: {{question}}
      Answer:

providers:
  - id: openai:gpt-4o-mini
    config:
      temperature: 0
  - id: openai:gpt-4o
    config:
      temperature: 0
  - id: anthropic:claude-3-5-haiku-20241022

tests:
  - description: Basic refund policy query
    vars:
      question: "What is your return policy?"
      context: "Products can be returned within 30 days for a full refund."
    assert:
      - type: contains
        value: "30 days"
      - type: llm-rubric
        value: "The response accurately describes the return policy and is helpful"
      - type: cost
        threshold: 0.01  # Max $0.01 per call

  - description: Edge case - digital products
    vars:
      question: "Can I return downloaded software?"
      context: "Digital downloads cannot be returned once accessed."
    assert:
      - type: contains
        value: "cannot"
        not: false
      - type: not-contains
        value: "30 days"  # Should NOT apply physical policy to digital
      - type: factuality
        value: "Digital products cannot be returned once downloaded"

  - description: Hallucination test - question not in context
    vars:
      question: "What are your store hours?"
      context: "We sell premium software products."
    assert:
      - type: llm-rubric
        value: "The response should NOT invent store hours. It should say the information is not available."
# Run evals
npx promptfoo eval
npx promptfoo eval --output results.json
npx promptfoo view  # Browser UI with comparison table

# Compare two configs
npx promptfoo eval --config config-v1.yaml
npx promptfoo eval --config config-v2.yaml
npx promptfoo view  # See side-by-side comparison

Langsmith Tracing and Eval

from langchain_core.tracers.langchain import LangChainTracer
from langsmith import Client
from langsmith.evaluation import evaluate

# Auto-tracing — add to any LangChain app
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls_..."
os.environ["LANGCHAIN_PROJECT"] = "my-rag-app"

# Create an evaluator
from langchain.evaluation import load_evaluator

correctness_evaluator = load_evaluator(
    "labeled_score_string",
    criteria={
        "correctness": "Is the response factually correct based on the reference?"
    }
)

# Evaluate a dataset
client = Client()

# Create dataset
dataset = client.create_dataset("refund-questions")
for example in eval_dataset:
    client.create_example(
        inputs={"question": example["question"], "context": example["contexts"]},
        outputs={"answer": example["ground_truth"]},
        dataset_id=dataset.id,
    )

# Run evaluation
def run_rag_chain(inputs):
    return {"answer": my_rag_chain.invoke(inputs)}

results = evaluate(
    run_rag_chain,
    data=dataset.name,
    evaluators=["qa", "context_qa"],  # Built-in evaluators
    experiment_prefix="rag-v2",
)

Hallucination Detection

HALLUCINATION_CHECK_PROMPT = """Given the following context and response, determine if the response 
contains any information NOT supported by the context (hallucination).

Context:
{context}

Response:
{response}

Return JSON:
{{
    "has_hallucination": true/false,
    "hallucinated_claims": ["specific claim not in context", ...],
    "confidence": 0.0-1.0,
    "reasoning": "explanation"
}}"""

def detect_hallucination(
    context: str,
    response: str,
    model: str = "gpt-4o"
) -> dict:
    result = client.chat.completions.create(
        model=model,
        messages=[{
            "role": "user",
            "content": HALLUCINATION_CHECK_PROMPT.format(
                context=context, response=response
            )
        }],
        response_format={"type": "json_object"},
        temperature=0,
        seed=42,  # Deterministic output
    )
    return json.loads(result.choices[0].message.content)

# NLI-based hallucination detection (faster, cheaper)
from transformers import pipeline

nli = pipeline("text-classification", model="cross-encoder/nli-deberta-v3-base")

def is_faithful(premise: str, hypothesis: str, threshold: float = 0.5) -> bool:
    """Check if hypothesis is entailed by premise."""
    result = nli(f"{premise} [SEP] {hypothesis}")
    entailment = next((r for r in result if r["label"] == "ENTAILMENT"), None)
    return entailment is not None and entailment["score"] > threshold

CI/CD Integration for LLM Quality

# .github/workflows/eval.yml
name: LLM Evaluation

on:
  pull_request:
    paths: ['prompts/**', 'rag/**', 'src/llm/**']
  push:
    branches: [main]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Run evaluations
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          npx promptfoo eval --output eval-results.json
          
      - name: Check quality gates
        run: |
          python scripts/check_eval_gates.py eval-results.json \
            --min-pass-rate 0.90 \
            --max-latency-p95 2000 \
            --max-cost-per-1k 0.50
          
      - name: Comment results on PR
        uses: actions/github-script@v7
        with:
          script: |
            const results = require('./eval-results.json')
            const passRate = results.stats.successes / results.stats.total
            const comment = `## LLM Eval Results
            - Pass rate: ${(passRate * 100).toFixed(1)}%
            - Avg latency: ${results.stats.avgLatency}ms
            - Total cost: ${results.stats.totalCost.toFixed(4)}`
            github.rest.issues.createComment({...})

Production Monitoring

import prometheus_client as prom

# Metrics to track
llm_request_latency = prom.Histogram(
    "llm_request_duration_seconds",
    "LLM request latency",
    labelnames=["model", "prompt_version"],
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
)

llm_token_usage = prom.Counter(
    "llm_tokens_total",
    "LLM token usage",
    labelnames=["model", "type"],  # type: input/output
)

response_quality = prom.Histogram(
    "llm_response_quality_score",
    "LLM response quality (0-1)",
    labelnames=["metric_name"],
)

def monitored_llm_call(prompt: str, model: str = "gpt-4o-mini"):
    with llm_request_latency.labels(model=model, prompt_version="v2").time():
        response = client.chat.completions.create(
            model=model, messages=[{"role": "user", "content": prompt}]
        )
    
    # Track token usage
    llm_token_usage.labels(model=model, type="input").inc(response.usage.prompt_tokens)
    llm_token_usage.labels(model=model, type="output").inc(response.usage.completion_tokens)
    
    # Sample 10% for quality scoring (expensive)
    if random.random() < 0.1:
        quality = judge_response_quality(prompt, response.choices[0].message.content)
        response_quality.labels(metric_name="relevancy").observe(quality["relevancy"])
    
    return response.choices[0].message.content

Eval Framework Comparison

FrameworkBest ForProsCons
PromptfooPrompt comparison, CI/CDEvals-as-code, fast, cheapLess enterprise features
LangsmithLangChain appsDeep tracing, dataset managementVendor lock-in
BraintrustProduction monitoringNice UI, A/B testingNewer
RAGASRAG systemsRAG-specific metrics, freeLimited to RAG
Phoenix (Arize)ML + LLM observabilityGood observability UXComplex setup
DeepEvalComprehensive evalsMany metrics, LLM-agnosticSlower
OpikOpen source eval + tracingSelf-hostableLess mature

Skill Information

Source
MoltbotDen
Category
AI & LLMs
Repository
View on GitHub

Related Skills