Skip to main content
AI & LLMsDocumented

llm-evaluation

LLMOps and evaluation engineering. RAGAS metrics, Promptfoo evals-as-code, Langsmith tracing, hallucination detection, CI/CD for LLM quality, and production monitoring. Build LLM apps you can actually measure.

Share:

Installation

npx clawhub@latest install llm-evaluation

View the full skill documentation and source below.

Documentation

LLM Evaluation Engineering

Why Evals Matter

"Evals are to LLMs what tests are to software." — Andrej Karpathy

Without evals:

  • You can't measure if a prompt change improved things

  • You don't know if a model upgrade regressed outputs

  • You're shipping vibes, not quality



Evaluation Taxonomy

Automated Evals (fast, cheap, scalable)
  ├─ Rule-based: regex, JSON schema, string contains
  ├─ Model-based (LLM-as-Judge): GPT-4o grades outputs
  └─ Reference-based: compare against golden answers

Human Evals (slow, expensive, ground truth)
  ├─ Expert annotation: domain experts label samples
  ├─ Preference annotation: A/B comparison voting
  └─ Production feedback: thumbs up/down from users

Online vs Offline:
  Offline: Test against curated dataset before deploy
  Online: Monitor live traffic in production

The Evaluation Loop

1. Build eval dataset
   - Golden QA pairs from domain experts
   - Edge cases from production failures
   - Adversarial examples

2. Define metrics
   - What does "good" look like?
   - How do you measure it programmatically?

3. Run evals
   - Against multiple model versions
   - Against prompt variants
   - After system changes

4. Analyze regressions
   - Which cases failed?
   - What patterns exist?

5. Improve system
   - Fix prompts, chunking, retrieval
   - Retrain / fine-tune

6. Add failing cases to dataset (regression prevention)

Building an Eval Dataset

# Golden dataset structure
eval_dataset = [
    {
        "id": "q001",
        "question": "What is our refund policy?",
        "ground_truth": "Products can be returned within 30 days of purchase for a full refund.",
        "contexts": [
            "Our refund policy allows returns within 30 days of purchase.",
            "Refunds are processed within 5-7 business days."
        ],
        "category": "policy",
        "difficulty": "easy",
        "tags": ["refund", "returns"],
    },
    {
        "id": "q002",
        "question": "Can I return a digital product?",
        "ground_truth": "Digital products are not eligible for refunds once accessed.",
        "contexts": [
            "Digital downloads cannot be returned once downloaded or accessed."
        ],
        "category": "policy",
        "difficulty": "hard",  # Edge case
        "tags": ["refund", "digital"],
    }
]

# Minimum viable eval dataset: 50-100 questions per use case
# Cover: easy/medium/hard, each business scenario, known failure modes

Metric Design

For RAG Systems

from ragas import evaluate
from ragas.metrics import (
    faithfulness,        # 0-1: Is the answer grounded in retrieved context?
    answer_relevancy,   # 0-1: Does the answer address the question?
    context_precision,  # 0-1: Is the retrieved context useful?
    context_recall,     # 0-1: Does the context contain enough info?
    answer_correctness, # 0-1: Is the answer factually correct (vs ground truth)?
)
from ragas.metrics.critique import harmfulness, coherence

from datasets import Dataset

results = evaluate(
    Dataset.from_list(eval_dataset),
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

# Scores explained:
# Faithfulness < 0.8: LLM is hallucinating
# Context Recall < 0.7: Retrieval is missing relevant chunks
# Context Precision < 0.7: Retrieved chunks are irrelevant
# Answer Relevancy < 0.8: Answers are off-topic

For Chatbots / Assistants

# Custom LLM-as-Judge metric
from openai import OpenAI

client = OpenAI()

JUDGE_PROMPT = """You are evaluating an AI assistant's response.

Question: {question}
AI Response: {response}

Rate the response on each dimension (1-5):
1. Accuracy: Is the information factually correct?
2. Completeness: Does it fully answer the question?
3. Clarity: Is it easy to understand?
4. Conciseness: Is it appropriately concise without being incomplete?
5. Tone: Is the tone appropriate and professional?

Return JSON only:
{{"accuracy": X, "completeness": X, "clarity": X, "conciseness": X, "tone": X, "reasoning": "..."}}"""

def judge_response(question: str, response: str) -> dict:
    result = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": JUDGE_PROMPT.format(question=question, response=response)
        }],
        response_format={"type": "json_object"},
        temperature=0,  # Deterministic grading
    )
    return json.loads(result.choices[0].message.content)

Rule-Based Metrics

import re
from typing import Callable

def make_evaluator(checks: list[Callable]) -> Callable:
    """Compose multiple checks into a single evaluator."""
    def evaluate(response: str, metadata: dict) -> dict:
        results = {}
        for check in checks:
            name = check.__name__
            results[name] = check(response, metadata)
        results["passed"] = all(results.values())
        return results
    return evaluate

def check_format_json(response: str, _) -> bool:
    try:
        json.loads(response)
        return True
    except json.JSONDecodeError:
        return False

def check_length(response: str, metadata: dict) -> bool:
    min_len = metadata.get("min_length", 50)
    max_len = metadata.get("max_length", 2000)
    return min_len <= len(response) <= max_len

def check_no_hallucination_markers(response: str, _) -> bool:
    """Flag responses that suggest hallucination."""
    hallucination_phrases = [
        "I'm not sure but",
        "I believe but I'm not certain",
        "I might be wrong",
        "as far as I know",
        "I think (but don't quote me)",
    ]
    response_lower = response.lower()
    return not any(phrase.lower() in response_lower for phrase in hallucination_phrases)

def check_cites_source(response: str, _) -> bool:
    """Check if response includes source citation."""
    patterns = [r'\[Source:', r'\[Ref:', r'According to', r'Based on']
    return any(re.search(p, response) for p in patterns)

Promptfoo — Evals as Code

# promptfooconfig.yaml
prompts:
  - id: prompt-v1
    raw: |
      You are a helpful customer service agent.
      Answer questions about our products based on this knowledge base:
      {{context}}
      
      Question: {{question}}
      Answer:
  
  - id: prompt-v2
    raw: |
      You are a precise customer service agent. Answer ONLY using information 
      from the provided context. If unsure, say "I don't have that information."
      
      Context: {{context}}
      
      Question: {{question}}
      Answer:

providers:
  - id: openai:gpt-4o-mini
    config:
      temperature: 0
  - id: openai:gpt-4o
    config:
      temperature: 0
  - id: anthropic:claude-3-5-haiku-20241022

tests:
  - description: Basic refund policy query
    vars:
      question: "What is your return policy?"
      context: "Products can be returned within 30 days for a full refund."
    assert:
      - type: contains
        value: "30 days"
      - type: llm-rubric
        value: "The response accurately describes the return policy and is helpful"
      - type: cost
        threshold: 0.01  # Max $0.01 per call

  - description: Edge case - digital products
    vars:
      question: "Can I return downloaded software?"
      context: "Digital downloads cannot be returned once accessed."
    assert:
      - type: contains
        value: "cannot"
        not: false
      - type: not-contains
        value: "30 days"  # Should NOT apply physical policy to digital
      - type: factuality
        value: "Digital products cannot be returned once downloaded"

  - description: Hallucination test - question not in context
    vars:
      question: "What are your store hours?"
      context: "We sell premium software products."
    assert:
      - type: llm-rubric
        value: "The response should NOT invent store hours. It should say the information is not available."
# Run evals
npx promptfoo eval
npx promptfoo eval --output results.json
npx promptfoo view  # Browser UI with comparison table

# Compare two configs
npx promptfoo eval --config config-v1.yaml
npx promptfoo eval --config config-v2.yaml
npx promptfoo view  # See side-by-side comparison

Langsmith Tracing and Eval

from langchain_core.tracers.langchain import LangChainTracer
from langsmith import Client
from langsmith.evaluation import evaluate

# Auto-tracing — add to any LangChain app
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls_..."
os.environ["LANGCHAIN_PROJECT"] = "my-rag-app"

# Create an evaluator
from langchain.evaluation import load_evaluator

correctness_evaluator = load_evaluator(
    "labeled_score_string",
    criteria={
        "correctness": "Is the response factually correct based on the reference?"
    }
)

# Evaluate a dataset
client = Client()

# Create dataset
dataset = client.create_dataset("refund-questions")
for example in eval_dataset:
    client.create_example(
        inputs={"question": example["question"], "context": example["contexts"]},
        outputs={"answer": example["ground_truth"]},
        dataset_id=dataset.id,
    )

# Run evaluation
def run_rag_chain(inputs):
    return {"answer": my_rag_chain.invoke(inputs)}

results = evaluate(
    run_rag_chain,
    data=dataset.name,
    evaluators=["qa", "context_qa"],  # Built-in evaluators
    experiment_prefix="rag-v2",
)

Hallucination Detection

HALLUCINATION_CHECK_PROMPT = """Given the following context and response, determine if the response 
contains any information NOT supported by the context (hallucination).

Context:
{context}

Response:
{response}

Return JSON:
{{
    "has_hallucination": true/false,
    "hallucinated_claims": ["specific claim not in context", ...],
    "confidence": 0.0-1.0,
    "reasoning": "explanation"
}}"""

def detect_hallucination(
    context: str,
    response: str,
    model: str = "gpt-4o"
) -> dict:
    result = client.chat.completions.create(
        model=model,
        messages=[{
            "role": "user",
            "content": HALLUCINATION_CHECK_PROMPT.format(
                context=context, response=response
            )
        }],
        response_format={"type": "json_object"},
        temperature=0,
        seed=42,  # Deterministic output
    )
    return json.loads(result.choices[0].message.content)

# NLI-based hallucination detection (faster, cheaper)
from transformers import pipeline

nli = pipeline("text-classification", model="cross-encoder/nli-deberta-v3-base")

def is_faithful(premise: str, hypothesis: str, threshold: float = 0.5) -> bool:
    """Check if hypothesis is entailed by premise."""
    result = nli(f"{premise} [SEP] {hypothesis}")
    entailment = next((r for r in result if r["label"] == "ENTAILMENT"), None)
    return entailment is not None and entailment["score"] > threshold

CI/CD Integration for LLM Quality

# .github/workflows/eval.yml
name: LLM Evaluation

on:
  pull_request:
    paths: ['prompts/**', 'rag/**', 'src/llm/**']
  push:
    branches: [main]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Run evaluations
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          npx promptfoo eval --output eval-results.json
          
      - name: Check quality gates
        run: |
          python scripts/check_eval_gates.py eval-results.json \
            --min-pass-rate 0.90 \
            --max-latency-p95 2000 \
            --max-cost-per-1k 0.50
          
      - name: Comment results on PR
        uses: actions/github-script@v7
        with:
          script: |
            const results = require('./eval-results.json')
            const passRate = results.stats.successes / results.stats.total
            const comment = `## LLM Eval Results
            - Pass rate: ${(passRate * 100).toFixed(1)}%
            - Avg latency: ${results.stats.avgLatency}ms
            - Total cost: ${results.stats.totalCost.toFixed(4)}`
            github.rest.issues.createComment({...})

Production Monitoring

import prometheus_client as prom

# Metrics to track
llm_request_latency = prom.Histogram(
    "llm_request_duration_seconds",
    "LLM request latency",
    labelnames=["model", "prompt_version"],
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
)

llm_token_usage = prom.Counter(
    "llm_tokens_total",
    "LLM token usage",
    labelnames=["model", "type"],  # type: input/output
)

response_quality = prom.Histogram(
    "llm_response_quality_score",
    "LLM response quality (0-1)",
    labelnames=["metric_name"],
)

def monitored_llm_call(prompt: str, model: str = "gpt-4o-mini"):
    with llm_request_latency.labels(model=model, prompt_version="v2").time():
        response = client.chat.completions.create(
            model=model, messages=[{"role": "user", "content": prompt}]
        )
    
    # Track token usage
    llm_token_usage.labels(model=model, type="input").inc(response.usage.prompt_tokens)
    llm_token_usage.labels(model=model, type="output").inc(response.usage.completion_tokens)
    
    # Sample 10% for quality scoring (expensive)
    if random.random() < 0.1:
        quality = judge_response_quality(prompt, response.choices[0].message.content)
        response_quality.labels(metric_name="relevancy").observe(quality["relevancy"])
    
    return response.choices[0].message.content

Eval Framework Comparison

FrameworkBest ForProsCons
PromptfooPrompt comparison, CI/CDEvals-as-code, fast, cheapLess enterprise features
LangsmithLangChain appsDeep tracing, dataset managementVendor lock-in
BraintrustProduction monitoringNice UI, A/B testingNewer
RAGASRAG systemsRAG-specific metrics, freeLimited to RAG
Phoenix (Arize)ML + LLM observabilityGood observability UXComplex setup
DeepEvalComprehensive evalsMany metrics, LLM-agnosticSlower
OpikOpen source eval + tracingSelf-hostableLess mature