LLM Evaluation Pipeline

Purpose

An LLM evaluation pipeline measures whether a generative AI system is meeting quality, safety, and correctness requirements — across versions, prompt changes, and model swaps. Without systematic evaluation, regressions are invisible. This note covers: building a golden dataset, defining an LLM-as-judge rubric, running evaluations with LangSmith, comparing runs, and setting up automated regression checks in CI.

Examples

RAG Q&A evaluation: Measure faithfulness (does the answer follow from the context?) and answer relevance (does the answer address the question?) on 200 golden QA pairs after each prompt update.

Code generation evaluation: Run pass@1 tests on 50 coding tasks before and after a model swap; fail CI if pass@1 drops more than 3 points.


Architecture

Golden dataset (prompt, [context], expected_output)
        ↓
   LLM / pipeline generates actual outputs
        ↓
   Evaluation: automated metrics + LLM-as-judge
        ├── Exact match / ROUGE / BLEU (reference-based)
        └── LLM judge: rubric scoring (faithfulness, relevance, etc.)
        ↓
   Results stored in LangSmith experiment
        ↓
   Compare across runs (A/B, before/after)
        ↓
   CI gate: fail if metric drops below threshold

Setup

pip install langsmith langchain-openai
export LANGCHAIN_API_KEY="ls__..."
export LANGCHAIN_TRACING_V2="true"
export LANGCHAIN_PROJECT="my-eval-project"

Create a Golden Evaluation Dataset

from langsmith import Client
 
client = Client()
 
# Create dataset
dataset = client.create_dataset(
    dataset_name="rag-qa-golden-v1",
    description="200 curated QA pairs with grounding context",
)
 
# Upload examples: (inputs, expected outputs)
examples = [
    {
        "inputs":  {"question": "What is the refund window for enterprise accounts?",
                    "context":  "Enterprise customers may request refunds within 45 days..."},
        "outputs": {"answer":   "Enterprise accounts have a 45-day refund window."},
    },
    {
        "inputs":  {"question": "Is multi-factor authentication required?",
                    "context":  "MFA is mandatory for all admin accounts as of Q1 2024."},
        "outputs": {"answer":   "Yes, MFA is required for all admin accounts."},
    },
    # ... add 198 more examples
]
 
client.create_examples(
    inputs=[e["inputs"] for e in examples],
    outputs=[e["outputs"] for e in examples],
    dataset_id=dataset.id,
)
print(f"Dataset created: {dataset.id}")

LLM-as-Judge Rubric

from langchain_openai import ChatOpenAI
from langsmith.evaluation import LangChainStringEvaluator
 
# Built-in evaluators: "qa" (correctness), "context_qa", "cot_qa", "criteria"
 
# Custom rubric evaluator using "criteria"
faithfulness_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "faithfulness": (
                "Does the answer accurately reflect the information in the provided context? "
                "Score 1 (faithful) or 0 (hallucinated or contradicts context)."
            )
        },
        "llm": ChatOpenAI(model="gpt-4o-mini", temperature=0),
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],
        "input":      example.inputs["context"],
        "reference":  example.outputs["answer"],
    },
)
 
relevance_evaluator = LangChainStringEvaluator("qa", config={
    "llm": ChatOpenAI(model="gpt-4o-mini", temperature=0),
})

Running Evaluations

from langsmith.evaluation import evaluate
from langchain_openai import ChatOpenAI
 
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
 
def answer_question(inputs: dict) -> dict:
    """Pipeline under evaluation."""
    prompt = f"Context: {inputs['context']}\n\nQuestion: {inputs['question']}\nAnswer:"
    response = llm.invoke(prompt)
    return {"answer": response.content}
 
# Run evaluation on the golden dataset
results = evaluate(
    answer_question,
    data="rag-qa-golden-v1",
    evaluators=[faithfulness_evaluator, relevance_evaluator],
    experiment_prefix="gpt-4o-mini-baseline",
    num_repetitions=1,
)
print(results)

Comparing Runs (A/B Testing)

# After running two experiments (e.g., baseline vs. prompt-v2)
from langsmith import Client
import pandas as pd
 
client = Client()
 
# Fetch experiment results for comparison
baseline = client.read_project(project_name="gpt-4o-mini-baseline")
improved = client.read_project(project_name="gpt-4o-mini-prompt-v2")
 
# Programmatic comparison (via LangSmith dashboard or API)
# Dashboard: https://smith.langchain.com → Datasets → rag-qa-golden-v1 → Compare runs

CI Regression Check (GitHub Actions)

# .github/workflows/llm_eval.yml
name: LLM Evaluation CI
 
on:
  pull_request:
    paths: ["src/prompts/**", "src/pipeline/**"]
 
jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install langsmith langchain-openai
      - name: Run evaluation
        env:
          LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }}
          OPENAI_API_KEY:    ${{ secrets.OPENAI_API_KEY }}
        run: python scripts/run_eval.py --fail-threshold 0.85
# scripts/run_eval.py
import argparse, sys
from langsmith.evaluation import evaluate
# ... (same as above) ...
 
parser = argparse.ArgumentParser()
parser.add_argument("--fail-threshold", type=float, default=0.85)
args = parser.parse_args()
 
results = evaluate(answer_question, data="rag-qa-golden-v1", ...)
mean_faithfulness = results.to_pandas()["feedback.faithfulness"].mean()
 
print(f"Faithfulness: {mean_faithfulness:.3f} (threshold: {args.fail_threshold})")
if mean_faithfulness < args.fail_threshold:
    print("FAIL: evaluation below threshold")
    sys.exit(1)

Evaluation Metric Summary

MetricMethodWhen to use
Exact matchString equalitySQL, structured outputs, classification
ROUGE-LN-gram overlapSummarisation baseline
FaithfulnessLLM judgeRAG — does answer follow from context?
Answer relevanceLLM judgeRAG — does answer address the question?
CorrectnessLLM judge vs. referenceOpen-ended Q&A
Pass@kCode executionCode generation tasks
Refusal rateString match / classifierSafety evaluation

References

AI Engineering

System Patterns

End-to-End Examples