---
title: AI Evaluation Methodologies Library
canonical: https://www.slavin.ai/data/ai-evaluation-methodologies.json
sourceJSON: https://www.slavin.ai/data/ai-evaluation-methodologies.json
license: CC-BY-4.0
lastUpdated: 2026-06-20
totalMethods: 18
categories: ["offline", "online", "human-in-loop", "llm-as-judge", "drift-detection", "infrastructure"]
companionTo: ai-architecture-patterns
---

# AI Evaluation Methodologies Library

18 named, structured evaluation methods for production LLM systems.
Companion to the AI Architecture Patterns Library — **patterns = how
to build**, **evaluations = how to verify it works**. Together they
cover the full production lifecycle.

Cite by method `@id`:
`https://www.slavin.ai/data/ai-evaluation-methodologies.json#<method-id>`

Pair with `ai-architecture-patterns` to design + verify a complete
production AI system.

---

## Offline (8)

### Golden Dataset Eval (`golden-dataset`)
**Measures:** Output correctness against curated ground-truth pairs.
**Tooling:** promptfoo, deepeval, Langfuse, OpenAI Evals, pytest.
**Use when:** Any production AI with definable correct answers — mandatory before every model/prompt change.
**Cost:** 1-3 person-weeks initial; 2-4 h/month maintenance.
**Example:** `pass_rate=0.87, f1=0.81, regressions=3/156`

### RAGAS (`ragas`)
**Measures:** RAG-specific failure modes — faithfulness, answer-relevance, context-relevance, context-recall.
**Tooling:** RAGAS framework (Python), TruLens, DeepEval RagasMetric.
**Use when:** Any RAG system in production or pre-launch.
**Cost:** 2-3 days to wire up; ~10-15% of generation cost.
**Example:** `faithfulness=0.92, answer_relevance=0.88, context_precision=0.76`

### Self-Consistency Eval (`self-consistency-check`)
**Measures:** Output stability across paraphrased inputs.
**Tooling:** custom pytest, paraphrase-mining via embeddings.
**Use when:** Reasoning + factual tasks where stability matters.
**Cost:** 1-2 days; 3-5× generation per eval run.

### Adversarial / Red-Team Eval (`red-team-eval`)
**Measures:** Robustness against prompt injection, jailbreaks, encoded payloads.
**Tooling:** PromptInject, Garak (NVIDIA), PyRIT (Microsoft), custom suite.
**Use when:** User-facing LLM. Mandatory before launch for regulated.
**Cost:** 2-4 person-weeks initial; quarterly refresh.

### Long-Context Recall / Needle-in-Haystack (`long-context-recall`)
**Measures:** Whether the model retrieves specific facts placed deep in context.
**Tooling:** LangChain needle-in-haystack utilities, custom benchmarks.
**Use when:** Systems pushing context-window limits (long PDFs, code repos).

### Tool / Function Call Eval (`tool-call-eval`)
**Measures:** Tool selection accuracy, argument extraction, plan completion, hallucinated-tool rate.
**Tooling:** custom test harness, DeepEval ToolCorrectness, LangChain test utils.
**Use when:** Any agent with function-calling.
**Example:** `tool_correctness=0.91, arg_exact_match=0.78, plan_completion=0.83`

### Structured Output Adherence (`format-adherence`)
**Measures:** Compliance with declared JSON Schema / fixed format.
**Tooling:** jsonschema, Pydantic, Outlines, Instructor library.
**Use when:** Any system using JSON-mode / function-calling / structured generation.
**Cost:** 1-2 days wiring.

### Safety / Policy Filter Eval (`safety-filter-eval`)
**Measures:** Compliance with PII / profanity / brand-voice / off-topic policy.
**Tooling:** OpenAI Moderation API, Llama Guard 3, NeMo Guardrails.
**Use when:** User-facing AI in regulated/branded contexts.

---

## Online (3)

### Production A/B Test (`ab-test`)
**Measures:** Real business outcomes (conversion, completion, satisfaction) caused by AI change.
**Tooling:** GrowthBook, Optimizely, Statsig, in-house feature flags.
**Use when:** Any consequential AI change touching real users — final validation gate.
**Example:** `conversion_lift=+4.2% (95% CI [+1.8, +6.6])`

### Shadow Evaluation (`shadow-eval`)
**Measures:** Candidate model/prompt quality on real traffic without affecting users.
**Tooling:** LiteLLM shadow mode, in-house dual-call wrapper, Helicone.
**Use when:** Backend AI changes. Pre-A/B validation.
**Cost:** 2× generation cost while shadow runs.

### Inline User Feedback (`user-feedback`)
**Measures:** User-perceived quality via thumbs-up / thumbs-down + optional comment.
**Tooling:** Helicone feedback widget, Langfuse user feedback, custom UI.
**Use when:** Any user-facing AI. Earliest production-quality signal you'll get.
**Cost:** 1-3 days UI + aggregation.

---

## Human-in-Loop (1)

### Human Spot-Check Sampling (`human-spot-check`)
**Measures:** Ground-truth quality. Calibration source for all automated evals.
**Tooling:** Argilla, Label Studio, Prodigy, in-house labeling UI.
**Use when:** Always pair with LLM-as-judge as calibration source. Mandatory in regulated domains.
**Cost:** 4-20 person-hours per week per domain expert.
**Example:** `human_quality=4.1±0.3, judge_correlation=0.79`

---

## LLM-as-Judge (2)

### LLM-as-Judge with Rubric (`llm-as-judge-rubric`)
**Measures:** Subjective quality dimensions (helpfulness, factuality, tone, safety) at scale.
**Tooling:** Langfuse evaluators, promptfoo LLM rubric, OpenAI Evals, DeepEval GEval.
**Use when:** Subjective quality at production scale. Drift detection. Pre-deployment gating.
**Cost:** 1-2 person-weeks rubric design; ~5-10% of generation cost.

### Factuality + Citation Check (`factuality-citation-check`)
**Measures:** Whether stated facts in long-form output are supported by evidence.
**Tooling:** custom claim extractor + retriever, FActScore, DeepEval FactualityMetric.
**Use when:** Long-form generation with claims (research summaries, legal opinions).
**Cost:** 1-3 person-weeks; 2-3× generation per eval run.

---

## Drift Detection (2)

### Drift Monitoring (`drift-monitoring`)
**Measures:** Distribution shift in inputs or outputs over time.
**Tooling:** Evidently AI, Whylogs, in-house Pandas, Langfuse trends.
**Use when:** Production systems > 3 months old. Critical when corpus/user-base evolves.
**Example:** `topic_drift PSI=0.15 (yellow), refusal_rate=+3% MoM`

### Embedding Drift Detection (`embedding-drift`)
**Measures:** Whether embedding model + corpus continue to represent semantics as expected.
**Tooling:** custom Python + vector store, Pinecone Vector Eval.
**Use when:** RAG systems where corpus updates frequently or embedding model version changes.

---

## Infrastructure (2)

### Latency + Cost Tracking (`latency-cost-eval`)
**Measures:** p50/p95/p99 latency, tokens-per-request, $/request — by route, user-segment, model.
**Tooling:** OpenTelemetry + Honeycomb/Datadog, Helicone, Langfuse, Phoenix.
**Use when:** Always. Mandatory ops hygiene.
**Example:** `p95=1.8s, p99=4.2s, $/req=$0.012`

### Regression Suite — Pre-Deploy Gate (`regression-suite`)
**Measures:** Does new version preserve previous quality on a fixed test set?
**Tooling:** GitHub Actions + promptfoo CI, DeepEval CI integration.
**Use when:** Any production AI with version control.
**Cost:** 1-2 person-weeks initial; ongoing maintenance with new features.

---

## How to combine

A typical production AI system runs **5-7** of these methods continuously:
1. **Golden Dataset** + **Regression Suite** — CI gate before every change
2. **Latency + Cost** — always-on infra hygiene
3. **LLM-as-Judge sampling** — 5-10% of production traffic
4. **Human Spot-Check** — weekly calibration
5. **Drift Monitoring** — daily aggregation, alert at threshold
6. **Red-Team Eval** — quarterly refresh
7. RAG-specific: add **RAGAS** + **Embedding Drift**
8. Agent-specific: add **Tool Call Eval**

---

End of evaluation methodologies library.
