The Testing Crisis in LLM Development
You write a prompt. It works on your example. You ship it. Two days later, users report it's "giving weird answers." You can't reproduce the issue with the same input. You try different phrasings. Sometimes it works, sometimes it doesn't. You add more instructions to the prompt. It fixes one case but breaks another. You're debugging by vibes.
This isn't sustainable. Traditional software engineering has test-driven development (TDD): write tests, write code, verify tests pass, ship with confidence. But how do you write a test for "generates a helpful product description" or "answers customer questions accurately"?
Evaluation-driven development (EDD) adapts the rigor of TDD to the non-deterministic nature of LLMs. Instead of asserting exact outputs, you define quality metrics, build evaluation datasets, and measure improvements quantitatively. Every change—prompt tweaks, model switches, retrieval adjustments—is validated against benchmarks before shipping.
The Core Principles of Evaluation-Driven Development
1. Metrics Before Code
Before writing a single prompt, define what good looks like:
- For a customer support bot: "Resolves issue without escalation" (target: 80% of cases)
- For a content generator: "On-brand, factually accurate, engaging" (3 separate scoreable dimensions)
- For a code assistant: "Syntactically correct, solves the stated problem, includes error handling"
These become your acceptance criteria, just like in traditional development. But instead of boolean pass/fail, they're scored on continuous scales (0-10, 0-100%, etc.).
2. Build a Golden Dataset First
Before iterating on prompts, collect 50-200 representative examples of the task:
- Real user queries (anonymized)
- Edge cases you know are hard
- Common variations (different phrasings, typos, multi-step requests)
- Adversarial inputs (jailbreak attempts, prompt injections)
For each example, annotate:
- The input
- Expected output (or output characteristics)
- Metadata (difficulty level, category, required reasoning steps)
This becomes your eval set—the benchmark against which every iteration is measured.
3. Automate Evaluation
Manual review doesn't scale. You need automated scorers that can evaluate hundreds of outputs in minutes:
- LLM-as-judge: Use GPT-4 or Claude to rate outputs ("How well does this answer the question? 0-10")
- Rule-based checks: Regex for format compliance, keyword presence, length constraints
- Embedding similarity: Compare output to reference answers semantically
- Custom logic: Domain-specific validators (e.g., check if generated SQL is valid)
Aim for correlation with human judgment >0.7. If your automated scores don't match what you'd rate manually, refine the scoring prompt or switch methods.
4. Iterate with Data, Not Intuition
Every change generates new metrics:
Baseline prompt:
- Accuracy: 72%
- Hallucination rate: 18%
- Avg tokens: 245
After adding "Be concise":
- Accuracy: 71% (-1%)
- Hallucination rate: 19% (+1%)
- Avg tokens: 180 (-27%)
After adding few-shot examples:
- Accuracy: 81% (+9%)
- Hallucination rate: 12% (-6%)
- Avg tokens: 210 (-14%)
Data-driven decisions: The few-shot approach is clearly superior. Ship it.
5. Regression Testing on Every Change
Traditional software has CI that runs tests on every commit. EDD is the same: every prompt update, model switch, or retrieval tuning triggers automatic eval runs.
If scores drop below thresholds, the deployment is blocked. No more "seems fine" launches that break production silently.
The Evaluation-Driven Development Workflow
Phase 1: Define Success Metrics
Start with the business goal. What are users trying to achieve? What does failure look like?
Example: Building a customer support QA bot
Goals:
- Answer customer questions without needing human escalation
- Stay on-brand (friendly but professional tone)
- Never leak PII or make promises about refunds/policies
Translate to metrics:
Accuracy (primary): Does the answer correctly address the question?
- Scorer: LLM-as-judge with rubric
- Target: >85%
Completeness: Does it include all necessary steps?
- Scorer: Custom function checking for key phrases
- Target: >80%
Tone compliance: Is it appropriately friendly?
- Scorer: LLM-as-judge comparing to brand guidelines
- Target: >7/10
Safety: No PII leakage, no unauthorized promises
- Scorer: Regex + NER model
- Target: 100% (hard requirement)
Efficiency: Token usage reasonable
- Scorer: Token count
- Target: <500 tokens per response
Phase 2: Build the Golden Dataset
Collect 100 real customer questions from support tickets, anonymize them, and categorize:
- 40 common questions (password reset, shipping info, returns)
- 30 medium-difficulty (edge cases, multi-part questions)
- 20 complex (requires policy interpretation, multi-step resolution)
- 10 adversarial (attempts to extract PII, requests for refunds bot shouldn't grant)
For each, create reference answers or at minimum, note the key points that must be included.
Store in a structured format:
{
"dataset": "customer-support-qa-v1",
"examples": [
{
"id": "cs-001",
"input": "I forgot my password and the reset email isn't coming",
"category": "common",
"difficulty": "easy",
"expected_elements": [
"Check spam folder",
"Verify email on file",
"Offer alternative reset method (SMS)",
"Provide support contact if still unresolved"
],
"must_not_include": [
"Direct password reset without verification"
]
}
]
}
Phase 3: Build the Baseline
Write a simple initial prompt and run it through the eval set:
PROMPT_V1 = """
You are a customer support agent for our e-commerce platform.
Answer the following customer question helpfully and professionally.
Question: {question}
"""
# Run evaluation
results = evalops.evaluate(
prompt=PROMPT_V1,
dataset="customer-support-qa-v1",
model="gpt-4",
scorers=[
accuracy_scorer,
completeness_scorer,
tone_scorer,
safety_scorer,
token_counter
]
)
print(results.summary())
Baseline results:
- Accuracy: 68%
- Completeness: 62%
- Tone: 7.2/10
- Safety: 95% (5 failures!)
- Avg tokens: 380
This gives you a starting point. Every future iteration must beat these numbers.
Phase 4: Iterate and Measure
Try improvements:
Iteration 1: Add context and constraints
PROMPT_V2 = """
You are a customer support agent for [Company Name].
Context:
- Users expect friendly but professional responses
- Always verify user identity before discussing account details
- If you don't know something, direct to human support
Answer the following customer question:
Question: {question}
"""
Results:
- Accuracy: 74% (+6%)
- Completeness: 70% (+8%)
- Tone: 7.8/10 (+0.6)
- Safety: 98% (+3%)
- Avg tokens: 420 (+40)
Better, but safety isn't 100% yet and tokens increased.
Iteration 2: Add few-shot examples
PROMPT_V3 = """
You are a customer support agent for [Company Name].
Examples of good responses:
Q: I forgot my password
A: I can help with that! First, check your spam folder for the reset email. If it's not there, verify the email address on your account matches the one you're using. You can also reset via SMS if you have a phone number on file. If none of this works, contact our support team at support@company.com.
Q: Where is my order?
A: I'd be happy to check on your order status. To look this up, I'll need your order number or the email address you used to place the order. For security, I can't access account details without verification. Can you provide your order number?
Now answer this question:
Question: {question}
"""
Results:
- Accuracy: 82% (+8%)
- Completeness: 85% (+15%)
- Tone: 8.4/10 (+0.6)
- Safety: 100% ✓
- Avg tokens: 450 (+30)
Decision point: Safety is now perfect. Accuracy and completeness meet targets. Tone is excellent. Token usage is acceptable for the quality gain. Ship v3.
Phase 5: Continuous Evaluation in Production
Deploy with instrumentation:
@app.post("/support/chat")
@evalops.trace(scenario="customer-support-qa")
async def handle_support_query(question: str, session_id: str):
response = await generate_response(question, PROMPT_V3)
# Async scoring in production
evalops.score_async(
trace_id=response.trace_id,
scorers=[accuracy_scorer, safety_scorer, tone_scorer]
)
return {"answer": response.text}
Monitor production metrics daily. If they drift from eval set results:
- Investigate: Are users asking new types of questions?
- Update eval set: Add new patterns to golden dataset
- Re-evaluate: Does current prompt still perform well on expanded set?
- Iterate: Improve prompt or switch models if needed
Phase 6: Regression Testing on Changes
A few weeks later, you want to switch from GPT-4 to Claude 3.5 to save costs. Don't just deploy—evaluate first:
results_gpt4 = evalops.evaluate(
prompt=PROMPT_V3,
dataset="customer-support-qa-v1",
model="gpt-4"
)
results_claude = evalops.evaluate(
prompt=PROMPT_V3,
dataset="customer-support-qa-v1",
model="claude-3-5-sonnet"
)
comparison = results_claude.compare_to(results_gpt4)
print(comparison)
Output:
Metric GPT-4 Claude 3.5 Delta
Accuracy 82% 79% -3%
Completeness 85% 83% -2%
Tone 8.4/10 8.6/10 +0.2
Safety 100% 100% 0%
Cost/query $0.04 $0.02 -50%
Decision: Claude is slightly worse on accuracy/completeness but massively cheaper. Is the 3% accuracy drop acceptable for 50% cost savings?
Run an A/B test in production:
- 10% of traffic to Claude
- Monitor user feedback and escalation rates
- If real-world performance matches eval results, roll out fully
Building Domain-Specific Scorers
Generic metrics (accuracy, tone) are a start, but domain-specific scorers are where EDD shines.
Example: E-commerce product description generator
You need descriptions that are:
- Accurate: Match product specifications
- SEO-optimized: Include target keywords naturally
- Engaging: High click-through language
- On-brand: Match company style guide
Custom scorers:
def spec_accuracy_scorer(output: str, product_specs: dict) -> float:
"""Check if key specifications are mentioned"""
mentions = 0
for spec_key, spec_value in product_specs.items():
if spec_value.lower() in output.lower():
mentions += 1
return mentions / len(product_specs)
def seo_keyword_scorer(output: str, keywords: list[str]) -> float:
"""Check if target keywords are present"""
found = sum(1 for kw in keywords if kw.lower() in output.lower())
return found / len(keywords)
def readability_scorer(output: str) -> float:
"""Flesch reading ease score"""
# Implementation using textstat library
score = textstat.flesch_reading_ease(output)
return score / 100 # Normalize to 0-1
def brand_voice_scorer(output: str) -> float:
"""LLM-as-judge with brand guidelines"""
judge_prompt = f"""
Our brand voice is: friendly, informative, and aspirational.
We avoid: hype, superlatives without backing, technical jargon.
Rate this product description for brand alignment (0-10):
{output}
"""
rating = llm_judge(judge_prompt)
return rating / 10
Now every product description is scored on these dimensions:
results = evalops.evaluate(
prompt=PRODUCT_DESCRIPTION_PROMPT,
dataset="product-catalog-sample-100",
scorers=[
spec_accuracy_scorer,
seo_keyword_scorer,
readability_scorer,
brand_voice_scorer
]
)
This tells you exactly where your prompt is weak. If spec_accuracy
is low, you need better instructions to reference product data. If seo_keyword_scorer
is low, explicitly list keywords in the prompt.
Integrating with CI/CD
Evaluation-driven development requires automation. On every pull request that touches prompts or models:
GitHub Actions workflow:
name: Evaluate AI Changes
on:
pull_request:
paths:
- 'prompts/**'
- 'src/ai/**'
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run evaluation suite
env:
EVALOPS_API_KEY: ${{ secrets.EVALOPS_API_KEY }}
run: |
pip install evalops
evalops run --suite customer-support-qa \
--compare-to main \
--fail-on-regression
- name: Post results to PR
uses: evalops/pr-comment-action@v1
with:
comparison: true
The PR now shows:
Evaluation Results: customer-support-qa
Metric Current Main Branch Delta
Accuracy 84% 82% +2% ✅
Safety 100% 100% 0% ✅
Avg tokens 430 450 -20 ✅
All metrics passed. Safe to merge.
Or, if it regresses:
⚠️ Evaluation Failed
Metric Current Main Branch Delta
Accuracy 79% 82% -3% ❌
Safety 98% 100% -2% ❌
Regressions detected. Review before merging.
No more shipping changes that degrade quality silently.
Handling Non-Determinism
LLMs are stochastic. The same prompt can produce different outputs. How do you handle this in evaluation?
Strategy 1: Multiple Runs Per Example
Run each eval example 3-5 times and aggregate:
results = evalops.evaluate(
prompt=PROMPT,
dataset="eval-set",
runs_per_example=5,
aggregation="mean"
)
If accuracy is 80%, 85%, 82%, 78%, 84% across runs, report mean (81.8%) and std dev (2.8%). Tighter std dev = more consistent model.
Strategy 2: Temperature Control
Set temperature=0
during evaluation for maximum determinism:
results = evalops.evaluate(
prompt=PROMPT,
dataset="eval-set",
model_params={"temperature": 0}
)
This reduces variance but doesn't eliminate it (models still have some randomness even at temp=0).
Strategy 3: Statistical Significance Testing
When comparing two prompts, use statistical tests to ensure differences aren't noise:
comparison = evalops.compare(
prompt_a=PROMPT_V1,
prompt_b=PROMPT_V2,
dataset="eval-set",
runs_per_example=10,
test="t-test"
)
print(comparison.significance)
# "Prompt B is significantly better (p < 0.05)"
Only ship changes that are statistically significant improvements, not just lucky samples.
Common Pitfalls and How to Avoid Them
Pitfall 1: Overfitting to the Eval Set
You iterate on prompts while watching eval metrics. Eventually, you're optimizing for the 100 examples in your golden dataset, not real-world performance.
Solution: Split your data:
- Dev set (80 examples): Use for active iteration
- Test set (20 examples): Hold out, only evaluate final prompt candidates
- Production monitoring: Continuously collect new examples and refresh eval sets monthly
Pitfall 2: Proxy Metrics That Don't Correlate
You optimize for "response length < 200 tokens" because it's easy to measure, but users actually prefer longer, more detailed answers.
Solution: Validate that automated metrics correlate with user satisfaction. If your LLM-as-judge "accuracy" score doesn't match human ratings or user feedback, refine the judge prompt or use a different approach.
Pitfall 3: Ignoring Edge Cases
Your eval set is 80% easy questions, 20% medium. You optimize for the easy ones and average metrics look great, but production users hit the hard cases you didn't test.
Solution: Stratified sampling in your eval set. Ensure adequate coverage of:
- Common cases (to avoid breaking the basics)
- Edge cases (rare but important)
- Adversarial cases (misuse attempts)
Track metrics per category, not just overall averages.
Pitfall 4: Evaluation Theater
You run evals, generate pretty dashboards, but don't actually block bad deployments. Teams ship "because we're on a deadline" even when metrics regress.
Solution: Make evaluation a gate, not a suggestion. CI fails if metrics drop. Product managers understand that shipping a 75% accurate bot creates more support load than delaying a week to get to 85%.
Pitfall 5: Analysis Paralysis
You have 20 different metrics. Some go up, some go down. You can't decide if a change is good or bad.
Solution: Define a composite score or hierarchy of metrics:
- Safety (must be 100%, non-negotiable)
- Accuracy (primary: must be >80%)
- Tone (secondary: should be >7/10)
- Efficiency (tertiary: optimize if above thresholds met)
Ship changes that maintain #1, improve #2, and don't regress #3 or #4 significantly.
Scaling Evaluation as Your Application Grows
Start Small: Single Scenario
Begin with your highest-value use case:
- 50-100 examples
- 3-5 core metrics
- Manual review to validate scorer accuracy
Expand: Multiple Scenarios
As you add features, each gets its own eval set:
- Customer support QA
- Product recommendation
- Content generation
- Code completion
Share scorers where possible (e.g., safety checks across all scenarios).
Automate: Continuous Eval Pipeline
Run evaluations:
- On every PR (regression testing)
- Nightly (full suite against latest prod data)
- Weekly (expanded test sets including new edge cases)
Integrate: Feedback Loops
Connect eval results to:
- User feedback (correlate low-scoring traces with negative ratings)
- A/B test outcomes (which variant performs better in production)
- Support tickets (tag traces that led to escalations)
Use this to continuously improve eval sets: production failures become tomorrow's test cases.
Measuring the Impact of EDD
Teams that adopt evaluation-driven development report:
- Faster iteration: No more "try and see" deployments. Changes are validated in minutes, not weeks of production monitoring.
- Higher confidence: Ship knowing exactly how quality and cost trade off, not guessing.
- Fewer production incidents: Regressions caught in CI, not by users.
- Better collaboration: Product, engineering, and ML teams align on quantitative goals instead of debating subjective quality.
- Improved model utilization: Data-driven model selection (GPT-4 vs Claude vs Llama) based on task-specific benchmarks, not vendor marketing.
Getting Started: Your First Evaluation-Driven Feature
Week 1:
- Pick one LLM feature to evaluate
- Define 3 success metrics
- Collect 50 representative examples
- Annotate expected outputs or key characteristics
Week 2:
- Write scorers (start with LLM-as-judge for subjective metrics)
- Run baseline eval with current prompt
- Document results
Week 3:
- Iterate on prompt (3-5 variants)
- Evaluate each variant
- Pick winner based on data
- Deploy with instrumentation
Week 4:
- Monitor production metrics
- Compare to eval set performance
- Refine scorers if production diverges from expectations
- Add new examples to eval set based on production edge cases
Repeat monthly with expanded eval sets and additional scenarios.
Conclusion
Evaluation-driven development brings the discipline of test-driven development to LLM applications. It replaces "seems fine" with "measurably better," "probably works" with "verified against 200 examples," and "ship and pray" with "ship with confidence."
The investment is small—days to set up your first eval pipeline—but the returns compound. Every subsequent feature is faster to build, cheaper to maintain, and higher quality at launch.
Most importantly, EDD shifts the conversation from "Does this feel good?" to "Does this meet our quantitative quality bar?" That's the difference between hobbyist AI and production-grade systems.
Start simple. Pick your highest-impact feature. Define metrics. Build a golden dataset. Automate evaluation. Then never ship an AI feature without data again.
Next Steps:
- Set up your first eval pipeline with EvalOps
- Import pre-built scorers from Spellbook
- Join the community to share eval patterns
Questions about implementing EDD for your use case? Email us at hello@evalops.dev.