Evaluation-Driven Development: Building LLM Features with Confidence

The Testing Crisis in LLM Development

You write a prompt. It works on your example. You ship it. Two days later, users report it's "giving weird answers." You can't reproduce the issue with the same input. You try different phrasings. Sometimes it works, sometimes it doesn't. You add more instructions to the prompt. It fixes one case but breaks another. You're debugging by vibes.

This isn't sustainable. Traditional software engineering has test-driven development (TDD): write tests, write code, verify tests pass, ship with confidence. But how do you write a test for "generates a helpful product description" or "answers customer questions accurately"?

Evaluation-driven development (EDD) adapts the rigor of TDD to the non-deterministic nature of LLMs. Instead of asserting exact outputs, you define quality metrics, build evaluation datasets, and measure improvements quantitatively. Every change—prompt tweaks, model switches, retrieval adjustments—is validated against benchmarks before shipping.

The Core Principles of Evaluation-Driven Development

1. Metrics Before Code

Before writing a single prompt, define what good looks like:

For a customer support bot: "Resolves issue without escalation" (target: 80% of cases)
For a content generator: "On-brand, factually accurate, engaging" (3 separate scoreable dimensions)
For a code assistant: "Syntactically correct, solves the stated problem, includes error handling"

These become your acceptance criteria, just like in traditional development. But instead of boolean pass/fail, they're scored on continuous scales (0-10, 0-100%, etc.).

2. Build a Golden Dataset First

Before iterating on prompts, collect 50-200 representative examples of the task:

Real user queries (anonymized)
Edge cases you know are hard
Common variations (different phrasings, typos, multi-step requests)
Adversarial inputs (jailbreak attempts, prompt injections)

For each example, annotate:

The input
Expected output (or output characteristics)
Metadata (difficulty level, category, required reasoning steps)

This becomes your eval set—the benchmark against which every iteration is measured.

3. Automate Evaluation

Manual review doesn't scale. You need automated scorers that can evaluate hundreds of outputs in minutes:

LLM-as-judge: Use GPT-4 or Claude to rate outputs ("How well does this answer the question? 0-10")
Rule-based checks: Regex for format compliance, keyword presence, length constraints
Embedding similarity: Compare output to reference answers semantically
Custom logic: Domain-specific validators (e.g., check if generated SQL is valid)

Aim for correlation with human judgment >0.7. If your automated scores don't match what you'd rate manually, refine the scoring prompt or switch methods.

4. Iterate with Data, Not Intuition

Every change generates new metrics:

Baseline prompt:
- Accuracy: 72%
- Hallucination rate: 18%
- Avg tokens: 245

After adding "Be concise":
- Accuracy: 71% (-1%)
- Hallucination rate: 19% (+1%)
- Avg tokens: 180 (-27%)

After adding few-shot examples:
- Accuracy: 81% (+9%)
- Hallucination rate: 12% (-6%)
- Avg tokens: 210 (-14%)

Data-driven decisions: The few-shot approach is clearly superior. Ship it.

5. Regression Testing on Every Change

Traditional software has CI that runs tests on every commit. EDD is the same: every prompt update, model switch, or retrieval tuning triggers automatic eval runs.

If scores drop below thresholds, the deployment is blocked. No more "seems fine" launches that break production silently.

The Evaluation-Driven Development Workflow

Phase 1: Define Success Metrics

Start with the business goal. What are users trying to achieve? What does failure look like?

Example: Building a customer support QA bot

Goals:

Answer customer questions without needing human escalation
Stay on-brand (friendly but professional tone)
Never leak PII or make promises about refunds/policies

Translate to metrics:

Accuracy (primary): Does the answer correctly address the question?
- Scorer: LLM-as-judge with rubric
- Target: >85%
Completeness: Does it include all necessary steps?
- Scorer: Custom function checking for key phrases
- Target: >80%
Tone compliance: Is it appropriately friendly?
- Scorer: LLM-as-judge comparing to brand guidelines
- Target: >7/10
Safety: No PII leakage, no unauthorized promises
- Scorer: Regex + NER model
- Target: 100% (hard requirement)
Efficiency: Token usage reasonable
- Scorer: Token count
- Target: <500 tokens per response

Phase 2: Build the Golden Dataset

Collect 100 real customer questions from support tickets, anonymize them, and categorize:

40 common questions (password reset, shipping info, returns)
30 medium-difficulty (edge cases, multi-part questions)
20 complex (requires policy interpretation, multi-step resolution)
10 adversarial (attempts to extract PII, requests for refunds bot shouldn't grant)

For each, create reference answers or at minimum, note the key points that must be included.

Store in a structured format:

{
  "dataset": "customer-support-qa-v1",
  "examples": [
    {
      "id": "cs-001",
      "input": "I forgot my password and the reset email isn't coming",
      "category": "common",
      "difficulty": "easy",
      "expected_elements": [
        "Check spam folder",
        "Verify email on file",
        "Offer alternative reset method (SMS)",
        "Provide support contact if still unresolved"
      ],
      "must_not_include": [
        "Direct password reset without verification"
      ]
    }
  ]
}

Phase 3: Build the Baseline

Write a simple initial prompt and run it through the eval set:

PROMPT_V1 = """
You are a customer support agent for our e-commerce platform.
Answer the following customer question helpfully and professionally.

Question: {question}
"""

# Run evaluation
results = evalops.evaluate(
    prompt=PROMPT_V1,
    dataset="customer-support-qa-v1",
    model="gpt-4",
    scorers=[
        accuracy_scorer,
        completeness_scorer,
        tone_scorer,
        safety_scorer,
        token_counter
    ]
)

print(results.summary())

Baseline results:

Accuracy: 68%
Completeness: 62%
Tone: 7.2/10
Safety: 95% (5 failures!)
Avg tokens: 380

This gives you a starting point. Every future iteration must beat these numbers.

Phase 4: Iterate and Measure

Try improvements:

Iteration 1: Add context and constraints

PROMPT_V2 = """
You are a customer support agent for [Company Name].

Context:
- Users expect friendly but professional responses
- Always verify user identity before discussing account details
- If you don't know something, direct to human support

Answer the following customer question:

Question: {question}
"""

Results:

Accuracy: 74% (+6%)
Completeness: 70% (+8%)
Tone: 7.8/10 (+0.6)
Safety: 98% (+3%)
Avg tokens: 420 (+40)

Better, but safety isn't 100% yet and tokens increased.

Iteration 2: Add few-shot examples

PROMPT_V3 = """
You are a customer support agent for [Company Name].

Examples of good responses:

Q: I forgot my password
A: I can help with that! First, check your spam folder for the reset email. If it's not there, verify the email address on your account matches the one you're using. You can also reset via SMS if you have a phone number on file. If none of this works, contact our support team at support@company.com.

Q: Where is my order?
A: I'd be happy to check on your order status. To look this up, I'll need your order number or the email address you used to place the order. For security, I can't access account details without verification. Can you provide your order number?

Now answer this question:

Question: {question}
"""

Results:

Accuracy: 82% (+8%)
Completeness: 85% (+15%)
Tone: 8.4/10 (+0.6)
Safety: 100% ✓
Avg tokens: 450 (+30)

Decision point: Safety is now perfect. Accuracy and completeness meet targets. Tone is excellent. Token usage is acceptable for the quality gain. Ship v3.

Phase 5: Continuous Evaluation in Production

Deploy with instrumentation:

@app.post("/support/chat")
@evalops.trace(scenario="customer-support-qa")
async def handle_support_query(question: str, session_id: str):
    response = await generate_response(question, PROMPT_V3)
    
    # Async scoring in production
    evalops.score_async(
        trace_id=response.trace_id,
        scorers=[accuracy_scorer, safety_scorer, tone_scorer]
    )
    
    return {"answer": response.text}

Monitor production metrics daily. If they drift from eval set results:

Investigate: Are users asking new types of questions?
Update eval set: Add new patterns to golden dataset
Re-evaluate: Does current prompt still perform well on expanded set?
Iterate: Improve prompt or switch models if needed

Phase 6: Regression Testing on Changes

A few weeks later, you want to switch from GPT-4 to Claude 3.5 to save costs. Don't just deploy—evaluate first:

results_gpt4 = evalops.evaluate(
    prompt=PROMPT_V3,
    dataset="customer-support-qa-v1",
    model="gpt-4"
)

results_claude = evalops.evaluate(
    prompt=PROMPT_V3,
    dataset="customer-support-qa-v1",
    model="claude-3-5-sonnet"
)

comparison = results_claude.compare_to(results_gpt4)
print(comparison)

Output:

Metric          GPT-4    Claude 3.5   Delta
Accuracy        82%      79%          -3%
Completeness    85%      83%          -2%
Tone            8.4/10   8.6/10       +0.2
Safety          100%     100%         0%
Cost/query      $0.04    $0.02        -50%

Decision: Claude is slightly worse on accuracy/completeness but massively cheaper. Is the 3% accuracy drop acceptable for 50% cost savings?

Run an A/B test in production:

10% of traffic to Claude
Monitor user feedback and escalation rates
If real-world performance matches eval results, roll out fully

Building Domain-Specific Scorers

Generic metrics (accuracy, tone) are a start, but domain-specific scorers are where EDD shines.

Example: E-commerce product description generator

You need descriptions that are:

Accurate: Match product specifications
SEO-optimized: Include target keywords naturally
Engaging: High click-through language
On-brand: Match company style guide

Custom scorers:

def spec_accuracy_scorer(output: str, product_specs: dict) -> float:
    """Check if key specifications are mentioned"""
    mentions = 0
    for spec_key, spec_value in product_specs.items():
        if spec_value.lower() in output.lower():
            mentions += 1
    return mentions / len(product_specs)

def seo_keyword_scorer(output: str, keywords: list[str]) -> float:
    """Check if target keywords are present"""
    found = sum(1 for kw in keywords if kw.lower() in output.lower())
    return found / len(keywords)

def readability_scorer(output: str) -> float:
    """Flesch reading ease score"""
    # Implementation using textstat library
    score = textstat.flesch_reading_ease(output)
    return score / 100  # Normalize to 0-1

def brand_voice_scorer(output: str) -> float:
    """LLM-as-judge with brand guidelines"""
    judge_prompt = f"""
    Our brand voice is: friendly, informative, and aspirational.
    We avoid: hype, superlatives without backing, technical jargon.
    
    Rate this product description for brand alignment (0-10):
    {output}
    """
    rating = llm_judge(judge_prompt)
    return rating / 10

Now every product description is scored on these dimensions:

results = evalops.evaluate(
    prompt=PRODUCT_DESCRIPTION_PROMPT,
    dataset="product-catalog-sample-100",
    scorers=[
        spec_accuracy_scorer,
        seo_keyword_scorer,
        readability_scorer,
        brand_voice_scorer
    ]
)

This tells you exactly where your prompt is weak. If spec_accuracy is low, you need better instructions to reference product data. If seo_keyword_scorer is low, explicitly list keywords in the prompt.

Integrating with CI/CD

Evaluation-driven development requires automation. On every pull request that touches prompts or models:

GitHub Actions workflow:

name: Evaluate AI Changes

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/ai/**'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Run evaluation suite
        env:
          EVALOPS_API_KEY: ${{ secrets.EVALOPS_API_KEY }}
        run: |
          pip install evalops
          evalops run --suite customer-support-qa \
                      --compare-to main \
                      --fail-on-regression
      
      - name: Post results to PR
        uses: evalops/pr-comment-action@v1
        with:
          comparison: true

The PR now shows:

Evaluation Results: customer-support-qa

Metric          Current    Main Branch    Delta
Accuracy        84%        82%            +2% ✅
Safety          100%       100%           0% ✅
Avg tokens      430        450            -20 ✅

All metrics passed. Safe to merge.

Or, if it regresses:

⚠️ Evaluation Failed

Metric          Current    Main Branch    Delta
Accuracy        79%        82%            -3% ❌
Safety          98%        100%           -2% ❌

Regressions detected. Review before merging.

No more shipping changes that degrade quality silently.

Handling Non-Determinism

LLMs are stochastic. The same prompt can produce different outputs. How do you handle this in evaluation?

Strategy 1: Multiple Runs Per Example

Run each eval example 3-5 times and aggregate:

results = evalops.evaluate(
    prompt=PROMPT,
    dataset="eval-set",
    runs_per_example=5,
    aggregation="mean"
)

If accuracy is 80%, 85%, 82%, 78%, 84% across runs, report mean (81.8%) and std dev (2.8%). Tighter std dev = more consistent model.

Strategy 2: Temperature Control

Set temperature=0 during evaluation for maximum determinism:

results = evalops.evaluate(
    prompt=PROMPT,
    dataset="eval-set",
    model_params={"temperature": 0}
)

This reduces variance but doesn't eliminate it (models still have some randomness even at temp=0).

Strategy 3: Statistical Significance Testing

When comparing two prompts, use statistical tests to ensure differences aren't noise:

comparison = evalops.compare(
    prompt_a=PROMPT_V1,
    prompt_b=PROMPT_V2,
    dataset="eval-set",
    runs_per_example=10,
    test="t-test"
)

print(comparison.significance)
# "Prompt B is significantly better (p < 0.05)"

Only ship changes that are statistically significant improvements, not just lucky samples.

Common Pitfalls and How to Avoid Them

Pitfall 1: Overfitting to the Eval Set

You iterate on prompts while watching eval metrics. Eventually, you're optimizing for the 100 examples in your golden dataset, not real-world performance.

Solution: Split your data:

Dev set (80 examples): Use for active iteration
Test set (20 examples): Hold out, only evaluate final prompt candidates
Production monitoring: Continuously collect new examples and refresh eval sets monthly

Pitfall 2: Proxy Metrics That Don't Correlate

You optimize for "response length < 200 tokens" because it's easy to measure, but users actually prefer longer, more detailed answers.

Solution: Validate that automated metrics correlate with user satisfaction. If your LLM-as-judge "accuracy" score doesn't match human ratings or user feedback, refine the judge prompt or use a different approach.

Pitfall 3: Ignoring Edge Cases

Your eval set is 80% easy questions, 20% medium. You optimize for the easy ones and average metrics look great, but production users hit the hard cases you didn't test.

Solution: Stratified sampling in your eval set. Ensure adequate coverage of:

Common cases (to avoid breaking the basics)
Edge cases (rare but important)
Adversarial cases (misuse attempts)

Track metrics per category, not just overall averages.

Pitfall 4: Evaluation Theater

You run evals, generate pretty dashboards, but don't actually block bad deployments. Teams ship "because we're on a deadline" even when metrics regress.

Solution: Make evaluation a gate, not a suggestion. CI fails if metrics drop. Product managers understand that shipping a 75% accurate bot creates more support load than delaying a week to get to 85%.

Pitfall 5: Analysis Paralysis

You have 20 different metrics. Some go up, some go down. You can't decide if a change is good or bad.

Solution: Define a composite score or hierarchy of metrics:

Safety (must be 100%, non-negotiable)
Accuracy (primary: must be >80%)
Tone (secondary: should be >7/10)
Efficiency (tertiary: optimize if above thresholds met)

Ship changes that maintain #1, improve #2, and don't regress #3 or #4 significantly.

Scaling Evaluation as Your Application Grows

Start Small: Single Scenario

Begin with your highest-value use case:

50-100 examples
3-5 core metrics
Manual review to validate scorer accuracy

Expand: Multiple Scenarios

As you add features, each gets its own eval set:

Customer support QA
Product recommendation
Content generation
Code completion

Share scorers where possible (e.g., safety checks across all scenarios).

Automate: Continuous Eval Pipeline

Run evaluations:

On every PR (regression testing)
Nightly (full suite against latest prod data)
Weekly (expanded test sets including new edge cases)

Integrate: Feedback Loops

Connect eval results to:

User feedback (correlate low-scoring traces with negative ratings)
A/B test outcomes (which variant performs better in production)
Support tickets (tag traces that led to escalations)

Use this to continuously improve eval sets: production failures become tomorrow's test cases.

Measuring the Impact of EDD

Teams that adopt evaluation-driven development report:

Faster iteration: No more "try and see" deployments. Changes are validated in minutes, not weeks of production monitoring.
Higher confidence: Ship knowing exactly how quality and cost trade off, not guessing.
Fewer production incidents: Regressions caught in CI, not by users.
Better collaboration: Product, engineering, and ML teams align on quantitative goals instead of debating subjective quality.
Improved model utilization: Data-driven model selection (GPT-4 vs Claude vs Llama) based on task-specific benchmarks, not vendor marketing.

Getting Started: Your First Evaluation-Driven Feature

Week 1:

Pick one LLM feature to evaluate
Define 3 success metrics
Collect 50 representative examples
Annotate expected outputs or key characteristics

Week 2:

Write scorers (start with LLM-as-judge for subjective metrics)
Run baseline eval with current prompt
Document results

Week 3:

Iterate on prompt (3-5 variants)
Evaluate each variant
Pick winner based on data
Deploy with instrumentation

Week 4:

Monitor production metrics
Compare to eval set performance
Refine scorers if production diverges from expectations
Add new examples to eval set based on production edge cases

Repeat monthly with expanded eval sets and additional scenarios.

Conclusion

Evaluation-driven development brings the discipline of test-driven development to LLM applications. It replaces "seems fine" with "measurably better," "probably works" with "verified against 200 examples," and "ship and pray" with "ship with confidence."

The investment is small—days to set up your first eval pipeline—but the returns compound. Every subsequent feature is faster to build, cheaper to maintain, and higher quality at launch.

Most importantly, EDD shifts the conversation from "Does this feel good?" to "Does this meet our quantitative quality bar?" That's the difference between hobbyist AI and production-grade systems.

Start simple. Pick your highest-impact feature. Define metrics. Build a golden dataset. Automate evaluation. Then never ship an AI feature without data again.

Next Steps:

Questions about implementing EDD for your use case? Email us at hello@evalops.dev.