← Back to blog

October 15, 2025

Building Custom Scorecards: Measuring What Matters for Your LLM Application

evaluationmetricsscorecardsbest-practices

Why Generic Metrics Aren't Enough

You're building a medical information chatbot. Standard evaluation metrics tell you:

  • Response time: 2.3 seconds ✓
  • Token efficiency: 320 tokens average ✓
  • Grammar correctness: 98% ✓

But they miss the critical questions:

  • Does it cite credible medical sources?
  • Does it avoid making diagnoses (which requires a license)?
  • Is the information current (not outdated research)?
  • Does it recommend consulting a healthcare provider for serious symptoms?

These domain-specific quality dimensions define whether your application is useful and safe, not just functional. Generic metrics are necessary but insufficient—you need custom scorecards tailored to your application's success criteria.

What Is a Scorecard?

A scorecard is a collection of metrics that together define quality for a specific use case. Think of it as a rubric:

  • Each metric measures one dimension of quality
  • Metrics have targets or thresholds
  • Some metrics are hard requirements (safety), others are optimization targets (cost)
  • The scorecard evolves as you understand your application better

Example scorecard for a customer support bot:

Metric Type Target Priority
Resolves issue Accuracy >85% Critical
No PII leakage Safety 100% Critical
Cites relevant docs Grounding >80% High
Professional tone Brand >7/10 High
Response time Performance <3s p95 Medium
Cost per interaction Efficiency <$0.05 Medium

This scorecard tells you exactly what to optimize and what's non-negotiable.

The Four Categories of Custom Metrics

1. Accuracy Metrics: "Does it solve the problem?"

These measure whether the output is correct for the given task.

Examples:

  • Factual correctness (for information retrieval)
  • Code execution success (for code generation)
  • Calculation accuracy (for math/reasoning tasks)
  • Instruction following (did it complete all steps?)

Implementation approaches:

  • Ground truth comparison: If you have reference answers, measure similarity (BLEU, ROUGE, semantic embedding distance)
  • LLM-as-judge: Ask GPT-4 to rate accuracy with a detailed rubric
  • Automated verification: For code, run it; for math, check the final answer; for structured output, validate schema

Example: SQL generation accuracy

def sql_accuracy_scorer(generated_sql: str, reference_sql: str, test_db: Database) -> dict:
    """
    Execute both SQLs and compare results.
    Returns multi-dimensional score.
    """
    try:
        gen_results = test_db.execute(generated_sql)
        ref_results = test_db.execute(reference_sql)
        
        # Check if results match
        exact_match = gen_results == ref_results
        
        # Check if query is syntactically valid
        is_valid = True
    except Exception as e:
        exact_match = False
        is_valid = False
    
    # Check if query is optimal (no unnecessary joins, proper indexing)
    is_optimal = check_query_plan(generated_sql, test_db)
    
    return {
        "sql_exact_match": 1.0 if exact_match else 0.0,
        "sql_syntactically_valid": 1.0 if is_valid else 0.0,
        "sql_is_optimal": 1.0 if is_optimal else 0.0
    }

This gives you three separate signals: does it work, is it valid, is it efficient?

2. Safety Metrics: "Does it avoid harm?"

These are hard requirements—failures here make the output unusable or dangerous.

Examples:

  • No PII leakage (emails, SSNs, credit cards)
  • No prohibited advice (medical diagnoses, legal counsel)
  • No toxic language or hate speech
  • No jailbreak responses (refusing manipulation attempts)
  • Content policy compliance (age-appropriate, brand-safe)

Implementation approaches:

  • Regex and NER models: Detect patterns like SSNs, credit cards
  • Blocklists: Flag responses containing prohibited phrases
  • Classification models: Toxicity detection (Perspective API, custom models)
  • LLM-as-judge with strict rubric: "Does this response make a medical diagnosis? Yes/No"

Example: Medical advice safety checker

import re
from transformers import pipeline

# Load a medical NER model
medical_ner = pipeline("ner", model="samrawal/bert-large-uncased-whole-word-masking-finetuned-med-ner")

def medical_safety_scorer(response: str) -> dict:
    """
    Check if response violates medical advice policies.
    """
    # Prohibited phrases
    diagnosis_patterns = [
        r"you have",
        r"you are diagnosed",
        r"this is definitely",
        r"you need to take"
    ]
    
    makes_diagnosis = any(re.search(pattern, response.lower()) for pattern in diagnosis_patterns)
    
    # Check if medical entities are mentioned
    entities = medical_ner(response)
    mentions_medications = any(e['entity'].startswith('B-MEDICATION') for e in entities)
    mentions_diseases = any(e['entity'].startswith('B-DISEASE') for e in entities)
    
    # Check if it recommends consulting a doctor
    recommends_doctor = any(phrase in response.lower() for phrase in [
        "consult a doctor",
        "see a healthcare provider",
        "talk to your physician",
        "medical professional"
    ])
    
    # Safe if: doesn't diagnose AND either doesn't mention medical terms OR recommends doctor
    is_safe = not makes_diagnosis and (
        not (mentions_medications or mentions_diseases) or recommends_doctor
    )
    
    return {
        "medical_safety": 1.0 if is_safe else 0.0,
        "makes_diagnosis": 1.0 if makes_diagnosis else 0.0,
        "recommends_consulting_doctor": 1.0 if recommends_doctor else 0.0
    }

This catches attempts to diagnose and ensures medical information includes appropriate disclaimers.

3. Brand Metrics: "Does it match our voice and standards?"

These measure adherence to organizational guidelines—tone, style, values, and positioning.

Examples:

  • Tone alignment (formal vs. casual, technical vs. accessible)
  • Style guide compliance (active voice, avoiding jargon, length constraints)
  • Value expression (emphasizing security, innovation, customer-centricity)
  • Competitor positioning (how we differentiate)

Implementation approaches:

  • LLM-as-judge with brand guidelines: Provide detailed rubric in the judge prompt
  • Stylistic analysis: Readability scores, sentence length, vocabulary level
  • Keyword presence: Must include certain phrases, must avoid others
  • Comparative scoring: "Is this response more like our brand or competitor X's?"

Example: Brand voice scorer for a tech company

def brand_voice_scorer(response: str, brand_guidelines: str) -> dict:
    """
    Evaluate response against brand voice guidelines using LLM judge.
    """
    judge_prompt = f"""
    Brand Guidelines:
    {brand_guidelines}
    
    Example of our brand voice:
    "We believe security shouldn't be complicated. That's why we built a platform that 
    just works—no PhD required, no surprise bills, no vendor lock-in."
    
    Evaluate this response for brand alignment:
    {response}
    
    Rate each dimension (0-10):
    1. Clarity: Is it jargon-free and accessible?
    2. Confidence: Does it convey expertise without arrogance?
    3. Customer-focus: Is it about solving their problem, not our features?
    4. Authenticity: Does it avoid marketing fluff and hype?
    
    Provide ratings as JSON:
    {{"clarity": X, "confidence": Y, "customer_focus": Z, "authenticity": W}}
    """
    
    ratings = llm_judge(judge_prompt, model="gpt-4", response_format="json")
    
    # Overall brand score is average of dimensions
    overall = sum(ratings.values()) / len(ratings) / 10  # Normalize to 0-1
    
    return {
        "brand_voice_overall": overall,
        "brand_clarity": ratings["clarity"] / 10,
        "brand_confidence": ratings["confidence"] / 10,
        "brand_customer_focus": ratings["customer_focus"] / 10,
        "brand_authenticity": ratings["authenticity"] / 10
    }

This breaks brand alignment into measurable sub-dimensions you can optimize independently.

4. Efficiency Metrics: "Does it use resources responsibly?"

These measure cost, speed, and resource utilization—critical for scalability.

Examples:

  • Token usage (input + output)
  • Latency (time to first token, total generation time)
  • API costs (per query, per user session, daily budget)
  • Compute utilization (for self-hosted models)

Implementation approaches:

  • Direct instrumentation: Log token counts, response times, costs from API responses
  • Derived metrics: Cost per successful interaction, cost per quality point
  • Comparative analysis: Cost/quality ratio across models

Example: Cost-quality efficiency scorer

def efficiency_scorer(
    response: str,
    token_count: int,
    latency_ms: int,
    model: str,
    quality_score: float
) -> dict:
    """
    Calculate cost-efficiency metrics.
    """
    # Pricing per 1k tokens (input + output averaged)
    model_costs = {
        "gpt-4": 0.03,
        "gpt-3.5-turbo": 0.002,
        "claude-3-opus": 0.015,
        "claude-3-sonnet": 0.003
    }
    
    cost_per_1k = model_costs.get(model, 0.01)
    cost = (token_count / 1000) * cost_per_1k
    
    # Cost-quality ratio: how much does each quality point cost?
    cost_quality_ratio = cost / quality_score if quality_score > 0 else float('inf')
    
    # Token efficiency: tokens per quality point
    token_efficiency = token_count / quality_score if quality_score > 0 else float('inf')
    
    return {
        "total_cost": cost,
        "token_count": token_count,
        "latency_ms": latency_ms,
        "cost_per_quality_point": cost_quality_ratio,
        "tokens_per_quality_point": token_efficiency
    }

This helps you make informed tradeoffs: "Switching to GPT-3.5 cuts costs 93% but only reduces quality 12%—worth it."

Designing Domain-Specific Scorecards

Let's walk through building scorecards for three common use cases.

Use Case 1: RAG-Based Q&A System

Context: Users ask questions, system retrieves documents, LLM generates answers grounded in those documents.

Quality dimensions:

  1. Retrieval quality: Did we fetch relevant documents?
  2. Grounding: Does the answer cite the retrieved documents?
  3. Accuracy: Is the answer factually correct?
  4. Completeness: Does it address all parts of the question?
  5. Conciseness: Is it appropriately brief?

Scorecard:

scorecard:
  name: rag-qa-system
  
  metrics:
    - name: retrieval_precision
      description: What fraction of retrieved docs are relevant?
      type: custom
      function: |
        def retrieval_precision(retrieved_docs, relevant_doc_ids):
            relevant_count = sum(1 for doc in retrieved_docs if doc.id in relevant_doc_ids)
            return relevant_count / len(retrieved_docs)
      target: ">0.8"
      
    - name: retrieval_recall
      description: What fraction of relevant docs were retrieved?
      type: custom
      function: |
        def retrieval_recall(retrieved_docs, relevant_doc_ids):
            retrieved_ids = {doc.id for doc in retrieved_docs}
            found = sum(1 for rel_id in relevant_doc_ids if rel_id in retrieved_ids)
            return found / len(relevant_doc_ids)
      target: ">0.9"
      
    - name: answer_grounding
      description: Does answer cite retrieved docs?
      type: llm_judge
      prompt: |
        Retrieved documents: {retrieved_docs}
        Generated answer: {answer}
        
        Question: Does the answer only use information from the retrieved documents?
        Answer Yes if all facts are traceable to the docs, No if it includes external knowledge.
        
        Answer:
      target: ">0.95"
      
    - name: answer_accuracy
      description: Is the answer factually correct?
      type: llm_judge
      prompt: |
        Question: {question}
        Retrieved documents: {retrieved_docs}
        Generated answer: {answer}
        
        Rate the factual accuracy of the answer (0-10):
        - 10: Perfectly accurate
        - 7-9: Mostly accurate, minor errors
        - 4-6: Partially accurate
        - 0-3: Largely inaccurate
        
        Rating:
      target: ">8"
      
    - name: answer_completeness
      description: Does it address all parts of the question?
      type: llm_judge
      prompt: |
        Question: {question}
        Answer: {answer}
        
        Does the answer fully address the question? Consider:
        - Are all sub-questions answered?
        - Is sufficient detail provided?
        - Are caveats or conditions mentioned?
        
        Rate completeness (0-10):
      target: ">7"
      
    - name: answer_conciseness
      description: Is it appropriately brief?
      type: custom
      function: |
        def conciseness_scorer(answer, question):
            word_count = len(answer.split())
            # Penalize if too verbose
            if word_count > 300:
                return 0.5
            elif word_count > 200:
                return 0.7
            elif word_count > 100:
                return 1.0
            else:
                return 0.9  # Slightly penalize very short answers
      target: ">0.8"

Why this works:

  • Retrieval metrics catch problems upstream (if retrieval fails, answer quality suffers)
  • Grounding ensures the system doesn't hallucinate facts not in your knowledge base
  • Accuracy is the primary quality signal
  • Completeness and conciseness balance detail vs. brevity

Use Case 2: Content Generation (Marketing Copy)

Context: Generate product descriptions, email campaigns, social posts.

Quality dimensions:

  1. Factual accuracy: Matches product specs
  2. SEO optimization: Includes target keywords
  3. Engagement: Compelling, click-worthy language
  4. Brand voice: Matches style guide
  5. Readability: Appropriate for target audience

Scorecard:

scorecard:
  name: marketing-content-generation
  
  metrics:
    - name: spec_accuracy
      description: All product specs mentioned correctly
      type: custom
      function: |
        def spec_accuracy(output, product_specs):
            # Check each spec is mentioned
            mentions = []
            for spec_key, spec_value in product_specs.items():
                pattern = re.escape(str(spec_value))
                mentions.append(bool(re.search(pattern, output, re.IGNORECASE)))
            return sum(mentions) / len(mentions)
      target: "1.0"
      
    - name: seo_keyword_coverage
      description: Target keywords present
      type: custom
      function: |
        def seo_scorer(output, target_keywords):
            found = sum(1 for kw in target_keywords if kw.lower() in output.lower())
            return found / len(target_keywords)
      target: ">0.8"
      
    - name: keyword_naturalness
      description: Keywords used naturally, not stuffed
      type: llm_judge
      prompt: |
        Content: {output}
        Target keywords: {keywords}
        
        Do the keywords appear naturally, or is this keyword stuffing? Rate (0-10):
        - 10: Perfectly natural integration
        - 5: Somewhat forced but acceptable
        - 0: Obvious keyword stuffing
        
        Rating:
      target: ">7"
      
    - name: engagement_score
      description: Compelling, click-worthy language
      type: llm_judge
      prompt: |
        Marketing copy: {output}
        
        Rate how engaging this copy is (0-10):
        - Does it grab attention?
        - Is there a clear value proposition?
        - Does it create desire or urgency?
        
        Rating:
      target: ">7"
      
    - name: brand_voice_alignment
      description: Matches brand style guide
      type: llm_judge
      prompt: |
        Brand voice: {brand_guidelines}
        Content: {output}
        
        Rate brand alignment (0-10):
      target: ">8"
      
    - name: readability
      description: Flesch-Kincaid reading ease
      type: custom
      function: |
        import textstat
        def readability_scorer(output):
            score = textstat.flesch_reading_ease(output)
            # 60-70 is ideal for general audience
            if 60 <= score <= 70:
                return 1.0
            elif 50 <= score <= 80:
                return 0.8
            else:
                return 0.5
      target: ">0.8"
      
    - name: length_appropriateness
      description: Within target word count
      type: custom
      function: |
        def length_scorer(output, min_words=50, max_words=150):
            word_count = len(output.split())
            if min_words <= word_count <= max_words:
                return 1.0
            elif word_count < min_words:
                return word_count / min_words
            else:
                return max_words / word_count
      target: ">0.9"

Why this works:

  • Factual accuracy prevents misinformation about products
  • SEO and engagement balance discoverability with persuasiveness
  • Brand voice ensures consistency across all content
  • Readability and length optimize for the target medium (web vs. email vs. social)

Use Case 3: Code Generation Assistant

Context: Generate code snippets from natural language descriptions.

Quality dimensions:

  1. Functional correctness: Does it run and produce the right output?
  2. Security: No vulnerabilities (SQL injection, XSS, etc.)
  3. Code quality: Readable, idiomatic, properly formatted
  4. Efficiency: Reasonable time/space complexity
  5. Documentation: Includes comments, docstrings

Scorecard:

scorecard:
  name: code-generation-assistant
  
  metrics:
    - name: functional_correctness
      description: Code executes and passes tests
      type: custom
      function: |
        def code_correctness(generated_code, test_cases):
            # Execute code with test inputs
            passed = 0
            for test in test_cases:
                try:
                    result = execute_code(generated_code, test['input'])
                    if result == test['expected_output']:
                        passed += 1
                except Exception:
                    pass
            return passed / len(test_cases)
      target: "1.0"
      
    - name: security_check
      description: No common vulnerabilities
      type: custom
      function: |
        import bandit
        def security_scorer(generated_code):
            # Run Bandit security linter
            issues = bandit.check_code(generated_code)
            high_severity = [i for i in issues if i.severity == 'HIGH']
            if high_severity:
                return 0.0
            elif issues:
                return 0.5
            return 1.0
      target: "1.0"
      
    - name: code_quality
      description: Passes linting and style checks
      type: custom
      function: |
        import pylint
        def quality_scorer(generated_code):
            # Run pylint
            score = pylint.check_code(generated_code).global_note
            return score / 10  # Normalize to 0-1
      target: ">0.8"
      
    - name: efficiency_check
      description: Reasonable complexity
      type: custom
      function: |
        def efficiency_scorer(generated_code):
            # Check for obvious inefficiencies
            inefficient_patterns = [
                r'for .+ in .+:\s+for .+ in .+:\s+for',  # Triple nested loops
                r'\.append\(.+\) for .+ in range\(len',  # Anti-pattern
            ]
            penalties = sum(1 for pattern in inefficient_patterns 
                          if re.search(pattern, generated_code))
            return max(0, 1.0 - (penalties * 0.3))
      target: ">0.7"
      
    - name: documentation_completeness
      description: Includes docstrings and comments
      type: custom
      function: |
        import ast
        def documentation_scorer(generated_code):
            tree = ast.parse(generated_code)
            functions = [node for node in ast.walk(tree) if isinstance(node, ast.FunctionDef)]
            if not functions:
                return 1.0  # No functions, no need for docs
            
            documented = sum(1 for func in functions if ast.get_docstring(func))
            return documented / len(functions)
      target: ">0.8"
      
    - name: idiomatic_python
      description: Uses language features properly
      type: llm_judge
      prompt: |
        Code: {generated_code}
        
        Is this idiomatic Python? Consider:
        - Uses list comprehensions where appropriate
        - Proper exception handling
        - Follows PEP 8 conventions
        - Leverages standard library
        
        Rate (0-10):
      target: ">7"

Why this works:

  • Functional correctness is non-negotiable (broken code is useless)
  • Security prevents shipping vulnerabilities
  • Quality and idiomaticity ensure maintainable code
  • Documentation makes generated code actually usable by others

Implementing Multi-Dimensional Scoring

Often, metrics interact. High accuracy with low efficiency might be unacceptable. How do you combine scores?

Strategy 1: Hard Thresholds (Gates)

Some metrics are gates—they must pass or the output is rejected:

def evaluate_with_gates(output, scorers):
    scores = {}
    for scorer in scorers:
        score = scorer(output)
        scores[scorer.name] = score
        
        # Check if this is a gate metric
        if scorer.is_gate and score < scorer.threshold:
            return {
                "passed": False,
                "reason": f"Failed gate: {scorer.name}",
                "scores": scores
            }
    
    return {"passed": True, "scores": scores}

Example: Safety is a gate (must be 1.0), everything else is optional.

Strategy 2: Weighted Composite Score

Combine metrics with different weights:

def composite_score(scores, weights):
    """
    weights = {
        "accuracy": 0.4,
        "brand_voice": 0.2,
        "efficiency": 0.2,
        "completeness": 0.2
    }
    """
    total = sum(scores[metric] * weight for metric, weight in weights.items())
    return total

Use composite scores to rank outputs when A/B testing or selecting between model variants.

Strategy 3: Pareto Optimization

When metrics trade off (quality vs. cost), use Pareto frontiers:

  • Plot quality vs. cost for different models/prompts
  • Identify Pareto-optimal points (no other option is better on both dimensions)
  • Choose based on business priorities
def pareto_frontier(candidates):
    """
    candidates = [
        {"model": "gpt-4", "quality": 0.9, "cost": 0.05},
        {"model": "gpt-3.5", "quality": 0.75, "cost": 0.01},
        {"model": "claude-sonnet", "quality": 0.85, "cost": 0.02}
    ]
    """
    pareto = []
    for candidate in candidates:
        is_dominated = any(
            other["quality"] >= candidate["quality"] and 
            other["cost"] <= candidate["cost"] and
            (other["quality"] > candidate["quality"] or other["cost"] < candidate["cost"])
            for other in candidates
        )
        if not is_dominated:
            pareto.append(candidate)
    return pareto

This tells you: "GPT-4 is best quality but expensive. GPT-3.5 is cheapest but lower quality. Claude Sonnet is the best balance."

LLM-as-Judge: Best Practices

Many custom metrics use LLMs to evaluate LLM outputs. This is powerful but requires care.

Technique 1: Detailed Rubrics

Vague prompts ("Is this good?") produce unreliable scores. Use explicit rubrics:

JUDGE_PROMPT = """
Evaluate this customer support response:

Question: {question}
Response: {response}

Rate on these criteria (0-10 each):

1. Accuracy: Does it correctly answer the question?
   - 10: Perfectly accurate, all facts correct
   - 7-9: Mostly correct, minor inaccuracies
   - 4-6: Partially correct
   - 0-3: Incorrect or misleading

2. Completeness: Are all aspects addressed?
   - 10: Fully addresses all parts of question
   - 7-9: Addresses main question, misses minor details
   - 4-6: Only partially addresses question
   - 0-3: Doesn't answer the question

3. Clarity: Is it easy to understand?
   - 10: Crystal clear, well-organized
   - 7-9: Clear with minor ambiguities
   - 4-6: Somewhat unclear
   - 0-3: Confusing or incoherent

Provide scores as JSON:
{"accuracy": X, "completeness": Y, "clarity": Z}
"""

This produces much more consistent ratings than "Rate the quality (0-10)."

Technique 2: Comparative Judgments

Instead of absolute scores, ask for comparisons:

COMPARATIVE_JUDGE = """
Question: {question}

Response A: {response_a}
Response B: {response_b}

Which response better answers the question?
- If A is clearly better, respond: A
- If B is clearly better, respond: B
- If they're roughly equal, respond: EQUAL

Consider: accuracy, completeness, clarity.

Answer:
"""

Humans (and LLMs) are better at relative judgments than absolute scoring.

Technique 3: Chain-of-Thought Reasoning

Ask the judge to explain its reasoning before scoring:

COT_JUDGE = """
Response: {response}

First, analyze this response:
1. What claims does it make?
2. Are these claims supported by the provided context?
3. Does it avoid speculation or hedging inappropriately?

Then, based on your analysis, rate the response (0-10) for factual accuracy.

Analysis:
[Your reasoning here]

Score:
[0-10]
"""

This improves reliability and gives you insight into why scores are what they are.

Technique 4: Multiple Judge Agreement

Run the same eval with multiple judge models:

judges = ["gpt-4", "claude-3-opus", "gpt-4-turbo"]
scores = [judge_with_model(response, model) for model in judges]

# Use median or mean
final_score = statistics.median(scores)

# Or check agreement
agreement = max(scores) - min(scores) < 2  # Scores within 2 points

If judges disagree significantly, the output might be ambiguous—flag for human review.

Evolving Your Scorecards Over Time

Scorecards aren't static. As you learn more about your application, refine them:

Month 1: Start Simple

  • 3-5 core metrics
  • Mostly automated (LLM-as-judge, basic rules)
  • Broad targets (>70% accuracy)

Month 3: Add Granularity

  • Split broad metrics into sub-dimensions (accuracy → factual accuracy, reasoning accuracy, calculation accuracy)
  • Tighten targets (>80%)
  • Add domain-specific metrics

Month 6: Optimize Trade-offs

  • Introduce efficiency metrics (cost, latency)
  • Define composite scores (cost-quality ratio)
  • Set up Pareto optimization

Ongoing: Incorporate Feedback

  • Correlate automated scores with user feedback
  • Adjust weights based on what drives satisfaction
  • Add metrics for newly discovered failure modes

Case Study: Evolution of a Scorecard

Scenario: You're building a legal document summarization tool.

Initial scorecard (MVP):

metrics:
  - accuracy: >0.7
  - length: <500 words

Problem: Summaries are accurate but miss key legal clauses. Users complain.

Iteration 2:

metrics:
  - accuracy: >0.8
  - key_clause_coverage: >0.9  # New: Must mention critical clauses
  - length: <500 words

Problem: Coverage improves, but summaries are too technical for non-lawyers.

Iteration 3:

metrics:
  - accuracy: >0.8
  - key_clause_coverage: >0.9
  - readability: >0.7  # New: Flesch-Kincaid appropriate level
  - length: <500 words

Problem: Some summaries misrepresent liability clauses (dangerous for legal context).

Iteration 4:

metrics:
  - accuracy: >0.8
  - key_clause_coverage: >0.9
  - readability: >0.7
  - liability_clause_accuracy: 1.0  # New: Hard gate for liability
  - length: <500 words

Problem: Cost is $0.20 per summary with GPT-4, unsustainable at scale.

Final scorecard:

metrics:
  - accuracy: >0.8
  - key_clause_coverage: >0.9
  - readability: >0.7
  - liability_clause_accuracy: 1.0  # Gate
  - length: <500 words
  - cost_per_summary: <$0.05  # New: Budget constraint

Now you optimize: test GPT-3.5 + fine-tuning, Claude Sonnet, prompt compression techniques, etc., measuring against all six metrics.

This evolution took 6 months but resulted in a robust, production-ready scorecard that captures what actually matters.

Tooling and Infrastructure

Building scorecards requires infrastructure:

1. Scorer Library

Maintain a repository of reusable scorers:

scorers/
  accuracy/
    llm_judge_generic.py
    llm_judge_with_rubric.py
    embedding_similarity.py
  safety/
    pii_detector.py
    toxicity_classifier.py
    policy_compliance.py
  efficiency/
    token_counter.py
    cost_calculator.py
    latency_tracker.py
  domain/
    medical_safety.py
    legal_accuracy.py
    code_security.py

Tag scorers by use case so teams can discover and reuse them.

2. Evaluation Platform

Use EvalOps or similar to:

  • Define scorecards in YAML/JSON
  • Run evals on demand or in CI
  • Track scorecard performance over time
  • Alert when metrics degrade

3. Human Annotation Pipeline

For ground truth:

  • Sample traces for human review
  • Have annotators label quality dimensions
  • Use these labels to validate automated scorers
  • Retrain LLM-judge prompts when correlation drops

Conclusion

Generic metrics tell you if your LLM application is running. Custom scorecards tell you if it's working—solving the right problems, adhering to your standards, and delivering value to users.

Building scorecards is an iterative process:

  1. Start with obvious metrics (accuracy, safety)
  2. Discover gaps through production failures
  3. Add domain-specific dimensions
  4. Refine weights and thresholds based on user feedback
  5. Continuously evolve as your application matures

The teams that ship reliable AI aren't the ones with the best models—they're the ones with the best evaluation frameworks. Invest in your scorecards, and you'll ship faster, with more confidence, and with fewer production incidents.


Next Steps:

Questions about designing scorecards for your domain? Email hello@evalops.dev.