← Back to blog

October 2, 2025

Evaluating RAG Systems: Beyond Simple Accuracy Metrics

ragretrievalevaluationpatterns

The RAG Evaluation Challenge

Your RAG system looks great in demos. Given a question, it retrieves relevant documents and generates a well-formatted answer. Ship it!

Then production happens: users report the system "makes things up," answers are "off-topic," or responses reference documents that weren't retrieved. What went wrong?

Traditional LLM evaluation assumes a direct input-output relationship. But RAG is a pipeline: query → retrieval → generation → response. Failure can occur at any stage:

  • Retrieval fails: Wrong documents fetched
  • Generation ignores context: Model hallucinates instead of using retrieved docs
  • Contradictory sources: Retrieved docs disagree, model picks wrong one
  • Context overload: Too many docs, signal gets lost in noise

Evaluating RAG systems requires measuring each stage independently plus end-to-end behavior. Generic accuracy metrics miss these nuances entirely.

The Four Dimensions of RAG Quality

1. Retrieval Quality: "Did we fetch the right documents?"

Metrics:

Precision@K: What fraction of top-K retrieved documents are relevant?

def precision_at_k(retrieved_docs, relevant_doc_ids, k=5):
    """
    retrieved_docs: List of document IDs, ranked by relevance
    relevant_doc_ids: Set of ground-truth relevant doc IDs
    """
    top_k = set(retrieved_docs[:k])
    relevant_in_top_k = top_k.intersection(relevant_doc_ids)
    return len(relevant_in_top_k) / k

Recall@K: What fraction of relevant documents are in top-K?

def recall_at_k(retrieved_docs, relevant_doc_ids, k=5):
    top_k = set(retrieved_docs[:k])
    relevant_in_top_k = top_k.intersection(relevant_doc_ids)
    return len(relevant_in_top_k) / len(relevant_doc_ids) if relevant_doc_ids else 0

Mean Reciprocal Rank (MRR): How high is the first relevant document?

def mean_reciprocal_rank(retrieved_docs, relevant_doc_ids):
    for i, doc_id in enumerate(retrieved_docs, 1):
        if doc_id in relevant_doc_ids:
            return 1.0 / i
    return 0.0

NDCG (Normalized Discounted Cumulative Gain): Accounts for graded relevance and position

import numpy as np

def ndcg_at_k(retrieved_docs, relevance_scores, k=5):
    """
    relevance_scores: Dict mapping doc_id -> relevance (0-3 scale)
    """
    dcg = sum(
        relevance_scores.get(doc_id, 0) / np.log2(i + 2)
        for i, doc_id in enumerate(retrieved_docs[:k])
    )
    
    # Ideal DCG: docs sorted by relevance
    ideal_scores = sorted(relevance_scores.values(), reverse=True)[:k]
    idcg = sum(score / np.log2(i + 2) for i, score in enumerate(ideal_scores))
    
    return dcg / idcg if idcg > 0 else 0

Example evaluation:

# Ground truth: For question "How do I reset my password?",
# docs 15, 42, and 103 are relevant
relevant_docs = {15, 42, 103}

# System retrieved: [42, 15, 88, 103, 12]
retrieved = [42, 15, 88, 103, 12]

print(f"Precision@5: {precision_at_k(retrieved, relevant_docs, k=5):.2f}")
# Output: 0.60 (3 out of 5 are relevant)

print(f"Recall@5: {recall_at_k(retrieved, relevant_docs, k=5):.2f}")
# Output: 1.00 (all 3 relevant docs are in top 5)

print(f"MRR: {mean_reciprocal_rank(retrieved, relevant_docs):.2f}")
# Output: 1.00 (first retrieved doc is relevant)

2. Context Grounding: "Does the answer use the retrieved documents?"

The model has context but might ignore it and hallucinate. Measure faithfulness to the provided documents.

Approach 1: Claim Verification

Extract claims from the answer, check if each is supported by retrieved documents.

async def check_grounding(answer: str, retrieved_docs: list[str]) -> dict:
    """
    Use an LLM to verify if answer claims are in the documents.
    """
    judge_prompt = f"""
    Retrieved documents:
    {format_docs(retrieved_docs)}
    
    Generated answer:
    {answer}
    
    For each factual claim in the answer, determine if it is supported by the retrieved documents.
    
    Output JSON:
    {{
        "claims": [
            {{"claim": "...", "supported": true/false, "source_doc": "doc_id or null"}},
            ...
        ],
        "grounding_score": 0.0-1.0  // fraction of claims that are supported
    }}
    """
    
    result = await llm_judge(judge_prompt, model="gpt-4", response_format="json")
    return result

Approach 2: Entailment Scoring

Use a smaller model fine-tuned for natural language inference (NLI):

from transformers import pipeline

nli_model = pipeline("text-classification", model="microsoft/deberta-v3-large-mnli")

def entailment_score(answer: str, retrieved_docs: list[str]) -> float:
    """
    Check if answer is entailed by (logically follows from) retrieved docs.
    """
    # Concatenate docs as premises
    premises = " ".join(retrieved_docs)
    
    # Check entailment
    result = nli_model(f"{premises} [SEP] {answer}")
    
    # Result is one of: ENTAILMENT, NEUTRAL, CONTRADICTION
    if result[0]['label'] == 'ENTAILMENT':
        return result[0]['score']
    elif result[0]['label'] == 'NEUTRAL':
        return 0.5
    else:  # CONTRADICTION
        return 0.0

Approach 3: Citation Checking

If your system generates citations (e.g., "[1]"), verify they're accurate:

import re

def check_citations(answer: str, retrieved_docs: list[dict]) -> dict:
    """
    retrieved_docs: [{"id": 1, "content": "..."}, ...]
    """
    # Extract citations like [1], [2]
    citations = re.findall(r'\[(\d+)\]', answer)
    
    valid_citations = set(str(doc['id']) for doc in retrieved_docs)
    cited_doc_ids = set(citations)
    
    # Check if all citations reference retrieved docs
    invalid_citations = cited_doc_ids - valid_citations
    
    # Check if cited content actually supports the claim
    # (would require more sophisticated analysis)
    
    return {
        "total_citations": len(citations),
        "invalid_citations": len(invalid_citations),
        "citation_accuracy": 1.0 - (len(invalid_citations) / len(citations)) if citations else 1.0
    }

3. Answer Quality: "Is the answer correct and complete?"

Even with perfect retrieval and grounding, the answer might be poorly formatted, incomplete, or miss nuances.

Metrics:

Correctness: Does it answer the question accurately?

async def answer_correctness(question: str, answer: str, reference_answer: str) -> float:
    """
    Compare generated answer to reference answer (human-written or GPT-4 generated).
    """
    judge_prompt = f"""
    Question: {question}
    Reference answer: {reference_answer}
    Generated answer: {answer}
    
    Rate how well the generated answer matches the reference answer in correctness (0-10):
    - 10: Equivalent in correctness
    - 7-9: Mostly correct, minor differences
    - 4-6: Partially correct
    - 0-3: Incorrect or misleading
    
    Consider:
    - Factual accuracy
    - Completeness
    - Relevance to the question
    
    Score:
    """
    
    score = await llm_judge(judge_prompt, model="gpt-4")
    return float(score) / 10

Completeness: Does it address all parts of the question?

async def answer_completeness(question: str, answer: str) -> float:
    judge_prompt = f"""
    Question: {question}
    Answer: {answer}
    
    Does the answer fully address the question? Consider:
    - Are all sub-questions answered?
    - Is sufficient detail provided?
    - Are important caveats or conditions mentioned?
    
    Rate completeness (0-10):
    """
    
    score = await llm_judge(judge_prompt, model="gpt-4")
    return float(score) / 10

Relevance: Does it stay on topic?

def relevance_score(question: str, answer: str) -> float:
    """
    Use semantic similarity between question and answer.
    """
    from sentence_transformers import SentenceTransformer
    
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    q_emb = model.encode(question)
    a_emb = model.encode(answer)
    
    similarity = np.dot(q_emb, a_emb) / (np.linalg.norm(q_emb) * np.linalg.norm(a_emb))
    return float(similarity)

4. End-to-End Performance: "Does it solve the user's problem?"

Ultimately, users care about the final answer, not intermediate steps.

User Satisfaction Proxy:

  • Explicit feedback (thumbs up/down)
  • Implicit signals (time spent reading, follow-up questions, task completion)
  • A/B test metrics (resolution rate, escalation rate)

Composite Score:

Combine retrieval, grounding, and quality into a single metric:

def rag_composite_score(
    precision_at_5: float,
    recall_at_5: float,
    grounding_score: float,
    answer_correctness: float,
    answer_completeness: float
) -> float:
    """
    Weighted composite score for RAG system.
    """
    weights = {
        "precision": 0.15,
        "recall": 0.15,
        "grounding": 0.25,  # Critical: don't hallucinate
        "correctness": 0.30,  # Primary goal
        "completeness": 0.15
    }
    
    score = (
        weights["precision"] * precision_at_5 +
        weights["recall"] * recall_at_5 +
        weights["grounding"] * grounding_score +
        weights["correctness"] * answer_correctness +
        weights["completeness"] * answer_completeness
    )
    
    return score

Building a RAG Evaluation Dataset

Unlike simple Q&A, RAG eval sets require additional annotations.

Required Components

For each example:

  1. Query: User's question
  2. Relevant document IDs: Ground truth for retrieval evaluation
  3. Reference answer: Expected output (optional but helpful)
  4. Key facts: Claims that must be included (alternative to full reference answer)
  5. Metadata: Difficulty, category, multi-hop reasoning required?

Example:

{
  "id": "eval-001",
  "query": "What are the symptoms of type 2 diabetes?",
  "relevant_doc_ids": [42, 103, 217],
  "reference_answer": "Common symptoms of type 2 diabetes include increased thirst, frequent urination, increased hunger, unexplained weight loss, fatigue, blurred vision, slow-healing sores, and frequent infections. Many people with type 2 diabetes have no symptoms initially.",
  "key_facts": [
    "increased thirst",
    "frequent urination",
    "fatigue",
    "many have no symptoms initially"
  ],
  "metadata": {
    "category": "medical-info",
    "difficulty": "medium",
    "requires_multi_hop": false
  }
}

How to Annotate Relevant Documents

Option 1: Manual annotation (most accurate)

  • For each query, have humans review the document corpus
  • Mark which documents contain relevant information
  • Time-consuming but produces high-quality ground truth

Option 2: Pseudo-labeling with strong retrieval

  • Use a strong retrieval model (e.g., state-of-the-art embedding model)
  • Have humans verify top-10 results, correct mistakes
  • Faster than Option 1, still high quality

Option 3: Synthetic generation

  • For each document, generate questions it should answer
  • Map questions → documents automatically
  • Risk: biased toward generation model's view of relevance

Best practice: Combine approaches. Start with pseudo-labeling, manually verify subset, use for critical evals.

Evaluating Retrieval Systems

Comparing Embedding Models

Test different embedding models on your domain:

embedding_models = [
    "text-embedding-3-small",
    "text-embedding-3-large",
    "text-embedding-ada-002",
    "voyage-02",
    "sentence-transformers/all-MiniLM-L6-v2"
]

results = {}
for model in embedding_models:
    # Re-index corpus with this model
    index = build_vector_index(corpus, embedding_model=model)
    
    # Evaluate retrieval
    metrics = evaluate_retrieval(
        queries=eval_queries,
        index=index,
        ground_truth=relevant_docs_map
    )
    
    results[model] = metrics

# Compare
for model, metrics in results.items():
    print(f"{model}:")
    print(f"  Precision@5: {metrics['precision@5']:.3f}")
    print(f"  Recall@5: {metrics['recall@5']:.3f}")
    print(f"  MRR: {metrics['mrr']:.3f}")

Example output:

text-embedding-3-small:
  Precision@5: 0.720
  Recall@5: 0.680
  MRR: 0.810

text-embedding-3-large:
  Precision@5: 0.780
  Recall@5: 0.750
  MRR: 0.850

sentence-transformers/all-MiniLM-L6-v2:
  Precision@5: 0.650
  Recall@5: 0.620
  MRR: 0.730

Decision: text-embedding-3-large performs best. Worth the extra cost vs. -small?

Hybrid Retrieval: BM25 + Vector Search

Combine keyword-based (BM25) and semantic (vector) retrieval:

def hybrid_retrieval(query: str, corpus: list, alpha=0.5) -> list:
    """
    alpha: weight for vector search (1-alpha for BM25)
    """
    # BM25 scores
    bm25_scores = bm25_search(query, corpus)
    
    # Vector search scores
    vector_scores = vector_search(query, corpus)
    
    # Combine with weighted average
    combined_scores = {}
    for doc_id in set(bm25_scores.keys()).union(vector_scores.keys()):
        bm25_score = bm25_scores.get(doc_id, 0)
        vector_score = vector_scores.get(doc_id, 0)
        combined_scores[doc_id] = alpha * vector_score + (1 - alpha) * bm25_score
    
    # Rank by combined score
    ranked = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
    return [doc_id for doc_id, score in ranked]

# Evaluate different alpha values
for alpha in [0.0, 0.25, 0.5, 0.75, 1.0]:
    metrics = evaluate_retrieval(
        queries=eval_queries,
        retrieval_fn=lambda q: hybrid_retrieval(q, corpus, alpha=alpha),
        ground_truth=relevant_docs_map
    )
    print(f"alpha={alpha}: Recall@5={metrics['recall@5']:.3f}")

Find optimal alpha for your domain (often around 0.6-0.7).

Query Rewriting

Sometimes user queries are ambiguous or poorly phrased. Rewrite before retrieving:

async def rewrite_query(user_query: str) -> str:
    """
    Expand abbreviations, clarify intent, add domain keywords.
    """
    rewrite_prompt = f"""
    User query: {user_query}
    
    Rewrite this query to be more effective for document retrieval:
    - Expand abbreviations
    - Add relevant domain keywords
    - Clarify ambiguous terms
    - Maintain the user's intent
    
    Rewritten query:
    """
    
    return await llm_call(rewrite_prompt, model="gpt-3.5-turbo", max_tokens=100)

# Evaluate: original query vs. rewritten
original_metrics = evaluate_retrieval(eval_queries, index, ground_truth)
rewritten_queries = [await rewrite_query(q) for q in eval_queries]
rewritten_metrics = evaluate_retrieval(rewritten_queries, index, ground_truth)

print(f"Original Recall@5: {original_metrics['recall@5']:.3f}")
print(f"Rewritten Recall@5: {rewritten_metrics['recall@5']:.3f}")

Evaluating Generation: Grounding and Faithfulness

Technique 1: Self-Consistency Checks

Ask the model multiple variations of the same question. Consistent answers suggest grounded responses.

async def check_self_consistency(question: str, context: str, n=3) -> float:
    """
    Generate n answers, measure agreement.
    """
    answers = []
    for _ in range(n):
        answer = await rag_generate(question, context, temperature=0.7)
        answers.append(answer)
    
    # Measure pairwise semantic similarity
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('all-MiniLM-L6-v2')
    embeddings = model.encode(answers)
    
    # Average pairwise cosine similarity
    similarities = []
    for i in range(len(embeddings)):
        for j in range(i + 1, len(embeddings)):
            sim = np.dot(embeddings[i], embeddings[j])
            sim /= (np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[j]))
            similarities.append(sim)
    
    return np.mean(similarities)

# High consistency (>0.9) suggests grounded, low (<0.7) suggests hallucination
consistency = await check_self_consistency(query, retrieved_docs)
print(f"Self-consistency: {consistency:.2f}")

Technique 2: Adversarial Context Injection

Add irrelevant or contradictory documents to the context. A robust system should ignore them.

def adversarial_context_test(query: str, relevant_docs: list, irrelevant_docs: list) -> dict:
    """
    Test if model is distracted by irrelevant context.
    """
    # Baseline: answer with only relevant docs
    baseline_answer = rag_generate(query, relevant_docs)
    
    # Test: answer with relevant + irrelevant docs
    mixed_context = relevant_docs + irrelevant_docs
    test_answer = rag_generate(query, mixed_context)
    
    # Compare answers (should be similar if model focuses on relevant docs)
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    baseline_emb = model.encode(baseline_answer)
    test_emb = model.encode(test_answer)
    
    similarity = np.dot(baseline_emb, test_emb) / (
        np.linalg.norm(baseline_emb) * np.linalg.norm(test_emb)
    )
    
    return {
        "robustness_score": similarity,
        "baseline_answer": baseline_answer,
        "test_answer": test_answer
    }

High similarity (>0.85) suggests robust grounding. Low similarity suggests the model is distracted by noise.

Technique 3: Counterfactual Testing

Modify a document's content and see if the answer changes appropriately.

def counterfactual_test(query: str, original_doc: str, modified_doc: str) -> dict:
    """
    Change a key fact in a document, ensure answer reflects the change.
    """
    # Original
    original_answer = rag_generate(query, [original_doc])
    
    # Modified
    modified_answer = rag_generate(query, [modified_doc])
    
    # Answers should differ if model is grounded
    from difflib import SequenceMatcher
    
    similarity = SequenceMatcher(None, original_answer, modified_answer).ratio()
    
    return {
        "answer_changed": similarity < 0.8,  # Threshold for "different"
        "original_answer": original_answer,
        "modified_answer": modified_answer,
        "similarity": similarity
    }

# Example
original = "The company was founded in 2010 and has 500 employees."
modified = "The company was founded in 2020 and has 50 employees."

result = counterfactual_test("When was the company founded?", original, modified)
if result['answer_changed']:
    print("Model is grounded (answer changed with modified context)")
else:
    print("Model may be hallucinating (answer didn't change)")

Diagnosing RAG Failures

When end-to-end metrics are poor, isolate the failure stage.

Failure Mode 1: Retrieval Miss

Symptom: Answer is generic or says "I don't have that information."

Diagnosis: Check retrieval metrics (Recall@K is low).

Solution:

  • Improve embedding model
  • Add query rewriting
  • Use hybrid retrieval (BM25 + vectors)
  • Expand document chunking strategy

Failure Mode 2: Context Overload

Symptom: Answer is vague or misses key details despite relevant docs being retrieved.

Diagnosis: Too many documents in context, signal lost in noise.

Solution:

  • Reduce K (retrieve fewer docs)
  • Re-rank documents by relevance before passing to LLM
  • Use a better generation model (longer context window, better instruction following)

Failure Mode 3: Hallucination

Symptom: Answer includes facts not in retrieved documents.

Diagnosis: Grounding score is low.

Solution:

  • Improve prompt: explicitly instruct "only use provided documents"
  • Add negative examples (few-shot) showing what not to do
  • Use a model with better instruction-following (GPT-4 > GPT-3.5 for grounding)
  • Post-process: verify claims against context before returning

Failure Mode 4: Poor Document Quality

Symptom: Retrieval metrics are good, but answers are still wrong.

Diagnosis: Retrieved docs contain incorrect/outdated information.

Solution:

  • Improve document curation (remove outdated content)
  • Add metadata filtering (only retrieve docs updated after date X)
  • Implement source credibility scoring

Advanced RAG Patterns

Multi-Hop Reasoning

Some questions require synthesizing information from multiple documents.

Example: "Who was the CEO of Apple when the iPhone was released?"

  • Retrieve: "iPhone was released in 2007"
  • Retrieve: "Steve Jobs was CEO of Apple from 1997-2011"
  • Synthesize: "Steve Jobs"

Evaluation:

def evaluate_multi_hop(query: str, answer: str, required_docs: list[int]) -> dict:
    """
    Check if answer requires (and uses) multiple documents.
    """
    # Retrieve docs for query
    retrieved_doc_ids = retrieval_system.retrieve(query)
    
    # Check if all required docs were retrieved
    retrieval_complete = all(doc_id in retrieved_doc_ids for doc_id in required_docs)
    
    # Check if answer synthesizes information (not just copying one doc)
    # Use LLM judge
    judge_prompt = f"""
    Query: {query}
    Required documents: {required_docs}
    Retrieved documents: {retrieved_doc_ids}
    Answer: {answer}
    
    Does this answer correctly synthesize information from multiple documents?
    Rate (0-10):
    """
    
    synthesis_score = llm_judge(judge_prompt, model="gpt-4")
    
    return {
        "retrieval_complete": retrieval_complete,
        "synthesis_score": float(synthesis_score) / 10,
        "multi_hop_success": retrieval_complete and (float(synthesis_score) >= 7)
    }

Conversational RAG

Multi-turn conversations require maintaining context and updating retrieval based on history.

Evaluation considerations:

  • Track context across turns
  • Measure if retrieval adapts to conversation (e.g., resolves pronouns)
  • Evaluate coherence across turns
def evaluate_conversational_rag(conversation_history: list[dict], eval_turn: int) -> dict:
    """
    conversation_history: [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}, ...]
    eval_turn: Which turn to evaluate
    """
    # Get query at eval turn
    current_query = conversation_history[eval_turn]['content']
    
    # Retrieve with conversation context
    retrieved_docs = rag_retrieval_with_history(current_query, conversation_history[:eval_turn])
    
    # Generate answer
    answer = rag_generate_with_history(current_query, retrieved_docs, conversation_history[:eval_turn])
    
    # Evaluate:
    # 1. Does retrieval use conversation context (e.g., resolve "it" referring to previous topic)?
    # 2. Is answer coherent with conversation history?
    # 3. Does it maintain persona/tone established earlier?
    
    return {
        "retrieval_context_aware": check_context_awareness(current_query, retrieved_docs, conversation_history),
        "answer_coherence": check_conversation_coherence(answer, conversation_history),
        "tone_consistency": check_tone_consistency(answer, conversation_history)
    }

Production Monitoring for RAG

Track these metrics in production:

Retrieval metrics (sample 1% of queries):

  • Average Precision@5
  • Average MRR
  • Queries with zero results

Grounding metrics (all queries):

  • Grounding score distribution
  • Rate of "I don't have that information" responses
  • Citation accuracy (if using citations)

Quality metrics (from user feedback):

  • Explicit feedback rate (thumbs up/down)
  • Follow-up question rate (suggests incomplete first answer)
  • Escalation rate (user asks for human help)

Alerting:

alerts:
  - name: retrieval-degradation
    condition: precision@5 < 0.65 for 1 hour
    action: alert #rag-team
    
  - name: grounding-issues
    condition: grounding_score < 0.8 for 15 minutes
    action: alert #rag-team
    
  - name: zero-results-spike
    condition: zero_results_rate > 5% for 30 minutes
    action: alert #rag-team

Conclusion

Evaluating RAG systems requires moving beyond generic accuracy metrics to measure:

  1. Retrieval quality: Are we fetching the right documents?
  2. Grounding: Does the answer use those documents faithfully?
  3. Answer quality: Is the final response correct, complete, and relevant?
  4. End-to-end performance: Does it solve the user's problem?

Each dimension needs its own metrics and evaluation methods. Failures at any stage compromise the entire system.

Start with:

  • Build a 100-example eval set with annotated relevant documents
  • Measure retrieval Precision@5 and Recall@5
  • Implement grounding score (claim verification or NLI)
  • Track end-to-end answer correctness
  • Monitor production metrics and correlate with user satisfaction

Within weeks, you'll have visibility into your RAG system's true performance—not just whether it runs, but whether it's actually helpful.


Next Steps:

Questions about evaluating your RAG system? Email hello@evalops.dev.