Evaluating RAG Systems: Beyond Simple Accuracy Metrics

The RAG Evaluation Challenge

Your RAG system looks great in demos. Given a question, it retrieves relevant documents and generates a well-formatted answer. Ship it!

Then production happens: users report the system "makes things up," answers are "off-topic," or responses reference documents that weren't retrieved. What went wrong?

Traditional LLM evaluation assumes a direct input-output relationship. But RAG is a pipeline: query → retrieval → generation → response. Failure can occur at any stage:

Retrieval fails: Wrong documents fetched
Generation ignores context: Model hallucinates instead of using retrieved docs
Contradictory sources: Retrieved docs disagree, model picks wrong one
Context overload: Too many docs, signal gets lost in noise

Evaluating RAG systems requires measuring each stage independently plus end-to-end behavior. Generic accuracy metrics miss these nuances entirely.

The Four Dimensions of RAG Quality

1. Retrieval Quality: "Did we fetch the right documents?"

Metrics:

Precision@K: What fraction of top-K retrieved documents are relevant?

def precision_at_k(retrieved_docs, relevant_doc_ids, k=5):
    """
    retrieved_docs: List of document IDs, ranked by relevance
    relevant_doc_ids: Set of ground-truth relevant doc IDs
    """
    top_k = set(retrieved_docs[:k])
    relevant_in_top_k = top_k.intersection(relevant_doc_ids)
    return len(relevant_in_top_k) / k

Recall@K: What fraction of relevant documents are in top-K?

def recall_at_k(retrieved_docs, relevant_doc_ids, k=5):
    top_k = set(retrieved_docs[:k])
    relevant_in_top_k = top_k.intersection(relevant_doc_ids)
    return len(relevant_in_top_k) / len(relevant_doc_ids) if relevant_doc_ids else 0

Mean Reciprocal Rank (MRR): How high is the first relevant document?

def mean_reciprocal_rank(retrieved_docs, relevant_doc_ids):
    for i, doc_id in enumerate(retrieved_docs, 1):
        if doc_id in relevant_doc_ids:
            return 1.0 / i
    return 0.0

NDCG (Normalized Discounted Cumulative Gain): Accounts for graded relevance and position

import numpy as np

def ndcg_at_k(retrieved_docs, relevance_scores, k=5):
    """
    relevance_scores: Dict mapping doc_id -> relevance (0-3 scale)
    """
    dcg = sum(
        relevance_scores.get(doc_id, 0) / np.log2(i + 2)
        for i, doc_id in enumerate(retrieved_docs[:k])
    )
    
    # Ideal DCG: docs sorted by relevance
    ideal_scores = sorted(relevance_scores.values(), reverse=True)[:k]
    idcg = sum(score / np.log2(i + 2) for i, score in enumerate(ideal_scores))
    
    return dcg / idcg if idcg > 0 else 0

Example evaluation:

# Ground truth: For question "How do I reset my password?",
# docs 15, 42, and 103 are relevant
relevant_docs = {15, 42, 103}

# System retrieved: [42, 15, 88, 103, 12]
retrieved = [42, 15, 88, 103, 12]

print(f"Precision@5: {precision_at_k(retrieved, relevant_docs, k=5):.2f}")
# Output: 0.60 (3 out of 5 are relevant)

print(f"Recall@5: {recall_at_k(retrieved, relevant_docs, k=5):.2f}")
# Output: 1.00 (all 3 relevant docs are in top 5)

print(f"MRR: {mean_reciprocal_rank(retrieved, relevant_docs):.2f}")
# Output: 1.00 (first retrieved doc is relevant)

2. Context Grounding: "Does the answer use the retrieved documents?"

The model has context but might ignore it and hallucinate. Measure faithfulness to the provided documents.

Approach 1: Claim Verification

Extract claims from the answer, check if each is supported by retrieved documents.

async def check_grounding(answer: str, retrieved_docs: list[str]) -> dict:
    """
    Use an LLM to verify if answer claims are in the documents.
    """
    judge_prompt = f"""
    Retrieved documents:
    {format_docs(retrieved_docs)}
    
    Generated answer:
    {answer}
    
    For each factual claim in the answer, determine if it is supported by the retrieved documents.
    
    Output JSON:
    {{
        "claims": [
            {{"claim": "...", "supported": true/false, "source_doc": "doc_id or null"}},
            ...
        ],
        "grounding_score": 0.0-1.0  // fraction of claims that are supported
    }}
    """
    
    result = await llm_judge(judge_prompt, model="gpt-4", response_format="json")
    return result

Approach 2: Entailment Scoring

Use a smaller model fine-tuned for natural language inference (NLI):

from transformers import pipeline

nli_model = pipeline("text-classification", model="microsoft/deberta-v3-large-mnli")

def entailment_score(answer: str, retrieved_docs: list[str]) -> float:
    """
    Check if answer is entailed by (logically follows from) retrieved docs.
    """
    # Concatenate docs as premises
    premises = " ".join(retrieved_docs)
    
    # Check entailment
    result = nli_model(f"{premises} [SEP] {answer}")
    
    # Result is one of: ENTAILMENT, NEUTRAL, CONTRADICTION
    if result[0]['label'] == 'ENTAILMENT':
        return result[0]['score']
    elif result[0]['label'] == 'NEUTRAL':
        return 0.5
    else:  # CONTRADICTION
        return 0.0

Approach 3: Citation Checking

If your system generates citations (e.g., "[1]"), verify they're accurate:

import re

def check_citations(answer: str, retrieved_docs: list[dict]) -> dict:
    """
    retrieved_docs: [{"id": 1, "content": "..."}, ...]
    """
    # Extract citations like [1], [2]
    citations = re.findall(r'\[(\d+)\]', answer)
    
    valid_citations = set(str(doc['id']) for doc in retrieved_docs)
    cited_doc_ids = set(citations)
    
    # Check if all citations reference retrieved docs
    invalid_citations = cited_doc_ids - valid_citations
    
    # Check if cited content actually supports the claim
    # (would require more sophisticated analysis)
    
    return {
        "total_citations": len(citations),
        "invalid_citations": len(invalid_citations),
        "citation_accuracy": 1.0 - (len(invalid_citations) / len(citations)) if citations else 1.0
    }

3. Answer Quality: "Is the answer correct and complete?"

Even with perfect retrieval and grounding, the answer might be poorly formatted, incomplete, or miss nuances.

Metrics:

Correctness: Does it answer the question accurately?

async def answer_correctness(question: str, answer: str, reference_answer: str) -> float:
    """
    Compare generated answer to reference answer (human-written or GPT-4 generated).
    """
    judge_prompt = f"""
    Question: {question}
    Reference answer: {reference_answer}
    Generated answer: {answer}
    
    Rate how well the generated answer matches the reference answer in correctness (0-10):
    - 10: Equivalent in correctness
    - 7-9: Mostly correct, minor differences
    - 4-6: Partially correct
    - 0-3: Incorrect or misleading
    
    Consider:
    - Factual accuracy
    - Completeness
    - Relevance to the question
    
    Score:
    """
    
    score = await llm_judge(judge_prompt, model="gpt-4")
    return float(score) / 10

Completeness: Does it address all parts of the question?

async def answer_completeness(question: str, answer: str) -> float:
    judge_prompt = f"""
    Question: {question}
    Answer: {answer}
    
    Does the answer fully address the question? Consider:
    - Are all sub-questions answered?
    - Is sufficient detail provided?
    - Are important caveats or conditions mentioned?
    
    Rate completeness (0-10):
    """
    
    score = await llm_judge(judge_prompt, model="gpt-4")
    return float(score) / 10

Relevance: Does it stay on topic?

def relevance_score(question: str, answer: str) -> float:
    """
    Use semantic similarity between question and answer.
    """
    from sentence_transformers import SentenceTransformer
    
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    q_emb = model.encode(question)
    a_emb = model.encode(answer)
    
    similarity = np.dot(q_emb, a_emb) / (np.linalg.norm(q_emb) * np.linalg.norm(a_emb))
    return float(similarity)

4. End-to-End Performance: "Does it solve the user's problem?"

Ultimately, users care about the final answer, not intermediate steps.

User Satisfaction Proxy:

Explicit feedback (thumbs up/down)
Implicit signals (time spent reading, follow-up questions, task completion)
A/B test metrics (resolution rate, escalation rate)

Composite Score:

Combine retrieval, grounding, and quality into a single metric:

def rag_composite_score(
    precision_at_5: float,
    recall_at_5: float,
    grounding_score: float,
    answer_correctness: float,
    answer_completeness: float
) -> float:
    """
    Weighted composite score for RAG system.
    """
    weights = {
        "precision": 0.15,
        "recall": 0.15,
        "grounding": 0.25,  # Critical: don't hallucinate
        "correctness": 0.30,  # Primary goal
        "completeness": 0.15
    }
    
    score = (
        weights["precision"] * precision_at_5 +
        weights["recall"] * recall_at_5 +
        weights["grounding"] * grounding_score +
        weights["correctness"] * answer_correctness +
        weights["completeness"] * answer_completeness
    )
    
    return score

Building a RAG Evaluation Dataset

Unlike simple Q&A, RAG eval sets require additional annotations.

Required Components

For each example:

Query: User's question
Relevant document IDs: Ground truth for retrieval evaluation
Reference answer: Expected output (optional but helpful)
Key facts: Claims that must be included (alternative to full reference answer)
Metadata: Difficulty, category, multi-hop reasoning required?

Example:

{
  "id": "eval-001",
  "query": "What are the symptoms of type 2 diabetes?",
  "relevant_doc_ids": [42, 103, 217],
  "reference_answer": "Common symptoms of type 2 diabetes include increased thirst, frequent urination, increased hunger, unexplained weight loss, fatigue, blurred vision, slow-healing sores, and frequent infections. Many people with type 2 diabetes have no symptoms initially.",
  "key_facts": [
    "increased thirst",
    "frequent urination",
    "fatigue",
    "many have no symptoms initially"
  ],
  "metadata": {
    "category": "medical-info",
    "difficulty": "medium",
    "requires_multi_hop": false
  }
}

How to Annotate Relevant Documents

Option 1: Manual annotation (most accurate)

For each query, have humans review the document corpus
Mark which documents contain relevant information
Time-consuming but produces high-quality ground truth

Option 2: Pseudo-labeling with strong retrieval

Use a strong retrieval model (e.g., state-of-the-art embedding model)
Have humans verify top-10 results, correct mistakes
Faster than Option 1, still high quality

Option 3: Synthetic generation

For each document, generate questions it should answer
Map questions → documents automatically
Risk: biased toward generation model's view of relevance

Best practice: Combine approaches. Start with pseudo-labeling, manually verify subset, use for critical evals.

Evaluating Retrieval Systems

Comparing Embedding Models

Test different embedding models on your domain:

embedding_models = [
    "text-embedding-3-small",
    "text-embedding-3-large",
    "text-embedding-ada-002",
    "voyage-02",
    "sentence-transformers/all-MiniLM-L6-v2"
]

results = {}
for model in embedding_models:
    # Re-index corpus with this model
    index = build_vector_index(corpus, embedding_model=model)
    
    # Evaluate retrieval
    metrics = evaluate_retrieval(
        queries=eval_queries,
        index=index,
        ground_truth=relevant_docs_map
    )
    
    results[model] = metrics

# Compare
for model, metrics in results.items():
    print(f"{model}:")
    print(f"  Precision@5: {metrics['precision@5']:.3f}")
    print(f"  Recall@5: {metrics['recall@5']:.3f}")
    print(f"  MRR: {metrics['mrr']:.3f}")

Example output:

text-embedding-3-small:
  Precision@5: 0.720
  Recall@5: 0.680
  MRR: 0.810

text-embedding-3-large:
  Precision@5: 0.780
  Recall@5: 0.750
  MRR: 0.850

sentence-transformers/all-MiniLM-L6-v2:
  Precision@5: 0.650
  Recall@5: 0.620
  MRR: 0.730

Decision: text-embedding-3-large performs best. Worth the extra cost vs. -small?

Hybrid Retrieval: BM25 + Vector Search

Combine keyword-based (BM25) and semantic (vector) retrieval:

def hybrid_retrieval(query: str, corpus: list, alpha=0.5) -> list:
    """
    alpha: weight for vector search (1-alpha for BM25)
    """
    # BM25 scores
    bm25_scores = bm25_search(query, corpus)
    
    # Vector search scores
    vector_scores = vector_search(query, corpus)
    
    # Combine with weighted average
    combined_scores = {}
    for doc_id in set(bm25_scores.keys()).union(vector_scores.keys()):
        bm25_score = bm25_scores.get(doc_id, 0)
        vector_score = vector_scores.get(doc_id, 0)
        combined_scores[doc_id] = alpha * vector_score + (1 - alpha) * bm25_score
    
    # Rank by combined score
    ranked = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
    return [doc_id for doc_id, score in ranked]

# Evaluate different alpha values
for alpha in [0.0, 0.25, 0.5, 0.75, 1.0]:
    metrics = evaluate_retrieval(
        queries=eval_queries,
        retrieval_fn=lambda q: hybrid_retrieval(q, corpus, alpha=alpha),
        ground_truth=relevant_docs_map
    )
    print(f"alpha={alpha}: Recall@5={metrics['recall@5']:.3f}")

Find optimal alpha for your domain (often around 0.6-0.7).

Query Rewriting

Sometimes user queries are ambiguous or poorly phrased. Rewrite before retrieving:

async def rewrite_query(user_query: str) -> str:
    """
    Expand abbreviations, clarify intent, add domain keywords.
    """
    rewrite_prompt = f"""
    User query: {user_query}
    
    Rewrite this query to be more effective for document retrieval:
    - Expand abbreviations
    - Add relevant domain keywords
    - Clarify ambiguous terms
    - Maintain the user's intent
    
    Rewritten query:
    """
    
    return await llm_call(rewrite_prompt, model="gpt-3.5-turbo", max_tokens=100)

# Evaluate: original query vs. rewritten
original_metrics = evaluate_retrieval(eval_queries, index, ground_truth)
rewritten_queries = [await rewrite_query(q) for q in eval_queries]
rewritten_metrics = evaluate_retrieval(rewritten_queries, index, ground_truth)

print(f"Original Recall@5: {original_metrics['recall@5']:.3f}")
print(f"Rewritten Recall@5: {rewritten_metrics['recall@5']:.3f}")

Evaluating Generation: Grounding and Faithfulness

Technique 1: Self-Consistency Checks

Ask the model multiple variations of the same question. Consistent answers suggest grounded responses.

async def check_self_consistency(question: str, context: str, n=3) -> float:
    """
    Generate n answers, measure agreement.
    """
    answers = []
    for _ in range(n):
        answer = await rag_generate(question, context, temperature=0.7)
        answers.append(answer)
    
    # Measure pairwise semantic similarity
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('all-MiniLM-L6-v2')
    embeddings = model.encode(answers)
    
    # Average pairwise cosine similarity
    similarities = []
    for i in range(len(embeddings)):
        for j in range(i + 1, len(embeddings)):
            sim = np.dot(embeddings[i], embeddings[j])
            sim /= (np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[j]))
            similarities.append(sim)
    
    return np.mean(similarities)

# High consistency (>0.9) suggests grounded, low (<0.7) suggests hallucination
consistency = await check_self_consistency(query, retrieved_docs)
print(f"Self-consistency: {consistency:.2f}")

Technique 2: Adversarial Context Injection

Add irrelevant or contradictory documents to the context. A robust system should ignore them.

def adversarial_context_test(query: str, relevant_docs: list, irrelevant_docs: list) -> dict:
    """
    Test if model is distracted by irrelevant context.
    """
    # Baseline: answer with only relevant docs
    baseline_answer = rag_generate(query, relevant_docs)
    
    # Test: answer with relevant + irrelevant docs
    mixed_context = relevant_docs + irrelevant_docs
    test_answer = rag_generate(query, mixed_context)
    
    # Compare answers (should be similar if model focuses on relevant docs)
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    baseline_emb = model.encode(baseline_answer)
    test_emb = model.encode(test_answer)
    
    similarity = np.dot(baseline_emb, test_emb) / (
        np.linalg.norm(baseline_emb) * np.linalg.norm(test_emb)
    )
    
    return {
        "robustness_score": similarity,
        "baseline_answer": baseline_answer,
        "test_answer": test_answer
    }

High similarity (>0.85) suggests robust grounding. Low similarity suggests the model is distracted by noise.

Technique 3: Counterfactual Testing

Modify a document's content and see if the answer changes appropriately.

def counterfactual_test(query: str, original_doc: str, modified_doc: str) -> dict:
    """
    Change a key fact in a document, ensure answer reflects the change.
    """
    # Original
    original_answer = rag_generate(query, [original_doc])
    
    # Modified
    modified_answer = rag_generate(query, [modified_doc])
    
    # Answers should differ if model is grounded
    from difflib import SequenceMatcher
    
    similarity = SequenceMatcher(None, original_answer, modified_answer).ratio()
    
    return {
        "answer_changed": similarity < 0.8,  # Threshold for "different"
        "original_answer": original_answer,
        "modified_answer": modified_answer,
        "similarity": similarity
    }

# Example
original = "The company was founded in 2010 and has 500 employees."
modified = "The company was founded in 2020 and has 50 employees."

result = counterfactual_test("When was the company founded?", original, modified)
if result['answer_changed']:
    print("Model is grounded (answer changed with modified context)")
else:
    print("Model may be hallucinating (answer didn't change)")

Diagnosing RAG Failures

When end-to-end metrics are poor, isolate the failure stage.

Failure Mode 1: Retrieval Miss

Symptom: Answer is generic or says "I don't have that information."

Diagnosis: Check retrieval metrics (Recall@K is low).

Solution:

Improve embedding model
Add query rewriting
Use hybrid retrieval (BM25 + vectors)
Expand document chunking strategy

Failure Mode 2: Context Overload

Symptom: Answer is vague or misses key details despite relevant docs being retrieved.

Diagnosis: Too many documents in context, signal lost in noise.

Solution:

Reduce K (retrieve fewer docs)
Re-rank documents by relevance before passing to LLM
Use a better generation model (longer context window, better instruction following)

Failure Mode 3: Hallucination

Symptom: Answer includes facts not in retrieved documents.

Diagnosis: Grounding score is low.

Solution:

Improve prompt: explicitly instruct "only use provided documents"
Add negative examples (few-shot) showing what not to do
Use a model with better instruction-following (GPT-4 > GPT-3.5 for grounding)
Post-process: verify claims against context before returning

Failure Mode 4: Poor Document Quality

Symptom: Retrieval metrics are good, but answers are still wrong.

Diagnosis: Retrieved docs contain incorrect/outdated information.

Solution:

Improve document curation (remove outdated content)
Add metadata filtering (only retrieve docs updated after date X)
Implement source credibility scoring

Advanced RAG Patterns

Multi-Hop Reasoning

Some questions require synthesizing information from multiple documents.

Example: "Who was the CEO of Apple when the iPhone was released?"

Retrieve: "iPhone was released in 2007"
Retrieve: "Steve Jobs was CEO of Apple from 1997-2011"
Synthesize: "Steve Jobs"

Evaluation:

def evaluate_multi_hop(query: str, answer: str, required_docs: list[int]) -> dict:
    """
    Check if answer requires (and uses) multiple documents.
    """
    # Retrieve docs for query
    retrieved_doc_ids = retrieval_system.retrieve(query)
    
    # Check if all required docs were retrieved
    retrieval_complete = all(doc_id in retrieved_doc_ids for doc_id in required_docs)
    
    # Check if answer synthesizes information (not just copying one doc)
    # Use LLM judge
    judge_prompt = f"""
    Query: {query}
    Required documents: {required_docs}
    Retrieved documents: {retrieved_doc_ids}
    Answer: {answer}
    
    Does this answer correctly synthesize information from multiple documents?
    Rate (0-10):
    """
    
    synthesis_score = llm_judge(judge_prompt, model="gpt-4")
    
    return {
        "retrieval_complete": retrieval_complete,
        "synthesis_score": float(synthesis_score) / 10,
        "multi_hop_success": retrieval_complete and (float(synthesis_score) >= 7)
    }

Conversational RAG

Multi-turn conversations require maintaining context and updating retrieval based on history.

Evaluation considerations:

Track context across turns
Measure if retrieval adapts to conversation (e.g., resolves pronouns)
Evaluate coherence across turns

def evaluate_conversational_rag(conversation_history: list[dict], eval_turn: int) -> dict:
    """
    conversation_history: [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}, ...]
    eval_turn: Which turn to evaluate
    """
    # Get query at eval turn
    current_query = conversation_history[eval_turn]['content']
    
    # Retrieve with conversation context
    retrieved_docs = rag_retrieval_with_history(current_query, conversation_history[:eval_turn])
    
    # Generate answer
    answer = rag_generate_with_history(current_query, retrieved_docs, conversation_history[:eval_turn])
    
    # Evaluate:
    # 1. Does retrieval use conversation context (e.g., resolve "it" referring to previous topic)?
    # 2. Is answer coherent with conversation history?
    # 3. Does it maintain persona/tone established earlier?
    
    return {
        "retrieval_context_aware": check_context_awareness(current_query, retrieved_docs, conversation_history),
        "answer_coherence": check_conversation_coherence(answer, conversation_history),
        "tone_consistency": check_tone_consistency(answer, conversation_history)
    }

Production Monitoring for RAG

Track these metrics in production:

Retrieval metrics (sample 1% of queries):

Average Precision@5
Average MRR
Queries with zero results

Grounding metrics (all queries):

Grounding score distribution
Rate of "I don't have that information" responses
Citation accuracy (if using citations)

Quality metrics (from user feedback):

Explicit feedback rate (thumbs up/down)
Follow-up question rate (suggests incomplete first answer)
Escalation rate (user asks for human help)

Alerting:

alerts:
  - name: retrieval-degradation
    condition: precision@5 < 0.65 for 1 hour
    action: alert #rag-team
    
  - name: grounding-issues
    condition: grounding_score < 0.8 for 15 minutes
    action: alert #rag-team
    
  - name: zero-results-spike
    condition: zero_results_rate > 5% for 30 minutes
    action: alert #rag-team

Conclusion

Evaluating RAG systems requires moving beyond generic accuracy metrics to measure:

Retrieval quality: Are we fetching the right documents?
Grounding: Does the answer use those documents faithfully?
Answer quality: Is the final response correct, complete, and relevant?
End-to-end performance: Does it solve the user's problem?

Each dimension needs its own metrics and evaluation methods. Failures at any stage compromise the entire system.

Start with:

Build a 100-example eval set with annotated relevant documents
Measure retrieval Precision@5 and Recall@5
Implement grounding score (claim verification or NLI)
Track end-to-end answer correctness
Monitor production metrics and correlate with user satisfaction

Within weeks, you'll have visibility into your RAG system's true performance—not just whether it runs, but whether it's actually helpful.

Next Steps:

Questions about evaluating your RAG system? Email hello@evalops.dev.