The RAG Evaluation Challenge
Your RAG system looks great in demos. Given a question, it retrieves relevant documents and generates a well-formatted answer. Ship it!
Then production happens: users report the system "makes things up," answers are "off-topic," or responses reference documents that weren't retrieved. What went wrong?
Traditional LLM evaluation assumes a direct input-output relationship. But RAG is a pipeline: query → retrieval → generation → response. Failure can occur at any stage:
- Retrieval fails: Wrong documents fetched
- Generation ignores context: Model hallucinates instead of using retrieved docs
- Contradictory sources: Retrieved docs disagree, model picks wrong one
- Context overload: Too many docs, signal gets lost in noise
Evaluating RAG systems requires measuring each stage independently plus end-to-end behavior. Generic accuracy metrics miss these nuances entirely.
The Four Dimensions of RAG Quality
1. Retrieval Quality: "Did we fetch the right documents?"
Metrics:
Precision@K: What fraction of top-K retrieved documents are relevant?
def precision_at_k(retrieved_docs, relevant_doc_ids, k=5):
"""
retrieved_docs: List of document IDs, ranked by relevance
relevant_doc_ids: Set of ground-truth relevant doc IDs
"""
top_k = set(retrieved_docs[:k])
relevant_in_top_k = top_k.intersection(relevant_doc_ids)
return len(relevant_in_top_k) / k
Recall@K: What fraction of relevant documents are in top-K?
def recall_at_k(retrieved_docs, relevant_doc_ids, k=5):
top_k = set(retrieved_docs[:k])
relevant_in_top_k = top_k.intersection(relevant_doc_ids)
return len(relevant_in_top_k) / len(relevant_doc_ids) if relevant_doc_ids else 0
Mean Reciprocal Rank (MRR): How high is the first relevant document?
def mean_reciprocal_rank(retrieved_docs, relevant_doc_ids):
for i, doc_id in enumerate(retrieved_docs, 1):
if doc_id in relevant_doc_ids:
return 1.0 / i
return 0.0
NDCG (Normalized Discounted Cumulative Gain): Accounts for graded relevance and position
import numpy as np
def ndcg_at_k(retrieved_docs, relevance_scores, k=5):
"""
relevance_scores: Dict mapping doc_id -> relevance (0-3 scale)
"""
dcg = sum(
relevance_scores.get(doc_id, 0) / np.log2(i + 2)
for i, doc_id in enumerate(retrieved_docs[:k])
)
# Ideal DCG: docs sorted by relevance
ideal_scores = sorted(relevance_scores.values(), reverse=True)[:k]
idcg = sum(score / np.log2(i + 2) for i, score in enumerate(ideal_scores))
return dcg / idcg if idcg > 0 else 0
Example evaluation:
# Ground truth: For question "How do I reset my password?",
# docs 15, 42, and 103 are relevant
relevant_docs = {15, 42, 103}
# System retrieved: [42, 15, 88, 103, 12]
retrieved = [42, 15, 88, 103, 12]
print(f"Precision@5: {precision_at_k(retrieved, relevant_docs, k=5):.2f}")
# Output: 0.60 (3 out of 5 are relevant)
print(f"Recall@5: {recall_at_k(retrieved, relevant_docs, k=5):.2f}")
# Output: 1.00 (all 3 relevant docs are in top 5)
print(f"MRR: {mean_reciprocal_rank(retrieved, relevant_docs):.2f}")
# Output: 1.00 (first retrieved doc is relevant)
2. Context Grounding: "Does the answer use the retrieved documents?"
The model has context but might ignore it and hallucinate. Measure faithfulness to the provided documents.
Approach 1: Claim Verification
Extract claims from the answer, check if each is supported by retrieved documents.
async def check_grounding(answer: str, retrieved_docs: list[str]) -> dict:
"""
Use an LLM to verify if answer claims are in the documents.
"""
judge_prompt = f"""
Retrieved documents:
{format_docs(retrieved_docs)}
Generated answer:
{answer}
For each factual claim in the answer, determine if it is supported by the retrieved documents.
Output JSON:
{{
"claims": [
{{"claim": "...", "supported": true/false, "source_doc": "doc_id or null"}},
...
],
"grounding_score": 0.0-1.0 // fraction of claims that are supported
}}
"""
result = await llm_judge(judge_prompt, model="gpt-4", response_format="json")
return result
Approach 2: Entailment Scoring
Use a smaller model fine-tuned for natural language inference (NLI):
from transformers import pipeline
nli_model = pipeline("text-classification", model="microsoft/deberta-v3-large-mnli")
def entailment_score(answer: str, retrieved_docs: list[str]) -> float:
"""
Check if answer is entailed by (logically follows from) retrieved docs.
"""
# Concatenate docs as premises
premises = " ".join(retrieved_docs)
# Check entailment
result = nli_model(f"{premises} [SEP] {answer}")
# Result is one of: ENTAILMENT, NEUTRAL, CONTRADICTION
if result[0]['label'] == 'ENTAILMENT':
return result[0]['score']
elif result[0]['label'] == 'NEUTRAL':
return 0.5
else: # CONTRADICTION
return 0.0
Approach 3: Citation Checking
If your system generates citations (e.g., "[1]"), verify they're accurate:
import re
def check_citations(answer: str, retrieved_docs: list[dict]) -> dict:
"""
retrieved_docs: [{"id": 1, "content": "..."}, ...]
"""
# Extract citations like [1], [2]
citations = re.findall(r'\[(\d+)\]', answer)
valid_citations = set(str(doc['id']) for doc in retrieved_docs)
cited_doc_ids = set(citations)
# Check if all citations reference retrieved docs
invalid_citations = cited_doc_ids - valid_citations
# Check if cited content actually supports the claim
# (would require more sophisticated analysis)
return {
"total_citations": len(citations),
"invalid_citations": len(invalid_citations),
"citation_accuracy": 1.0 - (len(invalid_citations) / len(citations)) if citations else 1.0
}
3. Answer Quality: "Is the answer correct and complete?"
Even with perfect retrieval and grounding, the answer might be poorly formatted, incomplete, or miss nuances.
Metrics:
Correctness: Does it answer the question accurately?
async def answer_correctness(question: str, answer: str, reference_answer: str) -> float:
"""
Compare generated answer to reference answer (human-written or GPT-4 generated).
"""
judge_prompt = f"""
Question: {question}
Reference answer: {reference_answer}
Generated answer: {answer}
Rate how well the generated answer matches the reference answer in correctness (0-10):
- 10: Equivalent in correctness
- 7-9: Mostly correct, minor differences
- 4-6: Partially correct
- 0-3: Incorrect or misleading
Consider:
- Factual accuracy
- Completeness
- Relevance to the question
Score:
"""
score = await llm_judge(judge_prompt, model="gpt-4")
return float(score) / 10
Completeness: Does it address all parts of the question?
async def answer_completeness(question: str, answer: str) -> float:
judge_prompt = f"""
Question: {question}
Answer: {answer}
Does the answer fully address the question? Consider:
- Are all sub-questions answered?
- Is sufficient detail provided?
- Are important caveats or conditions mentioned?
Rate completeness (0-10):
"""
score = await llm_judge(judge_prompt, model="gpt-4")
return float(score) / 10
Relevance: Does it stay on topic?
def relevance_score(question: str, answer: str) -> float:
"""
Use semantic similarity between question and answer.
"""
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
q_emb = model.encode(question)
a_emb = model.encode(answer)
similarity = np.dot(q_emb, a_emb) / (np.linalg.norm(q_emb) * np.linalg.norm(a_emb))
return float(similarity)
4. End-to-End Performance: "Does it solve the user's problem?"
Ultimately, users care about the final answer, not intermediate steps.
User Satisfaction Proxy:
- Explicit feedback (thumbs up/down)
- Implicit signals (time spent reading, follow-up questions, task completion)
- A/B test metrics (resolution rate, escalation rate)
Composite Score:
Combine retrieval, grounding, and quality into a single metric:
def rag_composite_score(
precision_at_5: float,
recall_at_5: float,
grounding_score: float,
answer_correctness: float,
answer_completeness: float
) -> float:
"""
Weighted composite score for RAG system.
"""
weights = {
"precision": 0.15,
"recall": 0.15,
"grounding": 0.25, # Critical: don't hallucinate
"correctness": 0.30, # Primary goal
"completeness": 0.15
}
score = (
weights["precision"] * precision_at_5 +
weights["recall"] * recall_at_5 +
weights["grounding"] * grounding_score +
weights["correctness"] * answer_correctness +
weights["completeness"] * answer_completeness
)
return score
Building a RAG Evaluation Dataset
Unlike simple Q&A, RAG eval sets require additional annotations.
Required Components
For each example:
- Query: User's question
- Relevant document IDs: Ground truth for retrieval evaluation
- Reference answer: Expected output (optional but helpful)
- Key facts: Claims that must be included (alternative to full reference answer)
- Metadata: Difficulty, category, multi-hop reasoning required?
Example:
{
"id": "eval-001",
"query": "What are the symptoms of type 2 diabetes?",
"relevant_doc_ids": [42, 103, 217],
"reference_answer": "Common symptoms of type 2 diabetes include increased thirst, frequent urination, increased hunger, unexplained weight loss, fatigue, blurred vision, slow-healing sores, and frequent infections. Many people with type 2 diabetes have no symptoms initially.",
"key_facts": [
"increased thirst",
"frequent urination",
"fatigue",
"many have no symptoms initially"
],
"metadata": {
"category": "medical-info",
"difficulty": "medium",
"requires_multi_hop": false
}
}
How to Annotate Relevant Documents
Option 1: Manual annotation (most accurate)
- For each query, have humans review the document corpus
- Mark which documents contain relevant information
- Time-consuming but produces high-quality ground truth
Option 2: Pseudo-labeling with strong retrieval
- Use a strong retrieval model (e.g., state-of-the-art embedding model)
- Have humans verify top-10 results, correct mistakes
- Faster than Option 1, still high quality
Option 3: Synthetic generation
- For each document, generate questions it should answer
- Map questions → documents automatically
- Risk: biased toward generation model's view of relevance
Best practice: Combine approaches. Start with pseudo-labeling, manually verify subset, use for critical evals.
Evaluating Retrieval Systems
Comparing Embedding Models
Test different embedding models on your domain:
embedding_models = [
"text-embedding-3-small",
"text-embedding-3-large",
"text-embedding-ada-002",
"voyage-02",
"sentence-transformers/all-MiniLM-L6-v2"
]
results = {}
for model in embedding_models:
# Re-index corpus with this model
index = build_vector_index(corpus, embedding_model=model)
# Evaluate retrieval
metrics = evaluate_retrieval(
queries=eval_queries,
index=index,
ground_truth=relevant_docs_map
)
results[model] = metrics
# Compare
for model, metrics in results.items():
print(f"{model}:")
print(f" Precision@5: {metrics['precision@5']:.3f}")
print(f" Recall@5: {metrics['recall@5']:.3f}")
print(f" MRR: {metrics['mrr']:.3f}")
Example output:
text-embedding-3-small:
Precision@5: 0.720
Recall@5: 0.680
MRR: 0.810
text-embedding-3-large:
Precision@5: 0.780
Recall@5: 0.750
MRR: 0.850
sentence-transformers/all-MiniLM-L6-v2:
Precision@5: 0.650
Recall@5: 0.620
MRR: 0.730
Decision: text-embedding-3-large
performs best. Worth the extra cost vs. -small
?
Hybrid Retrieval: BM25 + Vector Search
Combine keyword-based (BM25) and semantic (vector) retrieval:
def hybrid_retrieval(query: str, corpus: list, alpha=0.5) -> list:
"""
alpha: weight for vector search (1-alpha for BM25)
"""
# BM25 scores
bm25_scores = bm25_search(query, corpus)
# Vector search scores
vector_scores = vector_search(query, corpus)
# Combine with weighted average
combined_scores = {}
for doc_id in set(bm25_scores.keys()).union(vector_scores.keys()):
bm25_score = bm25_scores.get(doc_id, 0)
vector_score = vector_scores.get(doc_id, 0)
combined_scores[doc_id] = alpha * vector_score + (1 - alpha) * bm25_score
# Rank by combined score
ranked = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
return [doc_id for doc_id, score in ranked]
# Evaluate different alpha values
for alpha in [0.0, 0.25, 0.5, 0.75, 1.0]:
metrics = evaluate_retrieval(
queries=eval_queries,
retrieval_fn=lambda q: hybrid_retrieval(q, corpus, alpha=alpha),
ground_truth=relevant_docs_map
)
print(f"alpha={alpha}: Recall@5={metrics['recall@5']:.3f}")
Find optimal alpha
for your domain (often around 0.6-0.7).
Query Rewriting
Sometimes user queries are ambiguous or poorly phrased. Rewrite before retrieving:
async def rewrite_query(user_query: str) -> str:
"""
Expand abbreviations, clarify intent, add domain keywords.
"""
rewrite_prompt = f"""
User query: {user_query}
Rewrite this query to be more effective for document retrieval:
- Expand abbreviations
- Add relevant domain keywords
- Clarify ambiguous terms
- Maintain the user's intent
Rewritten query:
"""
return await llm_call(rewrite_prompt, model="gpt-3.5-turbo", max_tokens=100)
# Evaluate: original query vs. rewritten
original_metrics = evaluate_retrieval(eval_queries, index, ground_truth)
rewritten_queries = [await rewrite_query(q) for q in eval_queries]
rewritten_metrics = evaluate_retrieval(rewritten_queries, index, ground_truth)
print(f"Original Recall@5: {original_metrics['recall@5']:.3f}")
print(f"Rewritten Recall@5: {rewritten_metrics['recall@5']:.3f}")
Evaluating Generation: Grounding and Faithfulness
Technique 1: Self-Consistency Checks
Ask the model multiple variations of the same question. Consistent answers suggest grounded responses.
async def check_self_consistency(question: str, context: str, n=3) -> float:
"""
Generate n answers, measure agreement.
"""
answers = []
for _ in range(n):
answer = await rag_generate(question, context, temperature=0.7)
answers.append(answer)
# Measure pairwise semantic similarity
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(answers)
# Average pairwise cosine similarity
similarities = []
for i in range(len(embeddings)):
for j in range(i + 1, len(embeddings)):
sim = np.dot(embeddings[i], embeddings[j])
sim /= (np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[j]))
similarities.append(sim)
return np.mean(similarities)
# High consistency (>0.9) suggests grounded, low (<0.7) suggests hallucination
consistency = await check_self_consistency(query, retrieved_docs)
print(f"Self-consistency: {consistency:.2f}")
Technique 2: Adversarial Context Injection
Add irrelevant or contradictory documents to the context. A robust system should ignore them.
def adversarial_context_test(query: str, relevant_docs: list, irrelevant_docs: list) -> dict:
"""
Test if model is distracted by irrelevant context.
"""
# Baseline: answer with only relevant docs
baseline_answer = rag_generate(query, relevant_docs)
# Test: answer with relevant + irrelevant docs
mixed_context = relevant_docs + irrelevant_docs
test_answer = rag_generate(query, mixed_context)
# Compare answers (should be similar if model focuses on relevant docs)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
baseline_emb = model.encode(baseline_answer)
test_emb = model.encode(test_answer)
similarity = np.dot(baseline_emb, test_emb) / (
np.linalg.norm(baseline_emb) * np.linalg.norm(test_emb)
)
return {
"robustness_score": similarity,
"baseline_answer": baseline_answer,
"test_answer": test_answer
}
High similarity (>0.85) suggests robust grounding. Low similarity suggests the model is distracted by noise.
Technique 3: Counterfactual Testing
Modify a document's content and see if the answer changes appropriately.
def counterfactual_test(query: str, original_doc: str, modified_doc: str) -> dict:
"""
Change a key fact in a document, ensure answer reflects the change.
"""
# Original
original_answer = rag_generate(query, [original_doc])
# Modified
modified_answer = rag_generate(query, [modified_doc])
# Answers should differ if model is grounded
from difflib import SequenceMatcher
similarity = SequenceMatcher(None, original_answer, modified_answer).ratio()
return {
"answer_changed": similarity < 0.8, # Threshold for "different"
"original_answer": original_answer,
"modified_answer": modified_answer,
"similarity": similarity
}
# Example
original = "The company was founded in 2010 and has 500 employees."
modified = "The company was founded in 2020 and has 50 employees."
result = counterfactual_test("When was the company founded?", original, modified)
if result['answer_changed']:
print("Model is grounded (answer changed with modified context)")
else:
print("Model may be hallucinating (answer didn't change)")
Diagnosing RAG Failures
When end-to-end metrics are poor, isolate the failure stage.
Failure Mode 1: Retrieval Miss
Symptom: Answer is generic or says "I don't have that information."
Diagnosis: Check retrieval metrics (Recall@K is low).
Solution:
- Improve embedding model
- Add query rewriting
- Use hybrid retrieval (BM25 + vectors)
- Expand document chunking strategy
Failure Mode 2: Context Overload
Symptom: Answer is vague or misses key details despite relevant docs being retrieved.
Diagnosis: Too many documents in context, signal lost in noise.
Solution:
- Reduce K (retrieve fewer docs)
- Re-rank documents by relevance before passing to LLM
- Use a better generation model (longer context window, better instruction following)
Failure Mode 3: Hallucination
Symptom: Answer includes facts not in retrieved documents.
Diagnosis: Grounding score is low.
Solution:
- Improve prompt: explicitly instruct "only use provided documents"
- Add negative examples (few-shot) showing what not to do
- Use a model with better instruction-following (GPT-4 > GPT-3.5 for grounding)
- Post-process: verify claims against context before returning
Failure Mode 4: Poor Document Quality
Symptom: Retrieval metrics are good, but answers are still wrong.
Diagnosis: Retrieved docs contain incorrect/outdated information.
Solution:
- Improve document curation (remove outdated content)
- Add metadata filtering (only retrieve docs updated after date X)
- Implement source credibility scoring
Advanced RAG Patterns
Multi-Hop Reasoning
Some questions require synthesizing information from multiple documents.
Example: "Who was the CEO of Apple when the iPhone was released?"
- Retrieve: "iPhone was released in 2007"
- Retrieve: "Steve Jobs was CEO of Apple from 1997-2011"
- Synthesize: "Steve Jobs"
Evaluation:
def evaluate_multi_hop(query: str, answer: str, required_docs: list[int]) -> dict:
"""
Check if answer requires (and uses) multiple documents.
"""
# Retrieve docs for query
retrieved_doc_ids = retrieval_system.retrieve(query)
# Check if all required docs were retrieved
retrieval_complete = all(doc_id in retrieved_doc_ids for doc_id in required_docs)
# Check if answer synthesizes information (not just copying one doc)
# Use LLM judge
judge_prompt = f"""
Query: {query}
Required documents: {required_docs}
Retrieved documents: {retrieved_doc_ids}
Answer: {answer}
Does this answer correctly synthesize information from multiple documents?
Rate (0-10):
"""
synthesis_score = llm_judge(judge_prompt, model="gpt-4")
return {
"retrieval_complete": retrieval_complete,
"synthesis_score": float(synthesis_score) / 10,
"multi_hop_success": retrieval_complete and (float(synthesis_score) >= 7)
}
Conversational RAG
Multi-turn conversations require maintaining context and updating retrieval based on history.
Evaluation considerations:
- Track context across turns
- Measure if retrieval adapts to conversation (e.g., resolves pronouns)
- Evaluate coherence across turns
def evaluate_conversational_rag(conversation_history: list[dict], eval_turn: int) -> dict:
"""
conversation_history: [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}, ...]
eval_turn: Which turn to evaluate
"""
# Get query at eval turn
current_query = conversation_history[eval_turn]['content']
# Retrieve with conversation context
retrieved_docs = rag_retrieval_with_history(current_query, conversation_history[:eval_turn])
# Generate answer
answer = rag_generate_with_history(current_query, retrieved_docs, conversation_history[:eval_turn])
# Evaluate:
# 1. Does retrieval use conversation context (e.g., resolve "it" referring to previous topic)?
# 2. Is answer coherent with conversation history?
# 3. Does it maintain persona/tone established earlier?
return {
"retrieval_context_aware": check_context_awareness(current_query, retrieved_docs, conversation_history),
"answer_coherence": check_conversation_coherence(answer, conversation_history),
"tone_consistency": check_tone_consistency(answer, conversation_history)
}
Production Monitoring for RAG
Track these metrics in production:
Retrieval metrics (sample 1% of queries):
- Average Precision@5
- Average MRR
- Queries with zero results
Grounding metrics (all queries):
- Grounding score distribution
- Rate of "I don't have that information" responses
- Citation accuracy (if using citations)
Quality metrics (from user feedback):
- Explicit feedback rate (thumbs up/down)
- Follow-up question rate (suggests incomplete first answer)
- Escalation rate (user asks for human help)
Alerting:
alerts:
- name: retrieval-degradation
condition: precision@5 < 0.65 for 1 hour
action: alert #rag-team
- name: grounding-issues
condition: grounding_score < 0.8 for 15 minutes
action: alert #rag-team
- name: zero-results-spike
condition: zero_results_rate > 5% for 30 minutes
action: alert #rag-team
Conclusion
Evaluating RAG systems requires moving beyond generic accuracy metrics to measure:
- Retrieval quality: Are we fetching the right documents?
- Grounding: Does the answer use those documents faithfully?
- Answer quality: Is the final response correct, complete, and relevant?
- End-to-end performance: Does it solve the user's problem?
Each dimension needs its own metrics and evaluation methods. Failures at any stage compromise the entire system.
Start with:
- Build a 100-example eval set with annotated relevant documents
- Measure retrieval Precision@5 and Recall@5
- Implement grounding score (claim verification or NLI)
- Track end-to-end answer correctness
- Monitor production metrics and correlate with user satisfaction
Within weeks, you'll have visibility into your RAG system's true performance—not just whether it runs, but whether it's actually helpful.
Next Steps:
- Set up RAG evaluation with EvalOps
- Import RAG evaluation recipes from Spellbook
- Join discussions on RAG patterns
Questions about evaluating your RAG system? Email hello@evalops.dev.