1. Metrics to Measure Accuracy of RAG
When building RAG (Retrieval Augmented Generation) systems, you need to measure how well your retrieval component is working. These three metrics are the most commonly used:
1.1. MRR (Mean Reciprocal Rank)
1.1.1. What it measures
MRR measures how quickly you find the first relevant result. It rewards systems that put relevant documents at the top.
1.1.2. The Formula
For a single query:
For mes:
Where is the number of queries, and rank_i is the position of the first relevant result for query_i.
1.1.3. Example
Imagine you search for "Who is Avery?" and get these results:
| Position | Document | Relevant? |
|---|---|---|
| 1 | Company Overview | ❌ |
| 2 | Product Info | ❌ |
| 3 | Avery's Profile | ✅ |
| 4 | Team Page | ❌ |
Reciprocal Rank = 1/3 = 0.333
If the relevant document was at position 1: Reciprocal Rank = 1/1 = 1.0 (perfect!)
1.1.4. Multiple Queries Example
| Query | First Relevant Position | RR |
|---|---|---|
| "Who is Avery?" | 3 | 1/3 = 0.333 |
| "What is HomeProtect?" | 1 | 1/1 = 1.000 |
| "Company address?" | 2 | 1/2 = 0.500 |
MRR = (0.333 + 1.000 + 0.500) / 3 = 0.611
1.1.5. How it's implemented
def calculate_mrr(keyword: str, retrieved_docs: list) -> float: """Calculate reciprocal rank for a single keyword.""" keyword_lower = keyword.lower() for rank, doc in enumerate(retrieved_docs, start=1): if keyword_lower in doc.page_content.lower(): return 1.0 / rank return 0.0 # Keyword not found
1.1.6. Interpretation
| MRR Score | Meaning |
|---|---|
| 1.0 | Perfect - relevant doc always first |
| 0.9+ | Excellent - usually in top 1-2 |
| 0.75+ | Good - usually in top 2-3 |
| 0.5+ | Acceptable - often in top 2-4 |
| < 0.5 | Poor - relevant docs buried |
1.1.7. Limitations
- Only considers the first relevant result
- Ignores all other relevant documents
- A query with 10 relevant docs but first at position 2 scores the same as one with 1 relevant doc at position 2
1.2. nDCG (Normalized Discounted Cumulative Gain)
1.2.1. What it measures
nDCG measures the quality of the entire ranking, not just the first result. It considers:
- How many relevant documents you retrieved
- Where they appear in the ranking (higher is better)
1.2.2. The Formula
Step 1: Calculate DCG (Discounted Cumulative Gain)
Where is the document retrival position and
is the relevance score at position , detailed example will be given in 〈1.2.3. Example (Binary Relevance: 0 or 1)〉.
The choice of simply ensures the whole summand is a non-negative value, and because it grows slower than , it provides higher score for later relevant result (less penality).
Step 2: Calculate IDCG (Ideal DCG) The best possible DCG if you ranked all relevant docs at the top.
Step 3: Calculate nDCG
nDCG@k = DCG@k / IDCG@k
1.2.3. Example (Binary Relevance: 0 or 1)
Query: "insurance products"
Your retrieved results:
| Position | Document | Relevant? | Relevance Score |
|---|---|---|---|
| 1 | Company News | ❌ | 0 |
| 2 | HomeProtect | ✅ | 1 |
| 3 | About Us | ❌ | 0 |
| 4 | AutoInsure | ✅ | 1 |
| 5 | CarePlus | ✅ | 1 |
Calculate DCG:
DCG = 0/log₂(2) + 1/log₂(3) + 0/log₂(4) + 1/log₂(5) + 1/log₂(6) = 0 + 0.631 + 0 + 0.431 + 0.387 = 1.449
Calculate Ideal DCG (if all 3 relevant docs were at top):
IDCG = 1/log₂(2) + 1/log₂(3) + 1/log₂(4) = 1.0 + 0.631 + 0.5 = 2.131
nDCG = 1.449 / 2.131 = 0.680
1.2.4. Why the "Discount"?
The logarithmic discount penalizes relevant results that appear lower:
| Position | Discount Factor (1/log₂(i+1)) |
|---|---|
| 1 | 1.000 |
| 2 | 0.631 |
| 3 | 0.500 |
| 4 | 0.431 |
| 5 | 0.387 |
| 10 | 0.289 |
A relevant document at position 1 contributes 1.0 to DCG.
A relevant document at position 10 contributes only 0.289.
1.2.5. How it's implemented
def calculate_dcg(relevances: list[int], k: int) -> float: """Calculate Discounted Cumulative Gain.""" dcg = 0.0 for i in range(min(k, len(relevances))): dcg += relevances[i] / math.log2(i + 2) # i+2 because rank starts at 1 return dcg def calculate_ndcg(keyword: str, retrieved_docs: list, k: int = 10) -> float: """Calculate nDCG for a single keyword (binary relevance).""" keyword_lower = keyword.lower() # Binary relevance: 1 if keyword found, 0 otherwise relevances = [ 1 if keyword_lower in doc.page_content.lower() else 0 for doc in retrieved_docs[:k] ] # DCG dcg = calculate_dcg(relevances, k) # Ideal DCG (best case: all relevant docs at top) ideal_relevances = sorted(relevances, reverse=True) idcg = calculate_dcg(ideal_relevances, k) return dcg / idcg if idcg > 0 else 0.0
1.2.6. Interpretation
| nDCG Score | Meaning |
|---|---|
| 1.0 | Perfect ranking |
| 0.9+ | Excellent |
| 0.75+ | Good |
| 0.5+ | Acceptable |
| < 0.5 | Poor - relevant docs ranked too low |
1.2.7. Graded Relevance (Advanced)
nDCG also supports graded relevance (not just binary):
| Relevance | Score | Meaning |
|---|---|---|
| Perfect | 3 | Exactly what the user wanted |
| Highly Relevant | 2 | Very useful |
| Somewhat Relevant | 1 | Partially useful |
| Not Relevant | 0 | Useless |
This gives more nuanced evaluation but requires human annotation.
1.3. Recall@K
1.3.1. What it measures
Recall@K measures what percentage of all relevant documents you retrieved in the top K results.
1.3.2. The Formula
Recall@K = (Number of relevant docs in top K) / (Total number of relevant docs)
1.3.3. Example
You have a database with 5 documents about HomeProtect.
Query: "Tell me about HomeProtect"
Top 10 retrieved results contain 3 HomeProtect documents.
Recall@10 = 3/5 = 0.60 = 60%
1.3.4. Different K values
| Metric | Meaning |
|---|---|
| Recall@1 | Did you get at least 1 relevant doc in the top result? |
| Recall@5 | What % of relevant docs are in the top 5? |
| Recall@10 | What % of relevant docs are in the top 10? |
| Recall@100 | What % of relevant docs are in the top 100? |
1.3.5. Example across K values
Total relevant documents: 4
| K | Relevant in top K | Recall@K |
|---|---|---|
| 1 | 1 | 25% |
| 3 | 2 | 50% |
| 5 | 3 | 75% |
| 10 | 4 | 100% |
1.3.6. Implementation
def calculate_recall_at_k(retrieved_docs: list, relevant_docs: set, k: int) -> float: """Calculate Recall@K.""" retrieved_ids = set(doc.metadata.get('id') for doc in retrieved_docs[:k]) found = len(retrieved_ids.intersection(relevant_docs)) return found / len(relevant_docs) if relevant_docs else 0.0
1.3.7. Interpretation
| Recall@K | Meaning |
|---|---|
| 100% | Found all relevant documents |
| 75%+ | Good coverage |
| 50%+ | Acceptable |
| < 50% | Missing important information |
1.3.8. Trade-off with Precision
Higher K → Higher Recall (find more relevant docs)
Higher K → Lower Precision (more irrelevant docs too)
1.4. Comparing the Metrics
| Metric | Focus | Question it answers |
|---|---|---|
| MRR | First relevant result | "How quickly do I find something useful?" |
| nDCG | Ranking quality | "Are relevant docs ranked higher than irrelevant ones?" |
| Recall@K | Coverage | "Did I find all the relevant information?" |
1.5. When to use which?
| Use Case | Best Metric |
|---|---|
| Search engine (user wants 1 good result) | MRR |
| RAG system (need complete context) | Recall@K |
| Recommendation system (order matters) | nDCG |
| General retrieval evaluation | All three! |
2. Practical Example: RAG Evaluation
2.1. Test Case
test = TestQuestion( question="What products does InsureLLM offer?", keywords=["HomeProtect", "AutoInsure", "CarePlus", "TravelGuard"], answer="InsureLLM offers four products: HomeProtect, AutoInsure, CarePlus, and TravelGuard." )
2.2. Retrieved Documents (top 5)
- Company Overview (mentions "HomeProtect") ✅
- Team Page (no keywords) ❌
- AutoInsure Product Page ✅
- News Article (no keywords) ❌
- CarePlus Details ✅
2.3. Calculations
MRR per keyword:
- HomeProtect: found at position 1 → RR = 1/1 = 1.0
- AutoInsure: found at position 3 → RR = 1/3 = 0.333
- CarePlus: found at position 5 → RR = 1/5 = 0.2
- TravelGuard: not found → RR = 0
Average MRR = (1.0 + 0.333 + 0.2 + 0) / 4 = 0.383
Recall@5:
- Found 3 out of 4 keywords
- Recall@5 = 3/4 = 75%
3. Summary
┌─────────────────────────────────────────────────────────────────┐ │ MRR: How fast do you find the FIRST relevant result? │ │ ──────────────────────────────────────────────────── │ │ Position 1 → Score 1.0 │ │ Position 2 → Score 0.5 │ │ Position 3 → Score 0.33 │ │ Not found → Score 0.0 │ └─────────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────┐ │ nDCG: How good is the ENTIRE ranking? │ │ ──────────────────────────────────────── │ │ Relevant docs at top → High score │ │ Relevant docs buried → Low score │ │ Normalized: 0-1 scale (1 = perfect) │ └─────────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────┐ │ Recall@K: How many relevant docs did you FIND? │ │ ─────────────────────────────────────────────── │ │ Found 3 of 4 relevant → 75% recall │ │ Higher K → More coverage but slower │ └─────────────────────────────────────────────────────────────────┘













