1. Metrics to Measure Accuracy of RAG
When building RAG (Retrieval Augmented Generation) systems, you need to measure how well your retrieval component is working. These three metrics are the most commonly used:
1.1. MRR (Mean Reciprocal Rank)
1.1.1. What it measures
MRR measures how quickly you find the first relevant result. It rewards systems that put relevant documents at the top.
1.1.2. The Formula
For a single query:
For mes:
Where is the number of queries, and rank_i is the position of the first relevant result for query_i.
1.1.3. Example
Imagine you search for "Who is Avery?" and get these results:
| Position | Document | Relevant? |
|---|---|---|
| 1 | Company Overview | ❌ |
| 2 | Product Info | ❌ |
| 3 | Avery's Profile | ✅ |
| 4 | Team Page | ❌ |
Reciprocal Rank = 1/3 = 0.333
If the relevant document was at position 1: Reciprocal Rank = 1/1 = 1.0 (perfect!)
1.1.4. Multiple Queries Example
| Query | First Relevant Position | RR |
|---|---|---|
| "Who is Avery?" | 3 | 1/3 = 0.333 |
| "What is HomeProtect?" | 1 | 1/1 = 1.000 |
| "Company address?" | 2 | 1/2 = 0.500 |
MRR = (0.333 + 1.000 + 0.500) / 3 = 0.611
1.1.5. How it's implemented
1.1.6. Interpretation
| MRR Score | Meaning |
|---|---|
| 1.0 | Perfect - relevant doc always first |
| 0.9+ | Excellent - usually in top 1-2 |
| 0.75+ | Good - usually in top 2-3 |
| 0.5+ | Acceptable - often in top 2-4 |
| < 0.5 | Poor - relevant docs buried |
1.1.7. Limitations
- Only considers the first relevant result
- Ignores all other relevant documents
- A query with 10 relevant docs but first at position 2 scores the same as one with 1 relevant doc at position 2
1.2. nDCG (Normalized Discounted Cumulative Gain)
1.2.1. What it measures
nDCG measures the quality of the entire ranking, not just the first result. It considers:
- How many relevant documents you retrieved
- Where they appear in the ranking (higher is better)
1.2.2. The Formula
Step 1: Calculate DCG (Discounted Cumulative Gain)
Where is the document retrival position and
is the relevance score at position , detailed example will be given in 〈1.2.3. Example (Binary Relevance: 0 or 1)〉.
The choice of simply ensures the whole summand is a non-negative value, and because it grows slower than , it provides higher score for later relevant result (less penality).
Step 2: Calculate IDCG (Ideal DCG) The best possible DCG if you ranked all relevant docs at the top.
Step 3: Calculate nDCG
1.2.3. Example (Binary Relevance: 0 or 1)
Query: "insurance products"
Your retrieved results:
| Position | Document | Relevant? | Relevance Score |
|---|---|---|---|
| 1 | Company News | ❌ | 0 |
| 2 | HomeProtect | ✅ | 1 |
| 3 | About Us | ❌ | 0 |
| 4 | AutoInsure | ✅ | 1 |
| 5 | CarePlus | ✅ | 1 |
Calculate DCG:
Calculate Ideal DCG (if all 3 relevant docs were at top):
nDCG = 1.449 / 2.131 = 0.680
1.2.4. Why the "Discount"?
The logarithmic discount penalizes relevant results that appear lower:
| Position | Discount Factor (1/log₂(i+1)) |
|---|---|
| 1 | 1.000 |
| 2 | 0.631 |
| 3 | 0.500 |
| 4 | 0.431 |
| 5 | 0.387 |
| 10 | 0.289 |
A relevant document at position 1 contributes 1.0 to DCG.
A relevant document at position 10 contributes only 0.289.
1.2.5. How it's implemented
1.2.6. Interpretation
| nDCG Score | Meaning |
|---|---|
| 1.0 | Perfect ranking |
| 0.9+ | Excellent |
| 0.75+ | Good |
| 0.5+ | Acceptable |
| < 0.5 | Poor - relevant docs ranked too low |
1.2.7. Graded Relevance (Advanced)
nDCG also supports graded relevance (not just binary):
| Relevance | Score | Meaning |
|---|---|---|
| Perfect | 3 | Exactly what the user wanted |
| Highly Relevant | 2 | Very useful |
| Somewhat Relevant | 1 | Partially useful |
| Not Relevant | 0 | Useless |
This gives more nuanced evaluation but requires human annotation.
1.3. Recall@K
1.3.1. What it measures
Recall@K measures what percentage of all relevant documents you retrieved in the top K results.
1.3.2. The Formula
1.3.3. Example
You have a database with 5 documents about HomeProtect.
Query: "Tell me about HomeProtect"
Top 10 retrieved results contain 3 HomeProtect documents.
1.3.4. Different K values
| Metric | Meaning |
|---|---|
| Recall@1 | Did you get at least 1 relevant doc in the top result? |
| Recall@5 | What % of relevant docs are in the top 5? |
| Recall@10 | What % of relevant docs are in the top 10? |
| Recall@100 | What % of relevant docs are in the top 100? |
1.3.5. Example across K values
Total relevant documents: 4
| K | Relevant in top K | Recall@K |
|---|---|---|
| 1 | 1 | 25% |
| 3 | 2 | 50% |
| 5 | 3 | 75% |
| 10 | 4 | 100% |
1.3.6. Implementation
1.3.7. Interpretation
| Recall@K | Meaning |
|---|---|
| 100% | Found all relevant documents |
| 75%+ | Good coverage |
| 50%+ | Acceptable |
| < 50% | Missing important information |
1.3.8. Trade-off with Precision
Higher K → Higher Recall (find more relevant docs)
Higher K → Lower Precision (more irrelevant docs too)
1.4. Comparing the Metrics
| Metric | Focus | Question it answers |
|---|---|---|
| MRR | First relevant result | "How quickly do I find something useful?" |
| nDCG | Ranking quality | "Are relevant docs ranked higher than irrelevant ones?" |
| Recall@K | Coverage | "Did I find all the relevant information?" |
1.5. When to use which?
| Use Case | Best Metric |
|---|---|
| Search engine (user wants 1 good result) | MRR |
| RAG system (need complete context) | Recall@K |
| Recommendation system (order matters) | nDCG |
| General retrieval evaluation | All three! |
2. Practical Example: RAG Evaluation
2.1. Test Case
2.2. Retrieved Documents (top 5)
- Company Overview (mentions "HomeProtect") ✅
- Team Page (no keywords) ❌
- AutoInsure Product Page ✅
- News Article (no keywords) ❌
- CarePlus Details ✅
2.3. Calculations
MRR per keyword:
- HomeProtect: found at position 1 → RR = 1/1 = 1.0
- AutoInsure: found at position 3 → RR = 1/3 = 0.333
- CarePlus: found at position 5 → RR = 1/5 = 0.2
- TravelGuard: not found → RR = 0
Average MRR = (1.0 + 0.333 + 0.2 + 0) / 4 = 0.383
Recall@5:
- Found 3 out of 4 keywords
- Recall@5 = 3/4 = 75%












