CCLee / Blog / Retrieval Evaluation Metrics: MRR, nDCG, and Recall@K

Retrieval Evaluation Metrics: MRR, nDCG, and Recall@K

1.Metrics to Measure Accuracy of RAG
- 1.1.MRR (Mean Reciprocal Rank)
  - 1.1.1.What it measures
  - 1.1.2.The Formula
  - 1.1.3.Example
  - 1.1.4.Multiple Queries Example
  - 1.1.5.How it's implemented
  - 1.1.6.Interpretation
  - 1.1.7.Limitations
- 1.2.nDCG (Normalized Discounted Cumulative Gain)
  - 1.2.1.What it measures
  - 1.2.2.The Formula
  - 1.2.3.Example (Binary Relevance: 0 or 1)
  - 1.2.4.Why the "Discount"?
  - 1.2.5.How it's implemented
  - 1.2.6.Interpretation
  - 1.2.7.Graded Relevance (Advanced)
- 1.3.Recall@K
  - 1.3.1.What it measures
  - 1.3.2.The Formula
  - 1.3.3.Example
  - 1.3.4.Different K values
  - 1.3.5.Example across K values
  - 1.3.6.Implementation
  - 1.3.7.Interpretation
  - 1.3.8.Trade-off with Precision
- 1.4.Comparing the Metrics
- 1.5.When to use which?
2.Practical Example: RAG Evaluation
- 2.1.Test Case
- 2.2.Retrieved Documents (top 5)
- 2.3.Calculations
3.Summary
4.References

December 8, 2025

Llm

Rag

1.Metrics to Measure Accuracy of RAG
- 1.1.MRR (Mean Reciprocal Rank)
  - 1.1.1.What it measures
  - 1.1.2.The Formula
  - 1.1.3.Example
  - 1.1.4.Multiple Queries Example
  - 1.1.5.How it's implemented
  - 1.1.6.Interpretation
  - 1.1.7.Limitations
- 1.2.nDCG (Normalized Discounted Cumulative Gain)
  - 1.2.1.What it measures
  - 1.2.2.The Formula
  - 1.2.3.Example (Binary Relevance: 0 or 1)
  - 1.2.4.Why the "Discount"?
  - 1.2.5.How it's implemented
  - 1.2.6.Interpretation
  - 1.2.7.Graded Relevance (Advanced)
- 1.3.Recall@K
  - 1.3.1.What it measures
  - 1.3.2.The Formula
  - 1.3.3.Example
  - 1.3.4.Different K values
  - 1.3.5.Example across K values
  - 1.3.6.Implementation
  - 1.3.7.Interpretation
  - 1.3.8.Trade-off with Precision
- 1.4.Comparing the Metrics
- 1.5.When to use which?
2.Practical Example: RAG Evaluation
- 2.1.Test Case
- 2.2.Retrieved Documents (top 5)
- 2.3.Calculations
3.Summary
4.References

1. Metrics to Measure Accuracy of RAG

When building RAG (Retrieval Augmented Generation) systems, you need to measure how well your retrieval component is working. These three metrics are the most commonly used:

1.1. MRR (Mean Reciprocal Rank)

1.1.1. What it measures

MRR measures how quickly you find the first relevant result. It rewards systems that put relevant documents at the top.

1.1.2. The Formula

For a single query:

$\texttt{Reciprocal Rank (RR)} = \frac{1}{\texttt{position of first relevant document}}$

For mes:

$\texttt{MRR} = \frac{1}{N}\times \sum_i \frac{1}{\texttt{rank_$i$}}$

Where $N$ is the number of queries, and rank_i is the position of the first relevant result for query_i.

1.1.3. Example

Imagine you search for "Who is Avery?" and get these results:

Position	Document	Relevant?
1	Company Overview	❌
2	Product Info	❌
3	Avery's Profile	✅
4	Team Page	❌

Reciprocal Rank = 1/3 = 0.333

If the relevant document was at position 1: Reciprocal Rank = 1/1 = 1.0 (perfect!)

1.1.4. Multiple Queries Example

Query	First Relevant Position	RR
"Who is Avery?"	3	1/3 = 0.333
"What is HomeProtect?"	1	1/1 = 1.000
"Company address?"	2	1/2 = 0.500

MRR = (0.333 + 1.000 + 0.500) / 3 = 0.611

1.1.5. How it's implemented

1def calculate_mrr(keyword: str, retrieved_docs: list) -> float:
2    """Calculate reciprocal rank for a single keyword."""
3    keyword_lower = keyword.lower()
4    for rank, doc in enumerate(retrieved_docs, start=1):
5        if keyword_lower in doc.page_content.lower():
6            return 1.0 / rank
7    return 0.0  # Keyword not found

1.1.6. Interpretation

MRR Score	Meaning
1.0	Perfect - relevant doc always first
0.9+	Excellent - usually in top 1-2
0.75+	Good - usually in top 2-3
0.5+	Acceptable - often in top 2-4
< 0.5	Poor - relevant docs buried

1.1.7. Limitations

Only considers the first relevant result
Ignores all other relevant documents
A query with 10 relevant docs but first at position 2 scores the same as one with 1 relevant doc at position 2

1.2. nDCG (Normalized Discounted Cumulative Gain)

1.2.1. What it measures

nDCG measures the quality of the entire ranking, not just the first result. It considers:

How many relevant documents you retrieved
Where they appear in the ranking (higher is better)

1.2.2. The Formula

Step 1: Calculate DCG (Discounted Cumulative Gain)

$\texttt{DCG@k} = \sum_{i=1}^k \frac{\texttt{relevance}_i}{\log_2(i+1)}$

Where $i$ is the document retrival position and

$\texttt{relevance}_i\in \{0,1\}$

is the relevance score at position $i$ , detailed example will be given in 〈1.2.3. Example (Binary Relevance: 0 or 1)〉.

The choice of $\log_2$ simply ensures the whole summand is a non-negative value, and because it grows slower than $x\mapsto x$ , it provides higher score for later relevant result (less penality).

Step 2: Calculate IDCG (Ideal DCG) The best possible DCG if you ranked all relevant docs at the top.

Step 3: Calculate nDCG

1nDCG@k = DCG@k / IDCG@k

1.2.3. Example (Binary Relevance: 0 or 1)

Query: "insurance products"

Your retrieved results:

Position	Document	Relevant?	Relevance Score
1	Company News	❌	0
2	HomeProtect	✅	1
3	About Us	❌	0
4	AutoInsure	✅	1
5	CarePlus	✅	1

Calculate DCG:

1DCG = 0/log₂(2) + 1/log₂(3) + 0/log₂(4) + 1/log₂(5) + 1/log₂(6)
2    = 0 + 0.631 + 0 + 0.431 + 0.387
3    = 1.449

Calculate Ideal DCG (if all 3 relevant docs were at top):

1IDCG = 1/log₂(2) + 1/log₂(3) + 1/log₂(4)
2     = 1.0 + 0.631 + 0.5
3     = 2.131

nDCG = 1.449 / 2.131 = 0.680

1.2.4. Why the "Discount"?

The logarithmic discount penalizes relevant results that appear lower:

Position	Discount Factor (1/log₂(i+1))
1	1.000
2	0.631
3	0.500
4	0.431
5	0.387
10	0.289

A relevant document at position 1 contributes 1.0 to DCG.
A relevant document at position 10 contributes only 0.289.

1.2.5. How it's implemented

1def calculate_dcg(relevances: list[int], k: int) -> float:
2    """Calculate Discounted Cumulative Gain."""
3    dcg = 0.0
4    for i in range(min(k, len(relevances))):
5        dcg += relevances[i] / math.log2(i + 2)  # i+2 because rank starts at 1
6    return dcg
7
8
9def calculate_ndcg(keyword: str, retrieved_docs: list, k: int = 10) -> float:
10    """Calculate nDCG for a single keyword (binary relevance)."""
11    keyword_lower = keyword.lower()
12
13    # Binary relevance: 1 if keyword found, 0 otherwise
14    relevances = [
15        1 if keyword_lower in doc.page_content.lower() else 0 
16        for doc in retrieved_docs[:k]
17    ]
18
19    # DCG
20    dcg = calculate_dcg(relevances, k)
21
22    # Ideal DCG (best case: all relevant docs at top)
23    ideal_relevances = sorted(relevances, reverse=True)
24    idcg = calculate_dcg(ideal_relevances, k)
25
26    return dcg / idcg if idcg > 0 else 0.0

1.2.6. Interpretation

nDCG Score	Meaning
1.0	Perfect ranking
0.9+	Excellent
0.75+	Good
0.5+	Acceptable
< 0.5	Poor - relevant docs ranked too low

1.2.7. Graded Relevance (Advanced)

nDCG also supports graded relevance (not just binary):

Relevance	Score	Meaning
Perfect	3	Exactly what the user wanted
Highly Relevant	2	Very useful
Somewhat Relevant	1	Partially useful
Not Relevant	0	Useless

This gives more nuanced evaluation but requires human annotation.

1.3. Recall@K

1.3.1. What it measures

Recall@K measures what percentage of all relevant documents you retrieved in the top K results.

1.3.2. The Formula

1Recall@K = (Number of relevant docs in top K) / (Total number of relevant docs)

1.3.3. Example

You have a database with 5 documents about HomeProtect.

Query: "Tell me about HomeProtect"

Top 10 retrieved results contain 3 HomeProtect documents.

1Recall@10 = 3/5 = 0.60 = 60%

1.3.4. Different K values

Metric	Meaning
Recall@1	Did you get at least 1 relevant doc in the top result?
Recall@5	What % of relevant docs are in the top 5?
Recall@10	What % of relevant docs are in the top 10?
Recall@100	What % of relevant docs are in the top 100?

1.3.5. Example across K values

Total relevant documents: 4

K	Relevant in top K	Recall@K
1	1	25%
3	2	50%
5	3	75%
10	4	100%

1.3.6. Implementation

1def calculate_recall_at_k(retrieved_docs: list, relevant_docs: set, k: int) -> float:
2    """Calculate Recall@K."""
3    retrieved_ids = set(doc.metadata.get('id') for doc in retrieved_docs[:k])
4    found = len(retrieved_ids.intersection(relevant_docs))
5    return found / len(relevant_docs) if relevant_docs else 0.0

1.3.7. Interpretation

Recall@K	Meaning
100%	Found all relevant documents
75%+	Good coverage
50%+	Acceptable
< 50%	Missing important information

1.3.8. Trade-off with Precision

Higher K → Higher Recall (find more relevant docs)
Higher K → Lower Precision (more irrelevant docs too)

1.4. Comparing the Metrics

Metric	Focus	Question it answers
MRR	First relevant result	"How quickly do I find something useful?"
nDCG	Ranking quality	"Are relevant docs ranked higher than irrelevant ones?"
Recall@K	Coverage	"Did I find all the relevant information?"

1.5. When to use which?

Use Case	Best Metric
Search engine (user wants 1 good result)	MRR
RAG system (need complete context)	Recall@K
Recommendation system (order matters)	nDCG
General retrieval evaluation	All three!

2. Practical Example: RAG Evaluation

2.1. Test Case

1test = TestQuestion(
2    question="What products does InsureLLM offer?",
3    keywords=["HomeProtect", "AutoInsure", "CarePlus", "TravelGuard"],
4    answer="InsureLLM offers four products: HomeProtect, AutoInsure, CarePlus, and TravelGuard."
5)

2.2. Retrieved Documents (top 5)

Company Overview (mentions "HomeProtect") ✅
Team Page (no keywords) ❌
AutoInsure Product Page ✅
News Article (no keywords) ❌
CarePlus Details ✅

2.3. Calculations

MRR per keyword:

HomeProtect: found at position 1 → RR = 1/1 = 1.0
AutoInsure: found at position 3 → RR = 1/3 = 0.333
CarePlus: found at position 5 → RR = 1/5 = 0.2
TravelGuard: not found → RR = 0

Average MRR = (1.0 + 0.333 + 0.2 + 0) / 4 = 0.383

Recall@5:

Found 3 out of 4 keywords
Recall@5 = 3/4 = 75%

3. Summary

1┌─────────────────────────────────────────────────────────────────┐
2│  MRR: How fast do you find the FIRST relevant result?           │
3│  ────────────────────────────────────────────────────           │
4│  Position 1 → Score 1.0                                         │
5│  Position 2 → Score 0.5                                         │
6│  Position 3 → Score 0.33                                        │
7│  Not found  → Score 0.0                                         │
8└─────────────────────────────────────────────────────────────────┘
9
10┌─────────────────────────────────────────────────────────────────┐
11│  nDCG: How good is the ENTIRE ranking?                          │
12│  ────────────────────────────────────────                       │
13│  Relevant docs at top → High score                              │
14│  Relevant docs buried → Low score                               │
15│  Normalized: 0-1 scale (1 = perfect)                            │
16└─────────────────────────────────────────────────────────────────┘
17
18┌─────────────────────────────────────────────────────────────────┐
19│  Recall@K: How many relevant docs did you FIND?                 │
20│  ───────────────────────────────────────────────                │
21│  Found 3 of 4 relevant → 75% recall                             │
22│  Higher K → More coverage but slower                            │
23└─────────────────────────────────────────────────────────────────┘

Contents