0%

Retrieval Evaluation Metrics: MRR, nDCG, and Recall@K

December 8, 2025

Llm

Rag

1. Metrics to Measure Accuracy of RAG

When building RAG (Retrieval Augmented Generation) systems, you need to measure how well your retrieval component is working. These three metrics are the most commonly used:


1.1. MRR (Mean Reciprocal Rank)

1.1.1. What it measures

MRR measures how quickly you find the first relevant result. It rewards systems that put relevant documents at the top.

1.1.2. The Formula

For a single query:

For mes:

Where is the number of queries, and rank_i is the position of the first relevant result for query_i.

1.1.3. Example

Imagine you search for "Who is Avery?" and get these results:

PositionDocumentRelevant?
1Company Overview
2Product Info
3Avery's Profile
4Team Page

Reciprocal Rank = 1/3 = 0.333

If the relevant document was at position 1: Reciprocal Rank = 1/1 = 1.0 (perfect!)

1.1.4. Multiple Queries Example
QueryFirst Relevant PositionRR
"Who is Avery?"31/3 = 0.333
"What is HomeProtect?"11/1 = 1.000
"Company address?"21/2 = 0.500

MRR = (0.333 + 1.000 + 0.500) / 3 = 0.611

1.1.5. How it's implemented
def calculate_mrr(keyword: str, retrieved_docs: list) -> float:
    """Calculate reciprocal rank for a single keyword."""
    keyword_lower = keyword.lower()
    for rank, doc in enumerate(retrieved_docs, start=1):
        if keyword_lower in doc.page_content.lower():
            return 1.0 / rank
    return 0.0  # Keyword not found
1.1.6. Interpretation
MRR ScoreMeaning
1.0Perfect - relevant doc always first
0.9+Excellent - usually in top 1-2
0.75+Good - usually in top 2-3
0.5+Acceptable - often in top 2-4
< 0.5Poor - relevant docs buried
1.1.7. Limitations
  • Only considers the first relevant result
  • Ignores all other relevant documents
  • A query with 10 relevant docs but first at position 2 scores the same as one with 1 relevant doc at position 2

1.2. nDCG (Normalized Discounted Cumulative Gain)

1.2.1. What it measures

nDCG measures the quality of the entire ranking, not just the first result. It considers:

  1. How many relevant documents you retrieved
  2. Where they appear in the ranking (higher is better)
1.2.2. The Formula

Step 1: Calculate DCG (Discounted Cumulative Gain)

Where is the document retrival position and

is the relevance score at position , detailed example will be given in 〈1.2.3. Example (Binary Relevance: 0 or 1)〉.

The choice of simply ensures the whole summand is a non-negative value, and because it grows slower than , it provides higher score for later relevant result (less penality).

Step 2: Calculate IDCG (Ideal DCG) The best possible DCG if you ranked all relevant docs at the top.

Step 3: Calculate nDCG

nDCG@k = DCG@k / IDCG@k
1.2.3. Example (Binary Relevance: 0 or 1)

Query: "insurance products"

Your retrieved results:

PositionDocumentRelevant?Relevance Score
1Company News0
2HomeProtect1
3About Us0
4AutoInsure1
5CarePlus1

Calculate DCG:

DCG = 0/log₂(2) + 1/log₂(3) + 0/log₂(4) + 1/log₂(5) + 1/log₂(6)
    = 0 + 0.631 + 0 + 0.431 + 0.387
    = 1.449

Calculate Ideal DCG (if all 3 relevant docs were at top):

IDCG = 1/log₂(2) + 1/log₂(3) + 1/log₂(4)
     = 1.0 + 0.631 + 0.5
     = 2.131

nDCG = 1.449 / 2.131 = 0.680

1.2.4. Why the "Discount"?

The logarithmic discount penalizes relevant results that appear lower:

PositionDiscount Factor (1/log₂(i+1))
11.000
20.631
30.500
40.431
50.387
100.289

A relevant document at position 1 contributes 1.0 to DCG.
A relevant document at position 10 contributes only 0.289.

1.2.5. How it's implemented
def calculate_dcg(relevances: list[int], k: int) -> float:
    """Calculate Discounted Cumulative Gain."""
    dcg = 0.0
    for i in range(min(k, len(relevances))):
        dcg += relevances[i] / math.log2(i + 2)  # i+2 because rank starts at 1
    return dcg


def calculate_ndcg(keyword: str, retrieved_docs: list, k: int = 10) -> float:
    """Calculate nDCG for a single keyword (binary relevance)."""
    keyword_lower = keyword.lower()

    # Binary relevance: 1 if keyword found, 0 otherwise
    relevances = [
        1 if keyword_lower in doc.page_content.lower() else 0 
        for doc in retrieved_docs[:k]
    ]

    # DCG
    dcg = calculate_dcg(relevances, k)

    # Ideal DCG (best case: all relevant docs at top)
    ideal_relevances = sorted(relevances, reverse=True)
    idcg = calculate_dcg(ideal_relevances, k)

    return dcg / idcg if idcg > 0 else 0.0
1.2.6. Interpretation
nDCG ScoreMeaning
1.0Perfect ranking
0.9+Excellent
0.75+Good
0.5+Acceptable
< 0.5Poor - relevant docs ranked too low
1.2.7. Graded Relevance (Advanced)

nDCG also supports graded relevance (not just binary):

RelevanceScoreMeaning
Perfect3Exactly what the user wanted
Highly Relevant2Very useful
Somewhat Relevant1Partially useful
Not Relevant0Useless

This gives more nuanced evaluation but requires human annotation.


1.3. Recall@K

1.3.1. What it measures

Recall@K measures what percentage of all relevant documents you retrieved in the top K results.

1.3.2. The Formula
Recall@K = (Number of relevant docs in top K) / (Total number of relevant docs)
1.3.3. Example

You have a database with 5 documents about HomeProtect.

Query: "Tell me about HomeProtect"

Top 10 retrieved results contain 3 HomeProtect documents.

Recall@10 = 3/5 = 0.60 = 60%
1.3.4. Different K values
MetricMeaning
Recall@1Did you get at least 1 relevant doc in the top result?
Recall@5What % of relevant docs are in the top 5?
Recall@10What % of relevant docs are in the top 10?
Recall@100What % of relevant docs are in the top 100?
1.3.5. Example across K values

Total relevant documents: 4

KRelevant in top KRecall@K
1125%
3250%
5375%
104100%
1.3.6. Implementation
def calculate_recall_at_k(retrieved_docs: list, relevant_docs: set, k: int) -> float:
    """Calculate Recall@K."""
    retrieved_ids = set(doc.metadata.get('id') for doc in retrieved_docs[:k])
    found = len(retrieved_ids.intersection(relevant_docs))
    return found / len(relevant_docs) if relevant_docs else 0.0
1.3.7. Interpretation
Recall@KMeaning
100%Found all relevant documents
75%+Good coverage
50%+Acceptable
< 50%Missing important information
1.3.8. Trade-off with Precision

Higher K → Higher Recall (find more relevant docs)
Higher K → Lower Precision (more irrelevant docs too)


1.4. Comparing the Metrics

MetricFocusQuestion it answers
MRRFirst relevant result"How quickly do I find something useful?"
nDCGRanking quality"Are relevant docs ranked higher than irrelevant ones?"
Recall@KCoverage"Did I find all the relevant information?"

1.5. When to use which?

Use CaseBest Metric
Search engine (user wants 1 good result)MRR
RAG system (need complete context)Recall@K
Recommendation system (order matters)nDCG
General retrieval evaluationAll three!

2. Practical Example: RAG Evaluation

2.1. Test Case

test = TestQuestion(
    question="What products does InsureLLM offer?",
    keywords=["HomeProtect", "AutoInsure", "CarePlus", "TravelGuard"],
    answer="InsureLLM offers four products: HomeProtect, AutoInsure, CarePlus, and TravelGuard."
)

2.2. Retrieved Documents (top 5)

  1. Company Overview (mentions "HomeProtect") ✅
  2. Team Page (no keywords) ❌
  3. AutoInsure Product Page ✅
  4. News Article (no keywords) ❌
  5. CarePlus Details ✅

2.3. Calculations

MRR per keyword:

  • HomeProtect: found at position 1 → RR = 1/1 = 1.0
  • AutoInsure: found at position 3 → RR = 1/3 = 0.333
  • CarePlus: found at position 5 → RR = 1/5 = 0.2
  • TravelGuard: not found → RR = 0

Average MRR = (1.0 + 0.333 + 0.2 + 0) / 4 = 0.383

Recall@5:

  • Found 3 out of 4 keywords
  • Recall@5 = 3/4 = 75%

3. Summary

┌─────────────────────────────────────────────────────────────────┐
│  MRR: How fast do you find the FIRST relevant result?           │
│  ────────────────────────────────────────────────────           │
│  Position 1 → Score 1.0                                         │
│  Position 2 → Score 0.5                                         │
│  Position 3 → Score 0.33                                        │
│  Not found  → Score 0.0                                         │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  nDCG: How good is the ENTIRE ranking?                          │
│  ────────────────────────────────────────                       │
│  Relevant docs at top → High score                              │
│  Relevant docs buried → Low score                               │
│  Normalized: 0-1 scale (1 = perfect)                            │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  Recall@K: How many relevant docs did you FIND?                 │
│  ───────────────────────────────────────────────                │
│  Found 3 of 4 relevant → 75% recall                             │
│  Higher K → More coverage but slower                            │
└─────────────────────────────────────────────────────────────────┘

4. References