ndcg@k metric is inaccurate #6707

PascalIversen · 2021-02-15T16:22:43Z

Hey,
I am using xgboost version 1.3.3 and I am getting invalid ndcg@k metric values. The values are >1 while the ndcg@k has to be in [0,1] by definition.

Example:

# creating dummy ranking data
import numpy as np
groups = [60, 50, 32, 22, 22 , 22 , 20, 12]
labels = []
features = []
for group in groups:
  label_group = list(np.argsort(np.argsort(np.random.randn(group))))
  labels = labels + label_group
  for label in label_group:
    features.append([label+10, np.random.randn(), np.random.randn()]) 
features = np.stack(features)


# fitting a ranking model

import xgboost as xgb
import pandas

model = xgb.XGBRanker(
        objective="rank:pairwise",
        max_depth= 10,
        n_estimators=1000,
         verbosity=2)

ranker = model.fit(X=features,
        y=labels, group=groups,
        eval_set=[(features, labels)],
        eval_group=[groups], eval_metric="ndcg@10",  early_stopping_rounds=2, verbose=True)

[16:15:01] INFO: ../src/tree/updater_prune.cc:101: tree pruning end, 146 extra nodes, 0 pruned nodes, max_depth=10
[0]	validation_0-ndcg@10:1.32964
[16:15:01] INFO: ../src/tree/updater_prune.cc:101: tree pruning end, 112 extra nodes, 0 pruned nodes, max_depth=10
[1]	validation_0-ndcg@10:112.34529
[16:15:01] INFO: ../src/tree/updater_prune.cc:101: tree pruning end, 130 extra nodes, 0 pruned nodes, max_depth=10
[2]	validation_0-ndcg@10:1.69752
[16:15:01] INFO: ../src/tree/updater_prune.cc:101: tree pruning end, 92 extra nodes, 0 pruned nodes, max_depth=10

The text was updated successfully, but these errors were encountered:

lcschv · 2023-03-03T17:02:22Z

Hey, I have faced the same issue. After some investigation I think the issue is that the computation of the ideal ranking is wrong.

For instance, let a query q, and a set of relevant documents for this query RelevantDocs = {d1, d2, d3}. Lets assume that the relevance grades of each document is the following: {d1=2, d2=1, d3=1}.

In order to compute the ideal ranking, we have to sort the Relevant Documents in the best way possible (based only on the set of relevant documents).

So, in this case, the ideal ranking is
R_Ideal = {d1, d2, d3} or because of ties {d1, d3, d3}.

Now, imagine we have a retrieval model (say BM25) which returns a ranking R = {d3, d2, d1}.
In order to compute ndcg, we have to compute first the DCG(R_Ideal) then DCG (R).
Assuming a discount function as the position we can compute: rel / rank.

DCG(R_Ideal)  = (2/1) + (1/2) + (1/3) = 2.83
DCG (R) = (1/1) +(1/2) + (2/3) = 2.16

Then we can compute

NDCG = DCG(R) / DCG (R_Ideal) which leads to:
NDCG = 0.7632

The issue that is happening during the computation of the ndcg@k is probably due to wrong calculations of the ideal ranking. Basically, imagine now we want to compute ndcg@2. The results should be computed as follows:

DCG@2(R_Ideal)  = (2/1) + (1/2) = 2.5
DCG@2(R) = (1/1) +(1/2) = 1.5
NDCG@2 = 0.60`

However, I think that the computation of the ideal ranking is being computed as a sort of the relevant documents in ranking R. So basically, R_Ideal would look like this if we cut-off at k=2. R_Ideal = {d3, d2}

DCG@2(R_Ideal)  = (1/1) + (1/2) = 1.5
DCG@2(R) = (1/1) +(1/2) = 1.5
NDCG@2 = 1.0`

In addition, it looks like it is using the initial ranking (baseline) to extract this top-k of relevant documents. So basically, if you after resorting using LTR got a rank like: R_LTR = {d3, d2}. Since the ideal ranking isbeing computed as DCG@2(R_Ideal) = (1/1) + (1/2) = 1.5, when you compute DCG (R_LTR), you would have something like this:

DCG@2(R_Ideal)  = (1/1) + (1/2) = 1.5
DCG@2(R_LTR) = (2/1) +(1/2) = 2.5
NDCG@2 = 1.66`

which would result in the wrong result of NDCG@2.

This is what I think that is happening.

Did someone else faced this same issue and have a fix?

trivialfis · 2023-03-03T18:31:22Z

Hi, thank you for sharing a detailed analysis. For the original issue posted here, the issue is just overflow. The exponential gain 2^{rel - 1}, using a 32-bit integer to calculate limits the maximum relevance label to 31. The existing ltr implementation doesn't have a check for input label, hence a weird output.

trivialfis added the LTR Learning to rank label Mar 30, 2021

trivialfis mentioned this issue Feb 17, 2023

Rework learning to rank. #8822

Merged

trivialfis closed this as completed in #8822 Jun 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ndcg@k metric is inaccurate #6707

ndcg@k metric is inaccurate #6707

PascalIversen commented Feb 15, 2021

lcschv commented Mar 3, 2023 •

edited

Loading

trivialfis commented Mar 3, 2023

ndcg@k metric is inaccurate #6707

ndcg@k metric is inaccurate #6707

Comments

PascalIversen commented Feb 15, 2021

lcschv commented Mar 3, 2023 • edited Loading

trivialfis commented Mar 3, 2023

lcschv commented Mar 3, 2023 •

edited

Loading