rank:pairwise degrade peformance in 2.0.0 #9625

xbanke · 2023-10-04T06:24:58Z

As mentioned in #9624

I train model rollingly about every month. The training data is grouped by date, and the eval_func is the average of daily IC(information coefficient). In 2.0.0 the in sample loss descent quite slow (the red line is in sample and the blue line is out sample):

In 1.7.6, the in sample loss descent obviously.

Here is the params:

import os
params = {
    'objective': 'rank:pairwise',
    'booster': 'gbtree',
    'tree_method': 'hist',
    'base_score': 0.5,  # specified in 2.0.0
    'verbosity': 1,
    'seed': 42,
    'learning_rate': 0.1,
    'max_depth': 6,
    'gamma': 1,
    'min_child_weight': 4,
    'subsample': 0.9,
    'colsample_bytree': 0.7,
    'nthread': max(1, int(os.cpu_count() * 0.8)),
    'reg_lambda': 1,
    'reg_alpha': 1,
}

hcho3 · 2023-10-04T06:26:54Z

Thanks, by any chance is it possible to share the training data? If not, have you seen the same phenomenon occurring with synthetic data?

xbanke · 2023-10-04T06:33:04Z

thanks for your quick reply.

The model target is the rank of stock return in the next 10 days, and the training data covers the last ten years of stock quote from China stock exchange. the size of the data is about 5GB.

hcho3 · 2023-10-04T06:38:24Z

I suppose your data would be confidential and the file cannot be shared. It would be great if we can reproduce the same issue using nonconfidential data, so that we developers can identify the root cause.

trivialfis · 2023-10-04T07:59:57Z

It's fine, you can reproduce the old behavior by using sampling mean. We changed it into top-k as the default pair enumeration method

trivialfis · 2023-10-04T08:02:55Z

This is the related section https://xgboost.readthedocs.io/en/stable/tutorials/learning_to_rank.html#constructing-pairs

Old one was lambdarank_num_pair_per_sample=1, lambdarank_pair_method=mean

xbanke · 2023-10-04T08:52:57Z

This is the related section https://xgboost.readthedocs.io/en/stable/tutorials/learning_to_rank.html#constructing-pairs

Old one was lambdarank_num_pair_per_sample=1, lambdarank_pair_method=mean

Thanks, I'll try it out later.

The default value of lambdarank_pair_method is mean by https://xgboost.readthedocs.io/en/stable/parameter.html#parameters-for-learning-to-rank-rank-ndcg-rank-map-rank-pairwise

and what is the default value of lambdarank_num_pair_per_sample? I didn't see it.

xbanke · 2023-10-04T10:22:29Z

It's fine, you can reproduce the old behavior by using sampling mean. We changed it into top-k as the default pair enumeration method

I have tried with lambdarank_num_pair_per_sample=1 and lambdarank_pair_method=mean, but nothing changed. It looks like that the default value is the same as that.

trivialfis · 2023-10-04T13:50:08Z

Let me try it later, need a few days as I'm currently on PTO.

xbanke · 2024-03-02T10:14:41Z

Hi, @trivialfis , I use some fake data to train with objective rank:pairwise in version 2.0.3, the eval score never changes.

import numpy as np
import pandas as pd
import xgboost as xgb

np.random.seed(42)

n_groups = 1000
group_size = 2000
n_features = 100
n_levels = 20

rows = n_groups * group_size

features = pd.DataFrame(np.random.randn(rows, n_features).astype('float32'), columns=[f'f{i:03d}' for i in range(n_features)])
qids = pd.Series(np.arange(rows, dtype='int') // group_size)
labels = pd.Series(np.random.randn(rows).astype('float32')).groupby(qids).rank(method='first').sub(1) // (group_size // n_levels)

dmatrix = xgb.DMatrix(features, label=labels, qid=qids)

params = {
    'objective': 'rank:pairwise',
    # 'objective': 'multi:softprob',
    # 'num_class': n_levels,
    
    'base_score': 0.5,
    'lambdarank_pair_method': 'mean',
    'lambdarank_num_pair_per_sample': 1,
    'booster': 'gbtree',
    'tree_method': 'hist',
    'verbosity': 1,
    # 'seed': 42,
    'learning_rate': 0.1,
    'max_depth': 6,
    'gamma': 1,
    'min_child_weight': 4,
    'subsample': 0.9,
    'colsample_bytree': 0.7,
    'nthread': 20,
    'reg_lambda': 1,
    'reg_alpha': 1,
    'eval_metric': ['ndcg@100', 'ndcg@500', 'ndcg@1000'],
}

booster = xgb.train(params, dmatrix, 100, verbose_eval=10, evals=[(dmatrix, 'train')])

This will print something like this:

[0] train-ndcg@100:0.10149 train-ndcg@500:0.21251 train-ndcg@1000:0.36313
[10] train-ndcg@100:0.10149 train-ndcg@500:0.21251 train-ndcg@1000:0.36313
[20] train-ndcg@100:0.10149 train-ndcg@500:0.21251 train-ndcg@1000:0.36313
[30] train-ndcg@100:0.10149 train-ndcg@500:0.21251 train-ndcg@1000:0.36313
[40] train-ndcg@100:0.10149 train-ndcg@500:0.21251 train-ndcg@1000:0.36313
[50] train-ndcg@100:0.10149 train-ndcg@500:0.21251 train-ndcg@1000:0.36313
[60] train-ndcg@100:0.10149 train-ndcg@500:0.21251 train-ndcg@1000:0.36313
[70] train-ndcg@100:0.10149 train-ndcg@500:0.21251 train-ndcg@1000:0.36313
[80] train-ndcg@100:0.10149 train-ndcg@500:0.21251 train-ndcg@1000:0.36313
[90] train-ndcg@100:0.10149 train-ndcg@500:0.21251 train-ndcg@1000:0.36313
[99] train-ndcg@100:0.10149 train-ndcg@500:0.21251 train-ndcg@1000:0.36313

However, in 1.7.6, with the same params, the train prints like this:

[18:43:38] WARNING: ../src/learner.cc:767:
Parameters: { "lambdarank_num_pair_per_sample", "lambdarank_pair_method" } are not used.

[0] train-ndcg@100:0.10620 train-ndcg@500:0.21695 train-ndcg@1000:0.36816
[10] train-ndcg@100:0.11941 train-ndcg@500:0.23102 train-ndcg@1000:0.38391
[20] train-ndcg@100:0.12586 train-ndcg@500:0.23888 train-ndcg@1000:0.39168
[30] train-ndcg@100:0.13033 train-ndcg@500:0.24376 train-ndcg@1000:0.39735
[40] train-ndcg@100:0.13584 train-ndcg@500:0.24908 train-ndcg@1000:0.40316
[50] train-ndcg@100:0.13938 train-ndcg@500:0.25338 train-ndcg@1000:0.40777
[60] train-ndcg@100:0.14304 train-ndcg@500:0.25724 train-ndcg@1000:0.41162
[70] train-ndcg@100:0.14532 train-ndcg@500:0.26067 train-ndcg@1000:0.41497
[80] train-ndcg@100:0.14740 train-ndcg@500:0.26288 train-ndcg@1000:0.41804
[90] train-ndcg@100:0.15038 train-ndcg@500:0.26610 train-ndcg@1000:0.42058
[99] train-ndcg@100:0.15154 train-ndcg@500:0.26784 train-ndcg@1000:0.42342

trivialfis · 2024-03-07T13:30:36Z

Opened a PR to make normalization optional by having a new parameter lambdarank_normalization: #10094

xbanke mentioned this issue Mar 7, 2024

support sequence interface like lightgbm.Sequence #10091

Closed

trivialfis mentioned this issue Mar 7, 2024

Optional normalization for learning to rank. #10094

Merged

1 task

trivialfis closed this as completed in #10094 Mar 8, 2024

xbanke mentioned this issue Mar 8, 2024

default value of lambdarank_pair_method = topk in ranking parameters not the mean #10097

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rank:pairwise degrade peformance in 2.0.0 #9625

rank:pairwise degrade peformance in 2.0.0 #9625

xbanke commented Oct 4, 2023

hcho3 commented Oct 4, 2023

xbanke commented Oct 4, 2023

hcho3 commented Oct 4, 2023 •

edited

Loading

trivialfis commented Oct 4, 2023

trivialfis commented Oct 4, 2023 •

edited

Loading

xbanke commented Oct 4, 2023

xbanke commented Oct 4, 2023

trivialfis commented Oct 4, 2023

xbanke commented Mar 2, 2024 •

edited

Loading

trivialfis commented Mar 7, 2024

rank:pairwise degrade peformance in 2.0.0 #9625

rank:pairwise degrade peformance in 2.0.0 #9625

Comments

xbanke commented Oct 4, 2023

hcho3 commented Oct 4, 2023

xbanke commented Oct 4, 2023

hcho3 commented Oct 4, 2023 • edited Loading

trivialfis commented Oct 4, 2023

trivialfis commented Oct 4, 2023 • edited Loading

xbanke commented Oct 4, 2023

xbanke commented Oct 4, 2023

trivialfis commented Oct 4, 2023

xbanke commented Mar 2, 2024 • edited Loading

trivialfis commented Mar 7, 2024

hcho3 commented Oct 4, 2023 •

edited

Loading

trivialfis commented Oct 4, 2023 •

edited

Loading

xbanke commented Mar 2, 2024 •

edited

Loading