-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rank:pairwise degrade peformance in 2.0.0 #9625
Comments
Thanks, by any chance is it possible to share the training data? If not, have you seen the same phenomenon occurring with synthetic data? |
thanks for your quick reply. The model target is the rank of stock return in the next 10 days, and the training data covers the last ten years of stock quote from China stock exchange. the size of the data is about 5GB. |
I suppose your data would be confidential and the file cannot be shared. It would be great if we can reproduce the same issue using nonconfidential data, so that we developers can identify the root cause. |
It's fine, you can reproduce the old behavior by using sampling mean. We changed it into top-k as the default pair enumeration method |
This is the related section https://xgboost.readthedocs.io/en/stable/tutorials/learning_to_rank.html#constructing-pairs Old one was |
Thanks, I'll try it out later. The default value of and what is the default value of |
I have tried with |
Let me try it later, need a few days as I'm currently on PTO. |
Hi, @trivialfis , I use some fake data to train with objective import numpy as np
import pandas as pd
import xgboost as xgb
np.random.seed(42)
n_groups = 1000
group_size = 2000
n_features = 100
n_levels = 20
rows = n_groups * group_size
features = pd.DataFrame(np.random.randn(rows, n_features).astype('float32'), columns=[f'f{i:03d}' for i in range(n_features)])
qids = pd.Series(np.arange(rows, dtype='int') // group_size)
labels = pd.Series(np.random.randn(rows).astype('float32')).groupby(qids).rank(method='first').sub(1) // (group_size // n_levels)
dmatrix = xgb.DMatrix(features, label=labels, qid=qids) params = {
'objective': 'rank:pairwise',
# 'objective': 'multi:softprob',
# 'num_class': n_levels,
'base_score': 0.5,
'lambdarank_pair_method': 'mean',
'lambdarank_num_pair_per_sample': 1,
'booster': 'gbtree',
'tree_method': 'hist',
'verbosity': 1,
# 'seed': 42,
'learning_rate': 0.1,
'max_depth': 6,
'gamma': 1,
'min_child_weight': 4,
'subsample': 0.9,
'colsample_bytree': 0.7,
'nthread': 20,
'reg_lambda': 1,
'reg_alpha': 1,
'eval_metric': ['ndcg@100', 'ndcg@500', 'ndcg@1000'],
}
booster = xgb.train(params, dmatrix, 100, verbose_eval=10, evals=[(dmatrix, 'train')]) This will print something like this: [0] train-ndcg@100:0.10149 train-ndcg@500:0.21251 train-ndcg@1000:0.36313
However, in 1.7.6, with the same params, the train prints like this: [18:43:38] WARNING: ../src/learner.cc:767: [0] train-ndcg@100:0.10620 train-ndcg@500:0.21695 train-ndcg@1000:0.36816 |
Opened a PR to make normalization optional by having a new parameter |
As mentioned in #9624
I train model rollingly about every month. The training data is grouped by date, and the eval_func is the average of daily IC(information coefficient). In 2.0.0 the in sample loss descent quite slow (the red line is in sample and the blue line is out sample):
In 1.7.6, the in sample loss descent obviously.
Here is the params:
The text was updated successfully, but these errors were encountered: