Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisiting default parameter settings? #4986

Open
thvasilo opened this issue Oct 25, 2019 · 15 comments
Open

Revisiting default parameter settings? #4986

thvasilo opened this issue Oct 25, 2019 · 15 comments

Comments

@thvasilo
Copy link
Contributor

thvasilo commented Oct 25, 2019

Hello all,

I came upon a recent JMLR paper that examined the "tunability" of the hyperparameters of multiple algorithms, including XGBoost.

Their methodology, as far as I understand it, is to take the default parameters of the package, find the (near) optimal parameters for each dataset in their evaluation and determine how valuable it is to tune a particular parameter.

In doing so they also come up with "optimal defaults" in Table 3, and an interactive Shiny app.

This made me curious about how the defaults for XGBoost were chosen and if it's something that the community would be interested in revisiting in the future.

@trivialfis
Copy link
Member

@thvasilo Seems to be a good read at weekend. ;-)

@thvasilo
Copy link
Contributor Author

I'll ping the main author @PhilippPro in case he wants to chime in on recommended defaults.

@PhilippPro
Copy link

PhilippPro commented Oct 28, 2019

Hey @thvasilo ! I think the defaults in xgboost are not chosen to have the best performance. The user also has to specify nrounds by himself. Probably the defaults are chosen to provide a basic version of gradient boosting. Maybe it would be nice if some good defaults would be at least described in the help section. It is a bit nasty if you want to use a package and have to search the internet to make it work properly.

There is also an autoxgboost package (https://github.com/ja-thomas/autoxgboost), but it is not working very well, sometimes it provides nonsense results and performance is worse than other auto tune packages such as tuneRanger or liquidSVM. See here for a graph of the benchmark that I did on some regression datasets: https://github.com/PhilippPro/tuneRanger/blob/master/benchmark/figure/rsq_results.pdf

The benchmark is described here: https://github.com/PhilippPro/tuneRanger/blob/master/benchmark/figure/rsq_results.pdf

@RAMitchell
Copy link
Member

This paper is very interesting thanks @PhilippPro. There are a few issues with directly adopting the default parameters from Table 3. The paper refers to classification datasets only. How confident are we that these parameters are effective for regression/ranking? This actually brings up a very interesting question: are our regularisation parameters invariant to learning objective? For example min_child_weight will be much more restrictive in binary classification where the Hessians take on very small values, as compared to squared error regression where the Hessian is a constant.

The default value of 4168 boosting rounds with a low eta is almost certainly better but imposes a much higher computational burden. What if someone goes to test the algorithm and it takes 5 minutes to run? I still think it might be a good idea to use a higher default, but this is just a consideration.

What is the role of dataset size in the effectiveness of hyperparameters? I feel like small datasets benefit strongly from regularisation where large datasets often do not seem to benefit at all.

@PhilippPro if you are interested in going further with this I would love to adopt your work in xgboost. I think there is potential here to dramatically improve the results for a large portion of our user base.

@PhilippPro
Copy link

PhilippPro commented Nov 15, 2019

Hi @RAMitchell ! I am interested. I agree with you, that the nrounds parameter is too high. I think it would be better to optimize the hyperparameters in a restricted hyperparameter space, e.g. set the maximum nrounds to 300.

And yes, regression (can you do ranking in xgboost currently?) is another problem that should be considered. There is also the option to use other defaults for classification and regression, as done for example in several random forest packages.

The other thing you mention is the problem of hyperparameters dependent on dataset characteristics. Here you should be very sure about the relationship to set hyperparameters data dependent, e.g. on the number of observations or the number of features p (e.g. in random forest mtry is set to the square root of p). A paper which tries to set these defaults empirically can be seen here, but maybe it is better to specify this rule by "hand" (e.g. humans looking at some plots that show the relationship between dataset characteristics, hyperparameters and performance). I do not know much about theoretical considerations in this field regarding xgboost.

@pfistfl are the results of the new bot already usable, could we use this for the purpose here?

@PhilippPro
Copy link

@RAMitchell
I could rerun the results with a restricted nrounds. To what value should I restrict it? Or what would be your idea?

@RAMitchell
Copy link
Member

I think 100-500 is more reasonable - the goal would be for the user's application not to hang unreasonably when they first try the library. We have several places in language bindings where default parameters arise, we could start with Python APIs (or whatever your preference is) and worry about the rest later.

Another way of approaching this would be for us to provide some kind of dictionary of preset parameters in our API. This gives more flexibility and choice.

This is all really about user experience and lowering the barrier to be able to train models effectively. This would go hand in hand with examples and documentation.

@RAMitchell
Copy link
Member

@PhilippPro any update on this? Here is what I propose: rerun with 500 rounds and a couple of regression datasets. Confirm run-time is acceptable. Set these as the default parameters, create a documentation page on parameter tuning with a few notes on methodology, how you arrived at these and linking to your paper.

@PhilippPro
Copy link

@PhilippPro any update on this? Here is what I propose: rerun with 500 rounds and a couple of regression datasets. Confirm run-time is acceptable. Set these as the default parameters, create a documentation page on parameter tuning with a few notes on methodology, how you arrived at these and linking to your paper.

I have not forgotten it, but I currently do not have a lot of time. I can only rerun it on the existing datasets, as I have the data for this, but this is not a problem. Your proposition is good, I will follow it, thanks.

@PhilippPro
Copy link

@RAMitchell I got some results now which are not really stunning. ;)

I created a blog post, where I describe the results:
New xgboost defaults

Currently I am repeating the 5-fold CV to get more stable results and will update the results in the blog post tomorrow. Future work (as described in the blog post) will be a bit more interesting.

@RAMitchell
Copy link
Member

Awesome work thanks! I think there is moderate evidence for changing the default parameters settings. Shall we re-evaluate once you have results from CatBoost/LightGBM?

Note that we are mostly focusing new development on "tree_method":"hist" and"tree_method":"gpu_hist" which exposes the "grow_policy":"lossguide"/"depthwise" parameter. One of these growth policies might be definitively better on your datasets.

@PhilippPro
Copy link

PhilippPro commented Feb 26, 2020

Yes, that's fine. The results I got today (with 10 times repeated 5-fold CV) were slightly better for my defaults (75% better in case of Spearman's Rho), I updated the post.

I will be happy to evaluate your new parameters once they are readily implemented in the package.

I think Catboost will be better in the default mode, but I will see the results soon.
The problem for xgboost is also, that it cannot treat categorical parameters, so I made (automatically) dummy-variables out of these variables.

@thvasilo
Copy link
Contributor Author

@RAMitchell how do you feel about the switch to a max_depth of 11 which the blog post suggests? Wouldn't that risk very large memory consumption? Would overfitting be an issue or is that taken care of early stopping/loss guide?

@PhilippPro
Copy link

I looked at the graph again and thought a bit about the results. In the section with low R-squared the default of xgboost performs much worse. These are datasets that are hard to fit and few things can be learned. The higher eta (eta=0.1) leads to too much overfitting compared to my defaults (eta=0.05). If you would set nrounds lower than 500 the effect would be smaller. You could leave eta=0.1 but set nrounds to a lower default value. Or leave it like it is, withouth specifying it.

I am not sure how big is the effect for max_depth. You could leave it on 5 to get better runtimes.

The subsample and colsample parameters could be safely set to values between 0.6 and 0.8, I guess there is not a big danger here.

@RAMitchell
Copy link
Member

I'm not too worried about memory consumption from tree depth. As mentioned by @PhilippPro over fitting is also offset by lower learning rates and sampling.

@PhilippPro the "grow_policy":"lossguide"/"depthwise" parameters already exists for both "tree_method":"hist" and"tree_method":"gpu_hist". We may even make "tree_method":"hist" the default learning mode at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants