-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract interaction constraint from split evaluator. #5034
Conversation
* Extract interaction constraints from split evaluator. The primary reason for doing so is that it copies the `num_feature` parameter, which makes serialization and parameter validation difficult. Also, as it should be used for selecting feature, like column sampler, instead of computing weight. * clean up for colmaker. Remove support for `parallel_option` and `cache_opt`. Now we use whatever settings that are default before this PR. As these parameters are never documented nor actually maintained. * Enable for approx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! This is quite a nice feature upgrade for the 'histmaker' algorithm.
@hcho3 @RAMitchell I enforced the row index to be |
Note:
|
Codecov Report
@@ Coverage Diff @@
## master #5034 +/- ##
=======================================
Coverage 71.52% 71.52%
=======================================
Files 11 11
Lines 2311 2311
=======================================
Hits 1653 1653
Misses 658 658 Continue to review full report at Codecov.
|
The reason for doing so is mostly for model IO, where
num_feature
andinteraction_constraints
are copied in split evaluator. Also interaction constraint by itself is a feature selector, acting like column sampler and it's inefficient to bury it deep in the evaluator chain. Lastly removing one another copied parameter is a win.As now the implementation is spited up from evaluator class, it's also enabled for approx method.
They are never documented nor actually used in real world. Also there isn't a single test for those code blocks.
As the size of input dataset is marching to billion, incorrect use of
int
is subject to overflow, also singed integer overflow is undefined behaviour. This PR starts the procedure for unifying used index type to unsigned integers. There's optimization that can utilize this undefined behaviour, but after some testings I don't see the optimization is beneficial to XGBoost.Related to #4732 .