Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved InitSampling function speed by 2.12 times #6410

Merged
merged 2 commits into from
Dec 16, 2020

Conversation

RukhovichIV
Copy link
Contributor

@RukhovichIV RukhovichIV commented Nov 19, 2020

This PR is connected with #6411
Discarding elements from generators takes up most of the working time in InitSampling.
Since stdlibc++ doesn't have any random engines with o(n) complexity (little-o), we only can optimize the number of discarded elements.
std::bernoulli_distribution requires 64-bit input, so in previous version we had to discard twice as much elements as now.
This little optimization gives us ~2.12 speed up of InitSampling time, which translates into up to 16% speed up of the whole training time when subsampling < 1
The quality remains the same:

Mortgage dataset
version training time RMSE
original 35.479 0.009271
optimized 30.538 0.009262
Santander dataset      
version training time, s InitData time, s init improvement, % Log Loss
original 249.196 28.784 0.00 0.16607
optimized 239.298 17.173 67.61 0.16610
no discard 224.347 11.221 156.51 0.16613
Higgs dataset
version training time, s InitData time, s init improvement, % Log Loss
original 34.235 14.373 0.00 0.09339
optimized 29.201 8.316 72.84 0.09409

@hcho3 hcho3 merged commit 5c8ccf4 into dmlc:master Dec 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants