Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed discard from InitSampling #6411

Conversation

RukhovichIV
Copy link
Contributor

@RukhovichIV RukhovichIV commented Nov 19, 2020

This PR is connected with #6410
We can avoid discarding if use different seeds for each generator. This is not quite the right approach, but the quality does not suffer much.

Mortgage dataset    
version training time, s improvement, % RMSE
original 35.479 0.00 0.00927
optimized 30.538 16.18 0.00926
no discard 26.994 31.43 0.00925
Santander dataset      
version training time, s InitData time, s init improvement, % Log Loss
original 249.196 28.784 0.00 0.16607
optimized 239.298 17.173 67.61 0.16610
no discard 224.347 11.221 156.51 0.16613
Higgs dataset      
version training time, s InitData time, s init improvement, % Log Loss
original 34.235 14.373 0.00 0.09339
optimized 29.201 8.316 72.84 0.09409
no discard 25.916 5.160 178.54 0.09475

@hcho3
Copy link
Collaborator

hcho3 commented Dec 7, 2020

@RAMitchell Can you review this PR? To me, it seems reasonable to initialize each per-thread random generator with a number from the global random generator. I recall you saying that some parallel random number generators on GPUs take this approach.

@RAMitchell
Copy link
Member

RAMitchell commented Dec 7, 2020

Some interesting reading on the subject: https://arxiv.org/pdf/0905.4238.pdf (see section 3).

According to this, it is not recommended or theoretically sound to seed parallel rngs with another rng. This document recommends either block discard or "leapfrogging" as the correct approach. However, in practice no standard library implements efficient discard operations: https://stackoverflow.com/questions/47263584/which-c-random-number-engines-have-a-o1-discard-function. The rng algorithms for which efficient discard algorithms are known (LCD/LSFS) are probably also much worse than the mersenne algorithm used here.

So if we have to discard inefficiently, there is no point in using threads at all, and the approach of this PR seems to be the only parallel approach possible with standard C++ today. Personally I think the risk is acceptable for the xgboost use case, probably not if we were doing Monte Carlo simulation.

Copy link
Collaborator

@hcho3 hcho3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given @RAMitchell's comment, I would like to proceed with this PR.

@hcho3 hcho3 self-requested a review December 15, 2020 21:41
@@ -622,18 +580,6 @@ TEST(QuantileHist, InitData) {
maker_float.TestInitData();
}

TEST(QuantileHist, InitDataSampling) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am removing the unit test QuantileHist.InitDataSampling, since we no longer have the guarantee that using the same seed will lead to the same set of sampled rows. Changing the number of threads will now lead to a different sample, even when seed is fixed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RAMitchell @trivialfis Can we live with the relaxed guarantee of reproduciblity? Now the user needs to fix seed as well as nthread to obtain fully reproducible results.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I don't think we should merge it, the reproducibility guarantee is important. Ideally we would have something like this https://github.com/rabauke/trng4/blob/master/examples/pi_block_openmp.cc, but trng seems way to heavy of a dependency for this purpose.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RAMitchell Got it, let's review the alternative #6410 then, which does not break reproducibility.

@hcho3 hcho3 closed this Dec 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants