-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removed discard from InitSampling #6411
Removed discard from InitSampling #6411
Conversation
@RAMitchell Can you review this PR? To me, it seems reasonable to initialize each per-thread random generator with a number from the global random generator. I recall you saying that some parallel random number generators on GPUs take this approach. |
Some interesting reading on the subject: https://arxiv.org/pdf/0905.4238.pdf (see section 3). According to this, it is not recommended or theoretically sound to seed parallel rngs with another rng. This document recommends either block discard or "leapfrogging" as the correct approach. However, in practice no standard library implements efficient discard operations: https://stackoverflow.com/questions/47263584/which-c-random-number-engines-have-a-o1-discard-function. The rng algorithms for which efficient discard algorithms are known (LCD/LSFS) are probably also much worse than the mersenne algorithm used here. So if we have to discard inefficiently, there is no point in using threads at all, and the approach of this PR seems to be the only parallel approach possible with standard C++ today. Personally I think the risk is acceptable for the xgboost use case, probably not if we were doing Monte Carlo simulation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given @RAMitchell's comment, I would like to proceed with this PR.
@@ -622,18 +580,6 @@ TEST(QuantileHist, InitData) { | |||
maker_float.TestInitData(); | |||
} | |||
|
|||
TEST(QuantileHist, InitDataSampling) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am removing the unit test QuantileHist.InitDataSampling
, since we no longer have the guarantee that using the same seed will lead to the same set of sampled rows. Changing the number of threads will now lead to a different sample, even when seed
is fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RAMitchell @trivialfis Can we live with the relaxed guarantee of reproduciblity? Now the user needs to fix seed
as well as nthread
to obtain fully reproducible results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case I don't think we should merge it, the reproducibility guarantee is important. Ideally we would have something like this https://github.com/rabauke/trng4/blob/master/examples/pi_block_openmp.cc, but trng seems way to heavy of a dependency for this purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RAMitchell Got it, let's review the alternative #6410 then, which does not break reproducibility.
This PR is connected with #6410
We can avoid discarding if use different seeds for each generator. This is not quite the right approach, but the quality does not suffer much.