Removed discard from InitSampling #6411

RukhovichIV · 2020-11-19T20:11:14Z

This PR is connected with #6410
We can avoid discarding if use different seeds for each generator. This is not quite the right approach, but the quality does not suffer much.

Mortgage dataset
version	training time, s	improvement, %	RMSE
original	35.479	0.00	0.00927
optimized	30.538	16.18	0.00926
no discard	26.994	31.43	0.00925

Santander dataset
version	training time, s	InitData time, s	init improvement, %	Log Loss
original	249.196	28.784	0.00	0.16607
optimized	239.298	17.173	67.61	0.16610
no discard	224.347	11.221	156.51	0.16613

Higgs dataset
version	training time, s	InitData time, s	init improvement, %	Log Loss
original	34.235	14.373	0.00	0.09339
optimized	29.201	8.316	72.84	0.09409
no discard	25.916	5.160	178.54	0.09475

hcho3 · 2020-12-07T06:16:09Z

@RAMitchell Can you review this PR? To me, it seems reasonable to initialize each per-thread random generator with a number from the global random generator. I recall you saying that some parallel random number generators on GPUs take this approach.

RAMitchell · 2020-12-07T23:33:03Z

Some interesting reading on the subject: https://arxiv.org/pdf/0905.4238.pdf (see section 3).

According to this, it is not recommended or theoretically sound to seed parallel rngs with another rng. This document recommends either block discard or "leapfrogging" as the correct approach. However, in practice no standard library implements efficient discard operations: https://stackoverflow.com/questions/47263584/which-c-random-number-engines-have-a-o1-discard-function. The rng algorithms for which efficient discard algorithms are known (LCD/LSFS) are probably also much worse than the mersenne algorithm used here.

So if we have to discard inefficiently, there is no point in using threads at all, and the approach of this PR seems to be the only parallel approach possible with standard C++ today. Personally I think the risk is acceptable for the xgboost use case, probably not if we were doing Monte Carlo simulation.

hcho3

Given @RAMitchell's comment, I would like to proceed with this PR.

hcho3 · 2020-12-15T21:57:56Z

tests/cpp/tree/test_quantile_hist.cc

@@ -622,18 +580,6 @@ TEST(QuantileHist, InitData) {
  maker_float.TestInitData();
 }

-TEST(QuantileHist, InitDataSampling) {


I am removing the unit test QuantileHist.InitDataSampling, since we no longer have the guarantee that using the same seed will lead to the same set of sampled rows. Changing the number of threads will now lead to a different sample, even when seed is fixed.

@RAMitchell @trivialfis Can we live with the relaxed guarantee of reproduciblity? Now the user needs to fix seed as well as nthread to obtain fully reproducible results.

In this case I don't think we should merge it, the reproducibility guarantee is important. Ideally we would have something like this https://github.com/rabauke/trng4/blob/master/examples/pi_block_openmp.cc, but trng seems way to heavy of a dependency for this purpose.

@RAMitchell Got it, let's review the alternative #6410 then, which does not break reproducibility.

igor_rukhovich added 2 commits November 19, 2020 22:06

Removed discard from InitSampling

6c697ef

Added explicit conversion

aea160e

RukhovichIV mentioned this pull request Nov 19, 2020

Improved InitSampling function speed by 2.12 times #6410

Merged

hcho3 assigned RAMitchell Dec 7, 2020

hcho3 requested a review from RAMitchell December 7, 2020 06:21

hcho3 approved these changes Dec 15, 2020

View reviewed changes

hcho3 self-requested a review December 15, 2020 21:41

Remove QuantileHist.InitDataSampling test

811a18c

hcho3 reviewed Dec 15, 2020

View reviewed changes

hcho3 closed this Dec 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removed discard from InitSampling #6411

Removed discard from InitSampling #6411

RukhovichIV commented Nov 19, 2020 •

edited

Loading

hcho3 commented Dec 7, 2020

RAMitchell commented Dec 7, 2020 •

edited

Loading

hcho3 left a comment

hcho3 Dec 15, 2020

hcho3 Dec 15, 2020

RAMitchell Dec 16, 2020

hcho3 Dec 16, 2020

Removed discard from InitSampling #6411

Removed discard from InitSampling #6411

Conversation

RukhovichIV commented Nov 19, 2020 • edited Loading

hcho3 commented Dec 7, 2020

RAMitchell commented Dec 7, 2020 • edited Loading

hcho3 left a comment

Choose a reason for hiding this comment

hcho3 Dec 15, 2020

Choose a reason for hiding this comment

hcho3 Dec 15, 2020

Choose a reason for hiding this comment

RAMitchell Dec 16, 2020

Choose a reason for hiding this comment

hcho3 Dec 16, 2020

Choose a reason for hiding this comment

RukhovichIV commented Nov 19, 2020 •

edited

Loading

RAMitchell commented Dec 7, 2020 •

edited

Loading