Reduce 'InitSampling' complexity and set gradients to zero #6922

ShvetsKS · 2021-04-30T06:11:53Z

Complexity of discard operation of MT generator is linear so current InitSampling implementation is not efficient due to unbalanced local thread's discards. In this PR similar approach to GPU sampler was implemented:

generate new seed for each iteration by global mt19937 rng.
apply uniform distribution with default_random_engine (linear_congruential_engine) and generated seed.
set gradients to zero for i-th row with condition rnd(i) > p

results for binary classification for synthetic data [61 x 10723651] with 'sample': 0.8:

	Master	This PR
InitSampling	27.06s	0.78s
ApplySplit	10.25s	11.83s
BuildLocalHistograms	25.9s	32.76s
UpdatePredictionCache	14.26s	2.84s
Full training	87.8s	58.9s

Number of threads doesn't affect generated sequence.

trivialfis · 2021-04-30T11:37:42Z

src/tree/updater_quantile_hist.cc

-        if (gpair[i].GetHess() >= 0.0f && rnds[tid]() < coin_flip_border) {
-          p_row_indices_used[ibegin + row_offsets_used[tid]++] = i;
-        } else {
+        if (!(gpair[i].GetHess() >= 0.0f && coin_flip(eng))) {
          p_row_indices_unused[ibegin + row_offsets_unused[tid]++] = i;


Since you have 0 gradient for ignored samples, is it still necessary for the partitioner to be aware of sampling?

@trivialfis sorry for long response (was trying to fix cpp/python tests).
Practitioner still has to be aware about 'unused' rows to do UpdatePredictionCache correctly.

ShvetsKS · 2021-05-13T10:20:07Z

@trivialfis could you take a look at this PR please?

trivialfis

Sorry for the long delay. I think I don't quite understand the code in CPU hist now. Could you please simplify it down or do some refactoring first before piling up more code?

Why does the prediction cache update need to care about used rows? Since you have looked into the GPU impl, I think the approach there is simpler. WDYT?
Please try to avoid making special cases. Why does the number of trees have anything to do with sampling?
Why do you need to assign a special task on tid 1?
....

I think most of those are not needed. Thanks for optimizing it. But please consider making some cleanups before.

trivialfis · 2021-05-13T17:41:03Z

src/common/row_set.h

@@ -103,9 +103,9 @@ class RowSetCollection {
    size_t* all_begin = dmlc::BeginPtr(row_indices_);
    size_t* begin = all_begin + (e.begin - all_begin);

-    CHECK_EQ(n_left + n_right, e.Size());
+    CHECK_LE(n_left + n_right, e.Size());


When is it less?

src/tree/updater_quantile_hist.cc

trivialfis · 2021-05-13T17:42:49Z

src/tree/updater_quantile_hist.cc

@@ -706,6 +729,8 @@ void QuantileHistMaker::Builder<GradientSumT>::InitSampling(const std::vector<Gr
  unused_rows_.resize(info.num_row_);
  size_t* p_row_indices_used = row_indices->data();
  size_t* p_row_indices_unused = unused_rows_.data();
+  std::vector<GradientPair>& gpair_ref = const_cast<std::vector<GradientPair>&>(gpair);


Can you just make a copy of gradient as a class member and use it through out the current iteration?

copy is expensive operation for each iteration, and hopefully there is no need to do a copy for single tree case (when num_parallel_tree == 1).
So currently I simplified the InitSampling (no dependency on number of trees) and copy is done only for not single trees (num_parallel_tree != 1).

trivialfis · 2021-05-13T17:46:17Z

src/tree/updater_quantile_hist.cc

@@ -740,19 +761,34 @@ void QuantileHistMaker::Builder<GradientSumT>::InitSampling(const std::vector<Gr
      const size_t ibegin = tid * discard_size;
      const size_t iend = (tid == (nthread - 1)) ?
                          info.num_row_ : ibegin + discard_size;
-
-      rnds[tid].discard(discard_size * tid);
+      constexpr uint64_t kBase = 16807;


Is this arbitrary?

No. it's similar value as for minstd_rand: https://en.cppreference.com/w/cpp/numeric/random/linear_congruential_engine
By requirements for lcg we have to use primitive root modulo n.

Can it be made into a more reusable module?

yes. I applied refactoring for subsampling, thank for proposal!

trivialfis · 2021-05-13T17:47:09Z

src/tree/updater_quantile_hist.cc

-          p_row_indices_unused[ibegin + row_offsets_unused[tid]++] = i;
+        if (!(gpair[i].GetHess() >= 0.0f && coin_flip(eng)) || gpair[i].GetGrad() == 0.0f) {
+          p_row_indices_unused[ibegin + local_unused_offset++] = i;
+          if (is_single_tree) {


Why is single tree different?

for single tree we can change (set to zeros for unused rows) initial gradient vector while for not single tree case we have to reset initial values for each tree assigned for iteration (special case num_parallel_tree != 1). So copy is done only for not single tree case and removed this condition from InitSampling.

trivialfis · 2021-05-13T17:51:28Z

src/tree/updater_quantile_hist.cc

  } else {
    for (auto rid : rid_span)  {
+    if (pgh[rid*2] != 0) {


It gives possibility to reduce work for partition and building histogram kernel as unused rows were marked by zero gradients. We can exclude these specific rows from build histogram process.

But currently I reverted it as the performance benefits from refactored InitSampling, UpdatePredictionCache (no exact prediction for unused rows) are much bigger than overheads for Partition and BuildLocalHistograms kernels.

trivialfis · 2021-05-13T17:52:09Z

src/tree/updater_quantile_hist.cc

  return {nleft_elems, nright_elems};
 }

 // Split row indexes (rid_span) to 2 parts (left_part, right_part) depending
 // on comparison of indexes values (idx_span) and split point (split_cond).
 // Handle sparse columns
-template<bool default_left, typename BinIdxType>
+template<bool default_left, typename BinIdxType, bool check_gradient = false>


I don't see any check?

trivialfis · 2021-05-13T17:53:25Z

src/tree/updater_quantile_hist.cc

@@ -706,6 +729,8 @@ void QuantileHistMaker::Builder<GradientSumT>::InitSampling(const std::vector<Gr
  unused_rows_.resize(info.num_row_);
  size_t* p_row_indices_used = row_indices->data();
  size_t* p_row_indices_unused = unused_rows_.data();
+  std::vector<GradientPair>& gpair_ref = const_cast<std::vector<GradientPair>&>(gpair);


Also, why do you need const cast?

ShvetsKS · 2021-05-18T22:03:59Z

Sorry for the long delay. I think I don't quite understand the code in CPU hist now. Could you please simplify it down or do some refactoring first before piling up more code?

Why does the prediction cache update need to care about used rows? Since you have looked into the GPU impl, I think the approach there is simpler. WDYT?

Please try to avoid making special cases. Why does the number of trees have anything to do with sampling?

Why do you need to assign a special task on tid 1?
....

I think most of those are not needed. Thanks for optimizing it. But please consider making some cleanups before.

Thanks a lot for review!

Seems after applying your comments code become more obvious and clear.

Regarding prediction cache, there is no any specific handling for 'used' rows, except taking a leaf values from the latest partition.
For more effective partition and build histogram kernels it's better to exclude 'unused' rows indices from partition_builder_ (it was done by checking (pgh[rid*2] != 0) on root node ). And due to excluding 'unused' rows from partition we had to handle 'unused' rows in UpdatePredictionCache kernel exactly.
In refactored implementation I simplified UpdatePredictionCache (deleted expensive handling of 'unused' rows) and in addition to simplicity it brings more performance benefits than excluding of 'unused' rows from portion and build hist kernels.

ShvetsKS · 2021-05-19T10:02:12Z

seems failed tests are not related to current changes:
windows build, test: problem with env "AssertionError: wheel_path = xgboost-1.5.0_SNAPSHOT-cp38-cp38-win_amd64.whl"
linux test: failed test in /workspace/tests/python/test_linear.py, but on master branch it's also failed (checked locally, and similar problem here: https://xgboost-ci.net/blue/organizations/jenkins/xgboost/detail/master/858/pipeline)

trivialfis · 2021-05-19T11:00:35Z

I'm running into it too. @hcho3 might have some insight.

trivialfis · 2021-05-20T09:11:50Z

Sorry for the long delay for reviewing this PR. I will have to look deeper into current status of cpu hist. First eta would be Monday next week. Thanks for the patience.

ShvetsKS · 2021-05-20T09:53:52Z

@trivialfis Thanks a lot for update.
I'm also in a process of preparing refactoring of cpu hist: first step is unifying the depthwise and lossguid strategy. But I would prefer to separate this work if possible as current PR doesn't affect code simplicity and clearness.

trivialfis · 2021-05-20T09:57:56Z

Sure. I would love to see any code simpliciation. The part of this PR that confuses me is why row partitioner has to understand any unused rows. From my understanding, as long as the gradient is set to 0, the rest should play out without any specialized handling. Granted if we try to keep track of the unused rows the performance might be better. But so far looking at the code base, this optimization contributes a large portion of complexity and I'm not entirely sure whether it's worthy.

ShvetsKS · 2021-05-20T10:05:24Z

as you can see: https://github.com/dmlc/xgboost/pull/6922/files#diff-010c75219801f6b68880c42b0138f3b28517a0addb3055d98539a430ef3f3222L657
the track of unused rows was deleted :)
And currently we keep in mind that zero gradients doesn't impact histogram calculation and meanwhile we still apply partition for 'unused' rows without specialization (no big impact on performance).

ShvetsKS · 2021-05-20T10:11:19Z

maybe this PR should be divided into two parts?

changing the type of random number generator to reduce complexity of discard operation
setting to zero gradients for 'unused' rows

as most complexity of current changes is due to setting gradients to zero.

trivialfis

as most complexity of current changes is due to setting gradients to zero.

Do you have a branch of your refactoring code that you can share? I'm quite looking forward to it and see how we can improve the current structure. We still have some time before the next release so I think we don't have to rush into optimization PR. ;-)

trivialfis · 2021-05-26T20:58:08Z

src/tree/updater_quantile_hist.cc

@@ -740,19 +761,34 @@ void QuantileHistMaker::Builder<GradientSumT>::InitSampling(const std::vector<Gr
      const size_t ibegin = tid * discard_size;
      const size_t iend = (tid == (nthread - 1)) ?
                          info.num_row_ : ibegin + discard_size;
-
-      rnds[tid].discard(discard_size * tid);
+      constexpr uint64_t kBase = 16807;


Can it be made into a more reusable module?

ShvetsKS · 2021-05-27T20:43:11Z

@trivialfis

as most complexity of current changes is due to setting gradients to zero.

Do you have a branch of your refactoring code that you can share? I'm quite looking forward to it and see how we can improve the current structure. We still have some time before the next release so I think we don't have to rush into optimization PR. ;-)

I'll cleanup it a little and push draft pr with merging lossguide/depthwise strategies tomorrow.

But I think this PR become simple enough to be merged before refactoring. And it would be simpler to have this pr finished as base for refactoring process (new sampling, setting to zero gradients - it should be taken into account during refactoring).

trivialfis · 2021-05-28T11:57:19Z

Thanks, will revisit today. ;-)

ShvetsKS · 2021-05-28T17:23:58Z

small refactoring: #7007

trivialfis

Thanks for working on the optimization!

trivialfis · 2021-05-28T19:46:22Z

Hi, out of curiosity, what tools do you use for profiling? I have been using linux perf, valgrind and the monitor in XGBoost, they seem to contradict each other...

ShvetsKS · 2021-05-30T14:57:29Z

Hi, out of curiosity, what tools do you use for profiling? I have been using linux perf, valgrind and the monitor in XGBoost, they seem to contradict each other...

Basically I used VTune Performance Analyzer to have a deeper look at observed hotspots.

trivialfis · 2021-05-30T18:13:29Z

Thanks for sharing!

trivialfis reviewed Apr 30, 2021

View reviewed changes

ShvetsKS marked this pull request as ready for review May 4, 2021 15:20

trivialfis self-requested a review May 8, 2021 04:49

trivialfis requested changes May 13, 2021

View reviewed changes

ShvetsKS requested a review from trivialfis May 18, 2021 22:08

Reduce InitSampling complexity and set gradient to zeros

3db70a6

ShvetsKS force-pushed the init_sampling_reduce_time_and_memory branch 2 times, most recently from ba346d1 to 7b8562e Compare May 19, 2021 09:13

ShvetsKS force-pushed the init_sampling_reduce_time_and_memory branch from 7b8562e to 3db70a6 Compare May 19, 2021 10:18

trivialfis reviewed May 26, 2021

View reviewed changes

Shvets Kirill added 2 commits May 27, 2021 13:27

refactor sampling

6e75672

move to header

0d13a97

move to header

161b970

ShvetsKS requested a review from trivialfis May 27, 2021 21:05

trivialfis approved these changes May 28, 2021

View reviewed changes

trivialfis merged commit 55b823b into dmlc:master May 28, 2021

Reduce 'InitSampling' complexity and set gradients to zero #6922

Reduce 'InitSampling' complexity and set gradients to zero #6922

Conversation

ShvetsKS commented Apr 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ShvetsKS commented May 13, 2021

trivialfis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ShvetsKS commented May 18, 2021

ShvetsKS commented May 19, 2021

trivialfis commented May 19, 2021

trivialfis commented May 20, 2021

ShvetsKS commented May 20, 2021

trivialfis commented May 20, 2021

ShvetsKS commented May 20, 2021

ShvetsKS commented May 20, 2021 • edited Loading

trivialfis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ShvetsKS commented May 27, 2021

trivialfis commented May 28, 2021

ShvetsKS commented May 28, 2021

trivialfis left a comment

Choose a reason for hiding this comment

trivialfis commented May 28, 2021

ShvetsKS commented May 30, 2021

trivialfis commented May 30, 2021

ShvetsKS commented Apr 30, 2021 •

edited

Loading

ShvetsKS commented May 20, 2021 •

edited

Loading