Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce 'InitSampling' complexity and set gradients to zero #6922

Merged
merged 4 commits into from
May 28, 2021

Conversation

ShvetsKS
Copy link
Contributor

@ShvetsKS ShvetsKS commented Apr 30, 2021

Complexity of discard operation of MT generator is linear so current InitSampling implementation is not efficient due to unbalanced local thread's discards. In this PR similar approach to GPU sampler was implemented:

  • generate new seed for each iteration by global mt19937 rng.
  • apply uniform distribution with default_random_engine (linear_congruential_engine) and generated seed.
  • set gradients to zero for i-th row with condition rnd(i) > p

results for binary classification for synthetic data [61 x 10723651] with 'sample': 0.8:

Master This PR
InitSampling 27.06s 0.78s
ApplySplit 10.25s 11.83s
BuildLocalHistograms 25.9s 32.76s
UpdatePredictionCache 14.26s 2.84s
Full training 87.8s 58.9s

Number of threads doesn't affect generated sequence.

if (gpair[i].GetHess() >= 0.0f && rnds[tid]() < coin_flip_border) {
p_row_indices_used[ibegin + row_offsets_used[tid]++] = i;
} else {
if (!(gpair[i].GetHess() >= 0.0f && coin_flip(eng))) {
p_row_indices_unused[ibegin + row_offsets_unused[tid]++] = i;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you have 0 gradient for ignored samples, is it still necessary for the partitioner to be aware of sampling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trivialfis sorry for long response (was trying to fix cpp/python tests).
Practitioner still has to be aware about 'unused' rows to do UpdatePredictionCache correctly.

@ShvetsKS ShvetsKS marked this pull request as ready for review May 4, 2021 15:20
@trivialfis trivialfis self-requested a review May 8, 2021 04:49
@ShvetsKS
Copy link
Contributor Author

@trivialfis could you take a look at this PR please?

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long delay. I think I don't quite understand the code in CPU hist now. Could you please simplify it down or do some refactoring first before piling up more code?

  • Why does the prediction cache update need to care about used rows? Since you have looked into the GPU impl, I think the approach there is simpler. WDYT?
  • Please try to avoid making special cases. Why does the number of trees have anything to do with sampling?
  • Why do you need to assign a special task on tid 1?
    ....

I think most of those are not needed. Thanks for optimizing it. But please consider making some cleanups before.

@@ -103,9 +103,9 @@ class RowSetCollection {
size_t* all_begin = dmlc::BeginPtr(row_indices_);
size_t* begin = all_begin + (e.begin - all_begin);

CHECK_EQ(n_left + n_right, e.Size());
CHECK_LE(n_left + n_right, e.Size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is it less?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted

src/tree/updater_quantile_hist.cc Outdated Show resolved Hide resolved
@@ -706,6 +729,8 @@ void QuantileHistMaker::Builder<GradientSumT>::InitSampling(const std::vector<Gr
unused_rows_.resize(info.num_row_);
size_t* p_row_indices_used = row_indices->data();
size_t* p_row_indices_unused = unused_rows_.data();
std::vector<GradientPair>& gpair_ref = const_cast<std::vector<GradientPair>&>(gpair);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just make a copy of gradient as a class member and use it through out the current iteration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy is expensive operation for each iteration, and hopefully there is no need to do a copy for single tree case (when num_parallel_tree == 1).
So currently I simplified the InitSampling (no dependency on number of trees) and copy is done only for not single trees (num_parallel_tree != 1).

@@ -740,19 +761,34 @@ void QuantileHistMaker::Builder<GradientSumT>::InitSampling(const std::vector<Gr
const size_t ibegin = tid * discard_size;
const size_t iend = (tid == (nthread - 1)) ?
info.num_row_ : ibegin + discard_size;

rnds[tid].discard(discard_size * tid);
constexpr uint64_t kBase = 16807;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this arbitrary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. it's similar value as for minstd_rand: https://en.cppreference.com/w/cpp/numeric/random/linear_congruential_engine
By requirements for lcg we have to use primitive root modulo n.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be made into a more reusable module?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I applied refactoring for subsampling, thank for proposal!

p_row_indices_unused[ibegin + row_offsets_unused[tid]++] = i;
if (!(gpair[i].GetHess() >= 0.0f && coin_flip(eng)) || gpair[i].GetGrad() == 0.0f) {
p_row_indices_unused[ibegin + local_unused_offset++] = i;
if (is_single_tree) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is single tree different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for single tree we can change (set to zeros for unused rows) initial gradient vector while for not single tree case we have to reset initial values for each tree assigned for iteration (special case num_parallel_tree != 1). So copy is done only for not single tree case and removed this condition from InitSampling.

} else {
for (auto rid : rid_span) {
if (pgh[rid*2] != 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gives possibility to reduce work for partition and building histogram kernel as unused rows were marked by zero gradients. We can exclude these specific rows from build histogram process.

But currently I reverted it as the performance benefits from refactored InitSampling, UpdatePredictionCache (no exact prediction for unused rows) are much bigger than overheads for Partition and BuildLocalHistograms kernels.

return {nleft_elems, nright_elems};
}

// Split row indexes (rid_span) to 2 parts (left_part, right_part) depending
// on comparison of indexes values (idx_span) and split point (split_cond).
// Handle sparse columns
template<bool default_left, typename BinIdxType>
template<bool default_left, typename BinIdxType, bool check_gradient = false>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted

@@ -706,6 +729,8 @@ void QuantileHistMaker::Builder<GradientSumT>::InitSampling(const std::vector<Gr
unused_rows_.resize(info.num_row_);
size_t* p_row_indices_used = row_indices->data();
size_t* p_row_indices_unused = unused_rows_.data();
std::vector<GradientPair>& gpair_ref = const_cast<std::vector<GradientPair>&>(gpair);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why do you need const cast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted

@ShvetsKS
Copy link
Contributor Author

Sorry for the long delay. I think I don't quite understand the code in CPU hist now. Could you please simplify it down or do some refactoring first before piling up more code?

  • Why does the prediction cache update need to care about used rows? Since you have looked into the GPU impl, I think the approach there is simpler. WDYT?
  • Please try to avoid making special cases. Why does the number of trees have anything to do with sampling?
  • Why do you need to assign a special task on tid 1?
    ....

I think most of those are not needed. Thanks for optimizing it. But please consider making some cleanups before.

Thanks a lot for review!

Seems after applying your comments code become more obvious and clear.

Regarding prediction cache, there is no any specific handling for 'used' rows, except taking a leaf values from the latest partition.
For more effective partition and build histogram kernels it's better to exclude 'unused' rows indices from partition_builder_ (it was done by checking (pgh[rid*2] != 0) on root node ). And due to excluding 'unused' rows from partition we had to handle 'unused' rows in UpdatePredictionCache kernel exactly.
In refactored implementation I simplified UpdatePredictionCache (deleted expensive handling of 'unused' rows) and in addition to simplicity it brings more performance benefits than excluding of 'unused' rows from portion and build hist kernels.

@ShvetsKS ShvetsKS requested a review from trivialfis May 18, 2021 22:08
@ShvetsKS ShvetsKS force-pushed the init_sampling_reduce_time_and_memory branch 2 times, most recently from ba346d1 to 7b8562e Compare May 19, 2021 09:13
@ShvetsKS
Copy link
Contributor Author

seems failed tests are not related to current changes:
windows build, test: problem with env "AssertionError: wheel_path = xgboost-1.5.0_SNAPSHOT-cp38-cp38-win_amd64.whl"
linux test: failed test in /workspace/tests/python/test_linear.py, but on master branch it's also failed (checked locally, and similar problem here: https://xgboost-ci.net/blue/organizations/jenkins/xgboost/detail/master/858/pipeline)

@ShvetsKS ShvetsKS force-pushed the init_sampling_reduce_time_and_memory branch from 7b8562e to 3db70a6 Compare May 19, 2021 10:18
@trivialfis
Copy link
Member

I'm running into it too. @hcho3 might have some insight.

@trivialfis
Copy link
Member

Sorry for the long delay for reviewing this PR. I will have to look deeper into current status of cpu hist. First eta would be Monday next week. Thanks for the patience.

@ShvetsKS
Copy link
Contributor Author

@trivialfis Thanks a lot for update.
I'm also in a process of preparing refactoring of cpu hist: first step is unifying the depthwise and lossguid strategy. But I would prefer to separate this work if possible as current PR doesn't affect code simplicity and clearness.

@trivialfis
Copy link
Member

Sure. I would love to see any code simpliciation. The part of this PR that confuses me is why row partitioner has to understand any unused rows. From my understanding, as long as the gradient is set to 0, the rest should play out without any specialized handling. Granted if we try to keep track of the unused rows the performance might be better. But so far looking at the code base, this optimization contributes a large portion of complexity and I'm not entirely sure whether it's worthy.

@ShvetsKS
Copy link
Contributor Author

as you can see: https://github.com/dmlc/xgboost/pull/6922/files#diff-010c75219801f6b68880c42b0138f3b28517a0addb3055d98539a430ef3f3222L657
the track of unused rows was deleted :)
And currently we keep in mind that zero gradients doesn't impact histogram calculation and meanwhile we still apply partition for 'unused' rows without specialization (no big impact on performance).

@ShvetsKS
Copy link
Contributor Author

ShvetsKS commented May 20, 2021

maybe this PR should be divided into two parts?

  1. changing the type of random number generator to reduce complexity of discard operation
  2. setting to zero gradients for 'unused' rows

as most complexity of current changes is due to setting gradients to zero.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as most complexity of current changes is due to setting gradients to zero.

Do you have a branch of your refactoring code that you can share? I'm quite looking forward to it and see how we can improve the current structure. We still have some time before the next release so I think we don't have to rush into optimization PR. ;-)

@@ -740,19 +761,34 @@ void QuantileHistMaker::Builder<GradientSumT>::InitSampling(const std::vector<Gr
const size_t ibegin = tid * discard_size;
const size_t iend = (tid == (nthread - 1)) ?
info.num_row_ : ibegin + discard_size;

rnds[tid].discard(discard_size * tid);
constexpr uint64_t kBase = 16807;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be made into a more reusable module?

@ShvetsKS
Copy link
Contributor Author

@trivialfis

as most complexity of current changes is due to setting gradients to zero.

Do you have a branch of your refactoring code that you can share? I'm quite looking forward to it and see how we can improve the current structure. We still have some time before the next release so I think we don't have to rush into optimization PR. ;-)

I'll cleanup it a little and push draft pr with merging lossguide/depthwise strategies tomorrow.

But I think this PR become simple enough to be merged before refactoring. And it would be simpler to have this pr finished as base for refactoring process (new sampling, setting to zero gradients - it should be taken into account during refactoring).

@ShvetsKS ShvetsKS requested a review from trivialfis May 27, 2021 21:05
@trivialfis
Copy link
Member

Thanks, will revisit today. ;-)

@ShvetsKS
Copy link
Contributor Author

small refactoring: #7007

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on the optimization!

@trivialfis
Copy link
Member

Hi, out of curiosity, what tools do you use for profiling? I have been using linux perf, valgrind and the monitor in XGBoost, they seem to contradict each other...

@trivialfis trivialfis merged commit 55b823b into dmlc:master May 28, 2021
@ShvetsKS
Copy link
Contributor Author

Hi, out of curiosity, what tools do you use for profiling? I have been using linux perf, valgrind and the monitor in XGBoost, they seem to contradict each other...

Basically I used VTune Performance Analyzer to have a deeper look at observed hotspots.

@trivialfis
Copy link
Member

Thanks for sharing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants