-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce 'InitSampling' complexity and set gradients to zero #6922
Reduce 'InitSampling' complexity and set gradients to zero #6922
Conversation
src/tree/updater_quantile_hist.cc
Outdated
if (gpair[i].GetHess() >= 0.0f && rnds[tid]() < coin_flip_border) { | ||
p_row_indices_used[ibegin + row_offsets_used[tid]++] = i; | ||
} else { | ||
if (!(gpair[i].GetHess() >= 0.0f && coin_flip(eng))) { | ||
p_row_indices_unused[ibegin + row_offsets_unused[tid]++] = i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you have 0 gradient for ignored samples, is it still necessary for the partitioner to be aware of sampling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trivialfis sorry for long response (was trying to fix cpp/python tests).
Practitioner still has to be aware about 'unused' rows to do UpdatePredictionCache
correctly.
@trivialfis could you take a look at this PR please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the long delay. I think I don't quite understand the code in CPU hist now. Could you please simplify it down or do some refactoring first before piling up more code?
- Why does the prediction cache update need to care about used rows? Since you have looked into the GPU impl, I think the approach there is simpler. WDYT?
- Please try to avoid making special cases. Why does the number of trees have anything to do with sampling?
- Why do you need to assign a special task on tid 1?
....
I think most of those are not needed. Thanks for optimizing it. But please consider making some cleanups before.
src/common/row_set.h
Outdated
@@ -103,9 +103,9 @@ class RowSetCollection { | |||
size_t* all_begin = dmlc::BeginPtr(row_indices_); | |||
size_t* begin = all_begin + (e.begin - all_begin); | |||
|
|||
CHECK_EQ(n_left + n_right, e.Size()); | |||
CHECK_LE(n_left + n_right, e.Size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is it less?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted
src/tree/updater_quantile_hist.cc
Outdated
@@ -706,6 +729,8 @@ void QuantileHistMaker::Builder<GradientSumT>::InitSampling(const std::vector<Gr | |||
unused_rows_.resize(info.num_row_); | |||
size_t* p_row_indices_used = row_indices->data(); | |||
size_t* p_row_indices_unused = unused_rows_.data(); | |||
std::vector<GradientPair>& gpair_ref = const_cast<std::vector<GradientPair>&>(gpair); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just make a copy of gradient as a class member and use it through out the current iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy is expensive operation for each iteration, and hopefully there is no need to do a copy for single tree case (when num_parallel_tree == 1
).
So currently I simplified the InitSampling
(no dependency on number of trees) and copy is done only for not single trees (num_parallel_tree != 1
).
src/tree/updater_quantile_hist.cc
Outdated
@@ -740,19 +761,34 @@ void QuantileHistMaker::Builder<GradientSumT>::InitSampling(const std::vector<Gr | |||
const size_t ibegin = tid * discard_size; | |||
const size_t iend = (tid == (nthread - 1)) ? | |||
info.num_row_ : ibegin + discard_size; | |||
|
|||
rnds[tid].discard(discard_size * tid); | |||
constexpr uint64_t kBase = 16807; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this arbitrary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. it's similar value as for minstd_rand: https://en.cppreference.com/w/cpp/numeric/random/linear_congruential_engine
By requirements for lcg we have to use primitive root modulo n.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be made into a more reusable module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. I applied refactoring for subsampling, thank for proposal!
src/tree/updater_quantile_hist.cc
Outdated
p_row_indices_unused[ibegin + row_offsets_unused[tid]++] = i; | ||
if (!(gpair[i].GetHess() >= 0.0f && coin_flip(eng)) || gpair[i].GetGrad() == 0.0f) { | ||
p_row_indices_unused[ibegin + local_unused_offset++] = i; | ||
if (is_single_tree) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is single tree different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for single tree we can change (set to zeros for unused rows) initial gradient vector while for not single tree case we have to reset initial values for each tree assigned for iteration (special case num_parallel_tree != 1
). So copy is done only for not single tree case and removed this condition from InitSampling
.
src/tree/updater_quantile_hist.cc
Outdated
} else { | ||
for (auto rid : rid_span) { | ||
if (pgh[rid*2] != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It gives possibility to reduce work for partition and building histogram kernel as unused rows were marked by zero gradients. We can exclude these specific rows from build histogram process.
But currently I reverted it as the performance benefits from refactored InitSampling, UpdatePredictionCache (no exact prediction for unused rows) are much bigger than overheads for Partition and BuildLocalHistograms kernels.
src/tree/updater_quantile_hist.cc
Outdated
return {nleft_elems, nright_elems}; | ||
} | ||
|
||
// Split row indexes (rid_span) to 2 parts (left_part, right_part) depending | ||
// on comparison of indexes values (idx_span) and split point (split_cond). | ||
// Handle sparse columns | ||
template<bool default_left, typename BinIdxType> | ||
template<bool default_left, typename BinIdxType, bool check_gradient = false> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted
src/tree/updater_quantile_hist.cc
Outdated
@@ -706,6 +729,8 @@ void QuantileHistMaker::Builder<GradientSumT>::InitSampling(const std::vector<Gr | |||
unused_rows_.resize(info.num_row_); | |||
size_t* p_row_indices_used = row_indices->data(); | |||
size_t* p_row_indices_unused = unused_rows_.data(); | |||
std::vector<GradientPair>& gpair_ref = const_cast<std::vector<GradientPair>&>(gpair); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why do you need const cast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted
Thanks a lot for review! Seems after applying your comments code become more obvious and clear. Regarding prediction cache, there is no any specific handling for 'used' rows, except taking a leaf values from the latest partition. |
ba346d1
to
7b8562e
Compare
seems failed tests are not related to current changes: |
7b8562e
to
3db70a6
Compare
I'm running into it too. @hcho3 might have some insight. |
Sorry for the long delay for reviewing this PR. I will have to look deeper into current status of cpu hist. First eta would be Monday next week. Thanks for the patience. |
@trivialfis Thanks a lot for update. |
Sure. I would love to see any code simpliciation. The part of this PR that confuses me is why row partitioner has to understand any unused rows. From my understanding, as long as the gradient is set to 0, the rest should play out without any specialized handling. Granted if we try to keep track of the unused rows the performance might be better. But so far looking at the code base, this optimization contributes a large portion of complexity and I'm not entirely sure whether it's worthy. |
as you can see: https://github.com/dmlc/xgboost/pull/6922/files#diff-010c75219801f6b68880c42b0138f3b28517a0addb3055d98539a430ef3f3222L657 |
maybe this PR should be divided into two parts?
as most complexity of current changes is due to setting gradients to zero. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as most complexity of current changes is due to setting gradients to zero.
Do you have a branch of your refactoring code that you can share? I'm quite looking forward to it and see how we can improve the current structure. We still have some time before the next release so I think we don't have to rush into optimization PR. ;-)
src/tree/updater_quantile_hist.cc
Outdated
@@ -740,19 +761,34 @@ void QuantileHistMaker::Builder<GradientSumT>::InitSampling(const std::vector<Gr | |||
const size_t ibegin = tid * discard_size; | |||
const size_t iend = (tid == (nthread - 1)) ? | |||
info.num_row_ : ibegin + discard_size; | |||
|
|||
rnds[tid].discard(discard_size * tid); | |||
constexpr uint64_t kBase = 16807; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be made into a more reusable module?
I'll cleanup it a little and push draft pr with merging lossguide/depthwise strategies tomorrow. But I think this PR become simple enough to be merged before refactoring. And it would be simpler to have this pr finished as base for refactoring process (new sampling, setting to zero gradients - it should be taken into account during refactoring). |
Thanks, will revisit today. ;-) |
small refactoring: #7007 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on the optimization!
Hi, out of curiosity, what tools do you use for profiling? I have been using linux perf, valgrind and the monitor in XGBoost, they seem to contradict each other... |
Basically I used VTune Performance Analyzer to have a deeper look at observed hotspots. |
Thanks for sharing! |
Complexity of discard operation of MT generator is linear so current InitSampling implementation is not efficient due to unbalanced local thread's discards. In this PR similar approach to GPU sampler was implemented:
results for binary classification for synthetic data [61 x 10723651] with 'sample': 0.8:
Number of threads doesn't affect generated sequence.