-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[proposed change] distinguish loss-guide and depth-guide in fast histogram #4077
Comments
Do you have some numbers on what the performance difference might be? I generally have tried to avoid specialising our implementation for loss guided/depthwise for maintenance reasons. It could be justified if the performance difference is huge. |
with our internal dataset, when the local computation is intensive, we have got 2+X speedup with hist algorithm here however, when I applied col_sampling_by_tree to make local computation lighter, hist performance didn't change but approx improves a lot and surpass hist by looking at the profiling data, 96% time is spend in buildHist in hist |
This also affects multi-core CPU performance as well, since the amount of parallelism is limited when only a few data points end up at a node. See discussion at hcho3/xgboost-fast-hist-perf-lab#15. Originally, when I designed the fast histogram algorithm, I intended a fully generic method for specifying the order of node expansion, but in practice, only two choices are available (lossguide vs depthwise): xgboost/src/tree/updater_quantile_hist.h Lines 178 to 191 in 4fac987
If we just take depthwise, I think we can increase amount of parallelism for multi-core CPUs and improve scalability. |
Maybe a logical next step is to try re-writing the histogram construction code (snippet available at hcho3/xgboost-fast-hist-perf-lab ) so that histogram aggregation occurs level by level. Then we can measure performance impact. |
@hcho3 I think ideally we should have a a reconstruction of the whole algorithm to cover both cases and at the same time grow level by level I am not sure if we have enough resources to finish and take the risk of buggy re-implementation in the next (coming) release, I am going to have a short term solution to make them as two separate function (even one for distributed and one for local) to ensure that the next release contains a performance-acceptable distributed fast histogram and in the next next release, as we talked offline, when multi-core scalability becomes a major release milestone, we can give it another shot |
so, I have a prototype (please ignore those obvious copy-paste code for now, I will revise them) at CodingCat#8 (the diff is over the original distributed fast histo implementation in #4011) the changes in that PR brings another 1X speedup comparing to the original distributed fast histogram algorithm and 10 - 15% speedup comparing to approx when I set colsample_bytree for an internal use case details: the changes are to (1) do not create histograms for not-selected features; (2) sync histogram per level but this 10-15% speedup is not exciting enough I think we should focus on (1) improve quality of rabit (out of maintenance for a while) (2) improve multi-core perf of fast-histogram in near future |
Awesome. So for now, we can wrap up the distributed fast hist and then come back to items (1) and (2). |
@hcho3 you can start reviewing the current fast distributed histogram PR,
as I want to separate performance improvement to another PR to avoid
packing too many things together
…On Sun, Jan 27, 2019 at 11:08 PM Philip Hyunsu Cho ***@***.***> wrote:
Awesome. So for now, we can wrap up the distributed fast hist and then
come back to items (1) and (2).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4077 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AApYeE-v3FdPHTW7CO5Ul70ObO4QN6B8ks5vHqIFgaJpZM4aPbn7>
.
|
I cleaned up my prototype to remove the part of |
when working on distributed fast histogram, I found that when the local computing is not the bottleneck, faster histogram is even slower than approx
by profiling the program, I think the reason is on the histogram sync.
In fast histogram, we are growing a tree in a loop over every node:https://github.com/dmlc/xgboost/blob/master/src/tree/updater_quantile_hist.cc#L141-L190
In approx, we are actually working in a loop over per level: https://github.com/dmlc/xgboost/blob/master/src/tree/updater_histmaker.cc#L136-L148
The difference is that
when we sync histograms, in fast histogram algorithm, we trigger allreduce for every expanding (https://github.com/dmlc/xgboost/blob/master/src/tree/updater_quantile_hist.cc#L160-L163), but in approx, we put histograms in a level together and sync at once, https://github.com/dmlc/xgboost/blob/master/src/tree/updater_histmaker.cc#L374-L386
the current implementation in fast histogram is actually excellent o cover both loss-guided and depth-guided. To reduce the communication overhead in distributed environment, I would propose to break this (unfortunately) into two parts, one for loss-guide (the current loop) and the other to mimic the behavior of approx
the benefit is performance, the cost is the code complexity
@hcho3 @trivialfis @RAMitchell thoughts?
The text was updated successfully, but these errors were encountered: