Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[proposed change] distinguish loss-guide and depth-guide in fast histogram #4077

Closed
CodingCat opened this issue Jan 23, 2019 · 9 comments · Fixed by #4102
Closed

[proposed change] distinguish loss-guide and depth-guide in fast histogram #4077

CodingCat opened this issue Jan 23, 2019 · 9 comments · Fixed by #4102

Comments

@CodingCat
Copy link
Member

CodingCat commented Jan 23, 2019

when working on distributed fast histogram, I found that when the local computing is not the bottleneck, faster histogram is even slower than approx

by profiling the program, I think the reason is on the histogram sync.

In fast histogram, we are growing a tree in a loop over every node:https://github.com/dmlc/xgboost/blob/master/src/tree/updater_quantile_hist.cc#L141-L190

In approx, we are actually working in a loop over per level: https://github.com/dmlc/xgboost/blob/master/src/tree/updater_histmaker.cc#L136-L148

The difference is that

when we sync histograms, in fast histogram algorithm, we trigger allreduce for every expanding (https://github.com/dmlc/xgboost/blob/master/src/tree/updater_quantile_hist.cc#L160-L163), but in approx, we put histograms in a level together and sync at once, https://github.com/dmlc/xgboost/blob/master/src/tree/updater_histmaker.cc#L374-L386

the current implementation in fast histogram is actually excellent o cover both loss-guided and depth-guided. To reduce the communication overhead in distributed environment, I would propose to break this (unfortunately) into two parts, one for loss-guide (the current loop) and the other to mimic the behavior of approx

the benefit is performance, the cost is the code complexity

@hcho3 @trivialfis @RAMitchell thoughts?

@RAMitchell
Copy link
Member

Do you have some numbers on what the performance difference might be? I generally have tried to avoid specialising our implementation for loss guided/depthwise for maintenance reasons. It could be justified if the performance difference is huge.

@CodingCat
Copy link
Member Author

@RAMitchell

with our internal dataset, when the local computation is intensive, we have got 2+X speedup with hist algorithm here

however, when I applied col_sampling_by_tree to make local computation lighter, hist performance didn't change but approx improves a lot and surpass hist

by looking at the profiling data, 96% time is spend in buildHist in hist

@hcho3
Copy link
Collaborator

hcho3 commented Jan 24, 2019

@CodingCat

In fast histogram, we are growing a tree in a loop over every node

This also affects multi-core CPU performance as well, since the amount of parallelism is limited when only a few data points end up at a node. See discussion at hcho3/xgboost-fast-hist-perf-lab#15. Originally, when I designed the fast histogram algorithm, I intended a fully generic method for specifying the order of node expansion, but in practice, only two choices are available (lossguide vs depthwise):

inline static bool DepthWise(ExpandEntry lhs, ExpandEntry rhs) {
if (lhs.depth == rhs.depth) {
return lhs.timestamp > rhs.timestamp; // favor small timestamp
} else {
return lhs.depth > rhs.depth; // favor small depth
}
}
inline static bool LossGuide(ExpandEntry lhs, ExpandEntry rhs) {
if (lhs.loss_chg == rhs.loss_chg) {
return lhs.timestamp > rhs.timestamp; // favor small timestamp
} else {
return lhs.loss_chg < rhs.loss_chg; // favor large loss_chg
}
}

If we just take depthwise, I think we can increase amount of parallelism for multi-core CPUs and improve scalability.

@hcho3
Copy link
Collaborator

hcho3 commented Jan 24, 2019

Maybe a logical next step is to try re-writing the histogram construction code (snippet available at hcho3/xgboost-fast-hist-perf-lab ) so that histogram aggregation occurs level by level. Then we can measure performance impact.

@CodingCat
Copy link
Member Author

CodingCat commented Jan 24, 2019

@hcho3 I think ideally we should have a a reconstruction of the whole algorithm to cover both cases and at the same time grow level by level

I am not sure if we have enough resources to finish and take the risk of buggy re-implementation in the next (coming) release,

I am going to have a short term solution to make them as two separate function (even one for distributed and one for local) to ensure that the next release contains a performance-acceptable distributed fast histogram

and in the next next release, as we talked offline, when multi-core scalability becomes a major release milestone, we can give it another shot

@CodingCat
Copy link
Member Author

CodingCat commented Jan 28, 2019

so, I have a prototype (please ignore those obvious copy-paste code for now, I will revise them) at CodingCat#8 (the diff is over the original distributed fast histo implementation in #4011)

the changes in that PR brings another 1X speedup comparing to the original distributed fast histogram algorithm and 10 - 15% speedup comparing to approx when I set colsample_bytree for an internal use case

details:

the changes are to (1) do not create histograms for not-selected features; (2) sync histogram per level

but this 10-15% speedup is not exciting enough

I think we should focus on (1) improve quality of rabit (out of maintenance for a while) (2) improve multi-core perf of fast-histogram in near future

@hcho3
Copy link
Collaborator

hcho3 commented Jan 28, 2019

Awesome. So for now, we can wrap up the distributed fast hist and then come back to items (1) and (2).

@CodingCat
Copy link
Member Author

CodingCat commented Jan 28, 2019 via email

@CodingCat
Copy link
Member Author

I cleaned up my prototype to remove the part of do not create histograms for not-selected features; ,
I found histogram building cost is ignorable comparing to HIST and it makes code more complicated, the performance gaining didn't change too much

@lock lock bot locked as resolved and limited conversation to collaborators May 14, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants