[proposed change] distinguish loss-guide and depth-guide in fast histogram #4077

CodingCat · 2019-01-23T19:03:35Z

when working on distributed fast histogram, I found that when the local computing is not the bottleneck, faster histogram is even slower than approx

by profiling the program, I think the reason is on the histogram sync.

In fast histogram, we are growing a tree in a loop over every node:https://github.com/dmlc/xgboost/blob/master/src/tree/updater_quantile_hist.cc#L141-L190

In approx, we are actually working in a loop over per level: https://github.com/dmlc/xgboost/blob/master/src/tree/updater_histmaker.cc#L136-L148

The difference is that

when we sync histograms, in fast histogram algorithm, we trigger allreduce for every expanding (https://github.com/dmlc/xgboost/blob/master/src/tree/updater_quantile_hist.cc#L160-L163), but in approx, we put histograms in a level together and sync at once, https://github.com/dmlc/xgboost/blob/master/src/tree/updater_histmaker.cc#L374-L386

the current implementation in fast histogram is actually excellent o cover both loss-guided and depth-guided. To reduce the communication overhead in distributed environment, I would propose to break this (unfortunately) into two parts, one for loss-guide (the current loop) and the other to mimic the behavior of approx

the benefit is performance, the cost is the code complexity

@hcho3 @trivialfis @RAMitchell thoughts?

RAMitchell · 2019-01-24T06:20:52Z

Do you have some numbers on what the performance difference might be? I generally have tried to avoid specialising our implementation for loss guided/depthwise for maintenance reasons. It could be justified if the performance difference is huge.

CodingCat · 2019-01-24T23:23:49Z

@RAMitchell

with our internal dataset, when the local computation is intensive, we have got 2+X speedup with hist algorithm here

however, when I applied col_sampling_by_tree to make local computation lighter, hist performance didn't change but approx improves a lot and surpass hist

by looking at the profiling data, 96% time is spend in buildHist in hist

hcho3 · 2019-01-24T23:36:56Z

@CodingCat

In fast histogram, we are growing a tree in a loop over every node

This also affects multi-core CPU performance as well, since the amount of parallelism is limited when only a few data points end up at a node. See discussion at hcho3/xgboost-fast-hist-perf-lab#15. Originally, when I designed the fast histogram algorithm, I intended a fully generic method for specifying the order of node expansion, but in practice, only two choices are available (lossguide vs depthwise):

xgboost/src/tree/updater_quantile_hist.h

Lines 178 to 191 in 4fac987

    
           inline static bool DepthWise(ExpandEntry lhs, ExpandEntry rhs) { 
        
             if (lhs.depth == rhs.depth) { 
        
               return lhs.timestamp > rhs.timestamp;  // favor small timestamp 
        
             } else { 
        
               return lhs.depth > rhs.depth;  // favor small depth 
        
             } 
        
           } 
        
           inline static bool LossGuide(ExpandEntry lhs, ExpandEntry rhs) { 
        
             if (lhs.loss_chg == rhs.loss_chg) { 
        
               return lhs.timestamp > rhs.timestamp;  // favor small timestamp 
        
             } else { 
        
               return lhs.loss_chg < rhs.loss_chg;  // favor large loss_chg 
        
             } 
        
           }

If we just take depthwise, I think we can increase amount of parallelism for multi-core CPUs and improve scalability.

hcho3 · 2019-01-24T23:39:08Z

Maybe a logical next step is to try re-writing the histogram construction code (snippet available at hcho3/xgboost-fast-hist-perf-lab ) so that histogram aggregation occurs level by level. Then we can measure performance impact.

CodingCat · 2019-01-24T23:45:12Z

@hcho3 I think ideally we should have a a reconstruction of the whole algorithm to cover both cases and at the same time grow level by level

I am not sure if we have enough resources to finish and take the risk of buggy re-implementation in the next (coming) release,

I am going to have a short term solution to make them as two separate function (even one for distributed and one for local) to ensure that the next release contains a performance-acceptable distributed fast histogram

and in the next next release, as we talked offline, when multi-core scalability becomes a major release milestone, we can give it another shot

CodingCat · 2019-01-28T05:40:16Z

so, I have a prototype (please ignore those obvious copy-paste code for now, I will revise them) at CodingCat#8 (the diff is over the original distributed fast histo implementation in #4011)

the changes in that PR brings another 1X speedup comparing to the original distributed fast histogram algorithm and 10 - 15% speedup comparing to approx when I set colsample_bytree for an internal use case

details:

the changes are to (1) do not create histograms for not-selected features; (2) sync histogram per level

but this 10-15% speedup is not exciting enough

I think we should focus on (1) improve quality of rabit (out of maintenance for a while) (2) improve multi-core perf of fast-histogram in near future

hcho3 · 2019-01-28T07:08:51Z

Awesome. So for now, we can wrap up the distributed fast hist and then come back to items (1) and (2).

CodingCat · 2019-01-28T16:22:54Z

@hcho3 you can start reviewing the current fast distributed histogram PR, as I want to separate performance improvement to another PR to avoid packing too many things together

…

On Sun, Jan 27, 2019 at 11:08 PM Philip Hyunsu Cho ***@***.***> wrote: Awesome. So for now, we can wrap up the distributed fast hist and then come back to items (1) and (2). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4077 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AApYeE-v3FdPHTW7CO5Ul70ObO4QN6B8ks5vHqIFgaJpZM4aPbn7> .

CodingCat · 2019-01-31T17:47:35Z

I cleaned up my prototype to remove the part of do not create histograms for not-selected features; ,
I found histogram building cost is ignorable comparing to HIST and it makes code more complicated, the performance gaining didn't change too much

hcho3 mentioned this issue Jan 30, 2019

Call for contribution: improve multi-core CPU performance of 'hist' #3810

Closed

This was referenced Jan 31, 2019

colsample_bytree does not take full effect in fast-histo #4047

Closed

Separate Depthwidth and Lossguide growing policy in fast histogram #4102

Merged

CodingCat closed this as completed in #4102 Feb 13, 2019

lock bot locked as resolved and limited conversation to collaborators May 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[proposed change] distinguish loss-guide and depth-guide in fast histogram #4077

[proposed change] distinguish loss-guide and depth-guide in fast histogram #4077

CodingCat commented Jan 23, 2019 •

edited

Loading

RAMitchell commented Jan 24, 2019

CodingCat commented Jan 24, 2019

hcho3 commented Jan 24, 2019

hcho3 commented Jan 24, 2019 •

edited

Loading

CodingCat commented Jan 24, 2019 •

edited

Loading

CodingCat commented Jan 28, 2019 •

edited

Loading

hcho3 commented Jan 28, 2019

CodingCat commented Jan 28, 2019 via email

CodingCat commented Jan 31, 2019

[proposed change] distinguish loss-guide and depth-guide in fast histogram #4077

[proposed change] distinguish loss-guide and depth-guide in fast histogram #4077

Comments

CodingCat commented Jan 23, 2019 • edited Loading

RAMitchell commented Jan 24, 2019

CodingCat commented Jan 24, 2019

hcho3 commented Jan 24, 2019

hcho3 commented Jan 24, 2019 • edited Loading

CodingCat commented Jan 24, 2019 • edited Loading

CodingCat commented Jan 28, 2019 • edited Loading

hcho3 commented Jan 28, 2019

CodingCat commented Jan 28, 2019 via email

CodingCat commented Jan 31, 2019

CodingCat commented Jan 23, 2019 •

edited

Loading

hcho3 commented Jan 24, 2019 •

edited

Loading

CodingCat commented Jan 24, 2019 •

edited

Loading

CodingCat commented Jan 28, 2019 •

edited

Loading