-
Notifications
You must be signed in to change notification settings - Fork 2
Possible optimizations for Intel CPUs #15
Comments
I remember noticing that when going through the codebase as well, is there any reason for the added accuracy @hcho3? Using floats for histograms would also cut communication cost in half for distributed training. This is what I was using for example, Tencent Angel does the same, LighGBM's default is also float. |
@SmirnovEgorRu Here are some results for 2x Xeon Gold 6154. 2x 50 runs with gcc (your 512 row block) and icc (your 512 row block + extra optimizations you did). Rebooted server between each run, NUMA mode on. I got rid of the forced
Past 18 threads it's expected to be slower.
Full table:
|
Thanks everyone for your great benchmarks and insights :)
I have tried summation into floating point in my GPU version and the accuracy degrades significantly on larger data sets (1M rows). We need accuracy first and speed second so this will not be possible unless you have a scheme to reduce the loss of accuracy (e.g. tree reduction).
I don't think parallelism by node is possible if we are using the "lossguide" style algorithm that greedily opens one node at a time. This is a problem I also struggled with for the GPU version. |
Hi,
I'm from Intel® DAAL development team. We have our implementation of Gradient Boosting which highly optimized for CPUs. I have looked at the bench and applied practice used in DAAL optimizations to this benchmarks, result are below.
My code is available in my fork. I measured all examples on Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz (2 sockets, 14 cores per socket).
Removing of floating point numbers conversion:
We have found that XGBoost uses “float” for gradients pairs, but “double” for histograms. When XGBoost computes histograms it always converts floats to doubles in runtime, this operation is expensive. If replace “doubles” data type to “floats” in histograms performance are following:
Threading:
At the current implementation we see that XGBoost builds a histogram by blocks of size 8. In the data set "rowind-198-?.txt" - indexes of samples for some node in a tree. We can see that many of them contains only 10-500 samples. It is too small number to try to parallelize it, in this case overhead for creation of the tasks by OMP can be larger than useful work for histogram computation. We tried to limit block size by 512 rows and now we don’t see performance degradation with increase of thread number:
However, in this case we just utilize only one core of CPU for low levels of the tree and don’t archive optimal performance on multi-core CPU. How do we solve this issue in Intel® DAAL? - we have 2 levels of parallelism: inside each node (represented in XGBoost) and by nodes (not represented in XGBoost). On the first levels of the tree the main threading are inside nodes, for last levels - threading by nodes. Adding of parallelism by nodes in XGBoost – required for good thread scalability.
We had done experiments with this block size in DAAL and 512 rows – was optimal number for data sets which we used.
Implicit usage of C-arrays + no unroll:
For build of histograms you use STL-containers and do unroll by 8 rows. However, usage of STL can bring some overhead, build of histograms is hotspot, so let’s use data() method for access to internal data of std::vector.
Also, unroll like this in no efficient, due to copy of data from arrays to buffers – additional overhead and working with memory.
If remove this unroll and use C-style arrays we can get significant performance improvement and make code simpler :)
Modified code (build_hist.cc:171):
Let’s get indexes for 1 node: rowind-198-1 (contains 20.6k samples from 1m source samples – sparse variant) and measure performance for it.
Enabling of modern instruction sets (AVX-512):
At AVX-512 "scatter" (does stores to memory with some stride) instruction has been introduced. It can be useful for histogram computation together with "gather" (does loads of memory with some stride) instruction (available from AVX2). Unfortunately, compiler can't put this instructions in our case and we need to do by intrinsics. Usage of "gather-add-scatter" template written by intrinsics provides some performance improvement over baseline code built by compiler. Code with usage of AVX-512 (can be built by Intel compiler at the moment, build it for gcc is possible, but need to spend some time for it) in the fork.
The text was updated successfully, but these errors were encountered: