Currently, the hist
tree-growing algorithm (tree_method=hist
) of XGBoost
scales poorly on multi-core CPUs: for some datasets, performance deteriorates as the number of threads is increased.
This issue was discovered by @Laurae2's
Gradient Boosting Benchmark.
To make things easier for contributors, I went ahead and isolated the performance bottleneck. A vast majority of time (> 95 %) is spent in a stage known as gradient histogram construction. This repository isolates this stage so that it is easy to fix and improve.
- Compile the script by running CMake:
mkdir build
cd build
cmake ..
make
- Download record.tar.bz2 in the same directory.
- Extract record.tar.bz2 by running
tar xvf record.tar.bz2
. - Run the script:
# Usage: ./perflab record/ [number of threads]
./perflab record/ 36
Running with different number of threads should produce the following trend of performance:
The script reads from record.tar.bz2, which was processed from the Bosch dataset. Its job is to compute histograms for gradient pairs, where each bin of histogram is a partial sum.
Some background:
- A gradient for a given instance
(X_i, y_i)
is a pair ofdouble
values that quantify the distance between the true labely_i
and predicted labelyhat_i
. - There are as many gradient pairs as there are instances in a training dataset.
- In order to find optimal splits for decision trees, we compute a histogram of gradients. Each bin of the histogram stands for a range of feature values. The value of the bin is given by the sum of gradients corresponding to the data points lying inside the range.
- In each boosting iteration, we have to compute multiple histograms, each histogram corresponding to a set of instances.
-
By default, 'Release' build type will be used, with flags
-O3 -DNDEBUG
. -
For perfiling, you may want to add debug symbols by choosing 'RelWithDebInfo' build type instead:
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo ..
This build type uses the following flags:
-O2 -g -DNDEBUG
. -
For full control over the compilation flags, specify
CMAKE_CXX_FLAGS_RELEASE
:cmake -DCMAKE_CXX_FLAGS_RELEASE="-O3 -g -DNDEBUG -march=native" ..
This give you full control over the optimization flags. Here, we are compiling with
-O3 -g -DNDEBUG -march=native
flags.You can check whether they are applied using
make VERBOSE=1
and looking at the C++ compilation lines for the existence of the flags you used:/usr/bin/c++ -I/home/ubuntu/xgboost-fast-hist-perf-lab/include -O3 -g -DNDEBUG -march=native -fopenmp -std=gnu++11 -o CMakeFiles/perflab.dir/src/main.cc.o -c /home/ubuntu/xgboost-fast-hist-perf-lab/src/main.cc