Changing the threshold value of the method selection #5

RukhovichIV · 2021-08-24T18:21:10Z

As we know, hist method works much faster than its opponents on CPU. It works especially well with a large data size. At the moment, the threshold for choosing between hist and exact in the heuristic is too high (it is 2^22 or ~4M). We compared the performance and metrics for hist and exact on many workloads and came to the conclusion that it would be optimal to choose 2^18 (~250k) as the threshold. Below are brief tables with the best thresholds for different workloads.

We chose the best threshold based on the training time and two testing metrics on each case. It was grid-searched as the power of 2, starting from 256. We used accuracy + log_loss for classification and rmse + r2 for regression. "Optimal threshold" means the minimum data size at which hist starts performing at least as well as exact

Before the start of the training, all the datasets were randomly shuffled. Next, the first N lines from training datasets were selected for training. Full testing datasets were used for testing. The procedure was repeated for hist and exact.

Classification task:

dataset	train size	optimal train threshold	optimal accuracy threshold	optimal cross entropy threshold
airline-ohe	1M	4096	256	262144
higgs1m	1M	512	256	262144
letters	16k	4096	256	2048
plasticc	7k	2048	256	256
santander	190k	32768	256	8192
airline	92M	256	256	262144
bosch	1.184M	131072	256	131072
epsilon	400k	131072	256	400000
fraud	228k	4096	256	65536
higgs	8.8M	512	256	65536
mlsr	3.02M	16384	16384	8192

Regression task:

dataset	train size	optimal train threshold	optimal rmse threshold	optimal r2 threshold
abalone	~3.3k	256	4096	4096
year	464k	16384	262144	262144
mortgage1q	9.01M	1024	65536	65536

HW:
CPU: Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz
Socket(s): 2
Core(s) per socket: 28
Thread(s) per core: 2
RAM: 24*16G

The full table with all numbers can be found here

This PR is a part of making hist method the default when tree_method==auto (dmlc#7049)

* Fix truncation. * Lint. * lint.

* [CI] Automatically build GPU-enabled R package for Windows * Update Jenkinsfile-win64 * Build R package for the release branch only * Update install doc

On GPU we use rouding factor to truncate the gradient for deterministic results. This PR changes the gradient representation to fixed point number with exponent aligned with rounding factor. [breaking] Drop non-deterministic histogram. Use fixed point for shared memory. This PR is to improve the performance of GPU Hist. Co-authored-by: Andy Adinets <aadinets@nvidia.com>

…-hist-border

Igor Rukhovich and others added 11 commits August 24, 2021 20:08

Changing hist border to 2^18

d2156bb

Fix histogram truncation. (dmlc#7181)

ee8d1f5

* Fix truncation. * Lint. * lint.

[CI] Fix hanging Python setup in Windows CI (dmlc#7186)

d04312b

[breaking] Remove CUDA sm_35, add sm_86 (dmlc#7182)

9c64618

[CI] Automatically build GPU-enabled R package for Windows (dmlc#7185)

3060f0b

* [CI] Automatically build GPU-enabled R package for Windows * Update Jenkinsfile-win64 * Build R package for the release branch only * Update install doc

Fix building on CUDA 11.0. (dmlc#7187)

cdfaa70

[CI] Clean up in beginning of each task in Win CI (dmlc#7189)

b70e07d

Better error message for ncclUnhandledCudaError. (dmlc#7190)

e7d7ab6

Restore the custom double atomic add. (dmlc#7198)

ba69244

Merge branch 'master' of https://github.com/dmlc/xgboost into default…

17c72a8

…-hist-border

RukhovichIV merged commit 7719321 into default-hist Aug 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changing the threshold value of the method selection #5

Changing the threshold value of the method selection #5

RukhovichIV commented Aug 24, 2021 •

edited

Loading

Changing the threshold value of the method selection #5

Changing the threshold value of the method selection #5

Conversation

RukhovichIV commented Aug 24, 2021 • edited Loading

RukhovichIV commented Aug 24, 2021 •

edited

Loading