Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing the threshold value of the method selection #5

Merged
merged 11 commits into from
Aug 30, 2021

Conversation

RukhovichIV
Copy link
Owner

@RukhovichIV RukhovichIV commented Aug 24, 2021

As we know, hist method works much faster than its opponents on CPU. It works especially well with a large data size. At the moment, the threshold for choosing between hist and exact in the heuristic is too high (it is 2^22 or ~4M). We compared the performance and metrics for hist and exact on many workloads and came to the conclusion that it would be optimal to choose 2^18 (~250k) as the threshold. Below are brief tables with the best thresholds for different workloads.

We chose the best threshold based on the training time and two testing metrics on each case. It was grid-searched as the power of 2, starting from 256. We used accuracy + log_loss for classification and rmse + r2 for regression. "Optimal threshold" means the minimum data size at which hist starts performing at least as well as exact

Before the start of the training, all the datasets were randomly shuffled. Next, the first N lines from training datasets were selected for training. Full testing datasets were used for testing. The procedure was repeated for hist and exact.

Classification task:

dataset train size optimal train threshold optimal accuracy threshold optimal cross entropy threshold
airline-ohe 1M 4096 256 262144
higgs1m 1M 512 256 262144
letters 16k 4096 256 2048
plasticc 7k 2048 256 256
santander 190k 32768 256 8192
airline 92M 256 256 262144
bosch 1.184M 131072 256 131072
epsilon 400k 131072 256 400000
fraud 228k 4096 256 65536
higgs 8.8M 512 256 65536
mlsr 3.02M 16384 16384 8192

Regression task:

dataset train size optimal train threshold optimal rmse threshold optimal r2 threshold
abalone ~3.3k 256 4096 4096
year 464k 16384 262144 262144
mortgage1q 9.01M 1024 65536 65536

HW:
CPU: Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz
Socket(s): 2
Core(s) per socket: 28
Thread(s) per core: 2
RAM: 24*16G

The full table with all numbers can be found here

This PR is a part of making hist method the default when tree_method==auto (dmlc#7049)

Igor Rukhovich and others added 11 commits August 24, 2021 20:08
* Fix truncation.

* Lint.

* lint.
* [CI] Automatically build GPU-enabled R package for Windows

* Update Jenkinsfile-win64

* Build R package for the release branch only

* Update install doc
On GPU we use rouding factor to truncate the gradient for deterministic results. This PR changes the gradient representation to fixed point number with exponent aligned with rounding factor.

    [breaking] Drop non-deterministic histogram.
    Use fixed point for shared memory.

This PR is to improve the performance of GPU Hist. 

Co-authored-by: Andy Adinets <aadinets@nvidia.com>
@RukhovichIV RukhovichIV merged commit 7719321 into default-hist Aug 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants