Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Breaking] Let's make histogram method the default #7049

Closed
wants to merge 15 commits into from

Conversation

RukhovichIV
Copy link
Contributor

@RukhovichIV RukhovichIV commented Jun 18, 2021

We suggest changing the default method for large datasets from approx to hist as it's much faster. This way, XGBoost will perform better for those users who don't choose tree method themselves.
An attempt was already made in #$5178, but the PR wasn't merged.

Here're perf measurements:

Intel(R) Xeon(R) Platinum 8280L CPU
2 sockets, 28 cores per socket, HT:on

dataset hist training, s approx training, s hist training speedup test metric name metric difference (hist - approx)
airline 120.38 5588.1 46.4 accuracy 0
bosch 21.26 56.39 2.7 accuracy 0
covtype 7.23 248.71 34.4 accuracy 0.00533
epsilon 319.06 226.47 0.7 accuracy 0
fraud 0.92 6.21 6.7 accuracy 0
higgs 18.7 448.81 24 accuracy 0
airline-ohe 53.04 1678.80 31.7 accuracy 0
higgs1m 23.43 352.28 15 accuracy 0
letters 68.68 157.31 2.3 accuracy 0.00100
mlsr 82.03 1428.85 17.4 accuracy 0.00112
plasticc 3.16 4.39 1.4 accuracy 0.22073
santander 220.01 281.5 1.3 accuracy 0
year_prediction_msd 4.44 16.7 3.8 RMSE -0.00039
abalone 2.56 4.68 1.8 RMSE -0.11244
mortgage1Q 16.35 360.9 22.1 RMSE -0.00012
url 34.09 92.63 2.7 train RMSE -0.00179

Geometric mean for training time speedup is 5.667. We still have a slowdown on epsilon, and we're working on this case right now.
Metrics are equal or better than with approx for all cases.

@trivialfis
Copy link
Member

Restarted the CI.

There are some issues with hist, like its incomplete support for external memory.

@SmirnovEgorRu
Copy link
Contributor

@trivialfis, do you prefer to keep approx with auto when external memory is used? There:

  } else if (!fmat->SingleColBlock()) {
    LOG(INFO) << "Tree method is automatically set to 'hist' "
                 "since external-memory data matrix is used.";
    tparam_.tree_method = TreeMethod::kHist;
  }

@trivialfis
Copy link
Member

I made a similar attempt before, will look into this again. Thanks for running the comprehensive benchmark.

@trivialfis trivialfis self-requested a review June 28, 2021 07:29
@trivialfis trivialfis self-assigned this Jun 28, 2021
@trivialfis
Copy link
Member

Hi could you please take a look into the failing JVM tests?

@RukhovichIV
Copy link
Contributor Author

RukhovichIV commented Jun 29, 2021

There are some issues with hist, like its incomplete support for external memory.

It seems like the problem still appears even if we use approx for external memory (if (!fmat->SingleColBlock()) tparam_.tree_method = TreeMethod::kApprox;). Do you have any other ideas?

Most of the tests (5/7) are failing in a place like this:
https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostClassifierSuite.scala#L222

Here we perform training with ScalaXGBoost and a usual C++ interface. Then we comparing its results and see the difference.
The error only occurs when auto or hist is passed as a tree_method, and everything works ok when approx is used (as in the last commit 715ca2b)

@trivialfis
Copy link
Member

Sorry for the late reply.

Hey, @hcho3 @CodingCat @RAMitchell could you please join the discussion?

@hcho3
Copy link
Collaborator

hcho3 commented Jul 6, 2021

I am in favor of making hist the default method, except when external memory is used.

@trivialfis
Copy link
Member

I don't have a strong feeling. But I'm migrating the approx tree method to hist's code base (for example #7079) to have uniform categorical data support. After the migration I expect the approx tree method to become much faster.

@hcho3
Copy link
Collaborator

hcho3 commented Jul 6, 2021

@trivialfis Do you have a tracking issue for merging approx with hist?

@trivialfis
Copy link
Member

trivialfis commented Jul 6, 2021

@hcho3 I don't. It's a mix of different refactors. I need to:

@trivialfis
Copy link
Member

I can merge these items into the categorical data support tracker.

@hcho3
Copy link
Collaborator

hcho3 commented Jul 6, 2021

@trivialfis So is it fair to say that there are some useful utility functions in approx that you'd like to see it merged into hist? So far, our approach has been to direct all development effort to the hist method.

@trivialfis
Copy link
Member

So is it fair to say that there are some useful utility functions in approx that you'd like to see it merged into hist?

There are 2 features in approx that I'm not willing to remove:

  • External memory support.
  • Use hessian as weights.

direct all development effort to the hist method.

That's the reason I need to migrate approx to hist's code base and make things as reusble as possible. Whatever improvement goes into hist will go into approx for free.

@trivialfis
Copy link
Member

I can merge these items into the categorical data support tracker.

Done.

@SmirnovEgorRu SmirnovEgorRu marked this pull request as draft July 6, 2021 20:48
@SmirnovEgorRu
Copy link
Contributor

@trivialfis, @hcho3,
If look at the table above we can observe 2 things:

  • hist is always better than approx in accuracy/mse at least for datasets what we benchmarked. Probably, we can tune parameters of approx to increase metrics, but OOB it's worse.
  • hist is better in terms of performance (5.7x times) and we see opportunities to make it even better in future.

Another point - alignment:

  • GPU version of XGB has only hist method, but for CPU default - approx for large data. It's expected that hist on CPU and GPU have pretty similar accuracy results (math is the same, FP error can affect results mostly). But now when a user switches a device CPU <-> GPU - they see different results, because they use different methods by default.
  • LGBM has '''hist''' by default

Based on above - I prefer to make hist default. Only one exception - external memory, in this case we can use approx and think how to support this fully in hist.

Unification of code for approx and hist - of course good idea. But it's not so related to the topic of the PR, I think hist will be faster anyway.

@trivialfis
Copy link
Member

trivialfis commented Jul 7, 2021

Thanks for the detailed explanation! First I also prefer changing the tree method, but I think we need more work than setting the parameter along. Few reasons I haven't merged this PR yet are:

  1. The comparison carried out here is comparing the implementation, not the algorithm. In theory:
  • Is approx inherently slower than hist, yes and no. Yes because it needs to run sketching at the beginning of every iteration, no because if I add a condition to skip sketching for constant hessian objectives then it's exactly the same with hist.
  • Is it inherently less accurate than hist? No, if you use constant hessian objectives like reg:squarederror they should produce identical results.

But we know these aren't true in practice from the accuracy results here. For different outputs, my guess is on the difference of parameters. For hist the tuning parameter for the number of split candidates is max_bin, but for approx it's sketch_ratio + sketch_eps and the default value is much lower than 256 if you translate it back to max_bin. I will unify them during the refactor.

  1. After Export Python Interface for external memory. #7070 is merged (I'm splitting it up for review and a few of the smaller parts are already merged), the external memory implementation should be fairly easy. After which, we can have 1 algorithm that works out of box for most of the scenarios, including most of the training parameters. I'm trying to avoid making more auto-configurations that somehow change results and performance dramatically. (hence consistent). Throw an error if something is not implemented, don't configure.

But now when a user switches a device CPU <-> GPU - they see different results, because they use different methods by default.

My personal preference is to remove the name gpu_hist completely as discussed in #6212 (comment) . We can continue the discussion there, I linked another PR there with detailed notes on the caveat of the current parameters set.

Having said that, I'm looking forward to the change of the default tree method to hist, but we need to handle the external memory properly (without auto-configuration) and get the comparison result from a more unified implementation.

These are my personal preference. I'm looking forward to replies. ;-)

@trivialfis trivialfis changed the title Let's make histogram method the default [Breaking] Let's make histogram method the default Jul 7, 2021
@trivialfis
Copy link
Member

trivialfis commented Jul 13, 2021

Please ignore the R failure for now. It's caused by stalled R cache on github action.

#7102

@codecov-commenter
Copy link

codecov-commenter commented Jul 16, 2021

Codecov Report

Merging #7049 (59b40a6) into master (d7c1449) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #7049   +/-   ##
=======================================
  Coverage   81.60%   81.60%           
=======================================
  Files          13       13           
  Lines        3903     3903           
=======================================
  Hits         3185     3185           
  Misses        718      718           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d7c1449...59b40a6. Read the comment docs.

@trivialfis
Copy link
Member

Running some tests with the latest implementation of external memory. Hopefully can narrow down the failures.

@RukhovichIV
Copy link
Contributor Author

Let's try to take another look here

As we know, hist method works much faster with a large data size. At the moment, the threshold for choosing between hist and exact in the heuristic is too high (it is 2^22 or ~4M). We compared the performance and metrics for hist and exact on many workloads and came to the conclusion that it would be optimal to choose 2^18 (~250k) as the threshold. Below are brief tables with the best thresholds for different workloads.

We chose the best threshold based on the training time and two testing metrics on each case. It was grid-searched as the power of 2, starting from 256. We used accuracy + log_loss for classification and rmse + r2 for regression. "Optimal threshold" means the minimum data size at which hist starts performing at least as well as exact

Before the start of the training, all the datasets were randomly shuffled. Next, the first N lines from training datasets were selected for training. Full testing datasets were used for testing. The procedure was repeated for hist and exact.

Classification task:

dataset train size optimal train threshold optimal accuracy threshold optimal cross entropy threshold
airline-ohe 1M 4096 256 262144
higgs1m 1M 512 256 262144
letters 16k 4096 256 2048
plasticc 7k 2048 256 256
santander 190k 32768 256 8192
airline 92M 256 256 262144
bosch 1.184M 131072 256 131072
epsilon 400k 131072 256 400000
fraud 228k 4096 256 65536
higgs 8.8M 512 256 65536
mlsr 3.02M 16384 16384 8192

Regression task:

dataset train size optimal train threshold optimal rmse threshold optimal r2 threshold
abalone ~3.3k 256 4096 4096
year 464k 16384 262144 262144
mortgage1q 9.01M 1024 65536 65536

HW:
CPU: Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz
Socket(s): 2
Core(s) per socket: 28
Thread(s) per core: 2
RAM: 24*16G

The full table with all numbers can be found here

@trivialfis
Copy link
Member

the threshold for choosing between hist and exact in the heuristic is too high (it is 2^22 or ~4M).

If we can proceed with this change, let's remove the selection altogether. Just use 1 algorithm (hist) as default instead of the "auto".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants