Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance optimizations for Intel CPUs #3957

Merged
merged 15 commits into from
Jan 9, 2019
Merged

Performance optimizations for Intel CPUs #3957

merged 15 commits into from
Jan 9, 2019

Conversation

SmirnovEgorRu
Copy link
Contributor

In this issue some performance optimizations for multi-core CPUs were offered.
I prepared these as a pull-request. Also, I added SW prefetching, which helps to improve performance additionally.
It was tested for HIGGS data set (1m x 28) in comparison with Intel® DAAL.

Intel® DAAL Current XGBoost Proposed changes
Time, ms 1344 8600 4238

HW: Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz (2 sockets, 14 cores per socket).

Parameters:
"objective": "binary:hinge", "max_depth": "6", "eta": "0.3", "min_child_weight": "0" (DAAL's analog: minObservationsInLeafNode=0), "tree_method", "hist", "n_estimators": "50"

Accuracy before and after my changes - the same: 72.306801.

Some additional optimizations can be done: e.g. parallelization by nodes, optimization of "ApplySplit" functions.

@hcho3
Copy link
Collaborator

hcho3 commented Dec 2, 2018

Thanks a lot for assembling this pull request. I will review it and make fixes to pass CI tests.

In the longer run, we can also tweak compiler flags to take advantage of latest features of Intel CPUs. Not all of these flags can be merged into the master CMake build due to compatibility reasons, but we can make a special pre-built binary distribution for AWS C5 instances (which has AVX512).

@hcho3
Copy link
Collaborator

hcho3 commented Dec 2, 2018

@RAMitchell I recall you mentioning some failure cases that was triggered by using single-precision float for the accumulator variable. Can you share some examples?

@RAMitchell
Copy link
Member

Let me do a couple of experiments with the GPU version. I could be wrong about needing doubles, will need to try a few things.

@RAMitchell
Copy link
Member

Looks like it may be okay after some experimentation with large datasets. I am not sure how I reached the previous conclusion.

('YearPredictionMSD', 'Time(s)') ('YearPredictionMSD', 'Train RMSE') ('YearPredictionMSD', 'Test RMSE') ('Synthetic', 'Time(s)') ('Synthetic', 'Train RMSE') ('Synthetic', 'Test RMSE') ('Higgs', 'Time(s)') ('Higgs', 'Train Accuracy') ('Higgs', 'Test Accuracy') ('Cover Type', 'Time(s)') ('Cover Type', 'Train Accuracy') ('Cover Type', 'Test Accuracy') ('Bosch', 'Time(s)') ('Bosch', 'Train Accuracy')
xgb-gpu-hist (single precision) 18.9042 7.80142 8.88601 23.2389 13.369 13.5223 18.9742 0.750733 0.747206 64.3457 0.909576 0.892335 12.8371 0.995123
xgb-gpu-hist (double precision) 18.8883 7.79193 8.88281 24.439 13.3822 13.5334 23.429 0.750603 0.747163 67.2516 0.908491 0.891793 12.0918 0.995085

@trivialfis
Copy link
Member

@RAMitchell The change happens within CPU histogram, if one change GradStat to float some tests should fail.

@SmirnovEgorRu
Copy link
Contributor Author

@hcho3 can we use different types of GHistEntry for GPU and CPU?
Now I'm collecting data for case with double precision for GHistEntry. I will send result soon.

@SmirnovEgorRu
Copy link
Contributor Author

Also, I measured for case when GHistEntry contains doubles (as in source code) instead of floats:

Intel® DAAL Current XGBoost Proposed changes (float) Proposed changes (double)
Time, ms 1344 8600 4238 4455

Performance gain from usage of float is not significant, so it doubles can be used here.

@hcho3
Copy link
Collaborator

hcho3 commented Dec 4, 2018

@SmirnovEgorRu I don't think GHistEntry is parameterized, so currently, no. We'll merge it once the tests pass.

@hcho3
Copy link
Collaborator

hcho3 commented Dec 4, 2018

@SmirnovEgorRu Let me know if you need help passing the CI.

@SmirnovEgorRu
Copy link
Contributor Author

@hcho3, looks like it is passed now.
Can we commit it?

@hcho3
Copy link
Collaborator

hcho3 commented Dec 7, 2018

@SmirnovEgorRu I am assuming your new code assumes the existence of built-in prefetch function, is that right? In that case, we want to keep around the old code for fail-safe, if the user's machine doesn't support the prefetch function. (I think there are some people using non-x86 processors, such as ARM or SUN SPARC.) If you're okay, I can update this pull request to perform feature checking in the build system so as to check the existence of the prefetch function. So can I modify this pull request?

@SmirnovEgorRu
Copy link
Contributor Author

@hcho3, yes, you can update, of course.
Do I need to do something here from my side?

@hcho3
Copy link
Collaborator

hcho3 commented Dec 7, 2018

@SmirnovEgorRu Make sure to check on "Allow Edits from Maintainers"

@SmirnovEgorRu
Copy link
Contributor Author

@hcho3, yes, it is marked.

@hcho3
Copy link
Collaborator

hcho3 commented Dec 7, 2018

@SmirnovEgorRu Thanks, I'll make the necessary changes in the build system to do feature checking.

@Denisevi4
Copy link

Looks like it may be okay after some experimentation with large datasets. I am not sure how I reached the previous conclusion.

('YearPredictionMSD', 'Time(s)') ('YearPredictionMSD', 'Train RMSE') ('YearPredictionMSD', 'Test RMSE') ('Synthetic', 'Time(s)') ('Synthetic', 'Train RMSE') ('Synthetic', 'Test RMSE') ('Higgs', 'Time(s)') ('Higgs', 'Train Accuracy') ('Higgs', 'Test Accuracy') ('Cover Type', 'Time(s)') ('Cover Type', 'Train Accuracy') ('Cover Type', 'Test Accuracy') ('Bosch', 'Time(s)') ('Bosch', 'Train Accuracy')
xgb-gpu-hist (single precision) 18.9042 7.80142 8.88601 23.2389 13.369 13.5223 18.9742 0.750733 0.747206 64.3457 0.909576 0.892335 12.8371 0.995123
xgb-gpu-hist (double precision) 18.8883 7.79193 8.88281 24.439 13.3822 13.5334 23.429 0.750603 0.747163 67.2516 0.908491 0.891793 12.0918 0.995085

@RAMitchell May I ask how do you run gpu hist with double precision? Thanks

@SmirnovEgorRu
Copy link
Contributor Author

@hcho3 any news here?
Can I help here? With defines or something else.

@hcho3
Copy link
Collaborator

hcho3 commented Dec 25, 2018

@SmirnovEgorRu I’m currently on holiday vacation now. Sorry for the delay. I’ll try to get to it by end of this year.

@hcho3
Copy link
Collaborator

hcho3 commented Jan 4, 2019

@trivialfis FYI, this is another reason why we should prefer CMake over Makefile, since CMake lets us explicitly test the existence of specific compiler intrinsics such as _mm_prefetch and __builtin_prefetch. With Makefile, we'll have to be overly conservative when adopting compiler intrinsics because we can't test their existence.

@hcho3
Copy link
Collaborator

hcho3 commented Jan 4, 2019

@SmirnovEgorRu Sorry about the delay. I've now added build-time check for existence of SW prefetching. I'm assuming that the new code will be functional even when macro PREFETCH_READ_T0 is a no-op.

@hcho3
Copy link
Collaborator

hcho3 commented Jan 4, 2019

@trivialfis Can you take a look at this PR to see if it can be made compatible with your PR #3825?

@trivialfis
Copy link
Member

@hcho3 Please ignore #3825 and go ahead finishing this PR. I will take care of the remaining rebase. After all this one is more important I think.

@codecov-io
Copy link

codecov-io commented Jan 4, 2019

Codecov Report

Merging #3957 into master will increase coverage by 4.36%.
The diff coverage is 65.51%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #3957      +/-   ##
============================================
+ Coverage     56.35%   60.72%   +4.36%     
============================================
  Files           186      130      -56     
  Lines         14775    11722    -3053     
  Branches        498        0     -498     
============================================
- Hits           8327     7118    -1209     
+ Misses         6209     4604    -1605     
+ Partials        239        0     -239
Impacted Files Coverage Δ Complexity Δ
include/xgboost/base.h 100% <ø> (ø) 0 <0> (ø) ⬇️
src/common/hist_util.h 73.91% <100%> (-7.34%) 0 <0> (ø)
src/common/hist_util.cc 42.61% <64.91%> (-0.12%) 0 <0> (ø)
src/linear/updater_shotgun.cc 91.07% <0%> (-2.6%) 0% <0%> (ø)
src/gbm/gblinear.cc 12.34% <0%> (-2.12%) 0% <0%> (ø)
src/gbm/gbtree.cc 17.1% <0%> (-1.57%) 0% <0%> (ø)
src/learner.cc 26.37% <0%> (-0.65%) 0% <0%> (ø)
src/tree/param.h 86.5% <0%> (-0.42%) 0% <0%> (ø)
python-package/xgboost/core.py 77.38% <0%> (-0.06%) 0% <0%> (ø)
... and 88 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 48dddfd...6c37c3f. Read the comment docs.

@SmirnovEgorRu
Copy link
Contributor Author

@hcho3 @trivialfis Do we need to do smt new to push it?

@hcho3 hcho3 merged commit 5f151c5 into dmlc:master Jan 9, 2019
@hcho3
Copy link
Collaborator

hcho3 commented Jan 9, 2019

@SmirnovEgorRu I went ahead and merged the pull request. I will follow up with some unit tests later.

@Laurae2
Copy link
Contributor

Laurae2 commented Jan 13, 2019

@SmirnovEgorRu @hcho3 I'll get a look at benchmarking pre-post this PR on my 72 thread server (Dual Xeon 6154).

Will be using a2dc929 (pre) and 5f151c5 (post).

@lock lock bot locked as resolved and limited conversation to collaborators Apr 13, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants