Performance optimizations for Intel CPUs #3957

SmirnovEgorRu · 2018-12-02T16:37:58Z

In this issue some performance optimizations for multi-core CPUs were offered.
I prepared these as a pull-request. Also, I added SW prefetching, which helps to improve performance additionally.
It was tested for HIGGS data set (1m x 28) in comparison with Intel® DAAL.

	Intel® DAAL	Current XGBoost	Proposed changes
Time, ms	1344	8600	4238

HW: Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz (2 sockets, 14 cores per socket).

Parameters:
"objective": "binary:hinge", "max_depth": "6", "eta": "0.3", "min_child_weight": "0" (DAAL's analog: minObservationsInLeafNode=0), "tree_method", "hist", "n_estimators": "50"

Accuracy before and after my changes - the same: 72.306801.

Some additional optimizations can be done: e.g. parallelization by nodes, optimization of "ApplySplit" functions.

hcho3 · 2018-12-02T20:47:03Z

Thanks a lot for assembling this pull request. I will review it and make fixes to pass CI tests.

In the longer run, we can also tweak compiler flags to take advantage of latest features of Intel CPUs. Not all of these flags can be merged into the master CMake build due to compatibility reasons, but we can make a special pre-built binary distribution for AWS C5 instances (which has AVX512).

hcho3 · 2018-12-02T20:49:47Z

@RAMitchell I recall you mentioning some failure cases that was triggered by using single-precision float for the accumulator variable. Can you share some examples?

RAMitchell · 2018-12-02T22:24:56Z

Let me do a couple of experiments with the GPU version. I could be wrong about needing doubles, will need to try a few things.

RAMitchell · 2018-12-03T04:53:08Z

Looks like it may be okay after some experimentation with large datasets. I am not sure how I reached the previous conclusion.

	('YearPredictionMSD', 'Time(s)')	('YearPredictionMSD', 'Train RMSE')	('YearPredictionMSD', 'Test RMSE')	('Synthetic', 'Time(s)')	('Synthetic', 'Train RMSE')	('Synthetic', 'Test RMSE')	('Higgs', 'Time(s)')	('Higgs', 'Train Accuracy')	('Higgs', 'Test Accuracy')	('Cover Type', 'Time(s)')	('Cover Type', 'Train Accuracy')	('Cover Type', 'Test Accuracy')	('Bosch', 'Time(s)')	('Bosch', 'Train Accuracy')
xgb-gpu-hist (single precision)	18.9042	7.80142	8.88601	23.2389	13.369	13.5223	18.9742	0.750733	0.747206	64.3457	0.909576	0.892335	12.8371	0.995123
xgb-gpu-hist (double precision)	18.8883	7.79193	8.88281	24.439	13.3822	13.5334	23.429	0.750603	0.747163	67.2516	0.908491	0.891793	12.0918	0.995085

trivialfis · 2018-12-03T08:19:51Z

@RAMitchell The change happens within CPU histogram, if one change GradStat to float some tests should fail.

SmirnovEgorRu · 2018-12-04T13:18:10Z

@hcho3 can we use different types of GHistEntry for GPU and CPU?
Now I'm collecting data for case with double precision for GHistEntry. I will send result soon.

SmirnovEgorRu · 2018-12-04T13:43:09Z

Also, I measured for case when GHistEntry contains doubles (as in source code) instead of floats:

	Intel® DAAL	Current XGBoost	Proposed changes (float)	Proposed changes (double)
Time, ms	1344	8600	4238	4455

Performance gain from usage of float is not significant, so it doubles can be used here.

hcho3 · 2018-12-04T18:35:20Z

@SmirnovEgorRu I don't think GHistEntry is parameterized, so currently, no. We'll merge it once the tests pass.

hcho3 · 2018-12-04T21:56:53Z

@SmirnovEgorRu Let me know if you need help passing the CI.

SmirnovEgorRu · 2018-12-07T01:49:55Z

@hcho3, looks like it is passed now.
Can we commit it?

hcho3 · 2018-12-07T02:28:17Z

@SmirnovEgorRu I am assuming your new code assumes the existence of built-in prefetch function, is that right? In that case, we want to keep around the old code for fail-safe, if the user's machine doesn't support the prefetch function. (I think there are some people using non-x86 processors, such as ARM or SUN SPARC.) If you're okay, I can update this pull request to perform feature checking in the build system so as to check the existence of the prefetch function. So can I modify this pull request?

SmirnovEgorRu · 2018-12-07T03:04:54Z

@hcho3, yes, you can update, of course.
Do I need to do something here from my side?

hcho3 · 2018-12-07T03:10:01Z

@SmirnovEgorRu Make sure to check on "Allow Edits from Maintainers"

SmirnovEgorRu · 2018-12-07T10:28:05Z

@hcho3, yes, it is marked.

hcho3 · 2018-12-07T10:29:23Z

@SmirnovEgorRu Thanks, I'll make the necessary changes in the build system to do feature checking.

Denisevi4 · 2018-12-07T21:38:39Z

Looks like it may be okay after some experimentation with large datasets. I am not sure how I reached the previous conclusion.

('YearPredictionMSD', 'Time(s)') ('YearPredictionMSD', 'Train RMSE') ('YearPredictionMSD', 'Test RMSE') ('Synthetic', 'Time(s)') ('Synthetic', 'Train RMSE') ('Synthetic', 'Test RMSE') ('Higgs', 'Time(s)') ('Higgs', 'Train Accuracy') ('Higgs', 'Test Accuracy') ('Cover Type', 'Time(s)') ('Cover Type', 'Train Accuracy') ('Cover Type', 'Test Accuracy') ('Bosch', 'Time(s)') ('Bosch', 'Train Accuracy')
xgb-gpu-hist (single precision) 18.9042 7.80142 8.88601 23.2389 13.369 13.5223 18.9742 0.750733 0.747206 64.3457 0.909576 0.892335 12.8371 0.995123
xgb-gpu-hist (double precision) 18.8883 7.79193 8.88281 24.439 13.3822 13.5334 23.429 0.750603 0.747163 67.2516 0.908491 0.891793 12.0918 0.995085

@RAMitchell May I ask how do you run gpu hist with double precision? Thanks

SmirnovEgorRu · 2018-12-24T15:06:38Z

@hcho3 any news here?
Can I help here? With defines or something else.

hcho3 · 2018-12-25T00:22:44Z

@SmirnovEgorRu I’m currently on holiday vacation now. Sorry for the delay. I’ll try to get to it by end of this year.

hcho3 · 2019-01-04T03:08:27Z

@trivialfis FYI, this is another reason why we should prefer CMake over Makefile, since CMake lets us explicitly test the existence of specific compiler intrinsics such as _mm_prefetch and __builtin_prefetch. With Makefile, we'll have to be overly conservative when adopting compiler intrinsics because we can't test their existence.

hcho3 · 2019-01-04T04:28:07Z

@SmirnovEgorRu Sorry about the delay. I've now added build-time check for existence of SW prefetching. I'm assuming that the new code will be functional even when macro PREFETCH_READ_T0 is a no-op.

hcho3 · 2019-01-04T04:59:24Z

@trivialfis Can you take a look at this PR to see if it can be made compatible with your PR #3825?

trivialfis · 2019-01-04T06:58:47Z

@hcho3 Please ignore #3825 and go ahead finishing this PR. I will take care of the remaining rebase. After all this one is more important I think.

codecov-io · 2019-01-04T09:16:07Z

Codecov Report

Merging #3957 into master will increase coverage by 4.36%.
The diff coverage is 65.51%.

@@             Coverage Diff              @@
##             master    #3957      +/-   ##
============================================
+ Coverage     56.35%   60.72%   +4.36%     
============================================
  Files           186      130      -56     
  Lines         14775    11722    -3053     
  Branches        498        0     -498     
============================================
- Hits           8327     7118    -1209     
+ Misses         6209     4604    -1605     
+ Partials        239        0     -239

Impacted Files	Coverage Δ	Complexity Δ
include/xgboost/base.h	`100% <ø> (ø)`	`0 <0> (ø)`	⬇️
src/common/hist_util.h	`73.91% <100%> (-7.34%)`	`0 <0> (ø)`
src/common/hist_util.cc	`42.61% <64.91%> (-0.12%)`	`0 <0> (ø)`
src/linear/updater_shotgun.cc	`91.07% <0%> (-2.6%)`	`0% <0%> (ø)`
src/gbm/gblinear.cc	`12.34% <0%> (-2.12%)`	`0% <0%> (ø)`
src/gbm/gbtree.cc	`17.1% <0%> (-1.57%)`	`0% <0%> (ø)`
src/learner.cc	`26.37% <0%> (-0.65%)`	`0% <0%> (ø)`
src/tree/param.h	`86.5% <0%> (-0.42%)`	`0% <0%> (ø)`
python-package/xgboost/core.py	`77.38% <0%> (-0.06%)`	`0% <0%> (ø)`
... and 88 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 48dddfd...6c37c3f. Read the comment docs.

SmirnovEgorRu · 2019-01-09T04:58:39Z

@hcho3 @trivialfis Do we need to do smt new to push it?

hcho3 · 2019-01-09T05:08:47Z

@SmirnovEgorRu I went ahead and merged the pull request. I will follow up with some unit tests later.

Laurae2 · 2019-01-13T09:48:58Z

@SmirnovEgorRu @hcho3 I'll get a look at benchmarking pre-post this PR on my 72 thread server (Dual Xeon 6154).

Will be using a2dc929 (pre) and 5f151c5 (post).

SmirnovEgorRu added 2 commits December 2, 2018 18:58

Initial performance optimizations for xgboost

0ff8ada

remove includes

40c07c7

SmirnovEgorRu added 2 commits December 4, 2018 16:45

revert float->double

c80f4bc

fix for CI

32e88bb

fix for CI

f6c44a6

SmirnovEgorRu added 8 commits December 5, 2018 10:07

fix for CI

c36127d

fix for CI

dd12944

fix for CI

416bf2f

fix for CI

c862fa8

fix for CI

1d59566

fix for CI

b7685df

fix for CI

e29229b

fix for CI

d59c386

Check existence of _mm_prefetch and __builtin_prefetch

4a0c9b3

hcho3 mentioned this pull request Jan 4, 2019

Distributed Fast Histogram Algorithm #4011

Merged

Fix lint

6c37c3f

hcho3 mentioned this pull request Jan 4, 2019

Simplify tree building code. #3825

Merged

hcho3 merged commit 5f151c5 into dmlc:master Jan 9, 2019

Laurae2 mentioned this pull request Jan 15, 2019

Call for contribution: improve multi-core CPU performance of 'hist' #3810

Closed

hcho3 mentioned this pull request Mar 4, 2019

[RFC] Version 0.82 release candidate #4201

Merged

lock bot locked as resolved and limited conversation to collaborators Apr 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance optimizations for Intel CPUs #3957

Performance optimizations for Intel CPUs #3957

SmirnovEgorRu commented Dec 2, 2018

hcho3 commented Dec 2, 2018

hcho3 commented Dec 2, 2018

RAMitchell commented Dec 2, 2018

RAMitchell commented Dec 3, 2018

trivialfis commented Dec 3, 2018

SmirnovEgorRu commented Dec 4, 2018

SmirnovEgorRu commented Dec 4, 2018

hcho3 commented Dec 4, 2018

hcho3 commented Dec 4, 2018

SmirnovEgorRu commented Dec 7, 2018

hcho3 commented Dec 7, 2018 •

edited

Loading

SmirnovEgorRu commented Dec 7, 2018

hcho3 commented Dec 7, 2018

SmirnovEgorRu commented Dec 7, 2018

hcho3 commented Dec 7, 2018

Denisevi4 commented Dec 7, 2018

SmirnovEgorRu commented Dec 24, 2018

hcho3 commented Dec 25, 2018

hcho3 commented Jan 4, 2019

hcho3 commented Jan 4, 2019

hcho3 commented Jan 4, 2019

trivialfis commented Jan 4, 2019

codecov-io commented Jan 4, 2019 •

edited

Loading

SmirnovEgorRu commented Jan 9, 2019

hcho3 commented Jan 9, 2019

Laurae2 commented Jan 13, 2019

Performance optimizations for Intel CPUs #3957

Performance optimizations for Intel CPUs #3957

Conversation

SmirnovEgorRu commented Dec 2, 2018

hcho3 commented Dec 2, 2018

hcho3 commented Dec 2, 2018

RAMitchell commented Dec 2, 2018

RAMitchell commented Dec 3, 2018

trivialfis commented Dec 3, 2018

SmirnovEgorRu commented Dec 4, 2018

SmirnovEgorRu commented Dec 4, 2018

hcho3 commented Dec 4, 2018

hcho3 commented Dec 4, 2018

SmirnovEgorRu commented Dec 7, 2018

hcho3 commented Dec 7, 2018 • edited Loading

SmirnovEgorRu commented Dec 7, 2018

hcho3 commented Dec 7, 2018

SmirnovEgorRu commented Dec 7, 2018

hcho3 commented Dec 7, 2018

Denisevi4 commented Dec 7, 2018

SmirnovEgorRu commented Dec 24, 2018

hcho3 commented Dec 25, 2018

hcho3 commented Jan 4, 2019

hcho3 commented Jan 4, 2019

hcho3 commented Jan 4, 2019

trivialfis commented Jan 4, 2019

codecov-io commented Jan 4, 2019 • edited Loading

Codecov Report

SmirnovEgorRu commented Jan 9, 2019

hcho3 commented Jan 9, 2019

Laurae2 commented Jan 13, 2019

hcho3 commented Dec 7, 2018 •

edited

Loading

codecov-io commented Jan 4, 2019 •

edited

Loading