-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance optimizations for Intel CPUs #3957
Conversation
Thanks a lot for assembling this pull request. I will review it and make fixes to pass CI tests. In the longer run, we can also tweak compiler flags to take advantage of latest features of Intel CPUs. Not all of these flags can be merged into the master CMake build due to compatibility reasons, but we can make a special pre-built binary distribution for AWS C5 instances (which has AVX512). |
@RAMitchell I recall you mentioning some failure cases that was triggered by using single-precision float for the accumulator variable. Can you share some examples? |
Let me do a couple of experiments with the GPU version. I could be wrong about needing doubles, will need to try a few things. |
Looks like it may be okay after some experimentation with large datasets. I am not sure how I reached the previous conclusion.
|
@RAMitchell The change happens within CPU histogram, if one change GradStat to float some tests should fail. |
@hcho3 can we use different types of GHistEntry for GPU and CPU? |
Also, I measured for case when GHistEntry contains doubles (as in source code) instead of floats:
Performance gain from usage of float is not significant, so it doubles can be used here. |
@SmirnovEgorRu I don't think GHistEntry is parameterized, so currently, no. We'll merge it once the tests pass. |
@SmirnovEgorRu Let me know if you need help passing the CI. |
@hcho3, looks like it is passed now. |
@SmirnovEgorRu I am assuming your new code assumes the existence of built-in prefetch function, is that right? In that case, we want to keep around the old code for fail-safe, if the user's machine doesn't support the prefetch function. (I think there are some people using non-x86 processors, such as ARM or SUN SPARC.) If you're okay, I can update this pull request to perform feature checking in the build system so as to check the existence of the prefetch function. So can I modify this pull request? |
@hcho3, yes, you can update, of course. |
@SmirnovEgorRu Make sure to check on "Allow Edits from Maintainers" |
@hcho3, yes, it is marked. |
@SmirnovEgorRu Thanks, I'll make the necessary changes in the build system to do feature checking. |
@RAMitchell May I ask how do you run gpu hist with double precision? Thanks |
@hcho3 any news here? |
@SmirnovEgorRu I’m currently on holiday vacation now. Sorry for the delay. I’ll try to get to it by end of this year. |
@trivialfis FYI, this is another reason why we should prefer CMake over Makefile, since CMake lets us explicitly test the existence of specific compiler intrinsics such as |
@SmirnovEgorRu Sorry about the delay. I've now added build-time check for existence of SW prefetching. I'm assuming that the new code will be functional even when macro |
@trivialfis Can you take a look at this PR to see if it can be made compatible with your PR #3825? |
Codecov Report
@@ Coverage Diff @@
## master #3957 +/- ##
============================================
+ Coverage 56.35% 60.72% +4.36%
============================================
Files 186 130 -56
Lines 14775 11722 -3053
Branches 498 0 -498
============================================
- Hits 8327 7118 -1209
+ Misses 6209 4604 -1605
+ Partials 239 0 -239
Continue to review full report at Codecov.
|
@hcho3 @trivialfis Do we need to do smt new to push it? |
@SmirnovEgorRu I went ahead and merged the pull request. I will follow up with some unit tests later. |
@SmirnovEgorRu @hcho3 I'll get a look at benchmarking pre-post this PR on my 72 thread server (Dual Xeon 6154). |
In this issue some performance optimizations for multi-core CPUs were offered.
I prepared these as a pull-request. Also, I added SW prefetching, which helps to improve performance additionally.
It was tested for HIGGS data set (1m x 28) in comparison with Intel® DAAL.
HW: Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz (2 sockets, 14 cores per socket).
Parameters:
"objective": "binary:hinge", "max_depth": "6", "eta": "0.3", "min_child_weight": "0" (DAAL's analog: minObservationsInLeafNode=0), "tree_method", "hist", "n_estimators": "50"
Accuracy before and after my changes - the same: 72.306801.
Some additional optimizations can be done: e.g. parallelization by nodes, optimization of "ApplySplit" functions.