-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimisations for gpu_hist. #4248
Conversation
Running tests/benchmark/benchmark_tree.py 5 times on each branch and averaging the result. Old: 13.476s, New: 11.54s, 14.4% improvement This was on my Windows development machine with a single 1080ti. |
@RAMitchell Could you elaborate on why moving And I recently did some small testings for CUDA APIs to see if it's possible to add some checks for issues like #4245, and other things like whether we are using the right device etc... |
One more, please don't merge until the clang-tidy PR is done. |
It seemed a little more consistent to do it this way, each device shard performs work for strictly one GPU so if we set the device once before each shard performs work it should be safe. I am seeing the cuda API calls come up on the profiler taking some nontrivial amount of time, I'm not actually sure if they take this long. |
@RAMitchell It looks like the multi-GPU test is failing due to memory error: https://xgboost-ci.net/blue/organizations/jenkins/xgboost/detail/PR-4248/5/pipeline/51#step-88-log-1317
|
* Use streams to overlap operations. * Reduce redundant calls to cudaSetDevice(). * ColumnSampler now uses HostDeviceVector to prevent repeatedly copying feature vectors to the device.
That's weird |
This PR is ready to be merged, just having some difficulty with R test failures on Travis. |
Use streams to overlap operations.
Reduce redundant calls to cudaSetDevice().
ColumnSampler now uses HostDeviceVector to prevent repeatedly copying feature vectors to the device.