Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimisations for gpu_hist. #4248

Merged
merged 5 commits into from
Mar 20, 2019
Merged

Optimisations for gpu_hist. #4248

merged 5 commits into from
Mar 20, 2019

Conversation

RAMitchell
Copy link
Member

  • Use streams to overlap operations.

  • Reduce redundant calls to cudaSetDevice().

  • ColumnSampler now uses HostDeviceVector to prevent repeatedly copying feature vectors to the device.

@RAMitchell
Copy link
Member Author

Running tests/benchmark/benchmark_tree.py 5 times on each branch and averaging the result. Old: 13.476s, New: 11.54s, 14.4% improvement

This was on my Windows development machine with a single 1080ti.

@trivialfis
Copy link
Member

@RAMitchell Could you elaborate on why moving cudaSetDevice from device shard methods into ExecuteIndexShards?

And I recently did some small testings for CUDA APIs to see if it's possible to add some checks for issues like #4245, and other things like whether we are using the right device etc... cudaSetDevice is really cheap to call, I guess it's just a simple global integer variable indicating device id, redundant call sometimes makes debugging easier...

@trivialfis
Copy link
Member

One more, please don't merge until the clang-tidy PR is done.

@RAMitchell
Copy link
Member Author

It seemed a little more consistent to do it this way, each device shard performs work for strictly one GPU so if we set the device once before each shard performs work it should be safe.

I am seeing the cuda API calls come up on the profiler taking some nontrivial amount of time, I'm not actually sure if they take this long.

image

@hcho3
Copy link
Collaborator

hcho3 commented Mar 12, 2019

@RAMitchell It looks like the multi-GPU test is failing due to memory error: https://xgboost-ci.net/blue/organizations/jenkins/xgboost/detail/PR-4248/5/pipeline/51#step-88-log-1317

[2019-03-12T22:18:57.198Z] tests/python-gpu/test_gpu_updaters.py::TestGPU::test_gpu_hist_mgpu Training on dataset: Boston
[2019-03-12T22:18:57.198Z] Using parameters: {'n_gpus': -1, 'eval_metric': 'rmse', 'max_bin': 2, 'objective': 'reg:linear', 'grow_policy': 'lossguide', 'tree_method': 'gpu_hist', 'max_depth': 2, 'max_leaves': 255}
[2019-03-12T22:18:57.582Z] terminate called after throwing an instance of 'thrust::system::system_error'
[2019-03-12T22:18:57.582Z]   what():  device free failed: an illegal memory access was encountered
[2019-03-12T22:18:58.403Z] tests/ci_build/test_mgpu.sh: line 7: 35100 Aborted                 (core dumped) pytest -v -s --fulltrace -m "(not slow) and mgpu" tests/python-gpu
[2019-03-12T22:18:59.338Z] Terminated

* Use streams to overlap operations.

* Reduce redundant calls to cudaSetDevice().

* ColumnSampler now uses HostDeviceVector to prevent repeatedly copying feature vectors to the device.
@trivialfis
Copy link
Member

I am seeing the cuda API calls come up on the profiler taking some nontrivial amount of time

That's weird

@RAMitchell
Copy link
Member Author

This PR is ready to be merged, just having some difficulty with R test failures on Travis.

@RAMitchell RAMitchell merged commit 00465d2 into dmlc:master Mar 20, 2019
@hcho3 hcho3 mentioned this pull request Apr 21, 2019
18 tasks
@lock lock bot locked as resolved and limited conversation to collaborators Jun 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants