Optimisations for gpu_hist. #4248

RAMitchell · 2019-03-11T23:20:48Z

Use streams to overlap operations.
Reduce redundant calls to cudaSetDevice().
ColumnSampler now uses HostDeviceVector to prevent repeatedly copying feature vectors to the device.

RAMitchell · 2019-03-12T00:47:33Z

Running tests/benchmark/benchmark_tree.py 5 times on each branch and averaging the result. Old: 13.476s, New: 11.54s, 14.4% improvement

This was on my Windows development machine with a single 1080ti.

trivialfis · 2019-03-12T07:23:08Z

@RAMitchell Could you elaborate on why moving cudaSetDevice from device shard methods into ExecuteIndexShards?

And I recently did some small testings for CUDA APIs to see if it's possible to add some checks for issues like #4245, and other things like whether we are using the right device etc... cudaSetDevice is really cheap to call, I guess it's just a simple global integer variable indicating device id, redundant call sometimes makes debugging easier...

trivialfis · 2019-03-12T07:24:48Z

One more, please don't merge until the clang-tidy PR is done.

RAMitchell · 2019-03-12T21:21:59Z

It seemed a little more consistent to do it this way, each device shard performs work for strictly one GPU so if we set the device once before each shard performs work it should be safe.

I am seeing the cuda API calls come up on the profiler taking some nontrivial amount of time, I'm not actually sure if they take this long.

hcho3 · 2019-03-12T22:27:42Z

@RAMitchell It looks like the multi-GPU test is failing due to memory error: https://xgboost-ci.net/blue/organizations/jenkins/xgboost/detail/PR-4248/5/pipeline/51#step-88-log-1317

[2019-03-12T22:18:57.198Z] tests/python-gpu/test_gpu_updaters.py::TestGPU::test_gpu_hist_mgpu Training on dataset: Boston
[2019-03-12T22:18:57.198Z] Using parameters: {'n_gpus': -1, 'eval_metric': 'rmse', 'max_bin': 2, 'objective': 'reg:linear', 'grow_policy': 'lossguide', 'tree_method': 'gpu_hist', 'max_depth': 2, 'max_leaves': 255}
[2019-03-12T22:18:57.582Z] terminate called after throwing an instance of 'thrust::system::system_error'
[2019-03-12T22:18:57.582Z]   what():  device free failed: an illegal memory access was encountered
[2019-03-12T22:18:58.403Z] tests/ci_build/test_mgpu.sh: line 7: 35100 Aborted                 (core dumped) pytest -v -s --fulltrace -m "(not slow) and mgpu" tests/python-gpu
[2019-03-12T22:18:59.338Z] Terminated

* Use streams to overlap operations. * Reduce redundant calls to cudaSetDevice(). * ColumnSampler now uses HostDeviceVector to prevent repeatedly copying feature vectors to the device.

trivialfis · 2019-03-13T02:49:32Z

I am seeing the cuda API calls come up on the profiler taking some nontrivial amount of time

That's weird

RAMitchell · 2019-03-18T03:24:03Z

This PR is ready to be merged, just having some difficulty with R test failures on Travis.

RAMitchell force-pushed the perf branch from bc560b7 to 55a1f37 Compare March 12, 2019 02:40

RAMitchell force-pushed the perf branch from 55a1f37 to e371431 Compare March 12, 2019 23:28

Optimisations for gpu_hist.

5940f64

* Use streams to overlap operations. * Reduce redundant calls to cudaSetDevice(). * ColumnSampler now uses HostDeviceVector to prevent repeatedly copying feature vectors to the device.

RAMitchell force-pushed the perf branch from e371431 to 5940f64 Compare March 13, 2019 00:23

RAMitchell added 2 commits March 13, 2019 14:41

Clang tidy

2ecbf05

Shared pointer bug

b69b34a

RAMitchell force-pushed the perf branch from f2ef354 to 62edfb3 Compare March 13, 2019 22:03

Add back cudaSetDevice calls

6a053cd

RAMitchell force-pushed the perf branch from 62edfb3 to 6a053cd Compare March 15, 2019 01:30

Fix distributed tests bug

bc20d2d

RAMitchell merged commit 00465d2 into dmlc:master Mar 20, 2019

hcho3 mentioned this pull request Apr 21, 2019

XGBoost 0.90 Roadmap #4389

Closed

18 tasks

hcho3 mentioned this pull request May 17, 2019

[RFC] Version 0.90 release candidate #4475

Merged

lock bot locked as resolved and limited conversation to collaborators Jun 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimisations for gpu_hist. #4248

Optimisations for gpu_hist. #4248

RAMitchell commented Mar 11, 2019

RAMitchell commented Mar 12, 2019

trivialfis commented Mar 12, 2019

trivialfis commented Mar 12, 2019

RAMitchell commented Mar 12, 2019

hcho3 commented Mar 12, 2019 •

edited

Loading

trivialfis commented Mar 13, 2019

RAMitchell commented Mar 18, 2019

Optimisations for gpu_hist. #4248

Optimisations for gpu_hist. #4248

Conversation

RAMitchell commented Mar 11, 2019

RAMitchell commented Mar 12, 2019

trivialfis commented Mar 12, 2019

trivialfis commented Mar 12, 2019

RAMitchell commented Mar 12, 2019

hcho3 commented Mar 12, 2019 • edited Loading

trivialfis commented Mar 13, 2019

RAMitchell commented Mar 18, 2019

hcho3 commented Mar 12, 2019 •

edited

Loading