Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support cuda 10.1 #4223

Merged
merged 4 commits into from
Mar 7, 2019
Merged

support cuda 10.1 #4223

merged 4 commits into from
Mar 7, 2019

Conversation

rongou
Copy link
Contributor

@rongou rongou commented Mar 6, 2019

Doesn't quite work yet, I'm getting compilations errors:

/home/rou/src/xgboost/src/common/host_device_vector.cu: In instantiation of ‘void xgboost::HostDeviceVectorImpl<T>::DeviceShard::Init(xgboost::HostDeviceVectorImpl<T>*, int) [with T = long unsigned int]’:
/home/rou/src/xgboost/src/common/host_device_vector.cu:205:1:   required from ‘xgboost::HostDeviceVectorImpl<T>::InitShards()::<lambda(int, xgboost::HostDeviceVectorImpl<T>::DeviceShard&)> [with T = long unsigned int]’
/home/rou/src/xgboost/src/common/host_device_vector.cu:205:13:   required from ‘struct xgboost::HostDeviceVectorImpl<T>::InitShards() [with T = long unsigned int]::<lambda(int, struct xgboost::HostDeviceVectorImpl<long unsigned int>::DeviceShard&)>’
/home/rou/src/xgboost/src/common/host_device_vector.cu:204:23:   required from ‘void xgboost::HostDeviceVectorImpl<T>::InitShards() [with T = long unsigned int]’
/home/rou/src/xgboost/src/common/host_device_vector.cu:171:11:   required from ‘xgboost::HostDeviceVectorImpl<T>::HostDeviceVectorImpl(size_t, T, xgboost::GPUDistribution) [with T = long unsigned int; size_t = long unsigned int]’
/home/rou/src/xgboost/src/common/host_device_vector.cu:430:12:   required from ‘xgboost::HostDeviceVector<T>::HostDeviceVector(size_t, T, xgboost::GPUDistribution) [with T = long unsigned int; size_t = long unsigned int]’
/home/rou/src/xgboost/include/xgboost/data.h:186:14:   required from here
/home/rou/src/xgboost/src/common/host_device_vector.cu:56:16: error: cannot call member function ‘size_t xgboost::HostDeviceVectorImpl<T>::Size() const [with T = long unsigned int; size_t = long unsigned int]’ without object
       LazyResize(vec_->Size());
            ~~~~^~

@rongou
Copy link
Contributor Author

rongou commented Mar 6, 2019

The error is similar to protocolbuffers/protobuf#5805 and tensorflow/tensorflow#26155.

@RAMitchell
Copy link
Member

RAMitchell commented Mar 6, 2019

Can you please add a new entry to the Jenkinsfile build matrix for cuda 10.1.

I think we can also remove the second entry from the build matrix, we should have sufficient coverage without it and it will speed up CI:

def buildMatrix = [
    [ "enabled": true,  "os" : "linux", "withGpu": true, "withNccl": true,  "withOmp": true, "pythonVersion": "2.7", "cudaVersion": "9.2", "multiGpu": true],
    [ "enabled": true,  "os" : "linux", "withGpu": true, "withNccl": true,  "withOmp": true, "pythonVersion": "2.7", "cudaVersion": "9.2" ],
    [ "enabled": true,  "os" : "linux", "withGpu": true, "withNccl": true,  "withOmp": true, "pythonVersion": "2.7", "cudaVersion": "8.0" ],
    [ "enabled": true,  "os" : "linux", "withGpu": true, "withNccl": false, "withOmp": true, "pythonVersion": "2.7", "cudaVersion": "8.0" ],
]

EDIT: Looks like there are no cuda 10.1 images yet (https://hub.docker.com/r/nvidia/cuda/). Maybe we can track down who maintains these.

@rongou
Copy link
Contributor Author

rongou commented Mar 7, 2019

Thought this was a compiler bug, I built gcc 8.3 from source, but still getting the same error.

@rongou
Copy link
Contributor Author

rongou commented Mar 7, 2019

Wow, it is a compiler bug. Got everything built under gcc 6.

@rongou
Copy link
Contributor Author

rongou commented Mar 7, 2019

@RAMitchell added 10.1 to the build matrix.

This should work now as long as we stick with gcc 6.

@rongou rongou changed the title [WIP] support cuda 10.1 support cuda 10.1 Mar 7, 2019
@trivialfis
Copy link
Member

@rongou That's problematic. We might need to workaround that. Is it possible to file an issue at GCC with host code only?

@rongou
Copy link
Contributor Author

rongou commented Mar 7, 2019

Only partially fixes #4220 :(

@hcho3
Copy link
Collaborator

hcho3 commented Mar 7, 2019

Wait, is 10.1 container available already?

@rongou
Copy link
Contributor Author

rongou commented Mar 7, 2019

@hcho3 yeah just got pushed to docker hub.

@trivialfis agreed. Maybe we should be less ambitious and stick with cuda 10.0 for now. Let me try to figure out if I can reproduce the bug easily.

@rongou
Copy link
Contributor Author

rongou commented Mar 7, 2019

Hmm, maybe it's actually a cuda sdk bug. In the generated host_device_vector.cudafe1.cpp, that line is changed to LazyResize(Size());, which is obviously incorrect.

@rongou rongou changed the title support cuda 10.1 support cuda 10.0 Mar 7, 2019
@trivialfis
Copy link
Member

trivialfis commented Mar 7, 2019

@rongou Anyway I worked around it by de-referencing the pointer first:

(*vec_).Size();

@rongou
Copy link
Contributor Author

rongou commented Mar 7, 2019

@trivialfis cool, that's a good workaround for now. I made the change and added a TODO.

@rongou rongou changed the title support cuda 10.0 support cuda 10.1 Mar 7, 2019
@codecov-io
Copy link

codecov-io commented Mar 7, 2019

Codecov Report

Merging #4223 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4223      +/-   ##
==========================================
- Coverage   67.28%   67.27%   -0.01%     
==========================================
  Files         132      132              
  Lines       12219    12220       +1     
==========================================
  Hits         8221     8221              
- Misses       3998     3999       +1
Impacted Files Coverage Δ
src/tree/updater_colmaker.cc 39.6% <0%> (-0.08%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ac3d030...e86365e. Read the comment docs.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants