Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downgrade Jenkins agent to CUDA 9.2, to fix broken NCCL2 installation #4232

Merged
merged 1 commit into from
Mar 8, 2019

Conversation

hcho3
Copy link
Collaborator

@hcho3 hcho3 commented Mar 8, 2019

See https://xgboost-ci.net/blue/organizations/jenkins/xgboost/detail/PR-4231/2/pipeline/51.

The current scripts downloads NCCL2 from URL https://developer.download.nvidia.com/compute/redist/nccl/v2.2/nccl_2.2.13-1%2Bcuda${CUDA_VERSION}_x86_64.txz. This URL works fine when CUDA_VERSION is 8.x or 9.x but fails for 10.1.

TODO. Either find a link that's not behind a login wall, OR upload installers to a private S3 bucket.
EDIT. I think it would be best if we place NCCL2 installers inside the worker image. This way, we don't generate unnecessary HTTP traffic.

@hcho3 hcho3 changed the title Downgrade Jenkins agent to CUDA 9.2, since NCCL2 cannot be found Downgrade Jenkins agent to CUDA 9.2, until NCCL2 URL is fixed Mar 8, 2019
@hcho3 hcho3 changed the title Downgrade Jenkins agent to CUDA 9.2, until NCCL2 URL is fixed Downgrade Jenkins agent to CUDA 9.2, to fix broken NCCL2 installation Mar 8, 2019
@codecov-io
Copy link

codecov-io commented Mar 8, 2019

Codecov Report

Merging #4232 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4232   +/-   ##
=======================================
  Coverage   67.28%   67.28%           
=======================================
  Files         132      132           
  Lines       12220    12220           
=======================================
  Hits         8222     8222           
  Misses       3998     3998

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 224786f...1af54b8. Read the comment docs.

@hcho3 hcho3 requested review from RAMitchell and trivialfis March 8, 2019 05:47
@hcho3
Copy link
Collaborator Author

hcho3 commented Mar 8, 2019

I created a new issue to discuss improvements to Jenkins: #4234. I'd like to create a major refactor of Jenkins, for the sake of simplicity, de-duplication, and wider coverage.

@RAMitchell
Copy link
Member

Hmmm we can probably get the new nccl binary up soon. Suggest merging this though so we can continue to use CI, we can change it back to 10.1 later.

@trivialfis
Copy link
Member

Now that nccl is open sourced, can we compile it ourselves? I'm doing this for 10.1 setting

@rongou
Copy link
Contributor

rongou commented Mar 8, 2019

That's a very old version of nccl. Maybe upgrade to a newer version?

@hcho3 hcho3 merged commit 20845e8 into dmlc:master Mar 8, 2019
@hcho3
Copy link
Collaborator Author

hcho3 commented Mar 8, 2019

Merging for now to unblock CI. Will follow up with another PR to support 10.1 properly

@hcho3 hcho3 deleted the fix_ci branch March 8, 2019 17:11
@lock lock bot locked as resolved and limited conversation to collaborators Jun 6, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants