Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to reproduce a crash in Windows pipeline: an example #8

Closed
hcho3 opened this issue Dec 2, 2019 · 3 comments
Closed

How to reproduce a crash in Windows pipeline: an example #8

hcho3 opened this issue Dec 2, 2019 · 3 comments

Comments

@hcho3
Copy link
Owner

hcho3 commented Dec 2, 2019

Consider this example: https://xgboost-ci.net/blue/organizations/jenkins/xgboost-win64/detail/PR-5078/4/pipeline/73
Screen Shot 2019-12-01 at 7 09 49 PM

Let's try to reproduce the issue. The very first step is to find which machine image (AMI) was used. In this case, it's test-win64-gpu-cuda10.0, which corresponds to the image Windows2016_GPUTest_CUDA10_Dec2019.

  1. Log into the EC2 console and launch a new EC2 instance using the image Windows2016_GPUTest_CUDA10_Dec2019. Use g4dn.xlarge type. Use password you set in How to build machine image (AMI) to run test pipeline on Windows  #7.
  2. Locate the artifact that's causing the issue. In this example, it's testxgboost.exe.
  3. Go to the S3 console and navigate to the S3 bucket xgboost-ci-jenkins-artifacts. This works because in How to set up a Jenkins master node from scratch. #6 we configured Jenkins to store all artifacts in S3. The prefix for the artifact in this example is xgboost-win64/PR-5078/4/stashes. (In general, the prefix is of form <pipeline name>/<Pull Request ID>/<Build ID>/stashes.) Now we can download xgboost_cpp_tests.tgz, which contains testxgboost.exe.
  4. Copy over xgboost_cpp_tests.tgz to the EC2 instance.
  5. Install 7-zip to extract testxgboost.exe from the tgz file.
  6. Run testxgboost.exe
@hcho3
Copy link
Owner Author

hcho3 commented Dec 2, 2019

@trivialfis FYI, this is a blocking issue

@hcho3
Copy link
Owner Author

hcho3 commented Dec 2, 2019

I resolved this particular problem by installing latest driver from http://www.nvidia.com/drivers. It seems like using the latest driver (412.36) is causing an issue. I will try version 411.82 (first to support Tesla T4) instead.

@hcho3
Copy link
Owner Author

hcho3 commented Dec 2, 2019

I have to conclude that CUDA 10.0 can't really work with Tesla T4 (G4 instance), at least on Windows. I'll just use P2 type instead.

@hcho3 hcho3 pinned this issue Dec 23, 2019
@hcho3 hcho3 unpinned this issue Dec 23, 2019
@hcho3 hcho3 closed this as completed Sep 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant