Forward port #19895 #19903

josephevans · 2021-02-16T17:47:48Z

Description

Forward port #19895 from v1.x to master.

This PR makes a number of changes to make it more stable:

Remove SafeDocker client which uses python docker package to run containers. Change to use "docker run" command directly using subprocess.call(), because the python-docker client does not support a gpus parameter which newer docker versions use and we don't get timeout issues when using the docker command directly. This will allow us to update our AMIs to use newer docker versions.
In order to support both docker variants simultaneously, we first try to use the --gpus all parameter to docker run, if it fails with error code 125 (which means docker run command itself failed,) then we retry using the old --runtime nvidia parameter.
Remove the extra custom codecov calls (which usually fail and have to retry multiple times, even though the initial codecov command works.)

* Test moving pipelines from p3 to g4. * Remove fallback codecov command - the existing (first) command works and the second always fails a few times before finally succeeding (and also doesn't support the -P parameter, which causes an error.) * Stop using docker python client, since it still doesn't support latest nvidia 'gpus' attribute. Switch to using subprocess calls using list parameter (to avoid shell injections). See docker/docker-py#2395 * Remove old files. * Fix comment * Set default environment variables * Fix GPU syntax. * Use subprocess.run and redirect output to stdout, don't run docker in interactive mode. * Check if codecov works without providing parameters now. * Send docker stderr to sys.stderr * Support both nvidia-docker configurations, first try '--gpus all', and if that fails, then try '--runtime nvidia'. Co-authored-by: Joe Evans <joeev@amazon.com>

mxnet-bot · 2021-02-16T17:47:52Z

Hey @josephevans , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [miscellaneous, edge, centos-cpu, windows-gpu, website, sanity, clang, unix-gpu, windows-cpu, centos-gpu, unix-cpu]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

This PR makes a number of changes to make it more stable: Remove SafeDocker client which uses python docker package to run containers. Change to use "docker run" command directly using subprocess.call(), because the python-docker client does not support a gpus parameter which newer docker versions use and we don't get timeout issues when using the docker command directly. This will allow us to update our AMIs to use newer docker versions. In order to support both docker variants simultaneously, we first try to use the --gpus all parameter to docker run, if it fails with error code 125 (which means docker run command itself failed,) then we retry using the old --runtime nvidia parameter. Remove the extra custom codecov calls (which usually fail and have to retry multiple times, even though the initial codecov command works.)

josephevans and others added 2 commits February 16, 2021 17:37

Fix merge

d6db57f

josephevans requested review from aaronmarkham and marcoabreu as code owners February 16, 2021 17:47

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 16, 2021

leezu merged commit 2c4f5aa into apache:master Feb 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forward port #19895 #19903

Forward port #19895 #19903

josephevans commented Feb 16, 2021

mxnet-bot commented Feb 16, 2021

Forward port #19895 #19903

Forward port #19895 #19903

Conversation

josephevans commented Feb 16, 2021

Description

mxnet-bot commented Feb 16, 2021