-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Google cloud CI with GPUs in docker #2152
Conversation
Each command tested to work. Need to fix authentication and test with GH actions on github.
This reverts commit e480144.
Start from base Ubuntu 18.04 image Multiple commits can be processed at the same time. Tries 5 times to create VM if not successful.
gcloud CLI reboot disables future ssh
Moved all workflow logic to shell script gce-ubuntu-docker-run.sh) CI_CONFIG_ID used instead of build matrix to limit CI to a few build combinations TODO: change cuda to 10.1 cuDNN 7 add cudnn to docker image
Always run VM and container image delete steps.
This will hit the quota limit (each zone quota is 4GPUs
Thanks for submitting this pull request! The maintainers of this repository would appreciate if you could update the CHANGELOG.md based on your changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 7 files at r1, 2 of 2 files at r2.
Reviewable status: 6 of 7 files reviewed, 14 unresolved discussions (waiting on @benjaminum, @germanros1987, @ssheorey, and @yxlao)
.github/workflows/gce-ubuntu-docker.yml, line 9 at r1 (raw file):
branches: - master - ssheorey/gce-gpu-ci-docker
this could be removed eventually since the branch is in pull_request
already
.github/workflows/gce-ubuntu-docker.yml, line 29 at r2 (raw file):
# Setup gcloud CLI - uses: GoogleCloudPlatform/github-actions/setup-gcloud@master
give this step a name, so it looks nicer in web UI
.github/workflows/gce-ubuntu-docker.yml, line 31 at r2 (raw file):
- uses: GoogleCloudPlatform/github-actions/setup-gcloud@master with: version: '302.0.0' # Fixed to avoid dependency on API changes
put this in global env so it is easier to change?
.github/workflows/gce-ubuntu-docker.yml, line 38 at r2 (raw file):
# Build the Docker image - name: Build
name: Build docker image
.github/workflows/gce-ubuntu-docker.yml, line 43 at r2 (raw file):
# Push the Docker image to Google Container Registry - name: Publish
name: Push docker image
.github/workflows/gce-ubuntu-docker.yml, line 46 at r2 (raw file):
id: docker-push run: | ./util/docker/open3d-gpu/scripts/gce-ubuntu-docker-run.sh docker-push
if the docker image is built and consumed in local machine, do we need to push it to some repository?
.github/workflows/gce-ubuntu-docker.yml, line 92 at r2 (raw file):
cleanup: name: Cleanup container images and delete VM
"Cleanup container images and delete VM" seems too long, can we use a shorter name for this? Also seems like other job names are pretty long to be displayed on the webpage, we can consider shorter names.
.github/workflows/gce-ubuntu-docker.yml, line 99 at r2 (raw file):
steps: - name: GCloud setup uses: actions/checkout@v2 # Checkout code
do we need to checkout code again? also it looks like we are having nested folders
Open3D/Open3D/
.github/workflows/gce-ubuntu-readme.md, line 8 at r2 (raw file):
- Clones the repository and checks out the specific commit - Installs dependencies (including NVIDIA drivers and CUDA) - Builds the code and runs tests (`run-ci.sh`)
do we need to update workflow for docker?
util/docker/open3d-gpu/scripts/env-setup.sh, line 3 at r2 (raw file):
#!/usr/bin/env bash # Run this script from the CMakeLists.txt (top-level) directory
is this needed?
util/docker/open3d-gpu/scripts/gce-ubuntu-docker-run.sh, line 7 at r2 (raw file):
# exit when any command fails set -e
remove trailing space (probably enable this automatically in the editor)
util/docker/open3d-gpu/scripts/gce-ubuntu-docker-run.sh, line 102 at r2 (raw file):
until (( TRIES>=5 )) || gcloud compute images create ${VM_IMAGE} --source-disk=$VM_IMAGE \ --source-disk-zone=${GCE_INSTANCE_ZONE[$GCE_ZID]} \ --family=ubuntu-os-docker-gpu-1804-lts \
shall we specify it in the # Container configuration
section?
util/docker/open3d-gpu/scripts/gce-ubuntu-docker-run.sh, line 120 at r2 (raw file):
--boot-disk-size=$GCE_BOOT_DISK_SIZE \ --boot-disk-type=$GCE_BOOT_DISK_TYPE \ --image-family=ubuntu-os-docker-gpu-1804-lts \
same, shall we specify it in the # Container configuration
section?
util/docker/open3d-gpu/scripts/gce-ubuntu-docker-run.sh, line 134 at r2 (raw file):
--env NPROC=$NPROC \ --env SHARED=${SHARED[$CI_CONFIG_ID]} \ --env BUILD_DEPENDENCY_FROM_SOURCE=${BUILD_DEPENDENCY_FROM_SOURCE[$CI_CONFIG_ID]} \
seems like this has been removed from master
+ fix for propogating GCE_ZID to next step
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 3 of 7 files reviewed, 14 unresolved discussions (waiting on @benjaminum, @germanros1987, and @yxlao)
.github/workflows/gce-ubuntu-docker.yml, line 9 at r1 (raw file):
Previously, yxlao (Yixing Lao) wrote…
this could be removed eventually since the branch is in
pull_request
already
Done.
.github/workflows/gce-ubuntu-docker.yml, line 29 at r2 (raw file):
Previously, yxlao (Yixing Lao) wrote…
give this step a name, so it looks nicer in web UI
Done.
.github/workflows/gce-ubuntu-docker.yml, line 31 at r2 (raw file):
Previously, yxlao (Yixing Lao) wrote…
put this in global env so it is easier to change?
Done.
.github/workflows/gce-ubuntu-docker.yml, line 38 at r2 (raw file):
Previously, yxlao (Yixing Lao) wrote…
name: Build docker image
Done.
.github/workflows/gce-ubuntu-docker.yml, line 43 at r2 (raw file):
Previously, yxlao (Yixing Lao) wrote…
name: Push docker image
Done.
.github/workflows/gce-ubuntu-docker.yml, line 46 at r2 (raw file):
Previously, yxlao (Yixing Lao) wrote…
if the docker image is built and consumed in local machine, do we need to push it to some repository?
These are all with GPU, so they are pushed to google container registry.
.github/workflows/gce-ubuntu-docker.yml, line 92 at r2 (raw file):
Previously, yxlao (Yixing Lao) wrote…
"Cleanup container images and delete VM" seems too long, can we use a shorter name for this? Also seems like other job names are pretty long to be displayed on the webpage, we can consider shorter names.
New job names:
- Build and push docker
- Build and run on VM
- Clean-up container image
.github/workflows/gce-ubuntu-docker.yml, line 99 at r2 (raw file):
Previously, yxlao (Yixing Lao) wrote…
do we need to checkout code again? also it looks like we are having nested folders
Open3D/Open3D/
Each job may be run on a different runner machine, so need to checkout for each job. I just need a single script for jobs 2 and 3, but the Google Cloud action needs a checkout. Haven't tried without it, though.
Not sure about the nested folders. I don't set the working directory anywhere. I see it in the other github actions as well. Looks like it's coming from the checkout@v2 action.
.github/workflows/gce-ubuntu-readme.md, line 8 at r2 (raw file):
Previously, yxlao (Yixing Lao) wrote…
do we need to update workflow for docker?
Done.
util/docker/open3d-gpu/scripts/env-setup.sh, line 3 at r2 (raw file):
Previously, yxlao (Yixing Lao) wrote…
is this needed?
Checking if installed cmake is new enough by giving it the CMakeLists.txt file to process and watching for the specific version error. Won't be needed if we replace the test with a hardcoded cmake minimum version value. Will need to sync it with the value in CMakeLists.txt though.
util/docker/open3d-gpu/scripts/gce-ubuntu-docker-run.sh, line 7 at r2 (raw file):
Previously, yxlao (Yixing Lao) wrote…
remove trailing space (probably enable this automatically in the editor)
Done. Added editor setting.
util/docker/open3d-gpu/scripts/gce-ubuntu-docker-run.sh, line 102 at r2 (raw file):
Previously, yxlao (Yixing Lao) wrote…
shall we specify it in the
# Container configuration
section?
Added to VM instance configuration [this is the VM OS]. Also version fixed to 20.04.
util/docker/open3d-gpu/scripts/gce-ubuntu-docker-run.sh, line 120 at r2 (raw file):
Previously, yxlao (Yixing Lao) wrote…
same, shall we specify it in the
# Container configuration
section?
Done.
util/docker/open3d-gpu/scripts/gce-ubuntu-docker-run.sh, line 134 at r2 (raw file):
Previously, yxlao (Yixing Lao) wrote…
seems like this has been removed from master
Removed. Should I keep BUILD_RPC_INTERFACE? Its not in master, but I can keep it if we expect an update with it soon. @benjaminum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r3, 2 of 2 files at r4.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @benjaminum, @germanros1987, and @yxlao)
util/docker/open3d-gpu/scripts/gce-ubuntu-docker-run.sh, line 134 at r2 (raw file):
Previously, ssheorey wrote…
Removed. Should I keep BUILD_RPC_INTERFACE? Its not in master, but I can keep it if we expect an update with it soon. @benjaminum
RPC is now merged to master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @benjaminum, @germanros1987, @ssheorey, and @yxlao)
util/run_ci.sh, line 95 at r4 (raw file):
# Build the rpc interface only if we do not build the cuda module and the # ml module to keep build times short if [ "$BUILD_CUDA_MODULE" == "OFF" -a "$BUILD_TENSORFLOW_OPS" == "OFF" ]; then
We can remove this. This code was used before defining the new build confiurations.
Required for Python 3.8 RPC is built based on build configs only Moved versions into variables at top.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @benjaminum, @germanros1987, and @yxlao)
util/run_ci.sh, line 95 at r4 (raw file):
Previously, benjaminum (Benjamin Ummenhofer) wrote…
We can remove this. This code was used before defining the new build confiurations.
Done.
Here is an overview of what got changed by this pull request: Issues
======
+ Solved 2
See the complete overview on Codacy |
* GCE GPU CI scripts Each command tested to work. Need to fix authentication and test with GH actions on github. * Moved one time setup steps to readme * Manually trigger action for testing * Attempt to reset GitHub action * Revert "Attempt to reset GitHub action" This reverts commit e480144. * workflow rename * Removed document separators in yaml file * fixed branch name * Crete new VM instance for every commit Start from base Ubuntu 18.04 image Multiple commits can be processed at the same time. Tries 5 times to create VM if not successful. * Syntax error fix * Syntax fix isl-org#2 * GH actions yaml expressions syntax fix isl-org#3 * syntax fix isl-org#4 * Syntax fix isl-org#5 * Syntax error fix isl-org#5 * Syntax fix isl-org#7 * Syntax fix # 8 * Syntax fix # 9 * Syntax fix # 10 * Reboot with gcloud commands, always delete VM * VM reboot through ssh gcloud CLI reboot disables future ssh * Wait and retry for VM to boot up after reboot * Added Python 3 setup * Fix syntax errors # 12 * Link error fix * Documentation update * docker WIP * docker GPU CI shell script tested Moved all workflow logic to shell script gce-ubuntu-docker-run.sh) CI_CONFIG_ID used instead of build matrix to limit CI to a few build combinations TODO: change cuda to 10.1 cuDNN 7 add cudnn to docker image * Syntax error fix, recursive submodules checkout * Sequential order for init, CI and cleanup jobs Always run VM and container image delete steps. * Try different GCE zones if VM creation doesn't work This will hit the quota limit (each zone quota is 4GPUs * Path cleanup after rebase * Fix for install_deps_ubuntu.sh rename * Script name fix isl-org#2 * New dependency git for compilation * format md * Yixing's comments + fix for propogating GCE_ZID to next step * Markdown style fixes for codacy * itentially fail a unit test to test CI * fix unit test * Merge branch 'master' into ssheorey/gce-gpu-ci-docker * tensorflow version 2.0.0 -> 2.3.0 Required for Python 3.8 RPC is built based on build configs only Moved versions into variables at top. * Added BUILD_RPC_INTERFACE=ON Co-authored-by: Yixing Lao <yixing.lao@gmail.com>
Entire CI run takes ~25-30 minutes and runs build configurations in parallel.
Tasks:
This change is