Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build for GPU on CircleCI #8829

Closed
wants to merge 13 commits into from

Conversation

luhenry
Copy link
Contributor

@luhenry luhenry commented Feb 22, 2024

This change adds just enough support to build Velox with GPU support on
CircleCI. That will help make sure the build with VELOX_ENABLE_GPU
doesn't regress, even if it doesn't test just yet.

Co-authored-by: Sergei Lewis slewis@rivosinc.com

This will allow us to build for CUDA on CI
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 22, 2024
Copy link

netlify bot commented Feb 22, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit ef7cd95
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/65d8b167efad3f0008006b7c

@luhenry
Copy link
Contributor Author

luhenry commented Feb 22, 2024

It depends on #8828 for support of CUDA on the CircleCI Docker image

@luhenry luhenry mentioned this pull request Feb 23, 2024
2 tasks
This change adds just enough support to build Velox with GPU support on
CircleCI. That will help make sure the build with VELOX_ENABLE_GPU
doesn't regress, even if it doesn't test just yet.
@luhenry
Copy link
Contributor Author

luhenry commented Feb 23, 2024

After trying to use the GPU executors, I'm coming to the realization that they don't seem compatible with running the steps in Docker. @assignUser @kgpai I'm not sure what to do next? I think we still want to go ahead and add the building on a Linux/Docker executor for now, and we can add a downstream test job that uses the gpu executor.

@kgpai
Copy link
Contributor

kgpai commented Feb 23, 2024

After trying to use the GPU executors, I'm coming to the realization that they don't seem compatible with running the steps in Docker

Sorry can you add more detail - what is the problem you are seeing ?

@kgpai
Copy link
Contributor

kgpai commented Feb 23, 2024

Hi @luhenry I think this is what should be done:

  1. We have 4-core-ubuntu-gpu-t4 as GPU supported runners in GHA
  2. These are in GHA so changes to Add CUDA to the CircleCI docker image #8828 cant be used (as thats specific to CCI)
  3. You will need to make changes to scripts/centos-8... or scripts/ubuntu dockerfiles
  4. Setup a new GHA job that uses these images and the new runner above along with changes in this PR.

@assignUser
Copy link
Collaborator

@kgpai @luhenry We are a bit blocked with the GHA migration due to limitations of actions/cache when used to cache a build cache like ccache but I have a fix almost ready.

I haven't used the official cuda capable runners yet but with our self-hosted ones it's important to get the match between host and docker cuda versions correct, so either we run this without docker or use a cuda base image. I'll checkout the gpu runners and see what works.

Do we want to run this on every PR? Or would a scheduled nightly job be enough to spot regressions (+ maybe on PRs with direct changes to the experimental features?)

@luhenry
Copy link
Contributor Author

luhenry commented Feb 24, 2024

Sorry can you add more detail - what is the problem you are seeing ?
It is a limitation of the CircleCI GPU runners where you need to specify a machine: making it impossible to specify a docker:.

On using GHA to build and run with GPU, when are you expecting to have all of this ready to go? At the moment, with #8828, we would be able today to build the CUDA code on CircleCI.

I'm happy to do the work to add a specific GHA workflow/job to build and test on GHA. But if that's not going to be merged in the coming weeks/months because it's blocked on something else, I would much rather get what's currently available merged (even if it's only building, given it doesn't even builds today) and improve things later on (even if it means the work I'm doing today goes to trash and I need to do more later).

Do we want to run this on every PR? Or would a scheduled nightly job be enough to spot regressions (+ maybe on PRs with direct changes to the experimental features?)

I think that the building part should be run on every PR as it's only building and there shouldn't be any flakiness/slowness to it. For the testing, I think it entirely depends on the capacity and availability of the runners with GPUs. If there are enough, the tests shouldn't take much longer than a non-GPU test run.

@pedroerp what do you think?

@kgpai
Copy link
Contributor

kgpai commented Feb 24, 2024

On using GHA to build and run with GPU, when are you expecting to have all of this ready to go? At the moment, with #8828, we would be able today to build the CUDA code on CircleCI.

We expect this to happen in a few weeks at most. While we can have this working in CCI we will still need to do get this working in GHA as we are migrating in the very short term to GHA from CCI - thus changes to CCI are additional and wasted effort.

I think that the building part should be run on every PR as it's only building and there shouldn't be any flakiness/slowness to it. For the testing, I think it entirely depends on the capacity and availability of the runners with GPUs. If there are enough, the tests shouldn't take much longer than a non-GPU test run.

This makes sense, we can have the gpu part build / PR but only exercise the code , say nightly on a gpu.

@luhenry
Copy link
Contributor Author

luhenry commented Apr 2, 2024

Closed in favor of #9335

@assignUser assignUser closed this Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants