Run GPU tests on Jax + Torch #1160

ianstenbit · 2023-07-17T18:12:21Z

This was done in KerasCV in keras-team/keras-cv#1935

I've made the relevant changes to our GCB config on the backend already, and I've built the docker images for each of the three test suites.

ianstenbit · 2023-07-17T18:28:36Z

/gcbrun

ianstenbit · 2023-07-17T18:30:46Z

/gcbrun

ianstenbit · 2023-07-17T18:50:00Z

/gcbrun

ianstenbit · 2023-07-17T19:34:40Z

/gcbrun

ianstenbit · 2023-07-17T21:32:27Z

/gcbrun

ianstenbit · 2023-07-17T21:59:37Z

requirements-common.txt

@@ -1,7 +1,4 @@
 keras-core
-# Consider handling GPU here.
-torch>=2.0.1+cpu


Curious what your take is here -- this can definitely be handled differently for the docker configs, but the way I see it these aren't really requirements for KerasNLP. wdyt?

My take is basically what is laid out here. install_requires should list the minimum needed for a valid install, but requirements.txt should be an exhaustive and repeatable recipe for a complete environment.

After the next tf release, I think we can get by with just a requirements.txt and requirements-cuda.txt. These should be used by all our tooling (no listing extra deps in docker files, etc), and used by contributors to set up a development environment that matches our CI.

Short term, we could consider a requirements-jax.txt, requirements-tf.txt etc, that match our docker setups. Then all our version pinning, hacking, etc, is consolidated to requirements files.

No need to figure this out all on this PR though. Fine to land as is and keep tweaking.

ianstenbit · 2023-07-17T22:02:32Z

Okay @mattdangerw I'm sending this for review, but many of the tests are still failing on GCB.

For PyTorch, basically all of the failures are due to some sort of GPU device placement issue. I tried just updating the test cases but it ended up breaking a bunch of other tests with some cascading failures.

The Jax+TF failures should now be just the GPT-NeoX saving tests, and my stab-in-the-dark guess is that those are failing because of the randomness in the kernel initializer but I would expect model saving would correctly reload those weights.

ianstenbit · 2023-07-17T22:02:39Z

/gcbrun

mattdangerw · 2023-07-20T05:01:11Z

/gcbrun

mattdangerw

@ianstenbit flipping back to you with a last few comments on the test setup!

Thanks very much for tackling this! These last failure is a known flake, we can ignore for now and I will open up a fix.

mattdangerw · 2023-07-20T16:42:58Z

.cloudbuild/README.md

+- Repeat the last two steps for Jax and Torch (replacing "tensorflow" with "jax"
+ or "torch" in the docker image target name). `Dockerfile` for jax:
+  ```
+  FROM nvidia/cuda:11.7.1-base-ubuntu20.04


why not 11.8.0 here? that's the version tf depends on, and it would be nice to consolidate on one version of cuda for our testing. https://www.tensorflow.org/install/pip

I did 11.7 because that's the current default for PyTorch, and I figured we might as well use the same base image for the two to reduce possibility for variance. But 11.8 should work for both, we'd just need to update the pip install commands to use the correct cuda version

mattdangerw · 2023-07-20T16:44:29Z

.github/workflows/actions.yml

@@ -65,6 +65,8 @@ jobs:
    - name: Install dependencies
      run: |
          pip install -r requirements.txt --progress-bar off
+          pip install torch>=2.0.1+cpu --progress-bar off


I think this would also need to go into the publish to pypi action (though long term, I still favor a requirments file as a single source of truth to avoid duplication)

yeah this feels a little janky to me as well. It's possible that we could leave these in requirements.txt for now and just have the GPU CI workflow uninstall the CPU versions before installing the GPU versions.

I'm fine either way, this just seemed like less of a headache for the GPU tests.

mattdangerw · 2023-07-20T17:35:21Z

Here's fix for that flake, but I think we can merge things in any order. #1171

ianstenbit · 2023-07-20T19:36:40Z

I can't technically LGTM since I originally opened up the PR, but... LGTM. Feel free to merge this whenever you're ready.

mattdangerw · 2023-07-20T19:41:27Z

Thanks!!

* Update requirements and README * Update test configs * Fix normal CI * . - but that is actually the commit message * Fix cloudbuild dockerfile * Fix docker configs * Some test case fixes * Fix rich imports * Fix test case * Revert test case * Fix gpt_neo_x saving * Skip xlm roberta presets on jax/torch for now * Fix torch GPU detach errors * More detach fixes --------- Co-authored-by: Matt Watson <mattdangerw@gmail.com>

ianstenbit added 3 commits July 17, 2023 11:59

Update requirements and README

d0a3fe3

Update test configs

9f22fa6

Fix normal CI

d0a5ded

. - but that is actually the commit message

5d3006f

Fix cloudbuild dockerfile

6e31294

Fix docker configs

8fcd304

ianstenbit added 3 commits July 17, 2023 14:10

Some test case fixes

38e3373

Fix rich imports

2889299

Merge branch 'master' into core-gpu

b1aa771

Fix test case

442a37e

ianstenbit requested a review from mattdangerw July 17, 2023 21:57

ianstenbit marked this pull request as ready for review July 17, 2023 21:57

Revert test case

26a7199

ianstenbit commented Jul 17, 2023

View reviewed changes

mattdangerw added 5 commits July 18, 2023 12:46

Fix gpt_neo_x saving

07251f9

Skip xlm roberta presets on jax/torch for now

3dd9aff

Fix torch GPU detach errors

be13ebe

More detach fixes

ef5e964

Merge branch 'master' into core-gpu

de17583

mattdangerw approved these changes Jul 20, 2023

View reviewed changes

mattdangerw merged commit 32ff68d into keras-team:master Jul 20, 2023
8 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run GPU tests on Jax + Torch #1160

Run GPU tests on Jax + Torch #1160

ianstenbit commented Jul 17, 2023 •

edited

Loading

ianstenbit commented Jul 17, 2023

ianstenbit commented Jul 17, 2023

ianstenbit commented Jul 17, 2023

ianstenbit commented Jul 17, 2023

ianstenbit commented Jul 17, 2023

ianstenbit Jul 17, 2023

mattdangerw Jul 20, 2023

ianstenbit commented Jul 17, 2023

ianstenbit commented Jul 17, 2023

mattdangerw commented Jul 20, 2023

mattdangerw left a comment

mattdangerw Jul 20, 2023

ianstenbit Jul 20, 2023

mattdangerw Jul 20, 2023

ianstenbit Jul 20, 2023

mattdangerw commented Jul 20, 2023

ianstenbit commented Jul 20, 2023

mattdangerw commented Jul 20, 2023

Run GPU tests on Jax + Torch #1160

Run GPU tests on Jax + Torch #1160

Conversation

ianstenbit commented Jul 17, 2023 • edited Loading

ianstenbit commented Jul 17, 2023

ianstenbit commented Jul 17, 2023

ianstenbit commented Jul 17, 2023

ianstenbit commented Jul 17, 2023

ianstenbit commented Jul 17, 2023

ianstenbit Jul 17, 2023

Choose a reason for hiding this comment

mattdangerw Jul 20, 2023

Choose a reason for hiding this comment

ianstenbit commented Jul 17, 2023

ianstenbit commented Jul 17, 2023

mattdangerw commented Jul 20, 2023

mattdangerw left a comment

Choose a reason for hiding this comment

mattdangerw Jul 20, 2023

Choose a reason for hiding this comment

ianstenbit Jul 20, 2023

Choose a reason for hiding this comment

mattdangerw Jul 20, 2023

Choose a reason for hiding this comment

ianstenbit Jul 20, 2023

Choose a reason for hiding this comment

mattdangerw commented Jul 20, 2023

ianstenbit commented Jul 20, 2023

mattdangerw commented Jul 20, 2023

ianstenbit commented Jul 17, 2023 •

edited

Loading