Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Framework and hardware-specific CI tests #997

Merged
merged 37 commits into from
Nov 2, 2022
Merged

Conversation

anton-l
Copy link
Member

@anton-l anton-l commented Oct 26, 2022

Now we have the following GutHub actions runners as separate machines:

  • docker-gpu (V100 GPU)
  • docker-cpu (8 Intel CPU cores, MKL-ready)
  • docker-tpu (v3-8 TPU)
  • apple-m1 (Apple Silicon VM)

This PR sorts the tests to use the appropriate runners and base docker images:

  • ⚡ Fast/PR tests:
    • Test class name doesn't start with Flax or Onnx?
      • => Run with the diffusers-pytorch-cpu image on the docker-cpu runner
      • => Run with a conda env on the apple-m1 runner
    • Test class name starts with Flax?
      • => Run with the diffusers-flax-cpu image on the docker-cpu runner
    • Test class name starts with Onnx?
      • => Run with the diffusers-onnxruntime-cpu image on the docker-cpu runner
  • 🐢 Slow/Merge tests:
    • Test class name doesn't start with Flax or Onnx?
      • => Run with the diffusers-pytorch-cuda image on the docker-gpu runner
    • Test class name starts with Flax?
      • => Run with the diffusers-flax-tpu image on the docker-tpu runner
    • Test class name starts with Onnx?
      • => Run with the diffusers-onnxruntime-cuda image on the docker-gpu runner
  • 🎨 Examples tests:
    • => Run with the diffusers-pytorch-cuda image on the docker-gpu runner

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Oct 26, 2022

The documentation is not available anymore as the PR was closed or merged.

Comment on lines +14 to +15
OMP_NUM_THREADS: 4
MKL_NUM_THREADS: 4
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CPU runner has 8 cores => 2 pytest workers * 4 cores.
The speed isn't affected by this change (only faster due to the new docker image)

Comment on lines +23 to +39
matrix:
config:
- name: Fast PyTorch CPU tests on Ubuntu
framework: pytorch
runner: docker-cpu
image: diffusers/diffusers-pytorch-cpu
report: torch_cpu
- name: Fast Flax CPU tests on Ubuntu
framework: flax
runner: docker-cpu
image: diffusers/diffusers-flax-cpu
report: flax_cpu
- name: Fast ONNXRuntime CPU tests on Ubuntu
framework: onnxruntime
runner: docker-cpu
image: diffusers/diffusers-onnxruntime-cpu
report: onnx_cpu
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matrix defines the different combinations of frameworks, docker images and runners to test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice

Comment on lines 38 to +46
class OnnxStableDiffusionPipelineIntegrationTests(unittest.TestCase):
def test_inference(self):
provider = (
"CUDAExecutionProvider",
{
"gpu_mem_limit": "17179869184", # 16GB.
"arena_extend_strategy": "kSameAsRequested",
},
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Onnx tests now run with the CUDA provider. This enables us to add more integration tests without worrying about inference speed.

Comment on lines +62 to +64
assert images.shape == (8, 1, 128, 128, 3)
assert np.abs(np.abs(images[0, 0, :2, :2, -2:], dtype=np.float32).sum() - 3.1111548) < 1e-3
assert np.abs(np.abs(images, dtype=np.float32).sum() - 199746.95) < 5e-1
Copy link
Member Author

@anton-l anton-l Nov 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this slow (tpu) test had different values before. Updated the reference values to what I've got on the TPU runner with jax[tpu]

Comment on lines +314 to +319
if jax_device == "tpu":
assert abs(result_sum - 255.0714) < 1e-2
assert abs(result_mean - 0.332124) < 1e-3
else:
assert abs(result_sum - 255.1113) < 1e-2
assert abs(result_mean - 0.332176) < 1e-3
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheduler tests needed some adjustments for when they're running on TPU.

Comment on lines +593 to +600
if jax_device == "tpu":
pass
# FIXME: both result_sum and result_mean are nan on TPU
# assert jnp.isnan(result_sum)
# assert jnp.isnan(result_mean)
else:
assert abs(result_sum - 149.0784) < 1e-2
assert abs(result_mean - 0.1941) < 1e-3
Copy link
Member Author

@anton-l anton-l Nov 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not too urgent to fix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great! glad the tests are catching these things though!

Comment on lines -92 to -93
"jax>=0.2.8,!=0.3.2,<=0.3.6",
"jaxlib>=0.1.65,<=0.3.6",
Copy link
Member Author

@anton-l anton-l Nov 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the version cap from jax, as we can use the latest version now (and it's required for docker support)

@anton-l anton-l changed the title [WIP][CI] Framework and hardware-specific docker images for CI tests [CI] Framework and hardware-specific docker images for CI tests Nov 1, 2022
@anton-l anton-l changed the title [CI] Framework and hardware-specific docker images for CI tests [CI] Framework and hardware-specific CI tests Nov 1, 2022
@anton-l
Copy link
Member Author

anton-l commented Nov 1, 2022

The PR is now ready for review, lmk if something needs to be explained more!

cc @muellerzr @ydshieh for optional reviews and/or inspiration :)

@anton-l
Copy link
Member Author

anton-l commented Nov 1, 2022

@patrickvonplaten
Copy link
Contributor

Wow - super cool! Great job :-)

@patrickvonplaten
Copy link
Contributor

Looks all good to me - happy to merge!

HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
run: |
python -m pytest -n 2 --max-worker-restart=0 --dist=loadfile \
-s -v -k "Flax" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) think it's a bit saver/easier to work with environment variables e.g. RUN_FLAX=True/False and a test decorator but ok for me for now!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, will add it soon!

Copy link
Contributor

@patil-suraj patil-suraj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! Looks good to me

@anton-l anton-l merged commit 4e59bcc into main Nov 2, 2022
@anton-l anton-l deleted the ci-docker-images branch November 2, 2022 13:07
container:
image: python:3.7
image: ${{ matrix.config.image }}
options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need --gpus 0 or --gpus all if we want to use GPU in the docker? In transformers CI, we specified it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is PR tests, and only on CPU. Sorry to bother

env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
run: |
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v --make-reports=tests_torch_gpu tests/
python -m pytest -n 0 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use -n 0 to disable xdist for Flax?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precisely! Looks like jax[tpu] doesn't like being launched with multiprocessing at all: the TPU gets reserved by the parent process and the tests can't get access to it afterwards: jax-ml/jax#10192

@ydshieh
Copy link
Contributor

ydshieh commented Nov 2, 2022

Very nice and clean usage of matrix!

yoonseokjin pushed a commit to yoonseokjin/diffusers that referenced this pull request Dec 25, 2023
* [WIP][CI] Framework and hardware-specific docker images for CI tests

* username

* fix cpu

* try out the image

* push latest

* update workspace

* no root isolation for actions

* add a flax image

* flax and onnx matrix

* fix runners

* add reports

* onnxruntime image

* retry tpu

* fix

* fix

* build onnxruntime

* naming

* onnxruntime-gpu image

* onnxruntime-gpu image, slow tests

* latest jax version

* trigger flax

* run flax tests in one thread

* fast flax tests on cpu

* fast flax tests on cpu

* trigger slow tests

* rebuild torch cuda

* force cuda provider

* fix onnxruntime tests

* trigger slow

* don't specify gpu for tpu

* optimize

* memory limit

* fix flax tests

* disable docker cache
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants