Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Framework and hardware-specific CI tests #997

Merged
merged 37 commits into from
Nov 2, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
b43319a
[WIP][CI] Framework and hardware-specific docker images for CI tests
anton-l Oct 26, 2022
3247464
username
anton-l Oct 26, 2022
f796f2b
fix cpu
anton-l Oct 26, 2022
b30fadd
try out the image
anton-l Oct 26, 2022
ff02418
push latest
anton-l Oct 26, 2022
eaeadab
update workspace
anton-l Oct 26, 2022
d463c79
no root isolation for actions
anton-l Oct 26, 2022
9148936
add a flax image
anton-l Oct 26, 2022
54d9357
flax and onnx matrix
anton-l Oct 26, 2022
9f9ae16
fix runners
anton-l Oct 26, 2022
24420c1
add reports
anton-l Oct 26, 2022
f4fdf5c
onnxruntime image
anton-l Oct 26, 2022
c3c03bd
retry tpu
anton-l Oct 27, 2022
b5821a4
fix
anton-l Oct 27, 2022
adede47
fix
anton-l Oct 27, 2022
0c5cc43
build onnxruntime
anton-l Oct 27, 2022
a6c4f31
naming
anton-l Oct 27, 2022
45bb7be
onnxruntime-gpu image
anton-l Oct 31, 2022
3a644b6
Merge remote-tracking branch 'origin/main' into ci-docker-images
anton-l Oct 31, 2022
6c8bc3e
onnxruntime-gpu image, slow tests
anton-l Oct 31, 2022
f3ac32f
Merge main
anton-l Oct 31, 2022
a62cdd1
latest jax version
anton-l Oct 31, 2022
85ce44b
trigger flax
anton-l Oct 31, 2022
2b03693
run flax tests in one thread
anton-l Oct 31, 2022
948b666
fast flax tests on cpu
anton-l Oct 31, 2022
99bfc51
fast flax tests on cpu
anton-l Oct 31, 2022
7436fd8
trigger slow tests
anton-l Oct 31, 2022
cbc03a4
rebuild torch cuda
anton-l Oct 31, 2022
0b7e57b
force cuda provider
anton-l Oct 31, 2022
cb7db9b
fix onnxruntime tests
anton-l Oct 31, 2022
47225c2
Merge branch 'main' into ci-docker-images
anton-l Oct 31, 2022
e3cbd63
trigger slow
anton-l Oct 31, 2022
2894f76
don't specify gpu for tpu
anton-l Oct 31, 2022
735f4ee
optimize
anton-l Oct 31, 2022
c5ffe37
memory limit
anton-l Oct 31, 2022
c4e8dd6
fix flax tests
anton-l Nov 1, 2022
cf7c438
disable docker cache
anton-l Nov 1, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions .github/workflows/build_docker_images.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
name: Build Docker images (nightly)

on:
workflow_dispatch:
schedule:
- cron: "0 0 * * *" # every day at midnight

concurrency:
group: docker-image-builds
cancel-in-progress: false

env:
REGISTRY: diffusers

jobs:
build-docker-images:
runs-on: ubuntu-latest

permissions:
contents: read
packages: write

strategy:
fail-fast: false
matrix:
image-name:
- diffusers-pytorch-cpu
- diffusers-pytorch-cuda
- diffusers-flax-cpu
- diffusers-flax-tpu
- diffusers-onnxruntime-cpu
- diffusers-onnxruntime-cuda

steps:
- name: Checkout repository
uses: actions/checkout@v3

- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ env.REGISTRY }}
password: ${{ secrets.DOCKERHUB_TOKEN }}

- name: Build and push
uses: docker/build-push-action@v3
with:
no-cache: true
context: ./docker/${{ matrix.image-name }}
push: true
tags: ${{ env.REGISTRY }}/${{ matrix.image-name }}:latest
78 changes: 63 additions & 15 deletions .github/workflows/pr_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,19 +11,45 @@ concurrency:

env:
DIFFUSERS_IS_CI: yes
OMP_NUM_THREADS: 8
MKL_NUM_THREADS: 8
OMP_NUM_THREADS: 4
MKL_NUM_THREADS: 4
Comment on lines +14 to +15
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CPU runner has 8 cores => 2 pytest workers * 4 cores.
The speed isn't affected by this change (only faster due to the new docker image)

PYTEST_TIMEOUT: 60
MPS_TORCH_VERSION: 1.13.0

jobs:
run_tests_cpu:
name: CPU tests on Ubuntu
runs-on: [ self-hosted, docker-gpu ]
run_fast_tests:
strategy:
fail-fast: false
matrix:
config:
- name: Fast PyTorch CPU tests on Ubuntu
framework: pytorch
runner: docker-cpu
image: diffusers/diffusers-pytorch-cpu
report: torch_cpu
- name: Fast Flax CPU tests on Ubuntu
framework: flax
runner: docker-cpu
image: diffusers/diffusers-flax-cpu
report: flax_cpu
- name: Fast ONNXRuntime CPU tests on Ubuntu
framework: onnxruntime
runner: docker-cpu
image: diffusers/diffusers-onnxruntime-cpu
report: onnx_cpu
Comment on lines +23 to +39
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matrix defines the different combinations of frameworks, docker images and runners to test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice


name: ${{ matrix.config.name }}

runs-on: ${{ matrix.config.runner }}

container:
image: python:3.7
image: ${{ matrix.config.image }}
options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need --gpus 0 or --gpus all if we want to use GPU in the docker? In transformers CI, we specified it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is PR tests, and only on CPU. Sorry to bother


defaults:
run:
shell: bash

steps:
- name: Checkout diffusers
uses: actions/checkout@v3
Expand All @@ -32,34 +58,56 @@ jobs:

- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install torch --extra-index-url https://download.pytorch.org/whl/cpu
python -m pip install -e .[quality,test]
python -m pip install git+https://github.com/huggingface/accelerate

- name: Environment
run: |
python utils/print_env.py

- name: Run all fast tests on CPU
- name: Run fast PyTorch CPU tests
if: ${{ matrix.config.framework == 'pytorch' }}
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
run: |
python -m pytest -n 2 --max-worker-restart=0 --dist=loadfile \
-s -v -k "not Flax and not Onnx" \
--make-reports=tests_${{ matrix.config.report }} \
tests/

- name: Run fast Flax TPU tests
if: ${{ matrix.config.framework == 'flax' }}
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
run: |
python -m pytest -n 2 --max-worker-restart=0 --dist=loadfile \
-s -v -k "Flax" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) think it's a bit saver/easier to work with environment variables e.g. RUN_FLAX=True/False and a test decorator but ok for me for now!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, will add it soon!

--make-reports=tests_${{ matrix.config.report }} \
tests/

- name: Run fast ONNXRuntime CPU tests
if: ${{ matrix.config.framework == 'onnxruntime' }}
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
run: |
python -m pytest -n 2 --max-worker-restart=0 --dist=loadfile -s -v --make-reports=tests_torch_cpu tests/
python -m pytest -n 2 --max-worker-restart=0 --dist=loadfile \
-s -v -k "Onnx" \
--make-reports=tests_${{ matrix.config.report }} \
tests/

- name: Failure short reports
if: ${{ failure() }}
run: cat reports/tests_torch_cpu_failures_short.txt
run: cat reports/tests_${{ matrix.config.report }}_failures_short.txt

- name: Test suite reports artifacts
if: ${{ always() }}
uses: actions/upload-artifact@v2
with:
name: pr_torch_cpu_test_reports
name: pr_${{ matrix.config.report }}_test_reports
path: reports

run_tests_apple_m1:
name: MPS tests on Apple M1
run_fast_tests_apple_m1:
name: Fast PyTorch MPS tests on MacOS
runs-on: [ self-hosted, apple-m1 ]

steps:
Expand Down Expand Up @@ -91,7 +139,7 @@ jobs:
run: |
${CONDA_RUN} python utils/print_env.py

- name: Run all fast tests on MPS
- name: Run fast PyTorch tests on M1 (MPS)
shell: arch -arch arm64 bash {0}
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
Expand Down
91 changes: 69 additions & 22 deletions .github/workflows/push_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,38 @@ env:
RUN_SLOW: yes

jobs:
run_tests_single_gpu:
name: Diffusers tests
runs-on: [ self-hosted, docker-gpu, single-gpu ]
run_slow_tests:
strategy:
fail-fast: false
matrix:
config:
- name: Slow PyTorch CUDA tests on Ubuntu
framework: pytorch
runner: docker-gpu
image: diffusers/diffusers-pytorch-cuda
report: torch_cuda
- name: Slow Flax TPU tests on Ubuntu
framework: flax
runner: docker-tpu
image: diffusers/diffusers-flax-tpu
report: flax_tpu
- name: Slow ONNXRuntime CUDA tests on Ubuntu
framework: onnxruntime
runner: docker-gpu
image: diffusers/diffusers-onnxruntime-cuda
report: onnx_cuda

name: ${{ matrix.config.name }}

runs-on: ${{ matrix.config.runner }}

container:
image: nvcr.io/nvidia/pytorch:22.07-py3
options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache
image: ${{ matrix.config.image }}
options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ ${{ matrix.config.runner == 'docker-tpu' && '--privileged' || '--gpus 0'}}

defaults:
run:
shell: bash

steps:
- name: Checkout diffusers
Expand All @@ -28,44 +54,68 @@ jobs:
fetch-depth: 2

- name: NVIDIA-SMI
if : ${{ matrix.config.runner == 'docker-gpu' }}
run: |
nvidia-smi

- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip uninstall -y torch torchvision torchtext
python -m pip install torch --extra-index-url https://download.pytorch.org/whl/cu117
python -m pip install -e .[quality,test]
python -m pip install git+https://github.com/huggingface/accelerate

- name: Environment
run: |
python utils/print_env.py

- name: Run all (incl. slow) tests on GPU
- name: Run slow PyTorch CUDA tests
if: ${{ matrix.config.framework == 'pytorch' }}
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
run: |
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \
-s -v -k "not Flax and not Onnx" \
--make-reports=tests_${{ matrix.config.report }} \
tests/

- name: Run slow Flax TPU tests
if: ${{ matrix.config.framework == 'flax' }}
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
run: |
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v --make-reports=tests_torch_gpu tests/
python -m pytest -n 0 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use -n 0 to disable xdist for Flax?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precisely! Looks like jax[tpu] doesn't like being launched with multiprocessing at all: the TPU gets reserved by the parent process and the tests can't get access to it afterwards: jax-ml/jax#10192

-s -v -k "Flax" \
--make-reports=tests_${{ matrix.config.report }} \
tests/

- name: Run slow ONNXRuntime CUDA tests
if: ${{ matrix.config.framework == 'onnxruntime' }}
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
run: |
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \
-s -v -k "Onnx" \
--make-reports=tests_${{ matrix.config.report }} \
tests/

- name: Failure short reports
if: ${{ failure() }}
run: cat reports/tests_torch_gpu_failures_short.txt
run: cat reports/tests_${{ matrix.config.report }}_failures_short.txt

- name: Test suite reports artifacts
if: ${{ always() }}
uses: actions/upload-artifact@v2
with:
name: torch_test_reports
name: ${{ matrix.config.report }}_test_reports
path: reports

run_examples_single_gpu:
name: Examples tests
runs-on: [ self-hosted, docker-gpu, single-gpu ]
run_examples_tests:
name: Examples PyTorch CUDA tests on Ubuntu

runs-on: docker-gpu

container:
image: nvcr.io/nvidia/pytorch:22.07-py3
options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache
image: diffusers/diffusers-pytorch-cuda
options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/

steps:
- name: Checkout diffusers
Expand All @@ -79,9 +129,6 @@ jobs:

- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip uninstall -y torch torchvision torchtext
python -m pip install torch --extra-index-url https://download.pytorch.org/whl/cu117
python -m pip install -e .[quality,test,training]
python -m pip install git+https://github.com/huggingface/accelerate

Expand All @@ -93,11 +140,11 @@ jobs:
env:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
run: |
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v --make-reports=examples_torch_gpu examples/
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v --make-reports=examples_torch_cuda examples/

- name: Failure short reports
if: ${{ failure() }}
run: cat reports/examples_torch_gpu_failures_short.txt
run: cat reports/examples_torch_cuda_failures_short.txt

- name: Test suite reports artifacts
if: ${{ always() }}
Expand Down
42 changes: 42 additions & 0 deletions docker/diffusers-flax-cpu/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
FROM ubuntu:20.04
LABEL maintainer="Hugging Face"
LABEL repository="diffusers"

ENV DEBIAN_FRONTEND=noninteractive

RUN apt update && \
apt install -y bash \
build-essential \
git \
git-lfs \
curl \
ca-certificates \
python3.8 \
python3-pip \
python3.8-venv && \
rm -rf /var/lib/apt/lists

# make sure to use venv
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
# follow the instructions here: https://cloud.google.com/tpu/docs/run-in-container#train_a_jax_model_in_a_docker_container
RUN python3 -m pip install --no-cache-dir --upgrade pip && \
python3 -m pip install --upgrade --no-cache-dir \
clu \
"jax[cpu]>=0.2.16,!=0.3.2" \
"flax>=0.4.1" \
"jaxlib>=0.1.65" && \
python3 -m pip install --no-cache-dir \
accelerate \
datasets \
hf-doc-builder \
huggingface-hub \
modelcards \
numpy \
scipy \
tensorboard \
transformers

CMD ["/bin/bash"]
Loading