Add RL9 cuda build variant #428

sjpb · 2024-08-14T15:17:46Z

Adds nvidia-driver and CUDA install to the image build workflow.

Now uses opensource nvidia drivers to work around an issue installing cuda 12.6. See here for compatibility restrictions and background.
nvidia-driver package appears to install the latest kernel (contrary to documentation), so must install this driver during image build when the kernel is updated, rather than doing it as an additional "extra" build on top of a fat image.
Fixes distribution detection during cuda install.
Fixes cuda version detection during cuda samples tests.
Increases CUDA image size (= builder VM root volume size) to avoid build running out of disk space.
Fixes github runner running out of disk space during cuda image scanning.

sjpb · 2024-08-14T15:19:51Z

Build: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/10390362265

sjpb · 2024-08-14T15:22:39Z

Build: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/10390409980

sjpb · 2024-08-14T15:50:16Z

build: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/10390837280

sjpb · 2024-08-14T16:21:45Z

Build: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/10391284937

sjpb · 2024-08-15T08:46:58Z

Build: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/10401266612

sjpb · 2024-08-16T11:09:38Z

Build: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/10418987275

sjpb · 2024-08-16T14:50:06Z

build: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/10421189201/job/28862847426

sjpb · 2024-09-04T08:12:17Z

Note to self; try running df -h during workflow, apparently GH runners have ~29GB free on / (which presumably hosts the workspace dir) and ~10GB free on the temp disk on /mnt

bertiethorpe · 2024-09-04T08:13:45Z

Filesystem      Size  Used Avail Use% Mounted on
/dev/root        73G   52G   22G  71% /
tmpfs           7.9G  172K  7.9G   1% /dev/shm
tmpfs           3.2G  1.1M  3.2G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda15      105M  6.1M   99M   6% /boot/efi
/dev/sdb1        74G  4.1G   66G   6% /mnt
tmpfs           1.6G   12K  1.6G   1% /run/user/1001

bertiethorpe · 2024-09-04T08:53:30Z

Build: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/10698364015

bertiethorpe · 2024-09-04T09:37:31Z

Build: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/10699071615

bertiethorpe · 2024-09-04T11:01:01Z

https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/10700263532

github-advanced-security · 2024-09-04T12:17:08Z

This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation.

sjpb · 2024-09-04T14:16:47Z

@bertiethorpe's build failed for cuda: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/10702249427/job/29670015040

So looks like unpinning from 12.5 doesn't work. Maybe revert bead06d and just bump the images in the main.tf to the ones from the previous build? I can check if they actually work on a GPU node.

sjpb · 2024-09-04T15:08:35Z

Build: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/10704463080

COMPLETED: openhpc-cuda-RL9-240904-1509-1687368f - this is/should be cuda 12.6 with open drivers

$ qemu-img info openhpc-cuda-RL9-240904-1509-1687368f
image: openhpc-cuda-RL9-240904-1509-1687368f
file format: qcow2
virtual size: 30 GiB (32212254720 bytes)
disk size: 15.3 GiB

DONE: download to leafcloud
DONE: upload to s3 prerelease
DONE: download to test cloud deploy VM
DONE: upload to test cloud openstack - NB not doing any of the BM node things
DONE: provisioning cluster with this
DONE: configuring cluster
DONE: test cuda:

ok: [stg-a40-02] => (item=Device 1) => {}

MSG:

Device 1: NVIDIA A40
Bandwidths: (Gb/s)
Host to Device: 25.7
Device to Host: 20.3
Device to Device: 540.8
Result: PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

jovial

This looks pretty clean to me. Do we not want to support older hardware?

packer/openstack.pkr.hcl

ansible/roles/cuda/tasks/main.yml

jovial

Approving assuming that we are fine with dropping support for older cards.

jovial · 2024-09-06T08:15:22Z

Reran CI as failure looks transient:

│ Error: Error waiting for instance (a3434bd4-1ba2-41bd-900f-e8af53c57457) to become ready: The server timed out waiting for the request
│ 
│   with module.cluster.module.compute["standard"].openstack_compute_instance_v2.compute["compute-0"],
│   on ../../skeleton/{{cookiecutter.environment}}/terraform/compute/nodes.tf line 21, in resource "openstack_compute_instance_v2" "compute":
│   21: resource "openstack_compute_instance_v2" "compute" {
│

Is this running on SMS?

jovial

Change to make volumes optional looks good to me. Fingers crossed for CI 🤞

sjpb · 2024-09-06T08:39:15Z

Is this running on SMS?
@jovial no leafcloud at present. Thanks for kicking off again.

* determine cuda distro automatically * fix typo in CUDA samples * make facts available for cuda * add RL9 cuda build variant * fix typo in build definitions * set packer build volume sizes depending on build variant * fix volume size definition * fix cuda verfsion to workaround issue with 12-6-0-1 * don't fail all builds if one fails * bump CUDA builder disk size (build ran out of space) * download cuda image to /mnt on gh runner * download cuda image to /mnt on gh runner * fix fatimage.yml mnt permissions * Update main.yml * switch to open nvidia drivers * bump CI images * make packer build volume-backed optional again --------- Co-authored-by: bertiethorpe <bertie443@gmail.com> Co-authored-by: bertiethorpe <84867280+bertiethorpe@users.noreply.github.com>

sjpb mentioned this pull request Aug 16, 2024

Fix cuda installation on RL9 #416

Closed

sjpb added 10 commits September 4, 2024 09:28

determine cuda distro automatically

825688b

fix typo in CUDA samples

a115fe3

make facts available for cuda

5e79a9c

add RL9 cuda build variant

f474ac8

fix typo in build definitions

a985a1b

set packer build volume sizes depending on build variant

8b7edfd

fix volume size definition

905113c

fix cuda verfsion to workaround issue with 12-6-0-1

4a6a4fa

don't fail all builds if one fails

4ce4b5a

bump CUDA builder disk size (build ran out of space)

f517c6f

bertiethorpe force-pushed the ci/cuda-build branch 2 times, most recently from 5bd4f30 to 945825d Compare September 4, 2024 08:44

download cuda image to /mnt on gh runner

0abc465

bertiethorpe force-pushed the ci/cuda-build branch from 945825d to 0abc465 Compare September 4, 2024 09:29

This comment was marked as outdated.

Sign in to view

download cuda image to /mnt on gh runner

51d1991

fix fatimage.yml mnt permissions

aec6f50

Update main.yml

bead06d

sjpb mentioned this pull request Sep 4, 2024

Enable SMS Labs for CI #426

Merged

switch to open nvidia drivers

1687368

bump CI images

102e19a

sjpb marked this pull request as ready for review September 5, 2024 16:36

sjpb requested a review from a team as a code owner September 5, 2024 16:36

jovial reviewed Sep 5, 2024

View reviewed changes

packer/openstack.pkr.hcl Outdated Show resolved Hide resolved

ansible/roles/cuda/tasks/main.yml Show resolved Hide resolved

jovial previously approved these changes Sep 5, 2024

View reviewed changes

make packer build volume-backed optional again

7f15afc

sjpb dismissed jovial’s stale review via 7f15afc September 5, 2024 17:14

jovial approved these changes Sep 6, 2024

View reviewed changes

sjpb merged commit 6ec3a73 into main Sep 6, 2024
1 check passed

sjpb deleted the ci/cuda-build branch September 6, 2024 08:52

sjpb restored the ci/cuda-build branch September 6, 2024 08:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RL9 cuda build variant #428

Add RL9 cuda build variant #428

sjpb commented Aug 14, 2024 •

edited

Loading

sjpb commented Aug 14, 2024

sjpb commented Aug 14, 2024

sjpb commented Aug 14, 2024

sjpb commented Aug 14, 2024

sjpb commented Aug 15, 2024

sjpb commented Aug 16, 2024

sjpb commented Aug 16, 2024

sjpb commented Sep 4, 2024

bertiethorpe commented Sep 4, 2024

bertiethorpe commented Sep 4, 2024

This comment was marked as outdated.

bertiethorpe commented Sep 4, 2024

bertiethorpe commented Sep 4, 2024

github-advanced-security bot commented Sep 4, 2024

sjpb commented Sep 4, 2024 •

edited

Loading

sjpb commented Sep 4, 2024 •

edited

Loading

jovial left a comment

jovial left a comment

jovial commented Sep 6, 2024

jovial left a comment

sjpb commented Sep 6, 2024

Add RL9 cuda build variant #428

Add RL9 cuda build variant #428

Conversation

sjpb commented Aug 14, 2024 • edited Loading

sjpb commented Aug 14, 2024

sjpb commented Aug 14, 2024

sjpb commented Aug 14, 2024

sjpb commented Aug 14, 2024

sjpb commented Aug 15, 2024

sjpb commented Aug 16, 2024

sjpb commented Aug 16, 2024

sjpb commented Sep 4, 2024

bertiethorpe commented Sep 4, 2024

bertiethorpe commented Sep 4, 2024

This comment was marked as outdated.

bertiethorpe commented Sep 4, 2024

bertiethorpe commented Sep 4, 2024

github-advanced-security bot commented Sep 4, 2024

sjpb commented Sep 4, 2024 • edited Loading

sjpb commented Sep 4, 2024 • edited Loading

jovial left a comment

Choose a reason for hiding this comment

jovial left a comment

Choose a reason for hiding this comment

jovial commented Sep 6, 2024

jovial left a comment

Choose a reason for hiding this comment

sjpb commented Sep 6, 2024

sjpb commented Aug 14, 2024 •

edited

Loading

sjpb commented Sep 4, 2024 •

edited

Loading

sjpb commented Sep 4, 2024 •

edited

Loading