Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RL9 cuda build variant #428

Merged
merged 17 commits into from
Sep 6, 2024
Merged

Add RL9 cuda build variant #428

merged 17 commits into from
Sep 6, 2024

Conversation

sjpb
Copy link
Collaborator

@sjpb sjpb commented Aug 14, 2024

Adds nvidia-driver and CUDA install to the image build workflow.

  • Now uses opensource nvidia drivers to work around an issue installing cuda 12.6. See here for compatibility restrictions and background.
  • nvidia-driver package appears to install the latest kernel (contrary to documentation), so must install this driver during image build when the kernel is updated, rather than doing it as an additional "extra" build on top of a fat image.
  • Fixes distribution detection during cuda install.
  • Fixes cuda version detection during cuda samples tests.
  • Increases CUDA image size (= builder VM root volume size) to avoid build running out of disk space.
  • Fixes github runner running out of disk space during cuda image scanning.

@sjpb
Copy link
Collaborator Author

sjpb commented Aug 14, 2024

@sjpb
Copy link
Collaborator Author

sjpb commented Aug 14, 2024

@sjpb
Copy link
Collaborator Author

sjpb commented Aug 14, 2024

@sjpb
Copy link
Collaborator Author

sjpb commented Aug 14, 2024

@sjpb
Copy link
Collaborator Author

sjpb commented Aug 15, 2024

@sjpb
Copy link
Collaborator Author

sjpb commented Aug 16, 2024

@sjpb
Copy link
Collaborator Author

sjpb commented Aug 16, 2024

@sjpb
Copy link
Collaborator Author

sjpb commented Sep 4, 2024

Note to self; try running df -h during workflow, apparently GH runners have ~29GB free on / (which presumably hosts the workspace dir) and ~10GB free on the temp disk on /mnt

@bertiethorpe
Copy link
Member

Filesystem      Size  Used Avail Use% Mounted on
/dev/root        73G   52G   22G  71% /
tmpfs           7.9G  172K  7.9G   1% /dev/shm
tmpfs           3.2G  1.1M  3.2G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda15      105M  6.1M   99M   6% /boot/efi
/dev/sdb1        74G  4.1G   66G   6% /mnt
tmpfs           1.6G   12K  1.6G   1% /run/user/1001

@bertiethorpe bertiethorpe force-pushed the ci/cuda-build branch 2 times, most recently from 5bd4f30 to 945825d Compare September 4, 2024 08:44
@bertiethorpe
Copy link
Member

@bertiethorpe

This comment was marked as outdated.

@bertiethorpe
Copy link
Member

@bertiethorpe
Copy link
Member

@github-advanced-security
Copy link

This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation.

@sjpb sjpb mentioned this pull request Sep 4, 2024
@sjpb
Copy link
Collaborator Author

sjpb commented Sep 4, 2024

@bertiethorpe's build failed for cuda: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/10702249427/job/29670015040

So looks like unpinning from 12.5 doesn't work. Maybe revert bead06d and just bump the images in the main.tf to the ones from the previous build? I can check if they actually work on a GPU node.

@sjpb
Copy link
Collaborator Author

sjpb commented Sep 4, 2024

Build: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/10704463080

COMPLETED: openhpc-cuda-RL9-240904-1509-1687368f - this is/should be cuda 12.6 with open drivers

$ qemu-img info openhpc-cuda-RL9-240904-1509-1687368f
image: openhpc-cuda-RL9-240904-1509-1687368f
file format: qcow2
virtual size: 30 GiB (32212254720 bytes)
disk size: 15.3 GiB

DONE: download to leafcloud
DONE: upload to s3 prerelease
DONE: download to test cloud deploy VM
DONE: upload to test cloud openstack - NB not doing any of the BM node things
DONE: provisioning cluster with this
DONE: configuring cluster
DONE: test cuda:

ok: [stg-a40-02] => (item=Device 1) => {}

MSG:

Device 1: NVIDIA A40
Bandwidths: (Gb/s)
Host to Device: 25.7
Device to Host: 20.3
Device to Device: 540.8
Result: PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

@sjpb sjpb marked this pull request as ready for review September 5, 2024 16:36
@sjpb sjpb requested a review from a team as a code owner September 5, 2024 16:36
Copy link
Collaborator

@jovial jovial left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty clean to me. Do we not want to support older hardware?

packer/openstack.pkr.hcl Outdated Show resolved Hide resolved
ansible/roles/cuda/tasks/main.yml Show resolved Hide resolved
jovial
jovial previously approved these changes Sep 5, 2024
Copy link
Collaborator

@jovial jovial left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving assuming that we are fine with dropping support for older cards.

@jovial
Copy link
Collaborator

jovial commented Sep 6, 2024

Reran CI as failure looks transient:

│ Error: Error waiting for instance (a3434bd4-1ba2-41bd-900f-e8af53c57457) to become ready: The server timed out waiting for the request
│ 
│   with module.cluster.module.compute["standard"].openstack_compute_instance_v2.compute["compute-0"],
│   on ../../skeleton/{{cookiecutter.environment}}/terraform/compute/nodes.tf line 21, in resource "openstack_compute_instance_v2" "compute":
│   21: resource "openstack_compute_instance_v2" "compute" {
│ 

Is this running on SMS?

Copy link
Collaborator

@jovial jovial left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to make volumes optional looks good to me. Fingers crossed for CI 🤞

@sjpb
Copy link
Collaborator Author

sjpb commented Sep 6, 2024

Is this running on SMS?
@jovial no leafcloud at present. Thanks for kicking off again.

@sjpb sjpb merged commit 6ec3a73 into main Sep 6, 2024
1 check passed
@sjpb sjpb deleted the ci/cuda-build branch September 6, 2024 08:52
@sjpb sjpb restored the ci/cuda-build branch September 6, 2024 08:52
MaxBed4d pushed a commit that referenced this pull request Oct 15, 2024
* determine cuda distro automatically

* fix typo in CUDA samples

* make facts available for cuda

* add RL9 cuda build variant

* fix typo in build definitions

* set packer build volume sizes depending on build variant

* fix volume size definition

* fix cuda verfsion to workaround issue with 12-6-0-1

* don't fail all builds if one fails

* bump CUDA builder disk size (build ran out of space)

* download cuda image to /mnt on gh runner

* download cuda image to /mnt on gh runner

* fix fatimage.yml mnt permissions

* Update main.yml

* switch to open nvidia drivers

* bump CI images

* make packer build volume-backed optional again

---------

Co-authored-by: bertiethorpe <bertie443@gmail.com>
Co-authored-by: bertiethorpe <84867280+bertiethorpe@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants