-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RL9 cuda build variant #428
Conversation
Note to self; try running df -h during workflow, apparently GH runners have ~29GB free on / (which presumably hosts the workspace dir) and ~10GB free on the temp disk on /mnt |
|
5bd4f30
to
945825d
Compare
945825d
to
0abc465
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation. |
@bertiethorpe's build failed for cuda: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/10702249427/job/29670015040 So looks like unpinning from 12.5 doesn't work. Maybe revert bead06d and just bump the images in the main.tf to the ones from the previous build? I can check if they actually work on a GPU node. |
Build: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/10704463080 COMPLETED: openhpc-cuda-RL9-240904-1509-1687368f - this is/should be cuda 12.6 with open drivers
DONE: download to leafcloud
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty clean to me. Do we not want to support older hardware?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving assuming that we are fine with dropping support for older cards.
Reran CI as failure looks transient:
Is this running on SMS? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to make volumes optional looks good to me. Fingers crossed for CI 🤞
|
* determine cuda distro automatically * fix typo in CUDA samples * make facts available for cuda * add RL9 cuda build variant * fix typo in build definitions * set packer build volume sizes depending on build variant * fix volume size definition * fix cuda verfsion to workaround issue with 12-6-0-1 * don't fail all builds if one fails * bump CUDA builder disk size (build ran out of space) * download cuda image to /mnt on gh runner * download cuda image to /mnt on gh runner * fix fatimage.yml mnt permissions * Update main.yml * switch to open nvidia drivers * bump CI images * make packer build volume-backed optional again --------- Co-authored-by: bertiethorpe <bertie443@gmail.com> Co-authored-by: bertiethorpe <84867280+bertiethorpe@users.noreply.github.com>
Adds nvidia-driver and CUDA install to the image build workflow.
nvidia-driver
package appears to install the latest kernel (contrary to documentation), so must install this driver during image build when the kernel is updated, rather than doing it as an additional "extra" build on top of a fat image.