Add build scripts for building Nvidia and Neuron AMIs based on AL2023 #1924

nkvetsinski · 2024-08-14T23:00:23Z

Issue #, if available:

Description of changes:

Adding build scripts for building Nvidia and Neuron AMIs based on AL2023 OS.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Testing Done

Neuron:

Built the AMI.
Started an instance from the AMI.
Instance was able to join EKS cluster and after installing Neuron device plugin, Neuron devices were added as capacity to the node K8s object.

Nvidia:

Built 2 AMIs using the currently available drivers (555 and 560 versions) in the Nvidia repo.
Started instances from the AMIs.
Instances were able to join EKS cluster and after installing Nvidia device plugin, Nvidia GPUs were added as capacity to the nodes K8s objects.
Deployed Pod and executed nvidia-smi from it.

Nodeadm E2E tests are also passing.

See this guide for recommended testing for PRs. Some tests may not apply. Completing tests and providing additional validation steps are not required, but it is recommended and may reduce review time and time to merge.

README.md

templates/al2023/variables-default.json

templates/al2023/provisioners/install-efa.sh

templates/al2023/provisioners/install-neuron-driver.sh

templates/al2023/provisioners/install-nvidia-driver.sh

templates/al2023/runtime/gpu/initialize_nvidia_clock.sh

templates/al2023/runtime/rootfs/etc/systemd/system/configure-multicard-interfaces.service

Remove multicard config script Remove CUDA packages Remove OCI config for Neuron

templates/al2023/runtime/gpu/bootstrap_gpu.sh

templates/shared/runtime/bin/gpu-ami-util

templates/al2023/provisioners/install-nvidia-driver.sh

nodeadm/internal/containerd/oci_config.go

templates/al2023/runtime/gpu/bootstrap_gpu.service

README.md

doc/usage/al2023.md

cartermckinnon · 2024-08-22T20:27:43Z

nodeadm/internal/containerd/runtime_config.go

+	nvidiaOptions := `
+[plugins.'io.containerd.grpc.v1.cri'.containerd]
+default_runtime_name = 'nvidia'
+discard_unpacked_layers = true
+[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
+base_runtime_spec = "/etc/containerd/base-runtime-spec.json"
+runtime_type = "io.containerd.runc.v2"
+[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
+BinaryName = "/usr/bin/nvidia-container-runtime"
+SystemdCgroup = true
+`


If this is a constant, lets move it out of this func.

If you're having to include keys like SystemdCgroup = true, etc. so they don't get lost during merge, you should be able to just swap the order of the args you're passing to util.Merge.

Moved out of the function and declared as constant.

I'm not sure how swapping will help. IIUC [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia] and [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options] are not present in config.template.toml, so if we want them to present in the resulting config, we'll have pass everything we need as is (SystemdCgroup, BinaryName, etc).

I see what you mean. That's really not ideal, we don't want to have to keep this blob in sync with what the rest of the config flow does to these nested structs. Does the nvidia-ctk clone your existing runtime config and just change the name? How does this work there?

nvidia-ctk uses /etc/containerd/config.toml as base and does an in-place change (adding whatever it needs to the base). This is why initially I was writing the config file first, calling nvidia-ctk, passing the resulting config back. Let me see what will happen if we call it on empty file and merge with our template.

I'm wondering about how nvidia-ctk handles options like SystemdCgroup, base_runtime_spec, etc. It must be copying our existing runc runtime config and just plugging in nvidia and BinaryName.

Could you just update our existing containerd config template such that you can swap in nvidia? https://github.com/awslabs/amazon-eks-ami/blob/main/nodeadm/internal/containerd/config.template.toml

I think I addressed that, check the latest code.

nodeadm/internal/containerd/runtime_config.go

templates/al2023/provisioners/install-efa.sh

templates/al2023/provisioners/configure-nvidia-eula.sh

templates/al2023/template.json

templates/al2023/runtime/gpu/initialize-nvidia-clock.sh

templates/al2023/runtime/gpu/nvidia-kmod-load.service

templates/al2023/runtime/gpu/gpu-ami-util

Co-authored-by: Carter <cartermckinnon@gmail.com>

nodeadm/internal/containerd/runtime_config.go

templates/al2023/provisioners/install-neuron-driver.sh

templates/al2023/provisioners/install-nvidia-driver.sh

doc/usage/al2023.md

templates/al2023/runtime/bin/limit-c-states.sh

bryantbiggs · 2024-08-23T14:32:27Z

templates/al2023/runtime/gpu/initialize-nvidia-clock.sh

+    elif [[ $GPUNAME == *"M60"* ]]; then
+      nvidia-smi -ac 2505,1177
+    elif [[ $GPUNAME == *"H100"* ]]; then
+      nvidia-smi -ac 2619,1980


if we aren't going to dynamically pull these at boot, we should proactively add all known cards - L40S (g6e) was just launched - B200, GB200, etc.?

also, do we need to set nvidia-smi -lgc (--lock-gpu-clocks) xx and nvidia-smi -lmc (--lock-memory-clocks) yy? do we test and validate that the clock speeds are maintained across reboots?

if we aren't going to dynamically pull these at boot, we should proactively add all known cards - L40S (g6e) was just launched - B200, GB200, etc.?

I'll go over the existing instances and make sure we have all cards.

also, do we need to set nvidia-smi -lgc (--lock-gpu-clocks) xx and nvidia-smi -lmc (--lock-memory-clocks) yy? do we test and validate that the clock speeds are maintained across reboots?

IIUC from nvidia-smi help, the flag that we currently set ( --applications-clocks)
specifies the clocks when there is something running on the GPU. While the flags you mention are setting gpu and mem clocks at all times (with or without load). We don't use lock-gpu and lock-mem for the other GPU AMI that we support, so if you think it's beneficial, we can set them.

I'll check the reboot persistence, because this is executed as systemd unit, it should persist across reboots.

Clock settings persist reboot, I left out the lgc and lmc flags.

sounds good to me!

cartermckinnon

Some final nits 👍

doc/usage/al2023.md

Issacwww · 2024-08-26T19:57:44Z

please fix the lint, should be easy via make fmt && make lint under the root path

Issacwww · 2024-08-26T20:03:21Z

/ci
+workflow:os_distros al2023
+build enable_accelerator=nvidia

github-actions · 2024-08-26T20:03:56Z

@Issacwww roger that! I've dispatched a workflow. 👍

github-actions · 2024-08-26T21:11:39Z

@Issacwww the workflow that you requested has completed. 🎉

AMI variant	Build	Test
1.23 / al2023	success ✅	success ✅
1.24 / al2023	success ✅	success ✅
1.25 / al2023	success ✅	success ✅
1.26 / al2023	success ✅	success ✅
1.27 / al2023	success ✅	success ✅
1.28 / al2023	success ✅	success ✅
1.29 / al2023	success ✅	success ✅
1.30 / al2023	success ✅	success ✅

nkvetsinski · 2024-08-26T22:09:28Z

please fix the lint, should be easy via make fmt && make lint under the root path

done

bryantbiggs · 2024-08-27T19:13:49Z

templates/al2023/runtime/gpu/nvidia-kmod-load.sh

+
+# load the nvidia kernel module appropriate for the attached devices
+MODULE_NAME=""
+if devices-support-open; then


since the older cards that require the proprietary driver are a known quantity, and new cards that will be released in the future are unknown to this script (and therefore the PCI class codes are unknown here) - should we flip this logic to default to using the open GPU kernel module unless we know the proprietary driver is absolutely required?

this may require some changes above as well - but I think the appropriate end goal here is to default to the open GPU kernel module, unless we know for certain the proprietary driver is required

Let's check with @Issacwww about that. I'm not the original writer of this.

I think this is a fair call out, I will follow up with next PR as current one is big enough :)

Sounds good, I'm also in favor of changing this in different PR.

new cards that will be released in the future are unknown to this script (and therefore the PCI class codes are unknown here)

The PCI class codes don't change per device, these have been the same for a few decades now. :)

should we flip this logic to default to using the open GPU kernel module unless we know the proprietary driver is absolutely required

I didn't go this route because it's way messier. Compiling (and maintaining) an exhaustive list of every PCI device ID used by those EC2 instance types is quite a chore -- p3 uses V100's, but there are many revisions and variants of V100 in use. With the current approach, we generate the supported device file from the driver archive and do a simple lookup.

Long term, yes we want to default to the open source kmod. NVIDIA is going to do that in the 560 driver branch, and we can revisit/refactor accordingly then, IMO.

bryantbiggs · 2024-08-27T19:18:23Z

templates/al2023/runtime/gpu/set-nvidia-clocks.sh

+
+# gpu boost clock
+sudo nvidia-smi -pm 1 # set persistence mode
+sudo nvidia-smi --auto-boost-default=0


in the final else case below, we should not disable auto boost if the card is unknown to this script

if a new instance is launched, we don't want to hamper its performance in any form if we skip updating this script - let it fall back to the default behavior of the card

@bryantbiggs do you mind if we address this in another PR?

absolutely - as stated below, the two added comments here are great candidates for follow PRs if we think the changes are worthy. But I think this turkey is cooked as is so -

bryantbiggs · 2024-08-27T19:21:20Z

looks great to me - two comments that can be followed up in later PRs to ensure we fail open to the appropriate path

Add build scripts for building Nvidia and Neuron AMIs based on AL2023

d93aaf1

Issacwww reviewed Aug 15, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

templates/al2023/variables-default.json Outdated Show resolved Hide resolved

Fix in README and removing unused variable in Makefile

286f29e

bryantbiggs reviewed Aug 16, 2024

View reviewed changes

templates/al2023/provisioners/install-efa.sh Outdated Show resolved Hide resolved

bryantbiggs reviewed Aug 16, 2024

View reviewed changes

templates/al2023/provisioners/install-neuron-driver.sh Outdated Show resolved Hide resolved

bryantbiggs reviewed Aug 16, 2024

View reviewed changes

templates/al2023/provisioners/install-nvidia-driver.sh Outdated Show resolved Hide resolved

bryantbiggs reviewed Aug 16, 2024

View reviewed changes

templates/al2023/runtime/gpu/initialize_nvidia_clock.sh Outdated Show resolved Hide resolved

bryantbiggs reviewed Aug 16, 2024

View reviewed changes

templates/al2023/runtime/rootfs/etc/systemd/system/configure-multicard-interfaces.service Outdated Show resolved Hide resolved

Address comments

a202745

Remove multicard config script Remove CUDA packages Remove OCI config for Neuron

cartermckinnon reviewed Aug 20, 2024

View reviewed changes

Address Carter's comments

5c8ff3e

cartermckinnon reviewed Aug 22, 2024

View reviewed changes

ndbaker1 reviewed Aug 22, 2024

View reviewed changes

Update README.md

f6d066b

Co-authored-by: Carter <cartermckinnon@gmail.com>