Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add build scripts for building Nvidia and Neuron AMIs based on AL2023 #1924

Merged
merged 8 commits into from
Aug 29, 2024

Conversation

nkvetsinski
Copy link
Contributor

Issue #, if available:

Description of changes:

Adding build scripts for building Nvidia and Neuron AMIs based on AL2023 OS.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Testing Done

Neuron:

  1. Built the AMI.
  2. Started an instance from the AMI.
  3. Instance was able to join EKS cluster and after installing Neuron device plugin, Neuron devices were added as capacity to the node K8s object.

Nvidia:

  1. Built 2 AMIs using the currently available drivers (555 and 560 versions) in the Nvidia repo.
  2. Started instances from the AMIs.
  3. Instances were able to join EKS cluster and after installing Nvidia device plugin, Nvidia GPUs were added as capacity to the nodes K8s objects.
  4. Deployed Pod and executed nvidia-smi from it.

Nodeadm E2E tests are also passing.

See this guide for recommended testing for PRs. Some tests may not apply. Completing tests and providing additional validation steps are not required, but it is recommended and may reduce review time and time to merge.

README.md Outdated Show resolved Hide resolved
templates/al2023/variables-default.json Outdated Show resolved Hide resolved
Remove multicard config script
Remove CUDA packages
Remove OCI config for Neuron
templates/al2023/runtime/gpu/bootstrap_gpu.sh Outdated Show resolved Hide resolved
templates/shared/runtime/bin/gpu-ami-util Outdated Show resolved Hide resolved
templates/al2023/provisioners/install-nvidia-driver.sh Outdated Show resolved Hide resolved
templates/al2023/provisioners/install-nvidia-driver.sh Outdated Show resolved Hide resolved
templates/al2023/provisioners/install-nvidia-driver.sh Outdated Show resolved Hide resolved
nodeadm/internal/containerd/oci_config.go Outdated Show resolved Hide resolved
nodeadm/internal/containerd/oci_config.go Outdated Show resolved Hide resolved
nodeadm/internal/containerd/oci_config.go Outdated Show resolved Hide resolved
nodeadm/internal/containerd/oci_config.go Outdated Show resolved Hide resolved
templates/al2023/runtime/gpu/bootstrap_gpu.service Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
doc/usage/al2023.md Outdated Show resolved Hide resolved
doc/usage/al2023.md Outdated Show resolved Hide resolved
doc/usage/al2023.md Outdated Show resolved Hide resolved
doc/usage/al2023.md Outdated Show resolved Hide resolved
doc/usage/al2023.md Outdated Show resolved Hide resolved
Comment on lines 54 to 64
nvidiaOptions := `
[plugins.'io.containerd.grpc.v1.cri'.containerd]
default_runtime_name = 'nvidia'
discard_unpacked_layers = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
base_runtime_spec = "/etc/containerd/base-runtime-spec.json"
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
SystemdCgroup = true
`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. If this is a constant, lets move it out of this func.
  2. If you're having to include keys like SystemdCgroup = true, etc. so they don't get lost during merge, you should be able to just swap the order of the args you're passing to util.Merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Moved out of the function and declared as constant.
  2. I'm not sure how swapping will help. IIUC [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia] and [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options] are not present in config.template.toml, so if we want them to present in the resulting config, we'll have pass everything we need as is (SystemdCgroup, BinaryName, etc).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. That's really not ideal, we don't want to have to keep this blob in sync with what the rest of the config flow does to these nested structs. Does the nvidia-ctk clone your existing runtime config and just change the name? How does this work there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvidia-ctk uses /etc/containerd/config.toml as base and does an in-place change (adding whatever it needs to the base). This is why initially I was writing the config file first, calling nvidia-ctk, passing the resulting config back. Let me see what will happen if we call it on empty file and merge with our template.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering about how nvidia-ctk handles options like SystemdCgroup, base_runtime_spec, etc. It must be copying our existing runc runtime config and just plugging in nvidia and BinaryName.

Could you just update our existing containerd config template such that you can swap in nvidia? https://github.com/awslabs/amazon-eks-ami/blob/main/nodeadm/internal/containerd/config.template.toml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I addressed that, check the latest code.

nodeadm/internal/containerd/runtime_config.go Outdated Show resolved Hide resolved
templates/al2023/provisioners/install-efa.sh Outdated Show resolved Hide resolved
templates/al2023/provisioners/configure-nvidia-eula.sh Outdated Show resolved Hide resolved
templates/al2023/template.json Outdated Show resolved Hide resolved
templates/al2023/runtime/gpu/initialize-nvidia-clock.sh Outdated Show resolved Hide resolved
templates/al2023/runtime/gpu/nvidia-kmod-load.service Outdated Show resolved Hide resolved
templates/al2023/runtime/gpu/gpu-ami-util Outdated Show resolved Hide resolved
Co-authored-by: Carter <cartermckinnon@gmail.com>
doc/usage/al2023.md Outdated Show resolved Hide resolved
elif [[ $GPUNAME == *"M60"* ]]; then
nvidia-smi -ac 2505,1177
elif [[ $GPUNAME == *"H100"* ]]; then
nvidia-smi -ac 2619,1980
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we aren't going to dynamically pull these at boot, we should proactively add all known cards - L40S (g6e) was just launched - B200, GB200, etc.?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, do we need to set nvidia-smi -lgc (--lock-gpu-clocks) xx and nvidia-smi -lmc (--lock-memory-clocks) yy? do we test and validate that the clock speeds are maintained across reboots?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we aren't going to dynamically pull these at boot, we should proactively add all known cards - L40S (g6e) was just launched - B200, GB200, etc.?

I'll go over the existing instances and make sure we have all cards.

also, do we need to set nvidia-smi -lgc (--lock-gpu-clocks) xx and nvidia-smi -lmc (--lock-memory-clocks) yy? do we test and validate that the clock speeds are maintained across reboots?

IIUC from nvidia-smi help, the flag that we currently set ( --applications-clocks)
specifies the clocks when there is something running on the GPU. While the flags you mention are setting gpu and mem clocks at all times (with or without load). We don't use lock-gpu and lock-mem for the other GPU AMI that we support, so if you think it's beneficial, we can set them.

I'll check the reboot persistence, because this is executed as systemd unit, it should persist across reboots.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clock settings persist reboot, I left out the lgc and lmc flags.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me!

Copy link
Member

@cartermckinnon cartermckinnon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some final nits 👍

doc/usage/al2023.md Outdated Show resolved Hide resolved
doc/usage/al2023.md Outdated Show resolved Hide resolved
@Issacwww
Copy link
Member

please fix the lint, should be easy via make fmt && make lint under the root path

@Issacwww
Copy link
Member

/ci
+workflow:os_distros al2023
+build enable_accelerator=nvidia

Copy link
Contributor

@Issacwww roger that! I've dispatched a workflow. 👍

Copy link
Contributor

@Issacwww the workflow that you requested has completed. 🎉

AMI variantBuildTest
1.23 / al2023success ✅success ✅
1.24 / al2023success ✅success ✅
1.25 / al2023success ✅success ✅
1.26 / al2023success ✅success ✅
1.27 / al2023success ✅success ✅
1.28 / al2023success ✅success ✅
1.29 / al2023success ✅success ✅
1.30 / al2023success ✅success ✅

@nkvetsinski
Copy link
Contributor Author

please fix the lint, should be easy via make fmt && make lint under the root path

done


# load the nvidia kernel module appropriate for the attached devices
MODULE_NAME=""
if devices-support-open; then
Copy link
Contributor

@bryantbiggs bryantbiggs Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the older cards that require the proprietary driver are a known quantity, and new cards that will be released in the future are unknown to this script (and therefore the PCI class codes are unknown here) - should we flip this logic to default to using the open GPU kernel module unless we know the proprietary driver is absolutely required?

this may require some changes above as well - but I think the appropriate end goal here is to default to the open GPU kernel module, unless we know for certain the proprietary driver is required

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check with @Issacwww about that. I'm not the original writer of this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a fair call out, I will follow up with next PR as current one is big enough :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'm also in favor of changing this in different PR.

Copy link
Member

@cartermckinnon cartermckinnon Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new cards that will be released in the future are unknown to this script (and therefore the PCI class codes are unknown here)

The PCI class codes don't change per device, these have been the same for a few decades now. :)

should we flip this logic to default to using the open GPU kernel module unless we know the proprietary driver is absolutely required

I didn't go this route because it's way messier. Compiling (and maintaining) an exhaustive list of every PCI device ID used by those EC2 instance types is quite a chore -- p3 uses V100's, but there are many revisions and variants of V100 in use. With the current approach, we generate the supported device file from the driver archive and do a simple lookup.

Long term, yes we want to default to the open source kmod. NVIDIA is going to do that in the 560 driver branch, and we can revisit/refactor accordingly then, IMO.


# gpu boost clock
sudo nvidia-smi -pm 1 # set persistence mode
sudo nvidia-smi --auto-boost-default=0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the final else case below, we should not disable auto boost if the card is unknown to this script

if a new instance is launched, we don't want to hamper its performance in any form if we skip updating this script - let it fall back to the default behavior of the card

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bryantbiggs do you mind if we address this in another PR?

Copy link
Contributor

@bryantbiggs bryantbiggs Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely - as stated below, the two added comments here are great candidates for follow PRs if we think the changes are worthy. But I think this turkey is cooked as is so - :shipit:

@bryantbiggs
Copy link
Contributor

looks great to me - two comments that can be followed up in later PRs to ensure we fail open to the appropriate path
:shipit:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request gpu
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants