Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for enabling tcpx/o in a3 and a3mega vm, provide script for injecting rxdm sidecar and other required components into user workload #3012

Merged
merged 10 commits into from
Sep 13, 2024

Conversation

chengcongdu
Copy link
Member

@chengcongdu chengcongdu commented Sep 6, 2024

This change add support for enabling tcpx/o in a3 and a3mega vm, it include:

  • Install the correct version of NCCL plugin and NRI plugin daemonsets for A3 and A3Mega nodepool
  • provide script for injecting rxdm sidecar and other required components into user workload
  • provide a sample workload in the format of Kubernetes Job manifest to show how GPUDirect work and what performance it brings

SAMPLE OUTPUT of the added enable_tcpxo_in_workload:
module.a3-megagpu_pool.null_resource.enable_tcpxo_in_workload[0]: Creating...
module.a3-megagpu_pool.null_resource.enable_tcpxo_in_workload[0]: Provisioning with 'local-exec'...
module.a3-megagpu_pool.null_resource.enable_tcpxo_in_workload[0] (local-exec): Executing: ["/bin/sh" "-c" "python3 modules/embedded/modules/compute/gke-node-pool/gpu-direct-workload/scripts/enable-tcpxo-in-workload.py --file modules/embedded/modules/compute/gke-node-pool/gpu-direct-workload/sample-tcpxo-workload-job.yaml --rxdm v1.0.10"]
module.a3-megagpu_pool.null_resource.enable_tcpxo_in_workload[0] (local-exec): A new manifest has been generated and updated to have TCPXO enabled based on the provided workload
module.a3-megagpu_pool.null_resource.enable_tcpxo_in_workload[0] (local-exec): It can be found in /home/user/fork/hpc-toolkit/gke-a3-mega/primary/modules/embedded/modules/compute/gke-node-pool/gpu-direct-workload/sample-tcpxo-workload-job-tcpxo.yaml
module.a3-megagpu_pool.null_resource.enable_tcpxo_in_workload[0] (local-exec): You can use the following commands to submit the sample job:
module.a3-megagpu_pool.null_resource.enable_tcpxo_in_workload[0] (local-exec): kubectl create -f /home/user/fork/hpc-toolkit/gke-a3-mega/primary/modules/embedded/modules/compute/gke-node-pool/gpu-direct-workload/sample-tcpxo-workload-job-tcpxo.yaml
module.a3-megagpu_pool.null_resource.enable_tcpxo_in_workload[0]: Creation complete after 0s [id=1932468328196680693]
...
...
Apply complete! Resources: 19 added, 1 changed, 1 destroyed.

Outputs:

instructions_a3-megagpu_pool = <<EOT
Since you are using a3-megagpu-8g machine type that has GPUDirect support, your nodepool had been configured with the required plugins.
To use the GPUDirect you will have to add the some components into your workload manifest. Details below

A sample GKE job that had GPUDirect enabled and NCCL test included has been generated locally at:
/home/user/fork/hpc-toolkit/gke-a3-mega/primary/modules/embedded/modules/compute/gke-node-pool/gpu-direct-workload/sample-tcpxo-workload-job-tcpxo.yaml

You can use the following commands to submit the sample job:
kubectl create -f /home/user/fork/hpc-toolkit/gke-a3-mega/primary/modules/embedded/modules/compute/gke-node-pool/gpu-direct-workload/sample-tcpxo-workload-job-tcpxo.yaml

Follow below steps to enable GPUDirect for your own workload:
export WORKLOAD_PATH=<>
python3 /home/user/fork/hpc-toolkit/gke-a3-mega/primary/modules/embedded/modules/compute/gke-node-pool/gpu-direct-workload/scripts/enable-tcpxo-in-workload.py --file $WORKLOAD_PATH --rxdm v1.0.10
WARNING
The "--rxdm" version is tide to the nccl-tcpx/o-installer that had been deployed to your cluster, changing it to other value might have impact on performance
WARNING

Or you can also follow our GPUDirect user guide to update your workload
https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx#add-gpudirect-manifests

After the command an updated manifest will be generated and you can deploy it to the clsuter using kubectl


SAMPLE OUTPUT of running python script directly:
A new manifest has been generated and updated to have TCPXO enabled based on the provided workload
It can be found in /home/user/chdu/sample-tcpxo-workload-job-tcpxo.yaml
You can use the following commands to submit the sample job:
kubectl create -f /home/user/chdu/sample-tcpxo-workload-job-tcpxo.yaml

…or injecting rxdm sidecar and other required components into user workload
@chengcongdu chengcongdu added the release-module-improvements Added to release notes under the "Module Improvements" heading. label Sep 6, 2024
modules/compute/gke-node-pool/main.tf Outdated Show resolved Hide resolved
modules/compute/gke-node-pool/main.tf Show resolved Hide resolved
modules/compute/gke-node-pool/main.tf Outdated Show resolved Hide resolved
modules/compute/gke-node-pool/variables.tf Outdated Show resolved Hide resolved
modules/compute/gke-node-pool/main.tf Outdated Show resolved Hide resolved
modules/compute/gke-node-pool/main.tf Outdated Show resolved Hide resolved
modules/compute/gke-node-pool/main.tf Outdated Show resolved Hide resolved
modules/compute/gke-node-pool/variables.tf Outdated Show resolved Hide resolved
modules/compute/gke-node-pool/main.tf Outdated Show resolved Hide resolved
modules/compute/gke-node-pool/gpu_direct.tf Outdated Show resolved Hide resolved
modules/compute/gke-node-pool/gpu_direct.tf Outdated Show resolved Hide resolved
modules/compute/gke-node-pool/outputs.tf Outdated Show resolved Hide resolved
modules/compute/gke-node-pool/gpu_direct.tf Show resolved Hide resolved
modules/compute/gke-node-pool/outputs.tf Outdated Show resolved Hide resolved
modules/compute/gke-node-pool/outputs.tf Outdated Show resolved Hide resolved
@chengcongdu chengcongdu merged commit 0cf64ca into GoogleCloudPlatform:develop Sep 13, 2024
10 of 54 checks passed
@tpdownes tpdownes mentioned this pull request Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-module-improvements Added to release notes under the "Module Improvements" heading.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants