-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for enabling tcpx/o in a3 and a3mega vm, provide script for injecting rxdm sidecar and other required components into user workload #3012
Merged
chengcongdu
merged 10 commits into
GoogleCloudPlatform:develop
from
chengcongdu:develop
Sep 13, 2024
Merged
add support for enabling tcpx/o in a3 and a3mega vm, provide script for injecting rxdm sidecar and other required components into user workload #3012
chengcongdu
merged 10 commits into
GoogleCloudPlatform:develop
from
chengcongdu:develop
Sep 13, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…or injecting rxdm sidecar and other required components into user workload
chengcongdu
added
the
release-module-improvements
Added to release notes under the "Module Improvements" heading.
label
Sep 6, 2024
sharabiani
reviewed
Sep 9, 2024
ankitkinra
reviewed
Sep 9, 2024
ankitkinra
reviewed
Sep 9, 2024
nick-stroud
reviewed
Sep 10, 2024
ankitkinra
reviewed
Sep 11, 2024
modules/compute/gke-node-pool/gpu-direct-workload/scripts/enable-tcpx-in-workload.py
Show resolved
Hide resolved
modules/compute/gke-node-pool/gpu-direct-workload/scripts/enable-tcpx-in-workload.py
Show resolved
Hide resolved
modules/compute/gke-node-pool/gpu-direct-workload/scripts/enable-tcpx-in-workload.py
Outdated
Show resolved
Hide resolved
modules/compute/gke-node-pool/gpu-direct-workload/scripts/enable-tcpx-in-workload.py
Outdated
Show resolved
Hide resolved
modules/compute/gke-node-pool/gpu-direct-workload/scripts/enable-tcpx-in-workload.py
Show resolved
Hide resolved
modules/compute/gke-node-pool/gpu-direct-workload/scripts/enable-tcpxo-in-workload.py
Outdated
Show resolved
Hide resolved
ankitkinra
reviewed
Sep 12, 2024
modules/compute/gke-node-pool/gpu-direct-workload/scripts/enable-tcpx-in-workload.py
Show resolved
Hide resolved
modules/compute/gke-node-pool/gpu-direct-workload/scripts/enable-tcpx-in-workload.py
Show resolved
Hide resolved
modules/compute/gke-node-pool/gpu-direct-workload/scripts/enable-tcpxo-in-workload.py
Show resolved
Hide resolved
modules/compute/gke-node-pool/gpu-direct-workload/scripts/enable-tcpx-in-workload.py
Show resolved
Hide resolved
nick-stroud
approved these changes
Sep 12, 2024
ankitkinra
approved these changes
Sep 12, 2024
ankitkinra
approved these changes
Sep 13, 2024
chengcongdu
merged commit Sep 13, 2024
0cf64ca
into
GoogleCloudPlatform:develop
10 of 54 checks passed
Merged
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This change add support for enabling tcpx/o in a3 and a3mega vm, it include:
SAMPLE OUTPUT of the added enable_tcpxo_in_workload:
module.a3-megagpu_pool.null_resource.enable_tcpxo_in_workload[0]: Creating...
module.a3-megagpu_pool.null_resource.enable_tcpxo_in_workload[0]: Provisioning with 'local-exec'...
module.a3-megagpu_pool.null_resource.enable_tcpxo_in_workload[0] (local-exec): Executing: ["/bin/sh" "-c" "python3 modules/embedded/modules/compute/gke-node-pool/gpu-direct-workload/scripts/enable-tcpxo-in-workload.py --file modules/embedded/modules/compute/gke-node-pool/gpu-direct-workload/sample-tcpxo-workload-job.yaml --rxdm v1.0.10"]
module.a3-megagpu_pool.null_resource.enable_tcpxo_in_workload[0] (local-exec): A new manifest has been generated and updated to have TCPXO enabled based on the provided workload
module.a3-megagpu_pool.null_resource.enable_tcpxo_in_workload[0] (local-exec): It can be found in /home/user/fork/hpc-toolkit/gke-a3-mega/primary/modules/embedded/modules/compute/gke-node-pool/gpu-direct-workload/sample-tcpxo-workload-job-tcpxo.yaml
module.a3-megagpu_pool.null_resource.enable_tcpxo_in_workload[0] (local-exec): You can use the following commands to submit the sample job:
module.a3-megagpu_pool.null_resource.enable_tcpxo_in_workload[0] (local-exec): kubectl create -f /home/user/fork/hpc-toolkit/gke-a3-mega/primary/modules/embedded/modules/compute/gke-node-pool/gpu-direct-workload/sample-tcpxo-workload-job-tcpxo.yaml
module.a3-megagpu_pool.null_resource.enable_tcpxo_in_workload[0]: Creation complete after 0s [id=1932468328196680693]
...
...
Apply complete! Resources: 19 added, 1 changed, 1 destroyed.
Outputs:
instructions_a3-megagpu_pool = <<EOT
Since you are using a3-megagpu-8g machine type that has GPUDirect support, your nodepool had been configured with the required plugins.
To use the GPUDirect you will have to add the some components into your workload manifest. Details below
A sample GKE job that had GPUDirect enabled and NCCL test included has been generated locally at:
/home/user/fork/hpc-toolkit/gke-a3-mega/primary/modules/embedded/modules/compute/gke-node-pool/gpu-direct-workload/sample-tcpxo-workload-job-tcpxo.yaml
You can use the following commands to submit the sample job:
kubectl create -f /home/user/fork/hpc-toolkit/gke-a3-mega/primary/modules/embedded/modules/compute/gke-node-pool/gpu-direct-workload/sample-tcpxo-workload-job-tcpxo.yaml
Follow below steps to enable GPUDirect for your own workload:
export WORKLOAD_PATH=<>
python3 /home/user/fork/hpc-toolkit/gke-a3-mega/primary/modules/embedded/modules/compute/gke-node-pool/gpu-direct-workload/scripts/enable-tcpxo-in-workload.py --file $WORKLOAD_PATH --rxdm v1.0.10
WARNING
The "--rxdm" version is tide to the nccl-tcpx/o-installer that had been deployed to your cluster, changing it to other value might have impact on performance
WARNING
Or you can also follow our GPUDirect user guide to update your workload
https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx#add-gpudirect-manifests
After the command an updated manifest will be generated and you can deploy it to the clsuter using kubectl
SAMPLE OUTPUT of running python script directly:
A new manifest has been generated and updated to have TCPXO enabled based on the provided workload
It can be found in /home/user/chdu/sample-tcpxo-workload-job-tcpxo.yaml
You can use the following commands to submit the sample job:
kubectl create -f /home/user/chdu/sample-tcpxo-workload-job-tcpxo.yaml