Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dynamic setup of gpu_limit in gke-job-template module #3319

Merged

Conversation

mohitchaurasia91
Copy link
Contributor

@mohitchaurasia91 mohitchaurasia91 commented Nov 28, 2024

Triggered PR test for all blueprints with usage of gke-node-pool and gke-job-template module completed with OK status.

Verification

  • Blueprint: gke-a3-megagpu
    • Test job logs from daily test build running on develop branch, nvidia-smi is showing stat from 1 gpu (default value set in gke-job-template module)
    • Test job logs from PR build running on feature branch, nvidia-smi is showing stat from 8 gpu (using gke-node-pool module output to set this up dynamically), user can still override it using var requested_gpu_per_pod

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@mohitchaurasia91 mohitchaurasia91 added the release-module-improvements Added to release notes under the "Module Improvements" heading. label Nov 28, 2024
@mohitchaurasia91 mohitchaurasia91 marked this pull request as ready for review November 29, 2024 12:21
@mohitchaurasia91 mohitchaurasia91 self-assigned this Nov 29, 2024
Copy link
Contributor

@annuay-google annuay-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a section/comment on verifying the change- is there a way to see how many GPUs nvidia-smi has requested for e.g.?

@mohitchaurasia91
Copy link
Contributor Author

Please add a section/comment on verifying the change- is there a way to see how many GPUs nvidia-smi has requested for e.g.?

Updated PR description with logs from test jobs running on develop and feature branch.

Copy link
Contributor

@annuay-google annuay-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mohitchaurasia91 mohitchaurasia91 merged commit 549e252 into GoogleCloudPlatform:develop Dec 3, 2024
18 of 60 checks passed
@mohitchaurasia91 mohitchaurasia91 deleted the fix_gpu_limit branch December 9, 2024 06:26
@nick-stroud nick-stroud mentioned this pull request Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-module-improvements Added to release notes under the "Module Improvements" heading.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants