Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm on GKE - Guide #864

Merged

Conversation

danielmarzini
Copy link
Contributor

This guide shows you how to deploy Slurm on a Google Kubernetes Engine (GKE) cluster.

This guide is intended for platform administrators in an enterprise environment who are already managing Kubernetes or GKE clusters, and who need to set up Slurm clusters for AI/ML teams on Kubernetes. This guide is also for AI/ML startups that already use Kubernetes or GKE to run their workloads, such as inference workloads or web apps, and want to use their existing infrastructure to run training workloads with a Slurm interface.

slurm-on-gke/image/Dockerfile Show resolved Hide resolved
slurm-on-gke/.gitignore Outdated Show resolved Hide resolved
slurm-on-gke/.gitignore Outdated Show resolved Hide resolved
slurm-on-gke/README.md Outdated Show resolved Hide resolved
@@ -0,0 +1,196 @@
/**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a top-level generic module for infrastructure here https://github.com/GoogleCloudPlatform/ai-on-gke/tree/main/infrastructure

Is it possible to reuse that instead of creating another infrastructure module?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, the infra part is tied to the described example, I can do it but an a second release. wdyt?

slurm-on-gke/modules/slurm-cluster/main.tf Outdated Show resolved Hide resolved
@danielmarzini
Copy link
Contributor Author

danielmarzini commented Nov 4, 2024

Hello @andrewsykim I implemented the changes with 1 exception still open above, wdyt? Thanks!

@danielmarzini
Copy link
Contributor Author

@andrewsykim as agreed, can you also add @mwysokin as reviewer? Thanks!

@andrewsykim
Copy link
Collaborator

@andrewsykim as agreed, can you also add @mwysokin as reviewer? Thanks!

I'm unable to add him as a reviewer for some reason, but feel free to review anyways

Copy link

@mwysokin mwysokin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM 🖖

I think the module could be improved by making it more flexible when it comes to image versions because most of them are currently hardcoded but it's a nit and a possible room for improvement if people are going to use it.

@andrewsykim
Copy link
Collaborator

/gcbrun

Copy link
Collaborator

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left some comments for follow-up

impersonate_service_account = "YOUR_SERVICE_ACCOUNT_ID" <-- if you are using one
region = "europe-west3"
project_id = "YOUR_PROJECT_ID"
billing_account_id = "YOUR_BILLING_ACCOUNT_ID"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally expect users of ai-on-gke to bring an existing project, would that remove the need to specify a billing account ID and folder ID?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also has implications for how easy it is to use the guide, many people trying the slurm guide may not even know the billing ID and folder ID, but they may have an existing project to work with

# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

FROM ubuntu:22.04
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using a distroless base image here in the future

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewsykim andrewsykim merged commit ee94bdc into GoogleCloudPlatform:main Nov 18, 2024
5 of 7 checks passed
@andrewsykim
Copy link
Collaborator

I also suggest moving slurm-on-gke/ to tutorials-and-examples/slurm-on-gke. I had forgotten that we decided to consolidate guides to tutorials-and-examples when I initially reviewed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants