-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slurm on GKE - Guide #864
Slurm on GKE - Guide #864
Conversation
@@ -0,0 +1,196 @@ | |||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a top-level generic module for infrastructure here https://github.com/GoogleCloudPlatform/ai-on-gke/tree/main/infrastructure
Is it possible to reuse that instead of creating another infrastructure module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately, the infra part is tied to the described example, I can do it but an a second release. wdyt?
Hello @andrewsykim I implemented the changes with 1 exception still open above, wdyt? Thanks! |
@andrewsykim as agreed, can you also add @mwysokin as reviewer? Thanks! |
This reverts commit 76703f0.
I'm unable to add him as a reviewer for some reason, but feel free to review anyways |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM 🖖
I think the module could be improved by making it more flexible when it comes to image versions because most of them are currently hardcoded but it's a nit and a possible room for improvement if people are going to use it.
/gcbrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, left some comments for follow-up
impersonate_service_account = "YOUR_SERVICE_ACCOUNT_ID" <-- if you are using one | ||
region = "europe-west3" | ||
project_id = "YOUR_PROJECT_ID" | ||
billing_account_id = "YOUR_BILLING_ACCOUNT_ID" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally expect users of ai-on-gke to bring an existing project, would that remove the need to specify a billing account ID and folder ID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also has implications for how easy it is to use the guide, many people trying the slurm guide may not even know the billing ID and folder ID, but they may have an existing project to work with
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
# SOFTWARE. | ||
|
||
FROM ubuntu:22.04 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using a distroless base image here in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a list of available images here https://github.com/GoogleContainerTools/distroless?tab=readme-ov-file#what-images-are-available
I also suggest moving |
This guide shows you how to deploy Slurm on a Google Kubernetes Engine (GKE) cluster.
This guide is intended for platform administrators in an enterprise environment who are already managing Kubernetes or GKE clusters, and who need to set up Slurm clusters for AI/ML teams on Kubernetes. This guide is also for AI/ML startups that already use Kubernetes or GKE to run their workloads, such as inference workloads or web apps, and want to use their existing infrastructure to run training workloads with a Slurm interface.