Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move ml-platform to GoogleCloudPlatform/accelerated-platforms #828

Merged
merged 1 commit into from
Sep 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "best-practices/accelerated-platforms"]
path = best-practices/accelerated-platforms
url = https://github.com/GoogleCloudPlatform/accelerated-platforms.git
13 changes: 13 additions & 0 deletions best-practices/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Best Practices

## [AI/ML Platform for enabling AI/ML Ops on GKE Reference Architecture](best-practices/ml-platform)

Construct an Artificial Intelligence/Machine Learning (AI/ML) platform that streamlines AI/ML Operations (AIMLOps), this reference architecture utilizes Google Kubernetes Engine (GKE) as the underlying runtime environment. Additionally, it incorporates a collection of diverse use cases that illustrate practical workflows closely aligned with AI/ML operations.

## [Batch Processing Platform on GKE Reference Architecture](best-practices/gke-batch-refarch)

This reference architecture is designed to assist platform administrators, cloud architects, and operations professionals in deploying a batch processing platform on [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine/docs/concepts/kubernetes-engine-overview) (GKE). Utilizing GKE [Standard](https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-architecture#nodes) as its foundation, this architecture leverages [Kueue](https://kueue.sigs.k8s.io/) to manage resource quotas and borrowing rules between multiple tenant teams sharing the cluster. This enables these teams to run their batch workloads in a fair, cost-efficient, and high-performance manner. Key recommendations for effectively running batch workloads on GKE, as outlined in [Best practices for running batch workloads on GKE](https://cloud.google.com/kubernetes-engine/docs/best-practices/batch-platform-on-gke) are incorporated into this reference architecture.

## [Best Practices for Faster Workload Cold Start](best-practices/startup-latency.md)

To enhance cold start performance of workloads on Google Kubernetes Engine (GKE), this document provides best practices and examines the elements that influence startup latency.
1 change: 1 addition & 0 deletions best-practices/accelerated-platforms
Submodule accelerated-platforms added at 3b57e5
7 changes: 0 additions & 7 deletions best-practices/ml-platform/.gitignore

This file was deleted.

37 changes: 0 additions & 37 deletions best-practices/ml-platform/CONTRIBUTING.md

This file was deleted.

86 changes: 5 additions & 81 deletions best-practices/ml-platform/README.md
Original file line number Diff line number Diff line change
@@ -1,82 +1,6 @@
# Machine learning platform (MLP) on GKE reference architecture for enabling Machine Learning Operations (MLOps)
# Moved to the [GoogleCloudPlatform/accelerated-platforms](https://github.com/GoogleCloudPlatform/accelerated-platforms/blob/main/docs/platforms/gke-aiml/README.md) repository which a submodule [here](best-practices/accelerated-platforms)

## Platform Principles

This reference architecture demonstrates how to build a GKE platform that facilitates Machine Learning. The reference architecture is based on the following principles:

- The platform admin will create the GKE platform using IaC tool like [Terraform][terraform]. The IaC will come with re-usable modules that can be referred to create more resources as the demand grows.
- The platform will be based on [GitOps][gitops].
- After the GKE platform has been created, cluster scoped resources on it will be created through [Config Sync][config-sync] by the admins.
- Platform admins will create a namespace per application and provide the application team member full access to it.
- The namespace scoped resources will be created by the Application/ML teams either via [Config Sync][config-sync] or through a deployment tool like [Cloud Deploy][cloud-deploy]

For an outline of products and features used in the platform, see the [Platform Products and Features](/best-practices/ml-platform/docs/platform/products-and-features.md) document.

## Critical User Journeys (CUJs)

### Persona : Platform Admin

- Offer a platform that incorporates established best practices.
- Grant end users the essential resources, guided by the principle of least privilege, empowering them to manage and maintain their workloads.
- Establish secure channels for end users to interact seamlessly with the platform.
- Empower the enforcement of robust security policies across the platform.

### Persona : Machine Learning Engineer

- Deploy the model with ease and make the endpoints available only to the intended audience
- Continuously monitor the model performance and resource utilization
- Troubleshoot any performance or integration issues
- Ability to version, store and access the models and model artifacts:
- To debug & troubleshoot in production and track back to the specific model version & associated training data
- To quick & controlled rollback to a previous, more stable version
- Implement the feedback loop to adapt to changing data & business needs:
- Ability to retrain / fine-tune the model.
- Ability to split the traffic between models (A/B testing)
- Switching between the models without breaking inference system for the end-users
- Ability to scaling up/down the infra to accommodate changing needs
- Ability to share the insights and findings with stakeholders to take data-driven decisions

### Persona : Machine Learning Operator

- Provide and maintain software required by the end users of the platform.
- Operationalize experimental workload by providing guidance and best practices for running the workload on the platform.
- Deploy the workloads on the platform.
- Assist with enabling observability and monitoring for the workloads to ensure smooth operations.

## Prerequisites

- This guide is meant to be run on [Cloud Shell](https://shell.cloud.google.com) which comes preinstalled with the [Google Cloud SDK](https://cloud.google.com/sdk) and other tools that are required to complete this tutorial.
- Familiarity with following
- [Google Kubernetes Engine][gke]
- [Terraform][terraform]
- [git][git]
- [Google Configuration Management root-sync][root-sync]
- [Google Configuration Management repo-sync][repo-sync]
- [GitHub][github]

## Deploy the platform

[Playground Reference Architecture](examples/platform/playground/README.md): Set up an environment to familiarize yourself with the architecture and get an understanding of the concepts.

## Use cases

- [Distributed Data Processing with Ray](examples/use-case/data-processing/ray/README.md): Run a distributed data processing job using Ray.
- [Dataset Prepartion for Fine Tuning Gemma IT With Gemini Flash](examples/use-case/data-preparation/gemma-it/README.md): Generate prompts for fine tuning Gemma Instruction Tuned model with Vertex AI Gemini Flash
- [Fine Tuning Gemma2 9B IT model With FSDP](examples/use-case/fine-tuning/pytorch/README.md): Fine tune Gemma2 9B IT model with PyTorch FSDP

## Resources

- [Packaging Jupyter notebooks](docs/notebook/packaging.md): Patterns and tools to get your ipynb's ready for deployment in a container runtime.

[gitops]: https://about.gitlab.com/topics/gitops/
[repo-sync]: https://cloud.google.com/anthos-config-management/docs/reference/rootsync-reposync-fields
[root-sync]: https://cloud.google.com/anthos-config-management/docs/reference/rootsync-reposync-fields
[config-sync]: https://cloud.google.com/anthos-config-management/docs/config-sync-overview
[cloud-deploy]: https://cloud.google.com/deploy?hl=en
[terraform]: https://www.terraform.io/
[gke]: https://cloud.google.com/kubernetes-engine?hl=en
[git]: https://git-scm.com/
[github]: https://github.com/
[gcp-project]: https://cloud.google.com/resource-manager/docs/creating-managing-projects
[personal-access-token]: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens
[machine-user-account]: https://docs.github.com/en/get-started/learning-about-github/types-of-github-accounts
```
git clone --recurse-submodules https://github.com/GoogleCloudPlatform/ai-on-gke.git
cd best-practices/accelerated-platforms
```
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.

This file was deleted.

This file was deleted.

Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
93 changes: 0 additions & 93 deletions best-practices/ml-platform/docs/notebook/packaging.md

This file was deleted.

This file was deleted.

Loading
Loading