-
Notifications
You must be signed in to change notification settings - Fork 191
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Move ml-platform to GoogleCloudPlatform/accelerated-platforms (#828)
- Loading branch information
Showing
259 changed files
with
22 additions
and
182,411 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
[submodule "best-practices/accelerated-platforms"] | ||
path = best-practices/accelerated-platforms | ||
url = https://github.com/GoogleCloudPlatform/accelerated-platforms.git |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# Best Practices | ||
|
||
## [AI/ML Platform for enabling AI/ML Ops on GKE Reference Architecture](best-practices/ml-platform) | ||
|
||
Construct an Artificial Intelligence/Machine Learning (AI/ML) platform that streamlines AI/ML Operations (AIMLOps), this reference architecture utilizes Google Kubernetes Engine (GKE) as the underlying runtime environment. Additionally, it incorporates a collection of diverse use cases that illustrate practical workflows closely aligned with AI/ML operations. | ||
|
||
## [Batch Processing Platform on GKE Reference Architecture](best-practices/gke-batch-refarch) | ||
|
||
This reference architecture is designed to assist platform administrators, cloud architects, and operations professionals in deploying a batch processing platform on [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine/docs/concepts/kubernetes-engine-overview) (GKE). Utilizing GKE [Standard](https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-architecture#nodes) as its foundation, this architecture leverages [Kueue](https://kueue.sigs.k8s.io/) to manage resource quotas and borrowing rules between multiple tenant teams sharing the cluster. This enables these teams to run their batch workloads in a fair, cost-efficient, and high-performance manner. Key recommendations for effectively running batch workloads on GKE, as outlined in [Best practices for running batch workloads on GKE](https://cloud.google.com/kubernetes-engine/docs/best-practices/batch-platform-on-gke) are incorporated into this reference architecture. | ||
|
||
## [Best Practices for Faster Workload Cold Start](best-practices/startup-latency.md) | ||
|
||
To enhance cold start performance of workloads on Google Kubernetes Engine (GKE), this document provides best practices and examines the elements that influence startup latency. |
Submodule accelerated-platforms
added at
3b57e5
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,82 +1,6 @@ | ||
# Machine learning platform (MLP) on GKE reference architecture for enabling Machine Learning Operations (MLOps) | ||
# Moved to the [GoogleCloudPlatform/accelerated-platforms](https://github.com/GoogleCloudPlatform/accelerated-platforms/blob/main/docs/platforms/gke-aiml/README.md) repository which a submodule [here](best-practices/accelerated-platforms) | ||
|
||
## Platform Principles | ||
|
||
This reference architecture demonstrates how to build a GKE platform that facilitates Machine Learning. The reference architecture is based on the following principles: | ||
|
||
- The platform admin will create the GKE platform using IaC tool like [Terraform][terraform]. The IaC will come with re-usable modules that can be referred to create more resources as the demand grows. | ||
- The platform will be based on [GitOps][gitops]. | ||
- After the GKE platform has been created, cluster scoped resources on it will be created through [Config Sync][config-sync] by the admins. | ||
- Platform admins will create a namespace per application and provide the application team member full access to it. | ||
- The namespace scoped resources will be created by the Application/ML teams either via [Config Sync][config-sync] or through a deployment tool like [Cloud Deploy][cloud-deploy] | ||
|
||
For an outline of products and features used in the platform, see the [Platform Products and Features](/best-practices/ml-platform/docs/platform/products-and-features.md) document. | ||
|
||
## Critical User Journeys (CUJs) | ||
|
||
### Persona : Platform Admin | ||
|
||
- Offer a platform that incorporates established best practices. | ||
- Grant end users the essential resources, guided by the principle of least privilege, empowering them to manage and maintain their workloads. | ||
- Establish secure channels for end users to interact seamlessly with the platform. | ||
- Empower the enforcement of robust security policies across the platform. | ||
|
||
### Persona : Machine Learning Engineer | ||
|
||
- Deploy the model with ease and make the endpoints available only to the intended audience | ||
- Continuously monitor the model performance and resource utilization | ||
- Troubleshoot any performance or integration issues | ||
- Ability to version, store and access the models and model artifacts: | ||
- To debug & troubleshoot in production and track back to the specific model version & associated training data | ||
- To quick & controlled rollback to a previous, more stable version | ||
- Implement the feedback loop to adapt to changing data & business needs: | ||
- Ability to retrain / fine-tune the model. | ||
- Ability to split the traffic between models (A/B testing) | ||
- Switching between the models without breaking inference system for the end-users | ||
- Ability to scaling up/down the infra to accommodate changing needs | ||
- Ability to share the insights and findings with stakeholders to take data-driven decisions | ||
|
||
### Persona : Machine Learning Operator | ||
|
||
- Provide and maintain software required by the end users of the platform. | ||
- Operationalize experimental workload by providing guidance and best practices for running the workload on the platform. | ||
- Deploy the workloads on the platform. | ||
- Assist with enabling observability and monitoring for the workloads to ensure smooth operations. | ||
|
||
## Prerequisites | ||
|
||
- This guide is meant to be run on [Cloud Shell](https://shell.cloud.google.com) which comes preinstalled with the [Google Cloud SDK](https://cloud.google.com/sdk) and other tools that are required to complete this tutorial. | ||
- Familiarity with following | ||
- [Google Kubernetes Engine][gke] | ||
- [Terraform][terraform] | ||
- [git][git] | ||
- [Google Configuration Management root-sync][root-sync] | ||
- [Google Configuration Management repo-sync][repo-sync] | ||
- [GitHub][github] | ||
|
||
## Deploy the platform | ||
|
||
[Playground Reference Architecture](examples/platform/playground/README.md): Set up an environment to familiarize yourself with the architecture and get an understanding of the concepts. | ||
|
||
## Use cases | ||
|
||
- [Distributed Data Processing with Ray](examples/use-case/data-processing/ray/README.md): Run a distributed data processing job using Ray. | ||
- [Dataset Prepartion for Fine Tuning Gemma IT With Gemini Flash](examples/use-case/data-preparation/gemma-it/README.md): Generate prompts for fine tuning Gemma Instruction Tuned model with Vertex AI Gemini Flash | ||
- [Fine Tuning Gemma2 9B IT model With FSDP](examples/use-case/fine-tuning/pytorch/README.md): Fine tune Gemma2 9B IT model with PyTorch FSDP | ||
|
||
## Resources | ||
|
||
- [Packaging Jupyter notebooks](docs/notebook/packaging.md): Patterns and tools to get your ipynb's ready for deployment in a container runtime. | ||
|
||
[gitops]: https://about.gitlab.com/topics/gitops/ | ||
[repo-sync]: https://cloud.google.com/anthos-config-management/docs/reference/rootsync-reposync-fields | ||
[root-sync]: https://cloud.google.com/anthos-config-management/docs/reference/rootsync-reposync-fields | ||
[config-sync]: https://cloud.google.com/anthos-config-management/docs/config-sync-overview | ||
[cloud-deploy]: https://cloud.google.com/deploy?hl=en | ||
[terraform]: https://www.terraform.io/ | ||
[gke]: https://cloud.google.com/kubernetes-engine?hl=en | ||
[git]: https://git-scm.com/ | ||
[github]: https://github.com/ | ||
[gcp-project]: https://cloud.google.com/resource-manager/docs/creating-managing-projects | ||
[personal-access-token]: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens | ||
[machine-user-account]: https://docs.github.com/en/get-started/learning-about-github/types-of-github-accounts | ||
``` | ||
git clone --recurse-submodules https://github.com/GoogleCloudPlatform/ai-on-gke.git | ||
cd best-practices/accelerated-platforms | ||
``` |
Binary file not shown.
Binary file removed
BIN
-93.3 KB
best-practices/ml-platform/docs/images/log-analytics-data-processing.png
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file removed
BIN
-226 KB
best-practices/ml-platform/docs/images/notebook/jupyter-generate-requirements.png
Binary file not shown.
Binary file removed
BIN
-496 KB
best-practices/ml-platform/docs/images/notebook/jupyter-gpt-j-online-ipynb.png
Binary file not shown.
Binary file removed
BIN
-204 KB
best-practices/ml-platform/docs/images/notebook/jupyter-gpt-j-online-py.png
Binary file not shown.
Binary file removed
BIN
-112 KB
best-practices/ml-platform/docs/images/notebook/jupyter-nbconvert.png
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file removed
BIN
-114 KB
best-practices/ml-platform/docs/images/platform/oauth-consent-screen.png
Binary file not shown.
263 changes: 0 additions & 263 deletions
263
...ces/ml-platform/docs/images/platform/playground/mlp_playground_architecture.svg
This file was deleted.
Oops, something went wrong.
339 changes: 0 additions & 339 deletions
339
...l-platform/docs/images/platform/playground/mlp_playground_architecture_full.svg
This file was deleted.
Oops, something went wrong.
Binary file removed
BIN
-603 KB
best-practices/ml-platform/docs/images/use-case/MLFlow_experiment_tracking.png
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file removed
BIN
-89.9 KB
best-practices/ml-platform/docs/images/use-case/data-processing-ray-workflow.png
Binary file not shown.
Binary file not shown.
Binary file removed
BIN
-98.7 KB
best-practices/ml-platform/docs/images/use-case/mlflow_epoch_loss.png
Binary file not shown.
Binary file not shown.
This file was deleted.
Oops, something went wrong.
46 changes: 0 additions & 46 deletions
46
best-practices/ml-platform/docs/platform/playground/architecture.md
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.