GoogleCloudPlatform · arueth · Sep 20, 2024 · Sep 20, 2024
diff --git a/.gitmodules b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "best-practices/accelerated-platforms"]
+	path = best-practices/accelerated-platforms
+	url = https://github.com/GoogleCloudPlatform/accelerated-platforms.git
diff --git a/best-practices/README.md b/best-practices/README.md
@@ -0,0 +1,13 @@
+# Best Practices
+
+## [AI/ML Platform for enabling AI/ML Ops on GKE Reference Architecture](best-practices/ml-platform)
+
+Construct an Artificial Intelligence/Machine Learning (AI/ML) platform that streamlines AI/ML Operations (AIMLOps), this reference architecture utilizes Google Kubernetes Engine (GKE) as the underlying runtime environment. Additionally, it incorporates a collection of diverse use cases that illustrate practical workflows closely aligned with AI/ML operations.
+
+## [Batch Processing Platform on GKE Reference Architecture](best-practices/gke-batch-refarch)
+
+This reference architecture is designed to assist platform administrators, cloud architects, and operations professionals in deploying a batch processing platform on [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine/docs/concepts/kubernetes-engine-overview) (GKE). Utilizing GKE [Standard](https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-architecture#nodes) as its foundation, this architecture leverages [Kueue](https://kueue.sigs.k8s.io/) to manage resource quotas and borrowing rules between multiple tenant teams sharing the cluster. This enables these teams to run their batch workloads in a fair, cost-efficient, and high-performance manner. Key recommendations for effectively running batch workloads on GKE, as outlined in [Best practices for running batch workloads on GKE](https://cloud.google.com/kubernetes-engine/docs/best-practices/batch-platform-on-gke) are incorporated into this reference architecture.
+
+## [Best Practices for Faster Workload Cold Start](best-practices/startup-latency.md)
+
+To enhance cold start performance of workloads on Google Kubernetes Engine (GKE), this document provides best practices and examines the elements that influence startup latency.
diff --git a/best-practices/accelerated-platforms b/best-practices/accelerated-platforms
diff --git a/best-practices/ml-platform/.gitignore b/best-practices/ml-platform/.gitignore
diff --git a/best-practices/ml-platform/CONTRIBUTING.md b/best-practices/ml-platform/CONTRIBUTING.md
diff --git a/best-practices/ml-platform/README.md b/best-practices/ml-platform/README.md
@@ -1,82 +1,6 @@
-# Machine learning platform (MLP) on GKE reference architecture for enabling Machine Learning Operations (MLOps)
+# Moved to the [GoogleCloudPlatform/accelerated-platforms](https://github.com/GoogleCloudPlatform/accelerated-platforms/blob/main/docs/platforms/gke-aiml/README.md) repository which a submodule [here](best-practices/accelerated-platforms)
 
-## Platform Principles
-
-This reference architecture demonstrates how to build a GKE platform that facilitates Machine Learning. The reference architecture is based on the following principles:
-
-- The platform admin will create the GKE platform using IaC tool like [Terraform][terraform]. The IaC will come with re-usable modules that can be referred to create more resources as the demand grows.
-- The platform will be based on [GitOps][gitops].
-- After the GKE platform has been created, cluster scoped resources on it will be created through [Config Sync][config-sync] by the admins.
-- Platform admins will create a namespace per application and provide the application team member full access to it.
-- The namespace scoped resources will be created by the Application/ML teams either via [Config Sync][config-sync] or through a deployment tool like [Cloud Deploy][cloud-deploy]
-
-For an outline of products and features used in the platform, see the [Platform Products and Features](/best-practices/ml-platform/docs/platform/products-and-features.md) document.
-
-## Critical User Journeys (CUJs)
-
-### Persona : Platform Admin
-
-- Offer a platform that incorporates established best practices.
-- Grant end users the essential resources, guided by the principle of least privilege, empowering them to manage and maintain their workloads.
-- Establish secure channels for end users to interact seamlessly with the platform.
-- Empower the enforcement of robust security policies across the platform.
-
-### Persona : Machine Learning Engineer
-
-- Deploy the model with ease and make the endpoints available only to the intended audience
-- Continuously monitor the model performance and resource utilization
-- Troubleshoot any performance or integration issues
-- Ability to version, store and access the models and model artifacts:
-  - To debug & troubleshoot in production and track back to the specific model version & associated training data
-  - To quick & controlled rollback to a previous, more stable version
-- Implement the feedback loop to adapt to changing data & business needs:
-  - Ability to retrain / fine-tune the model.
-  - Ability to split the traffic between models (A/B testing)
-  - Switching between the models without breaking inference system for the end-users
-- Ability to scaling up/down the infra to accommodate changing needs
-- Ability to share the insights and findings with stakeholders to take data-driven decisions
-
-### Persona : Machine Learning Operator
-
-- Provide and maintain software required by the end users of the platform.
-- Operationalize experimental workload by providing guidance and best practices for running the workload on the platform.
-- Deploy the workloads on the platform.
-- Assist with enabling observability and monitoring for the workloads to ensure smooth operations.
-
-## Prerequisites
-
-- This guide is meant to be run on [Cloud Shell](https://shell.cloud.google.com) which comes preinstalled with the [Google Cloud SDK](https://cloud.google.com/sdk) and other tools that are required to complete this tutorial.
-- Familiarity with following
-  - [Google Kubernetes Engine][gke]
-  - [Terraform][terraform]
-  - [git][git]
-  - [Google Configuration Management root-sync][root-sync]
-  - [Google Configuration Management repo-sync][repo-sync]
-  - [GitHub][github]
-
-## Deploy the platform
-
-[Playground Reference Architecture](examples/platform/playground/README.md): Set up an environment to familiarize yourself with the architecture and get an understanding of the concepts.
-
-## Use cases
-
-- [Distributed Data Processing with Ray](examples/use-case/data-processing/ray/README.md): Run a distributed data processing job using Ray.
-- [Dataset Prepartion for Fine Tuning Gemma IT With Gemini Flash](examples/use-case/data-preparation/gemma-it/README.md): Generate prompts for fine tuning Gemma Instruction Tuned model with Vertex AI Gemini Flash
-- [Fine Tuning Gemma2 9B IT model With FSDP](examples/use-case/fine-tuning/pytorch/README.md): Fine tune Gemma2 9B IT model with PyTorch FSDP
-
-## Resources
-
-- [Packaging Jupyter notebooks](docs/notebook/packaging.md): Patterns and tools to get your ipynb's ready for deployment in a container runtime.
-
-[gitops]: https://about.gitlab.com/topics/gitops/
-[repo-sync]: https://cloud.google.com/anthos-config-management/docs/reference/rootsync-reposync-fields
-[root-sync]: https://cloud.google.com/anthos-config-management/docs/reference/rootsync-reposync-fields
-[config-sync]: https://cloud.google.com/anthos-config-management/docs/config-sync-overview
-[cloud-deploy]: https://cloud.google.com/deploy?hl=en
-[terraform]: https://www.terraform.io/
-[gke]: https://cloud.google.com/kubernetes-engine?hl=en
-[git]: https://git-scm.com/
-[github]: https://github.com/
-[gcp-project]: https://cloud.google.com/resource-manager/docs/creating-managing-projects
-[personal-access-token]: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens
-[machine-user-account]: https://docs.github.com/en/get-started/learning-about-github/types-of-github-accounts
+```
+git clone --recurse-submodules https://github.com/GoogleCloudPlatform/ai-on-gke.git
+cd best-practices/accelerated-platforms
+```
diff --git a/best-practices/ml-platform/docs/images/create-log-based-metrics.png b/best-practices/ml-platform/docs/images/create-log-based-metrics.png
diff --git a/best-practices/ml-platform/docs/images/log-analytics-data-processing.png b/best-practices/ml-platform/docs/images/log-analytics-data-processing.png
diff --git a/best-practices/ml-platform/docs/images/log-analytics.png b/best-practices/ml-platform/docs/images/log-analytics.png
diff --git a/best-practices/ml-platform/docs/images/log-explorer-query.png b/best-practices/ml-platform/docs/images/log-explorer-query.png
diff --git a/best-practices/ml-platform/docs/images/mlflow-home.png b/best-practices/ml-platform/docs/images/mlflow-home.png
diff --git a/best-practices/ml-platform/docs/images/mlflow-model-experiment.png b/best-practices/ml-platform/docs/images/mlflow-model-experiment.png
diff --git a/best-practices/ml-platform/docs/images/monitor-job.png b/best-practices/ml-platform/docs/images/monitor-job.png
diff --git a/best-practices/ml-platform/docs/images/notebook/dockerfile.png b/best-practices/ml-platform/docs/images/notebook/dockerfile.png
diff --git a/best-practices/ml-platform/docs/images/notebook/jupyter-generate-requirements.png b/best-practices/ml-platform/docs/images/notebook/jupyter-generate-requirements.png
diff --git a/best-practices/ml-platform/docs/images/notebook/jupyter-gpt-j-online-ipynb.png b/best-practices/ml-platform/docs/images/notebook/jupyter-gpt-j-online-ipynb.png
diff --git a/best-practices/ml-platform/docs/images/notebook/jupyter-gpt-j-online-py.png b/best-practices/ml-platform/docs/images/notebook/jupyter-gpt-j-online-py.png
diff --git a/best-practices/ml-platform/docs/images/notebook/jupyter-nbconvert.png b/best-practices/ml-platform/docs/images/notebook/jupyter-nbconvert.png
diff --git a/best-practices/ml-platform/docs/images/notebook/jupyter-pairing.png b/best-practices/ml-platform/docs/images/notebook/jupyter-pairing.png
diff --git a/best-practices/ml-platform/docs/images/platform/configsync.png b/best-practices/ml-platform/docs/images/platform/configsync.png
diff --git a/best-practices/ml-platform/docs/images/platform/oauth-consent-screen.png b/best-practices/ml-platform/docs/images/platform/oauth-consent-screen.png
diff --git a/...ces/ml-platform/docs/images/platform/playground/mlp_playground_architecture.svg b/...ces/ml-platform/docs/images/platform/playground/mlp_playground_architecture.svg
diff --git a/...l-platform/docs/images/platform/playground/mlp_playground_architecture_full.svg b/...l-platform/docs/images/platform/playground/mlp_playground_architecture_full.svg
diff --git a/best-practices/ml-platform/docs/images/use-case/MLFlow_experiment_tracking.png b/best-practices/ml-platform/docs/images/use-case/MLFlow_experiment_tracking.png
diff --git a/best-practices/ml-platform/docs/images/use-case/MLOps_e2e.png b/best-practices/ml-platform/docs/images/use-case/MLOps_e2e.png
diff --git a/best-practices/ml-platform/docs/images/use-case/TensorBoard.png b/best-practices/ml-platform/docs/images/use-case/TensorBoard.png
diff --git a/best-practices/ml-platform/docs/images/use-case/data-processing-ray-workflow.png b/best-practices/ml-platform/docs/images/use-case/data-processing-ray-workflow.png
diff --git a/best-practices/ml-platform/docs/images/use-case/dataset_info.png b/best-practices/ml-platform/docs/images/use-case/dataset_info.png
diff --git a/best-practices/ml-platform/docs/images/use-case/mlflow_epoch_loss.png b/best-practices/ml-platform/docs/images/use-case/mlflow_epoch_loss.png
diff --git a/best-practices/ml-platform/docs/images/use-log-based-metrics.png b/best-practices/ml-platform/docs/images/use-log-based-metrics.png
diff --git a/best-practices/ml-platform/docs/notebook/packaging.md b/best-practices/ml-platform/docs/notebook/packaging.md
diff --git a/best-practices/ml-platform/docs/platform/playground/architecture.md b/best-practices/ml-platform/docs/platform/playground/architecture.md