Releases · GoogleCloudPlatform/ai-on-gke · GitHub

15 Sep 00:18

imreddy13

TPU support for Ray, persistant ray logs & metrics, JupyterHub improvements

AI on GKE 1.0.1

The 1.0.1 patch introduces TPU support for Ray, persistent & searchable Ray logs and metrics and pre-configured resource profiles for Jupyterhub.

Support for TPUs with Ray

TPUs are now a first-class citizen in Ray’s resource orchestration layer, making the experience just like using GPUs. The user guide outlines how to get started with TPUs on Ray.

Improvements to Ray observability

Ray on GKE automatically write Ray logs and metrics to GCP, so users can view persistent logs & metrics across multiple clusters. Even if your ray cluster dies, you still have visibility into previous jobs via GCP.
See the Logging & Monitoring section for more details on usage.

Logs are exported via a fluentbit sidecar and tagged with the Ray job submission ID. The job submission ID can be used to filter Ray job logs in Cloud Logging:

Metrics are exported via Prometheus and can be viewed in Cloud Monitoring:

Multiple user profiles support for JupyterHub

JupyterHub comes installed with different user profiles, each profile specifies different types of resources (GPU/CPU, memory, image). This user guide outlines how to get started with JupyterHub and configure profiles for your use case:

Assets 2