Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray TPU Webhook Reliability Improvements #723

Merged
merged 61 commits into from
Jul 25, 2024

Conversation

ryanaoleary
Copy link
Collaborator

@ryanaoleary ryanaoleary commented Jul 4, 2024

This PR depends on changes from #740 which should be merged first (c4cdf31 marks the start of the changes in this PR).

This PR improves the reliability of the webhook by making it stateless in between calls, fixing issues related to the sliceToWorkers mapping being cleared upon webhook restart. These changes rely on adding a k8s PodInformer to the webhook that watches Pods in the GKE cluster with the ray.io/node-type=worker label. We can then determine the next replicaIndex and TPU_WORKER_ID using the PodInformer cache. These changes remove the need to intercept Pod deletion requests.

This PR has been tested as follows:

  • Unit Tests
  • Manual Tests using single-host, multi-host, multi-slice, and an autoscaling RayCluster with a TPU worker group added, intercepting additional Pods after webhook restart, and multiple webhook replicas with a Kuberay operator restart

@ryanaoleary ryanaoleary self-assigned this Jul 4, 2024
@ryanaoleary ryanaoleary changed the title Ray TPU Webhook Auto-scaling Support and Reliability Improvements Ray TPU Webhook Autoscaling Support and Reliability Improvements Jul 4, 2024
@ryanaoleary ryanaoleary deleted the autoscaling-changes branch July 15, 2024 22:25
@ryanaoleary ryanaoleary restored the autoscaling-changes branch July 15, 2024 23:22
@ryanaoleary ryanaoleary reopened this Jul 15, 2024
@ryanaoleary ryanaoleary changed the title Ray TPU Webhook Autoscaling Support and Reliability Improvements Ray TPU Webhook Reliability Improvements Jul 15, 2024
@ryanaoleary ryanaoleary requested a review from andrewsykim July 17, 2024 07:00
ray-on-gke/tpu/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
ray-on-gke/tpu/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
ray-on-gke/tpu/kuberay-tpu-webhook/main.go Show resolved Hide resolved
ray-on-gke/tpu/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
ray-on-gke/tpu/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
@ryanaoleary ryanaoleary requested a review from andrewsykim July 17, 2024 20:45
ray-on-gke/tpu/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
ray-on-gke/tpu/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
ray-on-gke/tpu/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
ray-on-gke/tpu/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
ray-on-gke/tpu/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
@ryanaoleary ryanaoleary requested a review from andrewsykim July 18, 2024 20:40
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@ryanaoleary ryanaoleary requested a review from andrewsykim July 23, 2024 21:36
ray-on-gke/tpu/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
ray-on-gke/tpu/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
ray-on-gke/tpu/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@ryanaoleary ryanaoleary force-pushed the autoscaling-changes branch from 67cc91c to f9e1b4f Compare July 24, 2024 19:55
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@ryanaoleary ryanaoleary force-pushed the autoscaling-changes branch from f9e1b4f to 54c97f1 Compare July 24, 2024 19:57
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
ryanaoleary and others added 2 commits July 25, 2024 00:56
Copy link
Collaborator

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ryanaoleary ryanaoleary dismissed spencer-p’s stale review July 25, 2024 01:20

Changes completed

@ryanaoleary ryanaoleary merged commit 0ae82b1 into GoogleCloudPlatform:main Jul 25, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants