Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray TPU Webhook Reliability Improvements #723

Merged
merged 61 commits into from
Jul 25, 2024
Merged
Show file tree
Hide file tree
Changes from 59 commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
f22a4a6
Generate hostnames at Pod creation
ryanaoleary Jul 1, 2024
274ac21
Update tests and add error checking
ryanaoleary Jul 1, 2024
f20e6f3
Make webhook stateless in between mutate calls
ryanaoleary Jul 3, 2024
02986e7
Formatting changes
ryanaoleary Jul 3, 2024
c628ca8
Fix bug causing incorrect IDs
ryanaoleary Jul 4, 2024
d9880eb
Add cluster role and log formatting changes
ryanaoleary Jul 4, 2024
1cb96ce
Filter pods by Ray worker group label
ryanaoleary Jul 15, 2024
c2ce875
Vulnerability fixes
ryanaoleary Jul 16, 2024
df09e9d
Better names and add ServiceAccount
ryanaoleary Jul 16, 2024
7829c6f
Change version back to v1.1
ryanaoleary Jul 16, 2024
0f128f6
Change implementation to use PodInformer
ryanaoleary Jul 17, 2024
42b5c39
Use PodLister
ryanaoleary Jul 17, 2024
e9b2205
updateSliceToWorkerIDs returns error
ryanaoleary Jul 17, 2024
1a8311f
Use mutex lock in updateSliceToWorkerIDs
ryanaoleary Jul 17, 2024
cffc57f
Update unit tests and fix comments
ryanaoleary Jul 18, 2024
7bba1e1
Remove global client var
ryanaoleary Jul 18, 2024
c98f757
Just return err instead of logging
ryanaoleary Jul 18, 2024
241d04d
TODO comment
ryanaoleary Jul 18, 2024
39e7c58
Lock when reading from shared sliceToWorkerIDs mapping
ryanaoleary Jul 18, 2024
c9c6371
Switch to using RWMutex
ryanaoleary Jul 19, 2024
ec7207a
Ray TPU Webhook Autoscaling Support (#740)
ryanaoleary Jul 19, 2024
25a5640
Generate hostnames at Pod creation
ryanaoleary Jul 1, 2024
a552118
Update tests and add error checking
ryanaoleary Jul 1, 2024
f038d21
Close stop channel on webhook termination
ryanaoleary Jul 19, 2024
8efed67
Refactor webhook to avoid using global vars
ryanaoleary Jul 19, 2024
dce1bbe
Fix comments
ryanaoleary Jul 20, 2024
2122992
Change service account name
ryanaoleary Jul 20, 2024
daff1ed
Return BadRequest if invalid kind
ryanaoleary Jul 20, 2024
12614a9
Fix comments
ryanaoleary Jul 22, 2024
869f275
Change error messages
ryanaoleary Jul 22, 2024
7895a44
Fatal log in main
ryanaoleary Jul 22, 2024
16607eb
Update function comments
ryanaoleary Jul 22, 2024
221d595
Refactor to minimize indentations
ryanaoleary Jul 22, 2024
8106aad
Change sliceToWorkerIDs nil check to use len
ryanaoleary Jul 22, 2024
ce531da
Write http.Error to header
ryanaoleary Jul 22, 2024
25bcdb8
Don't fatal log in validateRayCluster
ryanaoleary Jul 22, 2024
07d384d
Check for nil admission request
ryanaoleary Jul 22, 2024
7cf1237
Add doc comment
ryanaoleary Jul 22, 2024
355de76
Update expected errors
ryanaoleary Jul 22, 2024
df2ed78
Better getNextWorkerID logic
ryanaoleary Jul 23, 2024
c49e914
Update replicaIndex and nextWorkerID tests
ryanaoleary Jul 23, 2024
b70535d
Refactor webhook unit tests
ryanaoleary Jul 23, 2024
0fe181a
Create numOfHosts pods for Pod List
ryanaoleary Jul 23, 2024
c15e3ab
Log admission request object name
ryanaoleary Jul 23, 2024
af1993f
Fix nits and go vet output
ryanaoleary Jul 24, 2024
b639368
Initial cloudbuil commit
ryanaoleary Jul 24, 2024
843e7b7
Fix vet command
ryanaoleary Jul 24, 2024
0d4d0c0
Update cloudbuild
ryanaoleary Jul 24, 2024
54c97f1
Fix cloudbuild errors
ryanaoleary Jul 24, 2024
4149cc5
Add dir
ryanaoleary Jul 24, 2024
7c69b01
Remove arg
ryanaoleary Jul 24, 2024
351051a
Change to bash command
ryanaoleary Jul 24, 2024
adcd5f6
increase timeout time
ryanaoleary Jul 24, 2024
6614355
Fix validateRayCluster test
ryanaoleary Jul 24, 2024
746f898
Fix nits for cloudbuild
ryanaoleary Jul 24, 2024
4164563
Break early in validateRayCluster
ryanaoleary Jul 24, 2024
54b8a71
Remove unnecessary args from validateRayCluster test
ryanaoleary Jul 24, 2024
fdf4e6c
Change break to continue
ryanaoleary Jul 24, 2024
78e15a8
Remove unused vars from webhook tests and add edge cases
ryanaoleary Jul 25, 2024
d6a2eab
Merge branch 'GoogleCloudPlatform:main' into autoscaling-changes
ryanaoleary Jul 25, 2024
c0a0dae
Update helm chart
ryanaoleary Jul 25, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 13 additions & 1 deletion ray-on-gke/tpu/kuberay-tpu-webhook/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,19 @@ webhook:

# Run against the configured Kubernetes cluster in ~/.kube/config
run: webhook
go run ./main.go
go run ./main.go

# Run go fmt against code.
fmt:
go fmt ./...

# Run go vet against code.
vet:
go vet ./...

# Run go test against code.
test:
go test ./...

uninstall:
kubectl delete -f deployments/
Expand Down
45 changes: 45 additions & 0 deletions ray-on-gke/tpu/kuberay-tpu-webhook/cloudbuild.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

steps:
- id: 'kuberay tpu webhook fmt'
name: 'golang:1.21.11'
dir: /workspace/ray-on-gke/tpu/kuberay-tpu-webhook
entrypoint: 'bash'
args:
- '-c'
- |
set -e
make fmt
- id: 'kuberay tpu webhook vet'
name: 'golang:1.21.11'
dir: /workspace/ray-on-gke/tpu/kuberay-tpu-webhook
entrypoint: 'bash'
args:
- '-c'
- |
set -e
make vet
- id: 'kuberay tpu webhook test'
name: 'golang:1.21.11'
dir: /workspace/ray-on-gke/tpu/kuberay-tpu-webhook
entrypoint: 'bash'
args:
- '-c'
- |
set -e
make test
options:
machineType: 'E2_HIGHCPU_8'
timeout: 600s
31 changes: 30 additions & 1 deletion ray-on-gke/tpu/kuberay-tpu-webhook/deployments/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,34 @@ kind: Namespace
metadata:
name: ray-system
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: kuberay-tpu-webhook
ryanaoleary marked this conversation as resolved.
Show resolved Hide resolved
namespace: ray-system
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: kuberay-tpu-webhook-pod-reader
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kuberay-tpu-webhook-pod-reader
subjects:
- kind: ServiceAccount
name: kuberay-tpu-webhook
ryanaoleary marked this conversation as resolved.
Show resolved Hide resolved
namespace: ray-system
roleRef:
kind: ClusterRole
name: kuberay-tpu-webhook-pod-reader
apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: Deployment
metadata:
Expand All @@ -20,12 +48,13 @@ spec:
labels:
app: kuberay-tpu-webhook
spec:
serviceAccountName: kuberay-tpu-webhook
ryanaoleary marked this conversation as resolved.
Show resolved Hide resolved
containers:
- image: us-docker.pkg.dev/ai-on-gke/kuberay-tpu-webhook/kuberay-tpu-webhook:v1.1
imagePullPolicy: Always
name: kuberay-tpu-webhook
args:
- --v=0 # change this value to 1 for verbose logging
- --v=0 # change this value to 1 for verbose logging
ports:
- name: https
containerPort: 443
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,30 +5,6 @@ metadata:
annotations:
cert-manager.io/inject-ca-from: ray-system/kuberay-tpu-webhook-certs
webhooks:
- name: pods-kuberay-tpu-webhook.ray-system.svc
admissionReviewVersions: [v1]
sideEffects: NoneOnDryRun
namespaceSelector:
matchExpressions:
- key: kubernetes.io/metadata.name
operator: NotIn
values:
- kube-system
- kube-node-lease
clientConfig:
service:
name: kuberay-tpu-webhook
namespace: ray-system
path: /validate
rules:
- operations: ["DELETE"]
ryanaoleary marked this conversation as resolved.
Show resolved Hide resolved
apiGroups: [""]
apiVersions: ["v1"]
resources: ["pods"]
scope: Namespaced
objectSelector:
matchLabels:
app.kubernetes.io/name: kuberay
- name: raycluster-kuberay-tpu-webhook.ray-system.svc
admissionReviewVersions: [v1]
sideEffects: NoneOnDryRun
Expand Down
12 changes: 6 additions & 6 deletions ray-on-gke/tpu/kuberay-tpu-webhook/go.mod
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
module github.com/GoogleCloudPlatform/kuberay-tpu-webhook

go 1.21
go 1.21.11

require (
github.com/ray-project/kuberay/ray-operator v1.1.0-rc.0
github.com/ray-project/kuberay/ray-operator v1.1.1
github.com/stretchr/testify v1.8.4
k8s.io/api v0.29.1
k8s.io/apimachinery v0.29.1
k8s.io/client-go v0.29.0
k8s.io/klog/v2 v2.120.1
k8s.io/utils v0.0.0-20240102154912-e7106e64919e
)
Expand All @@ -16,6 +17,7 @@ require (
github.com/cespare/xxhash/v2 v2.2.0 // indirect
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/emicklei/go-restful/v3 v3.11.0 // indirect
github.com/evanphx/json-patch v5.6.0+incompatible // indirect
github.com/evanphx/json-patch/v5 v5.8.0 // indirect
github.com/fsnotify/fsnotify v1.7.0 // indirect
github.com/go-logr/logr v1.4.1 // indirect
Expand All @@ -24,7 +26,7 @@ require (
github.com/go-openapi/swag v0.22.3 // indirect
github.com/gogo/protobuf v1.3.2 // indirect
github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da // indirect
github.com/golang/protobuf v1.5.3 // indirect
github.com/golang/protobuf v1.5.4 // indirect
github.com/google/gnostic-models v0.6.8 // indirect
github.com/google/go-cmp v0.6.0 // indirect
github.com/google/gofuzz v1.2.0 // indirect
Expand Down Expand Up @@ -53,17 +55,15 @@ require (
golang.org/x/term v0.18.0 // indirect
golang.org/x/text v0.14.0 // indirect
golang.org/x/time v0.3.0 // indirect
golang.org/x/tools v0.17.0 // indirect
gomodules.xyz/jsonpatch/v2 v2.4.0 // indirect
google.golang.org/appengine v1.6.7 // indirect
google.golang.org/protobuf v1.33.0 // indirect
gopkg.in/inf.v0 v0.9.1 // indirect
gopkg.in/yaml.v2 v2.4.0 // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect
k8s.io/apiextensions-apiserver v0.29.0 // indirect
k8s.io/client-go v0.29.0 // indirect
k8s.io/component-base v0.29.0 // indirect
k8s.io/kube-openapi v0.0.0-20231010175941-2dd684a91f00 // indirect
k8s.io/kube-openapi v0.0.0-20240228011516-70dd3763d340 // indirect
sigs.k8s.io/controller-runtime v0.17.0 // indirect
sigs.k8s.io/json v0.0.0-20221116044647-bc3834ca7abd // indirect
sigs.k8s.io/structured-merge-diff/v4 v4.4.1 // indirect
Expand Down
20 changes: 8 additions & 12 deletions ray-on-gke/tpu/kuberay-tpu-webhook/go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,10 @@ github.com/gogo/protobuf v1.3.2/go.mod h1:P1XiOD3dCwIKUDQYPy72D8LYyHL2YPYrpS2s69
github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da h1:oI5xCqsCo564l8iNU+DwB5epxmsaqB+rhGL0m5jtYqE=
github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da/go.mod h1:cIg4eruTrX1D+g88fzRXU5OdNfaM+9IcxsU14FzY7Hc=
github.com/golang/protobuf v1.3.1/go.mod h1:6lQm79b+lXiMfvg/cZm0SGofjICqVBUtrP5yJMmIC1U=
github.com/golang/protobuf v1.5.0/go.mod h1:FsONVRAS9T7sI+LIUmWTfcYkHO4aIWwzhcaSAoJOfIk=
github.com/golang/protobuf v1.5.3 h1:KhyjKVUg7Usr/dYsdSqoFveMYd5ko72D+zANwlG1mmg=
github.com/golang/protobuf v1.5.3/go.mod h1:XVQd3VNwM+JqD3oG2Ue2ip4fOMUkwXdXDdiuN0vRsmY=
github.com/golang/protobuf v1.5.4 h1:i7eJL8qZTpSEXOPTxNKhASYpMn+8e5Q6AdndVa1dWek=
github.com/golang/protobuf v1.5.4/go.mod h1:lnTiLA8Wa4RWRcIUkrtSVa5nRhsEGBg48fD6rSs7xps=
github.com/google/gnostic-models v0.6.8 h1:yo/ABAfM5IMRsS1VnXjTBvUb61tFIHozhlYvRgGre9I=
github.com/google/gnostic-models v0.6.8/go.mod h1:5n7qKqH0f5wFt+aWF8CW6pZLLNOfYuF5OpfBSENuI8U=
github.com/google/go-cmp v0.5.5/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE=
github.com/google/go-cmp v0.5.9/go.mod h1:17dUlkBOakJ0+DkrSSNjCkIjxS6bF9zb3elmeNGIjoY=
github.com/google/go-cmp v0.6.0 h1:ofyhxvXcZhMsU5ulbFiLKl/XBFqE1GSq7atu8tAmTRI=
github.com/google/go-cmp v0.6.0/go.mod h1:17dUlkBOakJ0+DkrSSNjCkIjxS6bF9zb3elmeNGIjoY=
Expand Down Expand Up @@ -95,8 +93,8 @@ github.com/prometheus/common v0.45.0 h1:2BGz0eBc2hdMDLnO/8n0jeB3oPrt2D08CekT0lne
github.com/prometheus/common v0.45.0/go.mod h1:YJmSTw9BoKxJplESWWxlbyttQR4uaEcGyv9MZjVOJsY=
github.com/prometheus/procfs v0.12.0 h1:jluTpSng7V9hY0O2R9DzzJHYb2xULk9VTR1V1R/k6Bo=
github.com/prometheus/procfs v0.12.0/go.mod h1:pcuDEFsWDnvcgNzo4EEweacyhjeA9Zk3cnaOZAZEfOo=
github.com/ray-project/kuberay/ray-operator v1.1.0-rc.0 h1:QZIFpSxxoTFyC64Z6NK+TUCbQqs6PKLj2lCKHsZpLOc=
github.com/ray-project/kuberay/ray-operator v1.1.0-rc.0/go.mod h1:ZqyKKvMP5nKDldQoKmur+Wcx7wVlV9Q98phFqHzr+KY=
github.com/ray-project/kuberay/ray-operator v1.1.1 h1:mVOA1ddS9aAsPvhhHrpf0ZXgTzccIAyTbeYeDqtcfAk=
github.com/ray-project/kuberay/ray-operator v1.1.1/go.mod h1:ZqyKKvMP5nKDldQoKmur+Wcx7wVlV9Q98phFqHzr+KY=
github.com/rogpeppe/go-internal v1.11.0 h1:cWPaGQEPrBb5/AsnsZesgZZ9yb1OQ+GOISoDNXVBh4M=
github.com/rogpeppe/go-internal v1.11.0/go.mod h1:ddIwULY96R17DhadqLgMfk9H9tvdUzkipdSkR5nkCZA=
github.com/spf13/pflag v1.0.5 h1:iy+VFUOCP1a+8yFto/drg2CJ5u0yRoB7fZw3DKv/JXA=
Expand Down Expand Up @@ -156,8 +154,8 @@ golang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e/go.mod h1:n7NCudcB/nEzxVGm
golang.org/x/tools v0.0.0-20191119224855-298f0cb1881e/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo=
golang.org/x/tools v0.0.0-20200619180055-7c47624df98f/go.mod h1:EkVYQZoAsY45+roYkvgYkIh4xh/qjgUK9TdY2XT94GE=
golang.org/x/tools v0.0.0-20210106214847-113979e3529a/go.mod h1:emZCQorbCU4vsT4fOWvOPXz4eW1wZW4PmDk9uLelYpA=
golang.org/x/tools v0.17.0 h1:FvmRgNOcs3kOa+T20R1uhfP9F6HgG2mfxDv1vrx1Htc=
golang.org/x/tools v0.17.0/go.mod h1:xsh6VxdV005rRVaS6SSAf9oiAqljS7UZUacMZ8Bnsps=
golang.org/x/tools v0.18.0 h1:k8NLag8AGHnn+PHbl7g43CtqZAwG60vZkLqgyZgIHgQ=
golang.org/x/tools v0.18.0/go.mod h1:GL7B4CwcLLeo59yx/9UWWuNOW1n3VZ4f5axWfML7Lcg=
golang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
golang.org/x/xerrors v0.0.0-20191011141410-1b5146add898/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
Expand All @@ -166,8 +164,6 @@ gomodules.xyz/jsonpatch/v2 v2.4.0 h1:Ci3iUJyx9UeRx7CeFN8ARgGbkESwJK+KB9lLcWxY/Zw
gomodules.xyz/jsonpatch/v2 v2.4.0/go.mod h1:AH3dM2RI6uoBZxn3LVrfvJ3E0/9dG4cSrbuBJT4moAY=
google.golang.org/appengine v1.6.7 h1:FZR1q0exgwxzPzp/aF+VccGrSfxfPpkBqjIIEq3ru6c=
google.golang.org/appengine v1.6.7/go.mod h1:8WjMMxjGQR8xUklV/ARdw2HLXBOI7O7uCIDZVag1xfc=
google.golang.org/protobuf v1.26.0-rc.1/go.mod h1:jlhhOSvTdKEhbULTjvd4ARK9grFBp09yW+WbY/TyQbw=
google.golang.org/protobuf v1.26.0/go.mod h1:9q0QmTI4eRPtz6boOQmLYwt+qCgq0jsYwAQnmE0givc=
google.golang.org/protobuf v1.33.0 h1:uNO2rsAINq/JlFpSdYEKIZ0uKD/R9cpdv0T+yoGwGmI=
google.golang.org/protobuf v1.33.0/go.mod h1:c6P6GXX6sHbq/GpV6MGZEdwhWPcYBgnhAHhKbcUYpos=
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
Expand All @@ -194,8 +190,8 @@ k8s.io/component-base v0.29.0 h1:T7rjd5wvLnPBV1vC4zWd/iWRbV8Mdxs+nGaoaFzGw3s=
k8s.io/component-base v0.29.0/go.mod h1:sADonFTQ9Zc9yFLghpDpmNXEdHyQmFIGbiuZbqAXQ1M=
k8s.io/klog/v2 v2.120.1 h1:QXU6cPEOIslTGvZaXvFWiP9VKyeet3sawzTOvdXb4Vw=
k8s.io/klog/v2 v2.120.1/go.mod h1:3Jpz1GvMt720eyJH1ckRHK1EDfpxISzJ7I9OYgaDtPE=
k8s.io/kube-openapi v0.0.0-20231010175941-2dd684a91f00 h1:aVUu9fTY98ivBPKR9Y5w/AuzbMm96cd3YHRTU83I780=
k8s.io/kube-openapi v0.0.0-20231010175941-2dd684a91f00/go.mod h1:AsvuZPBlUDVuCdzJ87iajxtXuR9oktsTctW/R9wwouA=
k8s.io/kube-openapi v0.0.0-20240228011516-70dd3763d340 h1:BZqlfIlq5YbRMFko6/PM7FjZpUb45WallggurYhKGag=
k8s.io/kube-openapi v0.0.0-20240228011516-70dd3763d340/go.mod h1:yD4MZYeKMBwQKVht279WycxKyM84kkAx2DPrTXaeb98=
k8s.io/utils v0.0.0-20240102154912-e7106e64919e h1:eQ/4ljkx21sObifjzXwlPKpdGLrCfRziVtos3ofG/sQ=
k8s.io/utils v0.0.0-20240102154912-e7106e64919e/go.mod h1:OLgZIPagt7ERELqWJFomSt595RzquPNLL48iOWgYOg0=
sigs.k8s.io/controller-runtime v0.17.0 h1:fjJQf8Ukya+VjogLO6/bNX9HE6Y2xpsO5+fyS26ur/s=
Expand Down
Loading