Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Ray TPU Webhook Reliability Improvements (#723)
* Generate hostnames at Pod creation Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Update tests and add error checking Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Make webhook stateless in between mutate calls Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Formatting changes Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Fix bug causing incorrect IDs * Add cluster role and log formatting changes Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Filter pods by Ray worker group label Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Vulnerability fixes Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Better names and add ServiceAccount Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Change version back to v1.1 Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Change implementation to use PodInformer Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Use PodLister Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * updateSliceToWorkerIDs returns error Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Use mutex lock in updateSliceToWorkerIDs Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Update unit tests and fix comments Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Remove global client var Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Just return err instead of logging Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * TODO comment Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Lock when reading from shared sliceToWorkerIDs mapping Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Switch to using RWMutex Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Ray TPU Webhook Autoscaling Support (#740) * Generate hostnames at Pod creation Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Should not fatal log in deletePod Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * deletePod admission always succeeds Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Remove unused tests make command Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Update tests and add error checking Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Just return an error instead of logging Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Generate hostnames at Pod creation Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Update tests and add error checking Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Close stop channel on webhook termination Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Refactor webhook to avoid using global vars Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Fix comments Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Change service account name Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Return BadRequest if invalid kind Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Fix comments Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Change error messages Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Fatal log in main Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Update function comments Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Refactor to minimize indentations Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Change sliceToWorkerIDs nil check to use len Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Write http.Error to header Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Don't fatal log in validateRayCluster Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Check for nil admission request Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Add doc comment Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Update expected errors Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Better getNextWorkerID logic Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Update replicaIndex and nextWorkerID tests Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Refactor webhook unit tests Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Create numOfHosts pods for Pod List Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Log admission request object name Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Fix nits and go vet output Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Initial cloudbuil commit Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Fix vet command Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Update cloudbuild Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Fix cloudbuild errors Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Add dir Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Remove arg Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Change to bash command Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * increase timeout time Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Fix validateRayCluster test Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Fix nits for cloudbuild Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Break early in validateRayCluster Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Remove unnecessary args from validateRayCluster test Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Change break to continue Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Remove unused vars from webhook tests and add edge cases Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Update helm chart Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
- Loading branch information