-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add config option to allow use with HPAs #329
Comments
One requirement is to scale based on some metric from a GPU (or set of GPUs) that indicate how busy the GPU is (are). |
@njhill 1 possible fallback/alternative I was thinking while this gets implemented in modelmesh. we have a separate pod running in background that is always looking at certain metrics that are exposed by mm and re-applies the mm CR with the right number of replicas. this is a pretty crude fallback since this means additional permissions to this pod to interact with k8s master. |
Unlike Modelmesh, kserve creates a new pod for each inference service, so HPA can be created for each inference service. However, in modelmesh, since several inference services share pods, it is impossible to set autoscaler using annotations of inference services like kserve. First of all, if there is an autoscaler annotation in servingRuntime and there is no predictor, that is, inferenceService, only deployment is created as before, and the replica is set to 0. And later, when a new inference service is created, an HPA is created and the HPA controls the number of replicas. Conversely, when all isvcs are deleted, hpa is deleted and replicas are zeroed out using the existing logic. This is a 100% CPU base scenario, no GPUs. Regarding HPA metrics, I didn't take a look at them deeply but in the first phase, we can simply start with resource metrics such as CPU/Memory. After that, we can research better metrics for the HPA. If you like this idea, I want to prepare a PoC for this hpa feature next week. please let me know your opinion. |
@atinsood yeah I think what you described would work, not sure whether or not it's worth it though... @Jooho that sounds good to me. It doesn't need to be CPU-only though right? since HPA can watch any metrics presumably. In general I think we would want the scale-down to happen very slowly if possible (decent delay after load drops before starting to removing pods gradually). |
…vingRuntime (#342) Enable Horizontal Pod Autoscaling for ServingRuntime/ClusterServingRuntime by adding annotation `serving.kserve.io/autoscalerClass: hpa` - Add auto-scaling, HPA controller - Add ServingRuntime Webhook - Update deployment manifests - Add script to generate self-signed certificate - Add option to enable self-signed certificate to install script - Add deploy-release-dev-mode-fvt target to Makefile - Add FVT and unit tests - Upgrade FVT minikube version from 1.25 to 1.27 - Enabe FVT deployment on OpenShift (etcd --data-dir) - Update Docs Resolves #329 Signed-off-by: Jooho Lee <ljhiyh@gmail.com>
Currently the modelmesh-serving controller will start a fixed number of pod replicas per runtime if there are any predictors created that would use the runtime. This number is configurable globally and per-runtime. You can disable the scale-to-zero behaviour so that it will keep this number of replicas even when there are no predictors.
To allow for HPA auto-scaling at the pod level, it would be useful to be able to tell the controller not to manage the number of replicas and leave this to another controller.
Note that there could be problems using this in over-commit scenarios that model-mesh was designed for - really the HPA should take into account both load and some cache-related metrics (i.e. global LRU value). But if the number of models in use always fits into a single pod it could still be useful (even though non-modelmesh KServe is likely a better choice in these situations).
I haven't looked into what metrics HPA can be linked to, but another consideration is that the load across modelmesh pods can be very uneven by design (if a heavily used model is only loaded in a subset of them for example). Ideally the HPA should take this into account too, in particular when choosing which pod to stop when scaling down. Later, it might be better for the mm-serving controller to own the scaling for this reason.
The text was updated successfully, but these errors were encountered: