Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add config option to allow use with HPAs #329

Closed
njhill opened this issue Feb 16, 2023 · 4 comments · Fixed by #342
Closed

Add config option to allow use with HPAs #329

njhill opened this issue Feb 16, 2023 · 4 comments · Fixed by #342
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@njhill
Copy link
Member

njhill commented Feb 16, 2023

Currently the modelmesh-serving controller will start a fixed number of pod replicas per runtime if there are any predictors created that would use the runtime. This number is configurable globally and per-runtime. You can disable the scale-to-zero behaviour so that it will keep this number of replicas even when there are no predictors.

To allow for HPA auto-scaling at the pod level, it would be useful to be able to tell the controller not to manage the number of replicas and leave this to another controller.

Note that there could be problems using this in over-commit scenarios that model-mesh was designed for - really the HPA should take into account both load and some cache-related metrics (i.e. global LRU value). But if the number of models in use always fits into a single pod it could still be useful (even though non-modelmesh KServe is likely a better choice in these situations).

I haven't looked into what metrics HPA can be linked to, but another consideration is that the load across modelmesh pods can be very uneven by design (if a heavily used model is only loaded in a subset of them for example). Ideally the HPA should take this into account too, in particular when choosing which pod to stop when scaling down. Later, it might be better for the mm-serving controller to own the scaling for this reason.

@ashishkamra
Copy link

One requirement is to scale based on some metric from a GPU (or set of GPUs) that indicate how busy the GPU is (are).

@atinsood
Copy link

@njhill 1 possible fallback/alternative I was thinking while this gets implemented in modelmesh.

we have a separate pod running in background that is always looking at certain metrics that are exposed by mm and re-applies the mm CR with the right number of replicas. this is a pretty crude fallback since this means additional permissions to this pod to interact with k8s master.

@Jooho
Copy link
Contributor

Jooho commented Feb 24, 2023

@njhill

Unlike Modelmesh, kserve creates a new pod for each inference service, so HPA can be created for each inference service. However, in modelmesh, since several inference services share pods, it is impossible to set autoscaler using annotations of inference services like kserve.
Therefore, in my opinion, Modelmesh should use the annotation of servingRuntime. The logic would probably be something like this:

First of all, if there is an autoscaler annotation in servingRuntime and there is no predictor, that is, inferenceService, only deployment is created as before, and the replica is set to 0. And later, when a new inference service is created, an HPA is created and the HPA controls the number of replicas.

Conversely, when all isvcs are deleted, hpa is deleted and replicas are zeroed out using the existing logic.

This is a 100% CPU base scenario, no GPUs.

Regarding HPA metrics, I didn't take a look at them deeply but in the first phase, we can simply start with resource metrics such as CPU/Memory. After that, we can research better metrics for the HPA.

If you like this idea, I want to prepare a PoC for this hpa feature next week. please let me know your opinion.

@njhill
Copy link
Member Author

njhill commented Feb 24, 2023

@atinsood yeah I think what you described would work, not sure whether or not it's worth it though...

@Jooho that sounds good to me. It doesn't need to be CPU-only though right? since HPA can watch any metrics presumably.

In general I think we would want the scale-down to happen very slowly if possible (decent delay after load drops before starting to removing pods gradually).

@ckadner ckadner added this to the v0.11.0 milestone Apr 14, 2023
ckadner pushed a commit that referenced this issue Apr 27, 2023
…vingRuntime (#342)

Enable Horizontal Pod Autoscaling for ServingRuntime/ClusterServingRuntime
by adding annotation `serving.kserve.io/autoscalerClass: hpa`

- Add auto-scaling, HPA controller 
- Add ServingRuntime Webhook
- Update deployment manifests
- Add script to generate self-signed certificate
- Add option to enable self-signed certificate to install script
- Add deploy-release-dev-mode-fvt target to Makefile
- Add FVT and unit tests
- Upgrade FVT minikube version from 1.25 to 1.27
- Enabe FVT deployment on OpenShift (etcd --data-dir)
- Update Docs

Resolves #329

Signed-off-by: Jooho Lee <ljhiyh@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants