Add config option to allow use with HPAs #329

njhill · 2023-02-16T01:32:33Z

Currently the modelmesh-serving controller will start a fixed number of pod replicas per runtime if there are any predictors created that would use the runtime. This number is configurable globally and per-runtime. You can disable the scale-to-zero behaviour so that it will keep this number of replicas even when there are no predictors.

To allow for HPA auto-scaling at the pod level, it would be useful to be able to tell the controller not to manage the number of replicas and leave this to another controller.

Note that there could be problems using this in over-commit scenarios that model-mesh was designed for - really the HPA should take into account both load and some cache-related metrics (i.e. global LRU value). But if the number of models in use always fits into a single pod it could still be useful (even though non-modelmesh KServe is likely a better choice in these situations).

I haven't looked into what metrics HPA can be linked to, but another consideration is that the load across modelmesh pods can be very uneven by design (if a heavily used model is only loaded in a subset of them for example). Ideally the HPA should take this into account too, in particular when choosing which pod to stop when scaling down. Later, it might be better for the mm-serving controller to own the scaling for this reason.

ashishkamra · 2023-02-16T20:09:24Z

One requirement is to scale based on some metric from a GPU (or set of GPUs) that indicate how busy the GPU is (are).

atinsood · 2023-02-24T02:32:20Z

@njhill 1 possible fallback/alternative I was thinking while this gets implemented in modelmesh.

we have a separate pod running in background that is always looking at certain metrics that are exposed by mm and re-applies the mm CR with the right number of replicas. this is a pretty crude fallback since this means additional permissions to this pod to interact with k8s master.

Jooho · 2023-02-24T03:13:14Z

@njhill

Unlike Modelmesh, kserve creates a new pod for each inference service, so HPA can be created for each inference service. However, in modelmesh, since several inference services share pods, it is impossible to set autoscaler using annotations of inference services like kserve.
Therefore, in my opinion, Modelmesh should use the annotation of servingRuntime. The logic would probably be something like this:

First of all, if there is an autoscaler annotation in servingRuntime and there is no predictor, that is, inferenceService, only deployment is created as before, and the replica is set to 0. And later, when a new inference service is created, an HPA is created and the HPA controls the number of replicas.

Conversely, when all isvcs are deleted, hpa is deleted and replicas are zeroed out using the existing logic.

This is a 100% CPU base scenario, no GPUs.

Regarding HPA metrics, I didn't take a look at them deeply but in the first phase, we can simply start with resource metrics such as CPU/Memory. After that, we can research better metrics for the HPA.

If you like this idea, I want to prepare a PoC for this hpa feature next week. please let me know your opinion.

njhill · 2023-02-24T12:09:48Z

@atinsood yeah I think what you described would work, not sure whether or not it's worth it though...

@Jooho that sounds good to me. It doesn't need to be CPU-only though right? since HPA can watch any metrics presumably.

In general I think we would want the scale-down to happen very slowly if possible (decent delay after load drops before starting to removing pods gradually).

…vingRuntime (#342) Enable Horizontal Pod Autoscaling for ServingRuntime/ClusterServingRuntime by adding annotation `serving.kserve.io/autoscalerClass: hpa` - Add auto-scaling, HPA controller - Add ServingRuntime Webhook - Update deployment manifests - Add script to generate self-signed certificate - Add option to enable self-signed certificate to install script - Add deploy-release-dev-mode-fvt target to Makefile - Add FVT and unit tests - Upgrade FVT minikube version from 1.25 to 1.27 - Enabe FVT deployment on OpenShift (etcd --data-dir) - Update Docs Resolves #329 Signed-off-by: Jooho Lee <ljhiyh@gmail.com>

This was referenced Feb 18, 2023

Auto Scaling Model Mesh Custom Runtime Service #331

Closed

Load Balancing not happening in Multiple Replicas !!! #330

Closed

rafvasq added the enhancement New feature or request label Feb 21, 2023

Jooho mentioned this issue Mar 8, 2023

add HPA feature/unit/fvt/docs/script #342

Merged

ckadner assigned Jooho Mar 23, 2023

This was referenced Apr 13, 2023

Questions about functioning of ModelMesh kserve/modelmesh#46

Closed

horizontal Pod autoscaler available for the serving runtime #225

Closed

ckadner added this to the v0.11.0 milestone Apr 14, 2023

ckadner closed this as completed in #342 Apr 27, 2023

ckadner mentioned this issue May 24, 2023

ServingRuntime autoscaling monitoring GPU utilization #372

Open

ckadner mentioned this issue Sep 22, 2023

No information about how Predictor autoscaling works? #434

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add config option to allow use with HPAs #329

Add config option to allow use with HPAs #329

njhill commented Feb 16, 2023

ashishkamra commented Feb 16, 2023

atinsood commented Feb 24, 2023

Jooho commented Feb 24, 2023

njhill commented Feb 24, 2023

Add config option to allow use with HPAs #329

Add config option to allow use with HPAs #329

Comments

njhill commented Feb 16, 2023

ashishkamra commented Feb 16, 2023

atinsood commented Feb 24, 2023

Jooho commented Feb 24, 2023

njhill commented Feb 24, 2023