Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load Balancing not happening in Multiple Replicas !!! #330

Closed
MLHafizur opened this issue Feb 16, 2023 · 2 comments
Closed

Load Balancing not happening in Multiple Replicas !!! #330

MLHafizur opened this issue Feb 16, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@MLHafizur
Copy link

  1. We deployed PyTorch model using MLServer Serving runtime. Our goal is to get faster prediction with higher load.
    So we created 5 replicas for each runtime:

MicrosoftTeams-image

But we sending mutiple parallel or sequencial inference requests, all of the request are take care by one pod out of 5 replicas. Which is delaying us to get the result. Also we noticed that after few successful requests, the requests are failing.

  1. Keeping 5 replicas up running is expesive, is there any way to scale them zero or one, when there's no inference requests?
@MLHafizur MLHafizur added the bug Something isn't working label Feb 16, 2023
@njhill
Copy link
Member

njhill commented Feb 18, 2023

  1. We deployed PyTorch model using MLServer Serving runtime. Our goal is to get faster prediction with higher load.
    So we created 5 replicas for each runtime:

@MLHafizur if there's sufficient load, the model should get loaded in additional replicas up to the total number you have for that runtime. There will be a bit of a delay while this scale-up happens, depending on how long your models take to load.

Also we noticed that after few successful requests, the requests are failing.

You'll have to provide more detail about the kind of failure.. from the client side and also logs from the containers.

  1. Keeping 5 replicas up running is expesive, is there any way to scale them zero or one, when there's no inference requests?

There is a plan to allow it to work with HPA, see #329.

@rafvasq
Copy link
Member

rafvasq commented Feb 16, 2024

Closing due to inactivity. HPA was introduced in #342, and please feel free to reopen with more information about the failing requests or open a new issue.

@rafvasq rafvasq closed this as completed Feb 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants