-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loadbalance based on running requests #16
Comments
the load balancing requirement for sidekick was to be always purely random - that was the original intention with no heuristics-based requirement. |
It should be up to the targets health-path to decide if more load can be put on the target or if it's to busy. |
@harshavardhana That assumes that all requests are equal and all servers behave the same, and at least the first statement is always false, the second can be. |
@klauspost Your approach assumes that there‘s only a single Sidekick instance, which is also not always true. Look at the Splunk use case for example. |
@aweisser I may be overlooking something, but how does multiple instances affect this? If all sidekicks are trying to keep the number of running requests equal for all servers that would be good load balancing in my book and not just "load distribution". |
A sidekick instance can only work on heuristics that it can measure. Without getting heuristics from the S3 server and without sharing heuristics with other Sidekick instances, a local Sidekick process can only count its local requests without knowing what other requesting clients do. As you said, not all requests are the same and only the S3 server instance know about the real load. Imho it's all about the smartness of the health check. Maybe the minio/health/ready endpoint can be even smarter than counting its Go routines before responding with an HTTP 503 "I'm to busy, go away!" It may take the servers system load (RAM, CPU usage) or the saturation of its NICs into account. This way a "naiv" (let's better call it "simple and bullet proof") round robin over "ready" S3 targets should do the job quite well. |
@aweisser So you are saying because it doesn't know anything more than up/down, we should stick to an algorithm that keeps piling requests on to an overloaded/subpar performing server? Doesn't make of sense to me. The number of requests is a perfectly valid balancing function. Instead of relying on collecting metrics which may or may not give an indication of load (you mention some, but they are no real indication of load), keeping track of active requests is completely passive and doesn't have to rely any metrics and also takes sidekick->minio network issues into account. |
I'm sure you can find examples that speak for one or the other approach because there is no single source of truth in a distributed system and the number of requests is not the only metric that refers to "load". Also just recognized that the /minio/health/ready probe is currently not counting Go routines anyway. I was confused by the following gist https://gist.github.com/nitisht/0c11d8c670f565b58d930b526ba0f2ed that states, that the readiness probe returns HTTP 503 if more than 500 go routines are open. Maybe you already had reasons at Minio to not do it this way or to change the readiness probe to be equivalent to the liveness probe ("always return HTTP 200 as long as the service is running"). My opinion still is, that a server side "readiness" check is relevant for a qualitative load balancing (in contrast to a dumb load distribution). Surely a smarter way than round robin from the client side is also nice to have. Imho the question should be: Is it worth it to break KISS? |
fixed in #98 and released. |
Current loadbalancing is purely round-robin.
However different requests create different loads which means that servers processing complex requests may be slower and some servers may be mostly idle.
As an alternative simply choosing between the servers with the fewest running requests.
warp uses alternative host selection with the following scheme:
This both gives a good distribution and will take individual server load into consideration.
The text was updated successfully, but these errors were encountered: