Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loadbalance based on running requests #16

Closed
klauspost opened this issue Apr 15, 2020 · 9 comments
Closed

Loadbalance based on running requests #16

klauspost opened this issue Apr 15, 2020 · 9 comments
Assignees

Comments

@klauspost
Copy link

Current loadbalancing is purely round-robin.

However different requests create different loads which means that servers processing complex requests may be slower and some servers may be mostly idle.

As an alternative simply choosing between the servers with the fewest running requests.

warp uses alternative host selection with the following scheme:

  • Select the host with the fewest running requests.
  • If tied, select the host that has the longest time since last request finished.

This both gives a good distribution and will take individual server load into consideration.

@harshavardhana
Copy link
Member

Current loadbalancing is purely round-robin.

the load balancing requirement for sidekick was to be always purely random - that was the original intention with no heuristics-based requirement.

@aweisser
Copy link

It should be up to the targets health-path to decide if more load can be put on the target or if it's to busy.
Isn't this the reason why minio/health/ready endpoint is a good choice for Sidekick?

@klauspost
Copy link
Author

@harshavardhana That assumes that all requests are equal and all servers behave the same, and at least the first statement is always false, the second can be.

@aweisser
Copy link

@klauspost Your approach assumes that there‘s only a single Sidekick instance, which is also not always true. Look at the Splunk use case for example.
Imho only the Minio server can decide if it can take any more with respect to its internal heuristics.

@klauspost
Copy link
Author

@aweisser I may be overlooking something, but how does multiple instances affect this?

If all sidekicks are trying to keep the number of running requests equal for all servers that would be good load balancing in my book and not just "load distribution".

@aweisser
Copy link

aweisser commented Feb 10, 2021

A sidekick instance can only work on heuristics that it can measure. Without getting heuristics from the S3 server and without sharing heuristics with other Sidekick instances, a local Sidekick process can only count its local requests without knowing what other requesting clients do.

As you said, not all requests are the same and only the S3 server instance know about the real load.

Imho it's all about the smartness of the health check. Maybe the minio/health/ready endpoint can be even smarter than counting its Go routines before responding with an HTTP 503 "I'm to busy, go away!" It may take the servers system load (RAM, CPU usage) or the saturation of its NICs into account.

This way a "naiv" (let's better call it "simple and bullet proof") round robin over "ready" S3 targets should do the job quite well.
Together with smart health checks it becomes a qualitative load balancing and not just a load distribution.

@klauspost
Copy link
Author

@aweisser So you are saying because it doesn't know anything more than up/down, we should stick to an algorithm that keeps piling requests on to an overloaded/subpar performing server? Doesn't make of sense to me.

The number of requests is a perfectly valid balancing function. Instead of relying on collecting metrics which may or may not give an indication of load (you mention some, but they are no real indication of load), keeping track of active requests is completely passive and doesn't have to rely any metrics and also takes sidekick->minio network issues into account.

@aweisser
Copy link

aweisser commented Feb 22, 2021

I'm sure you can find examples that speak for one or the other approach because there is no single source of truth in a distributed system and the number of requests is not the only metric that refers to "load".

Also just recognized that the /minio/health/ready probe is currently not counting Go routines anyway.

I was confused by the following gist https://gist.github.com/nitisht/0c11d8c670f565b58d930b526ba0f2ed that states, that the readiness probe returns HTTP 503 if more than 500 go routines are open.

Maybe you already had reasons at Minio to not do it this way or to change the readiness probe to be equivalent to the liveness probe ("always return HTTP 200 as long as the service is running").

My opinion still is, that a server side "readiness" check is relevant for a qualitative load balancing (in contrast to a dumb load distribution). Surely a smarter way than round robin from the client side is also nice to have.

Imho the question should be: Is it worth it to break KISS?

@harshavardhana
Copy link
Member

fixed in #98 and released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants