-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Execute health checks in ring client pool in parallel #237
Conversation
373e882
to
54c422e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with a comment
I'll approve, but I think we need a maintainer's approval before merging.
ring/client/pool.go
Outdated
if p.cfg.MaxConcurrentHealthChecks == 0 { | ||
maxConcurrent = len(addresses) | ||
} | ||
_ = concurrency.ForEachJob(context.Background(), len(addresses), maxConcurrent, func(_ context.Context, idx int) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're executing these checks concurrently, I wonder if we need to cancel the checks if they overrun the interval at which cleanUnhealthy
is invoked - otherwise they'll start to run on divergent intervals. Is that even an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Now that the iteration
function is finished even when the requests are not finished, we need to cancel existing health checks when the iteration is invoked again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create a context with timeout instead of passing context.Background()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which timeout should we pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe CheckInterval
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, when I pass a cancellable context or a context with timeout to the healtCheck()
function and the context gets cancelled, it could be that not all health checks have been executed - which is ok. However, there is a race condition, where the context is cancelled and there are health check requests currently in flight. In that case, a context cancellation would cause the request to fail and the client to be removed, even though it may have been healthy, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bboreham @pracucci Since all child contexts have a timeout, do I really need to set a timeout on the parent context as well? The only case I could think of is when the health check timeout is greater than the check interval - which should never be the case and we could also add an assertion for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forget what I said. The function concurrency.ForEachJob()
is waiting for all health checks to finish 🤦♂️ We don't need to set a timeout on the parent context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broadly OK; I find "client" in the description confusing; logically we are removing servers.
(The code talks about both; it is holding a client to some server address).
ring/client/pool.go
Outdated
if p.cfg.MaxConcurrentHealthChecks == 0 { | ||
maxConcurrent = len(addresses) | ||
} | ||
_ = concurrency.ForEachJob(context.Background(), len(addresses), maxConcurrent, func(_ context.Context, idx int) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create a context with timeout instead of passing context.Background()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
func healthCheck(client PoolClient, timeout time.Duration) error { | ||
ctx, cancel := context.WithTimeout(context.Background(), timeout) | ||
func healthCheck(ctx context.Context, client PoolClient, timeout time.Duration) error { | ||
ctx, cancel := context.WithTimeout(ctx, timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there now a possibility that the context passed in is cancelled elsewhere (e.g. by concurrency.ForEachJob
), which will be treated as an error and a health-check failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The context could not be cancelled elsewhere. But when looking at the concurrency.ForEachJob
function again, I realised, that it would stop after the first error. This would prevent it from executing all health checks, and that's definitely not what we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bboreham I implemented the concurrency directly in the cleanUnhealthy()
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The context could not be cancelled elsewhere. But when looking at the concurrency.ForEachJob function again, I realised, that it would stop after the first error. This would prevent it from executing all health checks, and that's definitely not what we want.
Not sure reimplementing it from scratch (as done in the last commit) is a good approach. Why don't we keep using concurrency.ForEachJob()
but we never return an error from the function? That's how we use it in Mimir when we don't want to stop on first error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted the last commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
1820c95
to
6251548
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (modulo a nit)
if err != nil { | ||
level.Warn(p.logger).Log("msg", fmt.Sprintf("removing %s failing healthcheck", p.clientName), "addr", addr, "reason", err) | ||
p.RemoveClientFor(addr) | ||
} | ||
} | ||
} | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] Add a comment to explain why we never return an error.
The ring client pool executes the health checks for its servers sequentially, which can lead to problems when there are a lot of servers to check, especially when the targets do not respond fast enough. This PR changes the execution from sequential to parallel. If the new `MaxConcurrentHealthChecks` config setting is not set (`0` value), then health checks are executed with a parallelism of `16`, otherwise the parallelism from the setting is used. Fixes #236 Signed-off-by: Christian Haudum <christian.haudum@gmail.com>
6251548
to
c620fe8
Compare
Add user and org labels to observed exemplars
What this PR does:
The ring client pool executes the health checks for its servers sequentially, which can lead to problems when there are a lot of servers to check, especially when the targets do not respond fast enough.
This PR changes the execution from sequential to parallel. If the new
MaxConcurrentHealthChecks
config setting is not set (0
value), then health checks are executed with a parallelism of 16, otherwise the parallelism from the setting is used.Signed-off-by: Christian Haudum christian.haudum@gmail.com
Which issue(s) this PR fixes:
Fixes #236
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]