Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execute health checks in ring client pool in parallel #237

Merged
merged 1 commit into from
Dec 12, 2022

Conversation

chaudum
Copy link
Contributor

@chaudum chaudum commented Dec 1, 2022

What this PR does:

The ring client pool executes the health checks for its servers sequentially, which can lead to problems when there are a lot of servers to check, especially when the targets do not respond fast enough.

This PR changes the execution from sequential to parallel. If the new MaxConcurrentHealthChecks config setting is not set (0 value), then health checks are executed with a parallelism of 16, otherwise the parallelism from the setting is used.

Signed-off-by: Christian Haudum christian.haudum@gmail.com

Which issue(s) this PR fixes:

Fixes #236

Checklist

  • Tests updated
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@chaudum chaudum force-pushed the chaudum/concurrent-healthchecks branch from 373e882 to 54c422e Compare December 1, 2022 13:12
@chaudum chaudum marked this pull request as ready for review December 1, 2022 13:14
ring/client/pool.go Outdated Show resolved Hide resolved
Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a comment

I'll approve, but I think we need a maintainer's approval before merging.

if p.cfg.MaxConcurrentHealthChecks == 0 {
maxConcurrent = len(addresses)
}
_ = concurrency.ForEachJob(context.Background(), len(addresses), maxConcurrent, func(_ context.Context, idx int) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're executing these checks concurrently, I wonder if we need to cancel the checks if they overrun the interval at which cleanUnhealthy is invoked - otherwise they'll start to run on divergent intervals. Is that even an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Now that the iteration function is finished even when the requests are not finished, we need to cancel existing health checks when the iteration is invoked again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a context with timeout instead of passing context.Background() ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which timeout should we pass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe CheckInterval?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, when I pass a cancellable context or a context with timeout to the healtCheck() function and the context gets cancelled, it could be that not all health checks have been executed - which is ok. However, there is a race condition, where the context is cancelled and there are health check requests currently in flight. In that case, a context cancellation would cause the request to fail and the client to be removed, even though it may have been healthy, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bboreham @pracucci Since all child contexts have a timeout, do I really need to set a timeout on the parent context as well? The only case I could think of is when the health check timeout is greater than the check interval - which should never be the case and we could also add an assertion for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forget what I said. The function concurrency.ForEachJob() is waiting for all health checks to finish 🤦‍♂️ We don't need to set a timeout on the parent context.

Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broadly OK; I find "client" in the description confusing; logically we are removing servers.
(The code talks about both; it is holding a client to some server address).

if p.cfg.MaxConcurrentHealthChecks == 0 {
maxConcurrent = len(addresses)
}
_ = concurrency.ForEachJob(context.Background(), len(addresses), maxConcurrent, func(_ context.Context, idx int) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a context with timeout instead of passing context.Background() ?

ring/client/pool.go Outdated Show resolved Hide resolved
ring/client/pool.go Outdated Show resolved Hide resolved
ring/client/pool.go Outdated Show resolved Hide resolved
ring/client/pool.go Outdated Show resolved Hide resolved
@chaudum chaudum changed the title Execute health checks for ring clients in parallel Execute health checks in ring client pool in parallel Dec 2, 2022
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

ring/client/pool_test.go Outdated Show resolved Hide resolved
func healthCheck(client PoolClient, timeout time.Duration) error {
ctx, cancel := context.WithTimeout(context.Background(), timeout)
func healthCheck(ctx context.Context, client PoolClient, timeout time.Duration) error {
ctx, cancel := context.WithTimeout(ctx, timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there now a possibility that the context passed in is cancelled elsewhere (e.g. by concurrency.ForEachJob), which will be treated as an error and a health-check failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context could not be cancelled elsewhere. But when looking at the concurrency.ForEachJob function again, I realised, that it would stop after the first error. This would prevent it from executing all health checks, and that's definitely not what we want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bboreham I implemented the concurrency directly in the cleanUnhealthy() function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context could not be cancelled elsewhere. But when looking at the concurrency.ForEachJob function again, I realised, that it would stop after the first error. This would prevent it from executing all health checks, and that's definitely not what we want.

Not sure reimplementing it from scratch (as done in the last commit) is a good approach. Why don't we keep using concurrency.ForEachJob() but we never return an error from the function? That's how we use it in Mimir when we don't want to stop on first error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the last commit.

ring/client/pool.go Outdated Show resolved Hide resolved
Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chaudum chaudum force-pushed the chaudum/concurrent-healthchecks branch from 1820c95 to 6251548 Compare December 7, 2022 08:34
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (modulo a nit)

if err != nil {
level.Warn(p.logger).Log("msg", fmt.Sprintf("removing %s failing healthcheck", p.clientName), "addr", addr, "reason", err)
p.RemoveClientFor(addr)
}
}
}
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Add a comment to explain why we never return an error.

The ring client pool executes the health checks for its servers
sequentially, which can lead to problems when there are a lot of servers
to check, especially when the targets do not respond fast enough.

This PR changes the execution from sequential to parallel. If the new
`MaxConcurrentHealthChecks` config setting is not set (`0` value), then
health checks are executed with a parallelism of `16`, otherwise the
parallelism from the setting is used.

Fixes #236

Signed-off-by: Christian Haudum <christian.haudum@gmail.com>
@chaudum chaudum force-pushed the chaudum/concurrent-healthchecks branch from 6251548 to c620fe8 Compare December 12, 2022 11:56
@chaudum chaudum enabled auto-merge (squash) December 12, 2022 11:59
@chaudum chaudum merged commit 3e308a4 into main Dec 12, 2022
@chaudum chaudum deleted the chaudum/concurrent-healthchecks branch December 12, 2022 12:03
charleskorn pushed a commit that referenced this pull request Aug 3, 2023
Add user and org labels to observed exemplars
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ring client pool should not execute health checks sequentially
4 participants