Execute health checks in ring client pool in parallel #237

chaudum · 2022-12-01T13:11:54Z

What this PR does:

The ring client pool executes the health checks for its servers sequentially, which can lead to problems when there are a lot of servers to check, especially when the targets do not respond fast enough.

This PR changes the execution from sequential to parallel. If the new MaxConcurrentHealthChecks config setting is not set (0 value), then health checks are executed with a parallelism of 16, otherwise the parallelism from the setting is used.

Signed-off-by: Christian Haudum christian.haudum@gmail.com

Which issue(s) this PR fixes:

Fixes #236

Checklist

Tests updated
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

ring/client/pool.go

dannykopping

LGTM, with a comment

I'll approve, but I think we need a maintainer's approval before merging.

dannykopping · 2022-12-01T13:20:36Z

ring/client/pool.go

+	if p.cfg.MaxConcurrentHealthChecks == 0 {
+		maxConcurrent = len(addresses)
+	}
+	_ = concurrency.ForEachJob(context.Background(), len(addresses), maxConcurrent, func(_ context.Context, idx int) error {


If we're executing these checks concurrently, I wonder if we need to cancel the checks if they overrun the interval at which cleanUnhealthy is invoked - otherwise they'll start to run on divergent intervals. Is that even an issue?

Good point. Now that the iteration function is finished even when the requests are not finished, we need to cancel existing health checks when the iteration is invoked again.

Create a context with timeout instead of passing context.Background() ?

Which timeout should we pass?

Maybe CheckInterval?

So, when I pass a cancellable context or a context with timeout to the healtCheck() function and the context gets cancelled, it could be that not all health checks have been executed - which is ok. However, there is a race condition, where the context is cancelled and there are health check requests currently in flight. In that case, a context cancellation would cause the request to fail and the client to be removed, even though it may have been healthy, right?

@bboreham @pracucci Since all child contexts have a timeout, do I really need to set a timeout on the parent context as well? The only case I could think of is when the health check timeout is greater than the check interval - which should never be the case and we could also add an assertion for that.

Forget what I said. The function concurrency.ForEachJob() is waiting for all health checks to finish 🤦‍♂️ We don't need to set a timeout on the parent context.

bboreham

Broadly OK; I find "client" in the description confusing; logically we are removing servers.
(The code talks about both; it is holding a client to some server address).

bboreham · 2022-12-01T15:40:28Z

ring/client/pool.go

+	if p.cfg.MaxConcurrentHealthChecks == 0 {
+		maxConcurrent = len(addresses)
+	}
+	_ = concurrency.ForEachJob(context.Background(), len(addresses), maxConcurrent, func(_ context.Context, idx int) error {


Create a context with timeout instead of passing context.Background() ?

ring/client/pool.go

pracucci

LGTM, thanks!

ring/client/pool_test.go

bboreham · 2022-12-05T11:04:06Z

ring/client/pool.go

-func healthCheck(client PoolClient, timeout time.Duration) error {
-	ctx, cancel := context.WithTimeout(context.Background(), timeout)
+func healthCheck(ctx context.Context, client PoolClient, timeout time.Duration) error {
+	ctx, cancel := context.WithTimeout(ctx, timeout)


Is there now a possibility that the context passed in is cancelled elsewhere (e.g. by concurrency.ForEachJob), which will be treated as an error and a health-check failure?

The context could not be cancelled elsewhere. But when looking at the concurrency.ForEachJob function again, I realised, that it would stop after the first error. This would prevent it from executing all health checks, and that's definitely not what we want.

@bboreham I implemented the concurrency directly in the cleanUnhealthy() function.

The context could not be cancelled elsewhere. But when looking at the concurrency.ForEachJob function again, I realised, that it would stop after the first error. This would prevent it from executing all health checks, and that's definitely not what we want.

Not sure reimplementing it from scratch (as done in the last commit) is a good approach. Why don't we keep using concurrency.ForEachJob() but we never return an error from the function? That's how we use it in Mimir when we don't want to stop on first error.

Reverted the last commit.

ring/client/pool.go

dannykopping

LGTM

pracucci

LGTM (modulo a nit)

pracucci · 2022-12-12T09:59:00Z

ring/client/pool.go

 			if err != nil {
 				level.Warn(p.logger).Log("msg", fmt.Sprintf("removing %s failing healthcheck", p.clientName), "addr", addr, "reason", err)
 				p.RemoveClientFor(addr)
 			}
 		}
-	}
+		return nil


[nit] Add a comment to explain why we never return an error.

The ring client pool executes the health checks for its servers sequentially, which can lead to problems when there are a lot of servers to check, especially when the targets do not respond fast enough. This PR changes the execution from sequential to parallel. If the new `MaxConcurrentHealthChecks` config setting is not set (`0` value), then health checks are executed with a parallelism of `16`, otherwise the parallelism from the setting is used. Fixes #236 Signed-off-by: Christian Haudum <christian.haudum@gmail.com>

Add user and org labels to observed exemplars

chaudum force-pushed the chaudum/concurrent-healthchecks branch from 373e882 to 54c422e Compare December 1, 2022 13:12

chaudum marked this pull request as ready for review December 1, 2022 13:14

chaudum commented Dec 1, 2022

View reviewed changes

ring/client/pool.go Outdated Show resolved Hide resolved

dannykopping approved these changes Dec 1, 2022

View reviewed changes

bboreham reviewed Dec 1, 2022

View reviewed changes

pracucci reviewed Dec 1, 2022

View reviewed changes

ring/client/pool.go Outdated Show resolved Hide resolved

pracucci reviewed Dec 1, 2022

View reviewed changes

ring/client/pool.go Outdated Show resolved Hide resolved

chaudum changed the title ~~Execute health checks for ring clients in parallel~~ Execute health checks in ring client pool in parallel Dec 2, 2022

pracucci approved these changes Dec 5, 2022

View reviewed changes

bboreham reviewed Dec 5, 2022

View reviewed changes

chaudum requested review from dannykopping and pracucci December 6, 2022 07:19

dannykopping approved these changes Dec 6, 2022

View reviewed changes

chaudum force-pushed the chaudum/concurrent-healthchecks branch from 1820c95 to 6251548 Compare December 7, 2022 08:34

pracucci approved these changes Dec 12, 2022

View reviewed changes

chaudum force-pushed the chaudum/concurrent-healthchecks branch from 6251548 to c620fe8 Compare December 12, 2022 11:56

chaudum enabled auto-merge (squash) December 12, 2022 11:59

chaudum merged commit 3e308a4 into main Dec 12, 2022

chaudum deleted the chaudum/concurrent-healthchecks branch December 12, 2022 12:03

charleskorn pushed a commit that referenced this pull request Aug 3, 2023

Merge pull request #237 from colega/user-and-org-labels-on-exemplars

73f00dd

Add user and org labels to observed exemplars

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Execute health checks in ring client pool in parallel #237

Execute health checks in ring client pool in parallel #237

chaudum commented Dec 1, 2022 •

edited

Loading

dannykopping left a comment

dannykopping Dec 1, 2022

chaudum Dec 1, 2022

bboreham Dec 1, 2022

pracucci Dec 1, 2022

chaudum Dec 1, 2022

chaudum Dec 1, 2022

chaudum Dec 2, 2022

chaudum Dec 2, 2022

bboreham left a comment

bboreham Dec 1, 2022

pracucci left a comment

bboreham Dec 5, 2022

chaudum Dec 5, 2022

chaudum Dec 5, 2022

pracucci Dec 6, 2022

chaudum Dec 7, 2022

dannykopping left a comment

pracucci left a comment

pracucci Dec 12, 2022

Execute health checks in ring client pool in parallel #237

Execute health checks in ring client pool in parallel #237

Conversation

chaudum commented Dec 1, 2022 • edited Loading

dannykopping left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bboreham left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dannykopping left a comment

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chaudum commented Dec 1, 2022 •

edited

Loading