Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruler, Querier, AM: Increase frequency of gRPC healthchecks #3168

Merged

Conversation

dimitarvdimitrov
Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov commented Oct 7, 2022

This increases the frequency of gRPC client healthchecks. These
healthchecks run asynchronously from gRPC requests. If the healthchecks
fail, all requests using this connection are interrupted.

This will help in cases when the store-gateways become
unresponsive for prolonged periods of time. The querier will more
quickly remove the connection to that store-gateways and retry on
another store-gateway.

Does the same for the ruler-ruler and alertmanager-alertmanager clients
for good measure.

Checklist

  • NA Tests updated
  • NA Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@pstibrany
Copy link
Member

pstibrany commented Oct 10, 2022

The querier will more
quickly remove the connection to that store-gateways. So fewer new
queries will go to it and existing queries will fail faster.

Querier uses store-gateway ring to find which store-gateways should be queried. If gRPC connection doesn't exist in the pool, querier will try to create one (all of that happens in this function). I don't quite see how closing connections faster will change querier's logic for finding which store-gateways to query.

@osg-grafana
Copy link
Contributor

osg-grafana commented Oct 10, 2022

Leaving technical approval to an engineer. Approved from a docs perspective to get this off my review list. :)

@dimitarvdimitrov
Copy link
Contributor Author

The querier will more
quickly remove the connection to that store-gateways. So fewer new
queries will go to it and existing queries will fail faster.

Querier uses store-gateway ring to find which store-gateways should be queried. If gRPC connection doesn't exist in the pool, querier will try to create one (all of that happens in this function). I don't quite see how closing connections faster will change querier's logic for finding which store-gateways to query.

yeah that's a good point. It will only be forced to reestablish a new connection. New connection dialing is not bounded by a timeout unfortunately. We can use DialContext with a context with an expiration and also use WithBlock() here:

conn, err := grpc.Dial(addr, opts...)

WDYT @pstibrany ?

@pstibrany
Copy link
Member

We can use DialContext with a context with an expiration and also use WithBlock() here:

This sounds like a good idea, but will require some changes to the client pool. We already get similar benefit from using time-bound context for individual calls (if connection doesn't exist yet, it is created first during call to eg. Series or other methods on StoreGatewayClient interface).

@dimitarvdimitrov
Copy link
Contributor Author

This sounds like a good idea, but will require some changes to the client pool.

The client factory is within this package, so there's no need to change dskit if that's what you meant. I pushed a change in 9847f9d

@pstibrany
Copy link
Member

Previously blocksStoreReplicationSet.GetClientsFor function would return fast, returning map of clients to ULIDs. But not all store-gateways will necessarily by called by client of this function -- store-gateways are queried randomly, so user may be lucky and not hit the "bad" store-gateway.

Now blocksStoreReplicationSet.GetClientsFor will try to establish connection to store-gateways first. By doing that, all users will suffer if connection to store-gateway cannot be established in time, while previously some users might have been lucky and still get a response.

@dimitarvdimitrov
Copy link
Contributor Author

Previously blocksStoreReplicationSet.GetClientsFor function would return fast, returning map of clients to ULIDs. But not all store-gateways will necessarily by called by client of this function -- store-gateways are queried randomly, so user may be lucky and not hit the "bad" store-gateway.

Now blocksStoreReplicationSet.GetClientsFor will try to establish connection to store-gateways first. By doing that, all users will suffer if connection to store-gateway cannot be established in time, while previously some users might have been lucky and still get a response.

This sounds like we sacrifice the redundancy of store-gateways.

so we added the WithBlock and DialContext with a timeout context so that we don't try to reestablish new connections all the time after the health check fails. How about we continue using DialContext but remove WithBlock? This way we will effectively be retrying to reestablish a connection to a failed store-gateway every $timeout seconds1 until the store-gateway also fails its ring healthcheck.

Footnotes

  1. Assuming a continuous flow of queries

@pstibrany
Copy link
Member

How about we continue using DialContext but remove WithBlock?

This combination doesn't have the desired effect. From DialContext documentation:

// In the non-blocking case, the ctx does not act against the connection. It
// only controls the setup steps.

We need to configure timeout on context that will be used to create connection, which for non-blocking clients is first call to client methods.

@dimitarvdimitrov
Copy link
Contributor Author

dimitarvdimitrov commented Oct 11, 2022

I couldn't find any other way to limit to bound the connection time with the grpc libs.

We can override the behaviour on storeGatewayClient to take care of dialing with block and with context on the first call to the remote instead of in dialStoreGatewayClient.


I'm trying to think about what we're trying to solve with these dial timeouts. I went back to your previous comment @pstibrany

I don't quite see how closing connections faster will change querier's logic for finding which store-gateways to query.

It will fail requests to the store-gateways faster. Which means that they will be retried faster - the querier excludes store gateways which it already tried to query from retries. So if instead of waiting for 30s and then trying again, it will wait for 5s and then try again. So 3 tries are a total of 15s, not 1m30s (on average).

If all of this is correct, then why do we need to bound the dial time? The client pool closes and removes a connection after it exceeds its healthcheck timeout. It will do the same on a yet uninitialized connection. So this will also break any outstanding calls using the connection (like a hung Series call).

@dimitarvdimitrov
Copy link
Contributor Author

@pstibrany @pracucci can you take another look please?

@osg-grafana osg-grafana requested review from a team and removed request for osg-grafana November 16, 2022 14:46
Copy link
Contributor

@osg-grafana osg-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc-related content is fine.

@dimitarvdimitrov dimitarvdimitrov force-pushed the dimitar/increase-frequency-grpc-client-pool-helathcheck branch from 58c88e8 to 01c13b0 Compare November 21, 2022 09:19
@dimitarvdimitrov
Copy link
Contributor Author

@pstibrany, @pracucci can you take another look? I changed to PR to only have the gRPC client timeouts as we spoke IRL.

We also spoke about reducing the heartbeat timeout in the ring. I will first try this out internally first before changing the defaults.

Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but can you do it in the alertmanager too (pkg/alertmanager/alertmanager_client.go, please?

@dimitarvdimitrov dimitarvdimitrov force-pushed the dimitar/increase-frequency-grpc-client-pool-helathcheck branch from 01c13b0 to 4c0db15 Compare November 21, 2022 11:11
This increases the frequency of gRPC client healthchecks. These
healthchecks run asynchronously from gRPC requests. If the healthchecks
fail, all requests using this connection are interrupted.

This will help in cases when the store-gateways become
unresponsive for prolonged periods of time. The querier will more
quickly remove the connection to that store-gateways and retry on
another store-gateway.

Does the same for the ruler-ruler and alertmanager-alertmanager clients
 for good measure.

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
@dimitarvdimitrov dimitarvdimitrov force-pushed the dimitar/increase-frequency-grpc-client-pool-helathcheck branch from 4c0db15 to 1521d01 Compare November 21, 2022 11:15
@dimitarvdimitrov
Copy link
Contributor Author

Done, also added a changelog entry

@pracucci pracucci enabled auto-merge (squash) November 21, 2022 11:16
@dimitarvdimitrov dimitarvdimitrov changed the title Ruler, Querier: Increase frequency of gRPC healthchecks Ruler, Querier, AM: Increase frequency of gRPC healthchecks Nov 21, 2022
@pracucci pracucci merged commit e4c7d3c into main Nov 21, 2022
@pracucci pracucci deleted the dimitar/increase-frequency-grpc-client-pool-helathcheck branch November 21, 2022 11:29
masonmei pushed a commit to udmire/mimir that referenced this pull request Dec 16, 2022
…3168)

This increases the frequency of gRPC client healthchecks. These
healthchecks run asynchronously from gRPC requests. If the healthchecks
fail, all requests using this connection are interrupted.

This will help in cases when the store-gateways become
unresponsive for prolonged periods of time. The querier will more
quickly remove the connection to that store-gateways and retry on
another store-gateway.

Does the same for the ruler-ruler and alertmanager-alertmanager clients
 for good measure.

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants