-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ruler, Querier, AM: Increase frequency of gRPC healthchecks #3168
Ruler, Querier, AM: Increase frequency of gRPC healthchecks #3168
Conversation
Querier uses store-gateway ring to find which store-gateways should be queried. If gRPC connection doesn't exist in the pool, querier will try to create one (all of that happens in this function). I don't quite see how closing connections faster will change querier's logic for finding which store-gateways to query. |
Leaving technical approval to an engineer. Approved from a docs perspective to get this off my review list. :) |
yeah that's a good point. It will only be forced to reestablish a new connection. New connection dialing is not bounded by a timeout unfortunately. We can use mimir/pkg/querier/store_gateway_client.go Line 45 in da636a6
WDYT @pstibrany ? |
This sounds like a good idea, but will require some changes to the client pool. We already get similar benefit from using time-bound context for individual calls (if connection doesn't exist yet, it is created first during call to eg. |
The client factory is within this package, so there's no need to change dskit if that's what you meant. I pushed a change in 9847f9d |
Previously Now |
This sounds like we sacrifice the redundancy of store-gateways. so we added the Footnotes
|
This combination doesn't have the desired effect. From DialContext documentation:
We need to configure timeout on context that will be used to create connection, which for non-blocking clients is first call to client methods. |
I couldn't find any other way to limit to bound the connection time with the grpc libs. We can override the behaviour on I'm trying to think about what we're trying to solve with these dial timeouts. I went back to your previous comment @pstibrany
It will fail requests to the store-gateways faster. Which means that they will be retried faster - the querier excludes store gateways which it already tried to query from retries. So if instead of waiting for 30s and then trying again, it will wait for 5s and then try again. So 3 tries are a total of 15s, not 1m30s (on average). If all of this is correct, then why do we need to bound the dial time? The client pool closes and removes a connection after it exceeds its healthcheck timeout. It will do the same on a yet uninitialized connection. So this will also break any outstanding calls using the connection (like a hung |
@pstibrany @pracucci can you take another look please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc-related content is fine.
58c88e8
to
01c13b0
Compare
@pstibrany, @pracucci can you take another look? I changed to PR to only have the gRPC client timeouts as we spoke IRL. We also spoke about reducing the heartbeat timeout in the ring. I will first try this out internally first before changing the defaults. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but can you do it in the alertmanager too (pkg/alertmanager/alertmanager_client.go
, please?
01c13b0
to
4c0db15
Compare
This increases the frequency of gRPC client healthchecks. These healthchecks run asynchronously from gRPC requests. If the healthchecks fail, all requests using this connection are interrupted. This will help in cases when the store-gateways become unresponsive for prolonged periods of time. The querier will more quickly remove the connection to that store-gateways and retry on another store-gateway. Does the same for the ruler-ruler and alertmanager-alertmanager clients for good measure. Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
4c0db15
to
1521d01
Compare
Done, also added a changelog entry |
…3168) This increases the frequency of gRPC client healthchecks. These healthchecks run asynchronously from gRPC requests. If the healthchecks fail, all requests using this connection are interrupted. This will help in cases when the store-gateways become unresponsive for prolonged periods of time. The querier will more quickly remove the connection to that store-gateways and retry on another store-gateway. Does the same for the ruler-ruler and alertmanager-alertmanager clients for good measure. Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
This increases the frequency of gRPC client healthchecks. These
healthchecks run asynchronously from gRPC requests. If the healthchecks
fail, all requests using this connection are interrupted.
This will help in cases when the store-gateways become
unresponsive for prolonged periods of time. The querier will more
quickly remove the connection to that store-gateways and retry on
another store-gateway.
Does the same for the ruler-ruler and alertmanager-alertmanager clients
for good measure.
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]