-
Notifications
You must be signed in to change notification settings - Fork 806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All queries fail with "all SubConns are in TransientFailure" after upgrade #882
Comments
Deleting/recreating all the queriers caused it to recover, temporarily. |
Adding
Which corresponds with the ingester that was exiting as part of the rollout:
We should be able to tolerate a single down ingester for queries, looks like there might be a bug there. And this, people, is why you don't deploy on Fridays... |
We have tests to show we can tolerate a single dead ingester; I manually tested this by deleting an ingester in dev, and the queries worked fine. I then rolled out a new ingester, and reproduced it quite nicely in dev. All the IPs So it looks like (a) the ingester shuts down it gRPC server too early, (b) the querier code to tolerate one ingester outage is broken and (c) units tests are wrong. |
Starting to get to the bottom of this now
|
(so the only real bug is a.) |
Fixes in weaveworks/common#99, which was includes in #870 |
level=error ts=2018-07-13T19:09:18.433330722Z caller=engine.go:494 msg="error selecting series set" err="rpc error: code = Unavailable desc = all SubConns are in TransientFailure"
I suspect gRPC connections from querier -> ingester got upset after an update...
The text was updated successfully, but these errors were encountered: