Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All queries fail with "all SubConns are in TransientFailure" after upgrade #882

Closed
tomwilkie opened this issue Jul 13, 2018 · 6 comments
Closed

Comments

@tomwilkie
Copy link
Contributor

level=error ts=2018-07-13T19:09:18.433330722Z caller=engine.go:494 msg="error selecting series set" err="rpc error: code = Unavailable desc = all SubConns are in TransientFailure"

I suspect gRPC connections from querier -> ingester got upset after an update...

@tomwilkie
Copy link
Contributor Author

tomwilkie commented Jul 13, 2018

Deleting/recreating all the queriers caused it to recover, temporarily.

@tomwilkie
Copy link
Contributor Author

Adding GRPC_GO_LOG_SEVERITY_LEVEL=INFO to the pod got me:

WARNING: 2018/07/13 19:46:48 grpc: addrConn.createTransport failed to connect to {10.52.10.50:9095 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.52.10.50:9095: connect: connection refused". Reconnecting...

Which corresponds with the ingester that was exiting as part of the rollout:

ingester-f7fcbc9fc-2dk4d 1/1 Terminating 0 4d 10.52.10.50

We should be able to tolerate a single down ingester for queries, looks like there might be a bug there.

And this, people, is why you don't deploy on Fridays...

@tomwilkie
Copy link
Contributor Author

We have tests to show we can tolerate a single dead ingester; I manually tested this by deleting an ingester in dev, and the queries worked fine.

I then rolled out a new ingester, and reproduced it quite nicely in dev. All the IPs addrConn.createTransport reported were for the old ingesters, not the new ones. All the errors were "connection refused". And the ordering was as the ingesters were updated, one by one. And I checked the view of the ring the queriers has, it was consistent with reality.

So it looks like (a) the ingester shuts down it gRPC server too early, (b) the querier code to tolerate one ingester outage is broken and (c) units tests are wrong.

@tomwilkie
Copy link
Contributor Author

tomwilkie commented Jul 13, 2018

Starting to get to the bottom of this now

  • since we changed the sharding, queries now read all ingesters
  • the ring doesn’t read from joining ingesters, consuming our error budget of one
  • therefore we’re left with the two healthy ingester and the leaving ingester (for RF=3)
  • the leaving ingester has closed it grpc server for some reason

@tomwilkie
Copy link
Contributor Author

(so the only real bug is a.)

@tomwilkie
Copy link
Contributor Author

Fixes in weaveworks/common#99, which was includes in #870

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant