All queries fail with "all SubConns are in TransientFailure" after upgrade #882

tomwilkie · 2018-07-13T19:17:38Z

level=error ts=2018-07-13T19:09:18.433330722Z caller=engine.go:494 msg="error selecting series set" err="rpc error: code = Unavailable desc = all SubConns are in TransientFailure"

I suspect gRPC connections from querier -> ingester got upset after an update...

The text was updated successfully, but these errors were encountered:

tomwilkie · 2018-07-13T19:18:32Z

Deleting/recreating all the queriers caused it to recover, temporarily.

tomwilkie · 2018-07-13T19:50:27Z

Adding GRPC_GO_LOG_SEVERITY_LEVEL=INFO to the pod got me:

WARNING: 2018/07/13 19:46:48 grpc: addrConn.createTransport failed to connect to {10.52.10.50:9095 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.52.10.50:9095: connect: connection refused". Reconnecting...

Which corresponds with the ingester that was exiting as part of the rollout:

ingester-f7fcbc9fc-2dk4d 1/1 Terminating 0 4d 10.52.10.50

We should be able to tolerate a single down ingester for queries, looks like there might be a bug there.

And this, people, is why you don't deploy on Fridays...

tomwilkie · 2018-07-13T20:54:02Z

We have tests to show we can tolerate a single dead ingester; I manually tested this by deleting an ingester in dev, and the queries worked fine.

I then rolled out a new ingester, and reproduced it quite nicely in dev. All the IPs addrConn.createTransport reported were for the old ingesters, not the new ones. All the errors were "connection refused". And the ordering was as the ingesters were updated, one by one. And I checked the view of the ring the queriers has, it was consistent with reality.

So it looks like (a) the ingester shuts down it gRPC server too early, (b) the querier code to tolerate one ingester outage is broken and (c) units tests are wrong.

tomwilkie · 2018-07-13T21:07:54Z

Starting to get to the bottom of this now

since we changed the sharding, queries now read all ingesters
the ring doesn’t read from joining ingesters, consuming our error budget of one
therefore we’re left with the two healthy ingester and the leaving ingester (for RF=3)
the leaving ingester has closed it grpc server for some reason

tomwilkie · 2018-07-13T21:08:06Z

(so the only real bug is a.)

tomwilkie · 2018-07-26T11:12:19Z

Fixes in weaveworks/common#99, which was includes in #870

This was referenced Jul 13, 2018

Don't shutdown the gRPC server the minute a signal is received. weaveworks/common#99

Merged

[WIP] Don't shutdown gRPC server the minute a signal is received. #883

Closed

tomwilkie mentioned this issue Jul 25, 2018

Unify the server logs and the gokit logging. #870

Merged

tomwilkie closed this as completed Jul 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All queries fail with "all SubConns are in TransientFailure" after upgrade #882

All queries fail with "all SubConns are in TransientFailure" after upgrade #882

tomwilkie commented Jul 13, 2018

tomwilkie commented Jul 13, 2018 •

edited

Loading

tomwilkie commented Jul 13, 2018

tomwilkie commented Jul 13, 2018

tomwilkie commented Jul 13, 2018 •

edited

Loading

tomwilkie commented Jul 13, 2018

tomwilkie commented Jul 26, 2018

All queries fail with "all SubConns are in TransientFailure" after upgrade #882

All queries fail with "all SubConns are in TransientFailure" after upgrade #882

Comments

tomwilkie commented Jul 13, 2018

tomwilkie commented Jul 13, 2018 • edited Loading

tomwilkie commented Jul 13, 2018

tomwilkie commented Jul 13, 2018

tomwilkie commented Jul 13, 2018 • edited Loading

tomwilkie commented Jul 13, 2018

tomwilkie commented Jul 26, 2018

tomwilkie commented Jul 13, 2018 •

edited

Loading

tomwilkie commented Jul 13, 2018 •

edited

Loading