WithReturnConnectionError does not report errors related to health check #5831

atollena · 2022-12-01T17:00:44Z

What version of gRPC are you using?

v1.32.0, v1.49. master is affected.

What version of Go are you using (`go version`)?

go 1.19.1

What operating system (Linux, Windows, …) and version?

Linux, MacOS

What did you do?

The client is configured as follow:
- with health checks enabled via Service Config
- with the round robin balancer via Service Config
- with the WithReturnConnectionError() option enabled
- the client calls DialContext with a reasonable timeout context (e.g. 10 seconds)
The server is configured with a custom health checker that fails health checks

What did you expect to see?

DialContext report a more precise error such as:

did not connect: context deadline exceeded: connection active but health check failed. status=NOT_SERVING

What did you see instead?

DialContext reports a generic timeout error:

 did not connect: context deadline exceeded

This behavior makes troubleshooting problems with health checks more difficult than it needs to be.

The text was updated successfully, but these errors were encountered:

atollena · 2022-12-01T18:06:31Z

Bug or feature, however you see it.

I took a look at how to fix this. Two approaches I could think of:

simply capture the last error in setConnectivityState (https://github.com/grpc/grpc-go/blob/master/clientconn.go#L1371), via ac.cc.updateConnectionError(lastErr). I'm not sure about side effects of having an additional call site for ac.cc.updateConnectionError -- it makes the code a bit more difficult to understand. It's also a bit weird to call ac.cc.updateConnectionError outside the startup sequence (currently it's only called in tryAllAddrs).
make the first health check synchronous in startHealthCheck (https://github.com/grpc/grpc-go/blob/master/clientconn.go#L1374) and ensure createTransport returns an error when the first health check fails. This would make health check errors handling more in line with other errors that can happen on transport creation. But it has the drawback that we only get the chance to make 1 health check, and I'm not yet sure the last error would never be wrong if the health check flips from e.g. NOT_SERVING to SERVING while the connection context is not yet cancelled.

This bug is just a UX thing, but it ends up being quite confusing -- we're using health checks in many services, and we had cases where an interceptor or PerRPCCredentials failed, causing the health check to fail, which in turn fails blocking DialContext. There is no hint in the error message.

arvindbr8 · 2022-12-05T22:53:47Z

Swapped the description sections around which seems to be right

easwars · 2022-12-08T01:45:57Z

I'd like to understand why you are using WithReturnConnectionError() (which was added because one user really wanted that functionality and was ready to send a PR for it). WithReturnConnectionError() implies WithBlock() and unless there is a strong reason to perform a blocking dial you should not be doing so.

Try not to think of grpc.Dial() like a net.Dial() (which creates a connection). grpc.Dial() is not intended to create a connection to the backend(s). That should ideally happen in the background, controlled by the name resolver and the load balancer. The most common and recommended way to go about would be to perform a non-blocking grpc.Dial() and start performing RPCs on the returned ClientConn and rely on the RPC errors.

Coming to your suggestions:

simply capture the last error in setConnectivityState

This is actually the error returned when trying to create one transport for one subChannel. This cannot possibly act as a proxy for the connection error of the whole Channel.

make the first health check synchronous in startHealthCheck

We cannot do that because that would affect callers of the non-blocking grpc.Dial().

We definitely are not happy with how tightly coupled our health check implementation is with the channel. We ideally want to make health checking to be part of the LB policy which will then be able to set a picker on the channel with an appropriate error message when health checking fails for all subChannels. We recently added a generic framework to support arbitrary data to be produced on a per-subChannel basis. We want to switch the health checking implementation to use this framework and produce data that can then be consumed by LB policies on the Channel and act appropriately.

atollena · 2022-12-12T11:47:25Z

I'd like to understand why you are using WithReturnConnectionError() (which was added because one user really wanted that functionality and was ready to send a PR for it). WithReturnConnectionError() implies WithBlock() and unless there is a strong reason to perform a blocking dial you should not be doing so.

Try not to think of grpc.Dial() like a net.Dial() (which creates a connection). grpc.Dial() is not intended to create a connection to the backend(s). That should ideally happen in the background, controlled by the name resolver and the load balancer. The most common and recommended way to go about would be to perform a non-blocking grpc.Dial() and start performing RPCs on the returned ClientConn and rely on the RPC errors.

Thanks for the background. I understand that net.Dial and grpc.Dial are doing very different things. For what it's worth, many users use DialContext and WithBlock with a timeout as a way to fail during the application startup in case something is so wrong with either the configuration or the target service that RPCs would definitely fail, and that it warrants failing the application startup. This usually has the side effect of blocking rolling restarts of the service hosting the client, as is at least perceived as a useful, basic reliability feature. WithReturnConnectionError sounds like a nice addition to uncover what's wrong more explicitly than resorting to classic debugging in those cases (e.g. looking at detailed grpc logs of individual subchannels/rpc failures). I interpreted (wrongly, apparently) the presence of WithReturnConnectionError and WithBlock as an implicit endorsement of this usage.

We definitely are not happy with how tightly coupled our health check implementation is with the channel. We ideally want to make health checking to be part of the LB policy which will then be able to set a picker on the channel with an appropriate error message when health checking fails for all subChannels. We recently added a generic framework to support arbitrary data to be produced on a per-subChannel basis. We want to switch the health checking implementation to use this framework and produce data that can then be consumed by LB policies on the Channel and act appropriately.

OK, that makes sense. I got a bit ahead of myself with suggesting code changes, looks like you disagree both with the intent and suggested implementation (which as you pointed out may not work anyway).

Taking a step back, what I wanted to point out with this issue is that it is quite difficult to understand what's going from the client point of view when servers are listening but their health check is in NOT_SERVING state. There is no indication of the problem in the logs. Arguably, WithReturnConnectionError, if it exists, should include the relevant information, just like it does when e.g. connections were refused or certificate issues. Would you agree with that?

make the first health check synchronous in startHealthCheck

We cannot do that because that would affect callers of the non-blocking grpc.Dial().

Would it though? The transport is always created in a separate goroutine so createTransport would block but this would not block Dial. Anyways, I'm not attached to any particular way of surfacing the problem.

dfawley · 2022-12-22T18:47:11Z

For what it's worth, many users use DialContext and WithBlock with a timeout as a way to fail during the application startup in case something is so wrong with either the configuration or the target service that RPCs would definitely fail, and that it warrants failing the application startup.

This is not unreasonable, but it's important to note that the implementations of gRPC in the other languages we support don't have any such facility at client creation time. Instead, they direct their users to create the client and either monitor the raw connectivity state or rely on actual RPCs to determine healthiness. One reason for this is that even if the channel reports READY, many or even all RPCs can still fail due to downstream dependencies. A better practice would be to discover whether your dependencies are healthy by sending actual RPCs to them.

What I really would like to avoid here is anything that ties our core gRPC layer more tightly with health checking. In the other languages, health checking is an LB policy wrapper that alters the state of connections, as reported to the LB policy, to make them appear TRANSIENT_FAILURE instead of READY when they are unhealthy. We would like to adopt this approach in Go as well, and this feature request wouldn't be possible when we do that, as the connection (internally to gRPC) is actually successful in this case.

What could be possible would be to return a status back to the client along with the connectivity state when an LB policy enters TRANSIENT_FAILURE. If we do that and also add an API to surface that status to the user, then health check information would still be available. This would need some cross-language agreement, however, as we don't want to go any further down the path of adding significant features in one language that other languages may not be able to incorporate. For that, please feel free to file a feature request issue in the github.com/grpc/grpc repo, where cross-language issues are captured. (Note that I did bring this topic up and got some push back, but real user feedback and use cases could be helpful for the discussion.)

atollena · 2023-03-17T08:18:13Z

This PR proposes to add documentation with some clarifications related to this issue: https://github.com/grpc/grpc-go/pull/6034/files.

atollena added the Type: Bug label Dec 1, 2022

arvindbr8 self-assigned this Dec 5, 2022

arvindbr8 assigned easwars and unassigned arvindbr8 Dec 6, 2022

easwars assigned atollena and unassigned easwars Dec 8, 2022

easwars added the Status: Requires Reporter Clarification label Dec 8, 2022

easwars removed the Status: Requires Reporter Clarification label Dec 13, 2022

easwars assigned easwars and unassigned atollena Dec 13, 2022

easwars assigned dfawley and unassigned easwars Dec 20, 2022

dfawley closed this as completed Dec 22, 2022

s-matyukevich mentioned this issue Mar 9, 2023

client: log last error on subchannel connectivity change #6109

Merged

atollena mentioned this issue Mar 17, 2023

Add documentation on some anti-patterns #6034

Merged

github-actions bot locked as resolved and limited conversation to collaborators Sep 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WithReturnConnectionError does not report errors related to health check #5831

WithReturnConnectionError does not report errors related to health check #5831

atollena commented Dec 1, 2022 •

edited by arvindbr8

Loading

atollena commented Dec 1, 2022 •

edited

Loading

arvindbr8 commented Dec 5, 2022 •

edited

Loading

easwars commented Dec 8, 2022

atollena commented Dec 12, 2022

dfawley commented Dec 22, 2022

atollena commented Mar 17, 2023

WithReturnConnectionError does not report errors related to health check #5831

WithReturnConnectionError does not report errors related to health check #5831

Comments

atollena commented Dec 1, 2022 • edited by arvindbr8 Loading

What version of gRPC are you using?

What version of Go are you using (go version)?

What operating system (Linux, Windows, …) and version?

What did you do?

What did you expect to see?

What did you see instead?

atollena commented Dec 1, 2022 • edited Loading

arvindbr8 commented Dec 5, 2022 • edited Loading

easwars commented Dec 8, 2022

atollena commented Dec 12, 2022

dfawley commented Dec 22, 2022

atollena commented Mar 17, 2023

atollena commented Dec 1, 2022 •

edited by arvindbr8

Loading

What version of Go are you using (`go version`)?

atollena commented Dec 1, 2022 •

edited

Loading

arvindbr8 commented Dec 5, 2022 •

edited

Loading