Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WithReturnConnectionError does not report errors related to health check #5831

Closed
atollena opened this issue Dec 1, 2022 · 6 comments · Fixed by #6109
Closed

WithReturnConnectionError does not report errors related to health check #5831

atollena opened this issue Dec 1, 2022 · 6 comments · Fixed by #6109
Assignees

Comments

@atollena
Copy link
Collaborator

atollena commented Dec 1, 2022

What version of gRPC are you using?

v1.32.0, v1.49. master is affected.

What version of Go are you using (go version)?

go 1.19.1

What operating system (Linux, Windows, …) and version?

Linux, MacOS

What did you do?

  • The client is configured as follow:
    • with health checks enabled via Service Config
    • with the round robin balancer via Service Config
    • with the WithReturnConnectionError() option enabled
    • the client calls DialContext with a reasonable timeout context (e.g. 10 seconds)
  • The server is configured with a custom health checker that fails health checks

What did you expect to see?

DialContext report a more precise error such as:

did not connect: context deadline exceeded: connection active but health check failed. status=NOT_SERVING

What did you see instead?

DialContext reports a generic timeout error:

 did not connect: context deadline exceeded

This behavior makes troubleshooting problems with health checks more difficult than it needs to be.

@atollena
Copy link
Collaborator Author

atollena commented Dec 1, 2022

Bug or feature, however you see it.

I took a look at how to fix this. Two approaches I could think of:

  1. simply capture the last error in setConnectivityState (https://github.com/grpc/grpc-go/blob/master/clientconn.go#L1371), via ac.cc.updateConnectionError(lastErr). I'm not sure about side effects of having an additional call site for ac.cc.updateConnectionError -- it makes the code a bit more difficult to understand. It's also a bit weird to call ac.cc.updateConnectionError outside the startup sequence (currently it's only called in tryAllAddrs).
  2. make the first health check synchronous in startHealthCheck (https://github.com/grpc/grpc-go/blob/master/clientconn.go#L1374) and ensure createTransport returns an error when the first health check fails. This would make health check errors handling more in line with other errors that can happen on transport creation. But it has the drawback that we only get the chance to make 1 health check, and I'm not yet sure the last error would never be wrong if the health check flips from e.g. NOT_SERVING to SERVING while the connection context is not yet cancelled.

This bug is just a UX thing, but it ends up being quite confusing -- we're using health checks in many services, and we had cases where an interceptor or PerRPCCredentials failed, causing the health check to fail, which in turn fails blocking DialContext. There is no hint in the error message.

@arvindbr8 arvindbr8 self-assigned this Dec 5, 2022
@arvindbr8
Copy link
Member

arvindbr8 commented Dec 5, 2022

Swapped the description sections around which seems to be right

@arvindbr8 arvindbr8 assigned easwars and unassigned arvindbr8 Dec 6, 2022
@easwars
Copy link
Contributor

easwars commented Dec 8, 2022

I'd like to understand why you are using WithReturnConnectionError() (which was added because one user really wanted that functionality and was ready to send a PR for it). WithReturnConnectionError() implies WithBlock() and unless there is a strong reason to perform a blocking dial you should not be doing so.

Try not to think of grpc.Dial() like a net.Dial() (which creates a connection). grpc.Dial() is not intended to create a connection to the backend(s). That should ideally happen in the background, controlled by the name resolver and the load balancer. The most common and recommended way to go about would be to perform a non-blocking grpc.Dial() and start performing RPCs on the returned ClientConn and rely on the RPC errors.

Coming to your suggestions:

  1. simply capture the last error in setConnectivityState

This is actually the error returned when trying to create one transport for one subChannel. This cannot possibly act as a proxy for the connection error of the whole Channel.

  1. make the first health check synchronous in startHealthCheck

We cannot do that because that would affect callers of the non-blocking grpc.Dial().

We definitely are not happy with how tightly coupled our health check implementation is with the channel. We ideally want to make health checking to be part of the LB policy which will then be able to set a picker on the channel with an appropriate error message when health checking fails for all subChannels. We recently added a generic framework to support arbitrary data to be produced on a per-subChannel basis. We want to switch the health checking implementation to use this framework and produce data that can then be consumed by LB policies on the Channel and act appropriately.

@atollena
Copy link
Collaborator Author

I'd like to understand why you are using WithReturnConnectionError() (which was added because one user really wanted that functionality and was ready to send a PR for it). WithReturnConnectionError() implies WithBlock() and unless there is a strong reason to perform a blocking dial you should not be doing so.

Try not to think of grpc.Dial() like a net.Dial() (which creates a connection). grpc.Dial() is not intended to create a connection to the backend(s). That should ideally happen in the background, controlled by the name resolver and the load balancer. The most common and recommended way to go about would be to perform a non-blocking grpc.Dial() and start performing RPCs on the returned ClientConn and rely on the RPC errors.

Thanks for the background. I understand that net.Dial and grpc.Dial are doing very different things. For what it's worth, many users use DialContext and WithBlock with a timeout as a way to fail during the application startup in case something is so wrong with either the configuration or the target service that RPCs would definitely fail, and that it warrants failing the application startup. This usually has the side effect of blocking rolling restarts of the service hosting the client, as is at least perceived as a useful, basic reliability feature. WithReturnConnectionError sounds like a nice addition to uncover what's wrong more explicitly than resorting to classic debugging in those cases (e.g. looking at detailed grpc logs of individual subchannels/rpc failures). I interpreted (wrongly, apparently) the presence of WithReturnConnectionError and WithBlock as an implicit endorsement of this usage.

We definitely are not happy with how tightly coupled our health check implementation is with the channel. We ideally want to make health checking to be part of the LB policy which will then be able to set a picker on the channel with an appropriate error message when health checking fails for all subChannels. We recently added a generic framework to support arbitrary data to be produced on a per-subChannel basis. We want to switch the health checking implementation to use this framework and produce data that can then be consumed by LB policies on the Channel and act appropriately.

OK, that makes sense. I got a bit ahead of myself with suggesting code changes, looks like you disagree both with the intent and suggested implementation (which as you pointed out may not work anyway).

Taking a step back, what I wanted to point out with this issue is that it is quite difficult to understand what's going from the client point of view when servers are listening but their health check is in NOT_SERVING state. There is no indication of the problem in the logs. Arguably, WithReturnConnectionError, if it exists, should include the relevant information, just like it does when e.g. connections were refused or certificate issues. Would you agree with that?

make the first health check synchronous in startHealthCheck

We cannot do that because that would affect callers of the non-blocking grpc.Dial().

Would it though? The transport is always created in a separate goroutine so createTransport would block but this would not block Dial. Anyways, I'm not attached to any particular way of surfacing the problem.

@dfawley
Copy link
Member

dfawley commented Dec 22, 2022

For what it's worth, many users use DialContext and WithBlock with a timeout as a way to fail during the application startup in case something is so wrong with either the configuration or the target service that RPCs would definitely fail, and that it warrants failing the application startup.

This is not unreasonable, but it's important to note that the implementations of gRPC in the other languages we support don't have any such facility at client creation time. Instead, they direct their users to create the client and either monitor the raw connectivity state or rely on actual RPCs to determine healthiness. One reason for this is that even if the channel reports READY, many or even all RPCs can still fail due to downstream dependencies. A better practice would be to discover whether your dependencies are healthy by sending actual RPCs to them.

What I really would like to avoid here is anything that ties our core gRPC layer more tightly with health checking. In the other languages, health checking is an LB policy wrapper that alters the state of connections, as reported to the LB policy, to make them appear TRANSIENT_FAILURE instead of READY when they are unhealthy. We would like to adopt this approach in Go as well, and this feature request wouldn't be possible when we do that, as the connection (internally to gRPC) is actually successful in this case.

What could be possible would be to return a status back to the client along with the connectivity state when an LB policy enters TRANSIENT_FAILURE. If we do that and also add an API to surface that status to the user, then health check information would still be available. This would need some cross-language agreement, however, as we don't want to go any further down the path of adding significant features in one language that other languages may not be able to incorporate. For that, please feel free to file a feature request issue in the github.com/grpc/grpc repo, where cross-language issues are captured. (Note that I did bring this topic up and got some push back, but real user feedback and use cases could be helpful for the discussion.)

@atollena
Copy link
Collaborator Author

This PR proposes to add documentation with some clarifications related to this issue: https://github.com/grpc/grpc-go/pull/6034/files.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants