-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WithReturnConnectionError does not report errors related to health check #5831
Comments
Bug or feature, however you see it. I took a look at how to fix this. Two approaches I could think of:
This bug is just a UX thing, but it ends up being quite confusing -- we're using health checks in many services, and we had cases where an interceptor or |
Swapped the description sections around which seems to be right |
I'd like to understand why you are using Try not to think of Coming to your suggestions:
This is actually the error returned when trying to create one transport for one subChannel. This cannot possibly act as a proxy for the connection error of the whole Channel.
We cannot do that because that would affect callers of the non-blocking We definitely are not happy with how tightly coupled our health check implementation is with the channel. We ideally want to make health checking to be part of the LB policy which will then be able to set a picker on the channel with an appropriate error message when health checking fails for all subChannels. We recently added a generic framework to support arbitrary data to be produced on a per-subChannel basis. We want to switch the health checking implementation to use this framework and produce data that can then be consumed by LB policies on the Channel and act appropriately. |
Thanks for the background. I understand that
OK, that makes sense. I got a bit ahead of myself with suggesting code changes, looks like you disagree both with the intent and suggested implementation (which as you pointed out may not work anyway). Taking a step back, what I wanted to point out with this issue is that it is quite difficult to understand what's going from the client point of view when servers are listening but their health check is in
Would it though? The transport is always created in a separate goroutine so |
This is not unreasonable, but it's important to note that the implementations of gRPC in the other languages we support don't have any such facility at client creation time. Instead, they direct their users to create the client and either monitor the raw connectivity state or rely on actual RPCs to determine healthiness. One reason for this is that even if the channel reports READY, many or even all RPCs can still fail due to downstream dependencies. A better practice would be to discover whether your dependencies are healthy by sending actual RPCs to them. What I really would like to avoid here is anything that ties our core gRPC layer more tightly with health checking. In the other languages, health checking is an LB policy wrapper that alters the state of connections, as reported to the LB policy, to make them appear TRANSIENT_FAILURE instead of READY when they are unhealthy. We would like to adopt this approach in Go as well, and this feature request wouldn't be possible when we do that, as the connection (internally to gRPC) is actually successful in this case. What could be possible would be to return a status back to the client along with the connectivity state when an LB policy enters TRANSIENT_FAILURE. If we do that and also add an API to surface that status to the user, then health check information would still be available. This would need some cross-language agreement, however, as we don't want to go any further down the path of adding significant features in one language that other languages may not be able to incorporate. For that, please feel free to file a feature request issue in the github.com/grpc/grpc repo, where cross-language issues are captured. (Note that I did bring this topic up and got some push back, but real user feedback and use cases could be helpful for the discussion.) |
This PR proposes to add documentation with some clarifications related to this issue: https://github.com/grpc/grpc-go/pull/6034/files. |
What version of gRPC are you using?
v1.32.0, v1.49. master is affected.
What version of Go are you using (
go version
)?go 1.19.1
What operating system (Linux, Windows, …) and version?
Linux, MacOS
What did you do?
WithReturnConnectionError()
option enabledWhat did you expect to see?
DialContext report a more precise error such as:
What did you see instead?
DialContext reports a generic timeout error:
This behavior makes troubleshooting problems with health checks more difficult than it needs to be.
The text was updated successfully, but these errors were encountered: