Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS failures in blocking Dial calls don't provide useful error messages #2031

Closed
jmillikin-stripe opened this issue Apr 27, 2018 · 16 comments
Closed
Labels
P2 Type: Feature New features or improvements in behavior

Comments

@jmillikin-stripe
Copy link

jmillikin-stripe commented Apr 27, 2018

What version of gRPC are you using?

v1.9.2, but I've verified the same issue exists at the latest release (v1.11.3).

What version of Go are you using (go version)?

1.10

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

I'm using a blocking Dial (grpc.DialContext(..., grpc.WithBlock())) to better account errors to the dial phase vs RPC calls. One of the backends had an unexpected CN in its TLS certificate, so the connection was failing. The error message returned from DialContext was only context deadline exceeded, and the useful errors were printed out into the info log.

I expected a blocking dial to return some indication of why it had failed, instead of only timing out. In particular, including the most recent transport error in the returned error value would be very useful for reporting connection problems.

Based on the conversation in #1855 it sounds like this behavior is intentional (or, at least, expected). Is there any way to either get access to the transport-layer errors, or somehow propagate them up into the returned error value?

The very useful error printed to the logs:
W0427 21:43:33.738993 35711 clientconn.go:1167] grpc: addrConn.createTransport failed to connect to {badname.example.com 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for goodname.example.com, not badname.example.com". Reconnecting...

The not very useful error returned from DialContext:
W0427 21:43:33.773061 35711 handlers.go:180] Error dialing backend "badname.example.com": context deadline exceeded

@jmillikin-stripe
Copy link
Author

Also, #1917 documents that RPCs with fail-fast behavior will somehow get access to the underlying connection error. My system isn't using fail-fast because we want the gRPC library to retry past transient network issues.

@dfawley
Copy link
Member

dfawley commented Apr 30, 2018

I expected a blocking dial to return some indication of why it had failed, instead of only timing out. In particular, including the most recent transport error in the returned error value would be very useful for reporting connection problems.

The trouble is that this isn't really a "Dial" function, per se, despite the name. grpc.Dial creates a ClientConn containing a managed pool of connections. The blocking behavior of Dial is essentially to start up this pool and wait until a connection is established to one, or the deadline is met. Authentication problems could be a misconfiguration of a server or a client, and they could resolve themselves transparently to the client (e.g. if the server restarts and receives new certs). For this reason, we do not give up when authentication problems are encountered; we keep trying until the deadline is reached.

Changing the returned error may be possible, but it would be a behavior change. Anyone checking for Dial returning context.DeadlineExceeded could be broken by that change, because context.DeadlineExceeded is a value and not an interface, so it cannot be extended.

The longer-term plan is to deprecate Dial and replace it with a NewClient that is both clearer in its behavior, and can have the semantics we prefer.

My system isn't using fail-fast because we want the gRPC library to retry past transient network issues.

The difference between fail-fast (the default) and wait-for-ready RPCs is that WFR will remain pending through connection establishment errors. Once they are assigned to a connection, RPCs that encounter transient network errors will still result in failures. Auth failures are considered connection errors, so all your WFR RPCs will deadline exceed when this happens and you can't connect to any backend.

Note that our implementation of fail-fast used to be wrong ~5 months ago: it would not block when the client was in a "connecting" state, it would just fail. Now that that is resolved, it may be suitable for your use case.

@dfawley dfawley added Type: Feature New features or improvements in behavior P2 labels Apr 30, 2018
@dfawley
Copy link
Member

dfawley commented May 8, 2018

Update: I don't think we should change the behavior of Dial to return anything besides context.DeadlineExceeded in this case. That would be a behavior change and not a true bug fix, and it could break some users.

However, I do think we can and should modify the text of the error that comes from a wait-for-ready RPC when it exceeds its deadline while waiting for a subchannel to include the details from the last connection error, like we do with non-WFR RPCs when we encounter transient failure.

How does that sound, @jmillikin-stripe?

@jmillikin-stripe
Copy link
Author

With that solution, would it be possible to obtain the underlying error from a failed blocking dial? As I said in the first post, we'd like to be able to have error accounting such that failure to connect to the remote will be reported before attempting to run an RPC.

@dfawley
Copy link
Member

dfawley commented May 8, 2018

No. In that case, you'd want to do a non-blocking dial, and then poll connectivity state* until you get either TransientFailure or Connected (or some deadline). On TransientFailure, we could then add a method to our connectivity state API similar to what Java has** to provide the relevant connection error.

*: https://godoc.org/google.golang.org/grpc#ClientConn.GetState and https://godoc.org/google.golang.org/grpc#ClientConn.WaitForStateChange
**: https://grpc.io/grpc-java/javadoc/io/grpc/ConnectivityStateInfo.html#getStatus--

@dfawley
Copy link
Member

dfawley commented Jun 14, 2018

I created a PR (#2055) as a proof of concept for this. In light of some offline discussions, I still plan on reworking it a bit.

@dfawley
Copy link
Member

dfawley commented Aug 29, 2018

In light of #2266, if the TLS error is non-temporary, then using WithBlock in combination with FailOnNonTempDialError should get you the behavior you want.

@skipor
Copy link

skipor commented Jul 4, 2019

Is there any progress? The problem greatly complicates the analysis of connection problems.

@dfawley
Copy link
Member

dfawley commented Mar 6, 2020

Unfortunately, the TLS library doesn't make it readily apparent when a definite TLS error occurs vs. a transient connection error. One idea: we could attempt to detect temporary network errors and call any other errors permanent. (This would be done by the TLS wrapper in the credentials package.)

@sethp-nr
Copy link
Contributor

sethp-nr commented Mar 6, 2020

I'm not sure the FailOnNonTempDialError interface is going to help us too much here. Without modifying the defer it's racy (the context deadline error will always overwrite any returned error, if it exists). It also presumes that the definition of Temporary is universal – and that any errors that are Temporary are therefore uninteresting to callers.

re:

In that case, you'd want to do a non-blocking dial, and then poll connectivity state* until you get either TransientFailure or Connected (or some deadline). On TransientFailure, we could then add a method to our connectivity state API similar to what Java has** to provide the relevant connection error.

I'm not sure I understand how what you're describing is different from grpc.Block(). What is the benefit to asking each user of the library to independently re-implement the same polling loop?

I understand that #3412 is an interface change. However, given the number of issues linked to this one from other go projects, it sounds like an interface that's desirable – would you consider something like that functionality behind a dial option?

@sethp-nr
Copy link
Contributor

sethp-nr commented Mar 6, 2020

Also, per #3406 (comment) I believe this applies to more than just TLS misconfigurations – in Cluster API's case, we're providing a custom dialer that might exhibit a number of different types of non-TLS-related failures, and we'd want to surface those to the caller (ourselves).

@dfawley
Copy link
Member

dfawley commented Mar 6, 2020

I'm not sure I understand how what you're describing is different from grpc.Block(). What is the benefit to asking each user of the library to independently re-implement the same polling loop?

Each user that wants to block until a connection is ready after client creation would need to do this, yes. (We could make a utility function that does the polling, if that helps, but it should be pretty simple.) This is not necessary, however. Many applications will do a non-blocking Dial and just start attempting RPCs, seeing the connection errors at that time. Note that being connected, while an important step in startup, is not necessarily an indication that the backend you've connected to is healthy and able to serve RPCs for the client. And the connection may be lost in the future, so it's not really necessary to do a blocking dial in gRPC in general. Clients will always need to be tolerant of RPC errors which may be connection errors.

would you consider something like that functionality behind a dial option?

Potentially. Longer-term I would like to get rid of Dial entirely (#1786), and not provide a blocking option at all for the replacement (NewClient).

I believe this applies to more than just TLS misconfigurations – in Cluster API's case, we're providing a custom dialer that might exhibit a number of different types of non-TLS-related failures, and we'd want to surface those to the caller (ourselves).

I think our guidance here would be the same. If you consider the error a permanent one, then implement Temporary() bool { return false } and a blocking Dial will stop and return the error to the caller immediately. Or don't do a blocking dial and inspect the errors returned when making the RPCs.

@sethp-nr
Copy link
Contributor

sethp-nr commented Mar 6, 2020

At least in my testing, I could not get an RPC error to behave differently than the blocking dial – if I included a context.WithTimeout then all I'd get back from various RPCs is a context.DeadlineExceeded error. A complicating factor is that I don't actually hold a reference to the client, the etcd client interface I'm using does: I have no easy way to ask it about the state of the connection without a PR to their client library to expose that. As it stands, the shortest path I have to information about the health of the connection is the err value from either the RPC or Dial calls, which is always context deadline exceeded.

Note that being connected, while an important step in startup, is not necessarily an indication that the backend you've connected to is healthy and able to serve RPCs for the client.

Agreed. However, it is a strong indication that my connection configuration embeds the potential for successful communication: in the cases where that's not true, what I desire from the system is to help me understand what action I need to take, and context deadline exceeded is fairly sparse in that sense.

That's why I'm not sure I grok the Temporary axis – almost all connection failures are temporary, in my experience, but some of them require me to do something before my next connection attempt succeeds. Connection refused sometimes implies I should go start that service back up. Read timeout sometimes suggests I misconfigured part of the network between my client and my server. DNS resolution failures are frequently because the name of the server I'm connecting to has a typo in it.

What I (&, it seems, others) think I want out of a networking API is exactly something like grpc.Block() – I want to make a call that, upon success, tells me that there is at least the possibility that my future RPCs will succeed & gives me good context on failure as to why. Returning a non-Temporary error immediately is a nice optimization, but even the Temporary ones are usually informative.

@sethp-nr
Copy link
Contributor

sethp-nr commented Mar 6, 2020

I should add that I'm coming from a very short-lived context, in the sense that it's not a client with an indefinite lifespan like a long-running application. It's more like a CLI tool in that it's going to attempt to do one or two things with the RPC client before shutting down – in the long-lived application case, I would strongly agree that grpc.Block() is not what I want.

Tools like etcdctl & crictl don't really have a way to respond to failure, temporary or not, than "well, here's what happened, goodbye."

@sethp-nr
Copy link
Contributor

sethp-nr commented Mar 6, 2020

would you consider something like that functionality behind a dial option?

Potentially. Longer-term I would like to get rid of Dial entirely (#1786), and not provide a blocking option at all for the replacement (NewClient).

While we discuss the longer-term piece, I did put together a draft of what a dial option for this could look like: #3430

@dfawley
Copy link
Member

dfawley commented Apr 23, 2020

I believe this issue can be closed:

Blocking dial calls can now use WithReturnConnectionError to see what happened after #3430, and RPCs should return the last connection error with #2777.

Let us know if there's anything else not covered by these changes.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P2 Type: Feature New features or improvements in behavior
Projects
None yet
Development

No branches or pull requests

4 participants