Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invoke hangs in client after server presents an incorrect certificate #1656

Closed
rra opened this issue Nov 7, 2017 · 5 comments · Fixed by #1855
Closed

Invoke hangs in client after server presents an incorrect certificate #1656

rra opened this issue Nov 7, 2017 · 5 comments · Fixed by #1855
Assignees
Labels
P2 Type: Feature New features or improvements in behavior

Comments

@rra
Copy link

rra commented Nov 7, 2017

Please answer these questions before submitting your issue.

What version of gRPC are you using?

a5986a5

What version of Go are you using (go version)?

1.6 and 1.8 (same behavior in both)

What operating system (Linux, Windows, …) and version?

Linux (Ubuntu 16.04)

What did you do?

When attempting to update our internal grpc-go at Dropbox to the above version, one of our tests started hanging until killed by a timeout. The test tries to connect to a gRPC server that is presenting a different certificate identity than the client is expecting. This is a regression from the previous version of grpc-go we were using (66c9ed8).

What did you expect to see?

The client should return from Invoke with an error.

What did you see instead?

The client hung forever unless a timeout was set. This is true even if FastFail is set.

Tracing into the grpc-go code, the call sequence appears to be as follows:

  1. DialContext succeeds.
  2. WaitForStateChange eventually fails due to a timeout (as expected because a successful connection cannot be opened to the server presenting the incorrect certificate)
  3. Client calls Invoke
  4. Invoke -> invoke (in call.go) calls getTransport
  5. getTransport in clientconn.go calls pick
  6. pick in picker_wrapper.go sets ch to bp.blockingCh (not the original value on entry)
  7. pick calls Pick, which returns the error balancer.ErrNoSubConnAvailable, causing continue to restart the loop.
  8. ch is now bp.blockingCh, so pick enters the select on ch and at that point blocks forever.

At this point, there is no background activity or attempt to re-establish the connection (as determined by annotations inside ClientHandshake in the TransportCredentials), so I see no sign that the Invoke call will ever recover barring a timeout.

@rra rra changed the title Invoke hangs in client after server presents an incorrect certificate Invoke hangs in client after server presents an incorrect certificate Nov 7, 2017
@dzbarsky
Copy link
Contributor

dzbarsky commented Nov 7, 2017

@dfawley can you please take a look? thanks!

@menghanl
Copy link
Contributor

menghanl commented Nov 7, 2017

The failfast RPCs should fail immediately with unavailable errors. Fix for this bug is in #1657.

The error will be a general error "all SubConns are in TransientFailure" now. We will try to include the reason (for example creds handshake errors) in the error message.

Non-failfast RPCs will keep waiting for connections to be ready, and will fail with DeadlineExceeded (in theory, the connection could recover, e.g. the resolver might return a new address).

Let us know what you think of this behavior.

@rra
Copy link
Author

rra commented Nov 7, 2017

Thanks for the fix! We set failfast by default in our internal code, so I think this works for us, but @dzbarsky to confirm.

@dzbarsky
Copy link
Contributor

dzbarsky commented Nov 8, 2017

Yep, yyour fix makes sense. Thank you.

@menghanl
Copy link
Contributor

Sounds good. Thanks for confirming.

We will still work on including the error details in the RPC errors, but this issue should not be a blocker now.
Moving this to P2.

@menghanl menghanl added P2 and removed P1 labels Nov 17, 2017
@dfawley dfawley added Type: Feature New features or improvements in behavior and removed Type: Bug labels Dec 14, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Sep 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P2 Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants