Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver/api/v3rpc: debug user cancellation and log warning for rest #9105

Merged
merged 1 commit into from
Jan 8, 2018

Conversation

spzala
Copy link
Member

@spzala spzala commented Jan 5, 2018

The context error with cancel code is typically for user cancellation which
should be at debug level. For other error codes we should display a warning.

Fixes #9085

Contributing guidelines

Please read our contribution workflow before submitting a pull request.

@spzala
Copy link
Member Author

spzala commented Jan 5, 2018

@xiang90 @gyuho does the changes make sense? Do we need to cover any more errors? Thanks!

@codecov-io
Copy link

codecov-io commented Jan 5, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@6f76e46). Click here to learn what that means.
The diff coverage is 65.62%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #9105   +/-   ##
=========================================
  Coverage          ?   75.93%           
=========================================
  Files             ?      359           
  Lines             ?    29987           
  Branches          ?        0           
=========================================
  Hits              ?    22771           
  Misses            ?     5626           
  Partials          ?     1590
Impacted Files Coverage Δ
etcdserver/api/v3rpc/lease.go 86.48% <50%> (ø)
etcdserver/api/v3rpc/watch.go 84.12% <68.75%> (ø)
etcdserver/api/v3rpc/util.go 86.66% <75%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6f76e46...c2185e1. Read the comment docs.

@@ -140,7 +140,11 @@ func (ws *watchServer) Watch(stream pb.Watch_WatchServer) (err error) {
// deadlock when calling sws.close().
go func() {
if rerr := sws.recvLoop(); rerr != nil {
plog.Debugf("failed to receive watch request from gRPC stream (%q)", rerr.Error())
if stream.Context().Err() == context.Canceled {
plog.Debugf("failed to receive watch request from gRPC stream 11 (%q)", rerr.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove 11?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooops lol .. :( yes

@@ -140,7 +140,11 @@ func (ws *watchServer) Watch(stream pb.Watch_WatchServer) (err error) {
// deadlock when calling sws.close().
go func() {
if rerr := sws.recvLoop(); rerr != nil {
plog.Debugf("failed to receive watch request from gRPC stream (%q)", rerr.Error())
if stream.Context().Err() == context.Canceled {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both context.Cancel and context.DeadlineExceeded should be ignored i think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we should have a shared function for all streaming API logs #8939 (e.g. lease keepalive).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gyuho thanks! I just learned how to use lease :-) and could see the related log messages. So also modifying the message on LOC 110 (for user cancelled lease) in #8939 is what your suggesting right? I know you suggested to refer to the toErr function in etcd/clientv3/client.go in the issue discussion but I believe that was just to learn about different error code and not to make any changes there. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just like we do in clientv3, there should be an error parsing function on this rerr error. Convert it to gRPC *status.statusError and match the error code. If the error is canceled or deadline exceeded, we log with debug level.

@xiang90
Copy link
Contributor

xiang90 commented Jan 5, 2018

can we manually test this PR?

1. setup one member etcd cluster with debug level logging
2. create a watcher without timeout against it
3. cancel the watcher
4. make sure log is printed at debug level not at waring level
1. setup one member etcd cluster with debug level logging
2. create a watcher with a short timeout against it
3. wait for the timeout to exceed
4. make sure log is printed at debug level not at waring level

@spzala
Copy link
Member Author

spzala commented Jan 5, 2018

@xiang90 +1 thanks, I actually pretty much already tested the first scenario before submitting commit - i.e. manually cancelling watcher and making sure that warning is printed but not debug message. Testing more per your suggestion.

@xiang90
Copy link
Contributor

xiang90 commented Jan 5, 2018

manually cancelling watcher and making sure that warning is printed but not debug message.

I think if you manually callcelling watcher, the log should be at debug level not warning level?

@spzala
Copy link
Member Author

spzala commented Jan 5, 2018

Hmm..so by default the debug level is off right? For example, when I run etcd as ./bin/etcd and then cancel watcher I don't see any debug message (something I tried earlier). If I run with debug on explicitly (just tried it now) as ./bin/etcd --debug then I see debug message as below:

$./bin/etcdctl watch foo
^C
$
$./bin/etcd --debug
....
....
2018-01-05 13:42:22.660676 D | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream 11 ("rpc error: code = Canceled desc = context canceled")

@xiang90
Copy link
Contributor

xiang90 commented Jan 5, 2018

correct. and what you saw is the expected behavior.

@spzala
Copy link
Member Author

spzala commented Jan 5, 2018

:-) OK, prefect. Thought so too :-).

@spzala
Copy link
Member Author

spzala commented Jan 6, 2018

I have tested following, manual and auto interruption for watch and lease keep-alive, with new commit:

$ ./bin/etcdctl watch foo
^C
$

$ timeout 3 ./bin/etcdctl watch foo
$ ./bin/etcdctl lease keep-alive 694d60c8e16b2003
lease 694d60c8e16b2003 keepalived with TTL(30)
^C
$

$ timeout 3 ./bin/etcdctl lease keep-alive 694d60c8e16b2003
lease 694d60c8e16b2003 keepalived with TTL(30)
$

etcd was ran with and without --debug

@@ -107,7 +110,13 @@ func (ls *LeaseServer) leaseKeepAlive(stream pb.Lease_LeaseKeepAliveServer) erro
return nil
}
if err != nil {
plog.Debugf("failed to receive lease keepalive request from gRPC stream (%q)", err.Error())
ev, _ := status.FromError(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spzala Let's wrap this as a separate function inside etcdserver/api/v3rpc.

func isClientCtxErr(err error) error {
  ev, _ := status.FromError(err)
  ...
  return true // when code == codes.Canceled || code == codes.DeadlineExceeded
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, yup, makes sense specially that we are doing for all of the messages. Thanks!

@gyuho
Copy link
Contributor

gyuho commented Jan 6, 2018

@spzala We should do this for all streaming error handling (watch,keepalive receive/send). There are total 6 error we need to handle (see for others https://github.com/coreos/etcd/pull/8939/files).

@@ -26,6 +26,9 @@ import (
pb "github.com/coreos/etcd/etcdserver/etcdserverpb"
"github.com/coreos/etcd/mvcc"
"github.com/coreos/etcd/mvcc/mvccpb"

"google.golang.org/grpc/codes"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation.

@spzala
Copy link
Member Author

spzala commented Jan 6, 2018

@gyuho ah, I notice that in your previous PR earlier but thought user interruption/cancellation only impacts 'receive', something that I could test by cancelling watch or lease. But sounds good, I will be covering all. Is there anyway I can test 'send' as well quickly or let's not worry about testing send. Thanks!

@spzala
Copy link
Member Author

spzala commented Jan 6, 2018

WIP.

@gyuho
Copy link
Contributor

gyuho commented Jan 6, 2018

Does this silence the client watch canceling logs when --debug is on? Also squash 3 commits into one? Then LGTM. Thanks!

@spzala
Copy link
Member Author

spzala commented Jan 6, 2018

@gyuho thanks! So when --debug is on (./bin/etcd --debug) the following command will produce the message otherwise it won't. I believe that's the expected behavior?
e.g.

 $./bin/etcdctl watch foo
^C
$

And sure I will squash the commits.

@@ -81,3 +81,9 @@ func togRPCError(err error) error {
}
return grpcErr
}

func isClientCtxErr(err error) bool {
ev, _ := status.FromError(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to be more defensive, we should check the error against context.DeadlineExceeded and context.Canceled when the fromError fails with error. i am not sure if grpc.send and recv will always return statusError type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Does it make sense that we pass stream.Context().Err() (or sws.gRPCStream.Context().Err() when needed) to this func and modify the func as below?:

func isClientCtxErr(err error) bool {
	if err == nil {
		return false
	}
	return err == context.Canceled || err == context.DeadlineExceeded
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not sure how exactly the client side context error gets propagated to server side. maybe the stream.Context().Err() only returns the server side context error not the one propagated from the client.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xiang90 I see. I was able to run manual tests with using stream.Context().Err() but I will dig more. @gyuho if you have any suggestions please let me know. Thanks!!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah maybe add another parameter to isClientCtxErr function to parse stream.Context errors as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gyuho !!

@@ -81,3 +81,12 @@ func togRPCError(err error) error {
}
return grpcErr
}

func isClientCtxErr(ctxErr error, err error) bool {
if ctxErr != nil {
Copy link
Contributor

@xiang90 xiang90 Jan 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if ctxErr != nil {
    return true
}

ev, err := status.FromError(err)
if err != nil {
    return false
}
code := ev.Code()
return code == codes.Canceled || code == codes.DeadlineExceeded

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On ctxErr != nil, return true (instead of ctxErr == context.Canceled || ctxErr == context.DeadlineExceeded) makes sense since context errors seems either context.Canceled or context.DeadlineExceeded

if ctxErr != nil {
return ctxErr == context.Canceled || ctxErr == context.DeadlineExceeded
}
ev, _ := status.FromError(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when there is an error, ev will be nil. if we try to get ev.Code on nil, it might panic. see the code above to improve that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiang90 thanks!! Sure, agree. As you mentioned and just looking at status doc, since FromError can return nil we should handle it. I didn't think of it seeing few other places we aren't checking for nil on status.FromError (e.g in client.go and other places) and now it seems like we should, but that's something I will do in separate PR and see if sense that makes sense. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I guess we can check on nil or boolean.)

@xiang90
Copy link
Contributor

xiang90 commented Jan 8, 2018

Lgtm. Defer to @gyuho

@gyuho
Copy link
Contributor

gyuho commented Jan 8, 2018

@spzala Let's squash commits into one with commit title etcdserver/api/v3rpc: .... Then LGTM. Thanks!

@gyuho gyuho changed the title Log: debug user cancellation and log warning for rest etcdserver/api/v3rpc: debug user cancellation and log warning for rest Jan 8, 2018
@spzala
Copy link
Member Author

spzala commented Jan 8, 2018

Sounds good. Kids bed time so stepping out but I will be squashing soon. Thanks much @xiang90 @gyuho

The context error with cancel code is typically for user cancellation which
should be at debug level. For other error codes we should display a warning.

Fixes etcd-io#9085
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants