Clarify/define clientv3 retry logic (fix broken retries) #8691

gyuho · 2017-10-12T22:59:14Z

Now that retry logic is such a critical part of our balancer logic, we need document clearly when Go client retries its RPCs; currently we don't have any, other than some comments in clientv3/retry.go. Should be helpful for other client language bindings.

release-3.2, as of f1d7dd8

Mutable/immutable RPCs share the same error handling logic:

no error is returned, then no retry
if the error is rpctypes.EtcdError type, then no retry
- e.g. rpctypes.ErrEmptyKey, rpctypes.ErrNoSpace, rpctypes. ErrTimeout
if the error is grpc/status.(*statusError) type and its error code is codes.Unavailable which means the service is not currently available, then retry

rpctypes.EtcdError and grpc/status.(*statusError) errors are mutually exclusive.
This only works with grpc/grpc-go v1.2.1.

master branch, as of 764a0f7

grpc/grpc-go upgraded to >v1.6.x, which changed the behavior of error handling.

During network disconnections, error clientv3.ErrNoAddrAvilable can happen, and its error code is codes.Unavailable, so should be retried (same as grpc.Errorf(codes.Unavailable, "there is no address available")).

However, due to the change in grpc-go, transport.ErrStreamDrain can be returned. etcd mutable RPCs should not be retried on transport.ErrStreamDrain and only retried on clientv3.ErrNoAddrAvilable. This is fixed via #8335 (fix "put at most once", not in 3.2).

Plus with health checking balancer #8545, now retry error handling logic is:

no error is returned, then no retry (no matter what RPCs are)
for immutable requests, if the error is grpc/status.(*statusError) type and its error code is codes.Unavailable, then mark unhealthy, endpoint-switch, wait for connection notify, and retry
for mutable requests, if the error is grpc/status.(*statusError) type and its error code is codes.Unavailable and the error message is there is no address available, then mark unhealthy, endpoint-switch, wait for connection notify, and retry
for immutable requests, if the error is rpctypes.EtcdError type, then mark unhealthy, endpoint-switch, and exit
- e.g. rpctypes.ErrEmptyKey, rpctypes.ErrNoSpace, rpctypes. ErrTimeout
for mutable requests, if the error is rpctypes.EtcdError type, then mark unhealthy, endpoint-switch, and exit (proper handling is missing though)

TODOs

handle stale endpoints health status (done via clientv3: handle stale endpoints, clean up logging #8714)
distinguish mutable/immutable RPCs in Maintenance API, and others (done via clientv3: clean up retry wrapper, remove all FailFast=false #8717)
clean up retry wrapper (done via clientv3/retry: clean up retryRPCFunc #8724)
do auth wrapper consistent in all RPCs (done via clientv3/retry: clean up retryRPCFunc #8724)
handle status.FromError are all errors returned by gRPC wrapped by status.Status? grpc/grpc-go#1581
rpctypes.EtcdError type error handling should be consistent across mutable/immutable RPCs
make retrial logic consistent with our functional-tester/stresser
do not mark as unhealthy on rpctypes.ErrEmptyKey or rpctypes.ErrNoSpace
- only does on rpctypes.ErrTimeout
do not trigger endpoint switch on rpctypes.ErrEmptyKey or rpctypes.ErrNoSpace
- only does on rpctypes.ErrTimeout
consolidate retry logic across clientv3 (e.g. clientv3/leasing acquire has its own retrial logic)

The text was updated successfully, but these errors were encountered:

xiang90 · 2017-10-24T07:01:24Z

is this fixed?

gyuho · 2017-10-24T14:01:16Z

Yes.

devnev · 2017-12-05T16:08:52Z

Hi. The Watch client still contains a failFast=False with a TODO to switch it to failFast=True. Was that meant to be part of this issue? Or is there an issue tracking that elsewhere?

gyuho · 2017-12-05T17:30:51Z

@devnev We've disabled FailFast=true to all RPCs, replaced by internal, manual retry in clientv3 side.

We've covered some network fault cases in #8711, if that helps.

For watch, clientv3 (>=v3.2.10) should be able to switch and retry when a server fails.

devnev · 2017-12-05T17:53:29Z

We've disabled FailFast=true to all RPCs

@gyuho thas is incorrect, see https://github.com/coreos/etcd/blob/master/clientv3/watch.go#L778

devnev · 2017-12-05T17:57:20Z

A search of the codebase also reveals a FailFast=false in the grpcproxy

gyuho · 2017-12-05T17:58:06Z

@devnev Indeed.

I meant to say

We've disabled FailFast=false

Somehow got confused.

So we do FailFast=false for watch, but for others FailFast=true, thus gRPC side retry is disabled. But note that that watch has for-loop for manual retry, since we assume gRPC retry logic is not stable enough.

devnev · 2017-12-05T18:11:26Z

For watches in particular FailFast=False is problematic, as they usually do not have RPC timeouts. In our case the connections are failing for extended periods of times, but we cannot set a timeout on the request as it is a watch. However, because FailFast=False, we're also not given any indication that the watch is in fact broken.

gyuho · 2017-12-05T18:16:41Z

etcd watch API is not meant for detecting connection issues. Disconnect is handled in client balancer layer. We've added HTTP/2 keepalive and client balancer health checking (only available >= v3.2.10).

Please try HTTP/2 keepalive ping. If it still doesn't work, file open a new issue.

gyuho changed the title ~~Clarify/define clientv3 retry logic (possibly fix broken retries)~~ Clarify/define clientv3 retry logic (fix broken retries) Oct 13, 2017

gyuho closed this as completed Oct 24, 2017

lujiajing1126 mentioned this issue Apr 7, 2022

[Feature] [BanyanDB] Return rich error with canonical status codes apache/skywalking#8830

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify/define clientv3 retry logic (fix broken retries) #8691

Clarify/define clientv3 retry logic (fix broken retries) #8691

gyuho commented Oct 12, 2017 •

edited

Loading

xiang90 commented Oct 24, 2017

gyuho commented Oct 24, 2017

devnev commented Dec 5, 2017

gyuho commented Dec 5, 2017

devnev commented Dec 5, 2017

devnev commented Dec 5, 2017

gyuho commented Dec 5, 2017

devnev commented Dec 5, 2017

gyuho commented Dec 5, 2017

Clarify/define clientv3 retry logic (fix broken retries) #8691

Clarify/define clientv3 retry logic (fix broken retries) #8691

Comments

gyuho commented Oct 12, 2017 • edited Loading

release-3.2, as of f1d7dd8

master branch, as of 764a0f7

TODOs

xiang90 commented Oct 24, 2017

gyuho commented Oct 24, 2017

devnev commented Dec 5, 2017

gyuho commented Dec 5, 2017

devnev commented Dec 5, 2017

devnev commented Dec 5, 2017

gyuho commented Dec 5, 2017

devnev commented Dec 5, 2017

gyuho commented Dec 5, 2017

gyuho commented Oct 12, 2017 •

edited

Loading