Backoff on reestablishing watches when Unavailable errors are encountered #9840

liggitt · 2018-06-13T04:55:06Z

this is a targeted fix to the predominant CPU consumer when an etcd server becomes unavailable to a client with established watches. this picks cleanly back to 3.3.x and 3.2.x

This took our apiserver from 700% cpu when etcd was temporarily down to 10% cpu

xref #9578 #9740

xiang90 · 2018-06-13T05:35:23Z

/cc @gyuho @jpbetz if we use the new balancer, we will have a builtin retry mechanism?

gyuho · 2018-06-13T05:56:43Z

@xiang90 @liggitt

@jpbetz has implemented new retry in interceptor layer (still in development branch for testing), but not for this watch routine. We will see if we can incorporate those. But, if we need this be backported for Kubernetes, something simple like this would suffice as a transitional solution.

Defer to @jpbetz

…re encountered

jpbetz

lgtm. Thanks @liggitt! Having the for loop spin without any wait was clearly no good. This retry logic is reasonable. Within about 20 iterations and 500ms it's fully backed off.

xiang90 · 2018-06-13T18:39:38Z

clientv3/watch.go

@@ -849,6 +852,17 @@ func (w *watchGrpcStream) openWatchClient() (ws pb.Watch_WatchClient, err error)
 		if isHaltErr(w.ctx, err) {
 			return nil, v3rpc.Error(err)
 		}
+		if isUnavailableErr(w.ctx, err) {
+			// retry, but backoff
+			if backoff < maxBackoff {


abstract this out as a retry func with test?

I'd probably save abstracting it for if we wanted to use this elsewhere... it sounded like there were more systemic retry improvements in progress, so it's possible this will get dropped in the future. I was mostly looking for something minimal we could pick to 3.2.x and 3.3.x streams to alleviate this specific hotloop.

)

clientv3: backoff on reestablishing watches when Unavailable errors a…

d1579c9

…re encountered

liggitt force-pushed the client-hotloop branch from 5b45d2b to d1579c9 Compare June 13, 2018 06:05

jpbetz approved these changes Jun 13, 2018

View reviewed changes

xiang90 reviewed Jun 13, 2018

View reviewed changes

gyuho approved these changes Jun 14, 2018

View reviewed changes

gyuho merged commit 410d28c into etcd-io:master Jun 14, 2018

This was referenced Jun 15, 2018

clientv3: improve retry policy #9171

Closed

client: Work around tight loop when retrying watch connection. #9614

Closed

gyuho mentioned this pull request Aug 31, 2018

Update etcd client to 3.2.24 for latest release kubernetes/kubernetes#68147

Merged

humingcheng added a commit to humingcheng/servicecomb-service-center that referenced this pull request Mar 4, 2020

Upgrade etcd to v3.3.8, etcd client improvement: etcd-io/etcd#9840

831a20c

tianxiaoliang pushed a commit to apache/servicecomb-service-center that referenced this pull request Mar 4, 2020

Upgrade etcd to v3.3.8, etcd client improvement: etcd-io/etcd#9840 (#639

d0a2bec

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backoff on reestablishing watches when Unavailable errors are encountered #9840

Backoff on reestablishing watches when Unavailable errors are encountered #9840

liggitt commented Jun 13, 2018 •

edited

Loading

xiang90 commented Jun 13, 2018

gyuho commented Jun 13, 2018

jpbetz left a comment

xiang90 Jun 13, 2018

liggitt Jun 14, 2018

Backoff on reestablishing watches when Unavailable errors are encountered #9840

Backoff on reestablishing watches when Unavailable errors are encountered #9840

Conversation

liggitt commented Jun 13, 2018 • edited Loading

xiang90 commented Jun 13, 2018

gyuho commented Jun 13, 2018

jpbetz left a comment

Choose a reason for hiding this comment

xiang90 Jun 13, 2018

Choose a reason for hiding this comment

liggitt Jun 14, 2018

Choose a reason for hiding this comment

liggitt commented Jun 13, 2018 •

edited

Loading