Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backoff on reestablishing watches when Unavailable errors are encountered #9840

Merged
merged 1 commit into from
Jun 14, 2018

Conversation

liggitt
Copy link
Contributor

@liggitt liggitt commented Jun 13, 2018

this is a targeted fix to the predominant CPU consumer when an etcd server becomes unavailable to a client with established watches. this picks cleanly back to 3.3.x and 3.2.x

This took our apiserver from 700% cpu when etcd was temporarily down to 10% cpu

xref #9578 #9740

@xiang90
Copy link
Contributor

xiang90 commented Jun 13, 2018

/cc @gyuho @jpbetz if we use the new balancer, we will have a builtin retry mechanism?

@gyuho
Copy link
Contributor

gyuho commented Jun 13, 2018

@xiang90 @liggitt

@jpbetz has implemented new retry in interceptor layer (still in development branch for testing), but not for this watch routine. We will see if we can incorporate those. But, if we need this be backported for Kubernetes, something simple like this would suffice as a transitional solution.

Defer to @jpbetz

Copy link
Contributor

@jpbetz jpbetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Thanks @liggitt! Having the for loop spin without any wait was clearly no good. This retry logic is reasonable. Within about 20 iterations and 500ms it's fully backed off.

@@ -849,6 +852,17 @@ func (w *watchGrpcStream) openWatchClient() (ws pb.Watch_WatchClient, err error)
if isHaltErr(w.ctx, err) {
return nil, v3rpc.Error(err)
}
if isUnavailableErr(w.ctx, err) {
// retry, but backoff
if backoff < maxBackoff {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abstract this out as a retry func with test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably save abstracting it for if we wanted to use this elsewhere... it sounded like there were more systemic retry improvements in progress, so it's possible this will get dropped in the future. I was mostly looking for something minimal we could pick to 3.2.x and 3.3.x streams to alleviate this specific hotloop.

@gyuho gyuho merged commit 410d28c into etcd-io:master Jun 14, 2018
humingcheng added a commit to humingcheng/servicecomb-service-center that referenced this pull request Mar 4, 2020
tianxiaoliang pushed a commit to apache/servicecomb-service-center that referenced this pull request Mar 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants