-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backoff on reestablishing watches when Unavailable errors are encountered #9840
Conversation
@jpbetz has implemented new retry in interceptor layer (still in development branch for testing), but not for this watch routine. We will see if we can incorporate those. But, if we need this be backported for Kubernetes, something simple like this would suffice as a transitional solution. Defer to @jpbetz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. Thanks @liggitt! Having the for loop spin without any wait was clearly no good. This retry logic is reasonable. Within about 20 iterations and 500ms it's fully backed off.
@@ -849,6 +852,17 @@ func (w *watchGrpcStream) openWatchClient() (ws pb.Watch_WatchClient, err error) | |||
if isHaltErr(w.ctx, err) { | |||
return nil, v3rpc.Error(err) | |||
} | |||
if isUnavailableErr(w.ctx, err) { | |||
// retry, but backoff | |||
if backoff < maxBackoff { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abstract this out as a retry func with test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably save abstracting it for if we wanted to use this elsewhere... it sounded like there were more systemic retry improvements in progress, so it's possible this will get dropped in the future. I was mostly looking for something minimal we could pick to 3.2.x and 3.3.x streams to alleviate this specific hotloop.
this is a targeted fix to the predominant CPU consumer when an etcd server becomes unavailable to a client with established watches. this picks cleanly back to 3.3.x and 3.2.x
This took our apiserver from 700% cpu when etcd was temporarily down to 10% cpu
xref #9578 #9740