-
Notifications
You must be signed in to change notification settings - Fork 459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[target-allocator] restart pod watcher when no event is found #1237
[target-allocator] restart pod watcher when no event is found #1237
Conversation
24da7ae
to
158a13d
Compare
286ae82
to
9c84235
Compare
@@ -87,6 +86,9 @@ func (k *Client) Watch(ctx context.Context, labelMap map[string]string, fn func( | |||
|
|||
go func() { | |||
for { | |||
// add timeout to the context before calling Watch | |||
ctx, cancel := context.WithTimeout(ctx, watcherTimeout) | |||
defer cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to accumulate deferred executions for every iteration through this infinite loop? Do we need to defer execution here? Can it simply be not executed or can it be executed without deferral at the end of each loop iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! Hmm I think if the operation returns before the timeout (no timeout triggered) then a call to cancel() is needed for cleanup. I agree I shouldn't pile up a bunch of defer statements and should call cancel()
as soon as possible. If I add cancel() to the end of for loop and before the returns, I think there's still a chance if a panic occurs cancel()
won't be called without defer
.
To solve this I wrapped this whole block into its own method called restartWatch
and that way defer
can be called properly. Let me know if you have any thoughts on this!
5577c7c
to
d09c5ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the logging change (which can also be done in a separate PR if we want), this looks awesome!
2279a6d
to
b29d850
Compare
b29d850
to
45d668b
Compare
…elemetry#1237) * naive fix * unit test for close channel * update unit tests, timeout option still not working as expected * gofmt and removed unused block * fix more lint errors * more lint * add timeout to context instead * gofmt * move logic for starting watch to own function * gofmt * add timoutSeconds to test struct * remove repeated logger declarations * add chloggen
Resolves #1028
Before fix:
This meant pod watcher is no longer running and the TA assigns remaining targets to pods that were already terminated
After fix:
pod watcher gets restarted until stabilization occurs -> confirmed TA is assigning targets to correct remaining pods