Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[target-allocator] restart pod watcher when no event is found #1237

Conversation

moh-osman3
Copy link
Contributor

Resolves #1028

Before fix:

{"level":"info","ts":1668017421.9970064,"msg":"Starting the Target Allocator"}
{"level":"info","ts":1668017421.9972935,"logger":"allocator","msg":"Unrecognized filter strategy; filtering disabled"}
{"level":"info","ts":1668017422.0179584,"logger":"setup","msg":"Starting server..."}
{"level":"info","ts":1668017422.0202508,"logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":1665127338.1837273,"logger":"allocator","msg":"false","component":"opentelemetry-targetallocator"}
{"level":"info","ts":1665127338.1838017,"logger":"allocator","msg":"Collector pod watch event stopped no event","component":"opentelemetry-targetallocator"}

This meant pod watcher is no longer running and the TA assigns remaining targets to pods that were already terminated

After fix:

{"level":"info","ts":1668017421.9970064,"msg":"Starting the Target Allocator"}
{"level":"info","ts":1668017421.9972935,"logger":"allocator","msg":"Unrecognized filter strategy; filtering disabled"}
{"level":"info","ts":1668017422.0179584,"logger":"setup","msg":"Starting server..."}
{"level":"info","ts":1668017422.0202508,"logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":1668020290.9531963,"logger":"allocator","msg":"No event found. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":1668020290.9575274,"logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":1668023632.289222,"logger":"allocator","msg":"No event found. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":1668023632.292879,"logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}

pod watcher gets restarted until stabilization occurs -> confirmed TA is assigning targets to correct remaining pods

@moh-osman3 moh-osman3 force-pushed the mohosman/issue-1028/restart-podwatcher-on-noevent branch from 24da7ae to 158a13d Compare November 9, 2022 20:22
@moh-osman3 moh-osman3 marked this pull request as ready for review November 9, 2022 20:23
@moh-osman3 moh-osman3 requested a review from a team November 9, 2022 20:23
@moh-osman3 moh-osman3 force-pushed the mohosman/issue-1028/restart-podwatcher-on-noevent branch 2 times, most recently from 286ae82 to 9c84235 Compare December 7, 2022 00:51
@@ -87,6 +86,9 @@ func (k *Client) Watch(ctx context.Context, labelMap map[string]string, fn func(

go func() {
for {
// add timeout to the context before calling Watch
ctx, cancel := context.WithTimeout(ctx, watcherTimeout)
defer cancel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to accumulate deferred executions for every iteration through this infinite loop? Do we need to defer execution here? Can it simply be not executed or can it be executed without deferral at the end of each loop iteration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! Hmm I think if the operation returns before the timeout (no timeout triggered) then a call to cancel() is needed for cleanup. I agree I shouldn't pile up a bunch of defer statements and should call cancel() as soon as possible. If I add cancel() to the end of for loop and before the returns, I think there's still a chance if a panic occurs cancel() won't be called without defer.

To solve this I wrapped this whole block into its own method called restartWatch and that way defer can be called properly. Let me know if you have any thoughts on this!

@moh-osman3 moh-osman3 force-pushed the mohosman/issue-1028/restart-podwatcher-on-noevent branch from 5577c7c to d09c5ee Compare December 7, 2022 21:38
@moh-osman3 moh-osman3 requested a review from Aneurysm9 December 8, 2022 18:19
Copy link
Contributor

@kristinapathak kristinapathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the logging change (which can also be done in a separate PR if we want), this looks awesome!

cmd/otel-allocator/collector/collector.go Outdated Show resolved Hide resolved
@moh-osman3 moh-osman3 force-pushed the mohosman/issue-1028/restart-podwatcher-on-noevent branch from 2279a6d to b29d850 Compare December 12, 2022 18:14
@moh-osman3 moh-osman3 force-pushed the mohosman/issue-1028/restart-podwatcher-on-noevent branch from b29d850 to 45d668b Compare December 18, 2022 19:57
@moh-osman3 moh-osman3 requested a review from a team December 18, 2022 20:15
@pavolloffay pavolloffay merged commit a89d0d6 into open-telemetry:main Dec 19, 2022
ItielOlenick pushed a commit to ItielOlenick/opentelemetry-operator that referenced this pull request May 1, 2024
…elemetry#1237)

* naive fix

* unit test for close channel

* update unit tests, timeout option still not working as expected

* gofmt and removed unused block

* fix more lint errors

* more lint

* add timeout to context instead

* gofmt

* move logic for starting watch to own function

* gofmt

* add timoutSeconds to test struct

* remove repeated logger declarations

* add chloggen
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[target-allocator] targets assigned to old pod after HPA scaled down
4 participants