[target-allocator] restart pod watcher when no event is found #1237

moh-osman3 · 2022-11-09T20:20:07Z

Resolves #1028

Before fix:

{"level":"info","ts":1668017421.9970064,"msg":"Starting the Target Allocator"}
{"level":"info","ts":1668017421.9972935,"logger":"allocator","msg":"Unrecognized filter strategy; filtering disabled"}
{"level":"info","ts":1668017422.0179584,"logger":"setup","msg":"Starting server..."}
{"level":"info","ts":1668017422.0202508,"logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":1665127338.1837273,"logger":"allocator","msg":"false","component":"opentelemetry-targetallocator"}
{"level":"info","ts":1665127338.1838017,"logger":"allocator","msg":"Collector pod watch event stopped no event","component":"opentelemetry-targetallocator"}

This meant pod watcher is no longer running and the TA assigns remaining targets to pods that were already terminated

After fix:

{"level":"info","ts":1668017421.9970064,"msg":"Starting the Target Allocator"}
{"level":"info","ts":1668017421.9972935,"logger":"allocator","msg":"Unrecognized filter strategy; filtering disabled"}
{"level":"info","ts":1668017422.0179584,"logger":"setup","msg":"Starting server..."}
{"level":"info","ts":1668017422.0202508,"logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":1668020290.9531963,"logger":"allocator","msg":"No event found. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":1668020290.9575274,"logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"info","ts":1668023632.289222,"logger":"allocator","msg":"No event found. Restarting watch routine","component":"opentelemetry-targetallocator"}
{"level":"info","ts":1668023632.292879,"logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}

pod watcher gets restarted until stabilization occurs -> confirmed TA is assigning targets to correct remaining pods

cmd/otel-allocator/collector/collector.go

Aneurysm9 · 2022-12-07T19:28:07Z

cmd/otel-allocator/collector/collector.go

@@ -87,6 +86,9 @@ func (k *Client) Watch(ctx context.Context, labelMap map[string]string, fn func(

 	go func() {
 		for {
+			// add timeout to the context before calling Watch
+			ctx, cancel := context.WithTimeout(ctx, watcherTimeout)
+			defer cancel()


Is this going to accumulate deferred executions for every iteration through this infinite loop? Do we need to defer execution here? Can it simply be not executed or can it be executed without deferral at the end of each loop iteration?

Thanks for the review! Hmm I think if the operation returns before the timeout (no timeout triggered) then a call to cancel() is needed for cleanup. I agree I shouldn't pile up a bunch of defer statements and should call cancel() as soon as possible. If I add cancel() to the end of for loop and before the returns, I think there's still a chance if a panic occurs cancel() won't be called without defer.

To solve this I wrapped this whole block into its own method called restartWatch and that way defer can be called properly. Let me know if you have any thoughts on this!

cmd/otel-allocator/collector/collector_test.go

kristinapathak

Other than the logging change (which can also be done in a separate PR if we want), this looks awesome!

cmd/otel-allocator/collector/collector.go

…elemetry#1237) * naive fix * unit test for close channel * update unit tests, timeout option still not working as expected * gofmt and removed unused block * fix more lint errors * more lint * add timeout to context instead * gofmt * move logic for starting watch to own function * gofmt * add timoutSeconds to test struct * remove repeated logger declarations * add chloggen

moh-osman3 force-pushed the mohosman/issue-1028/restart-podwatcher-on-noevent branch from 24da7ae to 158a13d Compare November 9, 2022 20:22

moh-osman3 marked this pull request as ready for review November 9, 2022 20:23

moh-osman3 requested a review from a team November 9, 2022 20:23

kristinapathak reviewed Nov 9, 2022

View reviewed changes

cmd/otel-allocator/collector/collector.go Show resolved Hide resolved

moh-osman3 force-pushed the mohosman/issue-1028/restart-podwatcher-on-noevent branch 2 times, most recently from 286ae82 to 9c84235 Compare December 7, 2022 00:51

Aneurysm9 reviewed Dec 7, 2022

View reviewed changes

kristinapathak reviewed Dec 7, 2022

View reviewed changes

cmd/otel-allocator/collector/collector_test.go Show resolved Hide resolved

moh-osman3 force-pushed the mohosman/issue-1028/restart-podwatcher-on-noevent branch from 5577c7c to d09c5ee Compare December 7, 2022 21:38

moh-osman3 requested a review from Aneurysm9 December 8, 2022 18:19

kristinapathak approved these changes Dec 9, 2022

View reviewed changes

cmd/otel-allocator/collector/collector.go Outdated Show resolved Hide resolved

moh-osman3 force-pushed the mohosman/issue-1028/restart-podwatcher-on-noevent branch from 2279a6d to b29d850 Compare December 12, 2022 18:14

Aneurysm9 approved these changes Dec 12, 2022

View reviewed changes

moh-osman3 added 12 commits December 18, 2022 14:48

naive fix

c385126

unit test for close channel

1e8ced6

update unit tests, timeout option still not working as expected

14f9a24

gofmt and removed unused block

93e24ab

fix more lint errors

38f00dd

more lint

50740dc

add timeout to context instead

e16d7e5

gofmt

8690c54

move logic for starting watch to own function

605a0db

gofmt

19e7c5d

add timoutSeconds to test struct

5c0168c

remove repeated logger declarations

45d668b

moh-osman3 force-pushed the mohosman/issue-1028/restart-podwatcher-on-noevent branch from b29d850 to 45d668b Compare December 18, 2022 19:57

add chloggen

837ca8a

moh-osman3 requested a review from a team December 18, 2022 20:15

pavolloffay approved these changes Dec 19, 2022

View reviewed changes

pavolloffay merged commit a89d0d6 into open-telemetry:main Dec 19, 2022

moh-osman3 mentioned this pull request Aug 15, 2023

REQUEST: New membership for moh-osman3 open-telemetry/community#1647

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[target-allocator] restart pod watcher when no event is found #1237

[target-allocator] restart pod watcher when no event is found #1237

moh-osman3 commented Nov 9, 2022

Aneurysm9 Dec 7, 2022

moh-osman3 Dec 7, 2022

kristinapathak left a comment

[target-allocator] restart pod watcher when no event is found #1237

[target-allocator] restart pod watcher when no event is found #1237

Conversation

moh-osman3 commented Nov 9, 2022

Aneurysm9 Dec 7, 2022

Choose a reason for hiding this comment

moh-osman3 Dec 7, 2022

Choose a reason for hiding this comment

kristinapathak left a comment

Choose a reason for hiding this comment