Added cleanup if service errors on Start #6351

cpheps · 2022-10-19T00:48:42Z

Description: Same as #6239. Call service.Shutdown if service.Start call errors in collector. A failed service.Start would leave up the telemetryInitializer and the next attempt at starting the collector would cause an error as the telemetryInitializer didn't clean up global state.

I'm not sure if this interferes with work on #5564. I moved the service.telemetryInitializer.Shutdown into service.Shutdown. Currently it the telemetryInitializer needs to be shutdown separately from the service. This changes makes it so calling service.Shutdown ensures all its state is cleaned up.

Fixes: #6352

codecov · 2022-10-19T00:54:44Z

Codecov Report

Base: 91.41% // Head: 91.41% // Increases project coverage by +0.00% 🎉

Coverage data is based on head (34c6954) compared to base (36d142f).
Patch coverage: 81.81% of modified lines in pull request are covered.

❗ Current head 34c6954 differs from pull request most recent head af2d6e0. Consider uploading reports for the commit af2d6e0 to get more accurate results

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #6351   +/-   ##
=======================================
  Coverage   91.41%   91.41%           
=======================================
  Files         235      235           
  Lines       13466    13473    +7     
=======================================
+ Hits        12310    12317    +7     
  Misses        933      933           
  Partials      223      223

Impacted Files	Coverage Δ
service/service.go	`69.44% <ø> (ø)`
service/collector.go	`78.26% <81.81%> (+1.16%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

bogdandrutu · 2022-10-19T01:44:02Z

this looks ok, but I don't remember why we did not move this. I think it is because we cannot restart everything, so hence we cannot shutdown everything unless we know for sure we have to turn off completely

cpheps · 2022-10-19T12:05:33Z

I think it is because we cannot restart everything, so hence we cannot shutdown everything unless we know for sure we have to turn off completely

@bogdandrutu I'm not sure I completely follow this. My understanding is within a single process it's ok to stop an instance of the collector and start a new one as many times as you'd like.

If for some reason an instance of the collector fails to start due to the service it's wrapping failing to init or start it can leave state around that would prevent another instance from starting.

I actually just noticed too in the loop in Run if the ConfigProvider.Watch case is triggered currently the telemetryInitializer isn't shut down here so a new service can't be started. It seems like putting the telemetryInitializer shutdown in the service shutdown avoids having to know they both need to be shutdown together.

I can remove it and call it separately if you prefer. Let me know your thoughts.

cpheps · 2022-10-21T12:24:38Z

@bogdandrutu Would you mind reviewing this now that it's out of draft since you took a look earlier?

bogdandrutu

Please add a test, that with same "collector" instance, you can start the collector, see the metrics, then "reload" config (which stops/starts service) and you can see again metrics.

Update: If simpler just one "telemetryInitializer" and start/stop two times, and check that metrics initialization works properly second time.

cpheps · 2022-10-24T17:37:36Z

If simpler just one "telemetryInitializer" and start/stop two times, and check that metrics initialization works properly second time.

@bogdandrutu I think the test I wrote for #6239 here covers this. I wasn't sure how to trigger a service start failure in one run and not in another to target this fix. I figured the tests in #6239 covered the specific issue of the telemetryInitializer colliding with a previous version.

Let me know if you think that test doesn't cover what you're looking for.

bogdandrutu · 2022-10-24T19:28:10Z

Let me know if you think that test doesn't cover what you're looking for.

Not really, since I want to start/check works/stop/start/check works/stop same instance, not different instance, also not one invalid.

cpheps · 2022-10-25T12:03:45Z

Not really, since I want to start/check works/stop/start/check works/stop same instance, not different instance, also not one invalid.

@bogdandrutu I just want to make sure I understand. Do you want to test a single instance starting, stopping, restarting? I didn't think restarting an instance was valid from your comment in issue #5084.

I think we are going to the direction to follow the language pattern which is after Shutdown you cannot call Run on the same instance, you have to create a new instance.

I can do the test I just want to make sure I'm not writing a test that runs the collector in a way that should not be supported.

Maybe it's splitting hairs since we'd be stopping the service rather than the collector. I'm not sure that makes sense though either since the pattern right now is one service per one collector.

I started writing up a test for the service start/stop multiple times and I don't think that's valid with the current logic flow. Telemetry views are only registered in newService. Which means either service.Shutdown should not unregister telemetryInitializer views if we want the service to be reusable or the service cannot be reusable like the collector. I think having the service be a one time use makes more sense as this is how we want the Collector to work.

Sorry that's a lot of rambling curious your thoughts though as I'm a bit confused on what the intended behavior should be as far as reusing the service.

bogdandrutu

I think we are going to the direction to follow the language pattern which is after Shutdown you cannot call Run on the same instance, you have to create a new instance.

For sure I stand by my words, these were in the context of Collector type. The Collector has a capability, which is to reload the configuration, and this is implemented by shutting down the Service instance and creating a new one, everything as planned for the moment. The only thing is that the telemetryInitializer is shared between these Service instances and not re-created, because of these we either change the design of the telemetryInitializer to recreate it every time we re-create the Service, or we test that it is ok to start/stop/start/stop.

Let me know if this is not clear for you.

cpheps · 2022-10-25T15:41:38Z

The only thing is that the telemetryInitializer is shared between these Service instances and not re-created, because of these we either change the design of the telemetryInitializer to recreate it every time we re-create the Service, or we test that it is ok to start/stop/start/stop.

@bogdandrutu Sorry I think I get what you want I just want to lay out my thoughts make sure they align with your expectation.

The current design is the telemetryInitializer is instanced per instance of the Collector. A Collector has a single instance of a service which wraps the telemetryInitializer lifecycle. The service instance can be stopped and replaced in a Collector instance during a configuration reload.

The test it seems like you're looking for should target the use case of passing the same telemetryInitializer to multiple services and starting/stopping them, as what happens during a configuration reload. This would verify a single telemetryInitializer is reusable in services.

I'll go ahead and make that test as it's not currently covered and I think is valuable.

bogdandrutu · 2022-10-25T18:28:07Z

The test it seems like you're looking for should target the use case of passing the same telemetryInitializer to multiple services and starting/stopping them, as what happens during a configuration reload. This would verify a single telemetryInitializer is reusable in services.

Correct

bogdandrutu · 2022-10-25T18:30:37Z

Before it was a hack, because telemetryInitializer was initialize only first time (see the once variable inside the telemetryInitializer) and was shutdown only once when the collector was shutdown :)) So that is why I am worried (I know) that your change does not work, so to fix this you need way more things :)

bogdandrutu · 2022-10-25T18:31:56Z

service/service_test.go

+	// Start the service
+	require.NoError(t, srvOne.Start(context.Background()))
+
+	// Shutdown the service
+	require.NoError(t, srvOne.Shutdown(context.Background()))


Check in between that metrics are available at that endpoint/port "/metrics" :) And in the next section. You will understand why I am asking :)

I see why now the telemetryInitializer shutdown is outside of the service shutdown. So I think to fix this bug as we only want to shut down everything in the error case I move the telemetryInitializer Shutdown back out to the collector level. That was it keeps the same contract before as being reusable but also cleans up in error cases.

cpheps · 2022-10-25T19:04:58Z

@bogdandrutu I updated the test and have this working now. I will admit the current solution is not the most elegant but it does fix the immediate bug.

Over the past few PRs I've done for this it seems like there is some work to be done on the telemetryInitializer pattern in the case of a single process starting and stopping new instances of the collector. It works fine if there is only ever one instance of the collector in a process.

I would like your opinion on this but due to the fact that the bug this PR fixes can cause a process to stop executing correctly (but still be active), and that there is a release schedule for tomorrow, that it's might be worth merging this PR if there's not concerns with unintended behavior. I'd be happy to file an issue (if there isn't one already) to further discuss the telemetryInitializer lifecycle within a single process as an action item.

service/service_test.go

service/collector.go

bogdandrutu · 2022-10-25T23:42:24Z

Over the past few PRs I've done for this it seems like there is some work to be done on the telemetryInitializer pattern in the case of a single process starting and stopping new instances of the collector. It works fine if there is only ever one instance of the collector in a process.
I would like your opinion on this but due to the fact that the bug this PR fixes can cause a process to stop executing correctly (but still be active), and that there is a release schedule for tomorrow, that it's might be worth merging this PR if there's not concerns with unintended behavior. I'd be happy to file an issue (if there isn't one already) to further discuss the telemetryInitializer lifecycle within a single process as an action item.

Let's open an issue to discuss how to fix this. Unfortunately there is a limitation coming from the telemetry solution we use (opencensus) which uses a global state, so cannot initiate multiple instances (and possible cannot initiate multiple times).

Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>

…rvices Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>

Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>

…collector Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>

Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>

cpheps · 2022-10-26T12:06:42Z

@bogdandrutu pushed up suggested changes. I'll write up the issue today and link it here.

cpheps · 2022-10-26T12:25:24Z

Created #6407 to discuss possible solutions for the global state kept for the internal telemetry solution.

service/collector.go

Co-authored-by: Bogdan Drutu <lazy@splunk.com>

paivagustavo mentioned this pull request Oct 19, 2022

[test] fix flaky telemetry zpages test #6324

Merged

cpheps marked this pull request as ready for review October 20, 2022 12:03

cpheps requested review from a team and bogdandrutu October 20, 2022 12:03

cpheps force-pushed the service-lifecycle-fix branch 2 times, most recently from e409017 to 58ffdab Compare October 21, 2022 12:23

cpheps force-pushed the service-lifecycle-fix branch from 58ffdab to d40859f Compare October 24, 2022 12:16

bogdandrutu reviewed Oct 24, 2022

View reviewed changes

cpheps force-pushed the service-lifecycle-fix branch from d40859f to 7eadbad Compare October 25, 2022 12:38

bogdandrutu requested changes Oct 25, 2022

View reviewed changes

bogdandrutu reviewed Oct 25, 2022

View reviewed changes

cpheps force-pushed the service-lifecycle-fix branch from 212548c to b7435d5 Compare October 25, 2022 18:56

bogdandrutu approved these changes Oct 25, 2022

View reviewed changes

service/service_test.go Outdated Show resolved Hide resolved

service/collector.go Outdated Show resolved Hide resolved

service/collector.go Outdated Show resolved Hide resolved

Corbin Phelps added 5 commits October 26, 2022 07:58

Added cleanup if service errors on Start

3318bf0

Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>

Added issue to changelog

e187b69

Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>

Added a test to verify the reusablility of telemetryInitializer in se…

5a0aeca

…rvices Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>

Moved telemetryInitializer shutdown out of service shutdown

768d8a3

Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>

Fixed comment

0e5db51

Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>

Corbin Phelps added 2 commits October 26, 2022 07:58

Bundled service shutdown and telemetry into a helper function in the …

8fbe447

…collector Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>

PR Fixups

34c6954

Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>

cpheps force-pushed the service-lifecycle-fix branch from db052c5 to 34c6954 Compare October 26, 2022 12:02

bogdandrutu approved these changes Oct 26, 2022

View reviewed changes

service/collector.go Outdated Show resolved Hide resolved

Update service/collector.go

af2d6e0

Co-authored-by: Bogdan Drutu <lazy@splunk.com>

bogdandrutu merged commit 67bdf67 into open-telemetry:main Oct 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added cleanup if service errors on Start #6351

Added cleanup if service errors on Start #6351

cpheps commented Oct 19, 2022 •

edited

Loading

codecov bot commented Oct 19, 2022 •

edited

Loading

bogdandrutu commented Oct 19, 2022

cpheps commented Oct 19, 2022

cpheps commented Oct 21, 2022

bogdandrutu left a comment •

edited

Loading

cpheps commented Oct 24, 2022

bogdandrutu commented Oct 24, 2022

cpheps commented Oct 25, 2022 •

edited

Loading

bogdandrutu left a comment

cpheps commented Oct 25, 2022

bogdandrutu commented Oct 25, 2022

bogdandrutu commented Oct 25, 2022

bogdandrutu Oct 25, 2022

cpheps Oct 25, 2022

cpheps commented Oct 25, 2022

bogdandrutu commented Oct 25, 2022

cpheps commented Oct 26, 2022

cpheps commented Oct 26, 2022

Added cleanup if service errors on Start #6351

Added cleanup if service errors on Start #6351

Conversation

cpheps commented Oct 19, 2022 • edited Loading

codecov bot commented Oct 19, 2022 • edited Loading

Codecov Report

bogdandrutu commented Oct 19, 2022

cpheps commented Oct 19, 2022

cpheps commented Oct 21, 2022

bogdandrutu left a comment • edited Loading

Choose a reason for hiding this comment

cpheps commented Oct 24, 2022

bogdandrutu commented Oct 24, 2022

cpheps commented Oct 25, 2022 • edited Loading

bogdandrutu left a comment

Choose a reason for hiding this comment

cpheps commented Oct 25, 2022

bogdandrutu commented Oct 25, 2022

bogdandrutu commented Oct 25, 2022

bogdandrutu Oct 25, 2022

Choose a reason for hiding this comment

cpheps Oct 25, 2022

Choose a reason for hiding this comment

cpheps commented Oct 25, 2022

bogdandrutu commented Oct 25, 2022

cpheps commented Oct 26, 2022

cpheps commented Oct 26, 2022

cpheps commented Oct 19, 2022 •

edited

Loading

codecov bot commented Oct 19, 2022 •

edited

Loading

bogdandrutu left a comment •

edited

Loading

cpheps commented Oct 25, 2022 •

edited

Loading