Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added cleanup if service errors on Start #6351

Merged
merged 8 commits into from
Oct 26, 2022

Conversation

cpheps
Copy link
Contributor

@cpheps cpheps commented Oct 19, 2022

Description: Same as #6239. Call service.Shutdown if service.Start call errors in collector. A failed service.Start would leave up the telemetryInitializer and the next attempt at starting the collector would cause an error as the telemetryInitializer didn't clean up global state.

I'm not sure if this interferes with work on #5564. I moved the service.telemetryInitializer.Shutdown into service.Shutdown. Currently it the telemetryInitializer needs to be shutdown separately from the service. This changes makes it so calling service.Shutdown ensures all its state is cleaned up.

Fixes: #6352

@codecov
Copy link

codecov bot commented Oct 19, 2022

Codecov Report

Base: 91.41% // Head: 91.41% // Increases project coverage by +0.00% 🎉

Coverage data is based on head (34c6954) compared to base (36d142f).
Patch coverage: 81.81% of modified lines in pull request are covered.

❗ Current head 34c6954 differs from pull request most recent head af2d6e0. Consider uploading reports for the commit af2d6e0 to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #6351   +/-   ##
=======================================
  Coverage   91.41%   91.41%           
=======================================
  Files         235      235           
  Lines       13466    13473    +7     
=======================================
+ Hits        12310    12317    +7     
  Misses        933      933           
  Partials      223      223           
Impacted Files Coverage Δ
service/service.go 69.44% <ø> (ø)
service/collector.go 78.26% <81.81%> (+1.16%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@bogdandrutu
Copy link
Member

this looks ok, but I don't remember why we did not move this. I think it is because we cannot restart everything, so hence we cannot shutdown everything unless we know for sure we have to turn off completely

@cpheps
Copy link
Contributor Author

cpheps commented Oct 19, 2022

I think it is because we cannot restart everything, so hence we cannot shutdown everything unless we know for sure we have to turn off completely

@bogdandrutu I'm not sure I completely follow this. My understanding is within a single process it's ok to stop an instance of the collector and start a new one as many times as you'd like.

If for some reason an instance of the collector fails to start due to the service it's wrapping failing to init or start it can leave state around that would prevent another instance from starting.

I actually just noticed too in the loop in Run if the ConfigProvider.Watch case is triggered currently the telemetryInitializer isn't shut down here so a new service can't be started. It seems like putting the telemetryInitializer shutdown in the service shutdown avoids having to know they both need to be shutdown together.

I can remove it and call it separately if you prefer. Let me know your thoughts.

@cpheps cpheps marked this pull request as ready for review October 20, 2022 12:03
@cpheps cpheps requested review from a team and bogdandrutu October 20, 2022 12:03
@cpheps cpheps force-pushed the service-lifecycle-fix branch 2 times, most recently from e409017 to 58ffdab Compare October 21, 2022 12:23
@cpheps
Copy link
Contributor Author

cpheps commented Oct 21, 2022

@bogdandrutu Would you mind reviewing this now that it's out of draft since you took a look earlier?

@cpheps cpheps force-pushed the service-lifecycle-fix branch from 58ffdab to d40859f Compare October 24, 2022 12:16
Copy link
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test, that with same "collector" instance, you can start the collector, see the metrics, then "reload" config (which stops/starts service) and you can see again metrics.

Update: If simpler just one "telemetryInitializer" and start/stop two times, and check that metrics initialization works properly second time.

@cpheps
Copy link
Contributor Author

cpheps commented Oct 24, 2022

If simpler just one "telemetryInitializer" and start/stop two times, and check that metrics initialization works properly second time.

@bogdandrutu I think the test I wrote for #6239 here covers this. I wasn't sure how to trigger a service start failure in one run and not in another to target this fix. I figured the tests in #6239 covered the specific issue of the telemetryInitializer colliding with a previous version.

Let me know if you think that test doesn't cover what you're looking for.

@bogdandrutu
Copy link
Member

Let me know if you think that test doesn't cover what you're looking for.

Not really, since I want to start/check works/stop/start/check works/stop same instance, not different instance, also not one invalid.

@cpheps
Copy link
Contributor Author

cpheps commented Oct 25, 2022

Not really, since I want to start/check works/stop/start/check works/stop same instance, not different instance, also not one invalid.

@bogdandrutu I just want to make sure I understand. Do you want to test a single instance starting, stopping, restarting? I didn't think restarting an instance was valid from your comment in issue #5084.

I think we are going to the direction to follow the language pattern which is after Shutdown you cannot call Run on the same instance, you have to create a new instance.

I can do the test I just want to make sure I'm not writing a test that runs the collector in a way that should not be supported.

Maybe it's splitting hairs since we'd be stopping the service rather than the collector. I'm not sure that makes sense though either since the pattern right now is one service per one collector.

I started writing up a test for the service start/stop multiple times and I don't think that's valid with the current logic flow. Telemetry views are only registered in newService. Which means either service.Shutdown should not unregister telemetryInitializer views if we want the service to be reusable or the service cannot be reusable like the collector. I think having the service be a one time use makes more sense as this is how we want the Collector to work.

Sorry that's a lot of rambling curious your thoughts though as I'm a bit confused on what the intended behavior should be as far as reusing the service.

@cpheps cpheps force-pushed the service-lifecycle-fix branch from d40859f to 7eadbad Compare October 25, 2022 12:38
Copy link
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are going to the direction to follow the language pattern which is after Shutdown you cannot call Run on the same instance, you have to create a new instance.

For sure I stand by my words, these were in the context of Collector type. The Collector has a capability, which is to reload the configuration, and this is implemented by shutting down the Service instance and creating a new one, everything as planned for the moment. The only thing is that the telemetryInitializer is shared between these Service instances and not re-created, because of these we either change the design of the telemetryInitializer to recreate it every time we re-create the Service, or we test that it is ok to start/stop/start/stop.

Let me know if this is not clear for you.

@cpheps
Copy link
Contributor Author

cpheps commented Oct 25, 2022

The only thing is that the telemetryInitializer is shared between these Service instances and not re-created, because of these we either change the design of the telemetryInitializer to recreate it every time we re-create the Service, or we test that it is ok to start/stop/start/stop.

@bogdandrutu Sorry I think I get what you want I just want to lay out my thoughts make sure they align with your expectation.

The current design is the telemetryInitializer is instanced per instance of the Collector. A Collector has a single instance of a service which wraps the telemetryInitializer lifecycle. The service instance can be stopped and replaced in a Collector instance during a configuration reload.

The test it seems like you're looking for should target the use case of passing the same telemetryInitializer to multiple services and starting/stopping them, as what happens during a configuration reload. This would verify a single telemetryInitializer is reusable in services.

I'll go ahead and make that test as it's not currently covered and I think is valuable.

@bogdandrutu
Copy link
Member

The test it seems like you're looking for should target the use case of passing the same telemetryInitializer to multiple services and starting/stopping them, as what happens during a configuration reload. This would verify a single telemetryInitializer is reusable in services.

Correct

@bogdandrutu
Copy link
Member

Before it was a hack, because telemetryInitializer was initialize only first time (see the once variable inside the telemetryInitializer) and was shutdown only once when the collector was shutdown :)) So that is why I am worried (I know) that your change does not work, so to fix this you need way more things :)

Comment on lines 163 to 183
// Start the service
require.NoError(t, srvOne.Start(context.Background()))

// Shutdown the service
require.NoError(t, srvOne.Shutdown(context.Background()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check in between that metrics are available at that endpoint/port "/metrics" :) And in the next section. You will understand why I am asking :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why now the telemetryInitializer shutdown is outside of the service shutdown. So I think to fix this bug as we only want to shut down everything in the error case I move the telemetryInitializer Shutdown back out to the collector level. That was it keeps the same contract before as being reusable but also cleans up in error cases.

@cpheps cpheps force-pushed the service-lifecycle-fix branch from 212548c to b7435d5 Compare October 25, 2022 18:56
@cpheps
Copy link
Contributor Author

cpheps commented Oct 25, 2022

@bogdandrutu I updated the test and have this working now. I will admit the current solution is not the most elegant but it does fix the immediate bug.

Over the past few PRs I've done for this it seems like there is some work to be done on the telemetryInitializer pattern in the case of a single process starting and stopping new instances of the collector. It works fine if there is only ever one instance of the collector in a process.

I would like your opinion on this but due to the fact that the bug this PR fixes can cause a process to stop executing correctly (but still be active), and that there is a release schedule for tomorrow, that it's might be worth merging this PR if there's not concerns with unintended behavior. I'd be happy to file an issue (if there isn't one already) to further discuss the telemetryInitializer lifecycle within a single process as an action item.

service/service_test.go Outdated Show resolved Hide resolved
service/collector.go Outdated Show resolved Hide resolved
service/collector.go Outdated Show resolved Hide resolved
@bogdandrutu
Copy link
Member

Over the past few PRs I've done for this it seems like there is some work to be done on the telemetryInitializer pattern in the case of a single process starting and stopping new instances of the collector. It works fine if there is only ever one instance of the collector in a process.
I would like your opinion on this but due to the fact that the bug this PR fixes can cause a process to stop executing correctly (but still be active), and that there is a release schedule for tomorrow, that it's might be worth merging this PR if there's not concerns with unintended behavior. I'd be happy to file an issue (if there isn't one already) to further discuss the telemetryInitializer lifecycle within a single process as an action item.

Let's open an issue to discuss how to fix this. Unfortunately there is a limitation coming from the telemetry solution we use (opencensus) which uses a global state, so cannot initiate multiple instances (and possible cannot initiate multiple times).

Corbin Phelps added 5 commits October 26, 2022 07:58
Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>
Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>
…rvices

Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>
Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>
Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>
Corbin Phelps added 2 commits October 26, 2022 07:58
…collector

Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>
Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>
@cpheps cpheps force-pushed the service-lifecycle-fix branch from db052c5 to 34c6954 Compare October 26, 2022 12:02
@cpheps
Copy link
Contributor Author

cpheps commented Oct 26, 2022

@bogdandrutu pushed up suggested changes. I'll write up the issue today and link it here.

@cpheps
Copy link
Contributor Author

cpheps commented Oct 26, 2022

Created #6407 to discuss possible solutions for the global state kept for the internal telemetry solution.

service/collector.go Outdated Show resolved Hide resolved
Co-authored-by: Bogdan Drutu <lazy@splunk.com>
@bogdandrutu bogdandrutu merged commit 67bdf67 into open-telemetry:main Oct 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Service state is not cleaned up when service errors on Start
2 participants