-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added cleanup if service errors on Start #6351
Added cleanup if service errors on Start #6351
Conversation
Codecov ReportBase: 91.41% // Head: 91.41% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #6351 +/- ##
=======================================
Coverage 91.41% 91.41%
=======================================
Files 235 235
Lines 13466 13473 +7
=======================================
+ Hits 12310 12317 +7
Misses 933 933
Partials 223 223
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
this looks ok, but I don't remember why we did not move this. I think it is because we cannot restart everything, so hence we cannot shutdown everything unless we know for sure we have to turn off completely |
@bogdandrutu I'm not sure I completely follow this. My understanding is within a single process it's ok to stop an instance of the collector and start a new one as many times as you'd like. If for some reason an instance of the collector fails to start due to the service it's wrapping failing to init or start it can leave state around that would prevent another instance from starting. I actually just noticed too in the loop in I can remove it and call it separately if you prefer. Let me know your thoughts. |
e409017
to
58ffdab
Compare
@bogdandrutu Would you mind reviewing this now that it's out of draft since you took a look earlier? |
58ffdab
to
d40859f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a test, that with same "collector" instance, you can start the collector, see the metrics, then "reload" config (which stops/starts service) and you can see again metrics.
Update: If simpler just one "telemetryInitializer" and start/stop two times, and check that metrics initialization works properly second time.
@bogdandrutu I think the test I wrote for #6239 here covers this. I wasn't sure how to trigger a service start failure in one run and not in another to target this fix. I figured the tests in #6239 covered the specific issue of the telemetryInitializer colliding with a previous version. Let me know if you think that test doesn't cover what you're looking for. |
Not really, since I want to start/check works/stop/start/check works/stop same instance, not different instance, also not one invalid. |
@bogdandrutu I just want to make sure I understand. Do you want to test a single instance starting, stopping, restarting? I didn't think restarting an instance was valid from your comment in issue #5084.
I can do the test I just want to make sure I'm not writing a test that runs the collector in a way that should not be supported. Maybe it's splitting hairs since we'd be stopping the service rather than the collector. I'm not sure that makes sense though either since the pattern right now is one service per one collector. I started writing up a test for the service start/stop multiple times and I don't think that's valid with the current logic flow. Telemetry views are only registered in Sorry that's a lot of rambling curious your thoughts though as I'm a bit confused on what the intended behavior should be as far as reusing the service. |
d40859f
to
7eadbad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are going to the direction to follow the language pattern which is after Shutdown you cannot call Run on the same instance, you have to create a new instance.
For sure I stand by my words, these were in the context of Collector
type. The Collector
has a capability, which is to reload the configuration, and this is implemented by shutting down the Service
instance and creating a new one, everything as planned for the moment. The only thing is that the telemetryInitializer
is shared between these Service instances and not re-created, because of these we either change the design of the telemetryInitializer
to recreate it every time we re-create the Service, or we test that it is ok to start/stop/start/stop.
Let me know if this is not clear for you.
@bogdandrutu Sorry I think I get what you want I just want to lay out my thoughts make sure they align with your expectation. The current design is the telemetryInitializer is instanced per instance of the Collector. A Collector has a single instance of a service which wraps the telemetryInitializer lifecycle. The service instance can be stopped and replaced in a Collector instance during a configuration reload. The test it seems like you're looking for should target the use case of passing the same telemetryInitializer to multiple services and starting/stopping them, as what happens during a configuration reload. This would verify a single telemetryInitializer is reusable in services. I'll go ahead and make that test as it's not currently covered and I think is valuable. |
Correct |
Before it was a hack, because |
service/service_test.go
Outdated
// Start the service | ||
require.NoError(t, srvOne.Start(context.Background())) | ||
|
||
// Shutdown the service | ||
require.NoError(t, srvOne.Shutdown(context.Background())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check in between that metrics are available at that endpoint/port "/metrics" :) And in the next section. You will understand why I am asking :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why now the telemetryInitializer shutdown is outside of the service shutdown. So I think to fix this bug as we only want to shut down everything in the error case I move the telemetryInitializer Shutdown back out to the collector level. That was it keeps the same contract before as being reusable but also cleans up in error cases.
212548c
to
b7435d5
Compare
@bogdandrutu I updated the test and have this working now. I will admit the current solution is not the most elegant but it does fix the immediate bug. Over the past few PRs I've done for this it seems like there is some work to be done on the telemetryInitializer pattern in the case of a single process starting and stopping new instances of the collector. It works fine if there is only ever one instance of the collector in a process. I would like your opinion on this but due to the fact that the bug this PR fixes can cause a process to stop executing correctly (but still be active), and that there is a release schedule for tomorrow, that it's might be worth merging this PR if there's not concerns with unintended behavior. I'd be happy to file an issue (if there isn't one already) to further discuss the telemetryInitializer lifecycle within a single process as an action item. |
Let's open an issue to discuss how to fix this. Unfortunately there is a limitation coming from the telemetry solution we use (opencensus) which uses a global state, so cannot initiate multiple instances (and possible cannot initiate multiple times). |
Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>
Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>
…rvices Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>
Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>
Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>
…collector Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>
db052c5
to
34c6954
Compare
@bogdandrutu pushed up suggested changes. I'll write up the issue today and link it here. |
Created #6407 to discuss possible solutions for the global state kept for the internal telemetry solution. |
Co-authored-by: Bogdan Drutu <lazy@splunk.com>
Description: Same as #6239. Call
service.Shutdown
ifservice.Start
call errors in collector. A failedservice.Start
would leave up thetelemetryInitializer
and the next attempt at starting the collector would cause an error as thetelemetryInitializer
didn't clean up global state.I'm not sure if this interferes with work on #5564. I moved the
service.telemetryInitializer.Shutdown
intoservice.Shutdown
. Currently it thetelemetryInitializer
needs to be shutdown separately from the service. This changes makes it so callingservice.Shutdown
ensures all its state is cleaned up.Fixes: #6352