Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

otel collector telemetry metrics server won't spawn again after restarting #5084

Closed
splunkericl opened this issue Mar 24, 2022 · 12 comments
Closed
Labels
bug Something isn't working help wanted Good issue for contributors to OpenTelemetry Service to pick up priority:p1 High

Comments

@splunkericl
Copy link
Contributor

Describe the bug
When the collector is shutdown and run again(or another collector is run again), the telemetry metrics server won't spawn again.

  • collector telemetry is shut down here when shutdown is called
  • when collector is run telemetry is init here
  • the telemetry init implementation only initialize this once

Steps to reproduce

  • create a otel collector
  • call Run with settings
  • call shutdown
  • create another otel collector
  • call Run with new settings
  • metrics server won't spawn again

What did you expect to see?
When a new collector is run again. metrics server should still be spawned

What did you see instead?
metrics server no longer spawns due to initOnce call.

What version did you use?
Version: v0.46.0

What config did you use?
Any valid config

Environment
OS: macOS catalina
Compiler(if manually compiled): go 1.17.3

Additional context
Add any other context about the problem here.

@splunkericl splunkericl added the bug Something isn't working label Mar 24, 2022
@tigrannajaryan
Copy link
Member

call shutdown
Are you calling Shutdown? As far as I remember that puts the Service in a terminal state. It is not expected that Run will work again after that.

More importantly, why are you doing this? If you want to reload the Service with new config you need to use the ConfigProvider.Watch capability.

@splunkericl
Copy link
Contributor Author

Even if we use the watch method, the internal implementation for watch is similar. See:

if err = col.service.Shutdown(ctx); err != nil {
return fmt.Errorf("failed to shutdown the retiring config: %w", err)
}
if err = col.setupConfigurationComponents(ctx); err != nil {

It would invoke the shutdown function and then Run with the new settings.

@tigrannajaryan
Copy link
Member

It would invoke the shutdown function and then Run with the new settings.

OK, this looks like a legitimate bug in that case.

In any case I do not think we support calling Run after Shutdown right now. If you do that it may break in the future. I would advise to either use the watch approach or if you need Run to work after Shutdown we need a justification for that and we need to prescribe it in the contract of Service API and have tests that verify it.

Either way we need to fix this bug. We need to make sure telemetry works after a legitimate restart of the Service.

@dmitryax dmitryax added priority:p1 High help wanted Good issue for contributors to OpenTelemetry Service to pick up labels Mar 25, 2022
@TylerHelmuth
Copy link
Member

@tigrannajaryan do you know why Telemetry is only initialized once? I found this TODO where the initialize is called

// TODO: This should be part of the service initialization, which should be responsible to create TelemetrySettings.
// For the moment happens here, since it needs service.Config and Logger.
// It is called once because that is how it is implemented using sync.Once.
if err = collectorTelemetry.init(col); err != nil {
return err
}

But it doesn't say why Once is used.

@tigrannajaryan
Copy link
Member

@tigrannajaryan do you know why Telemetry is only initialized once? I found this TODO where the initialize is called

No, I don't know, I don't remember this part of the codebase. @bogdandrutu may know.

@TylerHelmuth
Copy link
Member

TylerHelmuth commented Apr 5, 2022

I see that the metrics server is started in a goroutine in Telemetry's initOnce, could be related to that

@Aneurysm9
Copy link
Member

It may have related to the use of global telemetry providers as well, which is being (has been?) refactored. Unfortunately it was this way before I started working in that area of the code and have felt a bit like it's been dragged around blindly mimicking the previous state through a few changes. I'm wary of Chesterton's Fence-type issues, but we should be able to reason through this.

@TylerHelmuth
Copy link
Member

I'll dive deep into it and see if I can find things that would break if it was called multiple times.

With my initial testing today I didn't see any indication that the http server goroutine is shutting down gracefully when the collector shuts down, but I need to keep testing.

@jpkrohling
Copy link
Member

I'll dive deep into it and see if I can find things that would break if it was called multiple times.

Perhaps I'm way off here, but OpenCensus required views to be registered only once. Is that maybe the background for this?

@bogdandrutu
Copy link
Member

There are some miss understandings here:

  1. This theory is wrong otel collector telemetry metrics server won't spawn again after restarting #5084 (comment) because the shutdown is the "Service" shutdown, which does not stops the metrics server. So in summary you cannot reproduce this problem with watch.
  2. The reason why this is global (happens once) is because we use OpenCensus (because otel metrics is not ready, we can ask one more time in that repo why and when), which uses a global that you cannot initialize twice.

There are some discussions (see #4947) and issues to actually make it more clear that the Collector will follow the language patter for Shutdown, see http.Server.Shutdown: When Shutdown is called, Serve, ListenAndServe, and ListenAndServeTLS immediately return ErrServerClosed. which means you cannot call Run again.

I think we are going to the direction to follow the language pattern which is after Shutdown you cannot call Run on the same instance, you have to create a new instance.

@TylerHelmuth
Copy link
Member

Looking into this issue I agree with @bogdandrutu. Assuming calling Run after Shutdown is not allowed, the existing outcome is acceptable. Digging into the Watch scenario, the Shutdown that is called doesn't do anything with the running metrics server. When the collector "restarts" the original telemetry instance will still be running. I think this issue could be closed.

One interesting thing is that any new Telemetry config settings will not be honored. If we want to be able to load new telemetry settings without a new instance of a collector I think we would need changes. I would say thats a different issue/feature request though.

@bogdandrutu
Copy link
Member

This should be fixed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Good issue for contributors to OpenTelemetry Service to pick up priority:p1 High
Projects
None yet
Development

No branches or pull requests

7 participants