Fix deadlock when Telegraf is aligning aggregators #5612
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
If a Telegraf is asked to stop (SIGTERM, SIGINT) while it is aligning an
aggregator,
internal.SleepContext
will fail and the goroutine willreturn without closing the
aggregations
channel getting stuckforever in
for metric := range aggregations
Required for all PRs:
Sample config file to test the problem:
Run telegraf:
Kill it before 60s:
It won't finish.
Here telegraf.log is the log of the execution mixed with the
kill -ABRT
used to check the status of goroutines.goroutine 58 (agent.runAggregators)
is the problematic one. It is stuck so it is not closing the channel used by the outputs.goroutine 59 (agent.runOutputs)
is waiting till the aggregator close the input channel.goroutine 60 (agent.flush)
: this one should be stopped by runOutputsgoroutine 36 (ticker)
is going to stop when agent.flushOnce finishes. Meanwhile is keeping one goroutine running so Go could not detect a deadlock.goroutine 51 (opencensus)
is strange. I have not configured nothing related with opencensus. Looking at the code, that lib have a init() function that starts its own goroutine with a time Ticker. At first glance do not look like a good pattern to start a goroutine in the init() function, in fact using the init() looks like is not a good pattern.Removing this import does not solve the not-detected-deadlock problem.
Maybe could be a good idea to check for no init() functions in the vendor libs, but maybe too much.