-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix leaks in filebeat log harvester #6809
Conversation
An internal goroutine wasn't stopped when Close() was called.
Harvester relied on Stop() being called to free resources allocated during construction. In the event of Setup() failing for some reason, the Start() / Stop() sequence is never invoked and this resources are never released.
@urso I'd like to have your opinion on the Harvester issue. To me, it would be cleaner to refactor the harvester and deal with the outlet only inside Run(). |
case event := <-o.ch: | ||
o.res <- out.OnEvent(event) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is dangerous, as returning early on o.done
might prevent the queue from being drained. Not draining o.ch
can lead to a deadlock on shutdown. For not leaking the go-routine, o.ch
must be closed properly.
|
||
return s | ||
} | ||
|
||
func (o *subOutlet) drainLoop(out Outleter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe name it workerLoop? This loop is forwarding all received events to out.OnEvent
.
The SubOutlet is still some left-over to be removed in the future. As the underlying channel is shared with the prospector, so to guarantee order on state updates, it can not be that easily refactored. Plus shutdown is potentially riddled with race conditions. Cleaning up prospector/harvester and state handling depends on #6810. |
@urso Once I get a bit of time, I will do a cleanup of the log / harvester, the |
Some more refactoring will be need but we should not block this change by it as it's not a small undertaking. |
Yep not for blocking this, it's just related as to keep note of what can be improved in a more long term work. |
@ph @ruflin @asr problem with this change is, I have a hard time to figure out the actual implications of this change. We still pass loads of channels in filebeat and shutdown still potentially riddled with a number of race conditions. This change indirectly affects the interworkings of prospectors, harvesters, registry, the event publishing pipeline, global counters on filebeat shutdown. It took us a while to get shutdown working somewhat stable (somewhat, as in stdin is still not working correctly). That is, by fixing one issue, we might re-introduce some other issues from the past. Trying to fix whatever issue we have with shutdown/startup will be a never ending story (it still is), without the required refactorings on registry and shutdown waiting logic (waiting is supported by the publisher pipeline since 6.0, but can't be used by filebeat yet). tl;dr here be dragons |
I share the concern from @urso that we are not 100% sure about the side effects of this change. But I think we should find a way forward in small steps instead of one big one which will also introduce other issues again. Best would be if we can have tests for every step. For example the new pipeline in 6.x in Beats is great and a required step we had to do but also has reintroduce some issue for example with the once shutdown in stdin. Same will be true for any small or major changes we will do here, so it is a trade off we have to choose. There will be dragons everywhere ;-) @adriansr I wonder if we could have a test that failed before having this change but goes green with the change. BTW: As @urso predicted, it broke the shutdown test on windows: https://beats-ci.elastic.co/job/elastic+beats+pull-request+multijob-windows/3895/beat=filebeat,label=windows/testReport/junit/test_shutdown/Test/test_shutdown_wait_timeout/ If we get our tests green, I'm relatively confident in any changes we do as the test suite is rather extensive. |
I've submitted an alternative fix that has a lesser chance of messing with shutdown #6829 |
Two independent but related causes of goroutine leaks have been found in filebeat:
channel.SubOutlet fails to terminate an internal goroutine, which keeps references to the SubOutlet itself and the Outlet being wrapped, preventing them from being freed.
log.Harvester constructor receives an Outlet which is closed by Harvester.Stop() or internally after Harvester.Run() is invoked. When Harvester.Setup() fails, Stop() is not called, causing a leak. A CloseOnSignal() util is also leaked for the same reasons, which adds another goroutine leak.
Fixes #6797