Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix leaks in filebeat log harvester #6809

Closed
wants to merge 2 commits into from
Closed

Conversation

adriansr
Copy link
Contributor

@adriansr adriansr commented Apr 9, 2018

Two independent but related causes of goroutine leaks have been found in filebeat:

  • channel.SubOutlet fails to terminate an internal goroutine, which keeps references to the SubOutlet itself and the Outlet being wrapped, preventing them from being freed.

  • log.Harvester constructor receives an Outlet which is closed by Harvester.Stop() or internally after Harvester.Run() is invoked. When Harvester.Setup() fails, Stop() is not called, causing a leak. A CloseOnSignal() util is also leaked for the same reasons, which adds another goroutine leak.

Fixes #6797

adriansr added 2 commits April 9, 2018 18:46
An internal goroutine wasn't stopped when Close() was called.
Harvester relied on Stop() being called to free resources allocated
during construction.

In the event of Setup() failing for some reason, the Start() / Stop()
sequence is never invoked and this resources are never released.
@adriansr adriansr added bug Filebeat Filebeat needs_backport PR is waiting to be backported to other branches. labels Apr 9, 2018
@adriansr
Copy link
Contributor Author

adriansr commented Apr 9, 2018

@urso I'd like to have your opinion on the Harvester issue.

To me, it would be cleaner to refactor the harvester and deal with the outlet only inside Run().

case event := <-o.ch:
o.res <- out.OnEvent(event)
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is dangerous, as returning early on o.done might prevent the queue from being drained. Not draining o.ch can lead to a deadlock on shutdown. For not leaking the go-routine, o.ch must be closed properly.


return s
}

func (o *subOutlet) drainLoop(out Outleter) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe name it workerLoop? This loop is forwarding all received events to out.OnEvent.

@urso
Copy link

urso commented Apr 9, 2018

The SubOutlet is still some left-over to be removed in the future. As the underlying channel is shared with the prospector, so to guarantee order on state updates, it can not be that easily refactored. Plus shutdown is potentially riddled with race conditions.

Cleaning up prospector/harvester and state handling depends on #6810.

@ph
Copy link
Contributor

ph commented Apr 10, 2018

@urso Once I get a bit of time, I will do a cleanup of the log / harvester, the stdin+once fix is really hard to do without doing some refactoring, after doing a spike on the problem this is what I've decided. :(

@ruflin
Copy link
Contributor

ruflin commented Apr 10, 2018

Some more refactoring will be need but we should not block this change by it as it's not a small undertaking.

@ph
Copy link
Contributor

ph commented Apr 10, 2018

Some more refactoring will be need but we should not block this change by it as it's not a small undertaking.

Yep not for blocking this, it's just related as to keep note of what can be improved in a more long term work.

@urso
Copy link

urso commented Apr 10, 2018

@ph @ruflin @asr problem with this change is, I have a hard time to figure out the actual implications of this change. We still pass loads of channels in filebeat and shutdown still potentially riddled with a number of race conditions. This change indirectly affects the interworkings of prospectors, harvesters, registry, the event publishing pipeline, global counters on filebeat shutdown. It took us a while to get shutdown working somewhat stable (somewhat, as in stdin is still not working correctly). That is, by fixing one issue, we might re-introduce some other issues from the past.

Trying to fix whatever issue we have with shutdown/startup will be a never ending story (it still is), without the required refactorings on registry and shutdown waiting logic (waiting is supported by the publisher pipeline since 6.0, but can't be used by filebeat yet).

tl;dr here be dragons

@ruflin
Copy link
Contributor

ruflin commented Apr 11, 2018

I share the concern from @urso that we are not 100% sure about the side effects of this change. But I think we should find a way forward in small steps instead of one big one which will also introduce other issues again. Best would be if we can have tests for every step. For example the new pipeline in 6.x in Beats is great and a required step we had to do but also has reintroduce some issue for example with the once shutdown in stdin. Same will be true for any small or major changes we will do here, so it is a trade off we have to choose. There will be dragons everywhere ;-)

@adriansr I wonder if we could have a test that failed before having this change but goes green with the change. BTW: As @urso predicted, it broke the shutdown test on windows: https://beats-ci.elastic.co/job/elastic+beats+pull-request+multijob-windows/3895/beat=filebeat,label=windows/testReport/junit/test_shutdown/Test/test_shutdown_wait_timeout/ If we get our tests green, I'm relatively confident in any changes we do as the test suite is rather extensive.

@adriansr
Copy link
Contributor Author

I've submitted an alternative fix that has a lesser chance of messing with shutdown #6829

@adriansr adriansr closed this Apr 11, 2018
@adriansr adriansr removed the needs_backport PR is waiting to be backported to other branches. label Jun 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Filebeat] Memory leak associated with failure to setup harvesters
4 participants