Fix leaks in filebeat log harvester #6809

adriansr · 2018-04-09T17:04:31Z

Two independent but related causes of goroutine leaks have been found in filebeat:

channel.SubOutlet fails to terminate an internal goroutine, which keeps references to the SubOutlet itself and the Outlet being wrapped, preventing them from being freed.
log.Harvester constructor receives an Outlet which is closed by Harvester.Stop() or internally after Harvester.Run() is invoked. When Harvester.Setup() fails, Stop() is not called, causing a leak. A CloseOnSignal() util is also leaked for the same reasons, which adds another goroutine leak.

An internal goroutine wasn't stopped when Close() was called.

Harvester relied on Stop() being called to free resources allocated during construction. In the event of Setup() failing for some reason, the Start() / Stop() sequence is never invoked and this resources are never released.

adriansr · 2018-04-09T17:06:33Z

@urso I'd like to have your opinion on the Harvester issue.

To me, it would be cleaner to refactor the harvester and deal with the outlet only inside Run().

urso · 2018-04-09T17:23:21Z

filebeat/channel/util.go

+		case event := <-o.ch:
+			o.res <- out.OnEvent(event)
+		}
+	}


This change is dangerous, as returning early on o.done might prevent the queue from being drained. Not draining o.ch can lead to a deadlock on shutdown. For not leaking the go-routine, o.ch must be closed properly.

urso · 2018-04-09T17:30:05Z

filebeat/channel/util.go


 	return s
 }

+func (o *subOutlet) drainLoop(out Outleter) {


maybe name it workerLoop? This loop is forwarding all received events to out.OnEvent.

urso · 2018-04-09T17:45:57Z

The SubOutlet is still some left-over to be removed in the future. As the underlying channel is shared with the prospector, so to guarantee order on state updates, it can not be that easily refactored. Plus shutdown is potentially riddled with race conditions.

Cleaning up prospector/harvester and state handling depends on #6810.

ph · 2018-04-10T13:20:57Z

@urso Once I get a bit of time, I will do a cleanup of the log / harvester, the stdin+once fix is really hard to do without doing some refactoring, after doing a spike on the problem this is what I've decided. :(

ruflin · 2018-04-10T13:45:55Z

Some more refactoring will be need but we should not block this change by it as it's not a small undertaking.

ph · 2018-04-10T13:56:41Z

Some more refactoring will be need but we should not block this change by it as it's not a small undertaking.

Yep not for blocking this, it's just related as to keep note of what can be improved in a more long term work.

urso · 2018-04-10T18:02:21Z

@ph @ruflin @asr problem with this change is, I have a hard time to figure out the actual implications of this change. We still pass loads of channels in filebeat and shutdown still potentially riddled with a number of race conditions. This change indirectly affects the interworkings of prospectors, harvesters, registry, the event publishing pipeline, global counters on filebeat shutdown. It took us a while to get shutdown working somewhat stable (somewhat, as in stdin is still not working correctly). That is, by fixing one issue, we might re-introduce some other issues from the past.

Trying to fix whatever issue we have with shutdown/startup will be a never ending story (it still is), without the required refactorings on registry and shutdown waiting logic (waiting is supported by the publisher pipeline since 6.0, but can't be used by filebeat yet).

tl;dr here be dragons

ruflin · 2018-04-11T08:29:01Z

I share the concern from @urso that we are not 100% sure about the side effects of this change. But I think we should find a way forward in small steps instead of one big one which will also introduce other issues again. Best would be if we can have tests for every step. For example the new pipeline in 6.x in Beats is great and a required step we had to do but also has reintroduce some issue for example with the once shutdown in stdin. Same will be true for any small or major changes we will do here, so it is a trade off we have to choose. There will be dragons everywhere ;-)

@adriansr I wonder if we could have a test that failed before having this change but goes green with the change. BTW: As @urso predicted, it broke the shutdown test on windows: https://beats-ci.elastic.co/job/elastic+beats+pull-request+multijob-windows/3895/beat=filebeat,label=windows/testReport/junit/test_shutdown/Test/test_shutdown_wait_timeout/ If we get our tests green, I'm relatively confident in any changes we do as the test suite is rather extensive.

adriansr · 2018-04-11T15:44:50Z

I've submitted an alternative fix that has a lesser chance of messing with shutdown #6829

adriansr added 2 commits April 9, 2018 18:46

Filebeat: Fix goroutine leak in channel.SubOutlet

c0bc514

An internal goroutine wasn't stopped when Close() was called.

Filebeat: Fix leak in log harvester

6105f00

Harvester relied on Stop() being called to free resources allocated during construction. In the event of Setup() failing for some reason, the Start() / Stop() sequence is never invoked and this resources are never released.

adriansr added bug Filebeat Filebeat needs_backport PR is waiting to be backported to other branches. labels Apr 9, 2018

urso reviewed Apr 9, 2018

View reviewed changes

adriansr closed this Apr 11, 2018

adriansr removed the needs_backport PR is waiting to be backported to other branches. label Jun 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix leaks in filebeat log harvester #6809

Fix leaks in filebeat log harvester #6809

adriansr commented Apr 9, 2018 •

edited

Loading

adriansr commented Apr 9, 2018 •

edited

Loading

urso Apr 9, 2018

urso Apr 9, 2018

urso commented Apr 9, 2018

ph commented Apr 10, 2018

ruflin commented Apr 10, 2018

ph commented Apr 10, 2018

urso commented Apr 10, 2018

ruflin commented Apr 11, 2018

adriansr commented Apr 11, 2018

Fix leaks in filebeat log harvester #6809

Fix leaks in filebeat log harvester #6809

Conversation

adriansr commented Apr 9, 2018 • edited Loading

adriansr commented Apr 9, 2018 • edited Loading

urso Apr 9, 2018

Choose a reason for hiding this comment

urso Apr 9, 2018

Choose a reason for hiding this comment

urso commented Apr 9, 2018

ph commented Apr 10, 2018

ruflin commented Apr 10, 2018

ph commented Apr 10, 2018

urso commented Apr 10, 2018

ruflin commented Apr 11, 2018

adriansr commented Apr 11, 2018

adriansr commented Apr 9, 2018 •

edited

Loading

adriansr commented Apr 9, 2018 •

edited

Loading