Add close_timeout option #1926

ruflin · 2016-06-29T07:49:43Z

close_timeout will end the harvester after the predefined time. In case the output is blocked, close_timeout will only apply on the next event sent. This is identical with the close_* options.

urso · 2016-07-04T11:57:29Z

filebeat/harvester/log.go

@@ -105,6 +106,11 @@ func (h *Harvester) Harvest() {
 		if !h.sendEvent(event) {
 			return
 		}
+
+		if h.Config.CloseTTL > 0 && time.Since(h.startTime) > h.Config.CloseTTL {


config validation mandates CloseTTL can not be 0 (due to min=0,nonzero). This on purpose?

yes, because I don't see any use cases for close_ttl = 0, as this would the harvester directly again.

urso · 2016-07-13T13:12:19Z

filebeat/harvester/log.go

+		if h.config.CloseTimeout > 0 && time.Since(startTime) > h.config.CloseTimeout {
+			logp.Info("Closing harvester because ttl was reached: %s", h.path)
+			return
+		}


Checking for CloseTimeout at this places requires an event to be published for closeTimeout to function.

If reader is waiting long for a new line or sendEvent blocks, due to spooler being blocked on publisher, close_timeout might be very very late.

file reader has some 'inactive timeout'? The 'inactive timeout' should be <= close_timeout.

What about multiline timeout? hm...

Alternative implementation could be to implement the timeout in a separate go routine the close h.done?

ruflin · 2016-07-18T08:24:10Z

@urso I pushed an alternative implementation. Can you have a look?

ruflin · 2016-07-18T11:19:16Z

jenkins, retest it

urso · 2016-07-18T19:43:58Z

filebeat/harvester/config.go

@@ -45,6 +46,7 @@ type harvesterConfig struct {
 	CloseRemoved         bool                       `config:"close_removed"`
 	CloseRenamed         bool                       `config:"close_renamed"`
 	CloseEOF             bool                       `config:"close_eof"`
+	CloseTimeout         time.Duration              `config:"close_timeout" validate:"min=0,nonzero"`


validate enforces CloseTimeout>=1ns. This enforces CloseTimeout always enabled. Why disallow 0 or -1?

I agree that nonzero should be removed. But -1 doesn't make sense from my point of view?

yeah, -1 is kinda similar to 0. CloseTimeout should only be enabled if >0

I think that is currently the case?

It's not. using nonzero in validate requires close_timeout > 0 in config files.

ruflin · 2016-07-20T08:36:26Z

@urso PR updated with adding a note to docs about potential multiline issue and added some notes about the channels to the code. We definitively should revise how processors are started / stopped at a later stage.

ruflin · 2016-07-20T14:58:57Z

In discussion with @urso we realised there is a potential issue in reader / processor which could lead to and endless loop potentially. Before merging this PR, the reader / processor part should be cleaned up to guarantee proper stopping.

Even thought implementation for close_timeout is not finished, it already showed up in the docs and config file. As close_timeout still needs more work, these were now removed to prevent any confusion. All changes for close_timeout go into elastic#1926

Even thought implementation for close_timeout is not finished, it already showed up in the docs and config file. As close_timeout still needs more work, these were now removed to prevent any confusion. All changes for close_timeout go into #1926

ruflin · 2016-08-29T14:33:30Z

I redid the implementation and add a Close function to the log_file reader. This implementation has the following issues:

In case the output applies back pressure and not at least one more event can be sent, the harvester will not be closed, means the file handler will stay open in case the output service is down
An alternative implementation was done where the harvester was directly killed not waiting for the output to complete. This had the side affect, that the final state of the harvester (Finished) was not updated properly. The issue here is currently that state updates and data updates go through the same channel. Even though state updates are not sent by the spooler, they still to through the spooler and if the queue is full not further states can be sent either. One potential option here could be to decouple the state updates and event updates. But this will bring lots of new issues with race conditions between state and event updates. In addition, state updates for events should only be applied, if the event was completely sent.

The above implementation is quite simple and works in cases where the output is behaving normally. This is in line with the oder close_* options. They all only apply in case the output works as expected. Otherwise file handlers stay open.

To have some kind of circuit breaker which really stops the harvester independent of the output, I think we also need to make changes to the output to have potentially some feedback loop. Or we not even start new harvesters if there is some congestions (limit number of harvesters ...). It will still not solve the problem, but not open more and more files when output is stuck.

urso · 2016-08-30T09:44:43Z

filebeat/harvester/log_file.go

+func (r *LogFile) Close() {
+	// Make sure reader is only closed once
+	r.singleClose.Do(func() {
+		close(r.done)


as the underlying reader might block on syscall, the file must be closed right after closing the channel.

urso · 2016-08-30T09:55:05Z

which error code is passed up the readers if file is closed by timeout (or done channel)?

what happens to buffered (but incomplete) multiline events if file is closed early?

ruflin · 2016-08-30T11:56:15Z

If the channel is closed, the Error will be ErrClosed. What the error will be after closing the file handler probably depends in which state the file handler was at this moment.

Incomplete multiline events will be sent on ErrClosed (see tests). It is handled the same as if the timeout was reached in multiline.

urso · 2016-08-30T12:09:10Z

filebeat/harvester/log.go

+			h.fileReader.Close()
+		} else {
+			h.file.Close()
+		}


who is calling (*Harvester).close()? like which go-routine? There a chance of race on h.fileReader being set?

The same go routine that calls close is also setting the fileReader, so there should not be any race condition.

hm... why not call defer fileReader.Close() right after creating in said go-routine? fileReader.Close() already ensures it's doing the magic only once + is thread-safe.

urso · 2016-08-30T14:03:58Z

filebeat/docs/reference/configuration/filebeat-options.asciidoc

+
+In case close_timeout is used in combination with multiline events, it can happen that the harvester will be stopped in the middle of a multiline event, means only parts of the event will be sent. In case the harvester is continued at a later stage again and the file still exists, only the second part of the event will be sent.
+
+Close timeout will not apply, in case your output is stuck and no further events can be sent. At least one event further event must be sent to make close_timeout apply.


At least one event further event must be sent to make close_timeout apply.

Can you clarify?

urso · 2016-08-30T22:12:43Z

@ruflin needs rebase

close_timeout will end the harvester after the predefined time. In case the output is blocked, close_timeout will only apply on the next event sent. This is identical with the close_* options.

ruflin · 2016-08-31T06:09:48Z

@urso Rebase done

ruflin · 2016-08-31T09:20:47Z

jenkins, retest it

ruflin added review Filebeat Filebeat labels Jun 29, 2016

ruflin mentioned this pull request Jun 29, 2016

Registry state persistence and cleanup #1600

Closed

ruflin force-pushed the close_ttl branch 2 times, most recently from 76e5c96 to a16b773 Compare July 4, 2016 09:01

urso reviewed Jul 4, 2016
View reviewed changes

ruflin added the in progress Pull request is currently in progress. label Jul 4, 2016

ruflin changed the title ~~Add close_ttl option~~ Add close_timeout option Jul 6, 2016

ruflin force-pushed the close_ttl branch 2 times, most recently from 2e5b57f to ff5df38 Compare July 12, 2016 08:05

ruflin mentioned this pull request Jul 12, 2016

Filebeat Configuration Changes #2012

Closed

8 tasks

ruflin force-pushed the close_ttl branch from ff5df38 to b0b0fb6 Compare July 12, 2016 10:13

ruflin removed the in progress Pull request is currently in progress. label Jul 12, 2016

ruflin force-pushed the close_ttl branch from b0b0fb6 to d18b568 Compare July 12, 2016 14:06

urso reviewed Jul 13, 2016
View reviewed changes

ruflin force-pushed the close_ttl branch from d18b568 to 4d4a57c Compare July 18, 2016 08:23

urso reviewed Jul 18, 2016
View reviewed changes

ruflin force-pushed the close_ttl branch 2 times, most recently from b2f8c91 to 913c3f2 Compare July 20, 2016 08:35

ruflin added in progress Pull request is currently in progress. and removed review labels Jul 20, 2016

ruflin force-pushed the close_ttl branch from 913c3f2 to e8b77f0 Compare July 20, 2016 15:03

ruflin added the blocked label Aug 11, 2016

ruflin mentioned this pull request Aug 29, 2016

Filebeat not closing deleted/rotated files (when output broken) #2395

Closed

ruflin force-pushed the close_ttl branch 3 times, most recently from e58fef6 to 9574e3d Compare August 29, 2016 14:21

urso reviewed Aug 30, 2016
View reviewed changes

ruflin force-pushed the close_ttl branch 2 times, most recently from 290ea69 to 632cd65 Compare August 30, 2016 12:05

urso reviewed Aug 30, 2016
View reviewed changes

ruflin force-pushed the close_ttl branch 2 times, most recently from a6f9980 to ba6934c Compare August 30, 2016 13:30

ruflin added review and removed blocked in progress Pull request is currently in progress. labels Aug 30, 2016

urso reviewed Aug 30, 2016
View reviewed changes

ruflin force-pushed the close_ttl branch from ba6934c to a82ae8e Compare August 30, 2016 14:21

Add close_timeout option

0df1eab

close_timeout will end the harvester after the predefined time. In case the output is blocked, close_timeout will only apply on the next event sent. This is identical with the close_* options.

ruflin force-pushed the close_ttl branch from ffdd719 to 0df1eab Compare August 31, 2016 06:09

urso merged commit 001b29b into elastic:master Aug 31, 2016

monicasarbu deleted the close_ttl branch September 5, 2016 10:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add close_timeout option #1926

Add close_timeout option #1926

ruflin commented Jun 29, 2016 •

edited

Loading

urso Jul 4, 2016

ruflin Jul 4, 2016

urso Jul 13, 2016

urso Jul 13, 2016

ruflin Jul 13, 2016

ruflin commented Jul 18, 2016

ruflin commented Jul 18, 2016

urso Jul 18, 2016

ruflin Jul 18, 2016

urso Jul 20, 2016

ruflin Jul 20, 2016

urso Jul 20, 2016

ruflin commented Jul 20, 2016

ruflin commented Jul 20, 2016

ruflin commented Aug 29, 2016 •

edited

Loading

urso Aug 30, 2016

ruflin Aug 30, 2016

urso commented Aug 30, 2016

ruflin commented Aug 30, 2016

urso Aug 30, 2016

ruflin Aug 30, 2016

urso Aug 30, 2016

urso Aug 30, 2016

ruflin Aug 30, 2016

urso commented Aug 30, 2016

ruflin commented Aug 31, 2016

ruflin commented Aug 31, 2016


		In case close_timeout is used in combination with multiline events, it can happen that the harvester will be stopped in the middle of a multiline event, means only parts of the event will be sent. In case the harvester is continued at a later stage again and the file still exists, only the second part of the event will be sent.

		Close timeout will not apply, in case your output is stuck and no further events can be sent. At least one event further event must be sent to make close_timeout apply.

Add close_timeout option #1926

Add close_timeout option #1926

Conversation

ruflin commented Jun 29, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruflin commented Jul 18, 2016

ruflin commented Jul 18, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruflin commented Jul 20, 2016

ruflin commented Jul 20, 2016

ruflin commented Aug 29, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

urso commented Aug 30, 2016

ruflin commented Aug 30, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

urso commented Aug 30, 2016

ruflin commented Aug 31, 2016

ruflin commented Aug 31, 2016

ruflin commented Jun 29, 2016 •

edited

Loading

ruflin commented Aug 29, 2016 •

edited

Loading