Init support for filebeat multiline #570

urso · 2015-12-19T03:52:52Z

start with multiline for filebeat (#461)

tsg · 2015-12-21T10:11:22Z

filebeat/config/config.go

+	ForceCloseFiles    bool             `yaml:"force_close_files"`
+	ExcludeLines       []string         `yaml:"exclude_lines"`
+	IncludeLines       []string         `yaml:"include_lines"`
+	Multiline          *MultilineConfig `yaml:"multi"`


I'd call the option multiline to be more clear what is about.

ruflin · 2015-12-27T14:21:48Z

filebeat/config/config.go

+	Negate   bool   `yaml:"negate"`
+	Match    string `yaml:"match"`
+	MaxLines *int   `yaml:"max_lines"`
+	MaxBytes *int   `yaml:"max_bytes"`


I would suggest to have max_bytes outside multiline as general configuration, as it not only applies to multiline but could also apply to a very large single line.

Good point. Will do. same for max_bytes.

What you mean by same for max_bytes?

I meant to say max_lines... but max_lines makes no sense. Just will do for max_bytes. Sorry for the noise.

ruflin · 2015-12-27T14:55:47Z

@urso Thanks a lot for getting this started. I had a closer look, but didn't test it yet locally. Some notes:

As it is quite a big change, it is not easy to follow all the changes. One thing I'm very caution about to change is the error handling functionality as it has (too) much magic inside. Best on a brief comparison I assume it stayed mostly the same: https://github.com/elastic/beats/pull/570/files#diff-0df022979a0f5b03f3e4f709bd9c9c4dR98
We should definitively add system tests. I can help you here
Interestingly I would probably have followed a different approach for the implementation. Instead of moving the functionality and error handling to the line readers, I would have kept it on a higher level. I would probably add functionality to the Harvest function. The main reason for this as I would argue the line readers should not have to deal with timeouts, backoffs or the error handling. Looking at the implementation it could be that it is more a "naming" discussion, as the basic line by line reader still exist. This is probably easier to discuss via chat with a small diagram :-)

urso · 2015-12-27T20:57:39Z

Error handling did not change. But I found error handling or retry functionallty to be too much inter winded. I did try to build "small" composable units being responsible for just one functionality when reading lines. I did find there is no 'higher' level without abstraction leak or doing all multiline-handling much more inter winded with error handling.
So I pushed retry logic + backoff to an earlier stage so not to confuse the multiline-reader state machine. I didn't want to handle multiline in main-loop, which is already hard to follow.

Processing stages (pipe-line is setup as follows):

file input (io.Reader) ->
logFileReader (io.Reader with retry and backoff) ->
encoding.LineReader (read lines and apply codec) ->
[optional] timeoutReader (force multiline Reader to publish current state) ->
[optional] multilineReader (state-machine combining multiple lines) ->
noEOLLineReader (remove trailing newline) ->
harvester loop (publish line event and quit loop on error)

The logFileReader implements error handling on lower level, so state machines in LineReader and multilineReader have sane error handling + timeout reader can be safely shutdown if error is not nil. Having the logFileReader removes the requirement of readers being still valid after EOF.

urso · 2015-12-28T08:12:11Z

In this PR I've implemented a layered approach:

raw input layer
processing layer (codec, line reader, multiline, remove trailing newline)
event publisher loop

Layers 1 and 2 use the io.Reader interface and layers 2 and 3 use the lineReader interface. By swapping the input layer, all processing layers can be reused for different inputs (e.g. TCP/unix socket or stdin). I think layers for dealing with complexity + composing processing elements by interfaces is a quite powerful and easy to use mechanism that fits well with i/o interfaces in go stdlib. Some more refactoring + cleanup in the harvester is required in general.

It is the input layer which must behave different for different source types. For log files we want backoff + retry on EOF, but not for sockets or stdin. It's up to the input layer to push bytes until some well-defined EOF is reached. For log files the input layer is implemented by logFileReader. Sockets and file handle for stdin can be used as is (we don't want to ignore EOF). The error handling as it has been implemented in log.go was a good fit for log files, but not so much for stdin for example (which it was used for).

There is no real error handling in the line readers, but I have had to move some error handling to the input layer, so statefull processors in layer 2 are not affected by error handling/recovery in layer 3.

I think another advantage of having the logFileReader handling the reading from files is, we're more flexible in adding file reading features. For example:

on windows close the file after EOF and retry (by re-opening the file) after backoff seconds. No state in upper layers will be affected
have central authority watching number of open files and schedule temporary closing/opening of files in logFileReader in order to limit the total number of file descriptors used by filebeat.

ruflin · 2015-12-28T13:48:03Z

filebeat/harvester/log.go

+	codec encoding.Encoding,
+	bufferSize int,
+	maxBytes int,
+	readerConfig logFileReaderConfig,


Should be maxBytes part of logFileReaderConfig? Same for bufferSize and codec?

logFileReadConfig is the input layer config (file reading only). The other options codec, bufferSize and maxBytes are used to configure the line processing layer.

ruflin · 2015-12-28T19:46:50Z

filebeat/docs/configuration.asciidoc

+
+====== pattern
+
+The regexp Pattern that hast to be matched. The example pattern matches all lines starting with [


Typo: s/hast/has/

I think Pattern can be lower-cased here. Do you think we should say that it is the same syntax as Perl or link to https://github.com/google/re2/wiki/Syntax.

we're using CompilePOSIX to compile the regular expressions. These are 'mostly' compatible to egrep + some quirks (see documentation).

TBH I find it quit difficult to understand the impact on syntax when given this line in documentation:

restricts the regular expression to POSIX ERE (egrep) syntax

even worse, you have to keep YAML string escaping in mind when doing the regular expression

That also relates the the Regexp expression for exclude and include, right? If yes, I suggest to create a separate page with docs to regexp we can link to.

Yes, we use same compile function.

Should we open an issue for that to not forget about it and discuss it with @dedemorton ?

ruflin · 2015-12-28T19:52:59Z

LGTM

andrewkroh · 2015-12-28T20:33:58Z

filebeat/docs/configuration.asciidoc

+
+====== match
+
+Match can be set to "after" or "before". It is used to define if lines should be append to a pattern


s/append/appended/

andrewkroh · 2015-12-28T20:43:23Z

filebeat/etc/beat.yml

+      # for Java Stack Traces or C-Line Continuation
+      #multiline:
+
+        # The regexp Pattern that hast to be matched. The example pattern matches all lines starting with [


s/hast/has

split processing into 3 layers: - input layer - line processing layer - event publisher layer (for loop driver) Input layer is responsible for reading files and forwarding errors if appropriate. new multiline system tests: - elasticsearch log with java exception - c-style log tests - max_lines test - max_bytes test - timeout test added asciidocs for multiline

Init support for filebeat multiline

urso added enhancement in progress Pull request is currently in progress. Filebeat Filebeat labels Dec 19, 2015

tsg reviewed Dec 21, 2015
View reviewed changes

urso force-pushed the enh/461-filebeat-multiline branch from 02bdb18 to 40e851f Compare December 22, 2015 02:39

ruflin reviewed Dec 27, 2015
View reviewed changes

ruflin reviewed Dec 28, 2015
View reviewed changes

urso force-pushed the enh/461-filebeat-multiline branch from 478dfd0 to bf5d9f4 Compare December 28, 2015 15:49

urso added review and removed in progress Pull request is currently in progress. labels Dec 28, 2015

urso force-pushed the enh/461-filebeat-multiline branch from bf5d9f4 to 44e2623 Compare December 28, 2015 15:55

ruflin reviewed Dec 28, 2015
View reviewed changes

andrewkroh reviewed Dec 28, 2015
View reviewed changes

urso force-pushed the enh/461-filebeat-multiline branch from 44e2623 to ad19296 Compare December 29, 2015 12:24

ruflin mentioned this pull request Dec 29, 2015

Review Multiline Docs #597

Closed

urso force-pushed the enh/461-filebeat-multiline branch from ad19296 to 0d147f7 Compare December 29, 2015 12:37

ruflin added a commit that referenced this pull request Dec 29, 2015

Merge pull request #570 from urso/enh/461-filebeat-multiline

6971d15

Init support for filebeat multiline

ruflin merged commit 6971d15 into elastic:master Dec 29, 2015

urso deleted the enh/461-filebeat-multiline branch January 12, 2016 09:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Init support for filebeat multiline #570

Init support for filebeat multiline #570

urso commented Dec 19, 2015

tsg Dec 21, 2015

ruflin Dec 27, 2015

urso Dec 27, 2015

ruflin Dec 28, 2015

urso Dec 28, 2015

ruflin commented Dec 27, 2015

urso commented Dec 27, 2015

urso commented Dec 28, 2015

ruflin Dec 28, 2015

urso Dec 28, 2015

ruflin Dec 28, 2015

andrewkroh Dec 28, 2015

urso Dec 29, 2015

urso Dec 29, 2015

ruflin Dec 29, 2015

urso Dec 29, 2015

ruflin Dec 29, 2015

ruflin commented Dec 28, 2015

andrewkroh Dec 28, 2015

andrewkroh Dec 28, 2015


		====== pattern

		The regexp Pattern that hast to be matched. The example pattern matches all lines starting with [


		====== match

		Match can be set to "after" or "before". It is used to define if lines should be append to a pattern

Init support for filebeat multiline #570

Init support for filebeat multiline #570

Conversation

urso commented Dec 19, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruflin commented Dec 27, 2015

urso commented Dec 27, 2015

urso commented Dec 28, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruflin commented Dec 28, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment