Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Init support for filebeat multiline #570

Merged
merged 1 commit into from
Dec 29, 2015

Conversation

urso
Copy link

@urso urso commented Dec 19, 2015

start with multiline for filebeat (#461)

@urso urso added enhancement in progress Pull request is currently in progress. Filebeat Filebeat labels Dec 19, 2015
ForceCloseFiles bool `yaml:"force_close_files"`
ExcludeLines []string `yaml:"exclude_lines"`
IncludeLines []string `yaml:"include_lines"`
Multiline *MultilineConfig `yaml:"multi"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call the option multiline to be more clear what is about.

Negate bool `yaml:"negate"`
Match string `yaml:"match"`
MaxLines *int `yaml:"max_lines"`
MaxBytes *int `yaml:"max_bytes"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to have max_bytes outside multiline as general configuration, as it not only applies to multiline but could also apply to a very large single line.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Will do. same for max_bytes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you mean by same for max_bytes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to say max_lines... but max_lines makes no sense. Just will do for max_bytes. Sorry for the noise.

@ruflin
Copy link
Member

ruflin commented Dec 27, 2015

@urso Thanks a lot for getting this started. I had a closer look, but didn't test it yet locally. Some notes:

  • As it is quite a big change, it is not easy to follow all the changes. One thing I'm very caution about to change is the error handling functionality as it has (too) much magic inside. Best on a brief comparison I assume it stayed mostly the same: https://github.com/elastic/beats/pull/570/files#diff-0df022979a0f5b03f3e4f709bd9c9c4dR98
  • We should definitively add system tests. I can help you here
  • Interestingly I would probably have followed a different approach for the implementation. Instead of moving the functionality and error handling to the line readers, I would have kept it on a higher level. I would probably add functionality to the Harvest function. The main reason for this as I would argue the line readers should not have to deal with timeouts, backoffs or the error handling. Looking at the implementation it could be that it is more a "naming" discussion, as the basic line by line reader still exist. This is probably easier to discuss via chat with a small diagram :-)

@urso
Copy link
Author

urso commented Dec 27, 2015

Error handling did not change. But I found error handling or retry functionallty to be too much inter winded. I did try to build "small" composable units being responsible for just one functionality when reading lines. I did find there is no 'higher' level without abstraction leak or doing all multiline-handling much more inter winded with error handling.
So I pushed retry logic + backoff to an earlier stage so not to confuse the multiline-reader state machine. I didn't want to handle multiline in main-loop, which is already hard to follow.

Processing stages (pipe-line is setup as follows):

file input (io.Reader) ->
logFileReader (io.Reader with retry and backoff) ->
encoding.LineReader (read lines and apply codec) ->
[optional] timeoutReader (force multiline Reader to publish current state) ->
[optional] multilineReader (state-machine combining multiple lines) ->
noEOLLineReader (remove trailing newline) ->
harvester loop (publish line event and quit loop on error)

The logFileReader implements error handling on lower level, so state machines in LineReader and multilineReader have sane error handling + timeout reader can be safely shutdown if error is not nil. Having the logFileReader removes the requirement of readers being still valid after EOF.

@urso
Copy link
Author

urso commented Dec 28, 2015

In this PR I've implemented a layered approach:

  1. raw input layer
  2. processing layer (codec, line reader, multiline, remove trailing newline)
  3. event publisher loop

Layers 1 and 2 use the io.Reader interface and layers 2 and 3 use the lineReader interface. By swapping the input layer, all processing layers can be reused for different inputs (e.g. TCP/unix socket or stdin). I think layers for dealing with complexity + composing processing elements by interfaces is a quite powerful and easy to use mechanism that fits well with i/o interfaces in go stdlib. Some more refactoring + cleanup in the harvester is required in general.

It is the input layer which must behave different for different source types. For log files we want backoff + retry on EOF, but not for sockets or stdin. It's up to the input layer to push bytes until some well-defined EOF is reached. For log files the input layer is implemented by logFileReader. Sockets and file handle for stdin can be used as is (we don't want to ignore EOF). The error handling as it has been implemented in log.go was a good fit for log files, but not so much for stdin for example (which it was used for).

There is no real error handling in the line readers, but I have had to move some error handling to the input layer, so statefull processors in layer 2 are not affected by error handling/recovery in layer 3.

I think another advantage of having the logFileReader handling the reading from files is, we're more flexible in adding file reading features. For example:

  • on windows close the file after EOF and retry (by re-opening the file) after backoff seconds. No state in upper layers will be affected
  • have central authority watching number of open files and schedule temporary closing/opening of files in logFileReader in order to limit the total number of file descriptors used by filebeat.

codec encoding.Encoding,
bufferSize int,
maxBytes int,
readerConfig logFileReaderConfig,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be maxBytes part of logFileReaderConfig? Same for bufferSize and codec?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logFileReadConfig is the input layer config (file reading only). The other options codec, bufferSize and maxBytes are used to configure the line processing layer.

@urso urso added review and removed in progress Pull request is currently in progress. labels Dec 28, 2015

====== pattern

The regexp Pattern that hast to be matched. The example pattern matches all lines starting with [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: s/hast/has/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Pattern can be lower-cased here. Do you think we should say that it is the same syntax as Perl or link to https://github.com/google/re2/wiki/Syntax.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're using CompilePOSIX to compile the regular expressions. These are 'mostly' compatible to egrep + some quirks (see documentation).

TBH I find it quit difficult to understand the impact on syntax when given this line in documentation:

restricts the regular expression to POSIX ERE (egrep) syntax

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even worse, you have to keep YAML string escaping in mind when doing the regular expression

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That also relates the the Regexp expression for exclude and include, right? If yes, I suggest to create a separate page with docs to regexp we can link to.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we use same compile function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we open an issue for that to not forget about it and discuss it with @dedemorton ?

@ruflin
Copy link
Member

ruflin commented Dec 28, 2015

LGTM


====== match

Match can be set to "after" or "before". It is used to define if lines should be append to a pattern
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/append/appended/

# for Java Stack Traces or C-Line Continuation
#multiline:

# The regexp Pattern that hast to be matched. The example pattern matches all lines starting with [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/hast/has

@ruflin ruflin mentioned this pull request Dec 29, 2015
split processing into 3 layers:
- input layer
- line processing layer
- event publisher layer (for loop driver)

Input layer is responsible for reading files and
forwarding errors if appropriate.

new multiline system tests:
- elasticsearch log with java exception
- c-style log tests
- max_lines test
- max_bytes test
- timeout test

added asciidocs for multiline
ruflin added a commit that referenced this pull request Dec 29, 2015
Init support for filebeat multiline
@ruflin ruflin merged commit 6971d15 into elastic:master Dec 29, 2015
@urso urso deleted the enh/461-filebeat-multiline branch January 12, 2016 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants