Support for Grok expressions in Beats #5790

ramon-garcia · 2017-12-01T12:14:47Z

Hello, I have developed Grok expression support in Beats.

I think that has already been discussed. But I need this feature.
First of all, to save a lot of network traffic in transmission of events to Elasticsearch. Many events are frequent and redundant, like web hits from monitoring system or firewall logs produced by frequent broadcast traffic. To drop these events, one needs from to classify them.

It was argued that it would make the development complex. But the grok.go code is just 232 lines .

Please consider accepting this request

elasticmachine · 2017-12-01T12:14:48Z

Can one of the admins verify this patch?

ramon-garcia · 2017-12-01T12:19:16Z

Will look into Travis build.

karmi · 2017-12-01T12:20:05Z

Hi @ramon-garcia, we have found your signature in our records, but it seems like you have signed with a different e-mail than the one used in yout Git commit. Can you please add both of these e-mails into your Github profile (they can be hidden), so we can match your e-mails to your Github profile?

ramon-garcia · 2017-12-02T01:55:03Z

I think that all the e-mail addresses are in my profile. Am I missing something?

ramon-garcia · 2017-12-02T14:51:26Z

I am trying to fix the travis build issues. It seems to complain about formatting issues. Unfortunately I am unable to do a make check in my Windows machine.

ramon-garcia · 2017-12-02T17:10:18Z

Finally, I managed to do "make check" under a Linux machine. make check fails with beat/info.go

$ make check
warning: "github.com/elastic/beats/libbeat/..." matched no packages
./testing/console.go
./processors/actions/grok_test.go
./cmd/export/config.go
./cmd/instance/beat.go
./cmd/instance/beat_test.go
./beat/info.go
Code differs from goimports' style ^

"goimports -d" shows no difference in files grok.go and grok_test.go.

This branch is merged with the master branch.

If there is anything that I can help with, here I am.

karmi · 2017-12-03T09:31:31Z

@ramon-garcia, some of the commits have an invalid address, and the CLA check cannot process those, but you have probably force pushed this branch with amended commits, since the CLA check is green now?

karmi · 2017-12-03T09:34:59Z

Hmm, something is weird here. When I manually check the PR, it's red, but the status is green here. Can you please make sure you have configured Git with your email, amend the commits with git commit --amend --verbose --reset-author and force push?

ramon-garcia · 2017-12-05T01:10:55Z

I have tried to rebase but prehaps I screwed up my branch. I will continue tomorrow.

For now, just a regexp processor.

ramon-garcia · 2017-12-05T12:17:10Z

Rebased with author changed. I hope now it is ok.

tsg · 2017-12-06T13:24:06Z

Hi @ramon-garcia, apologies for the delayed answer. There are two reasons for which we hesitate to add Grok support in Beats:

We've got Grok implementations in Logstash and Ingest Node, and we're worried that by adding Grok in Beats as well we're creating confusion around the roles of each component. Also, since the implementations are different, there can be subtle differences between them that create more confusion.
Grok, being compiled to regular expressions, tends to require significant CPU time, which is not ideal for a shipper that usually shares the hardware with the application. The Go regexp implementation is not the fastest, and it's going to dominate the CPU profile, so I'd expect to see similar CPU consumption in Filebeat as we see in Logstash or even larger.

For the two reasons above, we were thinking of adding a dissect filter, like Logstash has, before Grok, and see if that is enough for what people usually need. Dissect doesn't use regular expressions, and is faster on most work loads. Would you be interested in contributing a Dissect implementation?

Apologies again for not giving you timely feedback, we wanted to discuss it internally first.

ramon-garcia · 2017-12-08T17:05:11Z

My primary reason for developing this patch is that I want to store clean events, without noise. Storing millions of useless events that reflect repetitive activity is a waste of CPU, storage and obscures the important ones.

With regard to CPU time, in my case, it is usually more scarce in the Elasticsearch server than in other servers. Perhaps this is not the case for everyone, but offering this option helps a lot of use cases.

My experience with pipelines is that they are difficult to debug (there is no way to send a test event to the pipeline and see the output) and one must watch log files in Elasticsearch from time to time. By contrast, any test with filebeat can be done without touching an Elasticsearch server. For similar reasons, my experience with painless scripts was not good.

Finally, there are legal and ethical reasons for avoiding shipping irrelevant logs. This is a general comment with the philosophy of making beats as little processing as possible. Shipping logs from a coworker's workstation that are not relevant for objective security reasons or other demonstrable reasons, that show personal activity, is unethical and probably illegal. Or, for instance, consider the case of the logs of a proxy server that shows web sites visited by coworkers. One could learn, for instance, political affiliation from them. Shipping or storing more that what can be objectively justified must be avoided.

Anyway, in case of disagreement, anyone can take this version of beats and use themselves.

ramon-garcia · 2017-12-18T00:51:11Z

I have been quite surprised by the very bad performance of regular expressions under Go.
I hope to help to contribute a fix. It seems that the RE2 regular expression library (on which the the Go regular expression implementation is based) is much faster. I got a 10x factor of performance with a simple benchmark.

Porting optimizations from RE2 to Go regexp library cannot be that complicated.

ruflin · 2017-12-19T04:03:15Z

@ramon-garcia Also have a look at the optimizations to regexp that @urso made like #2433 (there are more PR's related to Matcher).

What do you think of introducing dissect instead of going with regexp?

ramon-garcia · 2017-12-19T09:30:55Z

@ruflin Regular expressions are difficult to replace. For instance, a log file contains many uninteresting things, and also ssh login successes and failures. One wants to extract the user, the IP address and the result of the authentication, and nothing else.

Exchange message tracking: since it is a CSV file, it looks like one can parse it by splitting by commas ... until one sees that the subject lines of message can also contain commas. So commas must be allowed in fields, if they are surrounded by double quotes. For instance, the line
sender@msg.com,"Hello, how are you"
contains two fields, not three.

ruflin · 2017-12-20T03:45:06Z

regexp is definitively more powerful the dissect. I'm kind of hoping dissect could already address a large portion of the problem, it will definitively not be able to solve all of them.

There are also 2 problems from my perspective here:

taking a message apart to do filtering on certain "entries"
taking a string, structure it to put it into json format

The two problems have a big overlap, but I think for the first one not necessarly all the features are needed that are needed for 2, as with option 2 an exact match of fields is required and in 1 often splitting up in junks is enough and then do on the specific field a regexp.

I think dissect can do more then just tokenize a string, but your problem above is interesting. @ph Do you know if dissect can handle the above "challenge"?

ph · 2017-12-20T14:47:34Z

@ramon-garcia @ruflin regular expression are indeed more powerful than dissect at the expanse of speed. If we look at the dissect test suite we still support quite a few things though.

Concerning logstash we were also suggesting to use conditions with regexp to decided if a dissect filter could be applied or chain multiple dissect together.

%{sender},"%{subject}" I think this would be valid dissect syntax for the example above.

Dissect is more like a tokenizer.

ph · 2020-01-17T18:18:20Z

Sorry for taking a long time to reply.

We have discussed this internally and we have decided to close this issue, even if been able to use grok expression in a beats processor would be a nice enhancement. We do think at the moment that the Grok processors in both the Logstash the Elasticsearch ingest pipeline are good enough for most cases. Using regular expressions on every event will slow thing down quite a bit which is a bit against the idea of getting event out of the machine as fast as possible. We also recommend using dissect instead if possible.

There is also the possibility to create a plugin that beats will load to add this feature, ping me if you want to go that route.

ruflin added discuss Issue needs further discussion. Filebeat Filebeat labels Dec 4, 2017

ramon-garcia force-pushed the extend-processors branch from ec13d95 to 5a7c873 Compare December 5, 2017 01:08

ramon-garcia added 8 commits December 5, 2017 12:37

Implement grok processor.

023b8bf

For now, just a regexp processor.

Grok pattern support

84d2b8a

Grok support

18c6701

Grok fragments: protect always with parenthesis

bcac814

Do not include unmatched fields

d456606

More test cases

9044594

Support additional patterns

b35def4

Test fixes

20cee95

ramon-garcia force-pushed the extend-processors branch from 5a7c873 to ec13d95 Compare December 5, 2017 12:02

ramon-garcia added 2 commits December 5, 2017 13:15

Documentation

bfb9284

Adapt Grok processor to latest Beats

44a5923

ramon-garcia force-pushed the extend-processors branch from ec13d95 to 44a5923 Compare December 5, 2017 12:16

ramon-garcia added 2 commits December 7, 2017 11:15

imports in grok_test.go

2ce8c92

fix import formatting in grok_test.go

1d62191

ph closed this Jan 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Grok expressions in Beats #5790

Support for Grok expressions in Beats #5790

ramon-garcia commented Dec 1, 2017

elasticmachine commented Dec 1, 2017

ramon-garcia commented Dec 1, 2017

karmi commented Dec 1, 2017

ramon-garcia commented Dec 2, 2017

ramon-garcia commented Dec 2, 2017

ramon-garcia commented Dec 2, 2017 •

edited

Loading

karmi commented Dec 3, 2017

karmi commented Dec 3, 2017

ramon-garcia commented Dec 5, 2017

ramon-garcia commented Dec 5, 2017

tsg commented Dec 6, 2017

ramon-garcia commented Dec 8, 2017 •

edited

Loading

ramon-garcia commented Dec 18, 2017

ruflin commented Dec 19, 2017

ramon-garcia commented Dec 19, 2017 •

edited

Loading

ruflin commented Dec 20, 2017

ph commented Dec 20, 2017

ph commented Jan 17, 2020

Support for Grok expressions in Beats #5790

Support for Grok expressions in Beats #5790

Conversation

ramon-garcia commented Dec 1, 2017

elasticmachine commented Dec 1, 2017

ramon-garcia commented Dec 1, 2017

karmi commented Dec 1, 2017

ramon-garcia commented Dec 2, 2017

ramon-garcia commented Dec 2, 2017

ramon-garcia commented Dec 2, 2017 • edited Loading

karmi commented Dec 3, 2017

karmi commented Dec 3, 2017

ramon-garcia commented Dec 5, 2017

ramon-garcia commented Dec 5, 2017

tsg commented Dec 6, 2017

ramon-garcia commented Dec 8, 2017 • edited Loading

ramon-garcia commented Dec 18, 2017

ruflin commented Dec 19, 2017

ramon-garcia commented Dec 19, 2017 • edited Loading

ruflin commented Dec 20, 2017

ph commented Dec 20, 2017

ph commented Jan 17, 2020

ramon-garcia commented Dec 2, 2017 •

edited

Loading

ramon-garcia commented Dec 8, 2017 •

edited

Loading

ramon-garcia commented Dec 19, 2017 •

edited

Loading