-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Grok expressions in Beats #5790
Conversation
Can one of the admins verify this patch? |
Will look into Travis build. |
Hi @ramon-garcia, we have found your signature in our records, but it seems like you have signed with a different e-mail than the one used in yout Git commit. Can you please add both of these e-mails into your Github profile (they can be hidden), so we can match your e-mails to your Github profile? |
I think that all the e-mail addresses are in my profile. Am I missing something? |
I am trying to fix the travis build issues. It seems to complain about formatting issues. Unfortunately I am unable to do a make check in my Windows machine. |
Finally, I managed to do "make check" under a Linux machine. make check fails with beat/info.go
"goimports -d" shows no difference in files grok.go and grok_test.go. This branch is merged with the master branch. If there is anything that I can help with, here I am. |
@ramon-garcia, some of the commits have an invalid address, and the CLA check cannot process those, but you have probably force pushed this branch with amended commits, since the CLA check is green now? |
Hmm, something is weird here. When I manually check the PR, it's red, but the status is green here. Can you please make sure you have configured Git with your email, amend the commits with |
ec13d95
to
5a7c873
Compare
I have tried to rebase but prehaps I screwed up my branch. I will continue tomorrow. |
5a7c873
to
ec13d95
Compare
ec13d95
to
44a5923
Compare
Rebased with author changed. I hope now it is ok. |
Hi @ramon-garcia, apologies for the delayed answer. There are two reasons for which we hesitate to add Grok support in Beats:
For the two reasons above, we were thinking of adding a dissect filter, like Logstash has, before Grok, and see if that is enough for what people usually need. Dissect doesn't use regular expressions, and is faster on most work loads. Would you be interested in contributing a Dissect implementation? Apologies again for not giving you timely feedback, we wanted to discuss it internally first. |
My primary reason for developing this patch is that I want to store clean events, without noise. Storing millions of useless events that reflect repetitive activity is a waste of CPU, storage and obscures the important ones. With regard to CPU time, in my case, it is usually more scarce in the Elasticsearch server than in other servers. Perhaps this is not the case for everyone, but offering this option helps a lot of use cases. My experience with pipelines is that they are difficult to debug (there is no way to send a test event to the pipeline and see the output) and one must watch log files in Elasticsearch from time to time. By contrast, any test with filebeat can be done without touching an Elasticsearch server. For similar reasons, my experience with painless scripts was not good. Finally, there are legal and ethical reasons for avoiding shipping irrelevant logs. This is a general comment with the philosophy of making beats as little processing as possible. Shipping logs from a coworker's workstation that are not relevant for objective security reasons or other demonstrable reasons, that show personal activity, is unethical and probably illegal. Or, for instance, consider the case of the logs of a proxy server that shows web sites visited by coworkers. One could learn, for instance, political affiliation from them. Shipping or storing more that what can be objectively justified must be avoided. Anyway, in case of disagreement, anyone can take this version of beats and use themselves. |
I have been quite surprised by the very bad performance of regular expressions under Go. Porting optimizations from RE2 to Go regexp library cannot be that complicated. |
@ramon-garcia Also have a look at the optimizations to regexp that @urso made like #2433 (there are more PR's related to Matcher). What do you think of introducing dissect instead of going with regexp? |
@ruflin Regular expressions are difficult to replace. For instance, a log file contains many uninteresting things, and also ssh login successes and failures. One wants to extract the user, the IP address and the result of the authentication, and nothing else. Exchange message tracking: since it is a CSV file, it looks like one can parse it by splitting by commas ... until one sees that the subject lines of message can also contain commas. So commas must be allowed in fields, if they are surrounded by double quotes. For instance, the line |
There are also 2 problems from my perspective here:
The two problems have a big overlap, but I think for the first one not necessarly all the features are needed that are needed for 2, as with option 2 an exact match of fields is required and in 1 often splitting up in junks is enough and then do on the specific field a regexp. I think dissect can do more then just tokenize a string, but your problem above is interesting. @ph Do you know if dissect can handle the above "challenge"? |
@ramon-garcia @ruflin regular expression are indeed more powerful than dissect at the expanse of speed. If we look at the dissect test suite we still support quite a few things though. Concerning logstash we were also suggesting to use conditions with regexp to decided if a dissect filter could be applied or chain multiple dissect together.
Dissect is more like a tokenizer. |
Sorry for taking a long time to reply. We have discussed this internally and we have decided to close this issue, even if been able to use grok expression in a beats processor would be a nice enhancement. We do think at the moment that the Grok processors in both the Logstash the Elasticsearch ingest pipeline are good enough for most cases. Using regular expressions on every event will slow thing down quite a bit which is a bit against the idea of getting event out of the machine as fast as possible. We also recommend using dissect instead if possible. There is also the possibility to create a plugin that beats will load to add this feature, ping me if you want to go that route. |
Hello, I have developed Grok expression support in Beats.
I think that has already been discussed. But I need this feature.
First of all, to save a lot of network traffic in transmission of events to Elasticsearch. Many events are frequent and redundant, like web hits from monitoring system or firewall logs produced by frequent broadcast traffic. To drop these events, one needs from to classify them.
It was argued that it would make the development complex. But the grok.go code is just 232 lines .
Please consider accepting this request