-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal Generic Filtering (Phase 2) #451
Comments
For reference: The previous discussion and comments can be found here: https://github.com/elastic/libbeat/issues/336 |
we should also take 'ingest node' syntax into account? |
Hi, I am an external and a newbie, but as customer with log filter experience I take the advantage to give my 2 cent to the discussion here: For filtering you need include and exclude expressions. But it is very important in which sequence these expressions are executed and only those, which are really relevant. Look at the following code: (has to be converted to YML later)
If the condition in line 1 is true the next line 2 is processed, otherwise it goes to the next line on the same level (line 8). In line 2 you have a further condition. This works like an ‘AND’ condition with line 1. By having nested conditions, the overall filter performance is improved. In our example for type = filesystem we only process 3 of 9 lines (1, 8 and 9). Another Topic: only in-/exclude?In-/exclude are very radical actions. Many tools use something like tracelevel and tracemodules to control filtering. So assume, you have a persistent score value with boosted tags somewhere stored in a registry-file in the form:
and assume, instead of coding in-/exclude you use as command like 4:
which contains the score value and a list of ‘tags’ including boost-factors, representing how relevant this value is for the named tag. When processing line 4, you can now calculate a score based on the line score mutiplied with the persistent score as:
Because total score 89 is greater than 70, we would output that document (=include), otherwise not (=exclude). Moreover if we lower the score 70 once (e.g. for production systems to 50), we will get more output lines from that system. Additionaly if we determine network problems, we could dynamically increase the persistent score net^1.1 boost factor to net^2.0 and get more net-messages without being flooded by additional performance entries (Think about the possibilities!) Maybe we can add the beat.score (before multiplying with persistent values) to the document, e.g. to select only messages with high performance relevance. This should be enough for today. I hope I could share my ideas, make it understandable and hope, we will design a fantastic solution. |
Previously I wrote:
These increases will normally be triggered by other log messages, so we should introduce another command to realize this. Asssume the following
If a trigger is found the persistent score is changed for 600 seconds to the new values.
Executing further scoretimer will overwrite existing ones. This way
will clear an existing scoretimer and establish the default again (after 1 second). Using this feature dynamic changes can be configured by a customer, based on log messages found. |
monicasarbu wrote:
Moreover most fields needed for filtering are evaluated earliest in the logstash using GROK, which is much too late. And for me it doesn't make sense to implement GROK etc. functionality here at client side. To realize fieldbased filtering nevertheless, we could use regex with FindStringSubmatch.
The values to test are extracted using parenthesis. The comparison value (line 2: "unknown", line 6: 200) determines the kind of comparism (relevant because '1000' > '200' is not the same like 1000 > 200). |
@kaem2111 Thank you for so many great ideas. We are considering to have a similar "language" with the one from Ingest node: elastic/elasticsearch#14647, so our users don't have to learn two different languages even if their purpose is a bit different. |
Let's consider: drop_event: condition Option2: if condition drop_event After being processed, both options are represented the same in the memory. In both options, we check first the condition and if it matches then we evaluate the action. I think adding nested conditionals complicates a bit the syntax of the language as the user needs to write more. Option1: if expr1 if expr2 Option2: if expr1 and expr2 If we have |
@monicasarbu: I agree, that on low level they maybe different notification to express to same, but with larger examples there are significant differences between the two options regarding performance. If we transform my approach, e.g.
into your equivalant we would get:
As you see, I need to introduce include_event, since there was no equivalent yet. If we say, we do not need include_event, because at the end everything is automatically included, then you always have to evaluate all drop_event conditions until to the end to find out, nothing matches for that include. If we assume we can filter out 10%, then you have to do that for 90% of your logs, without any hits. So an (early matching) include option is essential for performance. Nested if's were invented to check an expression only once. As you see in the 2. sample without nested ifs, expr1 will be evaluated up to 3 times (even it can be optimized within a single condition line). Especially for regexp this costs performance. Caching expr1 would be more complicated that using nested if's Since the first matching in-/exclude returns from the filter routine all following ifs are automatically handled as 'else if', so we could skip the 'else' word and only use if. If you want to, I could provide some concrete more complex examples to illustrate the performance differences. |
There is a bit of a difference between the conditionals as suggested in this discussion with the ones from Ingest Node (elastic/elasticsearch#14647). So, we need to decide what format to choose for conditions. I would suggest to keep the current format as it seems to be "shorter" in many cases. In Ingest Node the condition has the following format: "field_name": "OP": value that brings the necessity to group the condition under an additional section, maybe Let's take an example: - drop_event: range: "cpu.user_p": "gt": 0 "lt": 0.4 that would be translated to the following using the Ingest Node conditionals: - drop_event: when: and: cpu.user_p: "gt": 0 cpu.user_p: "lt": 0.4 |
Another example: - drop_event: equals: "cpu.user_p": 0 that translates to the following using the Ingest Node conditionals: - drop_event: when: "cpu.user_p": "eq": 0 |
An option would be to change - drop_fields: fields: ['load', 'swap'] when: type: system to - drop_fields: equals: type: "system" fields: ['load', 'swap'] So, all filters would have the following format: - action_name: condition argument 1 argument 2 .... argument n and we can identify |
Since this is about |
@cleesmith It's not so much about building a processing pipeline. This is mainly the domain of logstash and node ingest in elasticsearch. Filtering in libbeat is about reducing events and events sizes to reduce required bandwidth and disk storage. Being part of libbeat filtering should be used by operators/users only, developers should not be affected by it. |
I'm not sure if this is the best issue to chime in on. If there is going to be something DSL like, I recommend it have the syntax & semantics of common programming idioms (think Chef, not Puppet). Something like this might be a good fit here: |
Closing it as the status is tracked under #1447. |
This is a follow up from the discussion: https://github.com/elastic/libbeat/issues/336
The goal here is to reduce the number of exported events and fields, before they are sent over the network. It is not to add or change fields, that can be achieved later by Logstash.
This is part of the Filtering Phase 2.
Use cases to solve
Requirements
Disadvantages
If we want to use generic conditions in the filtering, then we need to apply the filtering on the created event object. In other words, we need to build the event object before running the filtering condition. Building the event object might be expensive as we need to calculate all the fields and if we just decide to drop the event, then the time spent is useless.
To improve this, I propose to decide about a minimum set of fields to be exported by each Beat. Additional fields can be added by enabling options in configuration file. These options will have to be implemented in each Beat but have the advantage of having the maximum performance. For example, Filebeat can implement an option like “exclude_files” a lot more efficient than it can be done in the generic filtering.
Proposal
The filtering rules are executed in libbeat before publishing the event. A list of actions is defined under the
filter
section.Supported actions:
drop event
The syntax for dropping an event:
drop fields
The syntax for dropping fields:
include fields
The syntax for including fields could be something like this:
Conditions: (Phase 2)
Examples
or
Note: You can follow the status of this feature under the PR: #830
The text was updated successfully, but these errors were encountered: