Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal Generic Filtering (Phase 2) #451

Closed
monicasarbu opened this issue Dec 4, 2015 · 15 comments
Closed

Proposal Generic Filtering (Phase 2) #451

monicasarbu opened this issue Dec 4, 2015 · 15 comments

Comments

@monicasarbu
Copy link
Contributor

This is a follow up from the discussion: https://github.com/elastic/libbeat/issues/336

The goal here is to reduce the number of exported events and fields, before they are sent over the network. It is not to add or change fields, that can be achieved later by Logstash.

This is part of the Filtering Phase 2.

Use cases to solve

  • Packetbeat: Drop all 200 OK transactions
  • Choose a set of fields to be exported
  • Remove a set of fields that are not interested for the user

Requirements

  • define a generic filtering that is implemented in libbeat and used by all the Beats, instead of having a specific filtering implementation for each Beat.
  • be able to use event fields (e.g. http.response.code) in the condition
  • be able to drop the event or remove a certain field if the condition is fulfilled

Disadvantages

If we want to use generic conditions in the filtering, then we need to apply the filtering on the created event object. In other words, we need to build the event object before running the filtering condition. Building the event object might be expensive as we need to calculate all the fields and if we just decide to drop the event, then the time spent is useless.

To improve this, I propose to decide about a minimum set of fields to be exported by each Beat. Additional fields can be added by enabling options in configuration file. These options will have to be implemented in each Beat but have the advantage of having the maximum performance. For example, Filebeat can implement an option like “exclude_files” a lot more efficient than it can be done in the generic filtering.

Proposal

The filtering rules are executed in libbeat before publishing the event. A list of actions is defined under the filter section.

Supported actions:

  • drop_event
  • drop_fields
  • include_fields

drop event

The syntax for dropping an event:

drop_event:
    condition

drop fields

The syntax for dropping fields:

drop_fields:
    condition
    fields: ['load', 'swap']

include fields

The syntax for including fields could be something like this:

include_fields:
   condition
   fields: ["mem.used_p", "swap.used_p"]

Conditions: (Phase 2)

  • equals
  • contains
  • regexp
  • range
  • and
  • or
  • not

Examples

  • Drop all 200 OKs in Packetbeat
filter:
    - drop_event:
          equals: 
            “http.response.code”: 200
  • Filter all all cpu.user_p = 0 and mem.used_p = 0
 filter:
   - drop_event:
       equals:
          "cpu.user_p": 0

 - drop_event:
      equals:
          "mem.used_p": 0

or

 - drop_event:
      or:
         equals:
            "cpu.user_p": 0
         equals:
             "mem.used_p": 0

Note: You can follow the status of this feature under the PR: #830

@monicasarbu monicasarbu added the discuss Issue needs further discussion. label Dec 4, 2015
@ruflin
Copy link
Contributor

ruflin commented Dec 5, 2015

For reference: The previous discussion and comments can be found here: https://github.com/elastic/libbeat/issues/336

@tsg tsg added the libbeat label Dec 7, 2015
@monicasarbu monicasarbu changed the title Proposal Filtering Implementation Proposal Generic Filtering (Phase 2) Dec 7, 2015
@urso
Copy link

urso commented Dec 10, 2015

we should also take 'ingest node' syntax into account?

e.g. elastic/elasticsearch#14647

@kaem2111
Copy link

Hi, I am an external and a newbie, but as customer with log filter experience I take the advantage to give my 2 cent to the discussion here:

For filtering you need include and exclude expressions. But it is very important in which sequence these expressions are executed and only those, which are really relevant.
If you have a linear list of 1000 filters, this never will executed performant.
Therefore the filters have to be structured using nested ‘if’-clauses.

Look at the following code: (has to be converted to YML later)

1: if type = proc
2:     if proc.cpu.total > 100
3:         if /myhost/
4:             include
5:         if proc.cpu.total > 100000
6:             exclude
7:         exclude
8:  if type = filesystem
9:     include dropping: fs.device_name fs.mount_point

If the condition in line 1 is true the next line 2 is processed, otherwise it goes to the next line on the same level (line 8). In line 2 you have a further condition. This works like an ‘AND’ condition with line 1.
If conditions are on the same level (like line 1 and 8) then they work like an ‘OR’ statement.
In line 3 there is a condition in short form (same as: if inputline =~ /myhost/). If all conditions (1-3) are true, the document should be created (line 4). The process ends when the first matching include or exclude was found (lines 4, 6, 7). If line 3 and line 5 are both false, line 7 in executed. If you comment out line 7, no decision will be made within the branch 2-7 and processing will simply continue with the next line (line 8). Lines 5-6 show, how to exclude an irrational high values for total cpu. Line 9 shows a usecase for dropping fields. It is only used for include statements, of course.

By having nested conditions, the overall filter performance is improved. In our example for type = filesystem we only process 3 of 9 lines (1, 8 and 9).
Moreover, it is easier to find the place, where to add additional in/excludes.
At least the customers can improve filter performance by arranging the if-conditions according to the cardinality and frequency within their data!

Another Topic: only in-/exclude?

In-/exclude are very radical actions. Many tools use something like tracelevel and tracemodules to control filtering. So assume, you have a persistent score value with boosted tags somewhere stored in a registry-file in the form:

score 70 net^1.1 appl^1.2 perf^1.0

and assume, instead of coding in-/exclude you use as command like 4:

4:      score 25 net^0.5 appl^1.3 perf^1.5

which contains the score value and a list of ‘tags’ including boost-factors, representing how relevant this value is for the named tag. When processing line 4, you can now calculate a score based on the line score mutiplied with the persistent score as:

net:      13   integer(25 * 0.5 * 1.1)
appl:     39   integer(25 * 1.3 * 1.2)
perf:     37   integer(25 * 1.5 * 1.0)
total:    89

Because total score 89 is greater than 70, we would output that document (=include), otherwise not (=exclude). Moreover if we lower the score 70 once (e.g. for production systems to 50), we will get more output lines from that system. Additionaly if we determine network problems, we could dynamically increase the persistent score net^1.1 boost factor to net^2.0 and get more net-messages without being flooded by additional performance entries (Think about the possibilities!)

Maybe we can add the beat.score (before multiplying with persistent values) to the document, e.g. to select only messages with high performance relevance.

This should be enough for today. I hope I could share my ideas, make it understandable and hope, we will design a fantastic solution.

@kaem2111
Copy link

Previously I wrote:

Additionaly if we determine network problems, we could dynamically 
increase the persistent score net^1.1 boost factor to net^2.0 and get more net-messages without
being flooded by additional performance entries

These increases will normally be triggered by other log messages, so we should introduce another command to realize this. Asssume the following

6: if /system crashed/
7:     scoretimer 600 50 net^1.6 appl^1.7 perf^0

If a trigger is found the persistent score is changed for 600 seconds to the new values.
The time must be based on the document timestamp value not on actual time.
If the 600 seconds are passed, the default values are used again.
This default could be setup (by using 0 seconds) in:

1:     scoretimer 0 70 net^1.1 appl^1.2 perf^1.0

Executing further scoretimer will overwrite existing ones. This way

8: if /system up and running/
9:     scoretimer 1 50 

will clear an existing scoretimer and establish the default again (after 1 second).

Using this feature dynamic changes can be configured by a customer, based on log messages found.
By the way those log message always should be 'included' in output (tag: newscore ) to document the score changes

@kaem2111
Copy link

monicasarbu wrote:

Building the event object might be expensive as we need to calculate all the fields

Moreover most fields needed for filtering are evaluated earliest in the logstash using GROK, which is much too late. And for me it doesn't make sense to implement GROK etc. functionality here at client side.

To realize fieldbased filtering nevertheless, we could use regex with FindStringSubmatch.
Assume:

1: if /access for user (\w+)/
2:     if $1 = "unknown"
3:         score 10 access^0.5
4:     score 40 access^2.0
5: if /(httpcode (\d+))/
6:     if $2 > 200
7:         score 40 http^2.0

The values to test are extracted using parenthesis. The comparison value (line 2: "unknown", line 6: 200) determines the kind of comparism (relevant because '1000' > '200' is not the same like 1000 > 200).
The expense of this solution is low, if we have to use regexp anyway.

@monicasarbu
Copy link
Contributor Author

@kaem2111 Thank you for so many great ideas. We are considering to have a similar "language" with the one from Ingest node: elastic/elasticsearch#14647, so our users don't have to learn two different languages even if their purpose is a bit different.

@monicasarbu
Copy link
Contributor Author

@kaem2111

Let's consider:
Option1:

drop_event:
  condition

Option2:

if condition
  drop_event

After being processed, both options are represented the same in the memory. In both options, we check first the condition and if it matches then we evaluate the action.

I think adding nested conditionals complicates a bit the syntax of the language as the user needs to write more.

Option1:

  if expr1
   if expr2

Option2:

if expr1 and expr2

If we have expr1 and expr2 then the check should stop anyway if espr1 fails and continue if it's true, so it doesn't affect the performance by having one if.

@kaem2111
Copy link

@monicasarbu:
I looked at elastic/elasticsearch#14647 and I feel, they are still on the way looking for an ideal solution. No agreement yet. Moreover I believe, we do not need a whole new programmining language to learm, but only some speedy selective filter configuration.

I agree, that on low level they maybe different notification to express to same, but with larger examples there are significant differences between the two options regarding performance.

If we transform my approach, e.g.

if expr1
    if expr2
        if expr3
            exclude
        if expr4
            include
    if expr5
        exclude
if expr6
    if expr7
        exclude

into your equivalant we would get:

drop_event:
   expr1 and expr2 and expr3
include_event:
   expr1 and expr2 and expr4
drop_event:
   expr1 and expr5
drop_event:
   expr6 and expr7 

As you see, I need to introduce include_event, since there was no equivalent yet. If we say, we do not need include_event, because at the end everything is automatically included, then you always have to evaluate all drop_event conditions until to the end to find out, nothing matches for that include. If we assume we can filter out 10%, then you have to do that for 90% of your logs, without any hits. So an (early matching) include option is essential for performance.

Nested if's were invented to check an expression only once. As you see in the 2. sample without nested ifs, expr1 will be evaluated up to 3 times (even it can be optimized within a single condition line). Especially for regexp this costs performance. Caching expr1 would be more complicated that using nested if's

Since the first matching in-/exclude returns from the filter routine all following ifs are automatically handled as 'else if', so we could skip the 'else' word and only use if.

If you want to, I could provide some concrete more complex examples to illustrate the performance differences.

@monicasarbu
Copy link
Contributor Author

There is a bit of a difference between the conditionals as suggested in this discussion with the ones from Ingest Node (elastic/elasticsearch#14647). So, we need to decide what format to choose for conditions. I would suggest to keep the current format as it seems to be "shorter" in many cases.

In Ingest Node the condition has the following format:

"field_name": "OP": value

that brings the necessity to group the condition under an additional section, maybe when?!

Let's take an example:

- drop_event:
  range:
    "cpu.user_p":
       "gt": 0
       "lt": 0.4

that would be translated to the following using the Ingest Node conditionals:

- drop_event:
  when:
   and:
     cpu.user_p:
        "gt": 0
     cpu.user_p:
        "lt": 0.4

@monicasarbu
Copy link
Contributor Author

Another example:

- drop_event:
  equals:
   "cpu.user_p": 0

that translates to the following using the Ingest Node conditionals:

- drop_event:
  when:
   "cpu.user_p":
     "eq": 0

@monicasarbu
Copy link
Contributor Author

An option would be to change

- drop_fields:
    fields: ['load', 'swap']
    when:
      type: system

to

- drop_fields:
    equals:
      type: "system"
    fields: ['load', 'swap']

So, all filters would have the following format:

- action_name:
   condition
   argument 1
   argument 2
    ....
   argument n

and we can identify condition if it starts with any of: "range", "equals", "regexp", "contains", "or", "and", etc

@cleesmith
Copy link

Since this is about filtering data would it be more useful to create another beat (say, Filterbeat or Tapbeat) that can be chained with other beats to alter the data in-flight. I know, it's not an original idea, and maybe not a good one. Just seems like libbeat should remain unbloated with the single-purpose of helping to build a beat. Tapping into data seems like a common need. I'm just spitballing here.

@urso
Copy link

urso commented Jan 27, 2016

@cleesmith It's not so much about building a processing pipeline. This is mainly the domain of logstash and node ingest in elasticsearch. Filtering in libbeat is about reducing events and events sizes to reduce required bandwidth and disk storage. Being part of libbeat filtering should be used by operators/users only, developers should not be affected by it.

@erik-stephens
Copy link

I'm not sure if this is the best issue to chime in on. If there is going to be something DSL like, I recommend it have the syntax & semantics of common programming idioms (think Chef, not Puppet). Something like this might be a good fit here:

https://github.com/glycerine/zygomys

@monicasarbu monicasarbu added :Processors and removed discuss Issue needs further discussion. labels Apr 21, 2016
@monicasarbu
Copy link
Contributor Author

Closing it as the status is tracked under #1447.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants