Proposal Generic Filtering (Phase 2) #451

monicasarbu · 2015-12-04T22:29:10Z

This is a follow up from the discussion: https://github.com/elastic/libbeat/issues/336

The goal here is to reduce the number of exported events and fields, before they are sent over the network. It is not to add or change fields, that can be achieved later by Logstash.

This is part of the Filtering Phase 2.

Use cases to solve

Packetbeat: Drop all 200 OK transactions
Choose a set of fields to be exported
Remove a set of fields that are not interested for the user

Requirements

define a generic filtering that is implemented in libbeat and used by all the Beats, instead of having a specific filtering implementation for each Beat.
be able to use event fields (e.g. http.response.code) in the condition
be able to drop the event or remove a certain field if the condition is fulfilled

Disadvantages

If we want to use generic conditions in the filtering, then we need to apply the filtering on the created event object. In other words, we need to build the event object before running the filtering condition. Building the event object might be expensive as we need to calculate all the fields and if we just decide to drop the event, then the time spent is useless.

To improve this, I propose to decide about a minimum set of fields to be exported by each Beat. Additional fields can be added by enabling options in configuration file. These options will have to be implemented in each Beat but have the advantage of having the maximum performance. For example, Filebeat can implement an option like “exclude_files” a lot more efficient than it can be done in the generic filtering.

Proposal

The filtering rules are executed in libbeat before publishing the event. A list of actions is defined under the filter section.

Supported actions:

drop_event
drop_fields
include_fields

drop event

The syntax for dropping an event:

drop_event:
    condition

drop fields

The syntax for dropping fields:

drop_fields:
    condition
    fields: ['load', 'swap']

include fields

The syntax for including fields could be something like this:

include_fields:
   condition
   fields: ["mem.used_p", "swap.used_p"]

Conditions: (Phase 2)

equals
contains
regexp
range
and
or
not

Examples

Drop all 200 OKs in Packetbeat

filter:
    - drop_event:
          equals: 
            “http.response.code”: 200

Filter all all cpu.user_p = 0 and mem.used_p = 0

 filter:
   - drop_event:
       equals:
          "cpu.user_p": 0

 - drop_event:
      equals:
          "mem.used_p": 0

or

 - drop_event:
      or:
         equals:
            "cpu.user_p": 0
         equals:
             "mem.used_p": 0

Note: You can follow the status of this feature under the PR: #830

The text was updated successfully, but these errors were encountered:

ruflin · 2015-12-05T21:38:10Z

For reference: The previous discussion and comments can be found here: https://github.com/elastic/libbeat/issues/336

urso · 2015-12-10T17:49:13Z

we should also take 'ingest node' syntax into account?

e.g. elastic/elasticsearch#14647

kaem2111 · 2015-12-10T21:34:24Z

Hi, I am an external and a newbie, but as customer with log filter experience I take the advantage to give my 2 cent to the discussion here:

For filtering you need include and exclude expressions. But it is very important in which sequence these expressions are executed and only those, which are really relevant.
If you have a linear list of 1000 filters, this never will executed performant.
Therefore the filters have to be structured using nested ‘if’-clauses.

Look at the following code: (has to be converted to YML later)

1: if type = proc
2:     if proc.cpu.total > 100
3:         if /myhost/
4:             include
5:         if proc.cpu.total > 100000
6:             exclude
7:         exclude
8:  if type = filesystem
9:     include dropping: fs.device_name fs.mount_point

If the condition in line 1 is true the next line 2 is processed, otherwise it goes to the next line on the same level (line 8). In line 2 you have a further condition. This works like an ‘AND’ condition with line 1.
If conditions are on the same level (like line 1 and 8) then they work like an ‘OR’ statement.
In line 3 there is a condition in short form (same as: if inputline =~ /myhost/). If all conditions (1-3) are true, the document should be created (line 4). The process ends when the first matching include or exclude was found (lines 4, 6, 7). If line 3 and line 5 are both false, line 7 in executed. If you comment out line 7, no decision will be made within the branch 2-7 and processing will simply continue with the next line (line 8). Lines 5-6 show, how to exclude an irrational high values for total cpu. Line 9 shows a usecase for dropping fields. It is only used for include statements, of course.

By having nested conditions, the overall filter performance is improved. In our example for type = filesystem we only process 3 of 9 lines (1, 8 and 9).
Moreover, it is easier to find the place, where to add additional in/excludes.
At least the customers can improve filter performance by arranging the if-conditions according to the cardinality and frequency within their data!

Another Topic: only in-/exclude?

In-/exclude are very radical actions. Many tools use something like tracelevel and tracemodules to control filtering. So assume, you have a persistent score value with boosted tags somewhere stored in a registry-file in the form:

score 70 net^1.1 appl^1.2 perf^1.0

and assume, instead of coding in-/exclude you use as command like 4:

4:      score 25 net^0.5 appl^1.3 perf^1.5

which contains the score value and a list of ‘tags’ including boost-factors, representing how relevant this value is for the named tag. When processing line 4, you can now calculate a score based on the line score mutiplied with the persistent score as:

net:      13   integer(25 * 0.5 * 1.1)
appl:     39   integer(25 * 1.3 * 1.2)
perf:     37   integer(25 * 1.5 * 1.0)
total:    89

Because total score 89 is greater than 70, we would output that document (=include), otherwise not (=exclude). Moreover if we lower the score 70 once (e.g. for production systems to 50), we will get more output lines from that system. Additionaly if we determine network problems, we could dynamically increase the persistent score net^1.1 boost factor to net^2.0 and get more net-messages without being flooded by additional performance entries (Think about the possibilities!)

Maybe we can add the beat.score (before multiplying with persistent values) to the document, e.g. to select only messages with high performance relevance.

This should be enough for today. I hope I could share my ideas, make it understandable and hope, we will design a fantastic solution.

kaem2111 · 2015-12-16T11:46:59Z

Previously I wrote:

Additionaly if we determine network problems, we could dynamically 
increase the persistent score net^1.1 boost factor to net^2.0 and get more net-messages without
being flooded by additional performance entries

These increases will normally be triggered by other log messages, so we should introduce another command to realize this. Asssume the following

6: if /system crashed/
7:     scoretimer 600 50 net^1.6 appl^1.7 perf^0

If a trigger is found the persistent score is changed for 600 seconds to the new values.
The time must be based on the document timestamp value not on actual time.
If the 600 seconds are passed, the default values are used again.
This default could be setup (by using 0 seconds) in:

1:     scoretimer 0 70 net^1.1 appl^1.2 perf^1.0

Executing further scoretimer will overwrite existing ones. This way

8: if /system up and running/
9:     scoretimer 1 50

will clear an existing scoretimer and establish the default again (after 1 second).

Using this feature dynamic changes can be configured by a customer, based on log messages found.
By the way those log message always should be 'included' in output (tag: newscore ) to document the score changes

kaem2111 · 2015-12-17T09:43:00Z

monicasarbu wrote:

Building the event object might be expensive as we need to calculate all the fields

Moreover most fields needed for filtering are evaluated earliest in the logstash using GROK, which is much too late. And for me it doesn't make sense to implement GROK etc. functionality here at client side.

To realize fieldbased filtering nevertheless, we could use regex with FindStringSubmatch.
Assume:

1: if /access for user (\w+)/
2:     if $1 = "unknown"
3:         score 10 access^0.5
4:     score 40 access^2.0
5: if /(httpcode (\d+))/
6:     if $2 > 200
7:         score 40 http^2.0

The values to test are extracted using parenthesis. The comparison value (line 2: "unknown", line 6: 200) determines the kind of comparism (relevant because '1000' > '200' is not the same like 1000 > 200).
The expense of this solution is low, if we have to use regexp anyway.

monicasarbu · 2015-12-17T12:13:52Z

@kaem2111 Thank you for so many great ideas. We are considering to have a similar "language" with the one from Ingest node: elastic/elasticsearch#14647, so our users don't have to learn two different languages even if their purpose is a bit different.

monicasarbu · 2015-12-17T13:13:41Z

@kaem2111

Let's consider:
Option1:

drop_event:
  condition

Option2:

if condition
  drop_event

After being processed, both options are represented the same in the memory. In both options, we check first the condition and if it matches then we evaluate the action.

I think adding nested conditionals complicates a bit the syntax of the language as the user needs to write more.

Option1:

  if expr1
   if expr2

Option2:

if expr1 and expr2

If we have expr1 and expr2 then the check should stop anyway if espr1 fails and continue if it's true, so it doesn't affect the performance by having one if.

kaem2111 · 2015-12-20T11:06:58Z

@monicasarbu:
I looked at elastic/elasticsearch#14647 and I feel, they are still on the way looking for an ideal solution. No agreement yet. Moreover I believe, we do not need a whole new programmining language to learm, but only some speedy selective filter configuration.

I agree, that on low level they maybe different notification to express to same, but with larger examples there are significant differences between the two options regarding performance.

If we transform my approach, e.g.

if expr1
    if expr2
        if expr3
            exclude
        if expr4
            include
    if expr5
        exclude
if expr6
    if expr7
        exclude

into your equivalant we would get:

drop_event:
   expr1 and expr2 and expr3
include_event:
   expr1 and expr2 and expr4
drop_event:
   expr1 and expr5
drop_event:
   expr6 and expr7

As you see, I need to introduce include_event, since there was no equivalent yet. If we say, we do not need include_event, because at the end everything is automatically included, then you always have to evaluate all drop_event conditions until to the end to find out, nothing matches for that include. If we assume we can filter out 10%, then you have to do that for 90% of your logs, without any hits. So an (early matching) include option is essential for performance.

Nested if's were invented to check an expression only once. As you see in the 2. sample without nested ifs, expr1 will be evaluated up to 3 times (even it can be optimized within a single condition line). Especially for regexp this costs performance. Caching expr1 would be more complicated that using nested if's

Since the first matching in-/exclude returns from the filter routine all following ifs are automatically handled as 'else if', so we could skip the 'else' word and only use if.

If you want to, I could provide some concrete more complex examples to illustrate the performance differences.

monicasarbu · 2016-01-15T00:31:38Z

There is a bit of a difference between the conditionals as suggested in this discussion with the ones from Ingest Node (elastic/elasticsearch#14647). So, we need to decide what format to choose for conditions. I would suggest to keep the current format as it seems to be "shorter" in many cases.

In Ingest Node the condition has the following format:

"field_name": "OP": value

that brings the necessity to group the condition under an additional section, maybe when?!

Let's take an example:

- drop_event:
  range:
    "cpu.user_p":
       "gt": 0
       "lt": 0.4

that would be translated to the following using the Ingest Node conditionals:

- drop_event:
  when:
   and:
     cpu.user_p:
        "gt": 0
     cpu.user_p:
        "lt": 0.4

monicasarbu · 2016-01-15T00:33:56Z

Another example:

- drop_event:
  equals:
   "cpu.user_p": 0

that translates to the following using the Ingest Node conditionals:

- drop_event:
  when:
   "cpu.user_p":
     "eq": 0

monicasarbu · 2016-01-15T11:06:08Z

An option would be to change

- drop_fields:
    fields: ['load', 'swap']
    when:
      type: system

to

- drop_fields:
    equals:
      type: "system"
    fields: ['load', 'swap']

So, all filters would have the following format:

- action_name:
   condition
   argument 1
   argument 2
    ....
   argument n

and we can identify condition if it starts with any of: "range", "equals", "regexp", "contains", "or", "and", etc

cleesmith · 2016-01-27T16:29:14Z

Since this is about filtering data would it be more useful to create another beat (say, Filterbeat or Tapbeat) that can be chained with other beats to alter the data in-flight. I know, it's not an original idea, and maybe not a good one. Just seems like libbeat should remain unbloated with the single-purpose of helping to build a beat. Tapping into data seems like a common need. I'm just spitballing here.

urso · 2016-01-27T21:48:42Z

@cleesmith It's not so much about building a processing pipeline. This is mainly the domain of logstash and node ingest in elasticsearch. Filtering in libbeat is about reducing events and events sizes to reduce required bandwidth and disk storage. Being part of libbeat filtering should be used by operators/users only, developers should not be affected by it.

erik-stephens · 2016-02-24T15:38:21Z

I'm not sure if this is the best issue to chime in on. If there is going to be something DSL like, I recommend it have the syntax & semantics of common programming idioms (think Chef, not Puppet). Something like this might be a good fit here:

https://github.com/glycerine/zygomys

monicasarbu · 2016-05-30T08:37:31Z

Closing it as the status is tracked under #1447.

monicasarbu added the discuss Issue needs further discussion. label Dec 4, 2015

tsg added the libbeat label Dec 7, 2015

monicasarbu mentioned this issue Dec 7, 2015

Filtering exported fields (Phase 1) #462

Closed

3 tasks

monicasarbu changed the title ~~Proposal Filtering Implementation~~ Proposal Generic Filtering (Phase 2) Dec 7, 2015

tbragin added enhancement v2.0.0 labels Dec 7, 2015

ruflin mentioned this issue Jan 6, 2016

Topbeat: Apply regex filter before sending events #642

Closed

monicasarbu mentioned this issue Jan 15, 2016

Delete old implementation of filters plugin #735

Merged

monicasarbu mentioned this issue Jan 24, 2016

Generic filtering Implementation #830

Closed

7 tasks

tsg mentioned this issue Jan 28, 2016

Enhance the Beats documentation with more guidance information about configuration #871

Closed

15 tasks

ruflin mentioned this issue Feb 3, 2016

Enhancement: Topbeat ignore some filesystems #914

Closed

monicasarbu mentioned this issue Feb 3, 2016

topbeat - exclude filesystems #915

Closed

ruflin mentioned this issue Mar 7, 2016

Provide bytesize filtering capability #1115

Closed

erik-stephens mentioned this issue Mar 30, 2016

Aggregate events by time or count #588

Closed

This was referenced Apr 14, 2016

Add condition to filters #1395

Merged

Add condition to a filtering rule #1447

Closed

monicasarbu added :Processors and removed discuss Issue needs further discussion. labels Apr 21, 2016

ruflin mentioned this issue Apr 27, 2016

filebeat, elasticsearch, docker logging loop #760

Closed

monicasarbu closed this as completed May 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal Generic Filtering (Phase 2) #451

Proposal Generic Filtering (Phase 2) #451

monicasarbu commented Dec 4, 2015

ruflin commented Dec 5, 2015

urso commented Dec 10, 2015

kaem2111 commented Dec 10, 2015

kaem2111 commented Dec 16, 2015

kaem2111 commented Dec 17, 2015

monicasarbu commented Dec 17, 2015

monicasarbu commented Dec 17, 2015

kaem2111 commented Dec 20, 2015

monicasarbu commented Jan 15, 2016

monicasarbu commented Jan 15, 2016

monicasarbu commented Jan 15, 2016

cleesmith commented Jan 27, 2016

urso commented Jan 27, 2016

erik-stephens commented Feb 24, 2016

monicasarbu commented May 30, 2016

Proposal Generic Filtering (Phase 2) #451

Proposal Generic Filtering (Phase 2) #451

Comments

monicasarbu commented Dec 4, 2015

ruflin commented Dec 5, 2015

urso commented Dec 10, 2015

kaem2111 commented Dec 10, 2015

Another Topic: only in-/exclude?

kaem2111 commented Dec 16, 2015

kaem2111 commented Dec 17, 2015

monicasarbu commented Dec 17, 2015

monicasarbu commented Dec 17, 2015

kaem2111 commented Dec 20, 2015

monicasarbu commented Jan 15, 2016

monicasarbu commented Jan 15, 2016

monicasarbu commented Jan 15, 2016

cleesmith commented Jan 27, 2016

urso commented Jan 27, 2016

erik-stephens commented Feb 24, 2016

monicasarbu commented May 30, 2016