Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ingest] Add comparison processor for supporting conditionals in pipeline definition #14647

Closed
talevy opened this issue Nov 10, 2015 · 17 comments
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement

Comments

@talevy
Copy link
Contributor

talevy commented Nov 10, 2015

The Ingest Pipeline should support the ability for conditional branching of processor execution.

The proposal here is to define a CompareProcessor which will accept a list of conditions that must all be met for specific processors to be executed. This processor would have a conditions parameter that would accept various boolean operators, such as gt, lt, gte, lte, eq, neq. If all of the provided conditions evaluate to true, then a set of processors defined in a field potentially called if_processors would be executed. If any conditions in this set evaluate to false, then processors in the else_processors list option would be executed. This type of conditional can be though of as an "if/else" processor, where the conditions field is a compounded boolean expression joined with an AND. Continuing with this analogy -- the if_processors block would be the if block, and the else_processors would be the else block.

Whether both if_processors and else_processors configuration options would be necessary is still up in the air, at least one must be defined.

Since the CompareProcessor is just another processor, nested comparisons would be possible.

Here is a potential processor definition for the CompareProcessor

{
  "compare": {
    "conditions": [
      "<field_name1>" : { "<OP1>": x },
      "<field_name2>" : { "<OP2>": y },
      "<field_name3>" : { "<OP3>": z }
    ],
    "if_processors": [...],
    "else_processors": [...]
  }
}

Here is an example pipeline definition with a Compare Processor

{
  "description": "_description",
  "processors": [
    {
      "compare": {
        "conditions": [
          "ip_address": { "eq": "127.0.0.1" },
          "user_age": { "gt": 13 }
        ],
        "if_processors": [
          "compare" : {
            "conditions" : [
               "country_name" : { "neq": "USA" }
               "if_processors" : [
                 {
                   "geoip" : {
                     "source_field" : "ip"
                   }
                 }
               ]
            ]
          }
        ],
        "else_processors" : [
          {
            "mutate" : {
              "remove" : ["country_name"]
            }
          }
        ]
      },
      "mutate" : {
        "update" : {
          "field2" : "_value"
        }
      }
    }
  ]
}
@talevy talevy added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Nov 10, 2015
@javanna
Copy link
Member

javanna commented Nov 10, 2015

Should we try and expand a bit more on what can be a condition and what syntax should be supported there? For instance can there be field values on the right side or only constants?

I would probably change naming, if it's an if why call it compare? and I think true_processors and false_processors might get confusing, why not use simply if and else and maybe call the processor "condition" or something along those lines? I think we should have @clintongormley have a look at the api and validate it also :)

@talevy
Copy link
Contributor Author

talevy commented Nov 10, 2015

Should we try and expand a bit more on what can be a condition and what syntax should be supported there? For instance can there be field values on the right side or only constants?

I was thinking of enabling the same mustache templating that is introduced here.

an example templated field value: {{ctx._source.department}}

This would enable fields to be compared to other fields.

"conditions" [
  "first_name": {"neq": "{{ctx._source.last_name}}"}
]

I would probably change naming, if it's an if why call it compare? and I think true_processors and false_processors might get confusing, why not use simply if and else and maybe call the processor "condition" or something along those lines?

do you propose something more similar to this syntax?

{
  "condition": {
    "conditions": [
      "field.hello" : { "neq":  "world" },
      "foo" : { "gt": 10 }
    ],
    "if": [...],
    "else": [...]
  }
}

@talevy talevy self-assigned this Nov 17, 2015
@talevy
Copy link
Contributor Author

talevy commented Nov 18, 2015

OK, @javanna @martijnvg, let me know what you think!

here is an in-progress branch of what I've come up with so far: link

a bit about what has been done:

Conditionals would be described as a separate processor within the pipeline. Branches of execution would be delegated within the ConditionalProcessor.

here is an example ConditionalProcessor config definition:

{
  "field_conditions": {
    "message" : { "neq": "hello world" },
    "age": { "gte": 13, "lt": 18 }
  },
  "match": [
    ... processor definitions to be executed in order (if all field_conditions match) ...
  ],
  "otherwise": [
    ... processor definitions to be executed in order (if any field condition in field_conditions fails to match) ...
  ]
}

To not create a dependency on the metadata/mustache-templating work, this Processor will start off only supporting hardcoded values provided within the config.

One major limitation of what has been done so far is that there is no "OR" logic anywhere... I will keep exploring to see how I could get that in. Also, all comparisons except for neq and eq require that the field values being compared are numbers.

** all naming is still up in the air, just trying out other names.

@martijnvg
Copy link
Member

To not create a dependency on the metadata/mustache-templating work, this Processor will start off only supporting hardcoded values provided within the config.

I think this is ok for the first version of the condition processor, At some point we also need to evaluate where else tempting is required and then we can get back to this.

One major limitation of what has been done so far is that there is no "OR" logic anywhere...

Maybe we should then add an extra level to the field_comparison? Like this:

...
"field_conditions": {
   "and" : {
       "message" : { "neq": "hello world" },
       "age": { "gte": 13, "lt": 18 }   
    },
    "or" :  {
      ...
   }
  }
...

If any comparisons are defined in the and element, then all must match and if there are comparisons defined in the or element then at least one must match.

@talevy
Copy link
Contributor Author

talevy commented Nov 19, 2015

I realize there is a huge limitation with this format...

a check to see if a field is either all lowercase or all uppercase for example...

if we place the boolean operator outside the field map, this would not be possible since the field name is a key.

"field_conditions": {
  "or": {
    "message" : { "eq": "hello world" },
    "message" : { "eq": "HELLO WORLD" }
  }
}

if we placed the boolean operator within each field, it would still not work since eq would appear as a key twice.

"field_conditions": {
  "message" : { "or" : { "eq": "hello world", "eq": "HELLO WORLD"} }
}

How about this?

"field_conditions": {
  ("and"|"or") : [
    { "message" : { "neq" : "hello world" } },
    { "message" : { "neq" : "HELLO WORLD" } },
    { "age" : { "gte" : 13 } },
    { "age" : { "lt" : 18 } }
  ]
}

the biggest beef I have with the suggestion above is that is can look rather
verbose when lots of relational comparisons are done against the same field...

@talevy
Copy link
Contributor Author

talevy commented Nov 19, 2015

since it is starting to grow in complexity, I thought it would be a good thought experiment to see what
a version of conditionals with nesting would look like:

"field_conditions": {
  "or" : [
      {
        "and" : [
          { "teenager" : { "eq" : true } },
          { "age" : { "gte" : 13 } },
          { "age" : { "lte" : 19 } }
        ]
      },
      {
        "and" : [
          { "teenager" : { "eq" : false } },
          { "not" : { "and" : [
                          { "age" : {"gte" : 13} },
                          { "age" : {"lte" : 19} }
                    ] }
          }
        ]
      }
  ]  
}

hmmm... looks like too much for me.

what do you think?

@martijnvg
Copy link
Member

This looks complex :) My idea is that the conditions don't support nesting. (all conditions need to be on the same level, so there is a list of 'and' conditions and a list of 'or' conditions) Nesting could be achieved via adding another condition processor in the match or otherwise part of the condition.

@talevy
Copy link
Contributor Author

talevy commented Nov 19, 2015

so, I think I have implemented most of what you meant. I'll go ahead and make a PR for it now so that it is easier to view the branch

@clintongormley
Copy link
Contributor

How about this:

{
  "if": {
    "all": [
      {"message" : { "==": "Hello"}},
      { "txt": { "in": ["one", "ONE"]}},
      { "date": { "lt": "2015", "gt": "2010"}},
      { "any": [
        {"status": { "==": "featured"}},
        { "adult": { "!=": false}}
      ]}
    ]
  },
  "then": [...],
  "else": [...]
}

So the if processor takes exactly one of all| any|none, plus then and optionally else

@clintongormley
Copy link
Contributor

oh i mixed == and lt etc... could settle on symbol or word operators, don't mind

@talevy
Copy link
Contributor Author

talevy commented Nov 20, 2015

thanks @clintongormley!

Offline, we were possibly discussing a simpler processor that would only operate on a single field.

here is what that looks like: https://github.com/talevy/elasticsearch/blob/if_conditions_else_processor/docs/plugins/ingest.asciidoc#field-conditional-processor

...
{ "txt": { "in": ["one", "ONE"]}}
...

The statement above demonstrates something that is similar, but still lacking in the way this processor
handles operands. It is quite common for people do have the following conditionals in their
logstash configurations:

if ("cdn_log" in [tags]) {
   ...
}

Since the subject here is not a field, and instead it is a simple string value, this relationship will probably need to be supported. Since we do not yet have string templates (mustache templates) for field referencing, I think only supporting field references instead of raw string values is ok for now.

does that make sense?

@uboness
Copy link
Contributor

uboness commented Nov 23, 2015

why are we trying to write a scripting language in json? this all looks off to me. We should probably just use real scripts for conditions... otherwise, there's no end to this.

@uboness
Copy link
Contributor

uboness commented Nov 27, 2015

another option is to have the following structure:

{
  "condition" : {
    "if" : {
      // a conditional structure
    },
    "then" : [
      // a list of processor that will run if the condition is not met
    ],
    "else" : [
      // a list of processor that will run if the condition is not met
    ]
  }
}

where there can be different conditional structures:

{
  "if" : {
    "script" : "_doc.key == 2"
  }
}

or

{
  "if" : {
    "compare_fields" : {
      { "eq" : { "field" : "key", "value" : "2"} }
    }
  }
}

@byronvoorbach
Copy link
Contributor

byronvoorbach commented May 31, 2016

From reading this issue, I understand that there are multiple ways of implementing such a feature. Are conditionals still on the roadmap, or have you dropped this for now?

@talevy
Copy link
Contributor Author

talevy commented May 31, 2016

@byronvoorbach. Conditionals have been pushed off the roadmap for now. The plan is to assess its use-cases after the initial release of the Ingest API.

For now, any error-related branching can be achieved via the on_failure option

@eskibars
Copy link
Contributor

I still agree with @uboness on

why are we trying to write a scripting language in json? this all looks off to me. We should probably just use real scripts for conditions... otherwise, there's no end to this.

We have a real scripting language in Elasticsearch that we've now embedded in the Ingest Node processors. If we do this, I think all the manual gt, lt, gte, etc operators would create more pain than just allowing a painless script that just returns true/false

@talevy
Copy link
Contributor Author

talevy commented Jul 20, 2016

@eskibars indeed, now that we have the flexibility of a real language and the boolean operators that come with it... there is no need for this.

in the future, if someone wants to make a designated bool processor that executes another processor if the bool_script evaluates to true, then that is easily done with scripts.

Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement
Projects
None yet
Development

No branches or pull requests

7 participants