Grok Processor does not support non-(a-zA-Z_) field characters for field names #21745

talevy · 2016-11-23T08:41:27Z

example Ingest pipeline that fails: @ field name.

POST _ingest/pipeline/_simulate
{
  "pipeline" :
  {
    "description": "_description",
    "processors": [
      {
        "grok" : {
          "field" : "foo",
          "patterns" : [
            "%{WORD:@}"
          ]
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_type": "type",
      "_id": "id",
      "_source": {
        "foo": "bar"
      }
    }
  ]
}

exception:

{
  "docs": [
    {
      "error": {
        "root_cause": [
          {
            "type": "exception",
            "reason": "java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: Provided Grok expressions do not match field value: [bar]",
            "header": {
              "processor_type": "grok"
            }
          }
        ],
        "type": "exception",
        "reason": "java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: Provided Grok expressions do not match field value: [bar]",
        "caused_by": {
          "type": "illegal_argument_exception",
          "reason": "java.lang.IllegalArgumentException: Provided Grok expressions do not match field value: [bar]",
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "Provided Grok expressions do not match field value: [bar]"
          }
        },
        "header": {
          "processor_type": "grok"
        }
      }
    }
  ]
}

The Grok Parser in Ingest requires that field names match a-zA-Z_, this should be expanded to support all unicode characters.

Must update the regex here to do so: https://github.com/talevy/elasticsearch/blob/82f7bfad98253e94305136df481cd1c7dc4e8ca8/modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/Grok.java#L47-L47

might be a relevant Issue in Joni: jruby/joni#13

The text was updated successfully, but these errors were encountered:

talevy · 2016-11-23T09:19:09Z

Maybe something like this would do the job

"(?::(?<subname>[^%:{}]+))?" +

accept all non %, :, }, { characters

cc/ @ruflin

clintongormley · 2016-11-23T13:27:04Z

@talevy ALL unicode characters? Including vertical tab, non-breaking space, newline, combining characters? Maybe you want to use unicode properties instead, and include allowed punctuation?

talevy · 2016-11-23T13:29:24Z

@clintongormley maybe not all :) does Elasticsearch have a specific allowed set of field name character? ideally it should follow that as much as it can, right?

mainly used as an example to remind me what to change. I am still in the process of investigating how to properly declare these specific unicode character types within Joni

clintongormley · 2016-11-23T13:45:44Z

No we don't :( You can quite happily create a field named {. We really need to get back to defining allowed characters in named entities (#9059)

talevy · 2016-11-23T13:54:18Z

OK, then I guess we should allow that here as well.

looking at Joni, I think I found a solution. called the [:graph:] unicode character match. They have quite extensive tests here: https://github.com/jruby/jruby/blob/f7746d0be94f5d9f7cd27af610f1bc5dd2622794/spec/ruby/language/regexp/character_classes_spec.rb

and

  it "matches Unicode letter characters with [[:graph:]]" do
  it "matches Unicode digits with [[:graph:]]" do
  it "matches Unicode marks with [[:graph:]]" do
  it "matches Unicode punctuation characters with [[:graph:]]" do
  it "match Unicode format characters with [[:graph:]]" do
  it "match Unicode private-use characters with [[:graph:]]" do
  it "doesn't match Unicode control characters with [[:graph:]]" do

another that is similar is called [:word:] which matches the below:

  it "matches Unicode lowercase characters with [[:word:]]" do
  it "matches Unicode uppercase characters with [[:word:]]" do
  it "matches Unicode title-case characters with [[:word:]]" do
  it "matches Unicode decimal digits with [[:word:]]" do
  it "matches Unicode marks with [[:word:]]" do
  it "match Unicode Nl characters with [[:word:]]" do
  it "doesn't match Unicode No characters with [[:word:]]" do
  it "doesn't match Unicode punctuation characters with [[:word:]]" do
  it "doesn't match Unicode control characters with [[:word:]]" do
  it "doesn't match Unicode format characters with [[:word:]]" do
  it "doesn't match Unicode private-use characters with [[:word:]]" do

so the change could be:

"(?::(?<subname>[[[:graph:]]]+))?" +

running locally, and it looks pretty good, not sure we want to match all of the types of unicode characters as is described by the test handles above ^^

I'll start a PR and incorporate these different character tests on our end as well

elasticmachine · 2018-03-15T15:27:11Z

Pinging @elastic/es-core-infra

* master: Do not check for object existence when deleting repository index files (#31680) Remove extra check for object existence in repository-gcs read object (#31661) Support multiple system store types (#31650) [Test] Clean up some repository-s3 tests (#31601) [Docs] Use capital letters in section headings (#31678) [DOCS] Add PQL language Plugin (#31237) Merge AzureStorageService and AzureStorageServiceImpl and clean up tests (#31607) TEST: Fix test task invocation (#31657) Revert "[TEST] Mute failing tests in NativeRealmInteg and ReservedRealmInteg" Fix RealmInteg test failures Extend allowed characters for grok field names (#21745) (#31653) [DOCS] Fix licensing API details (#31667) [TEST] Mute failing tests in NativeRealmInteg and ReservedRealmInteg Fix CreateSnapshotRequestTests Failure (#31630) Configurable password hashing algorithm/cost (#31234) [TEST] Mute failing NamingConventionsTaskIT tests [DOCS] Replace CONFIG_DIR with ES_PATH_CONF (#31635) Core: Require all actions have a Task (#31627)

)

* 6.x: Fix not waiting for Netty ThreadDeathWatcher in IT (#31758) (#31789) [Docs] Correct default window_size (#31582) S3 fixture should report 404 on unknown bucket (#31782) [ML] Limit ML filter items to 10K (#31731) Fixture for Minio testing (#31688) [ML] Return statistics about forecasts as part of the jobsstats and usage API (#31647) [DOCS] Add missing get mappings docs to HLRC (#31765) [DOCS] Starting Elasticsearch (#31701) Fix coerce validation_method in GeoBoundingBoxQueryBuilder (#31747) Painless: Complete Removal of Painless Type (#31699) Consolidate watcher setting update registration (#31762) [DOCS] Adds empty 6.3.1 release notes page ingest: Introduction of a bytes processor (#31733) [test] don't run bats tests for suse boxes (#31749) Add analyze API to high-level rest client (#31577) Implemented XContent serialisation for GetIndexResponse (#31675) [DOCS] Typos DOC: Add examples to the SQL docs (#31633) Add support for AWS session tokens (#30414) Watcher: Reenable start/stop yaml tests (#31754) JDBC: Fix stackoverflow on getObject and timestamp conversion (#31735) Support multiple system store types (#31650) Add write*Blob option to replace existing blob (#31729) Split CircuitBreaker-related tests (#31659) Painless: Add Context Docs (#31190) Docs: Remove missing reference Migrate scripted metric aggregation scripts to ScriptContext design (#30111) Watcher: Fix chain input toXcontent serialization (#31721) Remove _all example (#31711) rest-high-level: added get cluster settings (#31706) Docs: Match the examples in the description (#31710) [Docs] Correct typos (#31720) Extend allowed characters for grok field names (#21745) (#31653) (#31722) [DOCS] Check for Windows and *nix file paths (#31648) [ML] Validate ML filter_id (#31535) Fix gradle4.8 deprecation warnings (#31654) Update numbers to reflect 4-byte UTF-8-encoded characters (#27083)

With this commit we implement a workaround in the mapping of the http_logs track when we use the grok processor. Due to elastic/elasticsearch#21745 the grok processor fails when a field name contains a non-alpha character. Therefore, we replace `@timestamp` with `timestamp` in the mapping and the grok pattern for these cases. Relates elastic/elasticsearch#21745

talevy added :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >bug v5.0.1 v6.0.0-alpha1 labels Nov 23, 2016

talevy changed the title ~~Grok Processor does not support non-ascii field names in extraction~~ Grok Processor does not support non-(a-zA-Z_) field characters for field names Nov 23, 2016

clintongormley added v6.0.0 and removed v6.0.0-alpha1 labels May 3, 2017

lcawl added v6.0.1 and removed v6.0.0 labels Nov 13, 2017

lcawl added v6.0.2 and removed v6.0.1 labels Dec 6, 2017

jaymode added v6.0.3 and removed v6.0.2 labels Dec 13, 2017

talevy added >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP and removed :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >bug v5.0.1 v6.0.3 labels Mar 15, 2018

talevy self-assigned this Mar 15, 2018

elasticmachine mentioned this issue Apr 24, 2018

[Grok Debugger] field names with special characters not possible elastic/kibana#18077

Closed

talevy assigned original-brownbear and unassigned talevy Jun 27, 2018

original-brownbear mentioned this issue Jun 28, 2018

Extend allowed characters for grok field names (#21745) #31653

Merged

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jun 29, 2018

Be more lenient with fieldnames in Grok Processor (elastic#21745)

5318e38

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jun 29, 2018

Be more lenient with fieldnames in Grok Processor (elastic#21745)

125933b

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jun 29, 2018

Be more lenient with fieldnames in Grok Processor (elastic#21745)

0ef98d6

original-brownbear closed this as completed in #31653 Jun 29, 2018

original-brownbear added a commit that referenced this issue Jun 29, 2018

Extend allowed characters for grok field names (#21745) (#31653)

b7b413e

original-brownbear added a commit that referenced this issue Jul 2, 2018

Extend allowed characters for grok field names (#21745) (#31653) (#31722

db3d4e6

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grok Processor does not support non-(a-zA-Z_) field characters for field names #21745

Grok Processor does not support non-(a-zA-Z_) field characters for field names #21745

talevy commented Nov 23, 2016 •

edited

Loading

talevy commented Nov 23, 2016 •

edited

Loading

clintongormley commented Nov 23, 2016

talevy commented Nov 23, 2016 •

edited

Loading

clintongormley commented Nov 23, 2016

talevy commented Nov 23, 2016 •

edited

Loading

elasticmachine commented Mar 15, 2018

Grok Processor does not support non-(a-zA-Z_) field characters for field names #21745

Grok Processor does not support non-(a-zA-Z_) field characters for field names #21745

Comments

talevy commented Nov 23, 2016 • edited Loading

talevy commented Nov 23, 2016 • edited Loading

clintongormley commented Nov 23, 2016

talevy commented Nov 23, 2016 • edited Loading

clintongormley commented Nov 23, 2016

talevy commented Nov 23, 2016 • edited Loading

elasticmachine commented Mar 15, 2018

talevy commented Nov 23, 2016 •

edited

Loading

talevy commented Nov 23, 2016 •

edited

Loading

talevy commented Nov 23, 2016 •

edited

Loading

talevy commented Nov 23, 2016 •

edited

Loading