Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grok Processor does not support non-(a-zA-Z_) field characters for field names #21745

Closed
talevy opened this issue Nov 23, 2016 · 6 comments
Closed
Assignees
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP

Comments

@talevy
Copy link
Contributor

talevy commented Nov 23, 2016

example Ingest pipeline that fails: @ field name.

POST _ingest/pipeline/_simulate
{
  "pipeline" :
  {
    "description": "_description",
    "processors": [
      {
        "grok" : {
          "field" : "foo",
          "patterns" : [
            "%{WORD:@}"
          ]
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_type": "type",
      "_id": "id",
      "_source": {
        "foo": "bar"
      }
    }
  ]
}

exception:

{
  "docs": [
    {
      "error": {
        "root_cause": [
          {
            "type": "exception",
            "reason": "java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: Provided Grok expressions do not match field value: [bar]",
            "header": {
              "processor_type": "grok"
            }
          }
        ],
        "type": "exception",
        "reason": "java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: Provided Grok expressions do not match field value: [bar]",
        "caused_by": {
          "type": "illegal_argument_exception",
          "reason": "java.lang.IllegalArgumentException: Provided Grok expressions do not match field value: [bar]",
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "Provided Grok expressions do not match field value: [bar]"
          }
        },
        "header": {
          "processor_type": "grok"
        }
      }
    }
  ]
}

The Grok Parser in Ingest requires that field names match a-zA-Z_, this should be expanded to support all unicode characters.

Must update the regex here to do so: https://github.com/talevy/elasticsearch/blob/82f7bfad98253e94305136df481cd1c7dc4e8ca8/modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/Grok.java#L47-L47

might be a relevant Issue in Joni: jruby/joni#13

@talevy talevy added :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >bug v5.0.1 v6.0.0-alpha1 labels Nov 23, 2016
@talevy
Copy link
Contributor Author

talevy commented Nov 23, 2016

Maybe something like this would do the job

"(?::(?<subname>[^%:{}]+))?" +

accept all non %, :, }, { characters

cc/ @ruflin

@clintongormley
Copy link
Contributor

@talevy ALL unicode characters? Including vertical tab, non-breaking space, newline, combining characters? Maybe you want to use unicode properties instead, and include allowed punctuation?

@talevy
Copy link
Contributor Author

talevy commented Nov 23, 2016

@clintongormley maybe not all :) does Elasticsearch have a specific allowed set of field name character? ideally it should follow that as much as it can, right?

mainly used as an example to remind me what to change. I am still in the process of investigating how to properly declare these specific unicode character types within Joni

@talevy talevy changed the title Grok Processor does not support non-ascii field names in extraction Grok Processor does not support non-(a-zA-Z_) field characters for field names Nov 23, 2016
@clintongormley
Copy link
Contributor

No we don't :( You can quite happily create a field named {. We really need to get back to defining allowed characters in named entities (#9059)

@talevy
Copy link
Contributor Author

talevy commented Nov 23, 2016

OK, then I guess we should allow that here as well.

looking at Joni, I think I found a solution. called the [:graph:] unicode character match. They have quite extensive tests here: https://github.com/jruby/jruby/blob/f7746d0be94f5d9f7cd27af610f1bc5dd2622794/spec/ruby/language/regexp/character_classes_spec.rb

and

  it "matches Unicode letter characters with [[:graph:]]" do
  it "matches Unicode digits with [[:graph:]]" do
  it "matches Unicode marks with [[:graph:]]" do
  it "matches Unicode punctuation characters with [[:graph:]]" do
  it "match Unicode format characters with [[:graph:]]" do
  it "match Unicode private-use characters with [[:graph:]]" do
  it "doesn't match Unicode control characters with [[:graph:]]" do

another that is similar is called [:word:] which matches the below:

  it "matches Unicode lowercase characters with [[:word:]]" do
  it "matches Unicode uppercase characters with [[:word:]]" do
  it "matches Unicode title-case characters with [[:word:]]" do
  it "matches Unicode decimal digits with [[:word:]]" do
  it "matches Unicode marks with [[:word:]]" do
  it "match Unicode Nl characters with [[:word:]]" do
  it "doesn't match Unicode No characters with [[:word:]]" do
  it "doesn't match Unicode punctuation characters with [[:word:]]" do
  it "doesn't match Unicode control characters with [[:word:]]" do
  it "doesn't match Unicode format characters with [[:word:]]" do
  it "doesn't match Unicode private-use characters with [[:word:]]" do

so the change could be:

"(?::(?<subname>[[[:graph:]]]+))?" +

running locally, and it looks pretty good, not sure we want to match all of the types of unicode characters as is described by the test handles above ^^

I'll start a PR and incorporate these different character tests on our end as well

@lcawl lcawl added v6.0.1 and removed v6.0.0 labels Nov 13, 2017
@lcawl lcawl added v6.0.2 and removed v6.0.1 labels Dec 6, 2017
@jaymode jaymode added v6.0.3 and removed v6.0.2 labels Dec 13, 2017
@talevy talevy added >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP and removed :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >bug v5.0.1 v6.0.3 labels Mar 15, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jun 29, 2018
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jun 29, 2018
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jun 29, 2018
dnhatn added a commit that referenced this issue Jun 29, 2018
* master:
  Do not check for object existence when deleting repository index files (#31680)
  Remove extra check for object existence in repository-gcs read object (#31661)
  Support multiple system store types (#31650)
  [Test] Clean up some repository-s3 tests (#31601)
  [Docs] Use capital letters in section headings (#31678)
  [DOCS] Add PQL language Plugin (#31237)
  Merge AzureStorageService and AzureStorageServiceImpl and clean up tests (#31607)
  TEST: Fix test task invocation (#31657)
  Revert "[TEST] Mute failing tests in NativeRealmInteg and ReservedRealmInteg"
  Fix RealmInteg test failures
  Extend allowed characters for grok field names (#21745) (#31653)
  [DOCS] Fix licensing API details (#31667)
  [TEST] Mute failing tests in NativeRealmInteg and ReservedRealmInteg
  Fix CreateSnapshotRequestTests Failure (#31630)
  Configurable password hashing algorithm/cost (#31234)
  [TEST] Mute failing NamingConventionsTaskIT tests
  [DOCS] Replace CONFIG_DIR with ES_PATH_CONF (#31635)
  Core: Require all actions have a Task (#31627)
dnhatn added a commit that referenced this issue Jul 4, 2018
* 6.x:
  Fix not waiting for Netty ThreadDeathWatcher in IT (#31758) (#31789)
  [Docs] Correct default window_size (#31582)
  S3 fixture should report 404 on unknown bucket (#31782)
  [ML] Limit ML filter items to 10K (#31731)
  Fixture for Minio testing (#31688)
  [ML] Return statistics about forecasts as part of the jobsstats and usage API (#31647)
  [DOCS] Add missing get mappings docs to HLRC (#31765)
  [DOCS] Starting Elasticsearch (#31701)
  Fix coerce validation_method in GeoBoundingBoxQueryBuilder (#31747)
  Painless: Complete Removal of Painless Type (#31699)
  Consolidate watcher setting update registration (#31762)
  [DOCS] Adds empty 6.3.1 release notes page
  ingest: Introduction of a bytes processor (#31733)
  [test] don't run bats tests for suse boxes (#31749)
  Add analyze API to high-level rest client (#31577)
  Implemented XContent serialisation for GetIndexResponse (#31675)
  [DOCS] Typos
  DOC: Add examples to the SQL docs (#31633)
  Add support for AWS session tokens (#30414)
  Watcher: Reenable start/stop yaml tests (#31754)
  JDBC: Fix stackoverflow on getObject and timestamp conversion (#31735)
  Support multiple system store types (#31650)
  Add write*Blob option to replace existing blob (#31729)
  Split CircuitBreaker-related tests (#31659)
  Painless: Add Context Docs (#31190)
  Docs: Remove missing reference
  Migrate scripted metric aggregation scripts to ScriptContext design (#30111)
  Watcher: Fix chain input toXcontent serialization (#31721)
  Remove _all example (#31711)
  rest-high-level: added get cluster settings (#31706)
  Docs: Match the examples in the description (#31710)
  [Docs] Correct typos (#31720)
  Extend allowed characters for grok field names (#21745) (#31653) (#31722)
  [DOCS] Check for Windows and *nix file paths (#31648)
  [ML] Validate ML filter_id (#31535)
  Fix gradle4.8 deprecation warnings (#31654)
  Update numbers to reflect 4-byte UTF-8-encoded characters (#27083)
danielmitterdorfer added a commit to elastic/rally-tracks that referenced this issue Aug 27, 2018
With this commit we implement a workaround in the mapping of the
http_logs track when we use the grok processor. Due to
elastic/elasticsearch#21745 the grok processor
fails when a field name contains a non-alpha character. Therefore, we
replace `@timestamp` with `timestamp` in the mapping and the grok
pattern for these cases.

Relates elastic/elasticsearch#21745
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP
Projects
None yet
Development

No branches or pull requests

6 participants