Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest Node's grok can't set the same field from two patterns #22117

Closed
tsg opened this issue Dec 12, 2016 · 9 comments · Fixed by #22131
Closed

Ingest Node's grok can't set the same field from two patterns #22117

tsg opened this issue Dec 12, 2016 · 9 comments · Fixed by #22131
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP discuss

Comments

@tsg
Copy link

tsg commented Dec 12, 2016

Elasticsearch version: 5.0.1

Plugins installed: ingest-node-geoip, ingest-node-ua

JVM version: 1.8

OS version: macOS sierra

Description of the problem including expected versus actual behavior:

See the following Ingest node simulate API call:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "Pipeline for parsing MySQL slow logs.",
    "processors": [
      {
        "grok": {
          "field": "message",
          "patterns": [
            "%{DATA:mysql.error.timestamp} %{NUMBER:mysql.error.id} \\[%{DATA:mysql.error.level}\\] %{GREEDYDATA:mysql.error.message}",
            "%{LOCALDATETIME:mysql.error.timestamp} %{DATA:mysql.error.name} %{GREEDYDATA:mysql.error.message}"
          ],
          "ignore_missing": true,
          "pattern_definitions": {
            "LOCALDATETIME": "[0-9]+ %{TIME}"
          }
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "161209 13:08:33 mysqld_safe Starting mysqld daemon with databases from /usr/local/var/mysql"
      }
    }
  ]
}

There are two Grok patterns, and the provided doc should match the second one. This works fine, but the mysql.error.message is not created. If I rename it to mysql.error.message1 in either of the two grok patterns, it works.

A workaround I found is to define another grok pattern definition for GREEDYDATA, like this:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "Pipeline for parsing MySQL slow logs.",
    "processors": [
      {
        "grok": {
          "field": "message",
          "patterns": [
            "%{DATA:mysql.error.timestamp} %{NUMBER:mysql.error.id} \\[%{DATA:mysql.error.level}\\] %{GREEDYDATA:mysql.error.message}",
            "%{LOCALDATETIME:mysql.error.timestamp} %{DATA:mysql.error.name} %{GREEDYDATA1:mysql.error.message}"
          ],
          "ignore_missing": true,
          "pattern_definitions": {
            "LOCALDATETIME": "[0-9]+ %{TIME}",
            "GREEDYDATA1": ".*"
          }
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "161209 13:08:33 mysqld_safe Starting mysqld daemon with databases from /usr/local/var/mysql"
      }
    }
  ]
}
@clintongormley clintongormley added :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP discuss labels Dec 12, 2016
@clintongormley
Copy link
Contributor

@talevy any thoughts?

@talevy
Copy link
Contributor

talevy commented Dec 12, 2016

I'll take a look at this

@talevy
Copy link
Contributor

talevy commented Dec 13, 2016

did some digging, this was definitely a bug on my end. Opened the PR above and tested it out with your example in Console, picks it up.

screen shot 2016-12-12 at 5 01 55 pm

talevy added a commit to talevy/elasticsearch that referenced this issue Dec 13, 2016
Grok was originally ignoring potential matches to named-capture groups
larger than one. For example, If you had two patterns containing the
same named field, but only the second pattern matched, it would fail to
pick this up.

This PR fixes this by exploring all potential places where a
named-capture was used and chooses the first one that matched.

Fixes elastic#22117.
talevy added a commit that referenced this issue Dec 13, 2016
…2131)

Grok was originally ignoring potential matches to named-capture groups
larger than one. For example, If you had two patterns containing the
same named field, but only the second pattern matched, it would fail to
pick this up.

This PR fixes this by exploring all potential places where a
named-capture was used and chooses the first one that matched.

Fixes #22117.
talevy added a commit that referenced this issue Dec 13, 2016
…2131)

Grok was originally ignoring potential matches to named-capture groups
larger than one. For example, If you had two patterns containing the
same named field, but only the second pattern matched, it would fail to
pick this up.

This PR fixes this by exploring all potential places where a
named-capture was used and chooses the first one that matched.

Fixes #22117.
@y0299
Copy link

y0299 commented Jan 17, 2018

i want use grok processor in elasticsearch to parse some message, which have double quotes in text, like this "message":"192.168.1.2 "GET" 168".
and i use

curl -XPOST 'localhost:9200/_ingest/pipeline/_simulate?pretty' -H 'Content-Type: application/json' -d'
{
  "pipeline": {
  "description" : "parse multiple patterns",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": ["%{IP:ip}\\s"%{REQUEST:request}"\\s%{NUM:num}"],
		"pattern_definitions" : {
          "IP" : "\\S+",
		  "REQUEST" : "\\S+",
		  "NUM" : "\\d+"
        }
      }
    }
  ]
},
"docs":[
  {
    "_source": {
      "message": "192.168.1.2 "GET" 168"
    }
  }
  ]
}
'

but has error belows:

"error" : {
    "root_cause" : [
      {
        "type" : "parse_exception",
        "reason" : "Failed to parse content to map"
      }
    ],
    "type" : "parse_exception",
    "reason" : "Failed to parse content to map",
    "caused_by" : {
      "type" : "json_parse_exception",
      "reason" : "Unexpected character ('%' (code 37)): was expecting comma to separate Array entries\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@290fc36d; line: 9, column: 36]"
    }
  },
  "status" : 400
}

I think the reason is the " (double quotation) cannot support in grok processor patterns,can someone help me resolve this issues?

@talevy
Copy link
Contributor

talevy commented Jan 17, 2018

Hi @y0299!

Indeed you need to escape the " in the javascript.

the offending lines are

        "patterns": ["%{IP:ip}\\s"%{REQUEST:request}"\\s%{NUM:num}"],

and

      "message": "192.168.1.2 "GET" 168"

both should escape the " with a \ that preceeds it.

like so:

        "patterns": ["%{IP:ip}\\s\"%{REQUEST:request}\"\\s%{NUM:num}"],

and

      "message": "192.168.1.2 \"GET\" 168"

Hope that helps!

For the future, it is best to ask for this type of help on The Elastic Discuss Forum since you may even find your questions were already answered there!
Github is still great for posting bugs you may have found with our features in your use-cases.

@y0299
Copy link

y0299 commented Jan 17, 2018

hi, @talevy ,thanks for giving me advice!
but i use

curl -XPOST 'localhost:9200/_ingest/pipeline/_simulate?pretty' -H 'Content-Type: application/json' -d'
{
  "pipeline": {
  "description" : "parse multiple patterns",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": ["%{IP:ip}\\s\"%{REQUEST:request}\"\\s%{NUM:num}"],
		"pattern_definitions" : {
          "IP" : "\\S+",
		  "REQUEST" : "\\S+",
		  "NUM" : "\\d+"
        }
      }
    }
  ]
},
"docs":[
  {
    "_source": {
      "message": "192.168.1.2 "GET" 168"
    }
  }
  ]
}
'

there still has error occurs, like below:

"error" : {
    "root_cause" : [
      {
        "type" : "parse_exception",
        "reason" : "Failed to parse content to map"
      }
    ],
    "type" : "parse_exception",
    "reason" : "Failed to parse content to map",
    "caused_by" : {
      "type" : "json_parse_exception",
      "reason" : "Unexpected character ('G' (code 71)): was expecting comma to separate Object entries\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@6bcebab4; line: 22, column: 33]"
    }
  },
  "status" : 400
}

do i must change the log to 192.168.1.2 \"GET\" 168 ??

@talevy
Copy link
Contributor

talevy commented Jan 17, 2018

you forgot the offending " in the message.

this should work:

curl -XPOST "http://localhost:9200/_ingest/pipeline/_simulate?pretty" -H 'Content-Type: application/json' -d'
{
  "pipeline": {
  "description" : "parse multiple patterns",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": ["%{IP:ip}\\s\"%{REQUEST:request}\"\\s%{NUM:num}"],
		"pattern_definitions" : {
          "IP" : "\\S+",
		  "REQUEST" : "\\S+",
		  "NUM" : "\\d+"
        }
      }
    }
  ]
},
"docs":[
  {
    "_source": {
      "message": "192.168.1.2 \"GET\" 168"
    }
  }
  ]
}'

@y0299
Copy link

y0299 commented Jan 17, 2018

en, this way can work, but i must change the original log 192.168.1.2 "GET" 168 to 192.168.1.2 \"GET\" 168. but actually some log, which only has " not with a blakslash\",how can i process this?

@talevy
Copy link
Contributor

talevy commented Jan 18, 2018

@y0299 I'm not sure I completely understand

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP discuss
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants