Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json_query not parsing through all entries in the JSON array #6138

Closed
harnitsignalfx opened this issue Jul 18, 2019 · 5 comments · Fixed by #6289
Closed

json_query not parsing through all entries in the JSON array #6138

harnitsignalfx opened this issue Jul 18, 2019 · 5 comments · Fixed by #6289
Assignees
Labels
area/tail bug unexpected problem or unintended behavior
Milestone

Comments

@harnitsignalfx
Copy link

Hi, I think there is a bug in how JSON array entries are being parsed by JSONQuery. Would really appreciate guidance on whether I've made a configuration error or this is indeed reproducible by other folks.

Relevant telegraf.conf:

[[inputs.tail]]
  files = ["/test.log"]
  watch_method = "poll"
  data_format = "json"
  name_override = "test-key"
  tag_keys = ["first"]
  json_query = "friends"

System info:

Telegraf v 1.11.2
OS: Debian 9

Steps to reproduce:

  1. Follow the example for JSON Query -
    https://github.com/influxdata/telegraf/tree/master/plugins/parsers/json#query
    So a sample entry in the file would be -
{"friends": [{"first": "Dale", "last": "Murphy", "age": 44},{"first": "Roger", "last": "Craig", "age": 68},{"first": "Jane", "last": "Murphy", "age": 47}]}

Expected behavior:

All entries in the array are parsed (very similar to the posted example)

{"fields":{"age":44},"name":"test-key","tags":{"first":"Dale","host":"test-us1-instance","path":"/test.log"},"timestamp":1563466236}
{"fields":{"age":68},"name":"test-key","tags":{"first":"Roger","host":"test-us1-instance","path":"/test.log"},"timestamp":1563466236}
{"fields":{"age":47},"name":"test-key","tags":{"first":"Jane","host":"test-us1-instance","path":"/test.log"},"timestamp":1563466236}

Actual behavior:

Only the first array entry is being parsed.
{"fields":{"age":44},"name":"test-key","tags":{"first":"Dale","host":"test-us1-instance","path":"/test.log"},"timestamp":1563466236}

@danielnelson danielnelson self-assigned this Jul 18, 2019
@danielnelson danielnelson added the bug unexpected problem or unintended behavior label Jul 18, 2019
@danielnelson danielnelson added this to the 1.11.3 milestone Jul 18, 2019
@danielnelson
Copy link
Contributor

danielnelson commented Jul 18, 2019

This is a bug in the tail plugin caused by some special logic for the csv parser when reading the first line of the file. We'll put together a fix for the next release.

@danielnelson danielnelson removed their assignment Jul 18, 2019
@harnitsignalfx
Copy link
Author

Appreciate your looking into it. Thanks Daniel!

@GeorgeMac GeorgeMac self-assigned this Jul 19, 2019
@GeorgeMac
Copy link
Contributor

@danielnelson been playing about with tests on this one and come to realise that there is a little interface design discrepancy going on between the tail plugin, the parser interface and the json implementation.

The parser interface assumers 1 metric per line:

// Parser is an interface defining functions that a parser plugin must satisfy.
type Parser interface {
// Parse takes a byte buffer separated by newlines
// ie, `cpu.usage.idle 90\ncpu.usage.busy 10`
// and parses it into telegraf metrics
//
// Must be thread-safe.
Parse(buf []byte) ([]telegraf.Metric, error)
// ParseLine takes a single string metric
// ie, "cpu.usage.idle 90"
// and parses it into a telegraf metric.
//
// Must be thread-safe.
ParseLine(line string) (telegraf.Metric, error)
// SetDefaultTags tells the parser to add all of the given tags
// to each parsed metric.
// NOTE: do _not_ modify the map after you've passed it here!!
SetDefaultTags(tags map[string]string)
}

Parse can take multiple lines and return multiple metrics. Which could potentially be used to circumvent this design constraint. However, as it stands, the tail parsers attempts to parse multiple lines using Parse and after it finds the first line it uses ParseLine from there on out. So a single line like in the above example, which produces > 1 metrics will always discard the rest.

if firstLine {
metrics, err = parser.Parse([]byte(text))
if err == nil {
if len(metrics) == 0 {
firstLine = false
continue
} else {
m = metrics[0]
}
}
firstLine = false
} else {
m, err = parser.ParseLine(text)
}

This leads me to think that in the long run perhaps the interface needs to be redesigned in order to acknowledge that a single newline delimited line can produce > 1 metrics.
However, in the near term we could perhaps fall back to using parser.Parse over parser.ParseLine and return all metrics?

@danielnelson
Copy link
Contributor

Yes, I'm hoping to remove the ParseLine function from this interface. Since we want to support arbitrary binary formats we shouldn't have a line based method; the containing plugin should handle transport level framing and the parser should understand inter-metric level framing.

First though we need a solution for parser state. The tail plugin is using the ParseLine function to check headers only once per file, and it would be easy to remove this, but we also need to consider when the parser is used by other plugins such as the file plugin.

In the tail case we want to call Parse multiple times but treat the input as a single document, in the file case we want to have each call to Parse be an new document. I think we need to switch to the SetParserFunc method in all plugins so that the plugin can decide when to reset by creating a new instance. Then we can push the state tracking (are we on the first line or not) into the parsers.

@GeorgeMac
Copy link
Contributor

GeorgeMac commented Jul 23, 2019

I get you. The tail plugin processes line by line. The file plugin processes entire document contents in one. However, both depend upon parser interface.

The parser needs to be able to handle being called either line by line or entire documents and be able to support the effect of a parser changing state when called in a line by line fashion (i.e. csv first line means header is parsed and then that effects all the following metrics).

Currently, the file input plugin uses SetParser which means a single parser is used for multiple documents. Where tail uses SetParserFunc and instantiates a parser per document. We need the file parser to do the same, in order to create a parser per document (file) and ensure that a stateful parser isn't shared between different documents. Which would break in situations like csv parser. Where each document has different headings at the start of each file.

Once we have this we can move the acknowledgement of a header being parsed into the state of the csv parser. Meaning the first call to Parse behaves diferently to the subsequent calls.
Then the tail plugin can do away with calling Parse and then ParseLine. And it can collect all the returned metrics from just repeatedly Parse().

Correct me if I am misunderstanding anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tail bug unexpected problem or unintended behavior
Projects
None yet
3 participants