K8s filelog collection: put the log record in the Body as a string #2851

rockb1017 · 2021-03-25T16:58:43Z

Describe the bug
#2266 (comment)
log record is wrapped in a json with key log. it should be just plain string.

Steps to reproduce
Read files using filelog

What did you expect to see?
"EPS: 98"
What did you see instead?
{"log":"EPS: 98\n"}

What version did you use?
0.22.0

The text was updated successfully, but these errors were encountered:

tigrannajaryan · 2021-03-25T17:23:42Z

@pmm-sumo @djaglowski this is the bug that I mentioned about when were looking at k8s collection. Can one you take this?

djaglowski · 2021-03-25T19:23:30Z

@sumo-drosiek would you be able to validate open-telemetry/opentelemetry-collector#2873?

pmm-sumo · 2021-03-25T19:53:04Z

@djaglowski what do you think, should we be good with just using open-telemetry/opentelemetry-collector#2873 or perhaps it would be better to make a deeper change in the source code?

djaglowski · 2021-03-25T20:32:34Z

@pmm-sumo There are probably some things we can do to improve the usability of the parsing operations, but I think you are maybe referring to whether or not we should need the "log" field in the first place? There is a specific reason it exists in this case - it's because there are other values that need to be parsed, and those values temporarily exist alongside "log" until they are parsed.

Detailed Version

Note that this regex operation is capturing multiple values, "time", "stream", and "log" (and also "logtag", but this doesn't appear to be used, so I am omitting from this example):

- type: regex_parser
    id: parser-containerd
    regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) (?P<log>.*)$'
    ...

At this point, the entry looks something like this (showing only the parts relevant to this example):

{
  "record": {  
    "time": "...",
    "stream": "...",
    "log": "The rest of the log line",
  },
  "attributes": { },
  "timestamp": "default,
  ...
}

Then some additional processing is done based on the other values:

...
  timestamp:
    parse_from: time # parses "time" to timestamp
    ...
- type: metadata
    attributes:
      stream: 'EXPR($.stream)' # moves "stream" to attributes

Now we have:

{
  "record": {  
    "log": "The rest of the log line",
  },
  "attributes": { 
    "stream": "stderr",
  },
  "timestamp": "2021-03-25T...",
  ...
}

Now we can clean up the record:

ops:
  - move:
      from: log
      to: $

This leaves us with:

{
  "record": "The rest of the log line",
  "attributes": { 
    "stream": "stderr",
  },
  "timestamp": "2021-03-25T...",
  ...
}

There are a few minor things that I would like to improve here though, such as this and this, but I'm not sure what specifically we should do to improve this parsing pipeline.

tigrannajaryan · 2021-03-26T16:35:06Z

@djaglowski a slightly different approach that possibly could work is for regex operator to extract value directly into "attributes" and also have an option to specify which extracted value becomes the new value of the "record" (or "Body" in otel terminology). For example:

type: regex_parser
    id: parser-containerd
    regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) (?P<log>.*)$'
    set_body_from: log # this refers to an extracted field name in the line above

Now once this is applied the data will look like this:

{
  "record": "The rest of the log line",
  "attributes": {
    "time": "...",
    "stream": "...",
  },
  "timestamp": "default,
  ...
}

I don't know how feasible this, just an idea.

djaglowski · 2021-03-26T17:36:22Z

I think when you consider a wider set of log parsing cases, you end up needing the ability to parse into an arbitrary object that you can then manipulate as needed. You might need to parse further. You might keep or discard any field at all.

This isn't really a regex specific problem either - it's a general question of how parsing operations should behave. To me, the primary action you are taking with any parsing operation is to structure to the log entry. Once the values are individually accessible, you can parse them further or move them into the appropriate special fields (resource, attributes, timestamp, severity).

All that said, there are many scenarios such as this that are much more constrained. Taking inspiration from @tigrannajaryan's suggestion, I think there is potentially a way in which we can support cases such as this one, while still allowing arbitrary field manipulations if needed.

We already support a concept of "nested parsers" for timestamp and severity. These are just subsequent operations that can be specified within any parser:

- type: regex_parser
  id: my_regex
  regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<sev>[^ ]*) (?P<uuid>[^ ]*) (?P<log>.*)$'
  timestamp: # this is a timestamp_parser embedded within the regex_parser
    parse_from: time
    layout: '%Y-%m-%dT%H:%M:%S.%LZ'
  severity: # this is a severity_parser embedded within the regex_parser
    parse_from: sev

What is really happening here is essentially the same thing as if you specified:

- type: regex_parser
  id: my_regex
  regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<sev>[^ ]*) (?P<uuid>[^ ]*) (?P<log>.*)$'
- type: timestamp_parser # separate operator
  parse_from: time
  layout: '%Y-%m-%dT%H:%M:%S.%LZ'
- type: severity_parser # separate operator
  parse_from: sev

So working with the same idea in mind, would it be helpful to embed additional operators into all parsers?

The metadata operator (possibly with some modifications) could also be embedded in the same way. This would allow roughly:

- type: regex_parser
  id: my_regex
  regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<sev>[^ ]*) (?P<uuid>[^ ]*) (?P<log>.*)$'
  timestamp:
    parse_from: time
    layout: '%Y-%m-%dT%H:%M:%S.%LZ'
  severity:
    parse_from: sev
  attributes:
    - stream
  resource:
    - uuid

Tigran's suggested set_body_from: log would fit the pattern just as well. Maybe this is a special field on parsers, or maybe it's just a simplified version of restructure.move, but this would allow roughly:

- type: regex_parser
  id: my_regex
  regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<sev>[^ ]*) (?P<uuid>[^ ]*) (?P<log>.*)$'
  timestamp:
    parse_from: time
    layout: '%Y-%m-%dT%H:%M:%S.%LZ'
  severity:
    parse_from: sev
  attributes:
    - stream
  resource:
    - uuid
  record: log

If instead, log needed to be parsed further, a user could do:

- type: regex_parser
  id: my_regex
  regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<sev>[^ ]*) (?P<uuid>[^ ]*) (?P<log>.*)$'
  timestamp:
    parse_from: time
    layout: '%Y-%m-%dT%H:%M:%S.%LZ'
  severity:
    parse_from: sev
  attributes:
    - stream
  resource:
    - uuid
- type: regex_parser
  id: more_parsing
  parse_from: log
  regex: '^(?P<some_id>[^ ^Z]+Z) (?P<the_rest>.*)$'
  resource:
    - some_id
  record: the_rest

There are some nuances here as well, such as how ordering is applied, but these seem manageable.

Thoughts on this?

pmm-sumo · 2021-03-29T17:03:24Z

I concur with @tigrannajaryan note on using attributes. Looking at the data model, I feel we might be having too many nesting levels right now. My understanding is that this is an outcome of mapping Stanza's original model into OpenTelemetry model, but I think it could evolve to be more in-line with the specification.

There is the technical part, which as described above, requires having a common framework for making all the transformations, which base on transforming fields present in body. I think that for many use-cases it would be more natural to deal with attributes instead (or even resource) rather than keeping additional fields in body.

My understanding (might be overly simplified) is that we are in principle dealing with three types of log sources:

Free text - probably the most common case, this is where we extract timestamp, severity, stream, etc. using the regexes - as in the examples above
Structured (e.g. JSON) - this is when body contains a sort of map or so on the input, if transformations are applied, then only on specific fields/attributes of it
Structure wrapped in text - this is a mix of both, e.g. 2021-03-2909:10:15 Something something {"severity":"info", "foo":"bar"} - to same extent Zap output can be considered here

I think that in principle, only the Structured case has strong arguments to keep the map in the body (as this was the original model). Even then, the distinction between body's attributes and attributes is not clear

Additionally, I think there are fields that should be always moved to attributes (or even resource) for consistency reasons (such as severity)

I think I understand the reasons behind it, but also I believe that eventually the operators should be dealing with attributes rather than body content

djaglowski · 2021-03-31T14:31:06Z

@pmm-sumo You make some really good points here. Differentiating between the structured and other types of log sources is very useful. Following what you've laid out, I think there are some additional design points to clarify:

opentelemery-log-collection needs to be enhanced such that resource and attributes support arbitrary data. Both are currently map[string]string. In the current form, any parser that emits a flat set of fields could parse to these. (e.g. regex, csv, syslog) I believe this already covers most of what we would want, so perhaps this means that adding depth to resource and attributes is not urgent.
json parser would parse to body.
We have a uri parser which makes use of depth, so this would also need to parse to body until depth is added to resource/attributes.
Since at least one parser (json) would parse to a different destination than the others, we need to be very clear about this - probably just a docs consideration.
Regarding whether we should allow parsing directly to resource, as opposed to attributes - all parsers currently support a parse_to field. I think we would choose a default attributes and users could override this by setting it to resource.
I'm not clear on what it would mean to move severity to attributes. severity is a top level field in the spec, so I think the entry structure reflects that appropriately. Can you clarify on this point?

pmm-sumo · 2021-03-31T14:58:09Z

Thank you @djaglowski

One quick note:

(...) resource and attributes support arbitrary data. Both are currently map[string]string

Technically we can use a collection of key values, where the latter are of AnyValue type

I'm not clear on what it would mean to move severity to attributes. severity is a top level field in the spec, so I think the entry structure reflects that appropriately. Can you clarify on this point?

You are right, I forgot it became part of the model. I was trying to illustrate there are some attributes that should be present on the attributes level. A better example could be e.g. http.user_agent (which could be extracted from Apache log)

djaglowski · 2021-03-31T16:25:48Z

Technically we can use a collection of key values, where the latter are of AnyValue type

I think we may be talking about OTel vs "stanza" Entry format. My note on this point was basically just acknowledging that we need to update the old stanza format to support AnyValue.

gillg · 2021-04-06T07:38:20Z

Related discussion here : #14718

Addresses #2851

tigrannajaryan · 2021-04-07T18:51:24Z

Fixed by open-telemetry/opentelemetry-helm-charts#43

Addresses open-telemetry#2851

rockb1017 added the bug Something isn't working label Mar 25, 2021

tigrannajaryan added this to the Basic Logs Support milestone Mar 25, 2021

tigrannajaryan changed the title ~~FileLog - put the log record in the Body as a string~~ K8s filelog collection: put the log record in the Body as a string Mar 25, 2021

djaglowski self-assigned this Mar 25, 2021

djaglowski mentioned this issue Mar 25, 2021

Promote 'log' to top of record in k8s examples #2873

Merged

gramidt mentioned this issue Apr 5, 2021

lokiexporter: Empty logs sended from otel collector #2529

Closed

tigrannajaryan pushed a commit that referenced this issue Apr 7, 2021

Promote 'log' to top of record in k8s examples (#2873)

0c198ee

Addresses #2851

tigrannajaryan closed this as completed Apr 7, 2021

tigrannajaryan mentioned this issue Apr 15, 2021

Clarify the input/output expectations of parsers #3129

Closed

pmatyjasek-sumo pushed a commit to pmatyjasek-sumo/opentelemetry-collector-contrib that referenced this issue Apr 28, 2021

Promote 'log' to top of record in k8s examples (open-telemetry#2873)

9455d0a

Addresses open-telemetry#2851

pmm-sumo mentioned this issue Jun 1, 2021

Add guidelines on using log Body vs Attributes (#1613) open-telemetry/opentelemetry-specification#1727

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8s filelog collection: put the log record in the Body as a string #2851

K8s filelog collection: put the log record in the Body as a string #2851

rockb1017 commented Mar 25, 2021

tigrannajaryan commented Mar 25, 2021

djaglowski commented Mar 25, 2021

pmm-sumo commented Mar 25, 2021

djaglowski commented Mar 25, 2021 •

edited

Loading

tigrannajaryan commented Mar 26, 2021

djaglowski commented Mar 26, 2021 •

edited

Loading

pmm-sumo commented Mar 29, 2021

djaglowski commented Mar 31, 2021

pmm-sumo commented Mar 31, 2021

djaglowski commented Mar 31, 2021 •

edited

Loading

gillg commented Apr 6, 2021

tigrannajaryan commented Apr 7, 2021

K8s filelog collection: put the log record in the Body as a string #2851

K8s filelog collection: put the log record in the Body as a string #2851

Comments

rockb1017 commented Mar 25, 2021

tigrannajaryan commented Mar 25, 2021

djaglowski commented Mar 25, 2021

pmm-sumo commented Mar 25, 2021

djaglowski commented Mar 25, 2021 • edited Loading

Detailed Version

tigrannajaryan commented Mar 26, 2021

djaglowski commented Mar 26, 2021 • edited Loading

pmm-sumo commented Mar 29, 2021

djaglowski commented Mar 31, 2021

pmm-sumo commented Mar 31, 2021

djaglowski commented Mar 31, 2021 • edited Loading

gillg commented Apr 6, 2021

tigrannajaryan commented Apr 7, 2021

djaglowski commented Mar 25, 2021 •

edited

Loading

djaglowski commented Mar 26, 2021 •

edited

Loading

djaglowski commented Mar 31, 2021 •

edited

Loading