Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8s filelog collection: put the log record in the Body as a string #2851

Closed
rockb1017 opened this issue Mar 25, 2021 · 12 comments
Closed

K8s filelog collection: put the log record in the Body as a string #2851

rockb1017 opened this issue Mar 25, 2021 · 12 comments
Assignees
Labels
bug Something isn't working

Comments

@rockb1017
Copy link
Contributor

Describe the bug
#2266 (comment)
log record is wrapped in a json with key log. it should be just plain string.

Steps to reproduce
Read files using filelog

What did you expect to see?
"EPS: 98"
What did you see instead?
{"log":"EPS: 98\n"}
image

What version did you use?
0.22.0

@rockb1017 rockb1017 added the bug Something isn't working label Mar 25, 2021
@tigrannajaryan tigrannajaryan added this to the Basic Logs Support milestone Mar 25, 2021
@tigrannajaryan tigrannajaryan changed the title FileLog - put the log record in the Body as a string K8s filelog collection: put the log record in the Body as a string Mar 25, 2021
@tigrannajaryan
Copy link
Member

@pmm-sumo @djaglowski this is the bug that I mentioned about when were looking at k8s collection. Can one you take this?

@djaglowski
Copy link
Member

@sumo-drosiek would you be able to validate open-telemetry/opentelemetry-collector#2873?

@pmm-sumo
Copy link
Contributor

@djaglowski what do you think, should we be good with just using open-telemetry/opentelemetry-collector#2873 or perhaps it would be better to make a deeper change in the source code?

@djaglowski
Copy link
Member

djaglowski commented Mar 25, 2021

@pmm-sumo There are probably some things we can do to improve the usability of the parsing operations, but I think you are maybe referring to whether or not we should need the "log" field in the first place? There is a specific reason it exists in this case - it's because there are other values that need to be parsed, and those values temporarily exist alongside "log" until they are parsed.

Detailed Version

Note that this regex operation is capturing multiple values, "time", "stream", and "log" (and also "logtag", but this doesn't appear to be used, so I am omitting from this example):

- type: regex_parser
    id: parser-containerd
    regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) (?P<log>.*)$'
    ...

At this point, the entry looks something like this (showing only the parts relevant to this example):

{
  "record": {  
    "time": "...",
    "stream": "...",
    "log": "The rest of the log line",
  },
  "attributes": { },
  "timestamp": "default,
  ...
}

Then some additional processing is done based on the other values:

...
  timestamp:
    parse_from: time # parses "time" to timestamp
    ...
- type: metadata
    attributes:
      stream: 'EXPR($.stream)' # moves "stream" to attributes

Now we have:

{
  "record": {  
    "log": "The rest of the log line",
  },
  "attributes": { 
    "stream": "stderr",
  },
  "timestamp": "2021-03-25T...",
  ...
}

Now we can clean up the record:

ops:
  - move:
      from: log
      to: $

This leaves us with:

{
  "record": "The rest of the log line",
  "attributes": { 
    "stream": "stderr",
  },
  "timestamp": "2021-03-25T...",
  ...
}

There are a few minor things that I would like to improve here though, such as this and this, but I'm not sure what specifically we should do to improve this parsing pipeline.

@tigrannajaryan
Copy link
Member

@djaglowski a slightly different approach that possibly could work is for regex operator to extract value directly into "attributes" and also have an option to specify which extracted value becomes the new value of the "record" (or "Body" in otel terminology). For example:

type: regex_parser
    id: parser-containerd
    regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) (?P<log>.*)$'
    set_body_from: log # this refers to an extracted field name in the line above

Now once this is applied the data will look like this:

{
  "record": "The rest of the log line",
  "attributes": {
    "time": "...",
    "stream": "...",
  },
  "timestamp": "default,
  ...
}

I don't know how feasible this, just an idea.

@djaglowski
Copy link
Member

djaglowski commented Mar 26, 2021

I think when you consider a wider set of log parsing cases, you end up needing the ability to parse into an arbitrary object that you can then manipulate as needed. You might need to parse further. You might keep or discard any field at all.

This isn't really a regex specific problem either - it's a general question of how parsing operations should behave. To me, the primary action you are taking with any parsing operation is to structure to the log entry. Once the values are individually accessible, you can parse them further or move them into the appropriate special fields (resource, attributes, timestamp, severity).

All that said, there are many scenarios such as this that are much more constrained. Taking inspiration from @tigrannajaryan's suggestion, I think there is potentially a way in which we can support cases such as this one, while still allowing arbitrary field manipulations if needed.

We already support a concept of "nested parsers" for timestamp and severity. These are just subsequent operations that can be specified within any parser:

- type: regex_parser
  id: my_regex
  regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<sev>[^ ]*) (?P<uuid>[^ ]*) (?P<log>.*)$'
  timestamp: # this is a timestamp_parser embedded within the regex_parser
    parse_from: time
    layout: '%Y-%m-%dT%H:%M:%S.%LZ'
  severity: # this is a severity_parser embedded within the regex_parser
    parse_from: sev

What is really happening here is essentially the same thing as if you specified:

- type: regex_parser
  id: my_regex
  regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<sev>[^ ]*) (?P<uuid>[^ ]*) (?P<log>.*)$'
- type: timestamp_parser # separate operator
  parse_from: time
  layout: '%Y-%m-%dT%H:%M:%S.%LZ'
- type: severity_parser # separate operator
  parse_from: sev

So working with the same idea in mind, would it be helpful to embed additional operators into all parsers?

The metadata operator (possibly with some modifications) could also be embedded in the same way. This would allow roughly:

- type: regex_parser
  id: my_regex
  regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<sev>[^ ]*) (?P<uuid>[^ ]*) (?P<log>.*)$'
  timestamp:
    parse_from: time
    layout: '%Y-%m-%dT%H:%M:%S.%LZ'
  severity:
    parse_from: sev
  attributes:
    - stream
  resource:
    - uuid

Tigran's suggested set_body_from: log would fit the pattern just as well. Maybe this is a special field on parsers, or maybe it's just a simplified version of restructure.move, but this would allow roughly:

- type: regex_parser
  id: my_regex
  regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<sev>[^ ]*) (?P<uuid>[^ ]*) (?P<log>.*)$'
  timestamp:
    parse_from: time
    layout: '%Y-%m-%dT%H:%M:%S.%LZ'
  severity:
    parse_from: sev
  attributes:
    - stream
  resource:
    - uuid
  record: log

If instead, log needed to be parsed further, a user could do:

- type: regex_parser
  id: my_regex
  regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<sev>[^ ]*) (?P<uuid>[^ ]*) (?P<log>.*)$'
  timestamp:
    parse_from: time
    layout: '%Y-%m-%dT%H:%M:%S.%LZ'
  severity:
    parse_from: sev
  attributes:
    - stream
  resource:
    - uuid
- type: regex_parser
  id: more_parsing
  parse_from: log
  regex: '^(?P<some_id>[^ ^Z]+Z) (?P<the_rest>.*)$'
  resource:
    - some_id
  record: the_rest

There are some nuances here as well, such as how ordering is applied, but these seem manageable.

Thoughts on this?

@pmm-sumo
Copy link
Contributor

I concur with @tigrannajaryan note on using attributes. Looking at the data model, I feel we might be having too many nesting levels right now. My understanding is that this is an outcome of mapping Stanza's original model into OpenTelemetry model, but I think it could evolve to be more in-line with the specification.

There is the technical part, which as described above, requires having a common framework for making all the transformations, which base on transforming fields present in body. I think that for many use-cases it would be more natural to deal with attributes instead (or even resource) rather than keeping additional fields in body.

My understanding (might be overly simplified) is that we are in principle dealing with three types of log sources:

  • Free text - probably the most common case, this is where we extract timestamp, severity, stream, etc. using the regexes - as in the examples above
  • Structured (e.g. JSON) - this is when body contains a sort of map or so on the input, if transformations are applied, then only on specific fields/attributes of it
  • Structure wrapped in text - this is a mix of both, e.g. 2021-03-2909:10:15 Something something {"severity":"info", "foo":"bar"} - to same extent Zap output can be considered here

I think that in principle, only the Structured case has strong arguments to keep the map in the body (as this was the original model). Even then, the distinction between body's attributes and attributes is not clear

Additionally, I think there are fields that should be always moved to attributes (or even resource) for consistency reasons (such as severity)

I think I understand the reasons behind it, but also I believe that eventually the operators should be dealing with attributes rather than body content

@djaglowski
Copy link
Member

@pmm-sumo You make some really good points here. Differentiating between the structured and other types of log sources is very useful. Following what you've laid out, I think there are some additional design points to clarify:

  • opentelemery-log-collection needs to be enhanced such that resource and attributes support arbitrary data. Both are currently map[string]string. In the current form, any parser that emits a flat set of fields could parse to these. (e.g. regex, csv, syslog) I believe this already covers most of what we would want, so perhaps this means that adding depth to resource and attributes is not urgent.
  • json parser would parse to body.
  • We have a uri parser which makes use of depth, so this would also need to parse to body until depth is added to resource/attributes.
  • Since at least one parser (json) would parse to a different destination than the others, we need to be very clear about this - probably just a docs consideration.
  • Regarding whether we should allow parsing directly to resource, as opposed to attributes - all parsers currently support a parse_to field. I think we would choose a default attributes and users could override this by setting it to resource.
  • I'm not clear on what it would mean to move severity to attributes. severity is a top level field in the spec, so I think the entry structure reflects that appropriately. Can you clarify on this point?

@pmm-sumo
Copy link
Contributor

Thank you @djaglowski

One quick note:

(...) resource and attributes support arbitrary data. Both are currently map[string]string

Technically we can use a collection of key values, where the latter are of AnyValue type

  • I'm not clear on what it would mean to move severity to attributes. severity is a top level field in the spec, so I think the entry structure reflects that appropriately. Can you clarify on this point?

You are right, I forgot it became part of the model. I was trying to illustrate there are some attributes that should be present on the attributes level. A better example could be e.g. http.user_agent (which could be extracted from Apache log)

@djaglowski
Copy link
Member

djaglowski commented Mar 31, 2021

Technically we can use a collection of key values, where the latter are of AnyValue type

I think we may be talking about OTel vs "stanza" Entry format. My note on this point was basically just acknowledging that we need to update the old stanza format to support AnyValue.

@gillg
Copy link
Contributor

gillg commented Apr 6, 2021

Related discussion here : #14718

@tigrannajaryan
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants