Skip to content
This repository has been archived by the owner on May 25, 2022. It is now read-only.

Ideal parser settings #431

Closed
djaglowski opened this issue Mar 10, 2022 · 11 comments · Fixed by #464
Closed

Ideal parser settings #431

djaglowski opened this issue Mar 10, 2022 · 11 comments · Fixed by #464

Comments

@djaglowski
Copy link
Member

djaglowski commented Mar 10, 2022

Parsing behavior in this library was originally designed for a different data model than the one we now have. The expected usage of body and attributes have been redefined. As such, the behavior of parsers should be reevaluated.

This is not my first attempt to propose changes to the way parsers behave. I believe my previous attempts have been too focused on proposing changes, which is difficult to do without lengthy explanations of the current behavior. The following is instead meant to be read as a standalone proposal for how parsers should behave. Please read accordingly.

I suggest the following would be ideal behavior:

  1. Parsing should be non-destructive by default.
  2. Parsing should be destructive only when the same value is specified for both parse_from and parse_to.
  3. The parse_from setting should default to body.
  4. The parse_to setting should default to attributes.
  5. The preserve_to setting should not exist.

Rationale

  1. Parsing should be non-destructive by default.

The primary purpose of parsing operations is to isolate information. Where the extracted information should be placed is a secondary, but obvious, concern. However, what happens to the original value is easily overlooked. As a general principle, destruction of the original value should require intentional configuration.

  1. Parsing should be destructive only when the same value is specified for both parse_from and parse_to.

A somewhat common use case involves applying structure to an unstructured log. For example, when a user extracts information from a line of text, they may wish keep only the key value pairs they have isolated. When this is the case, an intuitive choice is to overwrite the original value. This should be supported but should require explicit configuration.

  1. The parse_from setting should default to body.

The body typically contains an unstructured log. Parsing this value is the most common parsing use case. Therefore, parsers should default to parsing the body.

  1. The parse_to setting should default to attributes.

The result of a parsing operation is a structured object. This is true of all parsers that have a parse_to field.

The data model strongly encourages that structured data belong in attributes, so this is a natural default value for parse_to.

Additionally, the default value for parse_to should be different than that of parse_from, else we have a situation where parsing operations are destructive by default.

  1. The preserve_to setting should not exist.

If parsing is non-destructive by default, there is no need to provide a special setting for preservation. In the case where a user wishes to "back up" a value and destroy the original, they can do so by copying or moving the value before parsing.

@pmm-sumo
Copy link

@djaglowski FYI, I run the survey across several filelog receiver users and they support the changes

@djaglowski
Copy link
Member Author

Thank you @pmm-sumo, I'm glad to have many eyes on this!

@jsirianni
Copy link
Member

Just want to make sure I'm keeping up. With body being preserved and everything defaulting to attributes, would this be the appropriate workflow?

  1. Parse body
  2. Move attributes.message field to body (overwriting the raw message)

Input message

"{\"msg\":\"user failed to login\", \"id\": 01, \"env\":\"dev\"}"

Json parser pseudo output

attributes
- msg: "user failed to login"
- id: 01
- env: "dev"
body: "{\"msg\":\"user failed to login\", \"id\": 01, \"env\":\"dev\"}"

Move operator output

attributes
- id: 01
- env: "dev"
body: "user failed to login"

@djaglowski
Copy link
Member Author

That is the idea, though there's nothing requiring that msg be moved back to body. Some may wish to preserve the original body.

As a possible future feature, we could add a body block onto parsers, since this will be such a common workflow.

type: json_parser
timestamp:
  layout: ...
  parse_from: time
severity:
  parse_from: sev
body: msg

@jsirianni
Copy link
Member

I would be a fan of the body option. Not destructive by default, but allows us to skip the move operator. My example is a common workflow for myself, where I put everything into attributes and replace body with the message field's value.

Popular loggers usually have a message field, severity, timestamp. Generally I would want to promote timestamp and severity and only preserve the message field under body.

Input

{"level":"info","timestamp":"2022-03-10T13:10:35.521-0500","message":"server stopped cleanly, exiting"}

Output

severity: info
timestamp: 2022-03-10T13:10:35.521-0500
attributes:
resources:
body: server stopped cleanly, exiting

For something like nginx, this would not make sense because there is technically no message field, preserving body as is would make a lot of sense

Input

10.33.104.40 - - [11/Jan/2021:11:25:01 -0500] "GET / HTTP/1.1" 200 612 "-" "curl/7.58.0"

Output

timestamp: 11/Jan/2021:11:25:01 -0500
attributes
- remote_addr: 10.33.104.40
- method: GET
- path: /
- proto: HTTP/1.1
- status: 200
body: 10.33.104.40 - - [11/Jan/2021:11:25:01 -0500] "GET / HTTP/1.1" 200 612 "-" "curl/7.58.0"

@djaglowski
Copy link
Member Author

I created open-telemetry/opentelemetry-collector-contrib#10274 to capture the idea of adding a body block to parsers. This can be a follow up issue to this one.

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Apr 5, 2022

Parsing should be non-destructive by default.

I agree with the principle. Question: if the parser takes and input and replaces it with a different representation that is reversible do we consider it non-destructive? So for example a JSON parser detects that Body is a well formed JSON string and replaces it with a structured Body (writes to the same field) that represents the same values but in structured form. Is this an acceptable behavior or we consider this a destruction?

@tigrannajaryan
Copy link
Member

The preserve_to setting should not exist.

My position on this probably depends on the answer to my previous question.

@tigrannajaryan
Copy link
Member

I agree with behaviors 2-4.

@djaglowski
Copy link
Member Author

if the parser takes and input and replaces it with a different representation that is reversible do we consider it non-destructive? So for example a JSON parser detects that Body is a well formed JSON string and replaces it with a structured Body (writes to the same field) that represents the same values but in structured form. Is this an acceptable behavior or we consider this a destruction?

I see your point, and realistically I would describe this as non-destructive.

From a practical standpoint though, I'm not sure the distinction really matters. Assuming behaviors 2-4 stand, we have nearly guaranteed already that parsing is non-destructive by default. The only other way in which the parsing operation could still be destructive by default is if the original value is removed, which is where behavior 5 suggests that it not be.

That said, I think behavior 5 is less important than 2-4, and may be worth reconsidering. Timestamps and severity values would typically not be missed, once parsed. Perhaps the same can be said of encoded json. On the other hand, I think regex should probably not consume the original value. CSV and key/value are less clear because in some sense "nothing is lost", but the original value may be useful for troubleshooting or just because a string representation is expected. Determining this on a case-by-case is too nuanced in my opinion, so I'd prefer we go one way or the other across the board.

So I think the other option is to reject behavior 5, consume the parse_from field by default, and keep preserve_to behavior as is.

@tigrannajaryan
Copy link
Member

Actually I was imagining a hypothetical scenario that is impossible, so ignore my question/objection. I am good with your entire proposal.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants