Ideal parser settings #431

djaglowski · 2022-03-10T18:16:23Z

Parsing behavior in this library was originally designed for a different data model than the one we now have. The expected usage of body and attributes have been redefined. As such, the behavior of parsers should be reevaluated.

This is not my first attempt to propose changes to the way parsers behave. I believe my previous attempts have been too focused on proposing changes, which is difficult to do without lengthy explanations of the current behavior. The following is instead meant to be read as a standalone proposal for how parsers should behave. Please read accordingly.

I suggest the following would be ideal behavior:

Parsing should be non-destructive by default.
Parsing should be destructive only when the same value is specified for both parse_from and parse_to.
The parse_from setting should default to body.
The parse_to setting should default to attributes.
The preserve_to setting should not exist.

Rationale

Parsing should be non-destructive by default.

The primary purpose of parsing operations is to isolate information. Where the extracted information should be placed is a secondary, but obvious, concern. However, what happens to the original value is easily overlooked. As a general principle, destruction of the original value should require intentional configuration.

Parsing should be destructive only when the same value is specified for both parse_from and parse_to.

A somewhat common use case involves applying structure to an unstructured log. For example, when a user extracts information from a line of text, they may wish keep only the key value pairs they have isolated. When this is the case, an intuitive choice is to overwrite the original value. This should be supported but should require explicit configuration.

The parse_from setting should default to body.

The body typically contains an unstructured log. Parsing this value is the most common parsing use case. Therefore, parsers should default to parsing the body.

The parse_to setting should default to attributes.

The result of a parsing operation is a structured object. This is true of all parsers that have a parse_to field.

The data model strongly encourages that structured data belong in attributes, so this is a natural default value for parse_to.

Additionally, the default value for parse_to should be different than that of parse_from, else we have a situation where parsing operations are destructive by default.

The preserve_to setting should not exist.

If parsing is non-destructive by default, there is no need to provide a special setting for preservation. In the case where a user wishes to "back up" a value and destroy the original, they can do so by copying or moving the value before parsing.

The text was updated successfully, but these errors were encountered:

pmm-sumo · 2022-03-11T17:44:19Z

@djaglowski FYI, I run the survey across several filelog receiver users and they support the changes

djaglowski · 2022-03-11T18:26:55Z

Thank you @pmm-sumo, I'm glad to have many eyes on this!

jsirianni · 2022-03-11T19:20:59Z

Just want to make sure I'm keeping up. With body being preserved and everything defaulting to attributes, would this be the appropriate workflow?

Parse body
Move attributes.message field to body (overwriting the raw message)

Input message

"{\"msg\":\"user failed to login\", \"id\": 01, \"env\":\"dev\"}"

Json parser pseudo output

attributes
- msg: "user failed to login"
- id: 01
- env: "dev"
body: "{\"msg\":\"user failed to login\", \"id\": 01, \"env\":\"dev\"}"

Move operator output

attributes
- id: 01
- env: "dev"
body: "user failed to login"

djaglowski · 2022-03-11T19:32:54Z

That is the idea, though there's nothing requiring that msg be moved back to body. Some may wish to preserve the original body.

As a possible future feature, we could add a body block onto parsers, since this will be such a common workflow.

type: json_parser
timestamp:
  layout: ...
  parse_from: time
severity:
  parse_from: sev
body: msg

jsirianni · 2022-03-11T20:03:23Z

I would be a fan of the body option. Not destructive by default, but allows us to skip the move operator. My example is a common workflow for myself, where I put everything into attributes and replace body with the message field's value.

Popular loggers usually have a message field, severity, timestamp. Generally I would want to promote timestamp and severity and only preserve the message field under body.

Input

{"level":"info","timestamp":"2022-03-10T13:10:35.521-0500","message":"server stopped cleanly, exiting"}

Output

severity: info
timestamp: 2022-03-10T13:10:35.521-0500
attributes:
resources:
body: server stopped cleanly, exiting

For something like nginx, this would not make sense because there is technically no message field, preserving body as is would make a lot of sense

Input

10.33.104.40 - - [11/Jan/2021:11:25:01 -0500] "GET / HTTP/1.1" 200 612 "-" "curl/7.58.0"

Output

timestamp: 11/Jan/2021:11:25:01 -0500
attributes
- remote_addr: 10.33.104.40
- method: GET
- path: /
- proto: HTTP/1.1
- status: 200
body: 10.33.104.40 - - [11/Jan/2021:11:25:01 -0500] "GET / HTTP/1.1" 200 612 "-" "curl/7.58.0"

djaglowski · 2022-03-30T13:08:50Z

I created open-telemetry/opentelemetry-collector-contrib#10274 to capture the idea of adding a body block to parsers. This can be a follow up issue to this one.

tigrannajaryan · 2022-04-05T17:39:08Z

Parsing should be non-destructive by default.

I agree with the principle. Question: if the parser takes and input and replaces it with a different representation that is reversible do we consider it non-destructive? So for example a JSON parser detects that Body is a well formed JSON string and replaces it with a structured Body (writes to the same field) that represents the same values but in structured form. Is this an acceptable behavior or we consider this a destruction?

tigrannajaryan · 2022-04-05T17:42:35Z

The preserve_to setting should not exist.

My position on this probably depends on the answer to my previous question.

tigrannajaryan · 2022-04-05T17:42:59Z

I agree with behaviors 2-4.

djaglowski · 2022-04-05T19:27:54Z

if the parser takes and input and replaces it with a different representation that is reversible do we consider it non-destructive? So for example a JSON parser detects that Body is a well formed JSON string and replaces it with a structured Body (writes to the same field) that represents the same values but in structured form. Is this an acceptable behavior or we consider this a destruction?

I see your point, and realistically I would describe this as non-destructive.

From a practical standpoint though, I'm not sure the distinction really matters. Assuming behaviors 2-4 stand, we have nearly guaranteed already that parsing is non-destructive by default. The only other way in which the parsing operation could still be destructive by default is if the original value is removed, which is where behavior 5 suggests that it not be.

That said, I think behavior 5 is less important than 2-4, and may be worth reconsidering. Timestamps and severity values would typically not be missed, once parsed. Perhaps the same can be said of encoded json. On the other hand, I think regex should probably not consume the original value. CSV and key/value are less clear because in some sense "nothing is lost", but the original value may be useful for troubleshooting or just because a string representation is expected. Determining this on a case-by-case is too nuanced in my opinion, so I'd prefer we go one way or the other across the board.

So I think the other option is to reject behavior 5, consume the parse_from field by default, and keep preserve_to behavior as is.

tigrannajaryan · 2022-04-05T21:07:52Z

Actually I was imagining a hypothetical scenario that is impossible, so ignore my question/objection. I am good with your entire proposal.

djaglowski mentioned this issue Mar 10, 2022

Update Changelog ahead of major set of breaking changes #430

Merged

djaglowski mentioned this issue May 24, 2022

[pkg/stanza] Parser "body" block open-telemetry/opentelemetry-collector-contrib#10274

Closed

djaglowski mentioned this issue Apr 4, 2022

Change default value of 'parse_to' to 'attributes' #463

Merged

djaglowski mentioned this issue Apr 7, 2022

Remove preserve_to #464

Merged

djaglowski closed this as completed in #464 Apr 8, 2022

djaglowski mentioned this issue Apr 20, 2022

Clarify the input/output expectations of parsers open-telemetry/opentelemetry-collector-contrib#3129

Closed

damemi mentioned this issue Nov 30, 2022

[exporter/googlecloud] Exporter for Logs should follow Data Model open-telemetry/opentelemetry-collector-contrib#16495

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ideal parser settings #431

Ideal parser settings #431

djaglowski commented Mar 10, 2022 •

edited

Loading

pmm-sumo commented Mar 11, 2022

djaglowski commented Mar 11, 2022

jsirianni commented Mar 11, 2022

djaglowski commented Mar 11, 2022

jsirianni commented Mar 11, 2022

djaglowski commented Mar 30, 2022

tigrannajaryan commented Apr 5, 2022 •

edited

Loading

tigrannajaryan commented Apr 5, 2022

tigrannajaryan commented Apr 5, 2022

djaglowski commented Apr 5, 2022

tigrannajaryan commented Apr 5, 2022

Ideal parser settings #431

Ideal parser settings #431

Comments

djaglowski commented Mar 10, 2022 • edited Loading

Rationale

pmm-sumo commented Mar 11, 2022

djaglowski commented Mar 11, 2022

jsirianni commented Mar 11, 2022

djaglowski commented Mar 11, 2022

jsirianni commented Mar 11, 2022

djaglowski commented Mar 30, 2022

tigrannajaryan commented Apr 5, 2022 • edited Loading

tigrannajaryan commented Apr 5, 2022

tigrannajaryan commented Apr 5, 2022

djaglowski commented Apr 5, 2022

tigrannajaryan commented Apr 5, 2022

djaglowski commented Mar 10, 2022 •

edited

Loading

tigrannajaryan commented Apr 5, 2022 •

edited

Loading