Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pkg/ottl] RFC - Direct XML manipulation functions #35281

Closed
djaglowski opened this issue Sep 18, 2024 · 11 comments
Closed

[pkg/ottl] RFC - Direct XML manipulation functions #35281

djaglowski opened this issue Sep 18, 2024 · 11 comments
Labels
enhancement New feature or request pkg/ottl

Comments

@djaglowski
Copy link
Member

djaglowski commented Sep 18, 2024

Component(s)

pkg/ottl

Is your feature request related to a problem? Please describe.

XML is frequently used in traditional logging frameworks, but within the collector and downstream tools it is often difficult to manipulate.

Before going further, I believe it would be helpful to define a term: "JSON-equivalent". Basically, a plog.LogRecord's body or attributes can be losslessly converted to or from JSON (or YAML, or some other formats).

Notably, XML is not JSON-equivalent, at least not generally. However, it is possible to define a subset of XML which is JSON-equivalent, which we could call "JSON-equivalent XML". (More on this below.)

We currently have a ParseXML function, but in order to deal with the fact that XML is not generally JSON-equivalent, we are producing an encoding of XML. The encoding is necessarily JSON-equivalent, but ultimately it is an overly verbose representation that OTTL is not well suited to manipulate in ways that respect the encoding. That means that our current strategy for parsing XML has very limited value because users find it difficult to work with in OTTL and at least in some backends.

Describe the solution you'd like

In order to better support XML, I believe we should provide the following:

  1. Some functions which work directly with XML documents, without forcing the user to work with a JSON-equivalent encoding of XML. These could lean on XML native technologies such as XPath to navigate and manipulate the documents directly.
  2. A recommended (or maybe automated) pathway for using these functions to migrate XML documents into a JSON-equivalent state.
  3. A new "JSON-equivalent XML" parser, which produces a much more useful output.

Example

Suppose we have the following XML document:

<Data foo="bar" hello="world">
    Some text
    <One>With text</One>
    <Two>
        <Three>3</Three>
        <Four>4</Four>
        <Three note="again">3</Three>
    </Two>
</Data>

In order to make this JSON-equivalent, we can't have both attributes and child elements. We also can't have raw values at the same level as child elements. A JSON-equivalent version might look something like this:

<Data>
    <foo>bar</foo>
    <hello>world</hello>
    <value>Some text</value>
    <One>With text</One>
    <Two>
        <Three>3</Three>
        <Four>4</Four>
        <Three>
            <note>again</note>
            <value>3</value>
        </Three>
    </Two>
</Data>

This can then be converted directly into a useful object:

Data:
  foo: bar
  hello: world
  value: Some text
  One: With text
  Two:
    - Three: 3
    - Four: 4
    - Three:
        note: again
        value: 3

In order to accomplish this migration, we need some functionality:

  • Convert attributes into elements
  • Detect raw values which are at the same level as elements, and wrap the value in an element

Notably, there is a reasonable amount of subjectivity here. In the example there are two instances of the Three tag, but they end up in different formats because of the presence of an attribute on one of them. This may be problematic for the user and there are likely many similar situations. I believe a general solution will require offering a set of composable functions that allow the user to make their own decisions about how to manipulate the representation into a JSON-equivalent format that meets their needs.

Describe alternatives you've considered

No response

Additional context

No response

@djaglowski djaglowski added enhancement New feature or request needs triage New item requiring triage labels Sep 18, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@TylerHelmuth
Copy link
Member

A function that can manipulate the xml string in place seems useful. That feels simpler than doing:

- set(cache["xmlMap"], ParseXML(body))
... # manipulate cache[xmlMap"]
- set(body, MarshalXML(cache["xmlMap"]))

I am not experienced enough with XML to propose what kind of functions we'd need for that. Some OTTL guidelines that may be helpful when brainstorming ideas:

  • OTTL doesn't have the concept of variables within the scope of a function. Functions could implement a caching mechanism to remember things between invocations, but our design principles currently prohibit sharing information that way for general ottlfuncs functions.
  • All our contexts contain a cache field that exists as a place for users to store information between statements. It is currently limited by pdata (see [pkg/ottl] Express context cache as a map[string]any. #26108)
  • OTTL can easily chain functions together, but the more chaining the worse the statement looks. Something like set(body, FunctionC(FunctionB(FunctionA(Parser(body)))) isn't great. The work around for this today is to use cache to store the information between statements. We're aware of this chaining annoyance, but haven't solved it yet.
  • Using Editors to encapsulate complex transformations is ok. While we generally like using set as the primary solution for updating telemetry, we have other methods, like merge_maps and flatten the work directly on the target field. Editors end up being a good solution if the transformation in question would result in bunch of chaining if Converters were used.

@djaglowski
Copy link
Member Author

Thanks for your thoughts on this @TylerHelmuth. I'm thinking we could mostly rely on Converters here. They would take a target parameter, which would need to be an xml formatted string. Otherwise parameters would be things like strings which are XPaths, or names of tags to create, etc. The cache could be useful if someone wants to work on a backup of the original value, but I think they could also just incrementally overwrite the target. It might help to add more detail to the above example.

Starting from the same xml document (and assuming this is the body):

set(body, DeleteXML(body, "*//@note")) takes an XPath parameter and deletes any "note" attributes

 <Data foo="bar" hello="world">
     Some text
     <One>With text</One>
     <Two>
         <Three>3</Three>
         <Four>4</Four>
-        <Three note="again">3</Three>
+        <Three>3</Three>
     </Two>
 </Data>

set(body, ConvertXMLAttributes(body)) converts any remaining attributes into child elements.

- <Data foo="bar" hello="world">
+ <Data>
+   <foo>bar</foo>
+   <hello>world</hello>
     Some text
     <One>With text</One>
     <Two>
         <Three>3</Three>
         <Four>4</Four>
         <Three>3</Three>
     </Two>
 </Data>

set(body, WrapFloatingXMLValues(body, "value")) finds instances where values exist at the same level as elements, and wraps them in a tag with the specified name

 <Data>
     <foo>bar</foo>
     <hello>world</hello>
-     Some text
+     <value>Some text</value>
     <One>With text</One>
     <Two>
         <Three>3</Three>
         <Four>4</Four>
         <Three>3</Three>
     </Two>
 </Data>

Then finally set(body, ParseSimplifiedXML(body)) just converts the simplified (JSON-equivalent) xml string into an attributes map.

If I'm not mistaken, the could compose these inline, but it's not clear to me if there's much benefit to this. Personally I would just use separate statements:

set(body, ParseSimplifiedXML(WrapFloatingXMLValues(ConvertXMLAttributes(DeleteXML(body, "*//@note")), "value")))

Either way, I'm not necessarily proposing the exact Converters in this example, but I think these are pretty close to what we'd need in the short term. Just wanted to articulate better how I imagine the user would incrementally convert their xml into a JSON-equivalent format, and ultimately to a clean attributes map.

@crobert-1
Copy link
Member

Removing needs triage as a code owner has responded approving the idea.

@crobert-1 crobert-1 removed the needs triage New item requiring triage label Sep 19, 2024
@djaglowski
Copy link
Member Author

I've opened #35301 with the first concrete direct-xml manipulation converter as described above. If this looks good, I'll add a few more in the coming days and start work on the JSON-equivalent XML parser.

@TylerHelmuth
Copy link
Member

@djaglowski would we gain efficiencies if we could pass around a parsed xml doc between functions? As I look at the proposed converters I am a little worried about having to parse and then convert back to a string each time.

@djaglowski
Copy link
Member Author

djaglowski commented Sep 27, 2024

would we gain efficiencies if we could pass around a parsed xml doc between functions

In theory, yes, but there are two requirements:

  1. A representation of the parsed doc. If this just a pcommon.Map, then we have the same problem this issue proposes to solve. In other words, the representation needs to be truly XML equivalent but pcommon.Map is insufficient for this.
  2. The best way to navigate XML docs is XPath, but it's pretty complicated. If we choose a representation which meets the first requirement, we still need a way to apply XPath to it.

If we pass around xmlquery's representation of a parsed doc, then in theory this could work, but it seems a little janky to me.

Another way we could try to go is to define a general "XML Converter" which takes a list of XML-specific statements (which follow a different contract than OTTL Converters). Maybe something like:
ParseXMLSteps(target, remove("//x"), insert("<b/><b/>, "//y"), elementizeAttributes(), elementizeValues())

Maybe it's worthwhile but unless I'm missing something it seems like any of the above would require a decent amount of new machinery.

@TylerHelmuth
Copy link
Member

Ya I was thinking of passing around xmlquery's parsed doc, but that feels leaky. Sticking with strings for now is ok. Can we add some xml "e2e" benchmarks? Something that benchmarks a reasonable use of the xml functions so we can see what users will experience?

@djaglowski
Copy link
Member Author

Can we add some xml "e2e" benchmarks? Something that benchmarks a reasonable use of the xml functions so we can see what users will experience?

That's a good idea. I created #35471 to track this and I'll take care of it once we have some more converters to chain together.

@djaglowski
Copy link
Member Author

I opened #35493 with a benchmark that uses GetXML, InsertXML and RemoveXML in a round trip. It depends on #35436.

djaglowski added a commit that referenced this issue Oct 10, 2024
This adds a converter called `ElementizeAttributesXML`. This serves as
one of the granular transformations described in #35281 which will allow
users to migrate any arbitrary XML document into a JSON-equivalent
state.

Also see #35364
djaglowski added a commit that referenced this issue Oct 10, 2024
This adds a converter called `ConvertTextToElementsXML `. This serves as
one of the granular transformations described in
#35281
which will allow users to migrate any arbitrary XML document into a
JSON-equivalent state.

Also see #35328
djaglowski added a commit that referenced this issue Oct 15, 2024
This adds a converter called `ParseSimplifiedXML`. This serves as the
final step described in
#35281,
which will allow users to parse any arbitrary XML document into
user-friendly result, by first transforming the document in place with
other functions (e.g. #35328 and #35364) and then calling this function.

---------

Co-authored-by: Evan Bradley <11745660+evan-bradley@users.noreply.github.com>
@djaglowski
Copy link
Member Author

I think this issue can be closed now that the linked PRs have been merged. It's possible that there are other functions we will need, but now that we've established the basic pattern, we should consider them as needed.

sbylica-splunk pushed a commit to sbylica-splunk/opentelemetry-collector-contrib that referenced this issue Dec 17, 2024
…ry#35328)

This adds a converter called `ElementizeAttributesXML`. This serves as
one of the granular transformations described in open-telemetry#35281 which will allow
users to migrate any arbitrary XML document into a JSON-equivalent
state.

Also see open-telemetry#35364
sbylica-splunk pushed a commit to sbylica-splunk/opentelemetry-collector-contrib that referenced this issue Dec 17, 2024
This adds a converter called `ConvertTextToElementsXML `. This serves as
one of the granular transformations described in
open-telemetry#35281
which will allow users to migrate any arbitrary XML document into a
JSON-equivalent state.

Also see open-telemetry#35328
sbylica-splunk pushed a commit to sbylica-splunk/opentelemetry-collector-contrib that referenced this issue Dec 17, 2024
This adds a converter called `ParseSimplifiedXML`. This serves as the
final step described in
open-telemetry#35281,
which will allow users to parse any arbitrary XML document into
user-friendly result, by first transforming the document in place with
other functions (e.g. open-telemetry#35328 and open-telemetry#35364) and then calling this function.

---------

Co-authored-by: Evan Bradley <11745660+evan-bradley@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pkg/ottl
Projects
None yet
Development

No branches or pull requests

3 participants