Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better multiline handling: s-expressions #1559

Closed
StephenWakely opened this issue Jan 21, 2020 · 10 comments
Closed

Better multiline handling: s-expressions #1559

StephenWakely opened this issue Jan 21, 2020 · 10 comments
Labels
domain: parsing Anything related to parsing within Vector sink: file Anything `file` sink related type: enhancement A value-adding code change that enhances its existing functionality. type: help User support and help.

Comments

@StephenWakely
Copy link
Contributor

We delimit our log messages using s-expressions:

("2020-01-21 14:51:39" 2 :MESSAGE (:ACTION SUBMIT-QUOTE) "Error occurred causing transaction to roll back : Database error 23503: insert or update on table \"sales_invoice\" violates foreign key constraint \"sales_invoice_sales_account_fkey\"
DETAIL: Key (sales_account)=(9000) is not present in table \"sales_account\".
QUERY: INSERT INTO \"sales_invoice\" (\"debtor_debit_transaction\", \"sales_account\") VALUES ($$2$$, $$9000$$) RETURNING \"id\";")

The file source is currently unable to parse this message as it is over multiple lines, and there is no set character to determine the start of the next log message. Using ( wont work as this is sometimes used within the log message itself.

It would be handy if there was a way to incorporate a more advanced parser that can keep track of all the quotes and parens (including taking into account when they are escaped) to determine when a message has been fully loaded.

It would probably be handy to share this logic with the socket source as well.

@binarylogic binarylogic added sink: file Anything `file` sink related type: enhancement A value-adding code change that enhances its existing functionality. type: help User support and help. labels Jan 21, 2020
@binarylogic
Copy link
Contributor

@MOZGIII this seems somewhat related to the work you're doing. I'm curious if what you're doing will solve this, or if it's closely related?

@MOZGIII
Copy link
Contributor

MOZGIII commented Jan 21, 2020

Yes, I guess this should be possible to implement using the merge transform I'm currently working on and a custom lua transform for a partial message marker.

It doesn't look like it'd possible to parse with regexp to do partial message marking, but with custom logic - it's doable.

@binarylogic
Copy link
Contributor

Thanks for this @FungusHumungus. We have a discussion brewing about how to best solve this and will follow up as we get more clarity on our solution.

@binarylogic
Copy link
Contributor

@MOZGIII I'm assigning this to you as part of your overall merge work. It probably makes sense to see how we can modify the existing file source behavior to cleanly support this. Again, my hope is that after we have a few sources with this functionality we can start to extract common patterns.

@binarylogic binarylogic added this to the Improve log continuation strategy milestone Jan 25, 2020
@MOZGIII
Copy link
Contributor

MOZGIII commented Feb 17, 2020

It would be handy if there was a way to incorporate a more advanced parser that can keep track of all the quotes and parens (including taking into account when they are escaped) to determine when a message has been fully loaded.

On the second thought, this case it's very tricky.

First of all, s-expressions are a context-free grammar, so we can't use regexps to parse it.

Then, our lua implementation is currently stateless - it resets the lua context each time the event is processed. This means that the state we'd need to store to be able to mark events as partial or non-partial (and to leverage the merge transform) can not be stored.

Good news is it's about to be changed in such a way that it'll (presumably) support this use case.

That said, to solve this issue with a lua transform and a merge transform 100% correctly you'd need to implement a tokenizer for the s-expressions variant you're using in lua - cause you really would want to do not just counting ( and ), but also detecting embedded string values - " and ignoring (/) inside them, as well as escaping \", \\.
There is a simpler way of course - to actually simply count ( and ) symbols in the string without any tokenization - and to hope that the inner log messages never contain an unmatched bracket.
In both scenarios (with and without the tokenizer) - if the sum of the opening and closing brackets is non-zero - the message is partial, so you'll just add the partial event marker field (_partial: true) to the event. If it's zero - it's non-partial, and should be passthrough as-is (do not add the partial event marker). This properly marks events for the merge transform, which will handle the rest of the heavy lifting - merging huge partial messages together may be a non-trivial task for a lua transform, but the merge transform, being implemented in rust can do it optimally, with less overhead than lua would have.

@MOZGIII
Copy link
Contributor

MOZGIII commented Feb 17, 2020

This case generalizes to the problem vector currently has with parsing multiline messages. For instance, if the case was not with s-expressions, but with multi-line JSON, the problem would be the same.

The merge transform that we implemented does not solve this general case. It does help though. For a particular case when messages are properly marked, it performs the merging of the messages independently of the inner grammar of the contents, and it does it very efficiently - and for some scenarios, this is a huge win, compared to the alternative.

However, the parsing of the multi-line messages is, in fact, a different thing.
We can approach this problem from two angles.
Let there be some text that spans across multiple messages (or any other kind of "chunk" if you will).

  • Scenario 1
    When all we want to do from the parsing process is to just concatenate the text - like when there's a stacktrace or segfaut dump.
  • Scenario 2
    When we want to extract the values from the message in the structured form - like in case a JSON document arrives in multiple lines, and we don't really care about the concatenated body of it - but just about the fields. This can be done in a single step - if we tokenize and parse the message in a streaming fashion. We can still achieve the same result if we concatenate the messages first and then run them though a non-streaming parser (go though the Scenario 1) - however, this may be way less efficient, especially for huge payloads.

Merge transform helps with the first scenario, but it's useless with the second. The first scenario is also a special case of the second scenario - when the value we extract is a top-level string.

To sum up, the most flexible way of implemeting this whole thing would be implementing a streaming tokenizer/parser with pluggable grammars: JSON / top-level string / user-supplied pattern / user-supplied grammar. This way we'll be able to actually properly support incoming data streams without providing workarounds to handle "read framing".

@StephenWakely
Copy link
Contributor Author

Makes sense. Could there be a case for calling into Lua to do this parsing?

@MOZGIII
Copy link
Contributor

MOZGIII commented Feb 25, 2020

We've improved multi-line handling support, please check out #1852.
Since we didn't address this issue, I'm not going to close this one, but I'll rename it to better represent the contents.
After the internal discussion, we deduced it's unlikely we'll be providing explicit support for the s-expressions in the near future. Our new multi-line parsing capabilities might help with your problem though!

@MOZGIII MOZGIII changed the title Better multiline handling Better multiline handling: s-expressions Feb 25, 2020
@MOZGIII
Copy link
Contributor

MOZGIII commented Feb 25, 2020

Makes sense. Could there be a case for calling into Lua to do this parsing?

After our lua transform is upgraded to be able to persist data across invocations - this should definitely be doable!

@binarylogic binarylogic removed this from the Best-In-Class Log Merging milestone Jul 26, 2020
@binarylogic binarylogic added the domain: parsing Anything related to parsing within Vector label Aug 7, 2020
@binarylogic
Copy link
Contributor

Here's a tutorial on merging multi-line logs with lua: https://vector.dev/guides/advanced/merge-multiline-logs-with-lua/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: parsing Anything related to parsing within Vector sink: file Anything `file` sink related type: enhancement A value-adding code change that enhances its existing functionality. type: help User support and help.
Projects
None yet
Development

No branches or pull requests

3 participants