Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up parsing -- delay string and number parsing? #1857

Open
nicowilliams opened this issue Mar 7, 2019 · 12 comments
Open

Speed up parsing -- delay string and number parsing? #1857

nicowilliams opened this issue Mar 7, 2019 · 12 comments

Comments

@nicowilliams
Copy link
Contributor

We've talked about this before, especially in the context of printing numbers exactly as they were input.

@vdukhovni was telling me about a Haskell JSON parser that does this, so if you have large JSON texts and you're interested in some tiny portion, you don't need to pay the full cost of string and number parsing for all of the bits you don't need.

Realistically, we'd have to either still check for invalid strings and numbers at parse time, or else need a new option to enable lazy parsing (delayed parse errors would be surprising). And we still need to do enough work at parse time to find the start and end positions of any numbers or strings.

(Another option too is to hold on to a reference on the input JSON text and have the delayed-parse values refer to substrings of that, thus limiting memory pressure. But this wouldn't save that much allocation (arrays and objects would still need to be allocated), and wouldn't help the streaming parser (which wants to hold on to nothing in order to have a very small memory footprint). So I'd say no to this option.)

Speaking of the streaming parser, it is actually slower than the non-streaming parser! We should disentangle the two parsers and make the streaming parser reuse all of the path and output value arrayss it uses.

@leonid-s-usov
Copy link
Contributor

Well as I have mentioned on the corresponding PR #1752 parsing string into decimal representation requires an order of magnitude fewer computations compared to parsing into the binary floating point.
As an additional speedup, I'm reusing a scratch dtoa context when it is actually needed to finally convert the decimal to binary, and with some valuable help from @wtlangford there is just one instance of the context per thread which gets initialized only once. (please see this commit for the latest changes)

Early in my implementations, I've actually pushed a literal number PR which exactly caches the original number string on the jv_number (hence the name of the branch), but we have concluded that this technique is too restrictive in terms of what can be further accomplished with this string representation, and since then I've reimplemented the feature with the support of the decNumber library

I think that @wtlangford was going to run some performance tests on this branch. But I'm not sure if he has gotten to actually doing that.

@nicowilliams
Copy link
Contributor Author

@leonid-s-usov BTW, thank you for your patience! So sorry this is taking so long. And, yes, I had that work in mind.

@leonid-s-usov
Copy link
Contributor

(Another option too is to hold on to a reference on the input JSON text and have the delayed-parse values refer to substrings of that, thus limiting memory pressure. But this wouldn't save that much allocation (arrays and objects would still need to be allocated), and wouldn't help the streaming parser (which wants to hold on to nothing in order to have a very small memory footprint). So I'd say no to this option.)

These aren't the only options. If we utilize jv_string for caching it would mean automatic ownership transfer from the origin to the last consumer.

@nicowilliams
Copy link
Contributor Author

nicowilliams commented Mar 7, 2019

Yeah. Another thing I've been wanting to do is to add support for binary values, which would appear to jq programs to be no different from arrays [of numbers, all integers between 0 and 255 inclusive] but actually would be backed by a plain buffer. The cache could be of that sort.

(On output, binary values could either print as arrays of small integers, or as base64-encoded strings. That could be a global option. A -b / --binary option would be needed to read binary from stdin, and it might need a block-size to go with it which if zero might mean slurp.)

@leonid-s-usov
Copy link
Contributor

leonid-s-usov commented Mar 7, 2019

Wait, but binary values would need to be embedded into the input JSON somehow, wouldn't they? How would a binary input be referenced from the input JSON?

@nicowilliams
Copy link
Contributor Author

nicowilliams commented Mar 7, 2019

@leonid-s-usov the idea is that you could:

  • apply base64tobinary to any string containing base64-encoded data, get binary out
  • apply binarytobase64 to any binary, get a base64-encoded string
  • apply tobinary to any string, get the same UTF-8, but now marked as binary
  • apply tobinary to any array of small integers to get the same value, but as a binary
  • apply isbinary to any array, string, or binary to get a boolean indicating whether the value is binary-like (all strings would be, and all non-binary arrays of small integers would be as well, but arrays of other things wouldn't be, and neither would be true, false, null, numbers, or objects)
  • apply tostring to any binary (or array of small integers) and get back a string, with invalid codepoint substitution, naturally
  • use --binary-input to read raw binary input from stdin
  • use binaryinputs to read raw binary from stdin?
  • use --binary-output to write raw binary to stdout
  • use binaryoutput to write raw binary to stdout?
  • use .[index] to get any given byte of the binary
  • use the array slice operator .[start:end] to get substrings of a binary
  • use .[] to iterate the numeric byte values in a binary value
  • use + and assignment operators to construct binary values
    • a binary + [small_int], or a [small_int] + a binary should yield a binary
    • a binary + a string, or a string + a binary should yield a binary
  • use @binary "..." to get a binary version of that string
  • @base64 should apply to binary inputs too, naturally

@jaslawinMs
Copy link

What would be a consequence of such a lazy parsing on expressions like "(.a)" for the input like {"a" : 0.0101010101101001001010101011010101111110101010 }? In general, is it required to convert the value always. I feel like values would be passed through with no interpretation.

@vdukhovni
Copy link

Speaking of binary data, the restriction that @base64d must produce valid UTF-8 out pretty much kills any usefulness of that function. I was hoping to be able to look at the resulting octets, but no go. It must treat the result as a stream of 8-bit code points, and if desired encode those to UTF-8, or be changed to produce an array of small integers. The current behaviour is useless IMHO.

@pkoppstein
Copy link
Contributor

pkoppstein commented Nov 21, 2021

@vdukhovni -You might be interested to know that gojq's @base64d does not have an "undefined behavior" caveat. (gojq is the Go implementation of jq.) I believe it decodes the based64-encoded string and then encodes that to UTF-8. In other words, if you start with an arbitrary file, B, and then base64-encode it to produce a file E, then

jq -rRj '@based64d' E 

should recover B.

I used this approach successfully to test gojq against https://www.w3.org/2001/06/utf-8-wrong/UTF-8-test.html :

$ diff <(gojq -Rrj '@base64d' B.base64) B
$

@jaslawinMs
Copy link

Sorry, I think I may have given misleading input providing just 0s and 1s. I would consider JQ as a tool to transform input JSON to output JSON. E.g., normally I see that the input {"a" : 0.120, "b" : 100.0} converted by the expression "." which yields {"a" : 0.12, "b" : 100 }. I may understand why these conversions are done, anyway, this is another view. So when I saw this thread about delayed parsing I was thinking that maybe with a delayed parsing in this particular case no excessive conversions would be done. There is no transformation of the values whatsoever so why JQ could not just pass it through and yield all values as they are from the input? Similarly, I if I would use it in a string interpolation, I would accept (or even expect?) "0.120" and "100.0" because why not? The input is pure text, the string interpolation requires the values to be text, it is already there, no conversion would be needed.

@vdukhovni
Copy link

No, delayed parsing is simply lazy parsing that avoids doing too much work on values that don't ultimately get used. The output must still be canonical.

So numbers that are syntactically valid may remain as raw strings until their numeric value is needed, but output of the value is "need".

Similarly strings could remain escaped raw input until first use, ... But null, true and false may as well be converted strictly. One could go further and skip materialising uninteresting object keys, or array slots, but that's harder, because the compiler would have to figure out which fields the program will/won't ignore.

Since keys can be computed on the fly, that's ETOOHARD I think. Perhaps with streaming (if fast enough) we could find a way to filter the stream for the wanted keys quickly enough to then reassemble just the desired object fragments, some predicate on the stream piped to fromstream.

@nicowilliams
Copy link
Contributor Author

One could go further and skip materialising uninteresting object keys, or array slots, but that's harder, because the compiler would have to figure out which fields the program will/won't ignore.

We have to validate the input, so we could avoid allocations for objects/arrays until they're needed, but when we then do those allocations we'd be wasting cycles on validation unless... unless we use a SIMD parser to validate and retain all the offsets to all the things and then allocate them only when they are "observed" or altered with any of .[<key>], .[], delpaths, setpath, or similar.

But all the SIMD JSON parsers I've seen are non-incremental, but the jq processor absolutely needs an incremental JSON parser. There are cases where we could use a non-incremental JSON parser because we know the size of the text a priori and have it in memory already (e.g., in fromjson), but in most cases when handling JSON texts from stdin we need an incremental parser.

I suspect that trying to be lazy about parsing in general will mainly just add too many branches in jv code.

@liquidaty liquidaty mentioned this issue Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants