simdjson #1892

ekg · 2019-04-19T21:29:10Z

Would it be possible to leverage simdjson to improve parsing speed?

https://github.com/lemire/simdjson

Apologies if this is a duplicate.

nicowilliams · 2019-04-22T19:27:28Z

It'd be nice to do this, yes, but there are issues.

First, jq's JSON parser is incremental, and so can consume JSON texts in chunks, whereas simdjson seems to need the entire text (and does two passes on the text). That doesn't mean that we couldn't use simdjson's SIMD techniques when we have the full text, but generally those won't be large texts. I'm not sure how easily we could adapt SIMD techniques to work incrementally, though at first glance it seems feasible.

Second, jq supports multiple JSON texts in sequence. This is less tricky.

pkoppstein · 2019-04-22T22:29:39Z

... but there are issues.

Third, jq's parser is currently quite lenient -- perhaps too much so in some cases (e.g. allowing 00 for 0), but well within its rights in others (e.g. allowing duplicate keys within a JSON object).

Fourth, there is a plan to allow jq to preserve arbitrarily big integers (in the sense that jq . will not modify them), and it would be a shame to lose that functionality.

(If jq is to support stricter parsing, then for the sake of utility as well as backward compatibility, it should be possible to specify which mode is wanted.)

lemire · 2019-10-18T20:15:24Z

well within its rights in others (e.g. allowing duplicate keys within a JSON object).

simdjson allows duplicate keys, the RFC does not require uniqueness: https://tools.ietf.org/html/rfc7159

there is a plan to allow jq to preserve arbitrarily big integers

Though simdjson is limited to 64-bit numbers at this time, there are plans to extend the support...
simdjson/simdjson#167

lemire · 2019-10-18T20:18:29Z

First, jq's JSON parser is incremental, and so can consume JSON texts in chunks, whereas simdjson seems to need the entire text (and does two passes on the text). That doesn't mean that we couldn't use simdjson's SIMD techniques when we have the full text, but generally those won't be large texts. I'm not sure how easily we could adapt SIMD techniques to work incrementally, though at first glance it seems feasible.

This issue is being worked on in simdjson...

simdjson/simdjson#128

Second, jq supports multiple JSON texts in sequence. This is less tricky.

This issue is also being worked on in simdjson...

simdjson/simdjson#188

cc @piotte13

lemire · 2019-11-25T23:58:36Z

simdjson now supports JSON documents in sequence. The JSON can be either line-separated or just in sequence with arbitrary white space between them. The input can be nearly infinite...

The performance is quite good... (gigabytes per second)

https://github.com/lemire/simdjson/blob/master/doc/JsonStream.md

cc @piotte13

hitorilabs · 2023-06-19T20:08:46Z

Is anyone working on this already? (or finished some form of it)

liquidaty · 2023-07-10T23:26:24Z

A few thoughts I would put forth for consideration:

can we come up with one or two particular clear "painful" use case(s) we can target-- including sample filters as well as sample inputs?
- presumably, the vast majority of jq use cases are using a filter that is not simply '.', so this isn't the ideal filter use case to optimize for
- optimizing for strict JSON might be very different from optimizing for non-strict JSON (or other e.g. YAML / XML), and input size might also impact
- ultimately, the initial goal would be to devise the specific benchmark tests to target for optimization
we may find that the bottleneck in the most important use cases is not the JSON input parsing, but rather the filter calculations, or maybe the way that the parser hands off data to the filter calculator. That's not to say that optimizing the JSON input parsing is not valuable in its own right no matter what, but rather to recognize that we all have limited time and resources to contribute, and informing how we spend that can make a drastic difference in how much impact that contribution makes-- and deciding how this issue request should be prioritized relative to other performance-related ones such as Speed up parsing -- delay string and number parsing? #1857
there are several mentions of multiple contemplated parser modes in terms of strictness and input format (e.g. XML/yaml) (ER: add --strict command-line option for strict parsing of JSON input #2643, simdjson #1892 (comment), Allow, and strip, comments (#, // or /*) #2548 (comment), Maybe add fromyaml to parse yaml strings? #2530, support yaml input/output in addition to json #467). Given this, the most expeditious approach might be to first evaluate these as a whole, identify commonalities, and then refactor the current jq parser to separate those commonalities, before then broadening the supported input formats
- not sure if this would turn out to be the case, but as an illustrative example, perhaps that collective analysis would then suggest refactoring the parser to an event-driven structure, where the event handlers could be reused by all of the different input parsers. Knowing that and performing that refactoring ahead of time might improve performance and almost certainly will make the subsequent work of supporting other input formats much easier, and lower maintenance learning curves and technical debt in the process

nicowilliams · 2023-07-10T23:45:14Z

First, jq's JSON parser is incremental, and so can consume JSON texts in chunks, whereas simdjson seems to need the entire text (and does two passes on the text). That doesn't mean that we couldn't use simdjson's SIMD techniques when we have the full text, but generally those won't be large texts. I'm not sure how easily we could adapt SIMD techniques to work incrementally, though at first glance it seems feasible.

This issue is being worked on in simdjson...

simdjson/simdjson#128

That's still open. Is it likely we can get that?

Second, jq supports multiple JSON texts in sequence. This is less tricky.

This issue is also being worked on in simdjson...

simdjson/simdjson#188

Sweet!

Maybe we could use simdjson for jv_parse() and jv_parse_sized(), which is to say "for fromjson". Now, we'll still have to allocate a bunch of objects, so that's not terribly fun, and I wonder what we could do about that. Now in the streaming JSON parser we only ever need to allocate an array for the paths to scalars and for the scalars, and we can reuse the path array, so we should be able to get more bang for this effort there.

nicowilliams · 2023-07-10T23:46:45Z

A few thoughts I would put forth for consideration:

Yes, this will be a lot of profiling and playing with options to find something that rocks perf-wise and isn't too hard to use.

itchyny added the performance label Jun 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simdjson #1892

simdjson #1892

ekg commented Apr 19, 2019

nicowilliams commented Apr 22, 2019

pkoppstein commented Apr 22, 2019

lemire commented Oct 18, 2019

lemire commented Oct 18, 2019

lemire commented Nov 25, 2019

hitorilabs commented Jun 19, 2023 •

edited

Loading

liquidaty commented Jul 10, 2023

nicowilliams commented Jul 10, 2023

nicowilliams commented Jul 10, 2023

simdjson #1892

simdjson #1892

Comments

ekg commented Apr 19, 2019

nicowilliams commented Apr 22, 2019

pkoppstein commented Apr 22, 2019

lemire commented Oct 18, 2019

lemire commented Oct 18, 2019

lemire commented Nov 25, 2019

hitorilabs commented Jun 19, 2023 • edited Loading

liquidaty commented Jul 10, 2023

nicowilliams commented Jul 10, 2023

nicowilliams commented Jul 10, 2023

hitorilabs commented Jun 19, 2023 •

edited

Loading