Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of writing raw UTF-8 encoded byte arrays #1349

Open
wants to merge 3 commits into
base: 2.19
Choose a base branch
from

Conversation

JoostK
Copy link

@JoostK JoostK commented Oct 20, 2024

The output escape table covers just 7-bits, meaning that a raw UTF-8 byte cannot be used to index into the table without a branch test for negative bytes (i.e. bytes larger than 0x7F). This extra check occurs in a tight loop and can be avoided if the lookup table were to cover all 8-bit indices.

This commit introduces ad-hoc logic in UTF8JsonGenerator#writeUTF8String to create an extended copy of _outputEscapes if necessary, writing the copy back into the field to avoid having to compute it again within the same generator instance (unless it is changed). This ad-hoc strategy was chosen as it is the least disruptive to existing code, as a larger-scale change around CharacterEscapes would impact public api or otherwise subtle chances for breakages.


Some quick and dirty JMH tests on M1 Max (arm64 ⚠) with Azul Zulu JDK 21, this shows the following numbers:

Benchmark                (length)  (needEscape)  (optimized)   Mode  Cnt   Score   Error   Units
JmhTest.writeUtf8String        32         first         true  thrpt   40  32,156 ± 0,084  ops/us
JmhTest.writeUtf8String        32         first        false  thrpt   40  27,936 ± 0,106  ops/us
JmhTest.writeUtf8String        32          last         true  thrpt   40  33,049 ± 0,091  ops/us
JmhTest.writeUtf8String        32          last        false  thrpt   40  29,605 ± 0,102  ops/us
JmhTest.writeUtf8String        32          none         true  thrpt   40  32,922 ± 0,192  ops/us
JmhTest.writeUtf8String        32          none        false  thrpt   40  29,654 ± 0,074  ops/us
JmhTest.writeUtf8String       256         first         true  thrpt   40   6,350 ± 0,023  ops/us
JmhTest.writeUtf8String       256         first        false  thrpt   40   4,734 ± 0,012  ops/us
JmhTest.writeUtf8String       256          last         true  thrpt   40   6,399 ± 0,018  ops/us
JmhTest.writeUtf8String       256          last        false  thrpt   40   4,759 ± 0,017  ops/us
JmhTest.writeUtf8String       256          none         true  thrpt   40   6,402 ± 0,021  ops/us
JmhTest.writeUtf8String       256          none        false  thrpt   40   4,751 ± 0,025  ops/us
JmhTest.writeUtf8String       512         first         true  thrpt   40   3,215 ± 0,030  ops/us
JmhTest.writeUtf8String       512         first        false  thrpt   40   2,478 ± 0,008  ops/us
JmhTest.writeUtf8String       512          last         true  thrpt   40   3,259 ± 0,012  ops/us
JmhTest.writeUtf8String       512          last        false  thrpt   40   2,480 ± 0,026  ops/us
JmhTest.writeUtf8String       512          none         true  thrpt   40   3,262 ± 0,013  ops/us
JmhTest.writeUtf8String       512          none        false  thrpt   40   2,486 ± 0,007  ops/us

This is writing buffers of length (length) that contain 'a' in all positions, where (needEscape) being 'first' has the first byte overwritten with ", or 'last' its last byte, versus 'none' where the buffer remains a sequence where no escapes need to be inserted.

Overall the numbers show improvements in the range of 11%–33%. I wonder if this extends to other CPU architectures, opening this PR to gauge interest in such a change. Note that this only affects UTF8JsonGenerator#writeUTF8String which isn't typically used, as it's more common to process from char[] or String buffers. In my use-case I already have an UTF-8 encoded byte[] which prompted me looking into this.


This logic can probably be vectorized quite nicely, that is also being done in dotnet's JSON writer infrastructure.

@pjfanning
Copy link
Member

Thanks @JoostK. Would you be able to add the benchmark to https://github.com/FasterXML/jackson-core/tree/2.19/src/test/java/perf ?

@JoostK
Copy link
Author

JoostK commented Oct 21, 2024

Sure I can add some; while looking at the existing ones I wonder what the desired testing strategy is:

  1. extract both write loops to be able to compare prior state (7-bit LUT) versus new state (8-bit LUT), or
  2. call into JsonGenerator#writeUTF8String and then running the test with and without the change applied, possibly adding char[] writing as a comparative benchmark.

What is the most valuable thing to have here? 1. is meaningful to compare this particular change across machines/JVMs, but 2. is more valuable to measure and compare write perf of JsonGenerator going forward.

@pjfanning
Copy link
Member

Both sound useful - could you add both benchmarks?

@JoostK
Copy link
Author

JoostK commented Oct 21, 2024

Both sound useful - could you add both benchmarks?

I'll come up with something, probably over the coming days.

@JoostK JoostK force-pushed the utf8-generator-write-encoded-bytes-perf branch from 2493a9f to 60fc4a0 Compare October 27, 2024 12:02
@JoostK JoostK closed this Oct 27, 2024
@JoostK JoostK force-pushed the utf8-generator-write-encoded-bytes-perf branch from 60fc4a0 to 5117042 Compare October 27, 2024 12:04
The output escape table covers just 7-bits, meaning that a raw UTF-8 byte cannot
be used to index into the table without a branch test for negative bytes (i.e. bytes
larger than 0x7F). This extra check occurs in a tight loop and can be avoided if the
lookup table were to cover all 8-bit indices.

This commit introduces ad-hoc logic in `UTF8JsonGenerator#writeUTF8String` to create
an extended copy of `_outputEscapes` if necessary, writing the copy back into the field
to avoid having to compute it again (unless it is changed). This ad-hoc strategy was
chosen as it is the least disruptive to existing code, as a larger-scale change around
`CharacterEscapes` would impact public api or otherwise subtle chances for breakages.
@JoostK JoostK reopened this Oct 27, 2024
@JoostK
Copy link
Author

JoostK commented Oct 27, 2024

Accidentally rebased onto master, unaware that this PR was targeting 2.19. Reverted back to 2.19.

Here's the results on my MBP w/ M1 Max:

after:

Length    8,  none escape: (7-bit, 8-bit, JsonGenerator):  57,4 /  47,8 / 196,7 msecs
Length    8, start escape: (7-bit, 8-bit, JsonGenerator):  84,3 /  76,2 / 198,5 msecs
Length    8,   end escape: (7-bit, 8-bit, JsonGenerator):  73,8 /  70,9 / 231,6 msecs
Length   16,  none escape: (7-bit, 8-bit, JsonGenerator):  64,0 /  56,3 / 120,0 msecs
Length   16, start escape: (7-bit, 8-bit, JsonGenerator):  73,5 /  62,4 / 129,5 msecs
Length   16,   end escape: (7-bit, 8-bit, JsonGenerator):  67,7 /  59,4 / 172,5 msecs
Length   32,  none escape: (7-bit, 8-bit, JsonGenerator):  63,0 /  52,9 /  85,8 msecs
Length   32, start escape: (7-bit, 8-bit, JsonGenerator):  67,0 /  56,4 / 103,6 msecs
Length   32,   end escape: (7-bit, 8-bit, JsonGenerator):  63,3 /  54,3 / 146,5 msecs
Length  256,  none escape: (7-bit, 8-bit, JsonGenerator):  60,1 /  52,1 /  60,7 msecs
Length  256, start escape: (7-bit, 8-bit, JsonGenerator):  60,8 /  54,7 /  83,5 msecs
Length  256,   end escape: (7-bit, 8-bit, JsonGenerator):  61,9 /  51,7 / 138,7 msecs
Length  512,  none escape: (7-bit, 8-bit, JsonGenerator):  59,5 /  50,9 /  56,5 msecs
Length  512, start escape: (7-bit, 8-bit, JsonGenerator):  61,6 /  53,3 /  79,3 msecs
Length  512,   end escape: (7-bit, 8-bit, JsonGenerator):  60,8 /  50,1 / 132,9 msecs
Length 1024,  none escape: (7-bit, 8-bit, JsonGenerator):  60,2 /  50,7 /  95,5 msecs
Length 1024, start escape: (7-bit, 8-bit, JsonGenerator):  60,3 /  52,5 /  78,7 msecs
Length 1024,   end escape: (7-bit, 8-bit, JsonGenerator):  60,6 /  50,1 /  97,0 msecs
Length 8192,  none escape: (7-bit, 8-bit, JsonGenerator):  58,0 /  49,0 /  32,4 msecs
Length 8192, start escape: (7-bit, 8-bit, JsonGenerator):  59,1 /  49,0 /  38,1 msecs
Length 8192,   end escape: (7-bit, 8-bit, JsonGenerator):  58,9 /  48,9 /  34,6 msecs



before:

Length    8,  none escape: (7-bit, 8-bit, JsonGenerator):  58,8 /  45,4 / 196,7 msecs
Length    8, start escape: (7-bit, 8-bit, JsonGenerator):  84,9 /  76,1 / 201,4 msecs
Length    8,   end escape: (7-bit, 8-bit, JsonGenerator):  74,2 /  70,7 / 230,7 msecs
Length   16,  none escape: (7-bit, 8-bit, JsonGenerator):  65,2 /  56,1 / 121,2 msecs
Length   16, start escape: (7-bit, 8-bit, JsonGenerator):  74,0 /  62,3 / 133,6 msecs
Length   16,   end escape: (7-bit, 8-bit, JsonGenerator):  67,8 /  59,3 / 178,4 msecs
Length   32,  none escape: (7-bit, 8-bit, JsonGenerator):  62,9 /  52,7 /  87,1 msecs
Length   32, start escape: (7-bit, 8-bit, JsonGenerator):  67,0 /  56,4 / 106,5 msecs
Length   32,   end escape: (7-bit, 8-bit, JsonGenerator):  63,4 /  54,2 / 155,5 msecs
Length  256,  none escape: (7-bit, 8-bit, JsonGenerator):  60,2 /  52,2 /  61,7 msecs
Length  256, start escape: (7-bit, 8-bit, JsonGenerator):  63,2 /  54,1 /  86,0 msecs
Length  256,   end escape: (7-bit, 8-bit, JsonGenerator):  62,1 /  52,0 / 142,1 msecs
Length  512,  none escape: (7-bit, 8-bit, JsonGenerator):  59,1 /  51,1 /  56,9 msecs
Length  512, start escape: (7-bit, 8-bit, JsonGenerator):  62,2 /  53,1 /  82,6 msecs
Length  512,   end escape: (7-bit, 8-bit, JsonGenerator):  60,6 /  50,2 / 136,0 msecs
Length 1024,  none escape: (7-bit, 8-bit, JsonGenerator):  59,9 /  50,7 /  96,4 msecs
Length 1024, start escape: (7-bit, 8-bit, JsonGenerator):  60,6 /  52,8 /  81,3 msecs
Length 1024,   end escape: (7-bit, 8-bit, JsonGenerator):  60,5 /  49,7 /  97,0 msecs
Length 8192,  none escape: (7-bit, 8-bit, JsonGenerator):  59,1 /  48,9 /  45,1 msecs
Length 8192, start escape: (7-bit, 8-bit, JsonGenerator):  59,7 /  49,1 /  49,5 msecs
Length 8192,   end escape: (7-bit, 8-bit, JsonGenerator):  58,6 /  49,0 /  47,3 msecs


combined JsonGenerator results:

Length    8,  none escape: (before, after): 196,7  / 196,7 msecs
Length    8, start escape: (before, after): 201,4  / 198,5 msecs
Length    8,   end escape: (before, after): 230,7  / 231,6 msecs
Length   16,  none escape: (before, after): 121,2  / 120,0 msecs
Length   16, start escape: (before, after): 133,6  / 129,5 msecs
Length   16,   end escape: (before, after): 178,4  / 172,5 msecs
Length   32,  none escape: (before, after):  87,1  /  85,8 msecs
Length   32, start escape: (before, after): 106,5  / 103,6 msecs
Length   32,   end escape: (before, after): 155,5  / 146,5 msecs
Length  256,  none escape: (before, after):  61,7  /  60,7 msecs
Length  256, start escape: (before, after):  86,0  /  83,5 msecs
Length  256,   end escape: (before, after): 142,1  / 138,7 msecs
Length  512,  none escape: (before, after):  56,9  /  56,5 msecs
Length  512, start escape: (before, after):  82,6  /  79,3 msecs
Length  512,   end escape: (before, after): 136,0  / 132,9 msecs
Length 1024,  none escape: (before, after):  96,4  /  95,5 msecs
Length 1024, start escape: (before, after):  81,3  /  78,7 msecs
Length 1024,   end escape: (before, after):  97,0  /  97,0 msecs
Length 8192,  none escape: (before, after):  45,1  /  32,4 msecs
Length 8192, start escape: (before, after):  49,5  /  38,1 msecs
Length 8192,   end escape: (before, after):  47,3  /  34,6 msecs

It's not as big of a difference I was seeing in the original benchmarks I had, although it's noticeable how larger strings appear to benefit quite a bit.


There is potentially another question of whether UTF8JsonGenerator#_extendOutputEscapesTo8Bits should overwrite JsonGeneratorImpl#sOutputEscapes, as to avoid having to recreate the 8-bit wide LUT for each individual JsonGenerator instance. I opted not to change this since that crosses UTF8JsonGenerator's boundary into the parent class, as well as demoting JsonGeneratorImpl#sOutputEscapes to a non-final field, which feels iffy.

@cowtowncoder cowtowncoder added 2.19 Issues planned at earliest for 2.19 cla-needed PR looks good (although may also require code review), but CLA needed from submitter labels Oct 27, 2024
@cowtowncoder
Copy link
Member

@JoostK First of all: thank you for contributing this! At high level this makes sense, but I do need to dig bit deeper into this when I have time -- and right now I am bit overloaded/overspread so apologies for delay there may be.

Having said that: one thing we will eventually need (if not done; apologies if it has) is to get a CLA, from here:

https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf

(needs to be sent just once for all Jackson contributions)

the usual way is to print, fill & sign, scan/photo, email to "cla" at fasterxml dot com.
Once I receive it we are good to wrt merging (obv pending code review).

Thank you again; looking forward to getting this merged!

}

final int[] extended = new int[0xFF];
System.arraycopy(escapes, 0, extended, 0, escapes.length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is Arrays.copyOf() that combines these 2 operations


// When writing raw UTF-8 encoded bytes, it is beneficial if the escaping table can directly be indexed into
// using the byte value.
final int[] extendedOutputEscapes = _extendOutputEscapesTo8Bits();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok... this is the (only) part I find problematic. Having to dynamically change _outputEscapes seems problematic, although I understand why it is being done. Since it is something that may be changed by a call to JsonGenerator.setCharacterEscapes() modifications cannot be done on constructor.

But: I have an idea for bit bigger changes that would make it possible to eagerly ensure _outputEscapes is 256 elements long. Will add a separate comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, scratch that. Only now realized this is limited to writeUTF8String, not all escaping.

So dynamically copying + changing is actually reasonable since it's not always needed etc.

@@ -1914,6 +1916,18 @@ private final void _writeUTF8Segment2(final byte[] utf8, int offset, int len)
_outputTail = outputPtr;
}

private int[] _extendOutputEscapesTo8Bits() {
final int[] escapes = _outputEscapes;
Copy link
Member

@cowtowncoder cowtowncoder Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok: I am fine with the idea, but will propose one change: instead of overwriting _outputEscapes, let's introduce secondary field (_outputEscapes8Bit or whatever), left as null, constructed first time needed.
Just changes it to null check.

The reason is just that ideally _outputEscapes is left as-is for other use cases, to minimize any risk of regression.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benefit of overwriting the field is that it will automatically be recreated if it is reset to a 7-bit wide LUT. Storing the 8-bit table in a separate field might get out of date, especially since _outputEscapes is a protected field that can be assigned through subclasses (that is evil, of course, but still a possibility).

@cowtowncoder
Copy link
Member

@JoostK Ok, I now understand the scope and added some suggestions. Since we are not using the new encoding table for all output, I think it's better to avoid overriding _outputEscapes.

But one thing I am wondering is whether this optimization came about actual observations of usage -- that is, if this was merged, would it help with use case you have? This is because as you say, this method seems like less commonly used and if so, benefits might be limited. But if it addresses something that at least you use, it is more reasonable to merge it.

@JoostK
Copy link
Author

JoostK commented Nov 7, 2024

Thanks for the comments, I haven't gotten to address them yet—nor signing the CLA, that won't be a problem but I don't typically have a printer at hand 😄.

My use-case is for UTF-8 encoded bytes that is being read from raw blobs, which are to be sent as a JSON-encoded string (it is known to be UTF-8 encoded) in a web response; this can be in the order of ~100MB of data across ~30k strings, so writeUTF8String is ideal to avoid the need to allocate+decode into a String.

Since the app will likely be deployed on X86_64, I intent to run the performance test on amd64 to gauge what the impact is there, as I suspect this may depend on the ISA (and possibly how effective a branch predictor is, so may be different across CPU vendors/generations). If the meaningful improvements are only applicable to arm64/M1 then this may not be as beneficial as I observed on macOS.

@cowtowncoder
Copy link
Member

No worries @JoostK . FWTW, printing is optional; modifying PDF with info & fake signature works perfectly fine too.

And thx for sharing use case: sounds legit.

@JoostK
Copy link
Author

JoostK commented Nov 22, 2024

Finally found sime time to run this on Intel x64 (specifically Core i7 1270P) but the results are all over the place, so I can't draw conclusive results from it for now. Interestingly the microbenchmark consistently shows that the 7-bit approach performs better on this CPU, which is opposite of what I was measuring on arm64 (M1 Max). On the actual JsonGenerator test however I do see improvements, but those results are wildly unstable. I am a bit puzzled 🤷‍♂️

@JoostK
Copy link
Author

JoostK commented Nov 22, 2024

Not sure what this means for this PR, really. I'd really like to get stable results to make informed decisions on whether this is worth it. Maybe somebody else is able to run the test suite on amd64 CPUs?

@cowtowncoder
Copy link
Member

@JoostK Ok that is... interesting. Given that code seems like it should out-perform existing implementation. I assume you tried with longer test run times but without seeing more stable results?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.19 Issues planned at earliest for 2.19 cla-needed PR looks good (although may also require code review), but CLA needed from submitter
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants