-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ATN serialization improvements (Java only for demo) #3505
Conversation
Refactor ATN serializer and deserializer, use ATNDataWriter, ATNDataReader Remove excess data cloning in deserializer fixes antlr#1863, fixes antlr#2732, fixes antlr#3338
… to Integer.MAX_VALUE) Simplify serializer and deserializer API
public abstract class ATNDataReader { | ||
protected abstract byte readByte(); | ||
|
||
public int read() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a standard base64 decoder somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh i see base64 below now. Hmm...adding general comment outside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the binary reader that uses abstract readByte
calls. It reads mostly integers from compact format (described below). readByte
is implemented in ATNDataReaderBase64
(that reads from base64 string) and in ATNDataReaderByteBuffer
.
Sounds amazing. But, i'm trying to remember how this works. Let's see: we go from list of int (32bits) to strings (list of short16) because java only statically initializes strings (whereas arrays are initialized one element at a time in code which explodes code size). Ok, can you explain how the base64 helps here? Is it that we are now going to 32 bit atn data, represented in a few base64 char? |
import java.nio.ByteBuffer; | ||
import java.util.UUID; | ||
|
||
public class ATNDataWriter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go base64 route, shouldn't we use existing encoder/decoders to avoid bugs? https://docs.oracle.com/javase/8/docs/api/java/util/Base64.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure we can. But I tried to get rid of excess big array allocations in the deserializer (there are 2 additional allocations if we use standard methods). It may affect application startup time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, if it's a trade between an alloc that gets collected vs homegrown base64, it seems safer to take the extra GC hit.
Because it is smaller than plain char[] representation at least in source code. Also, it's a more standard format that does not require weird increments and decrements for storing optimization (no need to think about how binary data is encoded to string). Consider 3 bytes: I've checked Anyway, the output format can be either plain chars or base64 chars since it does not depend on binary representation (it's being converted in SerializedATN). Different targets may use different output formats.
Now binary data and string representation are split ( |
Verrrrry interesting. Thanks for the explanation. I'll have to think hard on this but we should keep big picture in mind. What we have works and we're suggesting replacing it for a more general solution but with unknown issues/bugs to solve an uncommon case of > 64k states. Questions:
Currently we do handle 32-bit unicode char, but with two 16-bit /** Convert the list to a UTF-16 encoded char array. If all values are less
* than the 0xFFFF 16-bit code point limit then this is just a char array
* of 16-bit char as usual. For values in the supplementary range, encode
* them as two UTF-16 code units.
*/
public final char[] toCharArray() {...} This forces 2-char (32 bits) for all elements from list if any needs >16 bits. Why not just always do that? For existing case, literally nothing changes. For edge case, we use 2x string length in generated java file and class file and allocated serialized string during class loading. I wonder if this simple solution is the least intrusive and risky. |
My 2 cents:
I would want to see metrics to help reach a decision: load time and memory footprint after forced GC. |
Great points, @ericvergnaud. Yes, I can see fast startup/small class file size being important on mobile. Existing mechanism is smallest/fastest. If we adjust so max atn state can be > 16 bits then existing func will simply encode all ints as 2 16-bit chars. This provides new capability for a small group that needs huge atn w/o breaking anything. Most mobile code won't need this. Re GC of atn string. If we remove Actually, we don't care I think about I think therefore we should proceed to smallest tweak that will allow 32-bit ATN state numbers that preserves existing static string mechanism. Also let's try getting |
Sounds great indeed! |
Sorry, I'm a bit busy and I'll be able to answer the next week. Also, I have to check one thing. |
Yes, it has. Base64 string encoding does not differ from the current string encoding using plain or escaped chars (at least for Java).
Unfortunately, it's unclear how the deserializer will find out how many bits are used for integer encoding. There are two ways of such problem resolving:
In the previous closed PR I suggested the following encoding scheme:
We can quite easily change the format of internal serialization since parsers are not back-compatible (ANTLR reports
Actually, the smallest/faster (and clearest) mechanism is loading data from a resource file (for instance java.util.Properties). But it complicates compilation and it's not back-compatible.
Maybe most but I'm not sure about all. Current and future mobile devices may use ANTLR for applications related to natural languages processing or AI (?) that require a lot of tokens and ATN states.
I like the idea of removing info that is required only for deserialization. Unfortunately, there are some problems:
Consider the following code: public class Main {
public static void main(String[] args) {
String string = "test";
char[] chars = new char[] { 't', 'e', 's', 't' };
}
public static void test() {
String s2 = "test";
String s3 = "test2";
}
} And its bytecode:
As we see, both |
removing final will free the reference, but not the string constant, so it's probably not good enough. We'd have to either load if from a resource, or from a class which we unload. |
Yes, also an array of chars also probably won't help because all data will be stored to a method (that is unlikely to be collected) instead of string pool. I vote for the first option, loading from the resource because dynamic unloading is too excessive for such a task. |
Also, using
I suggest getting rid of the field at all and moving the initialization code to a static constructor (that calls submethods) because setting to null looks like an antipattern. |
Ok, I'll rollback base64 encoding since it's not a very optimal solution. Moreover, it looks like I've found a way how to decrease the size of source code for ATN data (it is not necessary to escape most part of the symbols) without affecting compilation files. But the format of encoding of big ATN is still unclear and I'm waiting for an answer. |
Ok, a lot to take in here. Will need time to think. Any idea how many grammars generate more than 2^14 states? Just curious about your encoding mechanism. Might work. I'm opposed to a data file with atn (since first construction of 4.0) as it means users need java file and a resource file, which is a mess to deal with. Must be kept in sync with each other etc... |
I don't know exactly but I've checked the encoding on our runtime tests and only one test with large lexer was failing. On the other hand, almost all tests are quite small, real grammars are bigger.
Ok. Yes, it requires building workflow changes for all runtimes and a lot of effort. Also, I have an idea of how to decrease the size of serialized string/array of chars. |
BTW, Swift already keeps ATN in a separate file but does it very ineffectively since it uses JSON instead of binary encoding. Probably it makes sense to use binary files for new targets or for targets that do not require compilation (JavaScript, Python). They just will read ATN from a nearby file. |
Regarding having separate files: I definitely prefer not having to keep two files which must be kept in sync. I think some of the target developers simply used the existing serialization mechanisms as an expedient, but it's suboptimal from a parser user build point of view. I've just spent the last hour going through the serialization code for Java because it is a special case... the obvious thing for other targets is simply to store a static integer array in the generated code. I built a little thing to track the size of numbers in the generated ATNs... let me write something up it report it here. |
Integer array takes more bytes in source code compared to raw char arrays or string arrays. Especially considering that almost all chars may be in a raw format, not escaped (I'm experimenting with that). Why Java is a special case? |
BTW, Go and C++ use int arrays for ATN data. Maybe it's ok for compiled languages but not very optimal for JavaScript where the size of the source is more critical. |
Only because int/char arrays are initialized with code which blows out size of init method easily and is slow. Strings are in the .class file constant pool in contrast. Other languages won't suffer from this.
Yeah, size of generated atn is something to pay attention to for JS as it's loaded in src form. Not huge though. Java grammar ATN in number of integers: Lexer len getSerialized = 9754 Still working on other numbers. standby: I need a sandwich haha |
At least C# (.NET) works in a similar way. It also has the conception of string pool and string interning. Not sure about other languages.
It's quite relatively. Also, it can be another much bigger grammar. If it's possible to decrease the size without affecting performance, why not do that. BTW, I'm changing JSON -> Binary (string array) serialization for Swift and there are exiting numbers of size decreasing. I'll publish the result a bit later. |
It's always fun to improve performance or reduce size, but in a project like this I would like to leave everything alone that isn't broken, at least for now.
Are you saying that static short arrays (vs strings) are initialized using |
I'm not worried about size as we are only talking about 32K without UTF8 compression for the base Java parser grammar. I'm totally willing to accept that size given that it has worked for over a decade. |
Ok, if I'm doing this correctly, it looks to me like one UTF-8 byte (0..127) holds about 47% of all values in Java parser's serialization. A full 0..255 byte hold about 75% of all values. it looks like we are getting a pretty decent compression from UTF-8. Out of the ~16,000 integers, here are the first few values we need to encode with their counts:
Interestingly, the maximum value is not very large...like 15k. The big numbers on the end are the UUID encoding that Sam put in; btw, not sure we need this and could remove it. Not sure what function it serves or what error it prevents, given that we already encode the serialization version number.
So, in the end, I don't think we have a problem with the existing mechanism except for the original issue we are trying to solve: What happens when we get a really big grammar where the number of ATN states exceeds 65535? We have code that handles this by manually encoding 32-bit numbers as two unicode short chars. Take a look at serialized = new ArrayList<String>(data.size());
for (int c : data.toArray()) {
String encoded = factory.getGenerator().getTarget().encodeIntAsCharEscape(c == -1 ? Character.MAX_VALUE : c);
serialized.add(encoded);
} This code generation bit would have to be updated to switch between "ints as 16 chars" and "ints as 2 x 16 chars" depending on the maximum value found in the serialized |
I meant string literals and objects, they work in Java and in C# in similar way
Please take a look at my suggestion in #3494 I suggested using 1 byte as minimal piece of information instead of current 2-byte. Values within 0..127 can be encoded as 1 byte. Also it can encode any 32 bit integer and does not require 32 bit int to be within 0..65535 range (it looks quite inconsistent).
I've checked: we need only 0xFFFF for -1. Other negative numbers are not used.
You suggest putting "switch flag" to serialized data, don't you? Deserializer should not about that. Also, it would significantly increase the size of output data.
It's quite a weird solution, I don't completely understand how it helps to optimize the size. Most values are |
More data. I'm looking at the serialized ATN strings for lexers and parsers for the java grammar:
When we look at the unshifted versions I don't see any difference in the parser and it's actually smaller in the lexer! I could be making a mistake in my computations here but the generated strings definitely look to be different by 2:
|
This is very similar to what UTF-8 does, which is the format used in the class file. I guess once it's loaded into Java however it will be two bytes per character, but it would be the same even in your encoding once it got back into memory. As you can see from the numbers I just posted, we're getting a very good compression from the simple UTF-8. |
the code I'm using to examine UTF-8 size looks like this: System.out.println("_serializedATN_parser_not_shifted");
// test size
try {
ByteArrayOutputStream bytesOut = new ByteArrayOutputStream();
OutputStreamWriter out = new OutputStreamWriter(bytesOut, "UTF8");
out.write(_serializedATN_parser_not_shifted);
out.flush();
byte[] tstBytes = bytesOut.toByteArray();
int size = tstBytes.length;
System.out.println(_serializedATN_parser_not_shifted.length() + " chars used to store the String");
System.out.println(_serializedATN_parser_not_shifted.length()*2 + " bytes used to store the String");
System.out.println(size + " bytes used to store the String in UTF8");
out.close();
}
catch (IOException ioe) {
System.err.println(ioe);
} |
Would you be willing to make a PR that got rid of this shifting by 2 @KvanTTT ? It's easy but there are about 10 places to change it. there does not seem to be a big problem as all of the Java tests seem to run. @ericvergnaud do you have a problem with us getting rid of this weird premature optimization? |
So, just to finish this off, I think we have a very acceptable solution for all but the biggest grammars. I believe that @ericvergnaud also agreed that we are mostly okay. For the special case of really big grammars, I'm willing to simply support it at this point and later possibly we can optimize. In order to support it, all we have to do is generate two 16-bit chars for each integer in the serialization. It potentially (more than) doubles the number of bytes but I'm okay with that is it still fairly small. Later, we can figure out how to deal with and encoding that is not messed up by the UTF-8 encoding of the class files. Other targets will have to be examined to figure out if they use short char arrays and, if so, switch that to int arrays for this figure case. In other words, we begin the process by serializing an ATN into a list of integers. Then, we figure out the maximum value and see if everything fits in 16 bits. If so, we leave everything as is, otherwise we convert ALL int values to two I just created a tiny PR that is a small bit of cleanup; would be useful if you could take a quick look at that. |
Maybe it was premature, but that doesn’t necessarily make it wrong...
If I understand the history, it was targeted at reducing the serialized size, so we’d have to measure the impact of removing it whilst at the same time trying to reduce the size through other means…
Personally I don’t really care about the literal size, the Javascript typically gets gzipped, so a few k don’t affect the network transfer time.
We should focus on deserialization time, and memory size after deserialization.
… Le 29 janv. 2022 à 23:45, Terence Parr ***@***.***> a écrit :
Would you be willing to make a PR that got rid of this shifting by 2 @KvanTTT <https://github.com/KvanTTT> ? It's easy but there are about 10 places to change it. there does not seem to be a big problem as all of the Java tests seem to run. @ericvergnaud <https://github.com/ericvergnaud> do you have a problem with us getting rid of this weird premature optimization?
—
Reply to this email directly, view it on GitHub <#3505 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZNQJALVUW3Q3LXP7EN5LLUYRUYRANCNFSM5MQGL35Q>.
You are receiving this because you were mentioned.
|
Yet another point against such a solution: we have to bypass the ATN twice to detect the maximum integer value (because int is used for everything). It duplicated the code, may decrease performance and it looks more complicated. Also, integer bits info (16 or 32) should be added to the beginning of data for the deserializer. Why can't we just use dynamic integers since back-compatibility will be broken anyway after #3516 (at least 2 and 4 bytes)? MessagePack and other binary serializers use dynamic integers, it's working solution. We can encapsulate the code of integer writing to the separated method write and use it everywhere in the serializer (the same for deserializer). |
It looks like it's actual for only big |
I was a bit wrong, actually, it's up to 2^15 states count (32768) without breaking compatibility. It's two times lesser than the current max limit (2^16). |
Looking at the ATN for your MySql grammar, it seems nowhere near the 16 bit limit. Here's the tail end of the histogram of state numbers and counts for lexer and parser:
Seems extremely rare that we'd have a bigger grammar than MySql and these are only 20% of the way to 65,535, right? |
I think it depends on the application. For programming languages, 65356 should be enough (but also not sure). But I suspect ANTLR can be used for natural languages processing where thousands of tokens are okay. Or other applications I can not even imagine. Also, there are several issues related to such a limit, some users require full range. I like the idea of attracting more users to use the great ANTLR tool. |
What concrete used cases have people submitted? I think we should carefully evaluate whether this is really needed. |
Take a look at @sharwell comment in the latest issue:
And others. |
I'm closing in favor of #3546 |
encodeIntAsCharEscape
call, doesn't need encoding optimization hack (since all values are always in single-byte form)