-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Figure out if the diacritic table can be smaller #1940
Comments
The main questions are:
The answer to the first question is: Yes, if U+034F COMBINING GRAPHEME JOINER and the following characters are excluded from the table instead of having the table cover the whole block. The answer to the second question is: For the root and Ewe, yes, but for Vietnamese, no, without establishing a larger tailoring gap in the root after the grave accent. Tweaking the root to establish such a tailoring gap seems like a bad idea in the ICU4X 1.0 scope. (But it's also a bit sad to store 16-bit weights instead of 8-bit weights knowing that 8-bit weights would suffice if the weights were allocated differently.) My thinking is that it makes sense to
The space saving isn't much, but this would eliminate a couple of branches from the handling of common diacritics, which could take back a tiny bit of the perf left on the table by omitting the canonical closure and Latin mini expansions. While eliminating a couple of branches seems a bit silly, the whole point of this table is to regain perf not obtained via the canonical closure and Latin mini expansions. I'm thinking I'm going to proceed with storing 16-bit secondary weights. @markusicu, what do you think? Does it seem realistic that in the future CLDR could end up with characters U+0300 and U+034E (inclusive) having non-zero primary weight or tertiary or quaternary weights other than the common tertiary weight? (It seems virtually certain that these characters cannot gain case bits.) |
One way to provide an escape hatch for guessing wrong about the future would not involve adding branches but would convert a number that's currently a constant to be loaded from data: If the diacritic table was a slice instead of an array whose length is known at compile time, it would be possible to bypass the whole thing by a tailoring supplying a zero-length diacritic table. The next iteration of this idea would be to store only 8-bit secondaries for root and Ewe and have Vietnamese supply a zero-length table until such time that the tailoring gap after the grave accent is made larger. |
What's the magnitude of the data savings we're talking about here? |
The data saving is tiny, between 150 and 250 bytes. The data saving doesn't make this worthwhile. Perhaps the branch avoidance isn't worthwhile, either, but it also seems silly to have the branches that are known-useless. So storing 16-bit secondaries with the escape hatch of making it a slice rather than an array (i.e. making the slice zero-length in tailoring turns off the whole thing) seems both less silly and more future-compatible than what's there now, but of course it's a small bit of work compared to not changing anything. |
Oops. The data savings are closer to 3 times the numbers in the previous comments, but still small. Less than a kilobyte across the tailorings. |
#1978 closed this but lacked the appropriate notation. |
Chances are that all the CE32s in the diacritic table always have meaningful bits only in the secondary weight position, which in CE32 means one byte. There might be an opportunity to save space and some time in the form of eliminated branches by storing a byte array on secondary weights only.
The text was updated successfully, but these errors were encountered: