Figure out if the diacritic table can be smaller #1940

hsivonen · 2022-05-25T05:30:30Z

Chances are that all the CE32s in the diacritic table always have meaningful bits only in the secondary weight position, which in CE32 means one byte. There might be an opportunity to save space and some time in the form of eliminated branches by storing a byte array on secondary weights only.

hsivonen · 2022-05-25T12:29:43Z

The main questions are:

Can the diacritic table store only secondary weights?
Can the diacritic table store only short secondary weights?

The answer to the first question is: Yes, if U+034F COMBINING GRAPHEME JOINER and the following characters are excluded from the table instead of having the table cover the whole block.

The answer to the second question is: For the root and Ewe, yes, but for Vietnamese, no, without establishing a larger tailoring gap in the root after the grave accent.

Tweaking the root to establish such a tailoring gap seems like a bad idea in the ICU4X 1.0 scope. (But it's also a bit sad to store 16-bit weights instead of 8-bit weights knowing that 8-bit weights would suffice if the weights were allocated differently.)

My thinking is that it makes sense to

Shorten the table so that U+034F COMBINING GRAPHEME JOINER is the first character not included.
Make the table consist of 16-bit secondary weights.

The space saving isn't much, but this would eliminate a couple of branches from the handling of common diacritics, which could take back a tiny bit of the perf left on the table by omitting the canonical closure and Latin mini expansions. While eliminating a couple of branches seems a bit silly, the whole point of this table is to regain perf not obtained via the canonical closure and Latin mini expansions.

I'm thinking I'm going to proceed with storing 16-bit secondary weights.

@markusicu, what do you think? Does it seem realistic that in the future CLDR could end up with characters U+0300 and U+034E (inclusive) having non-zero primary weight or tertiary or quaternary weights other than the common tertiary weight? (It seems virtually certain that these characters cannot gain case bits.)

CC @echeran, @sffc

hsivonen · 2022-05-25T12:43:31Z

One way to provide an escape hatch for guessing wrong about the future would not involve adding branches but would convert a number that's currently a constant to be loaded from data: If the diacritic table was a slice instead of an array whose length is known at compile time, it would be possible to bypass the whole thing by a tailoring supplying a zero-length diacritic table.

The next iteration of this idea would be to store only 8-bit secondaries for root and Ewe and have Vietnamese supply a zero-length table until such time that the tailoring gap after the grave accent is made larger.

sffc · 2022-05-25T16:08:13Z

What's the magnitude of the data savings we're talking about here?

hsivonen · 2022-05-25T17:26:50Z

The data saving is tiny, between 150 and 250 bytes. The data saving doesn't make this worthwhile. Perhaps the branch avoidance isn't worthwhile, either, but it also seems silly to have the branches that are known-useless.

So storing 16-bit secondaries with the escape hatch of making it a slice rather than an array (i.e. making the slice zero-length in tailoring turns off the whole thing) seems both less silly and more future-compatible than what's there now, but of course it's a small bit of work compared to not changing anything.

hsivonen · 2022-05-25T17:28:36Z

Oops. The data savings are closer to 3 times the numbers in the previous comments, but still small. Less than a kilobyte across the tailorings.

hsivonen · 2022-06-07T10:19:50Z

#1978 closed this but lacked the appropriate notation.

hsivonen added S-small Size: One afternoon (small bug fix or enhancement) C-collator Component: Collation, normalization labels May 25, 2022

hsivonen added this to the ICU4X 1.0 milestone May 25, 2022

hsivonen self-assigned this May 25, 2022

sffc modified the milestones: ICU4X 1.0 Untriaged, ICU4X 1.0 (Features) May 25, 2022

hsivonen closed this as completed Jun 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out if the diacritic table can be smaller #1940

Figure out if the diacritic table can be smaller #1940

hsivonen commented May 25, 2022

hsivonen commented May 25, 2022

hsivonen commented May 25, 2022

sffc commented May 25, 2022

hsivonen commented May 25, 2022

hsivonen commented May 25, 2022

hsivonen commented Jun 7, 2022

Figure out if the diacritic table can be smaller #1940

Figure out if the diacritic table can be smaller #1940

Comments

hsivonen commented May 25, 2022

hsivonen commented May 25, 2022

hsivonen commented May 25, 2022

sffc commented May 25, 2022

hsivonen commented May 25, 2022

hsivonen commented May 25, 2022

hsivonen commented Jun 7, 2022