Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out if the diacritic table can be smaller #1940

Closed
hsivonen opened this issue May 25, 2022 · 6 comments
Closed

Figure out if the diacritic table can be smaller #1940

hsivonen opened this issue May 25, 2022 · 6 comments
Assignees
Labels
C-collator Component: Collation, normalization S-small Size: One afternoon (small bug fix or enhancement)

Comments

@hsivonen
Copy link
Member

Chances are that all the CE32s in the diacritic table always have meaningful bits only in the secondary weight position, which in CE32 means one byte. There might be an opportunity to save space and some time in the form of eliminated branches by storing a byte array on secondary weights only.

@hsivonen hsivonen added S-small Size: One afternoon (small bug fix or enhancement) C-collator Component: Collation, normalization labels May 25, 2022
@hsivonen hsivonen added this to the ICU4X 1.0 milestone May 25, 2022
@hsivonen hsivonen self-assigned this May 25, 2022
@hsivonen
Copy link
Member Author

The main questions are:

  1. Can the diacritic table store only secondary weights?
  2. Can the diacritic table store only short secondary weights?

The answer to the first question is: Yes, if U+034F COMBINING GRAPHEME JOINER and the following characters are excluded from the table instead of having the table cover the whole block.

The answer to the second question is: For the root and Ewe, yes, but for Vietnamese, no, without establishing a larger tailoring gap in the root after the grave accent.

Tweaking the root to establish such a tailoring gap seems like a bad idea in the ICU4X 1.0 scope. (But it's also a bit sad to store 16-bit weights instead of 8-bit weights knowing that 8-bit weights would suffice if the weights were allocated differently.)

My thinking is that it makes sense to

  1. Shorten the table so that U+034F COMBINING GRAPHEME JOINER is the first character not included.
  2. Make the table consist of 16-bit secondary weights.

The space saving isn't much, but this would eliminate a couple of branches from the handling of common diacritics, which could take back a tiny bit of the perf left on the table by omitting the canonical closure and Latin mini expansions. While eliminating a couple of branches seems a bit silly, the whole point of this table is to regain perf not obtained via the canonical closure and Latin mini expansions.

I'm thinking I'm going to proceed with storing 16-bit secondary weights.

@markusicu, what do you think? Does it seem realistic that in the future CLDR could end up with characters U+0300 and U+034E (inclusive) having non-zero primary weight or tertiary or quaternary weights other than the common tertiary weight? (It seems virtually certain that these characters cannot gain case bits.)

CC @echeran, @sffc

@hsivonen
Copy link
Member Author

One way to provide an escape hatch for guessing wrong about the future would not involve adding branches but would convert a number that's currently a constant to be loaded from data: If the diacritic table was a slice instead of an array whose length is known at compile time, it would be possible to bypass the whole thing by a tailoring supplying a zero-length diacritic table.

The next iteration of this idea would be to store only 8-bit secondaries for root and Ewe and have Vietnamese supply a zero-length table until such time that the tailoring gap after the grave accent is made larger.

@sffc
Copy link
Member

sffc commented May 25, 2022

What's the magnitude of the data savings we're talking about here?

@hsivonen
Copy link
Member Author

The data saving is tiny, between 150 and 250 bytes. The data saving doesn't make this worthwhile. Perhaps the branch avoidance isn't worthwhile, either, but it also seems silly to have the branches that are known-useless.

So storing 16-bit secondaries with the escape hatch of making it a slice rather than an array (i.e. making the slice zero-length in tailoring turns off the whole thing) seems both less silly and more future-compatible than what's there now, but of course it's a small bit of work compared to not changing anything.

@hsivonen
Copy link
Member Author

Oops. The data savings are closer to 3 times the numbers in the previous comments, but still small. Less than a kilobyte across the tailorings.

@hsivonen
Copy link
Member Author

hsivonen commented Jun 7, 2022

#1978 closed this but lacked the appropriate notation.

@hsivonen hsivonen closed this as completed Jun 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-collator Component: Collation, normalization S-small Size: One afternoon (small bug fix or enhancement)
Projects
None yet
Development

No branches or pull requests

2 participants