-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for NFKD and the decomposed counterpart of UTS 46 without ignored and disallowed #1967
Conversation
…without ignoring default ignorables Default ignorables are not ignored, because doing so would violate the fundamental assumption of the normalizes that every input character produces non-empty output. The expectation is that real NFKC_CaseFold will be implemented by first filtering out default ignorables and then plugging the NFKD_CaseFold data into the upcoming `ComposingNormalizer` code that will turn NFD into NFC and NFKD into NFKC.
Copying the extended commit message here: Default ignorables are not ignored, because doing so would violate the fundamental assumption of the normalizer that every input character produces non-empty output. The expectation is that real NFKC_CaseFold will be implemented by first filtering out default ignorables and then plugging the NFKD_CaseFold data into the upcoming |
Saves 7332 bytes in data size.
The second changeset saves 7332 bytes in data size compared to the first changeset. The second changeset also reduces pointer chasing in the iterators my making the fields hold things more directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Praise: Looks good from a data perspective. @echeran will review from an algorithmic perspective.
Since I needed to make another push to merge main, I also pushed another commit that documents further data size optimizations at the expense of run-time branches. After writing that, avoiding those run-time branches starts feeling a bit silly, and I feel I should proceed with further data size optimizations. @echeran, @sffc, what do you think? (In any case, let's land this first.) |
Thinking about this more, the supplementary set idea doesn't make sense compared to hard-coding the exceptions: when a characters decomposes to a non-starter in a way that's not the character itself, that decomposition is hard-coded anyway, so it doesn't make sense for the set extension to be more generic. |
The "binsize (wasm)" failure looks like an infra failure: Extracting a tar file fails. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Squash-merged on behalf of @hsivonen at his request. |
I plan to consolidate the complex decomposition expansion tables between the normalization forms.