Add support for NFKD and the decomposed counterpart of UTS 46 without ignored and disallowed #1967

hsivonen · 2022-05-30T12:36:41Z

I plan to consolidate the complex decomposition expansion tables between the normalization forms.

…without ignoring default ignorables Default ignorables are not ignored, because doing so would violate the fundamental assumption of the normalizes that every input character produces non-empty output. The expectation is that real NFKC_CaseFold will be implemented by first filtering out default ignorables and then plugging the NFKD_CaseFold data into the upcoming `ComposingNormalizer` code that will turn NFD into NFC and NFKD into NFKC.

hsivonen · 2022-05-30T12:40:34Z

Copying the extended commit message here:

Default ignorables are not ignored, because doing so would violate the fundamental assumption of the normalizer that every input character produces non-empty output.

The expectation is that real NFKC_CaseFold will be implemented by first filtering out default ignorables and then plugging the NFKD_CaseFold data into the upcoming ComposingNormalizer code that will turn NFD into NFC and NFKD into NFKC.

Saves 7332 bytes in data size.

hsivonen · 2022-05-31T10:55:16Z

The second changeset saves 7332 bytes in data size compared to the first changeset. The second changeset also reduces pointer chasing in the iterators my making the fields hold things more directly.

sffc

Praise: Looks good from a data perspective. @echeran will review from an algorithmic perspective.

hsivonen · 2022-06-01T06:53:34Z

Since I needed to make another push to merge main, I also pushed another commit that documents further data size optimizations at the expense of run-time branches. After writing that, avoiding those run-time branches starts feeling a bit silly, and I feel I should proceed with further data size optimizations. @echeran, @sffc, what do you think? (In any case, let's land this first.)

hsivonen · 2022-06-01T17:50:48Z

After writing that, avoiding those run-time branches starts feeling a bit silly, and I feel I should proceed with further data size optimizations.

Thinking about this more, the supplementary set idea doesn't make sense compared to hard-coding the exceptions: when a characters decomposes to a non-starter in a way that's not the character itself, that decomposition is hard-coded anyway, so it doesn't make sense for the set extension to be more generic.

experimental/normalizer/src/lib.rs

…with_non_starter

hsivonen · 2022-06-02T08:36:19Z

The "binsize (wasm)" failure looks like an infra failure: Extracting a tar file fails.

echeran

LGTM

echeran · 2022-06-02T21:47:47Z

Squash-merged on behalf of @hsivonen at his request.

hsivonen requested review from a team, sffc, robertbastian, Manishearth and echeran as code owners May 30, 2022 12:36

hsivonen added the C-collator Component: Collation, normalization label May 30, 2022

hsivonen self-assigned this May 30, 2022

hsivonen added this to the ICU4X 1.0 (Features) milestone May 30, 2022

hsivonen added the S-medium Size: Less than a week (larger bug fix or enhancement) label May 30, 2022

Consolidate complex decomposition data

f1ad72d

Saves 7332 bytes in data size.

Avoid a clone that allocates instead of just copying a pointer

b08ad82

Manishearth removed request for a team and Manishearth May 31, 2022 15:15

sffc previously approved these changes Jun 1, 2022

View reviewed changes

hsivonen added 2 commits June 1, 2022 09:46

Merge branch 'main' into normalizerdata

e281e9f

Document normalizer data trade-offs

dd1d56b

hsivonen dismissed sffc’s stale review via dd1d56b June 1, 2022 06:47

sffc previously approved these changes Jun 1, 2022

View reviewed changes

Sync normalizer README with lib.rs

2b7d67a

hsivonen dismissed sffc’s stale review via 2b7d67a June 1, 2022 18:19

markusicu reviewed Jun 1, 2022

View reviewed changes

experimental/normalizer/src/lib.rs Outdated Show resolved Hide resolved

hsivonen added 4 commits June 2, 2022 10:12

Extract the right ICU4C data for UTS 46

30b4b4a

Flatten out a level of reference indirection on decomposition_starts_…

97129a9

…with_non_starter

Merge branch 'main' into normalizerdata

eae8a1d

Use 3 bytes per supplementary-plane character in normalization data

676133b

hsivonen changed the title ~~Add support for NFKD and the decomposed counterpart of NFKC_CaseFold without ignoring default ignorables~~ Add support for NFKD and the decomposed counterpart of UTS 46 without ignored and disallowed Jun 2, 2022

echeran approved these changes Jun 2, 2022

View reviewed changes

echeran merged commit a3ba544 into unicode-org:main Jun 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for NFKD and the decomposed counterpart of UTS 46 without ignored and disallowed #1967

Add support for NFKD and the decomposed counterpart of UTS 46 without ignored and disallowed #1967

hsivonen commented May 30, 2022

hsivonen commented May 30, 2022 •

edited

Loading

hsivonen commented May 31, 2022

sffc left a comment

hsivonen commented Jun 1, 2022

hsivonen commented Jun 1, 2022

hsivonen commented Jun 2, 2022 •

edited

Loading

echeran left a comment

echeran commented Jun 2, 2022

Add support for NFKD and the decomposed counterpart of UTS 46 without ignored and disallowed #1967

Add support for NFKD and the decomposed counterpart of UTS 46 without ignored and disallowed #1967

Conversation

hsivonen commented May 30, 2022

hsivonen commented May 30, 2022 • edited Loading

hsivonen commented May 31, 2022

sffc left a comment

Choose a reason for hiding this comment

hsivonen commented Jun 1, 2022

hsivonen commented Jun 1, 2022

hsivonen commented Jun 2, 2022 • edited Loading

echeran left a comment

Choose a reason for hiding this comment

echeran commented Jun 2, 2022

hsivonen commented May 30, 2022 •

edited

Loading

hsivonen commented Jun 2, 2022 •

edited

Loading