Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make NFKD and UTS 46 data store only the difference form NFD #1984

Merged
merged 18 commits into from
Jun 9, 2022

Conversation

hsivonen
Copy link
Member

@hsivonen hsivonen commented Jun 3, 2022

Makes the postcard form of the data 25.8 KB smaller. Draft only because changesets continue on top of #1978.

hsivonen added 15 commits May 30, 2022 15:30
…without ignoring default ignorables

Default ignorables are not ignored, because doing so would violate the fundamental assumption
of the normalizes that every input character produces non-empty output.

The expectation is that real NFKC_CaseFold will be implemented by first filtering out default ignorables
and then plugging the NFKD_CaseFold data into the upcoming `ComposingNormalizer` code that will turn
NFD into NFC and NFKD into NFKC.
Saves 7332 bytes in data size.
…and allow dynamic further shortening in tailorings

This makes the action of turning a value read from the table into a `CollationElement` super-simple (and branchless).
Makes testdata.postcard 25.8 KB smaller.
@hsivonen hsivonen added C-collator Component: Collation, normalization S-medium Size: Less than a week (larger bug fix or enhancement) labels Jun 3, 2022
@hsivonen hsivonen self-assigned this Jun 3, 2022
@hsivonen hsivonen requested a review from echeran June 3, 2022 11:24
@hsivonen hsivonen marked this pull request as ready for review June 6, 2022 17:59
@hsivonen hsivonen requested review from a team, sffc, robertbastian and Manishearth as code owners June 6, 2022 17:59
@Manishearth Manishearth removed their request for review June 6, 2022 18:00
@codecov-commenter
Copy link

Codecov Report

Merging #1984 (b1f8431) into main (02b9a3f) will decrease coverage by 3.63%.
The diff coverage is 85.07%.

@@            Coverage Diff             @@
##             main    #1984      +/-   ##
==========================================
- Coverage   78.47%   74.84%   -3.64%     
==========================================
  Files         375      474      +99     
  Lines       27643    29089    +1446     
==========================================
+ Hits        21694    21772      +78     
- Misses       5949     7317    +1368     
Impacted Files Coverage Δ
experimental/collator/src/comparison.rs 75.56% <ø> (+0.38%) ⬆️
provider/datagen/src/registry.rs 68.42% <ø> (ø)
...n/src/transform/uprops/canonical_decompositions.rs 33.33% <0.00%> (-1.67%) ⬇️
...tagen/src/transform/uprops/decompositions_serde.rs 0.00% <0.00%> (ø)
provider/testdata/data/baked/normalizer/nfkd_v1.rs 0.00% <ø> (ø)
...ovider/testdata/data/baked/normalizer/uts46d_v1.rs 0.00% <ø> (ø)
experimental/normalizer/src/provider.rs 51.85% <71.42%> (+11.85%) ⬆️
experimental/normalizer/src/lib.rs 78.04% <95.91%> (+2.53%) ⬆️
provider/testdata/data/baked/props/ideo_v1.rs 0.00% <0.00%> (ø)
... and 99 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 02b9a3f...b1f8431. Read the comment docs.

Copy link
Contributor

@echeran echeran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hsivonen hsivonen merged commit 50bedd7 into unicode-org:main Jun 9, 2022
@hsivonen hsivonen deleted the compactnorm branch June 9, 2022 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-collator Component: Collation, normalization S-medium Size: Less than a week (larger bug fix or enhancement)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants