Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-22368 Reduce ~200K langInfo.res size by encode LSR into 32bits int. #2458

Merged
merged 1 commit into from
Jun 22, 2023

Conversation

FrankYFTang
Copy link
Contributor

@FrankYFTang FrankYFTang commented May 11, 2023

Checklist
  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22368
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@FrankYFTang FrankYFTang added the incomplete Needs work; do not approve/merge as is. label May 11, 2023
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/common/loclikelysubtags.cpp is different
  • icu4j/main/shared/data/icudata.jar is different
  • icu4j/main/shared/data/icutzdata.jar is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@FrankYFTang
Copy link
Contributor Author

This change reduce the size of icudt73l.dat from 32206240 to 31999424 and reduced 206816 bytes

@FrankYFTang FrankYFTang changed the title ICU-22368 !!!DO NOT LOOK YET!!! Reduce langInfo.res size by encode LSR into 32bits int. Reduce ~200K langInfo.res size by encode LSR into 32bits int. Jun 1, 2023
@FrankYFTang FrankYFTang changed the title Reduce ~200K langInfo.res size by encode LSR into 32bits int. ICU-22368 Reduce ~200K langInfo.res size by encode LSR into 32bits int. Jun 1, 2023
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4j/main/shared/data/icudata.jar is different
  • icu4j/main/shared/data/icutzdata.jar is different
  • icu4j/maven-build/maven-icu4j-datafiles/src/main/resources/com/ibm/icu/impl/data/icudt73b/langInfo.res is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@FrankYFTang FrankYFTang added waiting-on-reviewer and removed incomplete Needs work; do not approve/merge as is. labels Jun 3, 2023
@FrankYFTang
Copy link
Contributor Author

Please review. thanks

@FrankYFTang
Copy link
Contributor Author

@pedberg-icu @markusicu could you take a look at this or should I ask someone else?

// Do not have enough bits to store the all 1000 possible combination of \d{3}
// Only support what is needed in resource- "001", "143" and "419".
if (region.length() == 3) {
assert region.equals("001") || region.equals("143") || region.equals("419");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments about this approach in LocaleDistanceMapper.java. It is too fragile and inflexible. I think LocaleDistanceMapper.java should write an array of supported numeric region codes that this method can use for encoding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well do. Changing the code now. Stay tune.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE PTAL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not seem like this code got updated to use the m49 array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opps. you are right. need to change it.

Copy link
Contributor

@pedberg-icu pedberg-icu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a better way of handling numeric region codes, see comments (does not even support all currently used codes)

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/common/loclikelysubtags.cpp is different
  • icu4c/source/data/misc/langInfo.txt is different
  • icu4j/main/classes/core/src/com/ibm/icu/impl/locale/LocaleDistance.java is different
  • icu4j/main/classes/core/src/com/ibm/icu/impl/locale/LSR.java is different
  • icu4j/main/classes/core/src/com/ibm/icu/impl/locale/XLikelySubtags.java is different
  • icu4j/main/shared/data/icudata.jar is different
  • icu4j/main/shared/data/icutzdata.jar is different
  • icu4j/maven-build/maven-icu4j-datafiles/src/main/resources/com/ibm/icu/impl/data/icudt73b/langInfo.res is different
  • tools/cldr/cldr-to-icu/src/main/java/org/unicode/icu/tool/cldrtoicu/localedistance/LocaleDistanceMapper.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@FrankYFTang
Copy link
Contributor Author

Peter - I modified the PR to output the M49 codes into the resources from the CLDR to ICU tool and use that during the decoding now.

@pedberg-icu
Copy link
Contributor

Peter - I modified the PR to output the M49 codes into the resources from the CLDR to ICU tool and use that during the decoding now.

Oh, I like this much better, thanks! But it looks like LSR.java is still not using the new M49 array...

@FrankYFTang
Copy link
Contributor Author

Needs a better way of handling numeric region codes, see comments (does not even support all currently used codes)

It is not intend to support ALL used codes in everywhere, but only for those encoded in the value of the likelySubtag.xml

@FrankYFTang
Copy link
Contributor Author

Peter - I modified the PR to output the M49 codes into the resources from the CLDR to ICU tool and use that during the decoding now.

Oh, I like this much better, thanks! But it looks like LSR.java is still not using the new M49 array...

yes, sorry, I missed that. Will fix

Copy link
Contributor

@pedberg-icu pedberg-icu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now, thanks for this!

@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@FrankYFTang FrankYFTang merged commit e83b071 into unicode-org:main Jun 22, 2023
101 checks passed
@FrankYFTang FrankYFTang deleted the ICU-22368-second branch June 22, 2023 08:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants