Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further optimize storage of icu locale data #45643

Merged
merged 8 commits into from
Jan 4, 2021
Merged

Conversation

benaadams
Copy link
Member

@benaadams benaadams commented Dec 5, 2020

Shrink disk usage from 35kB to 16kB; skip .cctor initialization by switching to ReadOnlySpan<byte> rather than ushort[] and int[]

  • Fix LcidToLocaleName lookup for Lcids that have sort order (was overflowing and truncating the sort bits)
  • Optimize disk size of icu ThreeLetterWindowsLanguageName from 5184 bytes to 2592 bytes (string to ReadOnlySpan<byte>)
  • Optimize disk size of NameIndexToNumericData 30kb to 13.5kb; by not using an int for every column and only storing the needed bytes.

Previous startup use:

image

No .cctor allocations in new path

Contributes to #45129

/cc @marek-safar @eerhardt

@ghost
Copy link

ghost commented Dec 5, 2020

Tagging subscribers to this area: @tarekgh, @safern, @krwq
See info in area-owners.md if you want to be subscribed.

Issue Details

Shrink disk usage from 35kB to 16kB; skip .cctor initialization by switching to ReadOnlySpan<byte> rather than ushort[] and int[]

  • Fix LcidToLocaleName lookup for Lcids that have sort order (was overflowing and truncating the sort bits)
  • Optimize disk size of icu ThreeLetterWindowsLanguageName from 5184 bytes to 2592 bytes (string to ReadOnlySpan<byte>)
  • Optimize disk size of NameIndexToNumericData 30kb to 13.5kb

image

Author: benaadams
Assignees: -
Labels:

area-System.Globalization

Milestone: -

Comment on lines 3389 to 3713
0x4a << 16 | 2926 << 4 | 2, // te
0x4b << 16 | 1824 << 4 | 2, // kn
0x4c << 16 | 2061 << 4 | 2, // ml
0x4d << 16 | 183 << 4 | 2, // as
0x4e << 16 | 2110 << 4 | 2, // mr
0x4f << 16 | 2540 << 4 | 2, // sa
0x50 << 16 | 2066 << 4 | 2, // mn
0x51 << 16 | 279 << 4 | 2, // bo
0x52 << 16 | 392 << 4 | 2, // cy
0x53 << 16 | 1819 << 4 | 2, // km
0x54 << 16 | 1966 << 4 | 2, // lo
0x55 << 16 | 2141 << 4 | 2, // my
0x56 << 16 | 1511 << 4 | 2, // gl
0x57 << 16 | 1839 << 4 | 3, // kok
0x58 << 16 | 2098 << 4 | 3, // mni
0x59 << 16 | 2563 << 4 | 2, // sd
0x5a << 16 | 2
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to deal with the sort order prefix; they are grouped into sections (same order as before); however the sort is used for the lookup e.g. LCID lookup of de-de vs de-de_phoneb

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LCIDToLocaleName method; GitHub doesn't seem to be highlighting the correct lines 😢

@benaadams
Copy link
Member Author

Might be a bug :)

@jkotas jkotas added tenet-performance Performance related issue linkable-framework Issues associated with delivering a linker friendly framework labels Dec 6, 2020
@benaadams benaadams marked this pull request as ready for review December 6, 2020 04:53
@danmoseley
Copy link
Member

cc @lewing

Console.WriteLine("};");
}
*/

private static ReadOnlySpan<byte> CultureNames => new byte[]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comment where to find the generator

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If #45699 is some form is merged; could change it to an actual source generator from input files? 🤔

@@ -3814,8 +3781,8 @@ internal static int GetLocaleDataNumericPart(string cultureName, IcuLocaleDataPa
return null;
}

Debug.Assert(CulturesCount == (c_threeLetterWindowsLanguageName.Length / 3));
return c_threeLetterWindowsLanguageName.Substring(index * 3, 3);
Debug.Assert(CulturesCount == (ThreeLetterWindowsLanguageName.Length / 3));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant, compiler does the check in array initializer


// Format of the data is LCDI | index to CultureNames | culture name length
private static readonly int[] s_lcidToCultureNameIndices = new int[]
private static ReadOnlySpan<byte> NameIndexToNumericData => new byte[CulturesCount * NumericLocaleDataBytesPerRow]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be better to have more than just 1 huge table but grouped the data into logical sections so illinker can remove unused data

// every 3-characters entry is matching locale name entry in CultureNames
private static ReadOnlySpan<byte> ThreeLetterWindowsLanguageName => new byte[CulturesCount * 3]
{
(byte)'Z', (byte)'Z', (byte)'Z', // aa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a lot of duplicates it might be worth trying to index them as well

@marek-safar marek-safar added size-reduction Issues impacting final app size primary for size sensitive workloads and removed linkable-framework Issues associated with delivering a linker friendly framework labels Dec 9, 2020
@ericstj ericstj requested review from tarekgh and safern January 4, 2021 17:30
@tarekgh
Copy link
Member

tarekgh commented Jan 4, 2021

@benaadams is there any changes you need to do here? or @marek-safar comments can be addressed in other PR and we can merge this one?

@benaadams
Copy link
Member Author

Is good from me; cna do it in follow up

@tarekgh
Copy link
Member

tarekgh commented Jan 4, 2021

ok, merging this one and any further comments can be addressed in follow up PR's.

@tarekgh tarekgh merged commit 5ac16c4 into dotnet:master Jan 4, 2021
@benaadams benaadams deleted the Icu branch January 5, 2021 01:36
@ghost ghost locked as resolved and limited conversation to collaborators Feb 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Globalization size-reduction Issues impacting final app size primary for size sensitive workloads tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants