Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[browser][non-icu] HybridGlobalization checking for prefix/suffix #84920

Closed
wants to merge 9 commits into from

Conversation

ilonatommy
Copy link
Member

@ilonatommy ilonatommy commented Apr 17, 2023

Implements a chunk of web-api based globalization. Is a part of HybridGlobalization feature and contributes to #79989.

Old, icu-based private API: GlobalizationNative_StartsWith, GlobalizationNative_EndsWith

New, non-icu private API: Interop.JsGlobalization.StartsWith, Interop.JsGlobalization.EndsWith

Affected public API (see: tests in CompareInfoTests.IsPrefix.cs, CompareInfoTests.IsSuffix):

  • CompareInfo.IsPrefix
  • CompareInfo.IsSuffix
  • String.StartsWith
  • String.EndsWith

This implementation hovers with 64kB strings. Average processing times for 16kB is ~200ms.

measurement (64kB strings) time ICU time Hybrid time CoreCLR
String, CompareInfo IsPrefix 5.3665ms ? ms 4.300ms
String, CompareInfo IsSuffix 10.1670ms ? ms 6.700ms
String, String StartsWith 5.4105ms ? ms 4.900ms
String, String EndsWith 10.4519ms ? ms 6.700ms

All changes in behavior are listed in docs\design\features\hybrid-globalization.md.

cc @SamMonoRT

// unless we have < 2 empty chars at the beginning
if (segment.length === 1 || (segment.length > 1 && "".localeCompare(segment[1].segment, undefined) !== 0))
{
segment.shift();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we pass currentIndex: number and increment it rather than mutate the array with .shift() ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could, I will try to edit the logic.


export function segment_string_locale_sensitive(string: string, locale: string | undefined) : Intl.SegmentData[]
{
const segmenter = new Intl.Segmenter(locale, { granularity: "grapheme" });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would appreciate comment about why we are segmenting the string into "user perceived letters" instead of comparing the whole string ? Is that because of the need to know the match index ?

Would it be possible it to assume that most calls return false (no match) and optimize for it by comparing unsegmented string first ?

Is creation of Segmenter object expensive ?

Copy link
Member Author

@ilonatommy ilonatommy Apr 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because we need locale-sesitive comparison. We cannot guess how many chars does n letters consist of. Consider the case:

s1 = "A\u0300" // 2 chars, 1 letter
s2 = "\u00C0" // 1 char, 1 letter

segmenter recognises both as 1 letter and we can easily compare them. If we were to compare s1[0] === s2[0] then we would get false.

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const segmenter = new Intl.Segmenter(locale, { granularity: "grapheme" });
return Array.from(segmenter.segment(string));
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

locale.split on each call to compare_strings is probably expensive.

Copy link
Member Author

@ilonatommy ilonatommy Apr 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added performance tests (for the whole operation, not only segmentation) and it seems it's not that bad.

measurement time ICU time Hybrid
String, CompareInfo IsPrefix 5.3330ms 6.0471ms
String, CompareInfo IsSuffix 0.0091ms 0.0093ms
String, String StartsWith 5.5313ms 6.0296ms
String, String EndsWith 0.0094ms 0.0102ms

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have to re-investigate it, the numbers are a bit too similar in both cases. I expected bigger differences, something might be wrong with this PR.

Copy link
Member Author

@ilonatommy ilonatommy Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked. There was an error connected with prefix testing but it does not change the range of ICU vs HG difference. Updated numbers:

measurement time ICU time Hybrid time CoreCLR
String, CompareInfo IsPrefix 5.3665ms 5.6897ms 4.300ms
String, CompareInfo IsSuffix 10.1670ms 10.7293ms 6.700ms
String, String StartsWith 5.4105ms 6.4558ms 4.900ms
String, String EndsWith 10.4519ms 12.8267ms 6.700ms

@ilonatommy
Copy link
Member Author

Closing. The PR will get re-designed to improve the performance and re-posted. We will need to drop support for matchLength calculation for this reason but based on the usage check, it is more beneficial to not support it for the cost of higher performance.

@ilonatommy ilonatommy closed this Apr 19, 2023
@ghost ghost locked as resolved and limited conversation to collaborators May 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants