Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intl.getParentLocales #87

Open
zbraniecki opened this issue Mar 31, 2016 · 11 comments
Open

Intl.getParentLocales #87

zbraniecki opened this issue Mar 31, 2016 · 11 comments
Labels
c: locale Component: locale identifiers Proposal Larger change requiring a proposal s: comment Status: more info is needed to move forward

Comments

@zbraniecki
Copy link
Member

During TC39 meeting yesterday we decided to separate out the Intl.getParentLocales out of #46.

Basically, we already have Intl.getCanonicalLocales in the spec, and the next step to help with language negotiation is to expose Intl.getParentLocales.

This should also fix the concerns raised by @rxaviers in https://github.com/rxaviers/ecma402-fix-lookup-matcher as we should fix the internal operation and expose it.

I'll prototype the polyfill and spec soon.

@zbraniecki
Copy link
Member Author

Oh, and since Abstract Locale Operations are already in Stage 2, I expect to be able to present it and get in Stage 2 for the next TC39 meeting and ask for reviewers there.

@zbraniecki
Copy link
Member Author

The way we would like to tackle it is that the algo will produce a list of locales from the most specific, to the most generic fallback.
In the generic example it'll cut out hypens, so:

Intl.getParentLocales('en-us'); // ['en-US', 'en']
Intl.getParentLocales('es-mx'); // ['es-MX', 'es']

On top of that, the algorithm will scan through an implementation specific exception list to product a different fallback list for certain locales:

Intl.getParentLocales('az-ir'); // ['az-IR', 'az-Arab-IR']

And then we'll use the resulting abstract operation for public API and internally for all formatters.

@rxaviers - will this work for you?

@zbraniecki
Copy link
Member Author

@srl295
Copy link
Member

srl295 commented Mar 31, 2016

Is the app spot data actually updated? I would use the XML or JSON from the official CLDR release, http://cldr.unicode.org

@zbraniecki
Copy link
Member Author

@zbraniecki
Copy link
Member Author

Ok, so it seems that likelySubtag is more of what we're looking for. It does give us the ability to take az-IR and change it to az-Arab-IR or ha-CM to ha-Arab-CM.

Now, the question is, what should we do from there? Should it be:

ha-CM -> ha-Arab-CM -> ha-Arab
or
ha-CM -> ha-Arab-CM -> ha-Arab -> ha

If not the latter, how do we recognize where to stop? One way would be to write the algo that cuts parts until it matches with a part that is already in exception list (in this case ha matches ha_Latn_NG).

Would that work?

@caridy caridy added this to the 4th Edition milestone Apr 1, 2016
@rxaviers
Copy link
Member

rxaviers commented Jun 13, 2016

Hey @zbraniecki, sorry for the long delay on my answer. The issues pointed out (and addressed) by the proposal https://github.com/rxaviers/ecma402-fix-lookup-matcher are a little trickier than the ones you exemplified above (and that are exemplified by Abstract Locale Operations - Nov 2015).

Please, correct me if I'm wrong, but I understood the decision was to expose getParentLocales so that third-party libraries can implement polyfills (or any other user land stuff) by leveraging it. Also, the spec will take into consideration the proper algo step, so the various implementations behave consistently.

I believe https://github.com/rxaviers/ecma402-fix-lookup-matcher is the "the spec will take into consideration the proper algo step" part, specifically this implementation details.

Answering to your last question "how do we recognize where to stop?". It should stop at root. Follow more details about how this algorithm works as specified by CLDR/UTS#35 (and as I understand it):

A very basic step of the algorithm for finding the parent locale is to truncate it, e.g., the parent locale of pt-PT is pt.

Another very basic step of the algorithm for finding the parent locale is to preferably use the parent locale data instead of truncating it, e.g., the parent locale of en-GB is en-001 (not en). Similarly, the parent locale of en-IN is en-001; the parent locale of es-MX is es-419; the parent locale of az-Arab is root (straight to root, without passing through az).

Code: https://github.com/rxaviers/cldrjs/blob/master/src/bundle/parent_lookup.js#L7-L26

The trickier part is that the above steps work (and only work) if you start at the right place (the right locale / the right bundle). For example, starting the parents locale chain of en-GB at en-Latn-GB will wrongly produce en-Latn-GB -> en-Latn -> en -> root (WRONG), which wrongly bypassed en-001. This is currently implemented wrongly in Chrome and Firefox as pointed out by the proposal https://github.com/rxaviers/ecma402-fix-lookup-matcher. It also demonstrates you simply cannot start your chain using the maxLanguageId created by augmenting a locale with its likelySubtags info (this info is very useful, but not only that). The correct chain for en-Latn-GB (or en-GB), considering such data is available, is en-GB -> en-001 -> en -> root.

Another important point is that "where to start" depends on your available data. For example, as of today, CLDR has the following bundles ('where to start's) for Chinese: zh, zh-Hans, zh-Hant, zh-Hans-{HK,MO,SG}, and zh-Hant-{HK,MO}. Note zh-Hans-CN and zh-Hant-TW are empty because they are the default locales for their respective parent bundles. Therefore, the correct parents locale chain for zh (or zh-CN or zh-Hans-CN) is zh-Hans -> zh -> root. The parents locale chain for zh-TW (or zh-Hant, or zh-Hant-TW) is zh-Hant -> root. The parents locale chain for zh-HK (or zh-Hant-HK) is zh-Hant-HK -> zh-Hant -> root.

CLDR/UTS#35 specifies that you should use Language Matching in order to find the 'where to start's by "The table Lookup Differences uses the naïve resource bundle lookup for illustration. More sophisticated systems will get far better results for resource bundle lookup if they use the algorithm described in Section 4.4 Language Matching. That algorithm takes into account both the user’s desired locale(s) and the application’s supported locales, in order to get the best match".

As you may have noticed, Language Matching is an algorithm that works for more stuff than simply a Lookup Matcher, it is a Best Fit Matcher, which is more generic than Lookup Matcher given you can give weights for the desired languages and so on. My proposal algorithm (the one linked above) is derived from it, but specifically for Lookup Matching, so it's simpler and requires no extra data than LikelySubtags. I suggest we use that.

Just let me know if you have any additional questions or ideas.

@caridy
Copy link
Contributor

caridy commented Aug 10, 2017

@zbraniecki do you plan to formalize this any time soon? Do you need help?

@brettz9
Copy link

brettz9 commented Dec 20, 2017

If an application has only a "en-GB" locale (and no "en" or "en-US" locale), a "en-US" locale ought to be suitable for presentation to such users.

It would therefore seem to me that, in considering getParentLocales, maybe some thought should also be given to something like getMutuallyComprehensibleSiblingLocales. Is this a feasible possibility?

(Pardon me if this is jumping the gun...)

@zbraniecki
Copy link
Member Author

@brettz9 what you're looking for is a language negotiation strategy. Such a strategy may take en-US requested locale, and match it against en-GB as a suitable match.

There are many ways to negotiate languages. One is described in RFC4647 - https://www.ietf.org/rfc/rfc4647.txt - but it is far from being the only one. ECMA402 uses a different one, and, for example, Fluent (a library I am working on) uses yet different, described here: https://github.com/projectfluent/fluent.js/blob/master/fluent-langneg/src/matches.js#L6

The nature of such negotiation depends on your needs and I doubt it can be unified. For example Fluent always aims to negotiate between a list of requested locales against a list of available locales and offers three different strategies for doing so:

  • match all possible available locales based on the order from requested
  • match a single best available locales per each requested locale
  • match a single best locale based on the requested locales list

RFC4647 and HTTP Accepted Headers recommend a different strategy that includes calculating the proximity and assigning weights.

The shared bottom line is that there's a trap in each way of thinking about negotiating BCP47 tags and that's the naive approach of "just cut at each - and you'll get a more general locale" - that thinking comes from the assumption that if you have en-Latn-US-macos then cutting out the variant part will give you a more general locale. Then cutting out the region, then script each will give you a more general locale.

Unfortunately, that thinking is wrong for a number of locales. Examples such as sr which has a likely subtag sr-Cyrl-RS. That means that if your requested locale is sr-Latn you cannot just cut out the script portion and match against sr - that would give you the wrong script which is very rarely what the user wants. Similar issue stands with zh-Hans and zh-Hant.

The whole list of likely subtags is here: https://github.com/unicode-cldr/cldr-core/blob/master/supplemental/likelySubtags.json

My current thinking is that instead of this API we really want getLikelySubtags API that expands sr to sr-Cyrl-RS and so on.

That also matches what ICU is exposing - http://www.icu-project.org/apiref/icu4c/uloc_8h.html#a0cb2dcd65f745e7a966a729395499770

I haven't had time to formalize it, but based on my work on fluent-langneg and fluent-locale packages, I believe that this is the API we should expose to allow for language negotiation algorithms (and what we should use internally in ECMA402).

@brettz9
Copy link

brettz9 commented Dec 25, 2017

While you make a good point, @zbraniecki , that there are a number of possible strategies that could be taken, a lack of suitability for requiring a single high level strategy does not obviate the desirability for having more comprehensive lower-level options.

So yes, while, I am interested in building a language negotiation strategy, I'm really looking for fundamentals that can help developers compose our own complete strategy without need for dragging a lot of supplementary data into our apps. I think we need more than getLikelySubtags to do this, as hopeful as I am that we can indeed also get getLikelySubtags.

getLikelySubtags could, as you intimate, not only allow for detection of default language direction (for languages without known scripts) but also for the aspect of language negotiation that you use in fluent-langneg, namely to expand the likes of a requested "en" locale to "en-Latn-US" so that such a generic request for "en" could match one of the more precisely expressed available locales like "en-Latn-US". (And going the other direction, if someone was requesting "en-Latn-US" and only "en" was available, one could also (without naivete) apply getLikelySubtags on the available locale and see if that matched the more precise requested locale.)

But these techniques don't cover the use case raised by @rxaviers at #87 (comment) for determining, for example that "en-GB" matches more closely to "en-001" than "en-US" or "en" (Technically, I think "world English" may in some cases be less helpful for "en-GB" readers than "en-US" as I would think many used to British English would prefer having U.S. English over an overly simplified international English, but if the nebulous "world English" concept is taken instead as meaning merely avoiding country-specific regionalisms, then it could indeed be useful to fall back to the more generic "en-001" for "en-GB" readers as appears to be the intent of this hierarchy).

getLikelySubtags would not help with this use case, and, as discussed, neither would lopping off the region. But an API for getParentLocales could return "en-001" as the parent for "en-GB" and thus meet this need of some language negotiation strategies. I don't think this is too strategy-specific to get exposed, as it merely makes use of the helpful CLDR data and could be part of any number of higher-level strategies.

Another consideration in all of this is that locale APIs would need not require specifying available locales ahead of time--an implementation could, for example, lazily check for the existence of an "en-001" file if "en-GB" was not found and an "en" file if that was not found (though it would admittedly probably be generally more optimal to require specification of the available locales). This lazy checking could have some appeal though when working with simple client-side apps whereby one doesn't wish to go through the process of adding a build step to specify the available locales (nor track the available locales manually) but where one might be caching the result anyways.

As far as my suggestion for getMutuallyComprehensibleSiblingLocales, this would help avoid needing to knowing available locales too, as one could lazily check for other siblings. But given that it would probably be more optimal to specify the locales ahead of time, one could probably get by with using getParentLocales: checking whether a requested locale and an available locale both had a shared non-root ancestor (which "en-US" and "en-GB" do ("en") while "zh-Hant" and "zh-Hans" do not) (Incidentally, here too, there may be Chinese language readers who would prefer having characters (simplified or traditional) contrary to their wont over having none at all, given that the characters have overlap or can often be guessed at by those familiar with the other form. But the fact remains that it would be useful to be able to compose such a strategy if desired.) I worry though that my workaround for looking at shared roots, even when taking account for the likes of en-GB -> en-001, may still end up too naive in certain edge cases, e.g., if "yue" (Cantonese) gets "zh" as a root (IANA lists "zh" as a macrolanguage for "yue" at least), such Cantonese is not mutually comprehensible, in auditory form, with "cmn" (Mandarin), despite both sharing the same macrolanguage. I believe some applications use language info for non-script purposes such as text-to-speech in which case getParentLanguage wouldn't be enough either.

So my personal preference would be to see getLikelySubtags and getParentLocales. getMutuallyComprehensibleSiblingLocales would be a bonus but not as fundamentally critical for APIs which required provision of a listing of all available locales, unless my approach of obtaining common roots is too naive in which case I think we could benefit from that as well.

@sffc sffc added s: help wanted Status: help wanted; needs proposal champion c: locale Component: locale identifiers and removed enhancement labels Mar 19, 2019
@sffc sffc added Proposal Larger change requiring a proposal s: comment Status: more info is needed to move forward and removed s: help wanted Status: help wanted; needs proposal champion labels Jun 5, 2020
@sffc sffc removed this from the 4th Edition milestone Jun 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: locale Component: locale identifiers Proposal Larger change requiring a proposal s: comment Status: more info is needed to move forward
Projects
None yet
Development

No branches or pull requests

6 participants