Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fallback behavior for extension keywords and auxiliary keys #3867

Closed
sffc opened this issue Aug 15, 2023 · 6 comments · Fixed by #5743
Closed

Fallback behavior for extension keywords and auxiliary keys #3867

sffc opened this issue Aug 15, 2023 · 6 comments · Fixed by #5743
Assignees
Labels
C-data-infra Component: provider, datagen, fallback, adapters

Comments

@sffc
Copy link
Member

sffc commented Aug 15, 2023

tl;dr, which of the following is the correct fallbacking order, assuming "short" is fallback for "long" in the aux key?

  1. Option 1
    • ar-EG-u-nu-latn/long
    • ar-EG/long
    • ar-u-nu-latn/long
    • ar/long
    • und-u-nu-latn/long
    • und/long
    • ar-EG-u-nu-latn/short
    • ar-EG/short
    • ar-u-nu-latn/short
    • ar/short
    • und-u-nu-latn/short
    • und/short
  2. Option 2
    • ar-EG-u-nu-latn/long
    • ar-EG-u-nu-latn/short
    • ar-EG/long
    • ar-EG/short
    • ar-u-nu-latn/long
    • ar-u-nu-latn/short
    • ar/long
    • ar/short
    • und-u-nu-latn/long
    • und-u-nu-latn/short
    • und/long
    • und/short
  3. Option 3
    • ar-EG-u-nu-latn/long
    • ar-u-nu-latn/long
    • und-u-nu-latn/long
    • ar-EG/long
    • ar/long
    • und/long
    • ar-EG-u-nu-latn/short
    • ar-u-nu-latn/short
    • und-u-nu-latn/short
    • ar-EG/short
    • ar/short
    • und/short

There are probably more orderings.

@sffc sffc added the discuss Discuss at a future ICU4X-SC meeting label Aug 15, 2023
@sffc
Copy link
Member Author

sffc commented Aug 15, 2023

My commentary:

  1. The aux key is the most important piece of the locale and therefore is the last thing that should undergo fallback
  2. Extension keywords too are important because they are an explicit preference
  3. We can control what goes into the root locale, so we should only put things in there which are sensical as a last-resort fallback. For example, we should not have any "long" data in root, only "short" data.

I therefore think I prefer option 3.

@sffc
Copy link
Member Author

sffc commented Oct 19, 2023

Discuss with:

@sffc sffc added discuss-priority Discuss at the next ICU4X meeting discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band labels Oct 19, 2023
@sffc
Copy link
Member Author

sffc commented Oct 19, 2023

  • @robertbastian - Aux key should go last because it is usecase defined. Transliterator and datetime will use this differently. It's not considered fallback; aux keys should be ignored in everything we call fallback in icu4x.
  • @Manishearth - A few observations. (1) For datetime, either a locale has none of the data or some of the data. Let's say we have ar and ar-EG, and we want ar long, and ar-EG only has short, I'd prefer to go to ar-EG short first. That's logic that can be written in datetime. The thing that's missing in response metadata is number of fallback steps performed. The algorithm for datetime symbols I'd like to run is: (a) look for the aux key you're actually looking for; (b) if fallbacking was performed, perform fallback yourself, with another request; (c) if you found a solution that uses fewer steps of fallback, use it. A different way would be to request, tell me which aux keys are available for a certain key and locale, returning the first locale that has any data. This is similar in behavior to what we do currently.
  • @sffc - What I think is a good outcome: no matter what option we choose here, we can make sure that the outcome of data resolution is correct by ensuring the datagen provider always outputs the correct data for a particular locale with an aux key. the case we're worried about is where ar/long has different data from ar-EG/short, and if you request ar-EG/long you get different data than what you want (which is ar-EG/short). We do have a lot of cases where we have absences, like standalone vs format. The way we resolve this is in datagen provider it always outputs the full set of aux subtags, post dedup. Then we have very powerful ways of doing this fallback in datagen, where we can still strip things from datagen where the behavior is idempotent.
  • @sffc Why this proposal: We have a lot of evidence that complex fallbacking incurs a lot of runtime cost, and wary of getting too-too complicated of doing fallback in e.g datetime's constructor. That's slow and also error prone. Nice to resolve at datagen time. Second reason is that we don't use the DataResponseMetadata much and we aren't sure we're populating it correctly. May want to remove it.
  • @Manishearth - Not convinced of runtime cost but I believe ICU4C has this problem. Because we store locales, this still balloons data size overall. Another way to solve the problem (not sure I like it yet) is adding an API which lets you resolve a locale without aux keys and get an iterator because currently all data providers store them adjacently.
  • @robertbastian - An iterator over the aux keys?
  • @Manishearth - Yeah.... you say you want a locale for all aux keys and it either tells you what the aux keys are or it gives you an iterator over them. There is at least one way we can do this without changing data provider APIs.
    • Design one: New API: DataProvider::load_all_aux(locale) -> Iterator<Response>
    • Design two: AuxKeyQueryMarker<DateSymbolsMarker>, always returns value of type AuxKeyList or AuxKeyIterator. Buffer/etc providers are tweaked to recognize the key.
  • @robertbastian - The way I understand you is that you want to deduplicate across aux keys at datagen time? Datagen should not know how to fall back between aux keys.
  • @sffc - Only impl DataProvider<MonthSymbolsV1> for DatagenProvider is aware of long/short aux keys. It will emit ar-EG/long even if there is no explicit CLDR data for that combination.
  • @echeran - Can you clarify the desired behavior?
  • @Manishearth - The behavior we wish to attain for datetimeformat is when it attempts to load e.g. monthsymbols, it will get them from the first locale in the fallback chain with any aux keys on monthsymbols whatsoever (so if we are requesting ar-EG/long we get ar-EG/short before we get ar/long)
  1. Make the standard fallback adapter have some specific magic behavior around aux. (solve problem in icu_locid_transform)
  2. In datagen, always generate data such that regular fallbacking will always produce the desired behavior. Types like DateTimeFormat do nothing fancy! (solve problem in icu_datagen)
  3. Perform the fallbacking in types like DateTimeFormatting using addtional APIs like "fallback iteration count" or "load all aux keys". (solve problem in component crate like icu_datetime)
  • @robertbastian - We're focused a lot on this ar-EG/long issue. I'm not convinced option 2 is the best solution for all use cases. I'm not in favor of option 1 either.
  • @Manishearth - I think solution 2 also could work for currencies.
  • @sffc Option 1 is infectious, impacting all components, even ones that don't use aux keys. Option 2 and 3 are component by component. We can make that call later. For currencies we may want that iterator.
  • @echeran - This sounds like a space/time tradeoff.
  • @sffc - Good point; and actually option 2, though it increases postcard size, may also help reduce code size since less special logic is required in the constructor.
  • @sffc - The horizontal fallback that we're worried about is a CLDR optimization. We should try to get rid of CLDR optimizations in datagen in general so that we only have ICU4X optimizations applied. It does not make sense to put CLDR horizontal fallback algorithms into runtime constructors in general.

Conclusion: Use either 2 or 3 on a component-by-component basis. Different components have different needs.

LGTM: @sffc @Manishearth @echeran

@sffc sffc added discuss Discuss at a future ICU4X-SC meeting and removed discuss Discuss at a future ICU4X-SC meeting discuss-priority Discuss at the next ICU4X meeting discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band labels Oct 19, 2023
@sffc
Copy link
Member Author

sffc commented Oct 19, 2023

We still need to discuss the part about Unicode extension keyword fallback priority.

Discuss with:

Optional:

@sffc sffc added discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band C-data-infra Component: provider, datagen, fallback, adapters labels Oct 19, 2023
@sffc sffc added this to icu4x 2.0 Mar 11, 2024
@sffc sffc moved this to Investigate in icu4x 2.0 Mar 11, 2024
@robertbastian
Copy link
Member

How do DataKeyAttributes behave in fallback?

// exhaustive
struct DataRequest<'a> {
    pub langid: &'a LanguageIdentifier,
    pub attributes: &'a DataKeyAttributes,
    pub metadata: DataRequestMetadata,
}

pub struct DataKeyAttributes(DataKeyAttributesInner)

// Bump this if there's a need for more space. 
// 8 is currently needed by components that use BCP subtags as attributes
// (collator, transliterator).
const DATA_KEY_ATTRIBUTES_RUNTIME_SIZE: usize = 8;

enum DataKeyAttributesInner {
    Static(&'static [&'static str]),
    Runtime(ShortVec<TinyAsciiStr<DATA_KEY_ATTRIBUTES_RUNTIME_SIZE>>),
}

The data key attributes need not participate in fallback. They can be resolved in datagen. The constructor is allowed to fall back from one attribute to another, such as when the langid reaches und. This is compatible with preresolved fallback since it always occurs.

Segmentation model fallback can be data-driven in the segmenter constructor.

Notes for collation fallback order:

  • yue = yue-Hant > und-Hant > und
  • zh-TW = zh-Hant-TW > zh-Hant > und-Hant > und
  • yue-CN = yue-Hans-CN > yue-Hans > und-Hans > und
  • zh = zh-Hans > und-Hans > und
  • zh-u-co-stroke ~> zh = zh-Hans > und-Hans > und, with -x-stroke data key attribute

The locales that are populated with data:

  • und-Hant (contains stroke data)
  • und-Hans (contains pinyin data)
  • und-Hani-x-pinyin
  • und-Hani-x-stroke
  • und-Hani-x-zhuyin

This uses a new script fallback mode:

  • always normalizes by adding the script using likely subtags
  • then chops off region, followed by language
  • contains extra parents to allow defining stuff for "generalized chinese things"
    • und-Hant > und-Hani
    • und-Hans > und-Hani
  • eventually reaches und

The same mode will be usable for transliterator.

LGTM: @robertbastian @sffc

@sffc
Copy link
Member Author

sffc commented Apr 22, 2024

The rewriting of this code should incorporate the new CLDR 45 fallback rules. #4782

@robertbastian robertbastian removed discuss Discuss at a future ICU4X-SC meeting discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band labels Jun 27, 2024
@robertbastian robertbastian added this to the ICU4X 2.0 ⟨P1⟩ milestone Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-data-infra Component: provider, datagen, fallback, adapters
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants