ECMA-402 compatibility for the components::Bag #645
Labels
C-datetime
Component: datetime, calendars, time zones
S-epic
Size: Major project (create smaller child issues)
T-task
Type: Tracking thread for a non-code task
Milestone
Edit:
This issue originally had discussion around the approach for the components bag, but now it serves as the issue to track ECMA-402 compatibility for the components::Bag.
Current scope and status of specified work
Original outdated discussion:
For ICU4X 0.2, we are shipping the first support for the components::Bag. This API is what will back our ECMA-402 compatibility. For Mozilla, we'll need this support in order to validate ICU4X as a good architectural fit. Currently, the partial implementation is probably good enough for some initial validation, but it can't be used for real without completing more scope of work.
Balancing those needs, we also have iterated through various ideas of improvements for the components bag API that require broader changes to the CLDR data.
Remaining questions on ECMA-402 support
I'm somewhat unclear on what other scope of work needs to be completed for complete ECMA-402 support. There are a few fields that are not yet supported. Are non-Gregorian calendar systems required to be compatible here? It would be good to make sure this is well known. I would assume test262 would reveal this.
Improvements to the translations
Part of the areas for improvement identified through the work above, is a change to the model of how the components bag gets built. Currently, ECMA-402 and ICU4X's component bag work with a per-component toggle and "length" designation. The proposed change to help improve on the translations is to provide a per-component toggle, and then a grouping "length", e.g. date length, time length, time zone length.
The proposed advantages here are that translations can provide a higher fidelity translation for the overall grouping, and it reduces the total combinatorial logic for the compononents bag. #605 outlines some thoughts on how to group those components, and comes up with a total possibility of 3075 component bag combinations.
Currently, the CLDR includes at minimum somewhere around 50 combinations per-locale. The following mechanisms act as a "compression algorithm" in order to reduce the total amount of combinations, and allow for algorithmic expansion of these combinations to a final pattern. These steps are:
a. Glue together date matches and time matches into a single pattern. (completed #617)
b. Expand the length of individual components
c. Use append items for missing components
Of these steps, step b and c are somewhat controversial. It turns out that the expansions of lengths have some guard rails, in that skeletons can provide some overrides where these expansions form nonsense translations. In fact, there is an implicit assumption that the skeletons will be expanded, and only the bare minimum of patterns need to be provided, as a form of compression in the CLDR data itself.
Append items is where things get a little tricky. There are some clearly orthogonal components where append items is non-controversial (e.g. time zones). However, the existing UTS 35 specification says that a result will be computed. There is not a good guarantee that the result will be high quality. However, I don't know of specific low quality translations for this area.
Bounding potential component combinations
The CLDR data and ICU4C operate on the assumption that the data can be combined in any way, and then an algorithmic result can be generated. However, there are some ideas on getting around this and completely skipping steps a, b, and c. This is outlined in #605, and works by restricting the inputs to the components bag, and enumerating every combination. As stated in #605, this is tentatively 3075 combinations. In this way, the algorithmic steps (a, b, c) are not needed, as the CLDR data would be completely specified.
In my mind, here are the steps to accomplish this, presented in an un-ordered list.
Incremental approach
Currently, the CLDR proposal timeline is fast approaching at 2021 May 19. The remaining amount of time is not enough to flesh out a high quality improved model proposal for CLDR and validate it against stakeholders and ECMA-402. There has been a lot of theoretical talk of improvements, without data-driven proof of concepts. The work that has landed in the last week, and the work with Mozilla's integration of ICU4X in Gecko will provide helpful validation for the changes here.
There's also the idea that you bring design proposals, not design problems to the CLDR group. At this point, I don't feel confident in having a total solution to the combinations generated, and the compression that can be applied to this data. This seems like an active problem still that needs to be designed.
I don't think missing this year's deadline for the CLDR change will be a failure for getting these improvements. Working on the infrastructure incrementally will allow us to have a good set of data-driven validations that we can run through. In addition, with the integration with Gecko, we gain access to the full suite of test262 to ensure spec combatibility.
I believe finishing the UTS 35 skeleton matching algorithm (points a, b, c above) will unblock the current work, and is a minimal investment of work to higher quality results in the short term. This will buy us time to figure out exactly how to define the CLDR data, and validate that it is getting good results. This mitigates the risk of proposing a large change for the translators of the CLDR, that is only theoretically going to provide an improvement. It would be good to validate these changes with our current system, and with our partners and users of ICU4X.
Proposal for 1.1
In my mind, this is positioned well to align with the 1.1 or 1.2 release of ICU4X, as we will have a good set of features completed. The design can be an on-going process that does not block other priority work. These changes will be a good quality boost, and are worth pursuing. The changes can be done internally, while maintaining ECMA 402 support. Once completed, the new API can be swapped out.
The text was updated successfully, but these errors were encountered: