-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finalize input data model for DateTimeFormat #355
Comments
Notes from design discussion on this subject with @mihnita @macchiati @markusicu @younies and others: Agreement on high-level concept (separation of concerns)Mark: Did you cross-check these against the LDML spec? And against the ICU calendar? Shane: I referenced LDML. Mark: We always assume hour/minute/second for all countries. We could stick with milliseconds_in_day. Mark: I would go with one field per format symbol. Mihai: On the time, the milliseconds_in_day field is a misnomer. Shane: Yeah; I got that name from LDML. But "millisecond_of_day" would be better. Mihai: I would comment that in general, these field are all convertible with one another. Decide on approach for era name variants (AD vs CE)Markus: It makes more sense to me for variant selection to go elsewhere. Don't overload the structure. Mark: We were thinking of putting this in the display context. I think that works better than having the calendar system pick it. It's a formatting issue, not a calendaring issue. Markus: In Hebrew, we have a choice between printing digits, or having Hebrew do a spellout. That is a display option. I would expect to have a day of the month as a number in the struct, and have the display option live elsewhere. Mihai: I think AD versus CE belongs in the same bucket as numbering system choice. Finalize data model for months: merge month ID with month number or keep them separateMarkus: How does this work in ICU? Mark: CLDR mixes the numbers and he identifiers and does resolution separately. Markus: Shane is proposing to put the decision in the calendar layer, and keep the display layer as a lookup. Mark: Yeah, I would go with a month number and an identifier. Shane: It sounds like we have agreement on two separate fields for month number and month name. String identifier vs. numeric identifier for month display namesMark: This seems excessive. For weekdays, we just look it up according to the language. For month names, we're expanding it to cover every calendar system, when you could cover it with 13 digits plus a separate field for the calendar system. Markus: It could work. It doesn't feel quite right. When we build up our structures, we have 12 or 13 month names. For month names, we always have a complete array. Mihai: You could have an array with indices. I feel uneasy seeing calendar-specific keys in the data structure. Mark: It seems simpler to have number indices. Shane: You say 12 or 13 months, but Hebrew has 14 months, even though only 13 of them can occur in each year. Chinese has 24 months, 12 normal and 12 leap, even though a year has no more than 13 of them at a time. My proposal is to put the display names of months into a global namespace. Markus: It feels weird, but I think it could work. I can't put my thumb on why it wouldn't work. Maybe we can try it and see how it works. Markus: For eras, in the Japanese calendar, going back in time is murky. You should pick an era as era 0 and count forward and backward from there. Mihai: For Japanese, if you use numbers, for new eras, you just add another number, which is easier than inventing a new string. Especially since you don't know the name of the era until a few weeks in advance. Mark: The advantage of an index is that for most calendar systems, the name and the month number correspond to each other. The problem is that in certain calendar systems, the two are not correlated. Mark: You could do "m01", "m02", … having a string "foobar" doesn't help me know that it's the 7th month in the calendar system. Investigate day period and decide whether it should be computed or inputtedMark: I think the day period should be decided by the calendar system. But, it tends to be a language/region computation rather than a calendar system calculation. So you could put it either place. Mihai: To me it feels like a locale-specific thing. Mark: The eras are closely linked to the calendar system. But the day periods are dissociated. If I were to do anything on the display side, it would be the day period stuff. Mihai: When you go from the calendar to month names, you go from something locale-independent and then you make it locale-dependent. Going from month 6 to a string is a matter of translation. But deciding what is "afternoon" is locale-specific. Shane: What data is required to figure out the day period? Mark: We make an approximation. Seconds in day is sufficient for the calculation. In theory, we should go beyond that, and look not only at your locale, but also your location, because in a lot of places, evening starts at sunset. In China, sunset is at a very different time depending on your latitude and longitude. Mihai: There are two layers: the calendar layer, and the locale layer. Some decisions are made on the calendar layer, and other decisions are made on the locale layer. Shane: I would go further and say there are 3 layers: the calendar layer, the localization layer, and the rendering layer. The localization layer could include both day period resolution and era name selection. It could be swapped out for a more sophisticated day period selector that takes lat/lon into account. |
Another advantage of a global namespace for month names: several calendar systems like Japanese and Buddhist use the Gregorian month names, but they have their own system for years and eras. With a global namespace, the Buddhist calendar could request month name "jan" and pull from the same data as Gregorian. With a nested namespace, we'd need to either duplicate the month names, or implement a fallback mechanism. |
A pseudo-global namespace would be something like, month-names:
gregory-m001: January
gregory-m002: February
# ...
hebrew-m011: Shevat
hebrew-m012: Adar
hebrew-m013: Adar I
hebrew-m014: Adar II
# ...
indian-m001: Chaitra
indian-m002: Vaisākha
# ... The Buddhist calendar could request month name "gregory-m001". And for eras, era-names:
# calendar-era-variant
gregory-e00: Before Christ
gregory-e00-v00: Before Common Era
gregory-e01: Anno Domini
gregory-e01-v00: Common Era
# Modern Japan: start from era ID 1000
japanese-e1000: Meiji
japanese-e1001: Taishō
japanese-e1002: Shōwa
japanese-e1003: Heisei
japanese-e1003: Reiwa
# Pre-1868: count down from 1000 with space to add missing eras
japanese-e0990: Keiō
japanese-e0980: Genji
# ... |
As per discussion, it requires one more piece of information to be communicated, which is the calendar system.
I think it would be simpler to have: calendar-system: japanese // tiny string rather than: month-names-index: japanese-m012 // string |
I'm sorry, I don't understand. I am proposing a model specifically designed to avoid the need to give the calendar system to the data bundle as a separate argument. Since Gregorian, Japanese, and Buddhist all share the same month names, my proposal is to have a single global namespace for month display names, and all three of those calendars can access the exact same resources.
Your example still duplicates data in a nested calendar system structure. This is not what I'm proposing. My proposal is more along the lines of month-names-index: gregory-m012 // tinystr pair |
Separately, I'm not a very big fan of indexing the month names with numbers, because it is misleading when dealing with calendars with leap months. In the Hebrew calendar, one does not simply request "display name for the 12th month", because there are 2 possibilities for that display name. I think it is more clear to index month names by a string identifier to clearly indicate that there is not necessarily a correlation between the month number and the month name index. |
@pedberg-icu pointed out that the data model for months needs to account for the CLDR month name patterns in the Chinese calendar: https://unicode.org/reports/tr35/tr35-dates.html#monthPatterns_cyclicNameSets In particular, the numeric form "M" in the Chinese calendar is not simply a number; it needs to have the month pattern applied to it. For day periods, there seems to be agreement that |
@pedberg-icu also noted that the Hebrew calendar no longer has any numeric patterns for the month. The month in that calendar is currently always represented by MMM. I will follow up with an updated proposal. |
On Oct 28, 2020, at 1:26 PM, Shane F. Carr ***@***.***> wrote:
@pedberg-icu <https://github.com/pedberg-icu> also noted that the Hebrew calendar no longer has any numeric patterns for the month. The month in that calendar is currently always represented by MMM.
Actually MMMM
- Peter
… I will follow up with an updated proposal.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#355 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKBS4KCYOK7SXHCAOKQKAHDSNB465ANCNFSM4SRRIZPQ>.
|
I put together a spreadsheet of calendar systems and items (eras & months)
https://docs.google.com/spreadsheets/d/1iIaJ-j-EQRyo0jPLp6rdStTwwBQYawoeRoRVJaR_Ytc/edit#gid=735277865
It coalesces items where CLDR aliases them in root. So, for example,
because the buddhist calendar months alias to gregorian, the buddhist
months don't need a separate enum. We would want to review those aliases to
make sure they are correct and complete: that they are intentional, and
there are no others that can be coalesced (eg maybe generic and gregorian).
Mark
On Wed, Oct 28, 2020 at 4:10 PM Peter Edberg <notifications@github.com>
wrote:
…
> On Oct 28, 2020, at 1:26 PM, Shane F. Carr ***@***.***>
wrote:
>
>
> @pedberg-icu <https://github.com/pedberg-icu> also noted that the
Hebrew calendar no longer has any numeric patterns for the month. The month
in that calendar is currently always represented by MMM.
>
Actually MMMM
- Peter
> I will follow up with an updated proposal.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub <
#355 (comment)>,
or unsubscribe <
https://github.com/notifications/unsubscribe-auth/AKBS4KCYOK7SXHCAOKQKAHDSNB465ANCNFSM4SRRIZPQ
>.
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#355 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMBXA7D7AEQ7YSYROG3SNCQGPANCNFSM4SRRIZPQ>
.
|
The spreadsheet is very helpful; thanks! My main takeaway is that we need only 133 strings (which need wide/short/narrow) to support all month names and all eras in all calendars*. That might be small enough that we can ship it in the ICU4X data provider by default, such that we support formatting in all calendar systems out of the box. If it's not small enough, we can give users options to remove data for calendars they don't use. * As @yumaoka suggested, I omitted the pre-modern Japanese eras. |
Okay, here's a proposal for how we can encode the data for month names, including leap months: month_names:
gregory-m001:
long: January
short: Jan
narrow: J
numeric: {0}
# ...
chinese-m001:
long: First Month
short: M01
narrow: {0}
numeric: {0}
chinese-m001-leap:
long: First Monthbis
short: M01bis
narrow: {0}b
numeric: {0}bis This trades a little extra data for less complicated code. The specification of the data would be:
Note: "Monthbis" is the language from the current CLDR specification. That's probably not right. new Date(2001, 5, 1).toLocaleDateString("en-u-ca-chinese", { dateStyle: "long" })
// "Fourth Monthbis 10, 2001(xin-si)" |
I think that there is a "hidden assumption" here that is not necessarily true. We have calendars that are "Gregorian-like" (12 months, maybe even extending Gregorian in implementation, the way BuddhistCalendar, Japanese, Taiwan are). The calculations work, it's all good... But it does not mean at all that the month names will be translated the same in all the languages. In other words, something like |
I did a quick check. So for zh-SG "gregory-m002" != "buddhist-m002", right now. |
Okay, I'm convinced. I think we can change the data model like this:
If a language-calendar pair wants to fall back to a different calendar, we can use #259 to perform that fallback. However, if it wants to override the data, it can add an additional entry in the data structure above. Q: Shane, why did you start month numbering at 101 instead of 1? A: Because I really, really don't want people to get used to the idea of a month number being equivalent to a month name identifier. We already know this isn't the case in multiple calendar systems, like Chinese. |
Shane to follow up with a concrete PR. |
Okay, I started something in #445. I'm trying something a bit different than what I proposed above. Here's my trait: pub trait NewDateTimeType {
fn julian_day(&self) -> JulianDay;
fn year(&self) -> Year;
fn year_week(&self) -> Year;
fn quarter(&self) -> Quarter;
fn month(&self) -> Month;
fn time(&self) -> Time;
} Note: the Julian day is the number of days since the Julian epoch (Wikipedia). Subtypes: pub struct Era(pub TinyStr8);
pub struct CyclicYear(pub TinyStr8);
pub struct Quarter(pub u8);
pub struct MonthCode(pub TinyStr8);
pub struct JulianDay(pub i64);
pub struct Year {
pub start: JulianDay,
pub era: Era,
pub number: usize, // FIXME: i64
pub extended: usize, // FIXME: i64
pub cyclic: CyclicYear,
}
pub struct Month {
pub start: JulianDay,
pub number: usize, // FIXME: i64
pub code: MonthCode,
}
pub enum FractionalSecond {
Whole,
Millisecond(u16),
Microsecond(u32),
Nanosecond(u32),
}
pub struct Time {
pub hour: u8,
pub minute: u8,
pub second: u8,
pub fractional: FractionalSecond,
} I think that we can compute all of the UTS 35 fields from this information, except for
I was considering two levels of traits: this shortcut trait, and another trait with full field coverage. However, I would like to keep as much of this part of the algorithm as possible inside the library. |
- 24:00 will not be a valid time once unicode-org#355 is fixed. - Removes logic than handles 24:00 for now.
New discovery regarding weeks: the choice of when to set the year cutoff on the week-of-year calendars (capital To test: go to the icu4j MessageFormat demo and enter the following:
Switch the locale between en-US and en-GB to observe the difference. Is the ICU4J behavior correct? (If so, it will affect the design of when we need to ingest locale information.) |
Yes it is. We had to implement |
Cool. Is the algorithm for determining the Week of Year cutoff deterministic across calendar systems? Like, say that the first day of the year is a Wednesday. With a combination of the locale-specific data in mozIntl, you can figure out whether that Wednesday should be considered 2020 or 2021. Does that work in systems other than Gregorian? Or is "week of year" just not used anywhere other than Gregorian? EDIT: I think we can structure the trait in a way that avoids the need to answer this question. |
Here are my latest traits: pub trait NewDateTimeType {
fn year(&self) -> Year;
fn prev_year(&self) -> Year;
fn next_year(&self) -> Year;
fn quarter(&self) -> Quarter;
fn month(&self) -> Month;
fn day_of_year(&self) -> DayOfYear;
fn day_of_month(&self) -> DayOfMonth;
fn weekday(&self) -> Weekday;
fn time(&self) -> Time;
}
pub trait FullDateTime: NewDateTimeType {
fn year_week(&self) -> Year;
fn week_of_month(&self) -> WeekOfMonth;
fn week_of_year(&self) -> WeekOfYear;
fn flexible_day_period(&self) -> FlexibleDayPeriod;
} The first, NewDateTimeType, is the one expected to be implemented by external date libraries. The second, FullDateTime, combines NewDateTimeType with a Locale to fill in additional information. The |
- 24:00 will not be a valid time once unicode-org#355 is fixed. - Removes logic than handles 24:00 for now.
2021-01-15: What I proposed above looks OK. |
- 24:00 will not be a valid time once unicode-org#355 is fixed. - Removes logic than handles 24:00 for now.
- 24:00 will not be a valid time once unicode-org#355 is fixed. - Removes logic than handles 24:00 for now.
- 24:00 will not be a valid time once unicode-org#355 is fixed. - Removes logic than handles 24:00 for now.
- 24:00 will not be a valid time once unicode-org#355 is fixed. - Removes logic than handles 24:00 for now.
…444) * Add support for NoonMidnight dayPeriods in test data. - Updates all of the JSON test data to contain fields for noon and midnight dayperiods. * Add DayPeriod symbols for NoonMidnight - Adds optional `DayPeriod` symbol for `noon`. - Adds optional `DayPeriod` symbol for `midnight`. * Add tests for DayPeirod AmPm and NoonMidnight patterns - Restructures `format` module to use directory structure instead of single file. - Adds a `format/tests` directory. - Moves existing local tests from `format.rs` to `format/tests/mod.rs` - Adds JSON test data for `DayPeriod` patterns. - Adds JSON-serializable structs for testing formatting patterns. - Adds test cases for `DayPeriod` patterns. - Adds parsing test cases for the `b` `DayPeriod` pattern. * Remove logic that handles 24:00 - 24:00 will not be a valid time once #355 is fixed. - Removes logic than handles 24:00 for now. * Refactor DayPeriod patterns tests to integration tests - Converts `DayPeriod` pattern tests to be integration tests. - Tests no longer direclty use the private `write_pattern()`. - Tests now mutate the `DatesV1` struct to the desired pattern, using `DateTimeFormat` to format the custom patterns. * Rewrite symbols!() macro to support serde_none seralization for Options - Rewrites the `symbols!()` macro as a token tree muncher. - For `Option` members, adds serde attributes to skip serializing if none. - Otherwise, includes them in the seralization. * Regenerate test data with optional serializtion for dayperiods - Regenerates the test data now that `noon` and `midnight` are skipped if not present. - Previously `noon` and `midnight` would show up as `null`. * Minor Test Cleanup - Moves a few expressions in the dayperiod patterns test to outer loops. * Make NoonMidnight dependent on granularity of pattern's time. - Adds capability for a pattern to compute its most granular time. - e.g. `h:mm:ss` is `Seconds`. - e.g. `h` is `Hours`. - e.g. `E, dd/MM/y` is `None`. - Patterns containing `b` the `NoonMidnight` pattern item will now display noon or midnight only if the displayed time falls on the hour. - This means that `12:00:00` is always noon-compatible. - However, `12:05:15` is noon-compatible only if the display pattern does not contain the minutes or seconds. * Move time granularity functions to format-local helper functions. - Time granularity functionality is no longer associated with Pattern or PatternItem. It is now local to the format module alone as standalone functions. * Move format/mod.rs to format.rs - Format no longer needs to be a directory. * Fix access specifiers - Makes TimeGranularity private instead of public. * Add minor DayPeriod formatting optimization. - Only calculates the time granularity if the `DayPeriod` is `NoonMidnight`. * Cache time granularity on Pattern - Converts `Pattern` from a tuple struct to a traditional struct. - Adds a new data member `time_granularity` to `Pattern`. - `time_granularity` is a lazily initialized, interrior-mutable cached value. - Makes `Pattern`'s data members private. - The cached `time_granularity` is dependent on the `Pattern`'s `items`. It is no longer safe to allow `items` to be publicly accessible, because mutating `items` must invalidate the cached granularity. - Adds new method `items()` to `Pattern` to return a slice of its items. - Implement `From<Vec<PatternItem>` for `Pattern` - This is out of convenience in many places where tuple-struct syntax was used previously. * Clean up Pattern::from_iter - Pattern::from_iter now uses Pattern::from::<Vec<_>> * Eagerly evaluate Pattern time granularity - `Pattern`'s time granularity is no longer lazily evaluated. - It is instead evaulated on construction. * Use filter_map instead of flat_map - filter_map is more specialized, and arguably more readable.
In datetime-input.md, I put forth a proposal for the trait used as input into DateTimeFormat. This issue is to track follow-up discussions and check in the result.
Task list:
The text was updated successfully, but these errors were encountered: