-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for the Gregorian Calendar availableFormats #480
Conversation
fbf52ad
to
6521dce
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
"medium": "{1}, {0}", | ||
"short": "{1}, {0}" | ||
}, | ||
"available_formats": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to decide on the best representation for this data, as it's a key pair value. Making it a simple tuple seemed like the best choice here to me.
Ultimately, the key will be converted from Cow<'static, str>
to a struct Skeleton
that can then be used for skeleton matching. I'm not 100% sure if it's useful to provide these as Cow
s, or what the ultimate design of the memory management for providers is. However, this is following the existing examples on how this data is being read in, so I'm assuming it's compatible.
My understanding of the data flow is
- The serialized JSON is read in.
- This gets deserialized into the
DateTimeFormatsV1
structure. Skeleton
s are generated from theDateTimeFormatsV1
- Skeleton matching happens with the
Skeleton
struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that for now we can keep the key as Cow<str>
, and separately once we have more complete data structures we can experiment with shifting from stringified representations to serializing/deserializing structs and see what it does to perf/mem/size.
6521dce
to
6398371
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see the code using these resources before we add the resources to the data model.
pub style_patterns: StylePatternsV1, | ||
|
||
#[cfg_attr(feature = "provider_serde", serde(with = "tuple_vec_map"))] | ||
pub available_formats: Vec<(Cow<'static, str>, Cow<'static, str>)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you put cows in a data struct, you should add a lifetime parameter to the data struct instead of using 'static
. Alternatively, you can use TinyStr (for skeletons) or SmallStr (for patterns) and obviate the need for a lifetime parameter.
It looks like the skeletons are always, or almost always, 8 ASCII characters or shorter, in which case TinyStr8 would work very well. If you want to allow skeletons to be longer than that, consider TinyStrAuto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you put cows in a data struct, you should add a lifetime parameter to the data struct instead of using 'static.
This is where a lot of my confusion over the memory model for data providers really stems. All of the other strings are using the 'static
lifetime parameter, so I'm not sure if there is any difference between that usage, and this usage. I was honestly a bit surprised this code compiled.
symbols!(months, [Cow<'static, str>; 12]);
symbols!(weekdays, [Cow<'static, str>; 7]);
symbols!(
day_periods {
am: Cow<'static, str>,
pm: Cow<'static, str>,
noon: Option<Cow<'static, str>>,
midnight: Option<Cow<'static, str>>,
}
);
I investigated the source of the Cows, which appears to be #256. It seems like this was initially landed to get the data in tree, with the idea that we would optimize in the future. Similarly I'm a bit nervous about spending a lot of time optimizing at the beginning, but let's start with a decent choice. I think there are different trade-offs in this data provider format, but I'd be happy to try the the non-heap/inline solutions first, and then measure the memory afterwards.
Should I file an issue to add lifetime parameters to the Cow
structures that are already here? If we're not adding new ones, it seems like the existing ones should be addressed at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edit: This graph was wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the style guide:
- If lifetime parameters are allowed, use
&'a str
.- If lifetime parameters are not allowed, use one of:
- TinyStr if the string is ASCII-only.
- SmallString for shorter strings, with a stack size ∈ {8, 12, 16, 20} that fits at least 99% of cases.
Cow<'static, str>
for longer strings.
The largest suggested stack size of 20 for this data set is: (aside: I'm not sure if stack size in the style guide is including the data required by by the SmallString struct itself beyond just the data)
20 bytes: 96.5%
29 bytes: 98.8%
30 bytes: 99.5%
Which, according to the style guide suggests we use Cow<'static, str>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to update the style guide. As of my recent changes to the data provider trait, cows can and should have real lifetime parameters. I thought I updated all existing structs when I did that PR but I guess I missed this one hiding in a macro. You don't have to worry about migrating in this PR; I'll follow up in #257.
About data types: I'm more talking about skeletons than patterns. I believe that patterns are going to be long enough that you need a cow. However, I think the keys (skeletons) might be a good just case for TinyStr.
medium: other.medium.get_pattern().clone(), | ||
short: other.short.get_pattern().clone(), | ||
}, | ||
available_formats: other.available_formats.0.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't have to copy the available_formats literally from CLDR into ICU4X. Is there any pre-processing you can do to optimize them?
"MMMMW-count-zero": "الأسبوع W من MMMM", | ||
"MMMMW-count-one": "الأسبوع W من MMMM", | ||
"MMMMW-count-two": "الأسبوع W من MMMM", | ||
"MMMMW-count-few": "الأسبوع W من MMMM", | ||
"MMMMW-count-many": "الأسبوع W من MMMM", | ||
"MMMMW-count-other": "الأسبوع W من MMMM", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have the plural rule selection implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I don't, and I haven't really thought through this problem. Do you have recommendations here? Currently I discard them in the Skeleton
struct construction. I don't really have a mental model of how these will be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'm going to try to change the representation here to handle these. I see what's going on with them now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't want to feature-creep your work, open an issue to figure these out later, and in the mean time, copy only the "other" form into the data.
Thanks for the review!
That seems fair enough. I would like to work incrementally, so I may investigate ways to separate out different sections still. |
It's good to keep this as a separate PR; I'd just like to review the code PR first so I can get an idea of how this data is being used. I imagine that there may be additional changes you may want to make once we see how the data is used. |
Codecov Report
@@ Coverage Diff @@
## main #480 +/- ##
==========================================
- Coverage 74.22% 74.08% -0.14%
==========================================
Files 128 129 +1
Lines 7840 8294 +454
==========================================
+ Hits 5819 6145 +326
- Misses 2021 2149 +128
Continue to review full report at Codecov.
|
I updated the PR to address the review feedback so far. I still need to rebase my bigger work on top of these changes, so I may tweak the design a bit more, however this incorporates the review feedback so far. |
@@ -132,6 +132,9 @@ impl From<&cldr_json::StylePatterns> for gregory::patterns::StylePatternsV1 { | |||
|
|||
impl From<&cldr_json::DateTimeFormats> for gregory::patterns::DateTimeFormatsV1 { | |||
fn from(other: &cldr_json::DateTimeFormats) -> Self { | |||
use gregory::patterns::{CountV1, SkeletonPatternV1, SkeletonsV1}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of CountV1, it would be nice if we could use an actual standard plural enum from the plurals crate that has a well-defined string representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few thoughts here:
A: I was wondering about dependencies when I wrote it. I looked into using PluralCategory
originally, but didn't understand the strategy for inter-dependencies between the various component crates. I could use the PluralCategory
directly, but this would add icu_plurals
as a dependency to icu_datetime
. Is this OK?
B: I'm a bit concerned with sharing serializable representations with code outside of the provider. The V1
signals a social contract to maintain the serialized represented. If I naively pulled in PluralCategory
, then this gives no signal that it's used as a serializable source, and care should be taken when changing its representation. Perhaps this is fine here since this is a fairly standardized representation. However, is there still a risk of this assumption changing? This type of thing was a source of bugs when I maintained the Firefox Profiler.
C: I read through data-pipeline.md again, but I'm still not 100% sure on the guarantees here. Here in the provider, every struct has an associated version.
- How granular is this version number?
- Does this mean that individual pieces of the serialization can change? For example, what happens if I make a breaking change to
CountV1
, does that mean every struct that contains aCountV1
needs to be bumped versions? - Do we not care yet to define this breakdown, and label everything as V1 until something else is needed?
- Can we have something with a V2 key, but contain V1 structs?
- What happens when we have a breaking change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A: I was wondering about dependencies when I wrote it. I looked into using PluralCategory originally, but didn't understand the strategy for inter-dependencies between the various component crates. I could use the PluralCategory directly, but this would add icu_plurals as a dependency to icu_datetime. Is this OK?
We could have it be an optional dependency and have skeletons be behind a feature (on by default, but this would allow you to build datetime with just styles for smaller footprint!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, icu_datetime is going to have to depend on icu_plurals anyway in order to operate on this data, right?
In any case, I don't have any problem with adding this dependency because icu_datetime is higher-level than icu_plurals. It will also need to eventually gain a dependency on icu_numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On your other questions:
B: This would assume PluralCategory
is serializable as an enum. If it's not, then I would implement serde on that enum over in icu_plurals behind that crate's serde
feature. Then the stability guarantee comes from PluralCategory's serde form.
C: Let's not worry about data struct versioning until ICU4X 1.0. These are going to be interesting questions to discuss and answer when we actually have to start worrying about our stability guarantees. Until that time, let's not worry about it much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, icu_datetime is going to have to depend on icu_plurals anyway in order to operate on this data, right?
What I'm saying is that icu_datetime
that is only working with dateStyle/timeStyle is a usable date/time formatting API that may not pull in icu_plurals
, am I correct?
icu_plurals
become important when skeletons
come into play.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just re-reading the style guide on crate features:
When adding enhancements to an ICU4X component, introduce features as a way for the end user to control the code size of their compilation as follows:
- If the enhancement adds a new crate dependency, it should be behind a feature.
- If the enhancement contains code that is not considered best practice, for example if it is mainly used for debugging diagnostics or development, then it should be behind a feature.
So it looks like this may fall under case 1. My hope is that dead code elimination would take care of this one, but I have no problem with putting the skeleton handling stuff behind a feature flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the conversation here, I'm doubting the use of storing the string data in memory at all. It would be fairly easy to implement Serialize
and Deserialize
for Field
and PatternItem
. Then the data could use something like the following shape:
pub type SkeletonTupleV1 = (
SmallVec<[fields::Field; 5]>,
SmallVec<[Vec<pattern::PatternItem>; 1]>,
);
Then all of the pattern matching machinery can take a lifetime parameter, and do all of its work via reference, without worrying about owning or copying any of this data. The data could be serialized back into the standard UTS 35 representations for strings, which is convenient and well-documented, but the in-memory representation would be exactly how we want to use it for the actual data processing.
The processing step can handle pre-sorting the information to make sure it's in the most efficient and direct form for skeleton matching. The skeleton matching machinery could then be a wrapper with a lifetime over this data representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very much in favor of pre-processing CLDR strings such that the data stored in ICU4X DataProvider needs minimal parsing and processing at runtime. I think it's okay if it incurs a small data size penalty as well.
// 9 16 0.1% 100.0% | ||
// 10 8 0.0% 100.0% | ||
|
||
// TODO - This could have better memory locality with TinyStr, however it does not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully we can get that solved before you need to merge this PR.
pub pattern: Cow<'static, str>, | ||
#[cfg_attr( | ||
feature = "provider_serde", | ||
serde(skip_serializing_if = "Option::is_none") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to also make this dependent on the serialize_none feature, which is necessary for bincode. Find some boilerplate from elsewhere and copy it in here.
We should probably add a CI check that fails if bincode doesn't round-trip. #491
// currently implement Serialize or Deserialize traits. | ||
// | ||
// pub type SkeletonsV1 = Vec<(TinyStr16, SmallVec<[SkeletonPatternV1; 1]>)>, | ||
pub type SkeletonsV1 = Vec<(Cow<'static, str>, SmallVec<[SkeletonPatternV1; 1]>)>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: instead of using SmallVec<[SkeletonPatternV1; 1]>
, consider an enum
with two variants: a single pattern, and a TupleVecMap of multiple patterns keyed by plural form.
It turns out the skeleton data in the CLDR has identical patterns for all of the different plural variants (except "other"). I'm thinking it will be better and simpler to land this code only with the assumption that there is a 1:1 relationship between skeleton to pattern. The only skeletons that have multiple patterns have at most one variant. This includes for plural rules, where the count-one, count-two, etc variants are identical. The only plural rule that has anything different is the Here is the list of patterns that have variants for a given a skeleton.
This list is small enough, that I don't think it's worth handling. In addition, we are not accepting free-form pattern input to match against, but rather specific options bags for the components. I don't think there is an easy way to prefer one variant over the other. This will simplify the initial implementation, and then we can re-visit the decision if we actually do need a way to access these variants. I determined this with this script. I also did a spot check of the CLDR XML data to ensure that the JSON data was accurate, and it appears so. (I checked this locally more thoroughly.) |
The list being small shouldn't by itself justify low-level design decisions like this one if they reduce i18n quality. For example, it looks like the plural forms produce actually different results for "pcm".
I'm also not sure what to do about variants that aren't tied to a Unicode extension keyword. I know you want to check this in, but I'd like to discuss this question at the meeting, because I think it's an important decision that could serve as precedent for future scenarios. So can we talk about it in Thursday? |
First off, I agree that just because a list is small, that it shouldn't be reason to reduce the quality of the overall output. I'm focusing on delivering a correct solution, but also keeping an eye towards limiting scope of this initial implementation. My understanding was that the
This leads me to realize that effectively for "pcm" the
I think for this first check-in, we should ignore them, and then file a follow-up to discuss them and come to a decision on them.
I'd be happy to discuss further if you feel the need, but I think we may be in agreement now that I realize my misunderstanding of the
|
That sounds good. Alternatively, if you want to reduce the scope, I see that the plural forms are only used in the "week of year" and "month of year" patterns; you could just omit those patterns, since there is no components bag that triggers them, and make a note to follow up in #488. |
I've updated this PR with the existing feedback. There are 4 new commits listed here. Locally, I rewrote my history while I was working, but used a merge strategy to update this PR. You can look at the total code diff, or look at it on a per-commit basis. This would be Part 1, and Part 2 of the new commits:
|
I found a new issue. I didn't enforce sorting of the skeleton fields from the CLDR. This was fixed in: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! A few nitpicks and suggestions.
components/datetime/src/skeleton.rs
Outdated
while let Some(item) = seq.next_element()? { | ||
items.push(item) | ||
} | ||
Ok(Skeleton(items)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: We shouldn't assume that the fields are in order, because the bincode could come from an untrusted source. We should just validate that they are indeed in order (when reading a new item, check that it is greater than the previous item), and return a Serde error if they are not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Question: Do we know what kind of validation we need to do for binary sources and what kind of guarantees serde bincode provides?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, if we don't trust the source, then I'd also like to check for duplicate fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this is done and with tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. IMO, our responsibility for validation is only on the level of semantic correctness. Bincode's own deserializer code should be able to handle correctness of types and such. We need to perform semantic correctness validation whenever we have our own Deserialize impl, or when deserializing into a struct that has certain invariants to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree here. I messed around with the binary representation, and serde seemed to do the correct thing when providing invalid binary data.
if prev_item > &item { | ||
return Err(de::Error::invalid_value( | ||
de::Unexpected::Other(&format!("field item out of order: {:?}", item)), | ||
&"ordered field symbols representing a skeleton", | ||
)); | ||
} | ||
if prev_item == &item { | ||
return Err(de::Error::invalid_value( | ||
de::Unexpected::Other(&format!("duplicate field: {:?}", item)), | ||
&"ordered field symbols representing a skeleton", | ||
)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick (optional): combine error handling via prev_item >= &item
. Less code to read and compile for an error that shouldn't happen in normal situations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since everything is green, I may fast follow with this one.
components/datetime/src/skeleton.rs
Outdated
while let Some(item) = seq.next_element()? { | ||
items.push(item) | ||
} | ||
Ok(Skeleton(items)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. IMO, our responsibility for validation is only on the level of semantic correctness. Bincode's own deserializer code should be able to handle correctness of types and such. We need to perform semantic correctness validation whenever we have our own Deserialize impl, or when deserializing into a struct that has certain invariants to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switching my review status to Approval with one more optional nitpick
This PR adds support for the
availableFormats
key to theDateTimeFormats
, in order to provide support for skeleton matching. It is part of the dependencies for Issue #481. I'm splitting out this piece from the rest of the work, as it is fairly self contained, and it's what the rest of the work will be based upon. It has a fairly large code diff, as it's generating lots of test data, so it should be helpful to get it in place first.This is ready for review.
(If anyone is curious, my draft PR for all of the components bag work is available in #479, which includes things like Skeleton.)