Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for the Gregorian Calendar availableFormats #480

Merged
merged 23 commits into from
Mar 29, 2021

Conversation

gregtatum
Copy link
Member

@gregtatum gregtatum commented Feb 8, 2021

This PR adds support for the availableFormats key to the DateTimeFormats, in order to provide support for skeleton matching. It is part of the dependencies for Issue #481. I'm splitting out this piece from the rest of the work, as it is fairly self contained, and it's what the rest of the work will be based upon. It has a fairly large code diff, as it's generating lots of test data, so it should be helpful to get it in place first.

This is ready for review.

(If anyone is curious, my draft PR for all of the components bag work is available in #479, which includes things like Skeleton.)

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • components/datetime/src/options/components.rs is no longer changed in the branch
  • components/datetime/src/options/preferences.rs is no longer changed in the branch
  • components/datetime/src/options/style.rs is no longer changed in the branch
  • components/datetime/tests/fixtures/mod.rs is no longer changed in the branch
  • components/datetime/tests/fixtures/structs.rs is no longer changed in the branch
  • components/datetime/tests/fixtures/tests/components.json is no longer changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@coveralls
Copy link

coveralls commented Feb 8, 2021

Pull Request Test Coverage Report for Build 2e341c4b05f71c613135ee01a632d7bf3a49039e-PR-480

  • 290 of 423 (68.56%) changed or added relevant lines in 9 files are covered.
  • 30 unchanged lines in 5 files lost coverage.
  • Overall coverage decreased (-0.07%) to 73.142%

Changes Missing Coverage Covered Lines Changed/Added Lines %
components/datetime/src/fields/length.rs 2 3 66.67%
components/datetime/src/provider/helpers.rs 3 4 75.0%
components/datetime/src/fields/mod.rs 2 5 40.0%
components/datetime/src/pattern/error.rs 4 7 57.14%
components/datetime/src/provider/mod.rs 15 28 53.57%
components/provider_cldr/src/transform/dates.rs 29 42 69.05%
components/datetime/src/fields/symbols.rs 56 78 71.79%
components/datetime/src/pattern/mod.rs 36 67 53.73%
components/datetime/src/skeleton.rs 143 189 75.66%
Files with Coverage Reduction New Missed Lines %
components/datetime/src/fields/length.rs 1 86.67%
components/datetime/src/fields/mod.rs 1 40.91%
components/datetime/src/provider/mod.rs 1 36.0%
components/datetime/src/fields/symbols.rs 7 82.32%
utils/litemap/src/map.rs 20 71.6%
Totals Coverage Status
Change from base Build 30a3909542955e156a11a979f57d8c38c1dbeac5: -0.07%
Covered Lines: 6269
Relevant Lines: 8571

💛 - Coveralls

"medium": "{1}, {0}",
"short": "{1}, {0}"
},
"available_formats": [
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to decide on the best representation for this data, as it's a key pair value. Making it a simple tuple seemed like the best choice here to me.

Ultimately, the key will be converted from Cow<'static, str> to a struct Skeleton that can then be used for skeleton matching. I'm not 100% sure if it's useful to provide these as Cows, or what the ultimate design of the memory management for providers is. However, this is following the existing examples on how this data is being read in, so I'm assuming it's compatible.

My understanding of the data flow is

  • The serialized JSON is read in.
  • This gets deserialized into the DateTimeFormatsV1 structure.
  • Skeletons are generated from the DateTimeFormatsV1
  • Skeleton matching happens with the Skeleton struct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that for now we can keep the key as Cow<str>, and separately once we have more complete data structures we can experiment with shifting from stringified representations to serializing/deserializing structs and see what it does to perf/mem/size.

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • Cargo.lock is now changed in the branch
  • components/datetime/Cargo.toml is now changed in the branch
  • components/datetime/src/provider/mod.rs is different
  • resources/testdata/data/json/dates/gregory@1/ar-EG.json is different
  • resources/testdata/data/json/dates/gregory@1/ar.json is different
  • resources/testdata/data/json/dates/gregory@1/bn.json is different
  • resources/testdata/data/json/dates/gregory@1/ccp.json is different
  • resources/testdata/data/json/dates/gregory@1/en-US-posix.json is different
  • resources/testdata/data/json/dates/gregory@1/en-ZA.json is different
  • resources/testdata/data/json/dates/gregory@1/en.json is different
  • resources/testdata/data/json/dates/gregory@1/es-AR.json is different
  • resources/testdata/data/json/dates/gregory@1/es.json is different
  • resources/testdata/data/json/dates/gregory@1/fr.json is different
  • resources/testdata/data/json/dates/gregory@1/ja.json is different
  • resources/testdata/data/json/dates/gregory@1/ru.json is different
  • resources/testdata/data/json/dates/gregory@1/sr-Cyrl.json is different
  • resources/testdata/data/json/dates/gregory@1/sr-Latn.json is different
  • resources/testdata/data/json/dates/gregory@1/sr.json is different
  • resources/testdata/data/json/dates/gregory@1/th.json is different
  • resources/testdata/data/json/dates/gregory@1/tr.json is different
  • resources/testdata/data/json/dates/gregory@1/und.json is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@gregtatum gregtatum marked this pull request as ready for review February 8, 2021 19:59
@gregtatum gregtatum requested review from sffc and a team as code owners February 8, 2021 19:59
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see the code using these resources before we add the resources to the data model.

pub style_patterns: StylePatternsV1,

#[cfg_attr(feature = "provider_serde", serde(with = "tuple_vec_map"))]
pub available_formats: Vec<(Cow<'static, str>, Cow<'static, str>)>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you put cows in a data struct, you should add a lifetime parameter to the data struct instead of using 'static. Alternatively, you can use TinyStr (for skeletons) or SmallStr (for patterns) and obviate the need for a lifetime parameter.

It looks like the skeletons are always, or almost always, 8 ASCII characters or shorter, in which case TinyStr8 would work very well. If you want to allow skeletons to be longer than that, consider TinyStrAuto.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you put cows in a data struct, you should add a lifetime parameter to the data struct instead of using 'static.

This is where a lot of my confusion over the memory model for data providers really stems. All of the other strings are using the 'static lifetime parameter, so I'm not sure if there is any difference between that usage, and this usage. I was honestly a bit surprised this code compiled.

    symbols!(months, [Cow<'static, str>; 12]);

    symbols!(weekdays, [Cow<'static, str>; 7]);

    symbols!(
        day_periods {
            am: Cow<'static, str>,
            pm: Cow<'static, str>,
            noon: Option<Cow<'static, str>>,
            midnight: Option<Cow<'static, str>>,
        }
    );

I investigated the source of the Cows, which appears to be #256. It seems like this was initially landed to get the data in tree, with the idea that we would optimize in the future. Similarly I'm a bit nervous about spending a lot of time optimizing at the beginning, but let's start with a decent choice. I think there are different trade-offs in this data provider format, but I'd be happy to try the the non-heap/inline solutions first, and then measure the memory afterwards.

Should I file an issue to add lifetime parameters to the Cow structures that are already here? If we're not adding new ones, it seems like the existing ones should be addressed at some point.

Copy link
Member Author

@gregtatum gregtatum Feb 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edit: This graph was wrong.

Copy link
Member Author

@gregtatum gregtatum Feb 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-posting this with correct utf8 byte size, instead of character count:

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the style guide:

  • If lifetime parameters are allowed, use &'a str.
  • If lifetime parameters are not allowed, use one of:
    • TinyStr if the string is ASCII-only.
    • SmallString for shorter strings, with a stack size ∈ {8, 12, 16, 20} that fits at least 99% of cases.
    • Cow<'static, str> for longer strings.

The largest suggested stack size of 20 for this data set is: (aside: I'm not sure if stack size in the style guide is including the data required by by the SmallString struct itself beyond just the data)

20 bytes: 96.5%
29 bytes: 98.8%
30 bytes: 99.5%

Which, according to the style guide suggests we use Cow<'static, str>.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to update the style guide. As of my recent changes to the data provider trait, cows can and should have real lifetime parameters. I thought I updated all existing structs when I did that PR but I guess I missed this one hiding in a macro. You don't have to worry about migrating in this PR; I'll follow up in #257.

About data types: I'm more talking about skeletons than patterns. I believe that patterns are going to be long enough that you need a cow. However, I think the keys (skeletons) might be a good just case for TinyStr.

medium: other.medium.get_pattern().clone(),
short: other.short.get_pattern().clone(),
},
available_formats: other.available_formats.0.clone(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to copy the available_formats literally from CLDR into ICU4X. Is there any pre-processing you can do to optimize them?

Comment on lines 168 to 173
"MMMMW-count-zero": "الأسبوع W من MMMM",
"MMMMW-count-one": "الأسبوع W من MMMM",
"MMMMW-count-two": "الأسبوع W من MMMM",
"MMMMW-count-few": "الأسبوع W من MMMM",
"MMMMW-count-many": "الأسبوع W من MMMM",
"MMMMW-count-other": "الأسبوع W من MMMM",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have the plural rule selection implemented?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I don't, and I haven't really thought through this problem. Do you have recommendations here? Currently I discard them in the Skeleton struct construction. I don't really have a mental model of how these will be used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'm going to try to change the representation here to handle these. I see what's going on with them now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't want to feature-creep your work, open an issue to figure these out later, and in the mean time, copy only the "other" form into the data.

@gregtatum
Copy link
Member Author

Thanks for the review!

I would like to see the code using these resources before we add the resources to the data model.

That seems fair enough. I would like to work incrementally, so I may investigate ways to separate out different sections still.

@sffc
Copy link
Member

sffc commented Feb 9, 2021

Thanks for the review!

I would like to see the code using these resources before we add the resources to the data model.

That seems fair enough. I would like to work incrementally, so I may investigate ways to separate out different sections still.

It's good to keep this as a separate PR; I'd just like to review the code PR first so I can get an idea of how this data is being used. I imagine that there may be additional changes you may want to make once we see how the data is used.

@codecov-io
Copy link

codecov-io commented Feb 10, 2021

Codecov Report

Merging #480 (9ce6021) into main (30a3909) will decrease coverage by 0.13%.
The diff coverage is 68.55%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #480      +/-   ##
==========================================
- Coverage   74.22%   74.08%   -0.14%     
==========================================
  Files         128      129       +1     
  Lines        7840     8294     +454     
==========================================
+ Hits         5819     6145     +326     
- Misses       2021     2149     +128     
Impacted Files Coverage Δ
components/datetime/src/lib.rs 100.00% <ø> (ø)
components/datetime/src/fields/mod.rs 40.90% <40.00%> (+40.90%) ⬆️
components/datetime/src/provider/mod.rs 36.00% <53.57%> (+11.00%) ⬆️
components/datetime/src/pattern/mod.rs 66.11% <53.73%> (-15.02%) ⬇️
components/datetime/src/pattern/error.rs 30.76% <57.14%> (+30.76%) ⬆️
components/datetime/src/fields/length.rs 86.66% <66.66%> (+3.33%) ⬆️
components/provider_cldr/src/transform/dates.rs 72.79% <69.04%> (-0.39%) ⬇️
components/datetime/src/fields/symbols.rs 82.32% <71.79%> (+5.62%) ⬆️
components/datetime/src/provider/helpers.rs 58.94% <75.00%> (ø)
components/datetime/src/skeleton.rs 75.66% <75.66%> (ø)
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 30a3909...9ce6021. Read the comment docs.

@gregtatum
Copy link
Member Author

I updated the PR to address the review feedback so far. I still need to rebase my bigger work on top of these changes, so I may tweak the design a bit more, however this incorporates the review feedback so far.

@@ -132,6 +132,9 @@ impl From<&cldr_json::StylePatterns> for gregory::patterns::StylePatternsV1 {

impl From<&cldr_json::DateTimeFormats> for gregory::patterns::DateTimeFormatsV1 {
fn from(other: &cldr_json::DateTimeFormats) -> Self {
use gregory::patterns::{CountV1, SkeletonPatternV1, SkeletonsV1};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of CountV1, it would be nice if we could use an actual standard plural enum from the plurals crate that has a well-defined string representation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few thoughts here:

A: I was wondering about dependencies when I wrote it. I looked into using PluralCategory originally, but didn't understand the strategy for inter-dependencies between the various component crates. I could use the PluralCategory directly, but this would add icu_plurals as a dependency to icu_datetime. Is this OK?

B: I'm a bit concerned with sharing serializable representations with code outside of the provider. The V1 signals a social contract to maintain the serialized represented. If I naively pulled in PluralCategory, then this gives no signal that it's used as a serializable source, and care should be taken when changing its representation. Perhaps this is fine here since this is a fairly standardized representation. However, is there still a risk of this assumption changing? This type of thing was a source of bugs when I maintained the Firefox Profiler.

C: I read through data-pipeline.md again, but I'm still not 100% sure on the guarantees here. Here in the provider, every struct has an associated version.

  • How granular is this version number?
  • Does this mean that individual pieces of the serialization can change? For example, what happens if I make a breaking change to CountV1, does that mean every struct that contains a CountV1 needs to be bumped versions?
  • Do we not care yet to define this breakdown, and label everything as V1 until something else is needed?
  • Can we have something with a V2 key, but contain V1 structs?
  • What happens when we have a breaking change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A: I was wondering about dependencies when I wrote it. I looked into using PluralCategory originally, but didn't understand the strategy for inter-dependencies between the various component crates. I could use the PluralCategory directly, but this would add icu_plurals as a dependency to icu_datetime. Is this OK?

We could have it be an optional dependency and have skeletons be behind a feature (on by default, but this would allow you to build datetime with just styles for smaller footprint!)

Copy link
Member

@sffc sffc Feb 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, icu_datetime is going to have to depend on icu_plurals anyway in order to operate on this data, right?

In any case, I don't have any problem with adding this dependency because icu_datetime is higher-level than icu_plurals. It will also need to eventually gain a dependency on icu_numbers.

Copy link
Member

@sffc sffc Feb 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On your other questions:

B: This would assume PluralCategory is serializable as an enum. If it's not, then I would implement serde on that enum over in icu_plurals behind that crate's serde feature. Then the stability guarantee comes from PluralCategory's serde form.

C: Let's not worry about data struct versioning until ICU4X 1.0. These are going to be interesting questions to discuss and answer when we actually have to start worrying about our stability guarantees. Until that time, let's not worry about it much.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, icu_datetime is going to have to depend on icu_plurals anyway in order to operate on this data, right?

What I'm saying is that icu_datetime that is only working with dateStyle/timeStyle is a usable date/time formatting API that may not pull in icu_plurals, am I correct?
icu_plurals become important when skeletons come into play.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just re-reading the style guide on crate features:

https://github.com/unicode-org/icu4x/blob/master/docs/process/style_guide.md#when-to-add-crate-features--suggested

When adding enhancements to an ICU4X component, introduce features as a way for the end user to control the code size of their compilation as follows:

  1. If the enhancement adds a new crate dependency, it should be behind a feature.
  2. If the enhancement contains code that is not considered best practice, for example if it is mainly used for debugging diagnostics or development, then it should be behind a feature.

So it looks like this may fall under case 1. My hope is that dead code elimination would take care of this one, but I have no problem with putting the skeleton handling stuff behind a feature flag.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the conversation here, I'm doubting the use of storing the string data in memory at all. It would be fairly easy to implement Serialize and Deserialize for Field and PatternItem. Then the data could use something like the following shape:

pub type SkeletonTupleV1 = (
    SmallVec<[fields::Field; 5]>,
    SmallVec<[Vec<pattern::PatternItem>; 1]>,
);

Then all of the pattern matching machinery can take a lifetime parameter, and do all of its work via reference, without worrying about owning or copying any of this data. The data could be serialized back into the standard UTS 35 representations for strings, which is convenient and well-documented, but the in-memory representation would be exactly how we want to use it for the actual data processing.

The processing step can handle pre-sorting the information to make sure it's in the most efficient and direct form for skeleton matching. The skeleton matching machinery could then be a wrapper with a lifetime over this data representation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very much in favor of pre-processing CLDR strings such that the data stored in ICU4X DataProvider needs minimal parsing and processing at runtime. I think it's okay if it incurs a small data size penalty as well.

// 9 16 0.1% 100.0%
// 10 8 0.0% 100.0%

// TODO - This could have better memory locality with TinyStr, however it does not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zbraniecki/tinystr#32

Hopefully we can get that solved before you need to merge this PR.

components/provider_cldr/src/transform/dates.rs Outdated Show resolved Hide resolved
pub pattern: Cow<'static, str>,
#[cfg_attr(
feature = "provider_serde",
serde(skip_serializing_if = "Option::is_none")
Copy link
Member

@sffc sffc Feb 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to also make this dependent on the serialize_none feature, which is necessary for bincode. Find some boilerplate from elsewhere and copy it in here.

We should probably add a CI check that fails if bincode doesn't round-trip. #491

// currently implement Serialize or Deserialize traits.
//
// pub type SkeletonsV1 = Vec<(TinyStr16, SmallVec<[SkeletonPatternV1; 1]>)>,
pub type SkeletonsV1 = Vec<(Cow<'static, str>, SmallVec<[SkeletonPatternV1; 1]>)>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: instead of using SmallVec<[SkeletonPatternV1; 1]>, consider an enum with two variants: a single pattern, and a TupleVecMap of multiple patterns keyed by plural form.

@sffc sffc added the waiting-on-author PRs waiting for action from the author for >7 days label Feb 18, 2021
@gregtatum
Copy link
Member Author

It turns out the skeleton data in the CLDR has identical patterns for all of the different plural variants (except "other"). I'm thinking it will be better and simpler to land this code only with the assumption that there is a 1:1 relationship between skeleton to pattern.

The only skeletons that have multiple patterns have at most one variant. This includes for plural rules, where the count-one, count-two, etc variants are identical. The only plural rule that has anything different is the count-other field.

Here is the list of patterns that have variants for a given a skeleton.

locale skeleton pattern
en-CA Md MM-dd
en-CA Md-alt-variant d/M
en-CA MEd E, MM-dd
en-CA MEd-alt-variant E, d/M
en-CA MMdd MM-dd
en-CA MMdd-alt-variant dd/MM
en-CA yM y-MM
en-CA yM-alt-variant M/y
en-CA yMd y-MM-dd
en-CA yMd-alt-variant d/M/y
en-CA yMEd E, y-MM-dd
en-CA yMEd-alt-variant E, d/M/y
fil MMMMW-count-one 'ika'-W 'linggo' 'ng' MMMM
fil MMMMW-count-other 'linggo' W 'ng' MMMM
fil yw-count-one 'ika'-w 'linggo' 'ng' Y
fil yw-count-other 'linggo' w 'ng' Y
hy MMMMW-count-one MMMM W-ին շաբաթ
hy MMMMW-count-other MMMM W-րդ շաբաթ
hy yw-count-one Y թ․ w-ին շաբաթ
hy yw-count-other Y թ․ w-րդ շաբաթ
pcm yw-count-one 'Wik' w 'fọ' Y
pcm yw-count-other 'Wiik' w 'fọ' Y
ps MMMMW-count-one اونۍ‘ W د MMMM‘
ps MMMMW-count-other اونۍ W د MMMM
ps-PK MMMMW-count-one اونۍ‘ W د MMMM‘
ps-PK MMMMW-count-other اونۍ W د MMMM

This list is small enough, that I don't think it's worth handling. In addition, we are not accepting free-form pattern input to match against, but rather specific options bags for the components. I don't think there is an easy way to prefer one variant over the other. This will simplify the initial implementation, and then we can re-visit the decision if we actually do need a way to access these variants.

I determined this with this script.

I also did a spot check of the CLDR XML data to ensure that the JSON data was accurate, and it appears so. (I checked this locally more thoroughly.)

@sffc
Copy link
Member

sffc commented Feb 22, 2021

This list is small enough, that I don't think it's worth handling.

The list being small shouldn't by itself justify low-level design decisions like this one if they reduce i18n quality. For example, it looks like the plural forms produce actually different results for "pcm".

In addition, we are not accepting free-form pattern input to match against, but rather specific options bags for the components. I don't think there is an easy way to prefer one variant over the other.

I'm also not sure what to do about variants that aren't tied to a Unicode extension keyword.

I know you want to check this in, but I'd like to discuss this question at the meeting, because I think it's an important decision that could serve as precedent for future scenarios. So can we talk about it in Thursday?

@gregtatum
Copy link
Member Author

gregtatum commented Feb 23, 2021

The list being small shouldn't by itself justify low-level design decisions like this one if they reduce i18n quality.

First off, I agree that just because a list is small, that it shouldn't be reason to reduce the quality of the overall output. I'm focusing on delivering a correct solution, but also keeping an eye towards limiting scope of this initial implementation.

My understanding was that the count-other was equivalent to alt-variant, but digging further, I realize that this is not the case. Looking into the plurals.json test data I see the following definition:

      "pcm": {
        "pluralRule-count-one": "i = 0 or n = 1 @integer 0, 1 @decimal 0.0~1.0, 0.00~0.04",
        "pluralRule-count-other": " @integer 2~17, 100, 1000, 10000, 100000, 1000000, … @decimal 1.1~2.6, 10.0, 100.0, 1000.0, 10000.0, 100000.0, 1000000.0, …"
      },

This leads me to realize that effectively for "pcm" the count-other is actually the plural variant. So I agree, this should be handled here, otherwise it would be a low quality result.

I'm also not sure what to do about variants that aren't tied to a Unicode extension keyword.

I think for this first check-in, we should ignore them, and then file a follow-up to discuss them and come to a decision on them.

I know you want to check this in, but I'd like to discuss this question at the meeting, because I think it's an important decision that could serve as precedent for future scenarios. So can we talk about it in Thursday?

I'd be happy to discuss further if you feel the need, but I think we may be in agreement now that I realize my misunderstanding of the count-other case. In summary:

  • I will handle the -count-* cases
  • I will not handle the -alt-variant case, and file a follow-up issue to discuss.

@sffc
Copy link
Member

sffc commented Feb 23, 2021

That sounds good. Alternatively, if you want to reduce the scope, I see that the plural forms are only used in the "week of year" and "month of year" patterns; you could just omit those patterns, since there is no components bag that triggers them, and make a note to follow up in #488.

@gregtatum
Copy link
Member Author

I've updated this PR with the existing feedback. There are 4 new commits listed here. Locally, I rewrote my history while I was working, but used a merge strategy to update this PR. You can look at the total code diff, or look at it on a per-commit basis. This would be Part 1, and Part 2 of the new commits:

  • Merge in master
  • Part 1: Directly serialize patterns and skeletons in the CLDR and prov…
  • Part 2: Re-compute the testdata for skeletons
  • Merge branch 'components-bag' (early part) into available-formats

components/datetime/src/pattern/mod.rs Outdated Show resolved Hide resolved
components/datetime/src/pattern/mod.rs Show resolved Hide resolved
@gregtatum gregtatum requested a review from sffc March 18, 2021 16:36
@gregtatum
Copy link
Member Author

I found a new issue. I didn't enforce sorting of the skeleton fields from the CLDR. This was fixed in:

1584020

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! A few nitpicks and suggestions.

components/datetime/src/skeleton.rs Outdated Show resolved Hide resolved
components/datetime/src/skeleton.rs Show resolved Hide resolved
components/datetime/src/skeleton.rs Outdated Show resolved Hide resolved
components/datetime/src/skeleton.rs Outdated Show resolved Hide resolved
components/datetime/src/provider/mod.rs Outdated Show resolved Hide resolved
Comment on lines 93 to 96
while let Some(item) = seq.next_element()? {
items.push(item)
}
Ok(Skeleton(items))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: We shouldn't assume that the fields are in order, because the bincode could come from an untrusted source. We should just validate that they are indeed in order (when reading a new item, check that it is greater than the previous item), and return a Serde error if they are not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Question: Do we know what kind of validation we need to do for binary sources and what kind of guarantees serde bincode provides?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if we don't trust the source, then I'd also like to check for duplicate fields.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this is done and with tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. IMO, our responsibility for validation is only on the level of semantic correctness. Bincode's own deserializer code should be able to handle correctness of types and such. We need to perform semantic correctness validation whenever we have our own Deserialize impl, or when deserializing into a struct that has certain invariants to maintain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree here. I messed around with the binary representation, and serde seemed to do the correct thing when providing invalid binary data.

components/datetime/src/skeleton.rs Outdated Show resolved Hide resolved
components/datetime/src/fields/symbols.rs Show resolved Hide resolved
components/provider_cldr/src/transform/dates.rs Outdated Show resolved Hide resolved
Comment on lines +95 to +106
if prev_item > &item {
return Err(de::Error::invalid_value(
de::Unexpected::Other(&format!("field item out of order: {:?}", item)),
&"ordered field symbols representing a skeleton",
));
}
if prev_item == &item {
return Err(de::Error::invalid_value(
de::Unexpected::Other(&format!("duplicate field: {:?}", item)),
&"ordered field symbols representing a skeleton",
));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick (optional): combine error handling via prev_item >= &item. Less code to read and compile for an error that shouldn't happen in normal situations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since everything is green, I may fast follow with this one.

Comment on lines 93 to 96
while let Some(item) = seq.next_element()? {
items.push(item)
}
Ok(Skeleton(items))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. IMO, our responsibility for validation is only on the level of semantic correctness. Bincode's own deserializer code should be able to handle correctness of types and such. We need to perform semantic correctness validation whenever we have our own Deserialize impl, or when deserializing into a struct that has certain invariants to maintain.

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching my review status to Approval with one more optional nitpick

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants