Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLDR-17938 Add test data generator for date time formatting #4011

Merged
merged 1 commit into from
Sep 26, 2024

Conversation

echeran
Copy link
Contributor

@echeran echeran commented Sep 3, 2024

CLDR-17938

  • This PR completes the ticket.

ALLOW_MANY_COMMITS=true


import java.io.IOException;
import java.util.Arrays;
import org.pcollections.HashPMap;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try not to introduce too many dependencies on on other libraries. It seems like this could just be done with regular collections or guava collections.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support that concern generally. The reason I thought to introduce a persistent collection library is that there are multiple dimensions that need to be iterated over (generate combinations for all calendar systems, for all selected dates, for all selected times, for all ways of specifying style/length (skeleton vs. style options), etc.). In particular, the fields for skeleton and the style options in a test case should be mutually exclusive -- if one has data, then the other should be empty.

Using immutable maps across the nested loops eliminates chances for unintended consequences of stateful mutation. I strongly prefer immutable maps for this reason, and thus do not want to use regular collections. But that also means I need to append to immutable collections.

I didn't immediately see it in the Guava APIs, but I guess it is possible, although doing so is clunky. Going this route would compel me to write helper methods to reduce that boilerplate, which would feel like I'm partially recreating the persistent collections library that point.

On the other hand, persistent collections are designed for supporting immutability + append/remove/etc., and have the added benefit of code readability b/c of concision when you do perform those operations. The PCollections library has no dependencies and 75 KB of jar size, FWIW.

I'm happy to do what you suggest, here. I would prefer that it entail immutable data structures -- either via Guava or PCollections.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really have time to do an analysis of a new package to see if it introduces any new security / performance issue If you are not modifying these structures once built, it is pretty easy to just build them and convert to immutable structures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, removed PCollections as a dep, using the existing Guava dep's collections instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@srl295
Copy link
Member

srl295 commented Sep 3, 2024

PR title needs to start with CLDR-…  as should the commit


import java.io.IOException;
import java.util.Arrays;
import org.pcollections.HashPMap;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really have time to do an analysis of a new package to see if it introduces any new security / performance issue If you are not modifying these structures once built, it is pretty easy to just build them and convert to immutable structures.

Copy link
Member

@srl295 srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • i don't see the argument for the pcollections
  • as to iterating - that data should be pulled from the DTD anyway instead of hard coded.

emptyMapStringToObject.plus("timeZoneName", "longGeneric")
));

private static final PVector<String> CALENDARS =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate of other supplemental data

Comment on lines 82 to 91
private static final PVector<String> TIME_ZONES =
TreePVector.from(Arrays.asList(
"America/Los_Angeles",
"Africa/Luanda",
"Asia/Tehran",
"Europe/Kiev",
"Australia/Brisbane",
"Pacific/Palau",
"America/Montevideo"
));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other code uses ImmutableSet.of("…", "…", "…") - does this provide value beyond that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched over to using Guava according to Mark's guidance, so it's a moot point now.

(There's a bit of cloning during runtime and verbosity of code that could be avoided if one were to use persistent data structures.)


private static final HashPMap<String,Object> emptyMapStringToObject = HashTreePMap.empty();

private static final PVector<PMap<String, Object>> SPEC_OPTIONS =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this just be read from the DTD?

@echeran
Copy link
Contributor Author

echeran commented Sep 4, 2024

@macchiati @srl295 Do you know what the proper way to format a date within CLDR code is? @sffc and I found a few separate instances of date time formatting, such as in cldr-code -> org.unicode.cldr.draft.Misc#getFormatted that uses ICU4J's DateTimePatternGenerator and SimpleDateFormat. Is that the way to do it?

@macchiati
Copy link
Member

There is a class called ICUServiceBuilder, that patches ICU calls with CLDR data. (If you are doing something new we might need to add APIs to that.)

@srl295
Copy link
Member

srl295 commented Sep 4, 2024

a good source of example code would be the ExampleGenerator.java

@echeran
Copy link
Contributor Author

echeran commented Sep 9, 2024

Thanks @macchiati and @srl295 . @sffc and I took a look at ExampleGenerator.java and found in handleDateFormatItem that finds the right xpaths and the values at that those xpaths for the date/time patterns, and then calls icuServiceBuilder.getDateFormat(...) using those values in the call site.

There's a lot of important and necessary code in that method that would be better off refactored into a reusable place rather than taking a copy-and-paste shortcut. The method itself has a lot of copy and-pasted code 3x over for full/long/short date format patterns. We think it would be good if we also cleaned up & refactored the duplicated code as a part of creating the test data generator. WDYT?

@macchiati
Copy link
Member

macchiati commented Sep 9, 2024 via email

@srl295
Copy link
Member

srl295 commented Sep 10, 2024

@echeran dont' forget to open a ticket and update the PR title and commit messages.

@echeran echeran changed the title Add test data generator for date time formatting CLDR-17938 Add test data generator for date time formatting Sep 10, 2024
@sffc
Copy link
Member

sffc commented Sep 11, 2024

Sounds good; pulling useful methods out of ExampleGenerator (etc). There is a pending PR (from one of the interns) that refactors ExampleGenerator, so you should wait until that is merged to avoid conflicts.

Is it PR #3895?

It looks like it is just stuck on a linter. @echeran pushed the lint fix to the branch.

@macchiati
Copy link
Member

Still failing spotless, build — and is draft. Elango, do you want to try to fix this for v46, or punt to 47?

@srl295
Copy link
Member

srl295 commented Sep 12, 2024

yes thanks for merging #3895

@echeran
Copy link
Contributor Author

echeran commented Sep 16, 2024

Still failing spotless, build — and is draft. Elango, do you want to try to fix this for v46, or punt to 47?

Still trying to get this done ASAP in case it can fit for v46. When is the last call for v46?

@macchiati
Copy link
Member

macchiati commented Sep 17, 2024 via email

@echeran echeran marked this pull request as ready for review September 18, 2024 22:42
@echeran
Copy link
Contributor Author

echeran commented Sep 18, 2024

Any test data that is targeted at ICU should be in before their freeze date (Thursday??).

Things are looking good and complete now. The only thing is the Jira ticket isn't accepted yet. The key files are GenerateDateTimeTestData.java and its output, datetime.json.

@sffc
Copy link
Member

sffc commented Sep 19, 2024

The data currently looks like

  {
    "dateStyle": "short",
    "timeStyle": "short",
    "calendar": "gregorian",
    "formatType": "standard",
    "locale": "en-US",
    "dateLength": "short",
    "timeLength": "short",
    "timeZone": "Pacific/Palau",
    "input_string": "1969-07-16T16:00+09:00[Pacific/Palau]",
    "verify": "7/16/69, 4:00 PM"
  },

A few suggestions for improvement:

  1. Remove formatType since we aren't using it and it is a bad name (unclear what it means)
  2. Keep dateStyle/timeStyle or dateLength/timeLength but not both
  3. Remove timeZone since it is already in the input string
  4. Rename the field "input_string" to "input" or "datetime"
  5. Rename the field "verify" to "output" or "expected"

@echeran
Copy link
Contributor Author

echeran commented Sep 19, 2024

A few suggestions for improvement:

Done. All suggestions were good. PTAL.

@@ -513,11 +513,11 @@ private static ImmutableMap<Object, Object> getTestCaseForZonedDateTime(
dateTimeGluePatternFormatType,
icuServiceBuilder);

// "input_string" = the ISO 18601 UTC time zone formatted string of the zoned date time
optionsBuilder.put("inputString", zdt.toString());
// "input" = the ISO 18601 UTC time zone formatted string of the zoned date time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ISO - International Organization for Standardization

ISO 18601:2013 - Packaging and the environment — General requirements for the use of ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// "input" = the ISO 18601 UTC time zone formatted string of the zoned date time
// "input" = the ISO 8601 UTC time zone formatted string of the zoned date time

@macchiati
Copy link
Member

I can't review the Jason file right now, because it's too big for the browser.

@echeran
Copy link
Contributor Author

echeran commented Sep 20, 2024

I can't review the Jason file right now, because it's too big for the browser.

For the purposes of just quick reviewing, you can view the PR files including the large JSON file by replacing the .com with .dev in the PR URL, and it spins up VS Code in the browser to make viewing the files manageable:

https://github.dev/unicode-org/cldr/pull/4011

(If you had to do it at the command line, you would have to check out the branch via gh pr checkout 4011 and navigate yourself to the file, which takes a few more steps.)

Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip, Elango. Data looks ok to me.

(I was on a tablet, and so constained in how I could review.)

(longer term, for each testdata file I think we want to have a unittest in CLDR; and in the testdata file point to the generator and the unit test. That will make it easier to manage.)

@echeran
Copy link
Contributor Author

echeran commented Sep 23, 2024

Blocked on the Jira ticket being accepted. Everything else is ready to go.

private static final ImmutableSet<String> NUMBERING_SYSTEMS =
ImmutableSet.of("latn", "arab", "beng");

// Use underscores for locale id b/c of CLDR historical reasons, even though dash is preferable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be in bcp47 and then convert them to CLDR using ULocale.forLanguageTag

Copy link
Member

@srl295 srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK to move forward (minor comments)

@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@echeran echeran merged commit ed98270 into unicode-org:main Sep 26, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants