ICU-22898 MF2: fix various parser bugs and add more tests #3092

catamorphism · 2024-08-08T21:47:23Z

Checklist

Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22898
Required: The PR title must be prefixed with a JIRA Issue number.
Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
Required: Each commit message must be prefixed with a JIRA Issue number.
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable

mihnita · 2024-08-09T17:47:54Z

Sorry, the sharing of the test files PR started when I was in vacation and somehow didn't get on my radar when I got back.

But I just found that the change broke Eclipse, and that now the json test files are packaged in the release jar files of ICU.

I general having Maven consume files from outside it's own folder tree is hacky and error prone.
And when it finally works, we discover that it breaks the import to various IDEs.
I think that we would want to support at least Eclipse, IntelliJ, VS Code.

We currently have that problem with the LICENSE file (I can't import icu4j in Eclipse anymore)
I'm working on that...

The other problem with sharing is that it forces the c/c++ and Java implementations to be 100% in sync at all times.
Having them work the same is in general a good thing, of course.

But becomes a PITA when something changes and we need to update the code.
Because (often) the C++ and Java devs might not be the same.

This PR being an example.

TLDR: I am tempted to keep two copies of the test data.
There is an ant task in tools/cldr/ that (copy-cldr-testdata) that copies some test data from CLDR.

These json files are probably in the same bucket: they live in CLDR, but must be tested against in ICU.

But the idea is that the ant task can copy the files in two places, icu4c and icu4j.
If code breaks, but one has time / expertise for C/C++ only, they can open a ticket against icu4j, priority zero,
but revert the test files in icu4j until that issue is fixed.

I don't know if that is a good idea or not.
But this is what we already do for other tests.

Here is a fragment of the ant script:

        <fileset id="cldrTestData" dir="${cldrDir}/common/testData">
            <!-- Add directories here to control which test data is installed. -->
            <include name="localeIdentifiers/**"/> <!-- ... -->
            <include name="personNameTest/**"/> <!-- Used in ExhaustivePersonNameTest -->
            <include name="units/**"/> <!-- Used in UnitsTest tests -->
       </fileset>

        <copy todir="${testDataDir4C}">
            <fileset refid="cldrTestData"/>
        </copy>
        <copy todir="${testDataDir4J}">
            <fileset refid="cldrTestData"/>
        </copy>

We copy the same test files in icu4c/source/test/testdata/cldr and icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/cldr

mihnita · 2024-08-09T17:50:24Z

Try:

diff -r icu4c/source/test/testdata/cldr icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/cldr

There are 3 sub-folders there: localeIdentifiers, personNameTest, units
I would propose we add another one (messageformat2) and do the same thing.

mihnita · 2024-08-09T18:28:28Z

icu4j/main/core/src/main/java/com/ibm/icu/message2/InputSource.java

@@ -62,10 +62,22 @@ int readCodePoint() {
        return c;
    }

-    // Backup a number of characters.
+    // Backup a number of code points.
    void backup(int amount) {


This method works on code units (Java char), not code points.

That's because int readCodePoint() also works on code units.
It returns a code point, but the current offset in the input (cursor) is expresses in code units.
So when it finds a code point above BMP it advances with 2

Reasons:

When one wants to parse something from an offset, it is faster to express the offset in the same kind of units we use for storage. We store using char, offset in code units. Otherwise we must iterate all the way to the offset in code points. Our input buffer uses char

Easier to deal with not-properly paired surrogates. The spec does not try to enforce the surrogate correctness, and I don't think it is the job of the MF2 to do that. If I pass ".....\uDC00...{$user}! there is no reason to not format to ".....\uDC00...John!

Even if there is disagreement about the reasons, and we decide to express offsets in code points, this fix is not sufficient.

We would also have to change getPosition(), skip(int amount), gotoPosition(int position)
Because we either work on code points, or on code units, we should not mix and match.

Thanks for the explanation. I see that the issue isn't really with backup(), but how it's used in conjunction with readCodePoint() (reading a code point that happens to be a wide char, and then calling backup(1) to "push it back", is a bug).

In eee547c I reverted the change to backup() and fixed the three places I could find where readCodePoint() and backup() were being used in that way.

mihnita · 2024-08-09T18:30:59Z

icu4j/main/core/src/main/java/com/ibm/icu/message2/InputSource.java

+        }
+    }
+
+    void backupOneCodePoint() {


What happens if it is an unpaired low surrogate?
There is nothing to prevent such a thing from showing up in a message.

True, it is incorrect.
But we are not trying to reject that, we accept it.
And this code would misbehave.

Sounds like we need a test for this! (I'll add one.)

mihnita

First, thank you very much for taking a stab at updating the Java implementation to the latest spec.
I was planning to do that, but it looks like trying to share the test files precipitated things somewhat :-)

I added some comments.

But if there is a decisions to treat the mf2 test files the same way we treat other CLDR test files (units, people names, locale ids) and separate the C++ / Java tests, I can take over the Java part. And in a different PR.

Or you can continue it, as you already did a big chunk of it (all?)
Your call.

Thanks again,
M

mihnita · 2024-08-09T18:45:38Z

icu4j/main/core/src/main/java/com/ibm/icu/message2/MFDataModelFormatter.java

-            if (nr != null) {
-                return nr;
-            }
+            // "The resolution of a text or literal MUST resolve to a string."


I think that this change breaks selection as it is currently implemented.
Because it means that "1.0" does not match "1.00"

Might be desired, to not match, I don't know.
But any change here should be accompanied by a change in the plural matcher.

One might say that this is desired.
But the format is locale sensitive.
For example formatting 1 dollar I might get "1.00 dollars", and does not match |1|.
But formatting 1 Japanese yen will result in "1 ...", and that matches |1|.

So the developer using |1| in their message creates something that behaves differently on different locales.

I proposed to interpret |1| as a string, and 1 as a number, and it is not resolved:
unicode-org/message-format-wg#712

And Eemeli argued for it recently:
unicode-org/message-format-wg#842

I don't think the plural matcher has to be changed? This change is to address situations where a number literal isn't annotated, like:

{ "src": "Format {123456789.9876} number", "locale": "en-IN", "exp": "Format 123456789.9876 number", "comment": "Number literals are not formatted as numbers by default" }

This change doesn't change the behavior for .match on something with a :number annotation, or at least, there aren't any tests that show that.

icu4j/main/core/src/main/java/com/ibm/icu/message2/MFParser.java

mihnita · 2024-08-09T18:51:57Z

icu4j/main/core/src/main/java/com/ibm/icu/message2/MFParser.java

@@ -712,7 +721,7 @@ private MFDataModel.Declaration getDeclaration() throws MFParseException {
        MFDataModel.Expression expression;
        switch (declName) {
            case "input":
-                skipMandatoryWhitespaces();
+                skipOptionalWhitespaces();


The rule is input-declaration = input [s] variable-expression
So the space is mandatory.

I don't think so? If it was mandatory, it would say input-declaration = input s variable-expression (without the square brackets.)

mihnita · 2024-08-09T19:00:46Z

icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/FirstReleaseTests.java

I would rather not delete these extra tests.
They cover functionality not covered by the tests in the spec.
And that can't be covered by the spec.

They test that the implementation works end-to-end.

Some might be moved to the standard (CLDR).

But some can't, because they test that custom functions work, or that icu:skeletons work (namespaced),
or that various types work (Date, Calendar, java.time, or even long with :dateformat.

And that the type "magically selects" the proper function (For example "...{$exp}..." formats as a date if the exp is a date-like type (Date / Calendar / java.time)

This comment also (or mostly?) apply to the other deleted files (FunctionsTests, IcuFunctionsTest)

It might be less confusing to look at #3063 first. This PR includes that PR, plus additional changes.

In the description for #3063 I explained that I didn't delete any tests (unless they were duplicates), but rather, moved all the JSON filenames being read into a single file, because now the schema is the same for all the tests, so the same code can be re-used to read all of the files.

mihnita · 2024-08-09T19:04:23Z

icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/ParserSmokeTest.java

@@ -29,16 +29,5 @@ public void testNullInput() throws Exception {
        MFParser.parse(null);
    }

-    @Test


There is not much / anything left if we remove this.

So the same comment that I had for deleted files applies: do we really want to delete extra tests?
I think that the more tests we have the better.
Even some don't come from the official CLDR spec.

We can separate these tests in a different folder for cleanliness.
But not delete them, unless they are redundant.

See #3092 (comment)

mihnita · 2024-08-09T19:04:46Z

...main/core/src/test/java/com/ibm/icu/dev/test/message2/SelectorsWithVariousArgumentsTest.java

Same comment about deleting tests.

catamorphism · 2024-08-09T19:14:11Z

Sorry, the sharing of the test files PR started when I was in vacation and somehow didn't get on my radar when I got back.

But I just found that the change broke Eclipse, and that now the json test files are packaged in the release jar files of ICU.

I general having Maven consume files from outside it's own folder tree is hacky and error prone. And when it finally works, we discover that it breaks the import to various IDEs. I think that we would want to support at least Eclipse, IntelliJ, VS Code.

Of course; is there a way to add automated testing that changes don't break IDEs?

We currently have that problem with the LICENSE file (I can't import icu4j in Eclipse anymore) I'm working on that...

The other problem with sharing is that it forces the c/c++ and Java implementations to be 100% in sync at all times. Having them work the same is in general a good thing, of course.

But becomes a PITA when something changes and we need to update the code. Because (often) the C++ and Java devs might not be the same.

This PR being an example.

TLDR: I am tempted to keep two copies of the test data. There is an ant task in tools/cldr/ that (copy-cldr-testdata) that copies some test data from CLDR.

I don't want to unilaterally say yes or no to this; is it something to discuss in the TC meeting, maybe? (Note: I'll be on vacation from early next week until September 2, so I won't be at the next few meetings, but I can go back and read minutes later.)

catamorphism · 2024-08-09T20:54:12Z

First, thank you very much for taking a stab at updating the Java implementation to the latest spec. I was planning to do that, but it looks like trying to share the test files precipitated things somewhat :-)

Right, first it was precipitated by sharing test files; then I wrote a random test generator and decided it was only fair to run ICU4J on some of the generated tests as well as ICU4C, so that's where some of the other changes came from.

I added some comments.

But if there is a decisions to treat the mf2 test files the same way we treat other CLDR test files (units, people names, locale ids) and separate the C++ / Java tests, I can take over the Java part. And in a different PR.

Certainly (like I said in another comment, I think that has to be discussed with the TC).

I think I've addressed all your comments, but let me know if I missed anything!

jira-pull-request-webhook · 2024-09-18T18:42:08Z

Notice: the branch changed across the force-push!

icu4c/source/i18n/messageformat2_data_model.cpp is no longer changed in the branch
icu4c/source/i18n/messageformat2_function_registry.cpp is no longer changed in the branch
icu4c/source/i18n/messageformat2_macros.h is no longer changed in the branch
icu4c/source/i18n/messageformat2_parser.cpp is different
icu4c/source/i18n/messageformat2_parser.h is different
icu4c/source/i18n/messageformat2_serializer.cpp is different
icu4c/source/i18n/unicode/messageformat2_data_model.h is no longer changed in the branch
icu4c/source/test/intltest/messageformat2test_read_json.cpp is no longer changed in the branch
icu4j/main/core/src/main/java/com/ibm/icu/message2/MFDataModelFormatter.java is no longer changed in the branch
icu4j/main/core/src/main/java/com/ibm/icu/message2/MFParser.java is different
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/CoreTest.java is different
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/DataModelErrorsTest.java is no longer changed in the branch
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/DefaultTestProperties.java is no longer changed in the branch
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/FirstReleaseTests.java is no longer changed in the branch
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/FunctionsTest.java is no longer changed in the branch
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/IcuFunctionsTest.java is no longer changed in the branch
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/MF2Test.java is no longer changed in the branch
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/Param.java is no longer changed in the branch
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/ParserSmokeTest.java is no longer changed in the branch
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/SelectorsWithVariousArgumentsTest.java is no longer changed in the branch
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/Sources.java is no longer changed in the branch
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/StringToListAdapter.java is no longer changed in the branch
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/SyntaxErrorsTest.java is no longer changed in the branch
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/TestUtils.java is no longer changed in the branch
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/Unit.java is no longer changed in the branch
testdata/message2/alias-selector-annotations.json is no longer changed in the branch
testdata/message2/duplicate-declarations.json is no longer changed in the branch
testdata/message2/icu-parser-tests.json is no longer changed in the branch
testdata/message2/icu-test-functions.json is no longer changed in the branch
testdata/message2/icu-test-previous-release.json is no longer changed in the branch
testdata/message2/icu-test-selectors.json is no longer changed in the branch
testdata/message2/invalid-number-literals-diagnostics.json is no longer changed in the branch
testdata/message2/invalid-options.json is no longer changed in the branch
testdata/message2/markup.json is no longer changed in the branch
testdata/message2/matches-whitespace.json is no longer changed in the branch
testdata/message2/more-data-model-errors.json is no longer changed in the branch
testdata/message2/more-functions.json is no longer changed in the branch
testdata/message2/more-syntax-errors.json is no longer changed in the branch
testdata/message2/README.txt is no longer changed in the branch
testdata/message2/reserved-syntax.json is no longer changed in the branch
testdata/message2/resolution-errors.json is no longer changed in the branch
testdata/message2/runtime-errors.json is no longer changed in the branch
testdata/message2/spec/data-model-errors.json is no longer changed in the branch
testdata/message2/spec/functions/date.json is no longer changed in the branch
testdata/message2/spec/functions/datetime.json is no longer changed in the branch
testdata/message2/spec/functions/integer.json is no longer changed in the branch
testdata/message2/spec/functions/number.json is no longer changed in the branch
testdata/message2/spec/functions/string.json is no longer changed in the branch
testdata/message2/spec/functions/time.json is no longer changed in the branch
testdata/message2/spec/syntax-errors.json is no longer changed in the branch
testdata/message2/spec/test-core.json is no longer changed in the branch
testdata/message2/spec/test-functions.json is no longer changed in the branch
testdata/message2/syntax-errors-diagnostics-multiline.json is no longer changed in the branch
testdata/message2/syntax-errors-end-of-input.json is no longer changed in the branch
testdata/message2/tricky-declarations.json is no longer changed in the branch
testdata/message2/valid-tests.json is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

catamorphism · 2024-09-18T18:55:05Z

I've rebased this PR, so it's now ready for review.

Note that this includes several fixes related to unsupported expression/statement (reserved) parsing. Recently these features were removed from the spec. However, since the recent spec changes haven't been integrated yet, I'd prefer to land these changes. I'm open to suggestions, though. Changed my mind; since reserved syntax was removed from ICU4J already, I've removed reserved-related changes from this PR.

catamorphism · 2024-09-18T20:13:39Z

/cc @echeran

jira-pull-request-webhook · 2024-09-19T02:23:00Z

Notice: the branch changed across the force-push!

icu4c/source/test/intltest/messageformat2test_read_json.cpp is now changed in the branch
icu4j/main/core/src/main/java/com/ibm/icu/message2/MFParser.java is different
icu4j/main/core/src/main/java/com/ibm/icu/message2/StringUtils.java is no longer changed in the branch
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/CoreTest.java is different
testdata/message2/reserved-syntax-2.json is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

jira-pull-request-webhook · 2024-09-19T21:06:29Z

Notice: the branch changed across the force-push!

icu4c/source/i18n/messageformat2_parser.cpp is different
icu4c/source/i18n/messageformat2_serializer.cpp is different
icu4c/source/test/intltest/messageformat2test_read_json.cpp is different
icu4j/main/core/src/main/java/com/ibm/icu/message2/InputSource.java is no longer changed in the branch
icu4j/main/core/src/main/java/com/ibm/icu/message2/MFParser.java is different
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/CoreTest.java is different
testdata/message2/reserved-syntax-2.json is no longer changed in the branch
testdata/message2/syntax-errors-reserved.json is now changed in the branch
testdata/message2/valid-tests.json is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

mihnita

Thank you,
Mihai

ICU4C: Escape curly braces when serializing and normalizing ICU4C: Escape '|' in patterns ICU4C: When normalizing input, escape optionally-escaped characters in patterns ICU4C/ICU4J: Allow trailing whitespace after a match ICU4C: Fix parser to iterate over code points, not code units Add tests with old reserved syntax as syntax-error tests

jira-pull-request-webhook · 2024-09-20T22:32:30Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

mihnita reviewed Aug 9, 2024

View reviewed changes

catamorphism mentioned this pull request Aug 13, 2024

ICU-22834 MF2: make tests compliant with schema and update spec tests #3063

Merged

7 tasks

This was referenced Sep 14, 2024

ICU-22890 Add test to show lone surrogate cause infinity loop #3166

Closed

ICU-22890 MF2: Add lone surrogate test to parser #3167

Merged

catamorphism force-pushed the mf2-test-schema-and-parser-fixes branch from eee547c to 2b5e263 Compare September 18, 2024 18:42

catamorphism changed the title ~~DRAFT: MF2: fix various parser bugs and add more tests~~ DRAFT: ICU-22898: MF2: fix various parser bugs and add more tests Sep 18, 2024

catamorphism marked this pull request as ready for review September 18, 2024 18:53

catamorphism changed the title ~~DRAFT: ICU-22898: MF2: fix various parser bugs and add more tests~~ ICU-22898: MF2: fix various parser bugs and add more tests Sep 18, 2024

catamorphism force-pushed the mf2-test-schema-and-parser-fixes branch from 2b5e263 to bf9d4c5 Compare September 19, 2024 02:22

catamorphism requested review from echeran and mihnita September 19, 2024 02:24

markusicu assigned mihnita Sep 19, 2024

catamorphism force-pushed the mf2-test-schema-and-parser-fixes branch from bf9d4c5 to 8c51816 Compare September 19, 2024 21:06

mihnita approved these changes Sep 20, 2024

View reviewed changes

echeran approved these changes Sep 20, 2024

View reviewed changes

catamorphism force-pushed the mf2-test-schema-and-parser-fixes branch from 8c51816 to 4b376f1 Compare September 20, 2024 22:32

catamorphism changed the title ~~ICU-22898: MF2: fix various parser bugs and add more tests~~ ICU-22898 MF2: fix various parser bugs and add more tests Sep 20, 2024

catamorphism merged commit 8f82fac into unicode-org:main Sep 20, 2024
100 checks passed

catamorphism mentioned this pull request Sep 20, 2024

ICU-22902 Remove support for Unsupported, Private & Reserved constructs #3193

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ICU-22898 MF2: fix various parser bugs and add more tests #3092

ICU-22898 MF2: fix various parser bugs and add more tests #3092

catamorphism commented Aug 8, 2024 •

edited

Loading

mihnita commented Aug 9, 2024

mihnita commented Aug 9, 2024

mihnita Aug 9, 2024

catamorphism Aug 9, 2024 •

edited

Loading

mihnita Aug 9, 2024

catamorphism Aug 9, 2024

mihnita left a comment

mihnita Aug 9, 2024

catamorphism Aug 9, 2024

mihnita Aug 9, 2024

catamorphism Aug 9, 2024

mihnita Aug 9, 2024

catamorphism Aug 9, 2024

mihnita Aug 9, 2024

catamorphism Aug 9, 2024

mihnita Aug 9, 2024

catamorphism commented Aug 9, 2024

catamorphism commented Aug 9, 2024 •

edited

Loading

jira-pull-request-webhook bot commented Sep 18, 2024

catamorphism commented Sep 18, 2024 •

edited

Loading

catamorphism commented Sep 18, 2024

jira-pull-request-webhook bot commented Sep 19, 2024

jira-pull-request-webhook bot commented Sep 19, 2024

mihnita left a comment

jira-pull-request-webhook bot commented Sep 20, 2024

ICU-22898 MF2: fix various parser bugs and add more tests #3092

ICU-22898 MF2: fix various parser bugs and add more tests #3092

Conversation

catamorphism commented Aug 8, 2024 • edited Loading

Checklist

mihnita commented Aug 9, 2024

mihnita commented Aug 9, 2024

Choose a reason for hiding this comment

catamorphism Aug 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mihnita left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

catamorphism commented Aug 9, 2024

catamorphism commented Aug 9, 2024 • edited Loading

jira-pull-request-webhook bot commented Sep 18, 2024

catamorphism commented Sep 18, 2024 • edited Loading

catamorphism commented Sep 18, 2024

jira-pull-request-webhook bot commented Sep 19, 2024

jira-pull-request-webhook bot commented Sep 19, 2024

mihnita left a comment

Choose a reason for hiding this comment

jira-pull-request-webhook bot commented Sep 20, 2024

catamorphism commented Aug 8, 2024 •

edited

Loading

catamorphism Aug 9, 2024 •

edited

Loading

catamorphism commented Aug 9, 2024 •

edited

Loading

catamorphism commented Sep 18, 2024 •

edited

Loading