Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normative: List new Unicode v14 Script/Script_Extensions values #2515

Merged
merged 2 commits into from
Jan 5, 2022

Conversation

mathiasbynens
Copy link
Member

The following new values for the already-supported properties Script and Script_Extensions are added:

  • Cypro_Minoan (Cpmn)
  • Old_Uyghur (Ougr)
  • Tangsa (Tnsa)
  • Toto (Toto)
  • Vithkuqi (Vith)

Issue: #2514
Tests: tc39/test262#3199

@mathiasbynens mathiasbynens added the unicode Relates to upstream Unicode updates. label Sep 14, 2021
@ljharb ljharb added normative change Affects behavior required to correctly evaluate some ECMAScript source text needs test262 tests The proposal should specify how to test an implementation. Ideally via github.com/tc39/test262 labels Sep 14, 2021
@michaelficarra michaelficarra requested a review from a team September 16, 2021 16:56
@mathiasbynens mathiasbynens added has test262 tests and removed needs test262 tests The proposal should specify how to test an implementation. Ideally via github.com/tc39/test262 labels Sep 30, 2021
@ljharb ljharb requested review from syg and bakkot October 6, 2021 21:47
@bakkot bakkot added the editor call to be discussed in the next editor call label Oct 16, 2021
@michaelficarra michaelficarra added editor call to be discussed in the next editor call needs consensus This needs committee consensus before it can be eligible to be merged. and removed editor call to be discussed in the next editor call labels Nov 10, 2021
@syg syg removed the editor call to be discussed in the next editor call label Nov 10, 2021
@bakkot
Copy link
Contributor

bakkot commented Nov 10, 2021

@mathiasbynens I believe last time we said we were going to ask for consensus for these. Do you want to present this or would you like the editors to take care of it? I imagine it will be an extremely short agenda item either way.

@mathiasbynens
Copy link
Member Author

mathiasbynens commented Nov 11, 2021

My memory is that we would indeed ask for consensus for any new properties being added, but not for any new values to already-supported properties. Otherwise we’d be in this weird state where we claim to support the latest Unicode, and we claim to support Script_Extensions in property escapes, yet we somehow don’t support one of the Script_Extensions values in the latest Unicode release.

IMHO, asking for consensus on new values for already-supported properties would be equivalent to asking for TC39 consensus on the set of new ID_Start characters in every new Unicode release. I can’t imagine anyone objecting to a specific ID_Start character but not others, and similarly I cannot imagine anyone objecting to a specific Script_Extensions value but not another. We either support Script_Extensions as a whole, or we don’t — and that decision was already made as part of the \p{…} proposal.

The meeting notes from when this was last discussed are here: https://github.com/tc39/notes/blob/master/meetings/2020-06/june-2.md#introducing-unicode-support But I agree it doesn’t clearly capture a decision on this specific case. My memory and intuition goes back to when I first proposed \p{…}.

@bakkot bakkot added the editor call to be discussed in the next editor call label Nov 15, 2021
@bakkot
Copy link
Contributor

bakkot commented Nov 17, 2021

@mathiasbynens That sounds reasonable, but I don't think we explicitly got consensus for that at the last meeting. I want to bring this PR to the next meeting and more explicitly get consensus on

  • new properties will require plenary approval
  • new values for existing properties will be marked normative but merged by editors without plenary approval

On the other hand, it's not totally clear to the editors what the purpose of the list of values in General_Category and Script/Script_Extensions serves, if the editors are always going to fast-track any updates to them without plenary approval, vs just having a normative reference to Unicode in place of those two tables.

@mathiasbynens
Copy link
Member Author

@mathiasbynens That sounds reasonable, but I don't think we explicitly got consensus for that at the last meeting. I want to bring this PR to the next meeting and more explicitly get consensus on

  • new properties will require plenary approval
  • new values for existing properties will be marked normative but merged by editors without plenary approval

This sounds like re-establishing consensus on something that was already part of the approved \p{…} proposal, but I understand that we might be disagreeing on that.

As part of the \p{…} proposal we decided to support all Script, Script_Extensions, and General_Category character property values, and (at the very least) the intention was for that list of values to be kept in sync with upstream Unicode over time. I thought this intention was part of what was approved at the time.

If relitigated, it’d be problematic if this would not re-gain consensus for the reasons I described.

@michaelficarra
Copy link
Member

@mathiasbynens any reason not to remove the tables that enumerate those values then?

@mathiasbynens
Copy link
Member Author

@mathiasbynens any reason not to remove the tables that enumerate those values then?

We discussed that, no? https://github.com/tc39/notes/blob/master/meetings/2020-06/june-2.md#introducing-unicode-support They were originally added to reduce the risk of interoperability issues, and IMHO we shouldn’t change that.

@michaelficarra
Copy link
Member

@mathiasbynens It was clear that we didn't want to remove the table that lists properties. It was not clear that we didn't want to remove the table that lists value.

@mathiasbynens
Copy link
Member Author

Besides interop, listing the values + aliases explicitly is useful since we decided not to support loose matching in ECMAScript’s \p{…}. If we defer to the Unicode Standard, it becomes ambiguous what the “canonical” way to write each value / value alias really is, since Unicode doesn’t define it. (We generally use the casing that’s found in Unicode’s data files, but as far as Unicode is concerned, these labels are case-insensitive.)

@michaelficarra michaelficarra removed the editor call to be discussed in the next editor call label Nov 24, 2021
@michaelficarra
Copy link
Member

Added a topic for next plenary: tc39/agendas#1084

@mathiasbynens mathiasbynens removed the needs consensus This needs committee consensus before it can be eligible to be merged. label Dec 15, 2021
@mathiasbynens
Copy link
Member Author

Per the 2021-12-14 TC39 meeting, patches that add new upstream Unicode values & value aliases to already-supported-in-ECMAScript properties do not require explicit consensus. Removing the label. Thanks for presenting this, @michaelficarra!

@bakkot
Copy link
Contributor

bakkot commented Dec 15, 2021

@mathiasbynens One of the things we said we'd do as part of this is to document exactly how the names of the entries in this table are chosen, either in the spec text itself or in some metadata somewhere. Can you help us figure out the right way of saying it? Would "these names correspond to the first spelling (including casing) used for each value in Scripts.txt" be accurate?

@michaelficarra
Copy link
Member

It was also suggested that we include that editorial process in the spec itself, so I plan to discuss the merits of that in the next editor call.

@michaelficarra michaelficarra added the editor call to be discussed in the next editor call label Dec 15, 2021
@mathiasbynens
Copy link
Member Author

mathiasbynens commented Dec 15, 2021

@mathiasbynens One of the things we said we'd do as part of this is to document exactly how the names of the entries in this table are chosen, either in the spec text itself or in some metadata somewhere. Can you help us figure out the right way of saying it? Would "these names correspond to the first spelling (including casing) used for each value in Scripts.txt" be accurate?

What you were suggesting might lead to the same results for the Script/Script_Extensions names (I haven’t checked) but wouldn’t explain where we get the canonical value aliases from. The source I’ve been using is PropertyValueAliases.txt (since we need the values + any value aliases as well). This applies to both the Script/Script_Extensions table as well as the General_Category table.

Note that this logic unfortunately doesn’t cover ALL properties listed in the spec — in particular Any, ASCII, and Assigned don’t appear in PropertyValueAliases.txt nor in PropertyAliases.txt since they’re technically not properties (at the Unicode level). For those, we just settled on a casing that felt consistent with the other properties/values (as part of the \p{…} proposal). Then again, that is a one-off scenario that wouldn’t occur in these annual PRs — we’d absolutely still want to get consensus on adding brand new properties to ECMAScript (just not for any new values for existing properties).

@markusicu
Copy link
Contributor

Right, these are the files to look at:

Formally speaking, the particular use of case/space/underscore is unlikely to change but not (I think) guaranteed to not change.

@michaelficarra michaelficarra removed the editor call to be discussed in the next editor call label Dec 22, 2021
@bakkot bakkot force-pushed the unicode-14-property-escapes branch 2 times, most recently from f13634e to 6f6c232 Compare December 23, 2021 18:11
@bakkot
Copy link
Contributor

bakkot commented Dec 23, 2021

I've pushed up a commit with the following paragraph just before the table-unicode-general-category-values.html and table-unicode-script-values.html tables:

The spellings of entries in these tables (including casing) were chosen to match the first occurrence of each property in the files PropertyAliases.txt and PropertyValueAliases.txt in the Unicode Character Database at the time each entry was added to this specification. However, because the precise spellings in those files are not guaranteed to be stable, implementations are required to follow this table rather than those files.

@mathiasbynens PTAL. If this does not seem correct to you, can you suggest an alternative?

spec.html Outdated
@@ -35141,6 +35141,9 @@ <h1>
<emu-note>
<p>This algorithm differs from <a href="https://unicode.org/reports/tr44/#Matching_Symbolic">the matching rules for symbolic values listed in UAX44</a>: case, <emu-xref href="#sec-white-space">white space</emu-xref>, U+002D (HYPHEN-MINUS), and U+005F (LOW LINE) are not ignored, and the `Is` prefix is not supported.</p>
</emu-note>
<emu-note>
<p>The spellings of entries in these tables (including casing) were chosen to match the first occurrence of each property in the files PropertyAliases.txt and PropertyValueAliases.txt in the Unicode Character Database at the time each entry was added to this specification. However, because the precise spellings in those files are not guaranteed to be stable, implementations are required to follow this table rather than those files.</p>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 Some nits/thoughts:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So like this?

Suggested change
<p>The spellings of entries in these tables (including casing) were chosen to match the first occurrence of each property in the files PropertyAliases.txt and PropertyValueAliases.txt in the Unicode Character Database at the time each entry was added to this specification. However, because the precise spellings in those files are not guaranteed to be stable, implementations are required to follow this table rather than those files.</p>
<p>The spellings of entries in these tables (including casing) were chosen to match the first occurrence of each property in the files <a href="https://unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt"><code>PropertyAliases.txt</code></a> and <a href="https://unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt"><code>PropertyValueAliases.txt</code></a> in the Unicode Character Database at the time each entry was added to this specification. However, because the precise spellings in those files are not guaranteed to be stable, implementations are required to follow this table rather than those files.</p>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that also apply to the other UCD file references (cf. #2594)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gibson042 Yes, please do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@michaelficarra michaelficarra added the editor call to be discussed in the next editor call label Dec 30, 2021
@bakkot bakkot force-pushed the unicode-14-property-escapes branch from 6f6c232 to 63bef6d Compare January 5, 2022 22:58
@bakkot bakkot removed the editor call to be discussed in the next editor call label Jan 5, 2022
@bakkot bakkot added the ready to merge Editors believe this PR needs no further reviews, and is ready to land. label Jan 5, 2022
mathiasbynens and others added 2 commits January 5, 2022 15:05
…c39#2515)

The following new values for the already-supported properties `Script` and `Script_Extensions` are added:

- Cypro_Minoan (Cpmn)
- Old_Uyghur (Ougr)
- Tangsa (Tnsa)
- Toto (Toto)
- Vithkuqi (Vith)

Issue: tc39#2514
Tests: tc39/test262#3199
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
has test262 tests normative change Affects behavior required to correctly evaluate some ECMAScript source text ready to merge Editors believe this PR needs no further reviews, and is ready to land. unicode Relates to upstream Unicode updates.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants