Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode 15.1 support #253

Merged
merged 7 commits into from
Oct 20, 2023
Merged

Unicode 15.1 support #253

merged 7 commits into from
Oct 20, 2023

Conversation

stevengj
Copy link
Member

@stevengj stevengj commented Oct 18, 2023

Support for Unicode 15.1, which means updating the tables but also adding a new rule to the grapheme-break algorithm to account for the new Indic_Conjunct_Break property. Fixes #252

Currently a work-in-progress. To do:

  • get the grapheme test passing

Update: should be ready now.

@stevengj stevengj merged commit 46a442b into master Oct 20, 2023
11 checks passed
@stevengj stevengj deleted the unicode15.1 branch October 20, 2023 20:25
KristofferC pushed a commit to JuliaLang/julia that referenced this pull request Oct 23, 2023
Similar to #47392, support [Unicode
15.1](https://www.unicode.org/versions/Unicode15.1.0/) by bumping
utf8proc to 2.9.0 (JuliaStrings/utf8proc#253).

This allows us to use [118 exciting new emoji
characters](https://blog.emojipedia.org/whats-new-in-unicode-15-1-and-emoji-15-1/)
as identifiers, including "edible mushroom" `"\U1f344\u200d\U1f7eb"`
(but still no superscript "q").

Interestingly, they also updated the [Unicode recommendations on
programming-language identifiers
(UAX#31)](https://www.unicode.org/reports/tr31/tr31-39.html#Mathematical_Compatibility_Notation_Profile)
to finally "bless" identifiers beginning with `∂` and `∇` and/or ending
with numeric sub/superscripts. They still don't recommend nearly the
range of identifiers accepted by Julia, however.
@PallHaraldsson
Copy link

PallHaraldsson commented Oct 24, 2023

Do you think this, i.e. the Julialang PR JuliaLang/julia#51799
should be backported to 1.10, in case it will become LTS, or at least by then?

I reviewed the code here, which is rather small (and that PR trivial), except for the Ruby generator that I think I don't need to scrutinize, and it seems safe/preferred to backport, though I did not look at utf8proc_data.c since it's quite large (and generated?).

I don't think your change is a breaking change, but I'm not sure.

The repertoire addition consists almost entirely of urgently needed CJK ideographs, synchronized with planned additions to the Chinese national standard, GB 18030. The remaining additions to the repertoire extend the set of ideographic description characters, to better enable description of unusual CJK ideographs.

Because of, at Wikipedia:

The updated standard GB18030-2022, is incompatible[how?], and it had an enforcement date of 1 August 2023.[3] It has been implemented ICU 73.2; and in Java 21,[4] and backported to older Java 8, 11, 17 (LTS releases) and 20.0.2.[5]

Also:

Major updates were made to UAX #9, Unicode Bidirectional Algorithm, UAX #31, Unicode Identifiers and Syntax, and UTS #39, Unicode Security Mechanisms, to coordinate with the publication of an important new Unicode Technical Standard: UTS #55, Unicode Source Code Handling.

* Segmentation rule changes, most notably:
a. Support was added to line breaking (UAX #14, Unicode Line Breaking Algorithm) for orthographic syllables in a number of South and Southeast Asian writing systems.
b. Grapheme cluster breaking (UAX #29, Unicode Text Segmentation) has adopted the aksara cluster behavior for six scripts. That cluster breaking behavior had previously been widely available via CLDR and ICU.
c. These changes involved significant character property updates.

What I find likely breaking about regarding GB18030-2022 and thus I think Unicode 15.1 (but not at the level of utf8proc?)::

CJK/Unihan Changes

  • Seven old provisional properties have been removed.
  • Six new provisional properties have been added.

@stevengj
Copy link
Member Author

It's not a breaking change, I think (mainly just adding new characters, and tweaking some grapheme-break rules), but it's a new feature and thus probably not eligible for backport.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

utf8proc 2.8.0 does not support new grapheme-break rules in Unicode 15.1.0
2 participants