Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update for Unicode 16.0.0 #271

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

update for Unicode 16.0.0 #271

wants to merge 1 commit into from

Conversation

stevengj
Copy link
Member

Draft PR to update our data tables to the upcoming Unicode 16.0.0 standard.

data_generator.jl script is currently failing with:

julia --project=. data_generator.jl > utf8proc_data.c.new
ERROR: LoadError: AssertionError: !(haskey(comb_indices, dm1))
Stacktrace:
 [1] top-level scope
   @ ~/Documents/Code/utf8proc/data/data_generator.jl:325
in expression starting at /Users/stevenj/Documents/Code/utf8proc/data/data_generator.jl:293

@c42f, since you wrote/ported this script in #258, can you help?

@inkydragon
Copy link
Contributor

inkydragon commented Oct 9, 2024

An error occurred while processing Character Decomposition Mapping data.
Here, dm0, dm1 are new characters obtained after a character decomposition.

Assert fails when dm0 == dm1.
I tried the old ruby script, and it also fails at the same assert.

@assert !haskey(comb_indices, dm0)
comb_indices[dm0] = cumoffset
cumoffset += last - first + 1 + 2
end
offset = 0
for dm1 in comb2nd_indices_sorted_keys
@assert !haskey(comb_indices, dm1)

In other words, we assume that a character will not be split into two identical characters.

But Unicode 16 introduces a new character: KIRAT RAI VOWEL SIGN AI (16D68).
It will be split into two KIRAT RAI VOWEL SIGN E (16D67).

image

P683, Figure 13-16
The Unicode Standard, Version 16.0 – Core Specification

I'm not sure whether the current compressed table (xref: #68) can represent this type of mapping.

@stevengj
Copy link
Member Author

stevengj commented Oct 9, 2024

Naively commenting out the assert doesn't work; it fails the normalization test for U+113C5, which is another such character introduced in Unicode 16.0 that decomposes into two identical characters U+113C2 + U+113C2.

It looks like we'll have to special-case the tables somehow for this. It would be unfortunate to have to add an extra table just for this, but I'm not sure I see a way around it yet.

@eschnett
Copy link
Collaborator

If I understand the combining table correctly, then bit 15 (0x8000) indicates whether a character can be the left character (bit is zero) or the right character (bit is one) in a combining pair. This would make it impossible to combine two identical characters.

If I understand correctly, then bit 14 (0x4000) indicates whether the result can be stored in 16 bits or requires 32 bits.

If that is correct, then the following approach would work:

  • Don't use bit 14. Instead, use bit 15 of the combined character to indicate whether there is a continuation character. This essentially stores the result as utf16.
  • Split bit 15 into two bits, "can be the left character" and "can be the right character".

This change would not change the logic much, and there would not need to be a new table.

@eschnett
Copy link
Collaborator

Nope, this wouldn't work. The combining table is used for two different purposes. For the first combining character it stores an index pointing to a second table, and for the second combining table it stores an index into that second table.

See #277 for a redesign. The combining table is quite small (less than 1000 characters overall), and for each first character even smaller (16 (?) second characters at most). I'm storing the secondary table as array and iterate through it.

@inkydragon
Copy link
Contributor

Another thought: The Rust community uses a minimum perfect hash to store these tables, which seems to be optimal for performance and size.

https://github.com/unicode-rs/unicode-normalization/blob/1598cfaff2d5e0661e2f506d77c931c01e1f23ea/scripts/unicode.py#L375-L413

@stevengj
Copy link
Member Author

@inkydragon, let's stick with @eschnett's approach for now, just to get support for Unicode 16 out the door.

In the future, it would be nice to benchmark alternative approaches, though.

@eschnett
Copy link
Collaborator

I thought about benchmarking... The question is, what cases do you want to benchmark? Long strings or short ones? Mostly ASCII or other characters? Where is this used in Julia? And if we're optimizing memory allocations and loops and table lookups: Why stick with C? So, no benchmarking now...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants