update for Unicode 16.0.0 #271

stevengj · 2024-08-30T17:08:52Z

Draft PR to update our data tables to the upcoming Unicode 16.0.0 standard.

data_generator.jl script is currently failing with:

julia --project=. data_generator.jl > utf8proc_data.c.new
ERROR: LoadError: AssertionError: !(haskey(comb_indices, dm1))
Stacktrace:
 [1] top-level scope
   @ ~/Documents/Code/utf8proc/data/data_generator.jl:325
in expression starting at /Users/stevenj/Documents/Code/utf8proc/data/data_generator.jl:293

@c42f, since you wrote/ported this script in #258, can you help?

inkydragon · 2024-10-09T09:25:43Z

An error occurred while processing Character Decomposition Mapping data.
Here, dm0, dm1 are new characters obtained after a character decomposition.

Assert fails when dm0 == dm1.
I tried the old ruby script, and it also fails at the same assert.

utf8proc/data/data_generator.jl

Lines 318 to 325 in 3de4596

    
               @assert !haskey(comb_indices, dm0) 
        
               comb_indices[dm0] = cumoffset 
        
               cumoffset += last - first + 1 + 2 
        
           end 
        
           offset = 0 
        
           for dm1 in comb2nd_indices_sorted_keys 
        
               @assert !haskey(comb_indices, dm1)

In other words, we assume that a character will not be split into two identical characters.

But Unicode 16 introduces a new character: KIRAT RAI VOWEL SIGN AI (16D68).
It will be split into two KIRAT RAI VOWEL SIGN E (16D67).

P683, Figure 13-16
The Unicode Standard, Version 16.0 – Core Specification

I'm not sure whether the current compressed table (xref: #68) can represent this type of mapping.

stevengj · 2024-10-09T14:19:34Z

Naively commenting out the assert doesn't work; it fails the normalization test for U+113C5, which is another such character introduced in Unicode 16.0 that decomposes into two identical characters U+113C2 + U+113C2.

It looks like we'll have to special-case the tables somehow for this. It would be unfortunate to have to add an extra table just for this, but I'm not sure I see a way around it yet.

eschnett · 2024-12-18T18:53:13Z

If I understand the combining table correctly, then bit 15 (0x8000) indicates whether a character can be the left character (bit is zero) or the right character (bit is one) in a combining pair. This would make it impossible to combine two identical characters.

If I understand correctly, then bit 14 (0x4000) indicates whether the result can be stored in 16 bits or requires 32 bits.

If that is correct, then the following approach would work:

Don't use bit 14. Instead, use bit 15 of the combined character to indicate whether there is a continuation character. This essentially stores the result as utf16.
Split bit 15 into two bits, "can be the left character" and "can be the right character".

This change would not change the logic much, and there would not need to be a new table.

eschnett · 2024-12-18T20:59:41Z

Nope, this wouldn't work. The combining table is used for two different purposes. For the first combining character it stores an index pointing to a second table, and for the second combining table it stores an index into that second table.

See #277 for a redesign. The combining table is quite small (less than 1000 characters overall), and for each first character even smaller (16 (?) second characters at most). I'm storing the secondary table as array and iterate through it.

inkydragon · 2024-12-19T01:16:57Z

Another thought: The Rust community uses a minimum perfect hash to store these tables, which seems to be optimal for performance and size.

https://github.com/unicode-rs/unicode-normalization/blob/1598cfaff2d5e0661e2f506d77c931c01e1f23ea/scripts/unicode.py#L375-L413

stevengj · 2024-12-28T18:37:33Z

@inkydragon, let's stick with @eschnett's approach for now, just to get support for Unicode 16 out the door.

In the future, it would be nice to benchmark alternative approaches, though.

eschnett · 2024-12-28T19:47:43Z

I thought about benchmarking... The question is, what cases do you want to benchmark? Long strings or short ones? Mostly ASCII or other characters? Where is this used in Julia? And if we're optimizing memory allocations and loops and table lookups: Why stick with C? So, no benchmarking now...

update for Unicode 16.0.0

d05ed9e

giordano mentioned this pull request Oct 7, 2024

Support latest Unicode 16.0 JuliaLang/julia#56035

Open

clason mentioned this pull request Nov 7, 2024

Please make a new release with commit 3de4596 #272

Open

eschnett mentioned this pull request Dec 18, 2024

Redesign combining table #277

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update for Unicode 16.0.0 #271

update for Unicode 16.0.0 #271

stevengj commented Aug 30, 2024

inkydragon commented Oct 9, 2024 •

edited

Loading

stevengj commented Oct 9, 2024 •

edited

Loading

eschnett commented Dec 18, 2024

eschnett commented Dec 18, 2024

inkydragon commented Dec 19, 2024

stevengj commented Dec 28, 2024

eschnett commented Dec 28, 2024

update for Unicode 16.0.0 #271

Are you sure you want to change the base?

update for Unicode 16.0.0 #271

Conversation

stevengj commented Aug 30, 2024

inkydragon commented Oct 9, 2024 • edited Loading

stevengj commented Oct 9, 2024 • edited Loading

eschnett commented Dec 18, 2024

eschnett commented Dec 18, 2024

inkydragon commented Dec 19, 2024

stevengj commented Dec 28, 2024

eschnett commented Dec 28, 2024

inkydragon commented Oct 9, 2024 •

edited

Loading

stevengj commented Oct 9, 2024 •

edited

Loading