-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update for Unicode 16.0.0 #271
base: master
Are you sure you want to change the base?
Conversation
An error occurred while processing Character Decomposition Mapping data. Assert fails when utf8proc/data/data_generator.jl Lines 318 to 325 in 3de4596
In other words, we assume that a character will not be split into two identical characters. But Unicode 16 introduces a new character:
I'm not sure whether the current compressed table (xref: #68) can represent this type of mapping. |
Naively commenting out the assert doesn't work; it fails the normalization test for U+113C5, which is another such character introduced in Unicode 16.0 that decomposes into two identical characters U+113C2 + U+113C2. It looks like we'll have to special-case the tables somehow for this. It would be unfortunate to have to add an extra table just for this, but I'm not sure I see a way around it yet. |
If I understand the combining table correctly, then bit 15 ( If I understand correctly, then bit 14 ( If that is correct, then the following approach would work:
This change would not change the logic much, and there would not need to be a new table. |
Nope, this wouldn't work. The combining table is used for two different purposes. For the first combining character it stores an index pointing to a second table, and for the second combining table it stores an index into that second table. See #277 for a redesign. The combining table is quite small (less than 1000 characters overall), and for each first character even smaller (16 (?) second characters at most). I'm storing the secondary table as array and iterate through it. |
Another thought: The Rust community uses a minimum perfect hash to store these tables, which seems to be optimal for performance and size. |
@inkydragon, let's stick with @eschnett's approach for now, just to get support for Unicode 16 out the door. In the future, it would be nice to benchmark alternative approaches, though. |
I thought about benchmarking... The question is, what cases do you want to benchmark? Long strings or short ones? Mostly ASCII or other characters? Where is this used in Julia? And if we're optimizing memory allocations and loops and table lookups: Why stick with C? So, no benchmarking now... |
Draft PR to update our data tables to the upcoming Unicode 16.0.0 standard.
data_generator.jl
script is currently failing with:@c42f, since you wrote/ported this script in #258, can you help?