-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Figure out what we're doing with ICU tokenisation and locales #15124
Comments
For reference I'm getting the following error on
|
I think the change in colon behaviour was actually added by Apple for ICU 72.1: https://unicode-org.atlassian.net/browse/ICU-22112 So possibly that will break in future Ubuntu versions too :) |
I'm somewhat confused here; I tried segmenting this snippet on 3 versions of Ubuntu (3 versions of ICU) and they all gave the same result. From this, we might hope to conclude that ICU version is not what affects it. But I tried playing around with locales and could not invite a change that way either. print(_parse_words_with_icu("lazy'dog jumped:over the.fox 授業は八時三十分から始まるから。"))
-----
Ubuntu 20.04
ICU 66.1
["lazy'dog", 'jumped', 'over', 'the', 'fox', '授業', 'は', '八時', '三', '十分', 'から', '始まる', 'から']
-----
Ubuntu 22.04
ICU 70.1
["lazy'dog", 'jumped', 'over', 'the', 'fox', '授業', 'は', '八時', '三', '十分', 'から', '始まる', 'から']
-----
Ubuntu 23.04
ICU 72.1
["lazy'dog", 'jumped', 'over', 'the', 'fox', '授業', 'は', '八時', '三', '十分', 'から', '始まる', 'から'] The above were containerised (docker) installs. My laptop has Kubuntu 22.10 and says
In an Ubuntu
Even if I change the |
Huh, interesting, because I did see that change on their github, but possibly that is a different library? I probably mixed something up then, sorry! |
The ICU tokenisation rules seem to vary on different platforms.
Is it the ICU version? The locale? (How does ICU even get a default locale? I had a quick spelunk in the source code and couldn't find it!)
We need to figure out:
It feels like we want a 'universal' locale independent of the host's settings, so that Synapse works well with all languages. (This may be a pie in the sky goal!)
What's the best we can do?
This issue was originally dug up in #15079, but e.g. Patrick's machine generates another tokenisation yet again. I'm not satisfied with the current solution..
The text was updated successfully, but these errors were encountered: