-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot process Supplementary Character in Java / 无法在Java中处理Supplementary字符 #1564
Comments
感谢反馈。1.x的确没考虑32位的Supplementary Character,1.x是建立于double array trie这个数据结构之上,它的转移函数基本单位为16位的char。如果要支持的话,每个char都得调用一次 具体到
如果将CJK Compatibility Ideographs Supplement与CJK Unified Ideographs重合的部分转为后者,超出的部分替换为空白字符,那么中文基本是没有问题的。这种转换可以根据两个码表做一个映射,按需求执行,似乎是更好的方案? |
这个Supplementary字符我也是今天晚上解析IDS字形描述文件的时候遇到的。因为项目的前半部分使用了HanLP对中文维基Dump进行处理,要产生一个句子级别的语料,主要用到的就是简繁转换、感知机分词、感知机命名实体识别和拼音标注。今天遇到这个问题之后,回顾前面的处理没有考虑这种情况会不会导致问题。于是测试了一下HanLP的反应,总的来说感觉问题不大,因为我感觉这些位于扩展集中的字符不太能组出什么词或什么实体名。这种字对于语义或命名实体的识别,乃至其他中文NLP任务来说意义不大,当作外文或unknown也无可厚非。 而且刚才又仔细看了一下,虽然Unicode值不同,但是字形确实是同一个字。在ids字形中U+9F16和U+2FA1B在描述上的区别也只是 最后感谢您百忙之中抽时间回复。 不过IDS那边确实介绍说如果有相同描述,会取Unicode较小的那一个,也许是他们那边bug了,我去提Issue(逃 |
刚才突然想到或许可以利用CodePoint来解决这个问题。前面提到转移的基本单位是Char型,在前面有关Supplementary字符的链接中有提到JSR204专家组最后决定是将Char原样保留,而通过对字符串增加 由此只需要将内部实现的基本类型从Char转为Int,在调入的部分增加对String到CodePoint的转换,我觉得这样子改动不大,而且大概率比较可行。 不过重复字的问题还是需要用户去保证,和之前说的一样。改为使用CodePoint作为内部处理的基本单位无非是增加了对生僻字更好的支持,不过从昨天对维基百科语料的筛选来看,这些生僻字无外乎就是古时候的人名、地名或是特别难写的元素名。如果未来HanLP有意支持,或有意允许用户训练对于古代命名实体识别之类的人物的话,或许可以一试 以及我万没想到 |
This issue has been mentioned on 蝴蝶效应. There might be relevant details there: |
String text = "江西鄱阳湖干枯,中国最大淡水湖变成大草原";
HanLP.Config.ShowTermNature = false;
System.out.println(SpeedTokenizer.segment(text));
long start = System.currentTimeMillis();
int pressure = 1000000;
for (int i = 0; i < pressure; ++i)
{
text.codePoints().toArray();
SpeedTokenizer.segment(text);
}
double costTime = (System.currentTimeMillis() - start) / (double)1000;
System.out.printf("SpeedTokenizer分词速度:%.2f字每秒\n", text.length() * pressure / costTime); 可行的确可行,只不过需要考虑如下cost:
究竟是否采用,我在论坛上发起了一个投票,综合一下广大用户的意见吧。 |
好消息,我意识到从原理上HanLP的基础数据结构(双数组trie树等)是支持多字节字符的,只要多字节字符组成的词语位于词典中,一样可以正常匹配,只不过业务逻辑上需要将多字节字符的长度视为1。参考上面的补丁,这个bug得到了圆满的解决,也没有损失速度或Java6支持。 |
Describe the bug
A clear and concise description of what the bug is.
When handling supplementary character, HanLP(I tested with pinyin and word segmentation) couldn't handle the supplementary character properly. For short. Java represent a unicode character > 0xFFFF as two sepeprate char, thus HanLP treat them as two seperate Chinese character when getting Pinyin on it. However, those chars not assigned to any validate charset, so the pinyin result would be two 'none', rather than one 'none'.
Word segmentation cannot recognize it, but would always keep them as a word.
处理Supplementary字符时,HanLP(我测试了拼音标注和分词)似乎没法恰当的处理Supplementary字符。简单地说,Java将0xFFFF以上的Unicode字符表示为两个char,因此HanLP在标注拼音的时候会将其视为两个独立的汉字。然而这些char的值特意的没有指定给任意有效的字符集,因此拼音标注的结果是两个'none',而不是一个'none'
分词也并不能识别这种字符,但是分词总会确保这些字符是一个词,结果中不会产生破碎的char。
Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
Describe the current behavior
A clear and concise description of what happened.
Get:
[none, none]
鼖
would be represent as\uD87E\uDE1B
in Java, none of them is validate Chinese character. So got 2none
.鼖
在Java中被表示为\uD87E\uDE1B
,每一个单独的char都不是有效的中文字符,因此得到了两个none
。Expected behavior
A clear and concise description of what you expected to happen.
Should get:
[fén]
, or at least a[none]
System information
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
The text was updated successfully, but these errors were encountered: