Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: 修复 bintrie 树全分词时 提前跳出循环 bug #1775

Merged
merged 1 commit into from
Aug 12, 2022
Merged

bugfix: 修复 bintrie 树全分词时 提前跳出循环 bug #1775

merged 1 commit into from
Aug 12, 2022

Conversation

carl10086
Copy link

bintrie 不能完全分词

Description

使用 BinTrie 的代码某些场景不能进行有效的完全切词

下面是 bug 演示:

import com.hankcs.hanlp.collection.AhoCorasick.AhoCorasickDoubleArrayTrie;
import org.junit.Before;
import org.junit.Test;

public class BinTrieParseTextTest {

    private BinTrie<Integer> trie;

    @Before
    public void setup() {
        this.trie = new BinTrie<Integer>();
        String[] words = new String[]{"溜", "儿", "溜儿", "一溜儿", "一溜"};
        /*构建一个简单的词典, 从 core dict 文件中扣出的一部分*/
        for (int i = 0; i < words.length; i++) {
            this.trie.put(words[i], i);
        }
    }


    @Test
    public void justForShowBugs() {
        showParseText("一溜儿");

        /*我们在 一溜儿后面随便+一个字符,这里我们加一个空格 会完全不同*/
        showParseText("一溜儿" + " ");

    }


    private void showParseText(final String text) {
        System.out.printf("========进行完全切词%s的演示======\n", text);
        this.trie.parseText(text, new AhoCorasickDoubleArrayTrie.IHit<Integer>() {
            @Override
            public void hit(int begin, int end, Integer value) {
                System.out.println(text.substring(begin, end));
            }
        });

        System.out.println("===========================");


    }
}

输出结果如下:

========进行完全切词一溜儿的演示======
一溜
一溜儿
===========================
========进行完全切词一溜儿 的演示======
一溜
一溜儿
溜
溜儿
儿
===========================
  • 现象: 发现在 "一溜儿" 的情况分词不完全, 而把 BinTrie 改为 DoubleArrayTrie 则没有问题.
  • 原因: debug 发现 bintrie 在分词命中了最后一个字符的时候 会提前跳出循环.

How Has This Been Tested?

测试代码见 com.hankcs.hanlp.collection.trie.bintrie.BinTrieParseTextTest.java

@hankcs hankcs merged commit b216b24 into hankcs:1.x Aug 12, 2022
@hankcs
Copy link
Owner

hankcs commented Aug 12, 2022

感谢指正!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants