DIET classifier _predict_entities function clean_up_entities for Chinese language issue #5972

johnson7788 · 2020-06-09T07:59:33Z

Rasa version:
1.10.1
Rasa SDK version (if used & relevant):
1.10.1
Rasa X version (if used & relevant):

Python version:
3.7.3
Operating system (windows, osx, ...):
osx
Issue:
Chinese Entities predicted correct by DIET model, but will change to wrong by entities = self.clean_up_entities(message, entities)

Error (including full traceback):

rasa/nlu/extractors/extractor.py  _token_clusters, this function put all Chinese sentence as a single word, So the correct entities will turn to wrong
    def _token_clusters(tokens: List[Token]) -> List[List[Token]]:
        """Build clusters of tokens that belong to one word.

        Args:
            tokens: list of tokens

        Returns:
            Token clusters.

        """
        # token cluster = list of token indices that belong to one word


the debug output ，the whole sentence "我想换成顺丰快递'" turn to a entities
2020-06-09 15:57:27 DEBUG    rasa.core.processor  - Received user message '我想换成顺丰快递' with intent '{'name': 'inform_choose_delivery', 'confide4753126800060272}' and entities '[{'entity': 'delivery', 'start': 0, 'end': 8, 'value': '我想换成顺丰快递', 'extractor': 'DIETClassifier'}]'

After comment out  nlu/classifiers/diet_classifier.py,    line 806
entities = self.clean_up_entities(message, entities)
it will correct output

Command or request that led to error:

Content of configuration file (config.yml) (if relevant):

language: zh

pipeline:
  - name: HFTransformersNLP
    model_name: "bert"
    model_weights: "bert-base-chinese"
    cache_dir: null
  - name: customrasa.printer.Printer
    alias: after HFTransformersNLP
#  - name: "JiebaTokenizer"
#    # Flag to check whether to split intents
#    "intent_tokenization_flag": False
#    # Symbol on which intent should be split
#    "intent_split_symbol": "_"
  - name: EntitySynonymMapper
  - name: "LanguageModelTokenizer"
    "intent_tokenization_flag": False
    # Symbol on which intent should be split
    "intent_split_symbol": "_"
  - name: customrasa.printer.Printer
    alias: after LanguageModelTokenizer
  - name: LanguageModelFeaturizer
#  - name: DucklingHTTPExtractor
#    url: http://localhost:8000
#    dimensions:
#      - number
  - name: customrasa.printer.Printer
    alias: after LanguageModelFeaturizer
  - name: DIETClassifier
    epochs: 100
  - name: customrasa.printer.Printer
    alias: after DIETClassifier

policies:
  - name: FormPolicy
  - name: FallbackPolicy
  - name: MemoizationPolicy
  - name: MappingPolicy
  - name: TEDPolicy

Content of domain file (domain.yml) (if relevant):

The text was updated successfully, but these errors were encountered:

sara-tagger · 2020-06-09T12:00:06Z

Thanks for the issue, @Ghostvv will get back to you about it soon!

You may find help in the docs and the forum, too 🤗

tabergma · 2020-06-09T14:50:49Z

@johnson7788 Thanks for submitting the issue. The issue was already solved in #5756. It will be released in the next minor release. It is not yet clear when this will happen, so please be patient. If you want to use the DIETClassifier and Chinese language, I guess best would be to use Rasa 1.9.7.

johnson7788 · 2020-06-09T15:16:57Z

Thank you very much, high efficiency

hei-my · 2021-03-31T10:28:13Z

Rasa版本：
1.10.1
Rasa SDK版本（如果使用且相关）：
1.10.1
Rasa X版本（如果使用且相关）：

Python版本：
3.7.3
操作系统（Windows，OSX，...）：
OSX
问题：
根据DIET模型预测的中国实体正确，但由于实体= self.clean_up_entities（消息，实体）而变为错误

错误（包括完整的追溯）：

rasa/nlu/extractors/extractor.py  _token_clusters, this function put all Chinese sentence as a single word, So the correct entities will turn to wrong
    def _token_clusters(tokens: List[Token]) -> List[List[Token]]:
        """Build clusters of tokens that belong to one word.

        Args:
            tokens: list of tokens

        Returns:
            Token clusters.

        """
        # token cluster = list of token indices that belong to one word


the debug output ，the whole sentence "我想换成顺丰快递'" turn to a entities
2020-06-09 15:57:27 DEBUG    rasa.core.processor  - Received user message '我想换成顺丰快递' with intent '{'name': 'inform_choose_delivery', 'confide4753126800060272}' and entities '[{'entity': 'delivery', 'start': 0, 'end': 8, 'value': '我想换成顺丰快递', 'extractor': 'DIETClassifier'}]'

After comment out  nlu/classifiers/diet_classifier.py,    line 806
entities = self.clean_up_entities(message, entities)
it will correct output

导致错误的命令或请求：

配置文件（config.yml）的内容（如果相关）：

语言：zh

管道：
  -名称：HFTransformersNLP
    模型名称：“ BERT ”
     model_weights：“ BERT基-中国”
    的cache_dir：空
  -名称：customrasa.printer.Printer
    别名：后HFTransformersNLP 
＃   -名称： “JiebaTokenizer” ＃
＃    标志，检查是否分裂意图
＃     “ intent_tokenization_flag”：错误
＃＃    应该在其上分割意图的符号
＃     “ intent_split_symbol”：“ _” 
  -名称：EntitySynonymMapper 
  -名称：“ LanguageModelTokenizer ” 
    “ intent_tokenization_flag ”：假
    ＃符号上的意图应该是分裂
    “ intent_split_symbol ”：“ _ ” 
  -名称：customrasa.printer.Printer
    别名：后LanguageModelTokenizer 
  -名称：LanguageModelFeaturizer 
＃   -名称：DucklingHTTPExtractor 
＃    网址：http：// localhost：8000 
＃    尺寸：
＃      -数
  -名称：customrasa.printer.Printer
    别名：后LanguageModelFeaturizer 
  -名称：DIETClassifier
    时代：100 
  -名称：customrasa.printer.Printer
    别名：后DIETClassifier

政策：
  -名称：FormPolicy 
  -名称：FallbackPolicy 
  -名称：MemoizationPolicy 
  -名称：MappingPolicy 
  -名称：TEDPolicy

域文件（domain.yml）的内容（如果相关）：

Hello, may I ask if you can refer to the config.xml configuration of Bert training in Chinese?I couldn't get HFTransformersNLP configuration identification entity, thank you. - name: customrasa. Printer. This can also provide the printer?

你好，请问下你使用中文bert训练的config.xml配置可以参考一下嘛？我弄的HFTransformersNLP配置识别不出来实体，谢谢.- name: customrasa.printer.Printer这个也可以提供下嘛？

johnson7788 added area:rasa-oss 🎡 Anything related to the open source Rasa framework type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors. labels Jun 9, 2020

tabergma closed this as completed Jun 9, 2020

tabergma mentioned this issue Jun 11, 2020

Remove 'clean_up_entities' from DIETClassifier and CRFEntityExtractor #6000

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DIET classifier _predict_entities function clean_up_entities for Chinese language issue #5972

DIET classifier _predict_entities function clean_up_entities for Chinese language issue #5972

johnson7788 commented Jun 9, 2020

sara-tagger commented Jun 9, 2020

tabergma commented Jun 9, 2020

johnson7788 commented Jun 9, 2020

hei-my commented Mar 31, 2021

DIET classifier _predict_entities function clean_up_entities for Chinese language issue #5972

DIET classifier _predict_entities function clean_up_entities for Chinese language issue #5972

Comments

johnson7788 commented Jun 9, 2020

sara-tagger commented Jun 9, 2020

You may find help in the docs and the forum, too 🤗

tabergma commented Jun 9, 2020

johnson7788 commented Jun 9, 2020

hei-my commented Mar 31, 2021