Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DIET classifier _predict_entities function clean_up_entities for Chinese language issue #5972

Closed
johnson7788 opened this issue Jun 9, 2020 · 4 comments · Fixed by #6000
Closed
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors.

Comments

@johnson7788
Copy link

Rasa version:
1.10.1
Rasa SDK version (if used & relevant):
1.10.1
Rasa X version (if used & relevant):

Python version:
3.7.3
Operating system (windows, osx, ...):
osx
Issue:
Chinese Entities predicted correct by DIET model, but will change to wrong by entities = self.clean_up_entities(message, entities)

Error (including full traceback):

rasa/nlu/extractors/extractor.py  _token_clusters, this function put all Chinese sentence as a single word, So the correct entities will turn to wrong
    def _token_clusters(tokens: List[Token]) -> List[List[Token]]:
        """Build clusters of tokens that belong to one word.

        Args:
            tokens: list of tokens

        Returns:
            Token clusters.

        """
        # token cluster = list of token indices that belong to one word


the debug output ,the whole sentence "我想换成顺丰快递'" turn to a entities
2020-06-09 15:57:27 DEBUG    rasa.core.processor  - Received user message '我想换成顺丰快递' with intent '{'name': 'inform_choose_delivery', 'confide4753126800060272}' and entities '[{'entity': 'delivery', 'start': 0, 'end': 8, 'value': '我想换成顺丰快递', 'extractor': 'DIETClassifier'}]'

After comment out  nlu/classifiers/diet_classifier.py,    line 806
entities = self.clean_up_entities(message, entities)
it will correct output

Command or request that led to error:


Content of configuration file (config.yml) (if relevant):

language: zh

pipeline:
  - name: HFTransformersNLP
    model_name: "bert"
    model_weights: "bert-base-chinese"
    cache_dir: null
  - name: customrasa.printer.Printer
    alias: after HFTransformersNLP
#  - name: "JiebaTokenizer"
#    # Flag to check whether to split intents
#    "intent_tokenization_flag": False
#    # Symbol on which intent should be split
#    "intent_split_symbol": "_"
  - name: EntitySynonymMapper
  - name: "LanguageModelTokenizer"
    "intent_tokenization_flag": False
    # Symbol on which intent should be split
    "intent_split_symbol": "_"
  - name: customrasa.printer.Printer
    alias: after LanguageModelTokenizer
  - name: LanguageModelFeaturizer
#  - name: DucklingHTTPExtractor
#    url: http://localhost:8000
#    dimensions:
#      - number
  - name: customrasa.printer.Printer
    alias: after LanguageModelFeaturizer
  - name: DIETClassifier
    epochs: 100
  - name: customrasa.printer.Printer
    alias: after DIETClassifier

policies:
  - name: FormPolicy
  - name: FallbackPolicy
  - name: MemoizationPolicy
  - name: MappingPolicy
  - name: TEDPolicy

Content of domain file (domain.yml) (if relevant):

@johnson7788 johnson7788 added area:rasa-oss 🎡 Anything related to the open source Rasa framework type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors. labels Jun 9, 2020
@sara-tagger
Copy link
Collaborator

Thanks for the issue, @Ghostvv will get back to you about it soon!

You may find help in the docs and the forum, too 🤗

@tabergma
Copy link
Contributor

tabergma commented Jun 9, 2020

@johnson7788 Thanks for submitting the issue. The issue was already solved in #5756. It will be released in the next minor release. It is not yet clear when this will happen, so please be patient. If you want to use the DIETClassifier and Chinese language, I guess best would be to use Rasa 1.9.7.

@tabergma tabergma closed this as completed Jun 9, 2020
@johnson7788
Copy link
Author

Thank you very much, high efficiency

@hei-my
Copy link

hei-my commented Mar 31, 2021

Rasa版本
1.10.1
Rasa SDK版本(如果使用且相关):
1.10.1
Rasa X版本(如果使用且相关):

Python版本
3.7.3
操作系统(Windows,OSX,...):
OSX
问题
根据DIET模型预测的中国实体正确,但由于实体= self.clean_up_entities(消息,实体)而变为错误

错误(包括完整的追溯)

rasa/nlu/extractors/extractor.py  _token_clusters, this function put all Chinese sentence as a single word, So the correct entities will turn to wrong
    def _token_clusters(tokens: List[Token]) -> List[List[Token]]:
        """Build clusters of tokens that belong to one word.

        Args:
            tokens: list of tokens

        Returns:
            Token clusters.

        """
        # token cluster = list of token indices that belong to one word


the debug output ,the whole sentence "我想换成顺丰快递'" turn to a entities
2020-06-09 15:57:27 DEBUG    rasa.core.processor  - Received user message '我想换成顺丰快递' with intent '{'name': 'inform_choose_delivery', 'confide4753126800060272}' and entities '[{'entity': 'delivery', 'start': 0, 'end': 8, 'value': '我想换成顺丰快递', 'extractor': 'DIETClassifier'}]'

After comment out  nlu/classifiers/diet_classifier.py,    line 806
entities = self.clean_up_entities(message, entities)
it will correct output

导致错误的命令或请求


配置文件(config.yml)的内容(如果相关):

语言:zh

管道:
  -名称:HFTransformersNLP
    模型名称:“ BERT ”
     model_weights:“ BERT基-中国”
    的cache_dir:空
  -名称:customrasa.printer.Printer
    别名:后HFTransformersNLP 
#   -名称: “JiebaTokenizer” #
#    标志,检查是否分裂意图
#     “ intent_tokenization_flag”:错误
##    应该在其上分割意图的符号
#     “ intent_split_symbol”:“ _” 
  -名称:EntitySynonymMapper 
  -名称:“ LanguageModelTokenizer ” 
    “ intent_tokenization_flag ”:假
    #符号上的意图应该是分裂
    “ intent_split_symbol ”:“ _ ” 
  -名称:customrasa.printer.Printer
    别名:后LanguageModelTokenizer 
  -名称:LanguageModelFeaturizer 
#   -名称:DucklingHTTPExtractor 
#    网址:http:// localhost:8000 
#    尺寸:
#      -数
  -名称:customrasa.printer.Printer
    别名:后LanguageModelFeaturizer 
  -名称:DIETClassifier
    时代:100 
  -名称:customrasa.printer.Printer
    别名:后DIETClassifier

政策:
  -名称:FormPolicy 
  -名称:FallbackPolicy 
  -名称:MemoizationPolicy 
  -名称:MappingPolicy 
  -名称:TEDPolicy

域文件(domain.yml)的内容(如果相关):

Hello, may I ask if you can refer to the config.xml configuration of Bert training in Chinese?I couldn't get HFTransformersNLP configuration identification entity, thank you. - name: customrasa. Printer. This can also provide the printer?

你好,请问下你使用中文bert训练的config.xml配置可以参考一下嘛?我弄的HFTransformersNLP配置识别不出来实体,谢谢.- name: customrasa.printer.Printer这个也可以提供下嘛?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants