Bugfixes countvectorizer #5038

RolandJAAI · 2020-01-04T16:01:02Z

Proposed changes:
This PR fixes two bugs in the CountVectorizer Featurizer

the token_pattern parameter is used correctly during training and establishes a corresponding vocabulary, but during prediction the token pattern is only applied after checking for OOV tokens, which results in many tokens to be mapped to the OOV_token even though they would have been in the vocabulary after applying the token_pattern. Example: The token ["role-based"] will be stored in the vocabulary as two tokens ["role", "based"] during training. But if a user message with the token ["role-based"] is received, the component currently checks, if this exact token is in the vocabulary (false) and then replaces it with the OOV_token. This has a heavy impact on all bots which use OOV_token! Therefore we need to apply self.token_pattern before the OOV check.
after loading a model, the very first message does not get processed correctly, because the vocabulary attribute is only established after the first call to the .transform method during the processing. This results in skipping the OOV_token replacement routine for the first message: Tokens which should have been replaced with the OOV_token will be passed directly to the CountVectorizer and result in a OOV prediction. This is now fixed by calling the original CountVectorizers._validate_vocabulary() method when checking for the existence of the vocabulary, which will set the attribute if a vocabulary was provided.

Status (please check what you already did):

added some tests for the functionality (na)
updated the documentation (na)
updated the changelog (please check changelog for instructions)
reformat files using black (please check Readme for instructions)

Update Fork

- try to load vocabulary if it is not loaded yet - apply token_pattern to _process_tokens to identify OOV_tokens correctly

erohmensing · 2020-01-06T15:18:45Z

Thanks for the PR! We'll give it a review as soon as possible.

RolandJAAI · 2020-01-07T09:37:50Z

I moved the compilation of the regex to the init method so that we do not need to compile it every time a message gets processed.

dakshvar22

Thanks for spotting these! Left a small suggestion. Also please add two tests to validate the corresponding fixes. 🚀

dakshvar22 · 2020-01-10T15:32:24Z

rasa/nlu/featurizers/sparse_featurizer/count_vectors_featurizer.py

@@ -141,6 +141,8 @@ def _load_OOV_params(self) -> None:
    def _check_attribute_vocabulary(self, attribute: Text) -> bool:
        """Check if trained vocabulary exists in attribute's count vectorizer"""
        try:
+            if not hasattr(self.vectorizers[attribute], "vocabulary_"):
+                self.vectorizers[attribute]._validate_vocabulary()


I would recommend moving this line to the exact place where vectorizers are created, i.e. - inside _create_independent_vocab_vectorizers and _create_shared_vocab_vectorizers

erohmensing · 2020-02-04T14:34:33Z

Hey @RolandJAAI, would love to get this bug fix in if you're able to come back to finishing the suggested changes 🙂

RolandJAAI · 2020-02-28T13:27:42Z

Hi, sorry I was very busy with client projects, I will try to finish this over the weekend

tmbo · 2020-05-15T22:15:24Z

@dakshvar22 can you please finish this PR?

RolandJAAI · 2020-05-18T07:14:40Z

@tmbo I did not finish this because someone from Rasa commented with questioning whether this PR is still needed - and then there was no response anymore, so I waited for a notification on that. This comment has now been deleted. I can finish it if it is still needed. Just let me know.

dakshvar22 · 2020-05-18T10:09:47Z

Hi @RolandJAAI , your bugfix is still very relevant. Can you please complete it? Thanks

Update Fork

added tests

RolandJAAI · 2020-05-21T09:28:07Z

@dakshvar22 I made the suggested changes and created the tests which themselves run fine. But the handling of the sequences has been changed since I fixed this, and there is now a new conflict, because actually applying the token_pattern of the CV-Featurizer might lead to a different number of tokens than the message had before, so the number of features does not match the shape of those of the message before. I think its a rather strategic decision how to handle this - maybe it is not a great idea to apply different token patterns in different places of the pipeline given your new sequence concept, so we could simply remove the token_pattern from the CV featurizer and let the user handle this in the tokenizer they choose. Otherwise you could replace the original tokens, which could lead to problems in some pipelines though. Please let me know your thoughts.

dakshvar22 · 2020-05-25T07:07:05Z

@RolandJAAI Thanks for bringing this up. Moving the parameter to tokenizers make sense but we need to discuss this internally once. Could you separate out the fix for validating the vocabulary maybe in a separate PR so that it can be merged ASAP?

…th and probably will have to be removed from the featurizer

…pective code

dakshvar22 · 2020-05-27T07:27:14Z

@RolandJAAI Any particular reason why you replaced the call to the private method to corresponding code? If the code is an exact copy can you please keep the call to the private method in there to avoid maintaining extra piece of code? It's a known issue with scikit-learn and we can fix it if they refactor it someday(the unit test would break).

changelog/5038.bugfix.rst

rasa/nlu/featurizers/sparse_featurizer/count_vectors_featurizer.py

Co-authored-by: Daksh Varshneya <dakshvar22@gmail.com>

…r.py Co-authored-by: Daksh Varshneya <dakshvar22@gmail.com>

RolandJAAI · 2020-05-27T08:23:31Z

@RolandJAAI Any particular reason why you replaced the call to the private method to corresponding code? If the code is an exact copy can you please keep the call to the private method in there to avoid maintaining extra piece of code? It's a known issue with scikit-learn and we can fix it if they refactor it someday(the unit test would break).

@dakshvar22 DeepSource throws an error when calling private methods. This is why I had to move the code over.

dakshvar22 · 2020-05-27T08:24:51Z

@RolandJAAI That's ok, DeepSource is an optional test.

dakshvar22

Looks good! Thanks for finishing it up 🚀

rasa/nlu/featurizers/sparse_featurizer/count_vectors_featurizer.py

changelog/5038.bugfix.rst

RolandJAAI and others added 4 commits January 4, 2020 16:03

Merge pull request #1 from RasaHQ/master

0a71fff

Update Fork

bugfixes countvectorizer:

326d220

- try to load vocabulary if it is not loaded yet - apply token_pattern to _process_tokens to identify OOV_tokens correctly

Added changelog file.

bb1b079

remove blanks

0b34860

erohmensing requested a review from dakshvar22 January 6, 2020 15:18

pre-compile regex in init for faster processing

52349ad

Merge branch 'master' into bugfixes_countvectorizer

70b2078

dakshvar22 requested changes Jan 10, 2020

View reviewed changes

Merge branch 'master' into bugfixes_countvectorizer

2e3a777

tmbo assigned dakshvar22 May 15, 2020

RolandJAAI and others added 2 commits May 19, 2020 23:25

Merge pull request #2 from RasaHQ/master

9fd078d

Update Fork

moved vocab validation to load method

816ad22

added tests

RolandJAAI added 3 commits May 26, 2020 23:29

removed token_pattern processing because it might alter sequence leng…

4812eb4

…th and probably will have to be removed from the featurizer

removed second token_pattern test

758dc9e

replaced call to private class member of original CV class by the res…

882e879

…pective code

dakshvar22 requested changes May 27, 2020

View reviewed changes

changelog/5038.bugfix.rst Outdated Show resolved Hide resolved

rasa/nlu/featurizers/sparse_featurizer/count_vectors_featurizer.py Outdated Show resolved Hide resolved

dakshvar22 mentioned this pull request May 27, 2020

Token_pattern is incorrectly applied on incoming text during inference #5905

Closed

RolandJAAI and others added 2 commits May 27, 2020 10:18

Update changelog/5038.bugfix.rst

2661f95

Co-authored-by: Daksh Varshneya <dakshvar22@gmail.com>

Update rasa/nlu/featurizers/sparse_featurizer/count_vectors_featurize…

a5203dd

…r.py Co-authored-by: Daksh Varshneya <dakshvar22@gmail.com>

dakshvar22 and others added 2 commits May 27, 2020 14:16

Merge branch 'master' into bugfixes_countvectorizer

7f88d64

calling a private member is ok in this case

980649f

dakshvar22 approved these changes May 27, 2020

View reviewed changes

rasa/nlu/featurizers/sparse_featurizer/count_vectors_featurizer.py Outdated Show resolved Hide resolved

changelog/5038.bugfix.rst Outdated Show resolved Hide resolved

Apply suggestions from code review

e6ea7a6

tmbo merged commit 35388b7 into RasaHQ:master Jun 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfixes countvectorizer #5038

Bugfixes countvectorizer #5038

RolandJAAI commented Jan 4, 2020 •

edited

Loading

erohmensing commented Jan 6, 2020

RolandJAAI commented Jan 7, 2020

dakshvar22 left a comment

dakshvar22 Jan 10, 2020

erohmensing commented Feb 4, 2020

RolandJAAI commented Feb 28, 2020

tmbo commented May 15, 2020

RolandJAAI commented May 18, 2020

dakshvar22 commented May 18, 2020

RolandJAAI commented May 21, 2020

dakshvar22 commented May 25, 2020 •

edited

Loading

dakshvar22 commented May 27, 2020

RolandJAAI commented May 27, 2020 •

edited

Loading

dakshvar22 commented May 27, 2020

dakshvar22 left a comment

Bugfixes countvectorizer #5038

Bugfixes countvectorizer #5038

Conversation

RolandJAAI commented Jan 4, 2020 • edited Loading

erohmensing commented Jan 6, 2020

RolandJAAI commented Jan 7, 2020

dakshvar22 left a comment

Choose a reason for hiding this comment

dakshvar22 Jan 10, 2020

Choose a reason for hiding this comment

erohmensing commented Feb 4, 2020

RolandJAAI commented Feb 28, 2020

tmbo commented May 15, 2020

RolandJAAI commented May 18, 2020

dakshvar22 commented May 18, 2020

RolandJAAI commented May 21, 2020

dakshvar22 commented May 25, 2020 • edited Loading

dakshvar22 commented May 27, 2020

RolandJAAI commented May 27, 2020 • edited Loading

dakshvar22 commented May 27, 2020

dakshvar22 left a comment

Choose a reason for hiding this comment

RolandJAAI commented Jan 4, 2020 •

edited

Loading

dakshvar22 commented May 25, 2020 •

edited

Loading

RolandJAAI commented May 27, 2020 •

edited

Loading