Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfixes countvectorizer #5038

Merged
merged 17 commits into from
Jun 3, 2020
Merged

Bugfixes countvectorizer #5038

merged 17 commits into from
Jun 3, 2020

Conversation

RolandJAAI
Copy link
Contributor

@RolandJAAI RolandJAAI commented Jan 4, 2020

Proposed changes:
This PR fixes two bugs in the CountVectorizer Featurizer

  • the token_pattern parameter is used correctly during training and establishes a corresponding vocabulary, but during prediction the token pattern is only applied after checking for OOV tokens, which results in many tokens to be mapped to the OOV_token even though they would have been in the vocabulary after applying the token_pattern. Example: The token ["role-based"] will be stored in the vocabulary as two tokens ["role", "based"] during training. But if a user message with the token ["role-based"] is received, the component currently checks, if this exact token is in the vocabulary (false) and then replaces it with the OOV_token. This has a heavy impact on all bots which use OOV_token! Therefore we need to apply self.token_pattern before the OOV check.

  • after loading a model, the very first message does not get processed correctly, because the vocabulary attribute is only established after the first call to the .transform method during the processing. This results in skipping the OOV_token replacement routine for the first message: Tokens which should have been replaced with the OOV_token will be passed directly to the CountVectorizer and result in a OOV prediction. This is now fixed by calling the original CountVectorizers._validate_vocabulary() method when checking for the existence of the vocabulary, which will set the attribute if a vocabulary was provided.

Status (please check what you already did):

  • added some tests for the functionality (na)
  • updated the documentation (na)
  • updated the changelog (please check changelog for instructions)
  • reformat files using black (please check Readme for instructions)

RolandJAAI and others added 4 commits January 4, 2020 16:03
- try to load vocabulary if it is not loaded yet
- apply token_pattern to _process_tokens to identify OOV_tokens correctly
@erohmensing
Copy link
Contributor

Thanks for the PR! We'll give it a review as soon as possible.

@RolandJAAI
Copy link
Contributor Author

I moved the compilation of the regex to the init method so that we do not need to compile it every time a message gets processed.

Copy link
Contributor

@dakshvar22 dakshvar22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting these! Left a small suggestion. Also please add two tests to validate the corresponding fixes. 🚀

@@ -141,6 +141,8 @@ def _load_OOV_params(self) -> None:
def _check_attribute_vocabulary(self, attribute: Text) -> bool:
"""Check if trained vocabulary exists in attribute's count vectorizer"""
try:
if not hasattr(self.vectorizers[attribute], "vocabulary_"):
self.vectorizers[attribute]._validate_vocabulary()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend moving this line to the exact place where vectorizers are created, i.e. - inside _create_independent_vocab_vectorizers and _create_shared_vocab_vectorizers

@erohmensing
Copy link
Contributor

Hey @RolandJAAI, would love to get this bug fix in if you're able to come back to finishing the suggested changes 🙂

@RolandJAAI
Copy link
Contributor Author

Hi, sorry I was very busy with client projects, I will try to finish this over the weekend

@tmbo
Copy link
Member

tmbo commented May 15, 2020

@dakshvar22 can you please finish this PR?

@RolandJAAI
Copy link
Contributor Author

@tmbo I did not finish this because someone from Rasa commented with questioning whether this PR is still needed - and then there was no response anymore, so I waited for a notification on that. This comment has now been deleted. I can finish it if it is still needed. Just let me know.

@dakshvar22
Copy link
Contributor

Hi @RolandJAAI , your bugfix is still very relevant. Can you please complete it? Thanks

@RolandJAAI
Copy link
Contributor Author

@dakshvar22 I made the suggested changes and created the tests which themselves run fine. But the handling of the sequences has been changed since I fixed this, and there is now a new conflict, because actually applying the token_pattern of the CV-Featurizer might lead to a different number of tokens than the message had before, so the number of features does not match the shape of those of the message before. I think its a rather strategic decision how to handle this - maybe it is not a great idea to apply different token patterns in different places of the pipeline given your new sequence concept, so we could simply remove the token_pattern from the CV featurizer and let the user handle this in the tokenizer they choose. Otherwise you could replace the original tokens, which could lead to problems in some pipelines though. Please let me know your thoughts.

@dakshvar22
Copy link
Contributor

dakshvar22 commented May 25, 2020

@RolandJAAI Thanks for bringing this up. Moving the parameter to tokenizers make sense but we need to discuss this internally once. Could you separate out the fix for validating the vocabulary maybe in a separate PR so that it can be merged ASAP?

@dakshvar22
Copy link
Contributor

@RolandJAAI Any particular reason why you replaced the call to the private method to corresponding code? If the code is an exact copy can you please keep the call to the private method in there to avoid maintaining extra piece of code? It's a known issue with scikit-learn and we can fix it if they refactor it someday(the unit test would break).

changelog/5038.bugfix.rst Outdated Show resolved Hide resolved
RolandJAAI and others added 2 commits May 27, 2020 10:18
Co-authored-by: Daksh Varshneya <dakshvar22@gmail.com>
…r.py

Co-authored-by: Daksh Varshneya <dakshvar22@gmail.com>
@RolandJAAI
Copy link
Contributor Author

RolandJAAI commented May 27, 2020

@RolandJAAI Any particular reason why you replaced the call to the private method to corresponding code? If the code is an exact copy can you please keep the call to the private method in there to avoid maintaining extra piece of code? It's a known issue with scikit-learn and we can fix it if they refactor it someday(the unit test would break).

@dakshvar22 DeepSource throws an error when calling private methods. This is why I had to move the code over.

@dakshvar22
Copy link
Contributor

@RolandJAAI That's ok, DeepSource is an optional test.

Copy link
Contributor

@dakshvar22 dakshvar22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks for finishing it up 🚀

changelog/5038.bugfix.rst Outdated Show resolved Hide resolved
@tmbo tmbo merged commit 35388b7 into RasaHQ:master Jun 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants