-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfixes countvectorizer #5038
Bugfixes countvectorizer #5038
Conversation
Update Fork
- try to load vocabulary if it is not loaded yet - apply token_pattern to _process_tokens to identify OOV_tokens correctly
Thanks for the PR! We'll give it a review as soon as possible. |
I moved the compilation of the regex to the init method so that we do not need to compile it every time a message gets processed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for spotting these! Left a small suggestion. Also please add two tests to validate the corresponding fixes. 🚀
@@ -141,6 +141,8 @@ def _load_OOV_params(self) -> None: | |||
def _check_attribute_vocabulary(self, attribute: Text) -> bool: | |||
"""Check if trained vocabulary exists in attribute's count vectorizer""" | |||
try: | |||
if not hasattr(self.vectorizers[attribute], "vocabulary_"): | |||
self.vectorizers[attribute]._validate_vocabulary() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend moving this line to the exact place where vectorizers are created, i.e. - inside _create_independent_vocab_vectorizers
and _create_shared_vocab_vectorizers
Hey @RolandJAAI, would love to get this bug fix in if you're able to come back to finishing the suggested changes 🙂 |
Hi, sorry I was very busy with client projects, I will try to finish this over the weekend |
@dakshvar22 can you please finish this PR? |
@tmbo I did not finish this because someone from Rasa commented with questioning whether this PR is still needed - and then there was no response anymore, so I waited for a notification on that. This comment has now been deleted. I can finish it if it is still needed. Just let me know. |
Hi @RolandJAAI , your bugfix is still very relevant. Can you please complete it? Thanks |
Update Fork
added tests
@dakshvar22 I made the suggested changes and created the tests which themselves run fine. But the handling of the sequences has been changed since I fixed this, and there is now a new conflict, because actually applying the token_pattern of the CV-Featurizer might lead to a different number of tokens than the message had before, so the number of features does not match the shape of those of the message before. I think its a rather strategic decision how to handle this - maybe it is not a great idea to apply different token patterns in different places of the pipeline given your new sequence concept, so we could simply remove the token_pattern from the CV featurizer and let the user handle this in the tokenizer they choose. Otherwise you could replace the original tokens, which could lead to problems in some pipelines though. Please let me know your thoughts. |
@RolandJAAI Thanks for bringing this up. Moving the parameter to tokenizers make sense but we need to discuss this internally once. Could you separate out the fix for validating the vocabulary maybe in a separate PR so that it can be merged ASAP? |
…th and probably will have to be removed from the featurizer
@RolandJAAI Any particular reason why you replaced the call to the private method to corresponding code? If the code is an exact copy can you please keep the call to the private method in there to avoid maintaining extra piece of code? It's a known issue with scikit-learn and we can fix it if they refactor it someday(the unit test would break). |
rasa/nlu/featurizers/sparse_featurizer/count_vectors_featurizer.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Daksh Varshneya <dakshvar22@gmail.com>
…r.py Co-authored-by: Daksh Varshneya <dakshvar22@gmail.com>
@dakshvar22 DeepSource throws an error when calling private methods. This is why I had to move the code over. |
@RolandJAAI That's ok, DeepSource is an optional test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thanks for finishing it up 🚀
Proposed changes:
This PR fixes two bugs in the CountVectorizer Featurizer
the token_pattern parameter is used correctly during training and establishes a corresponding vocabulary, but during prediction the token pattern is only applied after checking for OOV tokens, which results in many tokens to be mapped to the OOV_token even though they would have been in the vocabulary after applying the token_pattern. Example: The token ["role-based"] will be stored in the vocabulary as two tokens ["role", "based"] during training. But if a user message with the token ["role-based"] is received, the component currently checks, if this exact token is in the vocabulary (false) and then replaces it with the OOV_token. This has a heavy impact on all bots which use OOV_token! Therefore we need to apply self.token_pattern before the OOV check.
after loading a model, the very first message does not get processed correctly, because the vocabulary attribute is only established after the first call to the .transform method during the processing. This results in skipping the OOV_token replacement routine for the first message: Tokens which should have been replaced with the OOV_token will be passed directly to the CountVectorizer and result in a OOV prediction. This is now fixed by calling the original CountVectorizers._validate_vocabulary() method when checking for the existence of the vocabulary, which will set the attribute if a vocabulary was provided.
Status (please check what you already did):
black
(please check Readme for instructions)