-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lemmatization and CountVectorFeaturizers #6536
Comments
A related discussion needs to take place here too. Is this how we want to deal with lemmatisation in Rasa? My impression is that we've always used the spaCy @Ghostvv might be in favour of decoupling lemmatisation and the CountVectorFeaturizer. |
As soon as we add an option to the |
@tabergma is the
I'm currently leaning towards implementing #2 but since there's an option to do something "more general" here I figured I'd at least mention it. |
I think CVF should only featuring tokens/words. The rest is the job of |
Rasa version:
1.8 onwards
Python version:
3.6 or 3.7
Operating system (windows, osx, ...):
All
Issue:
Currently if you use spaCy components in your pipeline then the behaviour of the CountVectorFeaturizer changes. Currently we use the
lemma
that spaCy provides as an alternative for the text when we use the "word" analyser. There's potentially merit to this idea but currently:lemma
tokens at all times the user wants to use spaCy.I think the best way forward is to implement a configuration in the
CountVectorFeaturizer
that makes it possible to not use thelemma
. Once this is implemented we can update the documentation.The text was updated successfully, but these errors were encountered: