Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lemmatization and CountVectorFeaturizers #6536

Closed
koaning opened this issue Sep 1, 2020 · 4 comments · Fixed by #6589
Closed

Lemmatization and CountVectorFeaturizers #6536

koaning opened this issue Sep 1, 2020 · 4 comments · Fixed by #6589
Assignees
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR

Comments

@koaning
Copy link
Contributor

koaning commented Sep 1, 2020

Rasa version:

1.8 onwards

Python version:

3.6 or 3.7

Operating system (windows, osx, ...):

All

Issue:

Currently if you use spaCy components in your pipeline then the behaviour of the CountVectorFeaturizer changes. Currently we use the lemma that spaCy provides as an alternative for the text when we use the "word" analyser. There's potentially merit to this idea but currently:

  1. The user does not know this is happening because this phenomenon is undocumented. At the very least we need to update the docs to reflect this.
  2. The user cannot configure this, the user must use the lemma tokens at all times the user wants to use spaCy.

I think the best way forward is to implement a configuration in the CountVectorFeaturizer that makes it possible to not use the lemma. Once this is implemented we can update the documentation.

@koaning koaning added type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors. area:rasa-oss 🎡 Anything related to the open source Rasa framework labels Sep 1, 2020
@koaning koaning added type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR and removed type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors. labels Sep 1, 2020
@koaning
Copy link
Contributor Author

koaning commented Sep 1, 2020

A related discussion needs to take place here too. Is this how we want to deal with lemmatisation in Rasa?

My impression is that we've always used the spaCy lemma features inside of the CountVectorFeaturizer and never really considered how we think about lemmatisation in general. The reason I want to bring it up is related to a feature request in rasa-nlu-examples. There are other tools for lemmatization/tokenization that offer support for languages that spaCy currently does not cover. I'd like to add support for them but it might be good to formalise how we want these components to behave.

@Ghostvv might be in favour of decoupling lemmatisation and the CountVectorFeaturizer.

@tabergma
Copy link
Contributor

tabergma commented Sep 2, 2020

As soon as we add an option to the CountVectorFeaturizer to use the lemma or not, we are decoupling the lemmatisation from the CountVectorFeaturizer, aren't we? I think if users want to add another component to the pipeline that does lemmatisation, that is fine as they would need to reuse our interface, e.g. updating the lemma attribute of the tokens.

@koaning
Copy link
Contributor Author

koaning commented Sep 3, 2020

@tabergma is the lemma attribute the only attribute that we want to "countvectorize"? The reason I'm asking is related to (yet another) feature request. In spaCy there are many attributes that could be interesting to featurize. There's flags for stopwords, sentiment and out-of-vocabulary terms. There are all interesting but there's two paths towards an implementation.

  1. We can have a tokenizer that adds all of these attributes to the token and then we can have the CountVectorFeaturizer turn them all into sparse vectors for DIET.
  2. We can have a seperate featurizer that handles this directly without the need to attach anything to the tokens.

I'm currently leaning towards implementing #2 but since there's an option to do something "more general" here I figured I'd at least mention it.

@Ghostvv
Copy link
Contributor

Ghostvv commented Sep 3, 2020

I think CVF should only featuring tokens/words. The rest is the job of LexicalSyntacticFeaturizer

@Ghostvv Ghostvv removed their assignment Sep 3, 2020
@tabergma tabergma assigned tabergma and unassigned koaning and tabergma Sep 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants