Lemmatization and CountVectorFeaturizers #6536

koaning · 2020-09-01T14:17:20Z

Rasa version:

1.8 onwards

Python version:

3.6 or 3.7

Operating system (windows, osx, ...):

All

Issue:

Currently if you use spaCy components in your pipeline then the behaviour of the CountVectorFeaturizer changes. Currently we use the lemma that spaCy provides as an alternative for the text when we use the "word" analyser. There's potentially merit to this idea but currently:

The user does not know this is happening because this phenomenon is undocumented. At the very least we need to update the docs to reflect this.
The user cannot configure this, the user must use the lemma tokens at all times the user wants to use spaCy.

I think the best way forward is to implement a configuration in the CountVectorFeaturizer that makes it possible to not use the lemma. Once this is implemented we can update the documentation.

The text was updated successfully, but these errors were encountered:

koaning · 2020-09-01T14:21:53Z

A related discussion needs to take place here too. Is this how we want to deal with lemmatisation in Rasa?

My impression is that we've always used the spaCy lemma features inside of the CountVectorFeaturizer and never really considered how we think about lemmatisation in general. The reason I want to bring it up is related to a feature request in rasa-nlu-examples. There are other tools for lemmatization/tokenization that offer support for languages that spaCy currently does not cover. I'd like to add support for them but it might be good to formalise how we want these components to behave.

@Ghostvv might be in favour of decoupling lemmatisation and the CountVectorFeaturizer.

tabergma · 2020-09-02T08:58:44Z

As soon as we add an option to the CountVectorFeaturizer to use the lemma or not, we are decoupling the lemmatisation from the CountVectorFeaturizer, aren't we? I think if users want to add another component to the pipeline that does lemmatisation, that is fine as they would need to reuse our interface, e.g. updating the lemma attribute of the tokens.

koaning · 2020-09-03T07:46:50Z

@tabergma is the lemma attribute the only attribute that we want to "countvectorize"? The reason I'm asking is related to (yet another) feature request. In spaCy there are many attributes that could be interesting to featurize. There's flags for stopwords, sentiment and out-of-vocabulary terms. There are all interesting but there's two paths towards an implementation.

We can have a tokenizer that adds all of these attributes to the token and then we can have the CountVectorFeaturizer turn them all into sparse vectors for DIET.
We can have a seperate featurizer that handles this directly without the need to attach anything to the tokens.

I'm currently leaning towards implementing #2 but since there's an option to do something "more general" here I figured I'd at least mention it.

Ghostvv · 2020-09-03T07:56:28Z

I think CVF should only featuring tokens/words. The rest is the job of LexicalSyntacticFeaturizer

koaning added type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors. area:rasa-oss 🎡 Anything related to the open source Rasa framework labels Sep 1, 2020

koaning assigned koaning, Ghostvv and tabergma Sep 1, 2020

koaning added type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR and removed type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors. labels Sep 1, 2020

Ghostvv removed their assignment Sep 3, 2020

tabergma assigned tabergma and unassigned koaning and tabergma Sep 7, 2020

tabergma mentioned this issue Sep 7, 2020

Add option 'use_lemma' to CountVectorsFeaturizer #6589

Merged

4 tasks

rasabot closed this as completed in #6589 Sep 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lemmatization and CountVectorFeaturizers #6536

Lemmatization and CountVectorFeaturizers #6536

koaning commented Sep 1, 2020 •

edited

Loading

koaning commented Sep 1, 2020 •

edited

Loading

tabergma commented Sep 2, 2020

koaning commented Sep 3, 2020

Ghostvv commented Sep 3, 2020

Lemmatization and CountVectorFeaturizers #6536

Lemmatization and CountVectorFeaturizers #6536

Comments

koaning commented Sep 1, 2020 • edited Loading

koaning commented Sep 1, 2020 • edited Loading

tabergma commented Sep 2, 2020

koaning commented Sep 3, 2020

Ghostvv commented Sep 3, 2020

koaning commented Sep 1, 2020 •

edited

Loading

koaning commented Sep 1, 2020 •

edited

Loading