[FEA] TfidfVectorizer to accept Pandas Series as input #3403

abhipn · 2021-01-23T23:51:30Z

I am trying to use tfidf vectorizer and I keep getting this below error,

I am using the cuml 0.17 stable.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-9-871f536610e2> in <module>
----> 1 train_tfidf_data = tfidf.fit_transform(x_train.comment)

/opt/conda/envs/rapids-0.17/lib/python3.8/site-packages/cuml/feature_extraction/_tfidf_vectorizer.py in fit_transform(self, raw_documents)
    234             Tf-idf-weighted document-term matrix.
    235         """
--> 236         X = super().fit_transform(raw_documents)
    237         self._tfidf.fit(X)
    238         # X is already a transformed view of raw_documents so

/opt/conda/envs/rapids-0.17/lib/python3.8/site-packages/cuml/feature_extraction/_vectorizers.py in fit_transform(self, raw_documents)
    546         self._fixed_vocabulary = self.vocabulary is not None
    547 
--> 548         docs = self._preprocess(raw_documents)
    549         n_doc = len(docs)
    550 

/opt/conda/envs/rapids-0.17/lib/python3.8/site-packages/cuml/feature_extraction/_vectorizers.py in _preprocess(self, raw_documents)
    504     def _preprocess(self, raw_documents):
    505         preprocess = self.build_preprocessor()
--> 506         return preprocess(raw_documents)
    507 
    508     def fit(self, raw_documents):

/opt/conda/envs/rapids-0.17/lib/python3.8/site-packages/cuml/feature_extraction/_vectorizers.py in <lambda>(doc)
    108                                  remove_non_alphanumeric=remove_non_alpha,
    109                                  delimiter=self.delimiter)
--> 110         return lambda doc: self._remove_stop_words(preprocess(doc))
    111 
    112     def _get_stop_words(self):

/opt/conda/envs/rapids-0.17/lib/python3.8/site-packages/cuml/feature_extraction/_vectorizers.py in _preprocess(doc, lower, remove_non_alphanumeric, delimiter, keep_underscore_char, remove_single_token_len)
     60             temp_string = 'cumlSt'
     61             doc = doc.str.replace('_', temp_string, regex=False)
---> 62             doc = doc.str.filter_alphanum(' ', keep=True)
     63             doc = doc.str.replace(temp_string, '_', regex=False)
     64         else:

AttributeError: 'StringMethods' object has no attribute 'filter_alphanum'

The text was updated successfully, but these errors were encountered:

dantegd · 2021-01-24T16:10:49Z

Thanks for the issue @abhipn , I was wondering if you could provide the script and data that caused this to happen? This would very helpful to triage the issue. Thanks!

abhipn · 2021-01-28T12:20:10Z

@dantegd I don't have the permissions to share the data, it's text sentences and i want to convert them to tfidf vector matrix.

from cuml.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
tfidf.fit_transform(x_train.comment)

This is all I have used, and it returned that error

beckernick · 2021-01-28T13:37:15Z

Based on the error message, it looks like you may be passing a pandas Series to fit_transform. This functionality currently only accepts cuDF Series inputs, which is noted in the docstring. Does using cuDF resolve your issue?

@dantegd , perhaps this should be updated to be a feature request for input type conversion on the TFIDF vectorizer, and perhaps the other vectorizers if need be.

github-actions · 2021-03-07T16:29:54Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Resolves #3403 This PR adds support for using `pandas.Series` as an input to `TfidfVectorizer`, `HashingVectorizer` and `CountVectorizer`. Authors: - Shaswat Anand (https://github.com/shaswat-indian) - Ray Douglass (https://github.com/raydouglass) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: #4811

Resolves rapidsai#3403 This PR adds support for using `pandas.Series` as an input to `TfidfVectorizer`, `HashingVectorizer` and `CountVectorizer`. Authors: - Shaswat Anand (https://github.com/shaswat-indian) - Ray Douglass (https://github.com/raydouglass) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#4811

abhipn added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jan 23, 2021

cjnolet changed the title ~~CUML TfidfVectorizer - AttributeError: 'StringMethods' object has no attribute 'filter_alphanum' ?~~ [FEA] TfidfVectorizer to accept Pandas Series as input Feb 4, 2021

cjnolet removed the ? - Needs Triage Need team to review and classify label Feb 4, 2021

cjnolet added feature request New feature or request and removed bug Something isn't working labels Feb 5, 2021

github-actions bot added the inactive-30d label Mar 7, 2021

shaswat-indian mentioned this issue Jul 13, 2022

Vectorizers to accept Pandas Series as input #4811

Merged

rapids-bot bot closed this as completed in #4811 Jul 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] TfidfVectorizer to accept Pandas Series as input #3403

[FEA] TfidfVectorizer to accept Pandas Series as input #3403

abhipn commented Jan 23, 2021

dantegd commented Jan 24, 2021

abhipn commented Jan 28, 2021

beckernick commented Jan 28, 2021

github-actions bot commented Mar 7, 2021

[FEA] TfidfVectorizer to accept Pandas Series as input #3403

[FEA] TfidfVectorizer to accept Pandas Series as input #3403

Comments

abhipn commented Jan 23, 2021

dantegd commented Jan 24, 2021

abhipn commented Jan 28, 2021

beckernick commented Jan 28, 2021

github-actions bot commented Mar 7, 2021