Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] TfidfVectorizer to accept Pandas Series as input #3403

Closed
abhipn opened this issue Jan 23, 2021 · 4 comments · Fixed by #4811
Closed

[FEA] TfidfVectorizer to accept Pandas Series as input #3403

abhipn opened this issue Jan 23, 2021 · 4 comments · Fixed by #4811
Labels
feature request New feature or request inactive-30d

Comments

@abhipn
Copy link

abhipn commented Jan 23, 2021

I am trying to use tfidf vectorizer and I keep getting this below error,

I am using the cuml 0.17 stable.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-9-871f536610e2> in <module>
----> 1 train_tfidf_data = tfidf.fit_transform(x_train.comment)

/opt/conda/envs/rapids-0.17/lib/python3.8/site-packages/cuml/feature_extraction/_tfidf_vectorizer.py in fit_transform(self, raw_documents)
    234             Tf-idf-weighted document-term matrix.
    235         """
--> 236         X = super().fit_transform(raw_documents)
    237         self._tfidf.fit(X)
    238         # X is already a transformed view of raw_documents so

/opt/conda/envs/rapids-0.17/lib/python3.8/site-packages/cuml/feature_extraction/_vectorizers.py in fit_transform(self, raw_documents)
    546         self._fixed_vocabulary = self.vocabulary is not None
    547 
--> 548         docs = self._preprocess(raw_documents)
    549         n_doc = len(docs)
    550 

/opt/conda/envs/rapids-0.17/lib/python3.8/site-packages/cuml/feature_extraction/_vectorizers.py in _preprocess(self, raw_documents)
    504     def _preprocess(self, raw_documents):
    505         preprocess = self.build_preprocessor()
--> 506         return preprocess(raw_documents)
    507 
    508     def fit(self, raw_documents):

/opt/conda/envs/rapids-0.17/lib/python3.8/site-packages/cuml/feature_extraction/_vectorizers.py in <lambda>(doc)
    108                                  remove_non_alphanumeric=remove_non_alpha,
    109                                  delimiter=self.delimiter)
--> 110         return lambda doc: self._remove_stop_words(preprocess(doc))
    111 
    112     def _get_stop_words(self):

/opt/conda/envs/rapids-0.17/lib/python3.8/site-packages/cuml/feature_extraction/_vectorizers.py in _preprocess(doc, lower, remove_non_alphanumeric, delimiter, keep_underscore_char, remove_single_token_len)
     60             temp_string = 'cumlSt'
     61             doc = doc.str.replace('_', temp_string, regex=False)
---> 62             doc = doc.str.filter_alphanum(' ', keep=True)
     63             doc = doc.str.replace(temp_string, '_', regex=False)
     64         else:

AttributeError: 'StringMethods' object has no attribute 'filter_alphanum'
@abhipn abhipn added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jan 23, 2021
@dantegd
Copy link
Member

dantegd commented Jan 24, 2021

Thanks for the issue @abhipn , I was wondering if you could provide the script and data that caused this to happen? This would very helpful to triage the issue. Thanks!

@abhipn
Copy link
Author

abhipn commented Jan 28, 2021

@dantegd I don't have the permissions to share the data, it's text sentences and i want to convert them to tfidf vector matrix.

from cuml.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
tfidf.fit_transform(x_train.comment)

This is all I have used, and it returned that error

@beckernick
Copy link
Member

Based on the error message, it looks like you may be passing a pandas Series to fit_transform. This functionality currently only accepts cuDF Series inputs, which is noted in the docstring. Does using cuDF resolve your issue?

@dantegd , perhaps this should be updated to be a feature request for input type conversion on the TFIDF vectorizer, and perhaps the other vectorizers if need be.

@cjnolet cjnolet changed the title CUML TfidfVectorizer - AttributeError: 'StringMethods' object has no attribute 'filter_alphanum' ? [FEA] TfidfVectorizer to accept Pandas Series as input Feb 4, 2021
@cjnolet cjnolet removed the ? - Needs Triage Need team to review and classify label Feb 4, 2021
@cjnolet cjnolet added feature request New feature or request and removed bug Something isn't working labels Feb 5, 2021
@github-actions
Copy link

github-actions bot commented Mar 7, 2021

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

rapids-bot bot pushed a commit that referenced this issue Jul 29, 2022
Resolves #3403

This PR adds support for using `pandas.Series` as an input to `TfidfVectorizer`, `HashingVectorizer` and `CountVectorizer`.

Authors:
  - Shaswat Anand (https://github.com/shaswat-indian)
  - Ray Douglass (https://github.com/raydouglass)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: #4811
jakirkham pushed a commit to jakirkham/cuml that referenced this issue Feb 27, 2023
Resolves rapidsai#3403

This PR adds support for using `pandas.Series` as an input to `TfidfVectorizer`, `HashingVectorizer` and `CountVectorizer`.

Authors:
  - Shaswat Anand (https://github.com/shaswat-indian)
  - Ray Douglass (https://github.com/raydouglass)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4811
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request inactive-30d
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants