-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix D2VTransformer.fit_transform
. Fix #1834
#1845
Conversation
Thanks @Utkarsh-Mishra-CIC, @chinmayapancholi13 can you comment this? P/S @Utkarsh-Mishra-CIC please merge fresh |
D2VTransformer.fit_transform
. Fix #1834
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gensim/sklearn_api/d2vmodel.py
Outdated
@@ -63,8 +64,9 @@ def fit(self, X, y=None): | |||
Fit the model according to the given training data. | |||
Calls gensim.models.Doc2Vec | |||
""" | |||
d2v_sentences = [doc2vec.TaggedDocument(words[0], [i]) for i, words in enumerate(X)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can you check X, this can be already in
TaggedDocument
format (no need to convert it directly) - Why
words[0]
? IfX
is iterable of the list of tokens,words[0]
will be one token only. - Please add tests for
fit_transform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I can check for X like following:
all(isinstance(word, doc2vec.TaggedDocument) for word in X)
but that isn't useful forfit_transform
as thetransform
method ofD2VTransformer
requires only a list of list, so the input format should only be list of documents liketrain_input
not inTaggedDocument
format. - Thanks for pointing that out, i'll add that in next commit.
- Added in the latest commit.
Hi @menshikh-iv! |
Hey @chinmayapancholi13, can you review current approach? |
ping @Utkarsh-Mishra-CIC, can you answer my questions #1845 (comment)? |
Hey @menshikh-iv I went through the discussion on issue #1834 as well as the comments on this PR. When I had worked on sklearn API, I remember that at the time we had decided to not implement |
gensim/sklearn_api/d2vmodel.py
Outdated
@@ -63,8 +64,12 @@ def fit(self, X, y=None): | |||
Fit the model according to the given training data. | |||
Calls gensim.models.Doc2Vec | |||
""" | |||
if all(isinstance(word, doc2vec.TaggedDocument) for word in X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check only first element (that's enough).
gensim/test/test_sklearn_api.py
Outdated
@@ -831,6 +831,22 @@ def testTransform(self): | |||
self.assertEqual(matrix.shape[0], 1) | |||
self.assertEqual(matrix.shape[1], self.model.size) | |||
|
|||
def testFitTransform(self): | |||
numpy.random.seed(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is global seeding (affect on interpreter state, not only for this test), please don't use it.
gensim/test/test_sklearn_api.py
Outdated
numpy.random.seed(0) | ||
model = D2VTransformer(min_count=1) | ||
|
||
#fit and transform multiple documents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: should be #
(space between #
and text), here and later.
Made the requested changes. |
Thanks @Utkarsh-Mishra-CIC, congratz with the first contribution: 1st_place_medal:! |
* Fix D2VTransformer.fit_transform\(piskvorky#1834\) * Add check for TaggedDocument * Add test for D2VTransformer fit_transform * Add test and check d2vtransformer
No description provided.