Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix D2VTransformer.fit_transform. Fix #1834 #1845

Merged
merged 5 commits into from
Feb 16, 2018

Conversation

karshd3v
Copy link
Contributor

No description provided.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jan 19, 2018

Thanks @Utkarsh-Mishra-CIC,
this isn't very good fix, but possible for this situation. I see same conversion in tests (before training model).

@chinmayapancholi13 can you comment this?

P/S @Utkarsh-Mishra-CIC please merge fresh develop to current PR

@menshikh-iv menshikh-iv changed the title Fix D2VTransformer.fit_transform(#1834) Fix D2VTransformer.fit_transform. Fix #1834 Feb 1, 2018
Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -63,8 +64,9 @@ def fit(self, X, y=None):
Fit the model according to the given training data.
Calls gensim.models.Doc2Vec
"""
d2v_sentences = [doc2vec.TaggedDocument(words[0], [i]) for i, words in enumerate(X)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can you check X, this can be already in TaggedDocument format (no need to convert it directly)
  2. Why words[0]? If X is iterable of the list of tokens, words[0] will be one token only.
  3. Please add tests for fit_transform

Copy link
Contributor Author

@karshd3v karshd3v Feb 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I can check for X like following: all(isinstance(word, doc2vec.TaggedDocument) for word in X) but that isn't useful for fit_transform as the transform method of D2VTransformer requires only a list of list, so the input format should only be list of documents like train_input not in TaggedDocument format.
  2. Thanks for pointing that out, i'll add that in next commit.
  3. Added in the latest commit.

@chinmayapancholi13
Copy link
Contributor

Hi @menshikh-iv!
Apologies for not being able to devote time for this till now. I'll try to resolve this and get back to you by this weekend. I hope that's ok. :)

@menshikh-iv
Copy link
Contributor

Hey @chinmayapancholi13, can you review current approach?

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Feb 8, 2018

ping @Utkarsh-Mishra-CIC, can you answer my questions #1845 (comment)?

@chinmayapancholi13
Copy link
Contributor

Hey @menshikh-iv

I went through the discussion on issue #1834 as well as the comments on this PR. When I had worked on sklearn API, I remember that at the time we had decided to not implement fit_transform(). But if we are planning to have it now, the approach used in this PR (of modifying the fit() method to have a consistent format) looks good to me.

@@ -63,8 +64,12 @@ def fit(self, X, y=None):
Fit the model according to the given training data.
Calls gensim.models.Doc2Vec
"""
if all(isinstance(word, doc2vec.TaggedDocument) for word in X):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check only first element (that's enough).

@@ -831,6 +831,22 @@ def testTransform(self):
self.assertEqual(matrix.shape[0], 1)
self.assertEqual(matrix.shape[1], self.model.size)

def testFitTransform(self):
numpy.random.seed(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is global seeding (affect on interpreter state, not only for this test), please don't use it.

numpy.random.seed(0)
model = D2VTransformer(min_count=1)

#fit and transform multiple documents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: should be # (space between # and text), here and later.

@karshd3v
Copy link
Contributor Author

Made the requested changes.

@menshikh-iv
Copy link
Contributor

Thanks @Utkarsh-Mishra-CIC, congratz with the first contribution: 1st_place_medal:!

@menshikh-iv menshikh-iv merged commit 8759282 into piskvorky:develop Feb 16, 2018
@karshd3v karshd3v deleted the fix-1834 branch February 16, 2018 06:51
sj29-innovate pushed a commit to sj29-innovate/gensim that referenced this pull request Feb 21, 2018
* Fix D2VTransformer.fit_transform\(piskvorky#1834\)

* Add check for TaggedDocument

* Add test for D2VTransformer fit_transform

* Add test and check d2vtransformer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants