Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] sklearn API for Gensim models #1462

Merged
merged 36 commits into from
Aug 18, 2017

Conversation

chinmayapancholi13
Copy link
Contributor

@chinmayapancholi13 chinmayapancholi13 commented Jul 5, 2017

This PR creates scikit-learn API for the following Gensim models:

  • Doc2Vec Model
  • Text to Bag-of-words Model
  • Term Frequency-Inverse Document Frequency Model
  • Hierarchical Dirichlet Process Model
  • Phrases

The implementation for the following models is still left:

  • Normalization Model
  • LogEntropy Model
  • Dynamic Topic Model
  • Topic Coherence Model

@tmylk
Copy link
Contributor

tmylk commented Jul 12, 2017

Are you sure there is no memory duplication for doc2vec numpy arrays? can you please run it through a memory profiler.

if self.gensim_model is None:
raise NotFittedError("This model has not been fitted yet. Call 'fit' with appropriate arguments before using this method.")

# The input as array of array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please call them python lists

@tmylk
Copy link
Contributor

tmylk commented Jul 27, 2017

Let's aim to merge it this week. The missing things are ipynb and transform tests.
Please add ipynb examples in a separate notebook for now. We will copy-paste the notebooks together when these models are ready to merge.

self.assertEqual(matrix.shape[0], 1)
self.assertEqual(matrix.shape[1], self.model.size)

def testSetGetParams(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add checking with the original model too for each "getset" test (same as previous PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Aug 18, 2017

Great @chinmayapancholi13💯
You gave a new interface for gensim:+1:

@menshikh-iv menshikh-iv merged commit e27605f into piskvorky:develop Aug 18, 2017
@menshikh-iv menshikh-iv mentioned this pull request Aug 18, 2017
fabriciorsf pushed a commit to LINE-PESC/gensim that referenced this pull request Aug 23, 2017
* created sklearn wrapper for Doc2Vec

* PEP8 fix

* added 'transform' function and refactored code

* updated d2v skl api code

* added unittests for sklearn api for d2v model

* fixed flake8 errors

* added skl api class for Text2Bow model

* updated docstring for d2vmodel api

* updated text2bow skl api code

* added unittests for text2bow skl api class

* updated 'testPipeline' and 'testTransform' for text2bow

* added 'tokenizer' param to text2bow skl api

* updated unittests for text2bow

* removed get_params and set_params functions from existing classes

* added tfidf api class

* added unittests for tfidf api class

* flake8 fixes

* added skl api for hdpmodel

* added unittests for hdp model api class

* flake8 fixes

* updated hdp api class

* added 'testPartialFit' and 'testPipeline' tests for hdp api class

* flake8 fixes

* added skl API class for phrases

* added unit tests for phrases API class

* flake8 fixes

* added 'testPartialFit' function for 'TestPhrasesTransformer'

* updated 'testPipeline' function for 'TestText2BowTransformer'

* updated code for transform function for HDP transformer

* updated tests as discussed in PR 1473

* added examples for new models in ipynb

* unpinned sklearn version for running unit-tests

* updated 'Pipeline' initialization format

* updated 'Pipeline' initialization format in ipynb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants