Support pretrained word2vec model when train doc2vec #2703

maohbao · 2019-12-17T10:00:15Z

The default doc2vec model in gensim does't support pretrained word2vec model. But according to Jey Han Lau and Timothy Baldwin's paper, An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation(2016), using pretrained word2vec model usually gets better results in NLP tasks. The author also released a forked gensim verstion to perform pretrained embeddings, but it is from a very old gensim version, which can't be used in gensim 3.8.

Now I edited two files to support pretrained word embeddings for doc2vec.

mpenkov

Thank you for your contribution. Left you some comments.

Please also add unit tests for your new functionality.

Let me know when you're ready for another review.

mpenkov · 2019-12-21T02:40:01Z

README.md

@@ -1,177 +1,61 @@
-gensim – Topic Modelling in Python
+doc2vec in gensim – support pretrained word2vec


I think instead of replacing the top-level README.md file, you should put this documentation somewhere else. Ideally, it should be in a tutorial or a howto.

See https://radimrehurek.com/gensim/auto_examples/howtos/run_doc.html#sphx-glr-auto-examples-howtos-run-doc-py for more info.

This is my first time to do PR on github, I already followed your advice on word2vec.py and doc2vec.py, I also know that I should not change the top README.md file, but I really don't how and where to write the document, thank you for more advice!

mpenkov · 2019-12-21T02:41:07Z

README.md

-  [OpenBLAS]: http://xianyi.github.io/OpenBLAS/
-  [source tar.gz]: http://pypi.python.org/pypi/gensim
-  [documentation]: http://radimrehurek.com/gensim/install.html
+This is a forked gensim version, which edits the default doc2vec model to support pretrained word2vec during training doc2vec. It forked from gensim 3.8.


Write your documentation so that it's useful from the point of view of the reader.

"This is a forked gensim version" is not relevant to the user. Furthermore, it becomes misleading the moment we actually merge this PR.

mpenkov · 2019-12-21T02:41:38Z

README.md

-  [documentation]: http://radimrehurek.com/gensim/install.html
+This is a forked gensim version, which edits the default doc2vec model to support pretrained word2vec during training doc2vec. It forked from gensim 3.8.
+
+The default doc2vec model in gensim does't support pretrained word2vec model. But according to Jey Han Lau and Timothy Baldwin's paper, [An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation(2016)](https://arxiv.org/abs/1607.05368), using pretrained word2vec model usually gets better results in NLP tasks. The author also released a [forked gensim verstion](https://github.com/jhlau/gensim) to perform pretrained embeddings, but it is from a very old gensim version, which can't be used in gensim 3.8(the latest gensim version when I release this fork).


This is also irrelevant. This kind of information is good inside the PR, as motivation and background (it may already be there).

mpenkov · 2019-12-21T02:42:03Z

README.md

+> pretrained_emb = "word2vec_pretrained.txt"  # This is a pretrained word2vec model of C text format
+> 
+> model = gensim.models.doc2vec.Doc2Vec(  
+                                       corpus_train,  # This is the documents corpus to be trained which should meet gensim's format  


Hanging indent please.

gensim/models/doc2vec.py

gensim/models/word2vec.py

mpenkov · 2019-12-21T02:52:45Z

@gojomo Just saw your comments here: #1338

What do you think about this PR?

gojomo · 2019-12-24T22:52:16Z

I don't think an added __init__() option is the best way to achieve this.

I also think people who've read the Lau & Baldwin paper will have unrealistic expectations about the benefits of this approach, because the paper is a bit confused & contradictory in its analysis. (My comments on a prior issue outline some of the reasons.)

Unfortunately, the deeply misguided #1777 refactoring made experiments in this direction more difficult, by breaking two previous ways users could hook-into and modify a model's vocabulary-initialization. (The intersect_word2vec_format() stopped working in some models, and the prior decomposition of build_vocab() into 3 substeps – scan, prepare, finalize – was muddled.)

I also hope that a side-effect of re-factoring (& #1777-rollback) work I'm exploring in #2698 will eventually be to serve these kinds of needs in a more flexible & logical way. So, I'd not want this integrated until that work is further along. (Some of the classes this PR modifies, like the '_Trainables', should be going away.)

mpenkov · 2019-12-29T13:39:16Z

OK, so it sounds like the best action at the moment is to close this PR and wait for a more appropriate time to work on it. @gojomo Do you agree?

maohbao · 2020-01-08T08:23:27Z

Ok, you can close this PR. At 2019-12-29 21:39:17, "Michael Penkov" <notifications@github.com> wrote: OK, so it sounds like the best action at the moment is to close this PR and wait for a more appropriate time to work on it. @gojomo Do you agree? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

maohbao added 25 commits December 16, 2019 17:35

Update word2vec.py

0e4c786

Update doc2vec.py

cdd440e

Update doc2vec.py

4ef0fa8

Update word2vec.py

7e2e6ca

Update word2vec.py

6b5d882

Update README.md

57def4c

Update README.md

49bf922

Update README.md

ec5b268

Update README.md

4f6c514

Update README.md

d982a81

Update README.md

791764b

Update README.md

0a99acf

Update README.md

58870b6

Update README.md

b9acd0c

Update README.md

597d893

Update README.md

0e1a937

Update README.md

c217879

Update README.md

603b3f0

Update README.md

b44ea69

Update README.md

515e04d

Update README.md

b20605b

Update README.md

69fbdf2

Update README.md

a22d02f

Update README.md

20dc004

Update README.md

924ff7a

mpenkov requested changes Dec 21, 2019

View reviewed changes

maohbao added 2 commits December 24, 2019 14:52

Update doc2vec.py

08f0c34

Update word2vec.py

b5420f7

maohbao closed this Jan 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support pretrained word2vec model when train doc2vec #2703

Support pretrained word2vec model when train doc2vec #2703

maohbao commented Dec 17, 2019

mpenkov left a comment

mpenkov Dec 21, 2019

maohbao Dec 24, 2019

mpenkov Dec 21, 2019

mpenkov Dec 21, 2019

mpenkov Dec 21, 2019

mpenkov commented Dec 21, 2019

gojomo commented Dec 24, 2019

mpenkov commented Dec 29, 2019

maohbao commented Jan 8, 2020 via email

		@@ -1,177 +1,61 @@
		gensim – Topic Modelling in Python
		doc2vec in gensim – support pretrained word2vec

Support pretrained word2vec model when train doc2vec #2703

Support pretrained word2vec model when train doc2vec #2703

Conversation

maohbao commented Dec 17, 2019

mpenkov left a comment

Choose a reason for hiding this comment

mpenkov Dec 21, 2019

Choose a reason for hiding this comment

maohbao Dec 24, 2019

Choose a reason for hiding this comment

mpenkov Dec 21, 2019

Choose a reason for hiding this comment

mpenkov Dec 21, 2019

Choose a reason for hiding this comment

mpenkov Dec 21, 2019

Choose a reason for hiding this comment

mpenkov commented Dec 21, 2019

gojomo commented Dec 24, 2019

mpenkov commented Dec 29, 2019

maohbao commented Jan 8, 2020 via email