Doc2vec to wikipedia #654

isomap · 2016-04-01T06:21:21Z

Related to Issue #629.
I conducted the similar experiment to Document Embedding with Paragraph Vectors (http://arxiv.org/abs/1507.07998) and wrote documentation.
However, I got some problems. Could you help me with the following problems?

Problems

I have not enough of computational resources, so Doc2Vec size = 500 now. It should be a bigger size of Doc2Vec.
I use only DM not along with the paper.
I've not conducted triplet evaluation yet.

Todo

[x ] Increase size of Doc2Vec.
[x ] Use DBOW like paper.
~~[ ] Evaluate Doc2Vec using triplet datasets.~~

Questions

Should I remove some kinds of articles such as "List of XXXX"?
I think I got reasonably good results, but it is completely different from paper's results. How to interpret this results.
What Information should I add in the documentation?

Please feel free to comments if you have any idea other than the above.
Thanks.

gojomo · 2016-04-01T08:23:46Z

Thanks, I've hoped for a notebook like this for the project for a while!

I doubt the inclusion or exclusion of the "List of…" etc articles will make a big difference either way, and as far as I can tell, the 'Document Embeddings with Paragraph Vectors' paper didn't mention such article filtering. So I'd keep things simple, and maybe test different article subsets later.

Parameter thoughts:

As you've noted, to truly match the paper's most interesting experiments with mixed doc- and word-vectors, you'd want to use DBOW with concurrent word-training (dm=0, dbow_words=1).
While the paper shows their evaluations peaking at 10000 dimensions, that's rather large, and they still get reasonable results at their 100-1000d trials. (So: you may get plenty-good results at your existing dimensionality, or not far from it.)
The paper doesn't mention the window size they used. Your value, 8, seems common in other Word2Vec/Doc2Vec trials, along with 5 or 10, but for some purposes smaller values work as well or better. Also, in DBOW+words mode, larger windows very definitely increase training time, and mean more effort is spent training word-vecs relative to doc-vecs. So tinkering with this value may prove important to both runtime and vector-quality in eventual evaluations, and I wouldn't overlook values as small as 2 or 3 as potential best-performers.
The paper doesn't mention the min_count they used, only that their vocabulary was 916K words. I think that'd need a much larger min_count than 5, and working with a smaller vocab will help memory, runtime, and perhaps even vector-quality for the remaining words. You can iteratively test different min_count values by not supplying your corpus in the constructor, and instead explicitly calling scan_vocab(), scale_vocab() multiple times with different candidate min_count values and dry_run=True, then finally when you like the dry-run reported sizings, scale_vocab() with your intended min_count (and not dry_run), finalize_vocab(), and train().
For iterations, the paper only mentions they used "at least 10 epochs". By not specifying iter, you're currently using the default inherited from Word2Vec of 5 iterations.
The paper mentions using hierarchical softmax; by not specifying hs=1, negative=0, you're currently using the default inherited from Word2Vec of no-hierarchical-softmax, and negative-sampling with 5 negative examples. (I'm not sure which of HS or neg-sampling might be better; lots of larger-corpus projects seem to prefer negative-sampling.)

Regarding results:

Your first similarity results do indicate meaningful doc-vector learning has happened, so things seem on the right track. Going to at least 10 iterations might help.
The paper isn't clear as to whether they've unit-normed all vectors before their vector-arithmetic operations; they probably have. (The usual analogy-solving as in Word2Vec.most_similar(positive=['king','woman'], negative=['man']) does in fact operate on the normed vectors for each of the words.) You're currently using the raw vectors; you could access the normed vectors via syn0norm or use d2v.init_sims(true) to discard the raw vectors, and then only get back normed vectors from future bracket-indexing.

Hope this helps!

piskvorky · 2016-04-01T12:21:39Z

Regarding computation resources, what exactly do you need @isohyt ?

We could provide access to our dev servers, if that helps (and if @tmylk greenlights the need).

isomap · 2016-04-02T07:00:54Z

@gojomo, Thank you for your helpful and insightful comments, I will try it all as you proposed.
On the other hand, I should leave DM results to compare with DBOW results, shouldn't I?

@piskvorky, I feel extremely happy about your proposal. I want to run ipython notebook on your dev servers. However, I am a little worried about that because it's the first time for me to use a remote server to run a program...

piskvorky · 2016-04-02T11:56:45Z

We can do that, using SSH port tunnelling (@tmylk would help you setting this up).
But please first describe what kind of server resources do you need.

gojomo · 2016-04-02T18:19:28Z

@isohyt – the original paper didn't report DM results for comparison, so I wouldn't say that's strictly necessary for showing how to reproduce the paper's experiment. But it would be interesting, along with how other parameter variations (window size, HS-vs-negative, frequent-word-downsampling, etc) might affect the results!

isomap · 2016-06-13T08:21:49Z

Sorry for late to update this tutorial
Now, I'm training several models in ipynb.

tmylk · 2016-06-16T09:06:37Z

FYI a beginner doc2vec tutorial is in https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb

isomap · 2016-07-08T08:36:46Z

I've finished training two of d2v model, DBOW and DM, using wikipedia.
This result is fascinating!

gojomo · 2016-07-11T15:32:51Z

This is great stuff!

I notice in your notebook, your DM model keeps the default number of iterations (5), while the DBOW uses a full 10 like the paper.

Also, the max_vocab_size is a crude mechanism that personally I'd recommend against using unless absolutely necessary. Its mechanism – discarding a lot of words mid-scan, whenever the count hits the max – actually results in a smaller (perhaps much smaller) final vocabulary, and even of those words remaining, they won't exactly be the most-frequent words (though they should be close).

If possible, I'd suggest instead iteratively discovering the min_count that results in a desired just-about-1M-words vocabulary size. The way to do this without repeating the expensive word-discovery scan is to break the build_vocab() call up into its constituent scan_vocab(), scale_vocab(), finalize_vocab() calls. At the scale_vocab() step, scale_vocab() can be called with test values, and a dry_run=True parameter, to just print the resuling sizes without making any permanent changes. When you find the right tradeoffs, it can be called one last time with dry_run=False, then finalize_vocab(), to finish the usual build_vocab() process.

isomap · 2016-07-12T10:54:35Z

Thanks, @gojomo
I forgot to set the number of DM model iteration so that I will re-train these models.

In the same time, I will check the optimal min_count.
Anyway, I don't know how to extract vocab_size directly.
I got a optimal min_count=19 and then vocab_size=898,725. It's close to the original paper, 915,715. However, it's not a smart method to extract above information.
This is the vocab_size extraction code (use hs=1 model).
model.scale_vocab(min_count=num, dry_run=True)['memory']['vocab']/700)

If you want to remain the preprocessing code in ipynb, I want to write clearer code.
Do you know how to extract vocab_size information by scale_vocab method directly, @gojomo ?

gojomo · 2016-07-12T14:35:26Z

In the past, I've watched the log output and adjusted interactively, though I see why that's not ideal for robust code or demo notebooks.

It looks like scale_vocab() doesn't include the key value, retain_words, in its returned report_values. It could, but the number it does return – drop_unique, a count of the words dropped with the trial min_count – can be subtracted from len(model.raw_vocab) to know the end vocab-siize, if that min_count were to be destructively applied.

rtanzifi · 2016-07-13T05:11:51Z

Thank you guys especially @isohyt for this thread.
I have a problem in running the implementation of http://arxiv.org/abs/1507.07998 work. the error raise like this:

model.build_vocab(documents)
AttributeError: 'tuple' object has no attribute 'build_vocab'

I think you already knew that but the implementation find here : https://github.com/isohyt/gensim/blob/cb22f47f371457061b98f9390042f12b108587cf/docs/notebooks/doc2vec-wikipedia.ipynb

I would really appreciate any help ;)

gojomo · 2016-07-13T05:42:03Z

@rtanzifi The string "model.build_vocab(documents)" does not appear in the notebook. If you're getting such an error in your modified code, you've somehow made model into a tuple rather than a Doc2Vec instance. While this work-in-progress may be a good example to learn from, if you have questions specific to your own customizations, the list may be a better place to discuss, and you'll have to provide enough context to understand what you've changed.

isomap · 2016-09-04T05:03:15Z

sorry, i forgot to update this tutorial.
i update vocab size checking code in ipynb.

tmylk · 2016-09-06T08:24:24Z

Please make it one file, add a note to the changelog.md and will merge.

…2vec-wikipedia

isomap · 2016-09-10T07:20:13Z

@tmylk It's ready to be merged :)

tmylk · 2016-10-03T07:12:12Z

CHANGELOG.md

@@ -1,9 +1,23 @@
 Changes
 =======

+* Add doc2vec tutorial using wikipedia dump. (@isohyt, #654)


Please fix merge errors. It should be just 1 line added in Changelog.

tmylk · 2016-10-03T08:26:34Z

Please merge in develop to resolve the merge conflicts

tmylk and others added 8 commits November 5, 2015 19:07

Merge branch 'release-0.12.3rc1'

1c63c9a

Merge branch 'release-0.12.3'

280a488

Merge branch 'release-0.12.3'

ddeb002

Update CHANGELOG.txt

f2ac3a9

Update CHANGELOG.txt

cf09e8c

resolve merge conflict in Changelog

b61287a

Merge branch 'release-0.12.4' with piskvorky#596

3ade404

[WIP] doc2vec tutorial to wikipedia

ad7765f

isomap changed the title ~~Doc2vec to wikipedia~~ [WIP]Doc2vec to wikipedia Apr 1, 2016

isomap changed the title ~~[WIP]Doc2vec to wikipedia~~ [WIP] Doc2vec to wikipedia Apr 1, 2016

tmylk and others added 5 commits June 9, 2016 22:30

Merge branch 'release-0.13.0'

9e6522e

Merge branch 'release-0.13.0'

87c4e9c

Release version typo fix

9c74b40

Merge branch 'release-0.13.0rc1'

7b30025

add different method

428c326

tmylk and others added 3 commits June 21, 2016 23:25

Merge branch 'release-0.13.0'

de79c8e

Merge branch 'release-0.13.1'

d4f9cc5

add calc result of two models and doc

cb22f47

tmylk and others added 3 commits August 26, 2016 17:22

Merge branch 'release-0.13.2'

d8e9c0f

Merge branch 'release-0.13.2'

7c118fc

check vocab size

fedcc58

isomap added 2 commits September 9, 2016 01:51

delete old version file

e4c8622

Merge branch 'master' of https://github.com/piskvorky/gensim into doc…

9ec9175

…2vec-wikipedia

isomap changed the title ~~[WIP] Doc2vec to wikipedia~~ Doc2vec to wikipedia Sep 9, 2016

tmylk reviewed Oct 3, 2016

View reviewed changes

isomap force-pushed the doc2vec-wikipedia branch from f961f35 to 9ec9175 Compare October 3, 2016 09:51

tmylk merged commit 4f00b77 into piskvorky:develop Oct 3, 2016

isomap deleted the doc2vec-wikipedia branch October 10, 2016 09:22

isomap restored the doc2vec-wikipedia branch October 10, 2016 09:22

ERijck mentioned this pull request Jan 17, 2023

Add white line #3433

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doc2vec to wikipedia #654

Doc2vec to wikipedia #654

isomap commented Apr 1, 2016 •

edited

Loading

gojomo commented Apr 1, 2016

piskvorky commented Apr 1, 2016

isomap commented Apr 2, 2016

piskvorky commented Apr 2, 2016

gojomo commented Apr 2, 2016

isomap commented Jun 13, 2016

tmylk commented Jun 16, 2016

isomap commented Jul 8, 2016

gojomo commented Jul 11, 2016

isomap commented Jul 12, 2016

gojomo commented Jul 12, 2016

rtanzifi commented Jul 13, 2016 •

edited by gojomo

Loading

gojomo commented Jul 13, 2016

isomap commented Sep 4, 2016

tmylk commented Sep 6, 2016

isomap commented Sep 10, 2016 •

edited

Loading

tmylk Oct 3, 2016

tmylk commented Oct 3, 2016

Doc2vec to wikipedia #654

Doc2vec to wikipedia #654

Conversation

isomap commented Apr 1, 2016 • edited Loading

gojomo commented Apr 1, 2016

piskvorky commented Apr 1, 2016

isomap commented Apr 2, 2016

piskvorky commented Apr 2, 2016

gojomo commented Apr 2, 2016

isomap commented Jun 13, 2016

tmylk commented Jun 16, 2016

isomap commented Jul 8, 2016

gojomo commented Jul 11, 2016

isomap commented Jul 12, 2016

gojomo commented Jul 12, 2016

rtanzifi commented Jul 13, 2016 • edited by gojomo Loading

gojomo commented Jul 13, 2016

isomap commented Sep 4, 2016

tmylk commented Sep 6, 2016

isomap commented Sep 10, 2016 • edited Loading

tmylk Oct 3, 2016

Choose a reason for hiding this comment

tmylk commented Oct 3, 2016

isomap commented Apr 1, 2016 •

edited

Loading

rtanzifi commented Jul 13, 2016 •

edited by gojomo

Loading

isomap commented Sep 10, 2016 •

edited

Loading