Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc2vec to wikipedia #654

Merged
merged 21 commits into from
Oct 3, 2016
Merged

Doc2vec to wikipedia #654

merged 21 commits into from
Oct 3, 2016

Conversation

isomap
Copy link
Contributor

@isomap isomap commented Apr 1, 2016

Related to Issue #629.
I conducted the similar experiment to Document Embedding with Paragraph Vectors (http://arxiv.org/abs/1507.07998) and wrote documentation.
However, I got some problems. Could you help me with the following problems?

Problems

  • I have not enough of computational resources, so Doc2Vec size = 500 now. It should be a bigger size of Doc2Vec.
  • I use only DM not along with the paper.
  • I've not conducted triplet evaluation yet.

Todo

  • [x ] Increase size of Doc2Vec.
  • [x ] Use DBOW like paper.
  • [ ] Evaluate Doc2Vec using triplet datasets.

Questions

  • Should I remove some kinds of articles such as "List of XXXX"?
  • I think I got reasonably good results, but it is completely different from paper's results. How to interpret this results.
  • What Information should I add in the documentation?

Please feel free to comments if you have any idea other than the above.
Thanks.

@isomap isomap changed the title Doc2vec to wikipedia [WIP]Doc2vec to wikipedia Apr 1, 2016
@isomap isomap changed the title [WIP]Doc2vec to wikipedia [WIP] Doc2vec to wikipedia Apr 1, 2016
@gojomo
Copy link
Collaborator

gojomo commented Apr 1, 2016

Thanks, I've hoped for a notebook like this for the project for a while!

I doubt the inclusion or exclusion of the "List of…" etc articles will make a big difference either way, and as far as I can tell, the 'Document Embeddings with Paragraph Vectors' paper didn't mention such article filtering. So I'd keep things simple, and maybe test different article subsets later.

Parameter thoughts:

  • As you've noted, to truly match the paper's most interesting experiments with mixed doc- and word-vectors, you'd want to use DBOW with concurrent word-training (dm=0, dbow_words=1).
  • While the paper shows their evaluations peaking at 10000 dimensions, that's rather large, and they still get reasonable results at their 100-1000d trials. (So: you may get plenty-good results at your existing dimensionality, or not far from it.)
  • The paper doesn't mention the window size they used. Your value, 8, seems common in other Word2Vec/Doc2Vec trials, along with 5 or 10, but for some purposes smaller values work as well or better. Also, in DBOW+words mode, larger windows very definitely increase training time, and mean more effort is spent training word-vecs relative to doc-vecs. So tinkering with this value may prove important to both runtime and vector-quality in eventual evaluations, and I wouldn't overlook values as small as 2 or 3 as potential best-performers.
  • The paper doesn't mention the min_count they used, only that their vocabulary was 916K words. I think that'd need a much larger min_count than 5, and working with a smaller vocab will help memory, runtime, and perhaps even vector-quality for the remaining words. You can iteratively test different min_count values by not supplying your corpus in the constructor, and instead explicitly calling scan_vocab(), scale_vocab() multiple times with different candidate min_count values and dry_run=True, then finally when you like the dry-run reported sizings, scale_vocab() with your intended min_count (and not dry_run), finalize_vocab(), and train().
  • For iterations, the paper only mentions they used "at least 10 epochs". By not specifying iter, you're currently using the default inherited from Word2Vec of 5 iterations.
  • The paper mentions using hierarchical softmax; by not specifying hs=1, negative=0, you're currently using the default inherited from Word2Vec of no-hierarchical-softmax, and negative-sampling with 5 negative examples. (I'm not sure which of HS or neg-sampling might be better; lots of larger-corpus projects seem to prefer negative-sampling.)

Regarding results:

  • Your first similarity results do indicate meaningful doc-vector learning has happened, so things seem on the right track. Going to at least 10 iterations might help.
  • The paper isn't clear as to whether they've unit-normed all vectors before their vector-arithmetic operations; they probably have. (The usual analogy-solving as in Word2Vec.most_similar(positive=['king','woman'], negative=['man']) does in fact operate on the normed vectors for each of the words.) You're currently using the raw vectors; you could access the normed vectors via syn0norm or use d2v.init_sims(true) to discard the raw vectors, and then only get back normed vectors from future bracket-indexing.

Hope this helps!

@piskvorky
Copy link
Owner

Regarding computation resources, what exactly do you need @isohyt ?

We could provide access to our dev servers, if that helps (and if @tmylk greenlights the need).

@isomap
Copy link
Contributor Author

isomap commented Apr 2, 2016

@gojomo, Thank you for your helpful and insightful comments, I will try it all as you proposed.
On the other hand, I should leave DM results to compare with DBOW results, shouldn't I?

@piskvorky, I feel extremely happy about your proposal. I want to run ipython notebook on your dev servers. However, I am a little worried about that because it's the first time for me to use a remote server to run a program...

@piskvorky
Copy link
Owner

We can do that, using SSH port tunnelling (@tmylk would help you setting this up).
But please first describe what kind of server resources do you need.

@gojomo
Copy link
Collaborator

gojomo commented Apr 2, 2016

@isohyt – the original paper didn't report DM results for comparison, so I wouldn't say that's strictly necessary for showing how to reproduce the paper's experiment. But it would be interesting, along with how other parameter variations (window size, HS-vs-negative, frequent-word-downsampling, etc) might affect the results!

@isomap
Copy link
Contributor Author

isomap commented Jun 13, 2016

Sorry for late to update this tutorial
Now, I'm training several models in ipynb.

@tmylk
Copy link
Contributor

tmylk commented Jun 16, 2016

@isomap
Copy link
Contributor Author

isomap commented Jul 8, 2016

I've finished training two of d2v model, DBOW and DM, using wikipedia.
This result is fascinating!

@gojomo
Copy link
Collaborator

gojomo commented Jul 11, 2016

This is great stuff!

I notice in your notebook, your DM model keeps the default number of iterations (5), while the DBOW uses a full 10 like the paper.

Also, the max_vocab_size is a crude mechanism that personally I'd recommend against using unless absolutely necessary. Its mechanism – discarding a lot of words mid-scan, whenever the count hits the max – actually results in a smaller (perhaps much smaller) final vocabulary, and even of those words remaining, they won't exactly be the most-frequent words (though they should be close).

If possible, I'd suggest instead iteratively discovering the min_count that results in a desired just-about-1M-words vocabulary size. The way to do this without repeating the expensive word-discovery scan is to break the build_vocab() call up into its constituent scan_vocab(), scale_vocab(), finalize_vocab() calls. At the scale_vocab() step, scale_vocab() can be called with test values, and a dry_run=True parameter, to just print the resuling sizes without making any permanent changes. When you find the right tradeoffs, it can be called one last time with dry_run=False, then finalize_vocab(), to finish the usual build_vocab() process.

@isomap
Copy link
Contributor Author

isomap commented Jul 12, 2016

Thanks, @gojomo
I forgot to set the number of DM model iteration so that I will re-train these models.

In the same time, I will check the optimal min_count.
Anyway, I don't know how to extract vocab_size directly.
I got a optimal min_count=19 and then vocab_size=898,725. It's close to the original paper, 915,715. However, it's not a smart method to extract above information.
This is the vocab_size extraction code (use hs=1 model).
model.scale_vocab(min_count=num, dry_run=True)['memory']['vocab']/700)

If you want to remain the preprocessing code in ipynb, I want to write clearer code.
Do you know how to extract vocab_size information by scale_vocab method directly, @gojomo ?

@gojomo
Copy link
Collaborator

gojomo commented Jul 12, 2016

In the past, I've watched the log output and adjusted interactively, though I see why that's not ideal for robust code or demo notebooks.

It looks like scale_vocab() doesn't include the key value, retain_words, in its returned report_values. It could, but the number it does return – drop_unique, a count of the words dropped with the trial min_count – can be subtracted from len(model.raw_vocab) to know the end vocab-siize, if that min_count were to be destructively applied.

@rtanzifi
Copy link

rtanzifi commented Jul 13, 2016

Thank you guys especially @isohyt for this thread.
I have a problem in running the implementation of http://arxiv.org/abs/1507.07998 work. the error raise like this:

model.build_vocab(documents)
AttributeError: 'tuple' object has no attribute 'build_vocab'

I think you already knew that but the implementation find here : https://github.com/isohyt/gensim/blob/cb22f47f371457061b98f9390042f12b108587cf/docs/notebooks/doc2vec-wikipedia.ipynb

I would really appreciate any help ;)

@gojomo
Copy link
Collaborator

gojomo commented Jul 13, 2016

@rtanzifi The string "model.build_vocab(documents)" does not appear in the notebook. If you're getting such an error in your modified code, you've somehow made model into a tuple rather than a Doc2Vec instance. While this work-in-progress may be a good example to learn from, if you have questions specific to your own customizations, the list may be a better place to discuss, and you'll have to provide enough context to understand what you've changed.

@isomap
Copy link
Contributor Author

isomap commented Sep 4, 2016

sorry, i forgot to update this tutorial.
i update vocab size checking code in ipynb.

@tmylk
Copy link
Contributor

tmylk commented Sep 6, 2016

Please make it one file, add a note to the changelog.md and will merge.

@isomap isomap changed the title [WIP] Doc2vec to wikipedia Doc2vec to wikipedia Sep 9, 2016
@isomap
Copy link
Contributor Author

isomap commented Sep 10, 2016

@tmylk It's ready to be merged :)

@@ -1,9 +1,23 @@
Changes
=======

* Add doc2vec tutorial using wikipedia dump. (@isohyt, #654)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix merge errors. It should be just 1 line added in Changelog.

@tmylk
Copy link
Contributor

tmylk commented Oct 3, 2016

Please merge in develop to resolve the merge conflicts

@tmylk tmylk merged commit 4f00b77 into piskvorky:develop Oct 3, 2016
@isomap isomap deleted the doc2vec-wikipedia branch October 10, 2016 09:22
@isomap isomap restored the doc2vec-wikipedia branch October 10, 2016 09:22
@ERijck ERijck mentioned this pull request Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants