Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc2vec-lee.ipynb results ... not even close #1088

Closed
johncleveland opened this issue Jan 12, 2017 · 5 comments
Closed

doc2vec-lee.ipynb results ... not even close #1088

johncleveland opened this issue Jan 12, 2017 · 5 comments
Labels
difficulty easy Easy issue: required small fix documentation Current issue related to documentation

Comments

@johncleveland
Copy link

For this github tutorial: gensim/docs/notebooks/doc2vec-lee.ipynb
I have copied the code verabtim and I have been unable to reproduce any near the 95% rate.
collections.Counter(ranks) #96% accuracy
Counter({0: 292, 1: 8})

I have used python 2.7.12, 2.7.13, 3.5 on both Windows 10 and Ubuntu 16.10.
I have also had a friend try it on his Windows system. My results are all over the place.
What could possibly be the problem. I am just copy pasting?
Thanks

@gojomo
Copy link
Collaborator

gojomo commented Jan 12, 2017

Open-ended questions/discussion that are not bug-reports or feature-requests should go to the project discussion list at https://groups.google.com/forum/#!forum/gensim rather than this issues-tracker.

So please post your question there. (When you do so, it'd be helpful to make clear whether you've tried running the code in a Jupyter notebook itself and had the same problem, and what gensim version you're using, and what exact results or logged output you are seeing rather than what you expect.)

@gojomo gojomo closed this as completed Jan 12, 2017
@piskvorky
Copy link
Owner

piskvorky commented Jan 13, 2017

Looks like a (little incomplete) bug report to me.

@gojomo
Copy link
Collaborator

gojomo commented Jan 16, 2017

Reopening, as it does seem that our updating of Doc2Vec defaults made the examples in this notebook less effective and stable - see discussion thread at https://groups.google.com/d/msg/gensim/bs77ke1Zun0/9lrMo_w0CAAJ

I believe upping the iter to 50 restores the intent of the example, without changing other defaults. Some text could be added alongside the related cells to the effect of: (1) small datasets with short documents can benefit from more training passes; (2) the checking of an inferred-vector against a training-vector is a sort of 'sanity check' as to whether the model is behaving in a usefully consistent manner, though not a real 'accuracy' value.

Thanks, @johncleveland, for catching and reporting this!

@gojomo gojomo reopened this Jan 16, 2017
@tmylk tmylk added documentation Current issue related to documentation difficulty easy Easy issue: required small fix labels Jan 25, 2017
@ELind77
Copy link
Contributor

ELind77 commented Jan 28, 2017

This may not be the right place for this, but if this is the original paragraph vectors paper, I believe there have been some serious problems with the reproducibility of those findings. In Ensemble of Generative and Discriminative Techniques for Sentiment Analysis of Movie Reviews Mikolov even has a footnote that explains that the results were not reproducible.

@gojomo
Copy link
Collaborator

gojomo commented Jan 28, 2017

Yes, you can find posts across the net in a bunch of places from people who've been frustrated trying to reproduce the PV paper's error rates on the same original datasets, and a few comments by Mikolov (like that footnote) implying Le made a mistake in result-reporting.

Here, it's just a matter of our demo, on a different much smaller dataset, not behaving the same across some other code changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix documentation Current issue related to documentation
Projects
None yet
Development

No branches or pull requests

5 participants