Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing Doc2Vec defaults so that it won't override Word2Vec defaults. fix #795 #929

Merged
merged 12 commits into from
Nov 22, 2016

Conversation

markroxor
Copy link
Contributor

@markroxor markroxor commented Oct 9, 2016

Removing Doc2Vec defaults so that it won't override Word2Vec defaults.

@markroxor
Copy link
Contributor Author

@gojomo please review.

@gojomo
Copy link
Collaborator

gojomo commented Oct 9, 2016

Some thoughts:

@markroxor
Copy link
Contributor Author

@gojomo
I've added a relevant PR description, sorry about that.
I just need a little help, the Travis CI traceback suggets that -
Traceback (most recent call last): File "/home/travis/build/RaRe-Technologies/gensim/gensim/test/test_doc2vec.py", line 190, in test_dmm_hs alpha=0.05, min_count=2, iter=20) File "/home/travis/build/RaRe-Technologies/gensim/gensim/models/doc2vec.py", line 603, in __init__ null_word=dm_concat, **kwargs) TypeError: __init__() got an unexpected keyword argument 'dm_mean'

I don't get the error. There is no "dm_mean" variable in either word2vec.py or doc2vec.py. Am I missing something?
thanks

@gojomo
Copy link
Collaborator

gojomo commented Oct 9, 2016

There was a dm_mean before your edits, and the tests passed then – so look at the effects of that change!

@markroxor
Copy link
Contributor Author

markroxor commented Oct 9, 2016

In #795 it was said that

ideally by not specifying them at all (to avoid duplication). This probably means removing those params from the init() signature

but according to the CI test files, in this line we have to initialize params like dm_mean, window etc. So to implement this either we can change the test files or hard-code the default params from word2vec.py file in the __init__ method of doc2vec.py.

What are your views on this @gojomo ?

EDIT: About the PR #927, no were not on the same sprint.

@gojomo
Copy link
Collaborator

gojomo commented Oct 9, 2016

The added description is helpful – but a descriptive title (as edited atop the PR page) is even better, as it then appears in the Github list/notification views.

dm_mean is the trickiest parameter, as it is essentially a synonym for cbow_mean but still (like the others) should only override the default if explicitly specified by the caller.

We don't want to change the tests – holding them stable through any changes helps assure us that we're not totally breaking user code. (We're still altering the effective default, which may affect user results.)

Hard-coding dm_mean to have the same value as Word2Vec.__init__(cbow_mean) kind-of works, but means the same default is redundantly declared in two places, and any future changes would require two edits. (The best fix would keep it in a single place.)

@markroxor markroxor changed the title fix #795 Removing Doc2Vec defaults so that it won't override Word2Vec defaults. fix #795 Oct 9, 2016
@markroxor
Copy link
Contributor Author

markroxor commented Oct 9, 2016

My approach is to leave the params of __init__ of doc2vec.py as it is and remove only the overriding params in this method. Such that all the common params are now inherited from word2vec.py but now checks are failing with this new error
RuntimeError: you must first build vocabulary before training the model.
Strange.
I guess there is something else wrong with code.

EDIT: Fixed the runtime error and greeted by Travis CI error No output has been received in the last 10 minutes, this potentially indicates a stalled build or something wrong with the build itself.

sample=sample, seed=seed, workers=workers, min_alpha=min_alpha,
sg=(1+dm) % 2, hs=hs, negative=negative, cbow_mean=dm_mean,
size=size, window=window, min_count=min_count,
sample=sample, workers=workers,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If size, window, min_count, sample, and workers all use the same defaults as Word2Vec, there's no need to re-specify them in this method signature nor pass to the call of superclass __init__().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For doc2vec the default params are size=300, window=8, min_count=5, sample=0, workers=1 and for wordvec size=100, window=5, min_count=5, sample=1e-3, workers=3. Removing these none common default params from doc2vec raise a Travis build error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of issue #795 is to harmonize these defaults, unless there is a specific reason for the Doc2Vec defaults to differ. (Is there a reason to prefer size=300', 'window=8',sample=0', or 'workers=1as being more-sensible defaults in Doc2Vec?) And something likemin_count=5` already matches Word2Vec – so no need to redundantly specify it again.

Copy link
Contributor Author

@markroxor markroxor Oct 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that is definitely not the case, in the first commit that I submitted I removed these variables too but the Travis build seem to fail. Maybe we need to change the tests as well. What do you advice?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would depend specifically on what tests fail and why. (Looking at the failures at that time, it looks the problem was you retained a reference to size that was no longer valid: you'd taken our too little, not too much.) If at all possible, the tests shouldn't change: they help confirm that the refactoring didn't break anything else. But it might be the case that a test was dependent on the old defaults, and if that's the case, it could be appropriate to adjust the test as a last resort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For initialising the default params of doc2vec if I remove all the params passed on to super(Doc2Vec, self).init(....). In that case even if you pass some parameter to doc2vec. Suppose "workers", even in that case self.workers for both word2vec and doc2vec will be defaulted to 3(the default param of word2vec ). which implies that not passing the parameters at all to super(Doc2Vec, self).init(....) results in taking the default params of word2vec irrespective of whatever params you pass.
In the case when you pass all the params to super(Doc2Vec, self).init(size=size,alpha=alpha...) the default params of doc2vec will always be considered.
and defaults of word2vec will never be used.
I did some digging, but could not find a way (and possibly no way exist) to know whether the param is the default one or passed explicitly by the user.
I want to know your opinion about this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of **kwargs handles this. If you pass workers=5 to Doc2Vec, if it's not explicitly declared in the __init__() signature, it stays in the kwargs dict. That's then passed to the superclass __init__() – overriding the default of 3. So the desired effect is achieved: shared defaults exist in one place (Word2Vec), but calls to either can override with an explicit valiue.

(As noted in a prior comment, only dm_mean is especially tricky, needing a different approach, because it's a de facto synonym for cbow_mean. The goal is to have it so that if dm_mean is unspecified, the default Word2Vec cbow_mean value is operative, but if specified, it sets cbow_mean to the specified value. It's definitely also possible, and a solution might rely on the fact kwargs is a normal dict.)

@@ -459,6 +459,7 @@ def __init__(
self.total_train_time = 0
self.sorted_vocab = sorted_vocab
self.batch_words = batch_words
self.size = size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new instance variable seems unnecessary; better to consult existing layer1_size.

Copy link
Contributor Author

@markroxor markroxor Oct 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay I will also remove this line.

@markroxor markroxor force-pushed the patch-4 branch 2 times, most recently from 9b1ac77 to c144bf3 Compare October 20, 2016 07:40
@markroxor
Copy link
Contributor Author

markroxor commented Oct 20, 2016

@gojomo there is an issue with the travis build. My builds pass perfectly here. Please have a look at #957 and travis issue #6751

@tmylk facing the stalled build issue here too.

self.cbow_mean = int(kwargs["dm_mean"])
else:
self.cbow_mean = int(cbow_mean)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the right track, but since 'dm_mean' is a Doc2Vec-specific parameter, its definition and handling would be best localized there (where it is also documented), rather than in Word2Vec.

And while the open-ended kwargs is helpful in Doc2Vec.__init__ for shuttling parameters to a related method, to have it here, in the terminal method, would mean that stray extra arguments offered in error (including misspelled parameters) would not generate any error. (Before the change, if neither Doc2Vec.__init__ nor Word2Vec.__init__ handle them, they would generate an error – which is helpful behavior.)

Try leaving dm_mean as a part of the Doc2Vec.__init__ signature, but with a default value of None. Within the __init__ body, check for a user-supplied value that should then be used as self.cbow_mean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

@@ -30,10 +30,11 @@ def setUp(self):
wc = WikiCorpus(datapath(FILENAME))

def test_get_texts_returns_generator_of_lists(self):

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No real reason to have these whitespace changes in this unrelated patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was suggested by Lev to leave a blank line to pass the Travis build. I am not sure why.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be removed. it was just a travis ping.

@@ -168,15 +171,15 @@ def model_sanity(self, model):
def test_training(self):
"""Test doc2vec training."""
corpus = DocsLeeCorpus()
model = doc2vec.Doc2Vec(size=100, min_count=2, iter=20)
model = doc2vec.Doc2Vec(size=100, min_count=2, iter=20, window=8, sample=0.01, workers=1)
Copy link
Collaborator

@gojomo gojomo Oct 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working with a tiny corpus, workers=1 is a good idea (and I believe what most of the Word2Vec tests do). Were the other two parameter changes also necessary for the test to still pass? (It's good to have checked it still did, but if holding them same-as-before isn't strictly necessary, better to from-here-on-out test the new defaults.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests work fine without them so removed the other two params.

model.build_vocab(corpus)
self.assertEqual(model.docvecs.doctag_syn0.shape, (300, 100))
model.train(corpus)

self.model_sanity(model)

# build vocab and train in one step; must be the same as above
model2 = doc2vec.Doc2Vec(corpus, size=100, min_count=2, iter=20)
model2 = doc2vec.Doc2Vec(corpus, size=100, min_count=2, iter=20, window=8, sample=0.01, workers=1)
Copy link
Collaborator

@gojomo gojomo Oct 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, workers=1 is good but for the long run the other parameters should only be locked to historic values if definitely necessary.

@@ -84,10 +84,11 @@ def test_int_doctags(self):
"""Test doc2vec doctag alternatives"""
corpus = DocsLeeCorpus()

model = doc2vec.Doc2Vec(min_count=1)
size = 300
model = doc2vec.Doc2Vec(min_count=1, size=size)
Copy link
Collaborator

@gojomo gojomo Oct 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these tests would succeed with the new (inherited) default size of 100, I think it'd be better to now run them that way – and then test the internal array sizes for either matching '100' or (perhaps simply to verify internal self-consistency) 'model.vector_size'. (Though, it's good that they've been run at least once with parameters matching and passing the old values – as a one-time check against regression in that config.)

Copy link
Contributor Author

@markroxor markroxor Oct 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seem to fail with the size = 100.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's both a 300 (count of text examples, length of syn0) that needs to be tested, and a 100 (dimensionality of individual vectors) - you've combined them both into size. These tests will pass at default (100) dimensionality, as long as you're not asserting the example-count is also 100.

@@ -106,11 +107,13 @@ def test_string_doctags(self):
# force duplicated tags
corpus = corpus[0:10] + corpus

model = doc2vec.Doc2Vec(min_count=1)
size = 300
model = doc2vec.Doc2Vec(size=size, min_count=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, though holding test params constant for at least one cycle was useful, if new-default value would also work as suitable test, best to have it be the value going forward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This too, seem to fail with the size = 100.

@markroxor markroxor force-pushed the patch-4 branch 2 times, most recently from 49163a0 to 9bf5cae Compare October 20, 2016 10:13
@markroxor
Copy link
Contributor Author

markroxor commented Oct 20, 2016

The build pass perfectly in my forked repo's Travis and VM.

@gojomo
Copy link
Collaborator

gojomo commented Oct 20, 2016

Yes, there's a test that's often hanging on Python 2.7 that's unrelated to your changes (TestWikiCorpus.test_first_element). I've made a separate issue for that (#971). (It may be that forcing re-tries will cause it to eventually succeed; it may be that something unique on the test machine is now causing it to reliably fail; in either case, when you see a failure there, it's not the fault of your changes.)

Regarding the handling of dm_mean, your latest changes still have the Word2Vec class interpreting a dm_mean parameter, and accepting incorrect/misspelled parameters without error. The best fix would keep all knowledge of dm_mean inside Doc2Vec, and still error for completely-bad parameters.

@markroxor markroxor force-pushed the patch-4 branch 2 times, most recently from 718fbc6 to 0a353bb Compare October 21, 2016 05:16
@markroxor
Copy link
Contributor Author

Removed kwargs argument and any trace of dm_mean from word2vec.

@markroxor
Copy link
Contributor Author

Ping @gojomo

@tmylk
Copy link
Contributor

tmylk commented Oct 27, 2016

Looks good to me. Waiting for @gojomo review.

@piskvorky
Copy link
Owner

@tmylk: @gojomo may be overloaded with github notifications (I know I am!), so probably best to ping him at Slack #opensource for anything urgent.

Same if my input is urgently needed anyway... though I'll munch through the notifications on my own, eventually.

@gojomo
Copy link
Collaborator

gojomo commented Nov 3, 2016

Chnages all look good! The unrelated test timeout (wiki_corpus-related) means I'm not sure all Word2Vec/Doc2Vec tests have passed for the change, though.

@markroxor
Copy link
Contributor Author

markroxor commented Nov 3, 2016

Can you try restarting the build manually. The tests passed on my forked repo though. @gojomo check this.

@tmylk
Copy link
Contributor

tmylk commented Nov 8, 2016

@markroxor Please rebase to resolve merge conflicts.
Then ready to merge.

@markroxor markroxor force-pushed the patch-4 branch 2 times, most recently from 213b4ec to 98d0dc3 Compare November 8, 2016 20:14
@markroxor
Copy link
Contributor Author

ping @tmylk

@tmylk tmylk merged commit 14f12f4 into piskvorky:develop Nov 22, 2016
@tmylk
Copy link
Contributor

tmylk commented Nov 22, 2016

Thanks for the PR!

@markroxor markroxor deleted the patch-4 branch December 22, 2017 05:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants