-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing Doc2Vec defaults so that it won't override Word2Vec defaults. fix #795 #929
Conversation
@gojomo please review. |
Some thoughts:
|
@gojomo I don't get the error. There is no "dm_mean" variable in either word2vec.py or doc2vec.py. Am I missing something? |
There was a |
In #795 it was said that
but according to the CI test files, in this line we have to initialize params like dm_mean, window etc. So to implement this either we can change the test files or hard-code the default params from What are your views on this @gojomo ? EDIT: About the PR #927, no were not on the same sprint. |
The added description is helpful – but a descriptive title (as edited atop the PR page) is even better, as it then appears in the Github list/notification views.
We don't want to change the tests – holding them stable through any changes helps assure us that we're not totally breaking user code. (We're still altering the effective default, which may affect user results.) Hard-coding |
|
sample=sample, seed=seed, workers=workers, min_alpha=min_alpha, | ||
sg=(1+dm) % 2, hs=hs, negative=negative, cbow_mean=dm_mean, | ||
size=size, window=window, min_count=min_count, | ||
sample=sample, workers=workers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If size
, window
, min_count
, sample
, and workers
all use the same defaults as Word2Vec, there's no need to re-specify them in this method signature nor pass to the call of superclass __init__()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For doc2vec
the default params are size=300, window=8, min_count=5, sample=0, workers=1
and for wordvec
size=100, window=5, min_count=5, sample=1e-3, workers=3
. Removing these none common default params from doc2vec raise a Travis build error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal of issue #795 is to harmonize these defaults, unless there is a specific reason for the Doc2Vec defaults to differ. (Is there a reason to prefer size=300', 'window=8',
sample=0', or 'workers=1as being more-sensible defaults in Doc2Vec?) And something like
min_count=5` already matches Word2Vec – so no need to redundantly specify it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No that is definitely not the case, in the first commit that I submitted I removed these variables too but the Travis build seem to fail. Maybe we need to change the tests as well. What do you advice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would depend specifically on what tests fail and why. (Looking at the failures at that time, it looks the problem was you retained a reference to size
that was no longer valid: you'd taken our too little, not too much.) If at all possible, the tests shouldn't change: they help confirm that the refactoring didn't break anything else. But it might be the case that a test was dependent on the old defaults, and if that's the case, it could be appropriate to adjust the test as a last resort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For initialising the default params of doc2vec if I remove all the params passed on to super(Doc2Vec, self).init(....). In that case even if you pass some parameter to doc2vec. Suppose "workers", even in that case self.workers for both word2vec and doc2vec will be defaulted to 3(the default param of word2vec ). which implies that not passing the parameters at all to super(Doc2Vec, self).init(....) results in taking the default params of word2vec irrespective of whatever params you pass.
In the case when you pass all the params to super(Doc2Vec, self).init(size=size,alpha=alpha...) the default params of doc2vec will always be considered.
and defaults of word2vec will never be used.
I did some digging, but could not find a way (and possibly no way exist) to know whether the param is the default one or passed explicitly by the user.
I want to know your opinion about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of **kwargs
handles this. If you pass workers=5
to Doc2Vec, if it's not explicitly declared in the __init__()
signature, it stays in the kwargs
dict. That's then passed to the superclass __init__()
– overriding the default of 3. So the desired effect is achieved: shared defaults exist in one place (Word2Vec), but calls to either can override with an explicit valiue.
(As noted in a prior comment, only dm_mean
is especially tricky, needing a different approach, because it's a de facto synonym for cbow_mean
. The goal is to have it so that if dm_mean
is unspecified, the default Word2Vec cbow_mean
value is operative, but if specified, it sets cbow_mean
to the specified value. It's definitely also possible, and a solution might rely on the fact kwargs
is a normal dict.)
@@ -459,6 +459,7 @@ def __init__( | |||
self.total_train_time = 0 | |||
self.sorted_vocab = sorted_vocab | |||
self.batch_words = batch_words | |||
self.size = size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new instance variable seems unnecessary; better to consult existing layer1_size
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay I will also remove this line.
9b1ac77
to
c144bf3
Compare
self.cbow_mean = int(kwargs["dm_mean"]) | ||
else: | ||
self.cbow_mean = int(cbow_mean) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the right track, but since 'dm_mean' is a Doc2Vec-specific parameter, its definition and handling would be best localized there (where it is also documented), rather than in Word2Vec.
And while the open-ended kwargs
is helpful in Doc2Vec.__init__
for shuttling parameters to a related method, to have it here, in the terminal method, would mean that stray extra arguments offered in error (including misspelled parameters) would not generate any error. (Before the change, if neither Doc2Vec.__init__
nor Word2Vec.__init__
handle them, they would generate an error – which is helpful behavior.)
Try leaving dm_mean
as a part of the Doc2Vec.__init__
signature, but with a default value of None
. Within the __init__
body, check for a user-supplied value that should then be used as self.cbow_mean
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay
@@ -30,10 +30,11 @@ def setUp(self): | |||
wc = WikiCorpus(datapath(FILENAME)) | |||
|
|||
def test_get_texts_returns_generator_of_lists(self): | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No real reason to have these whitespace changes in this unrelated patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was suggested by Lev to leave a blank line to pass the Travis build. I am not sure why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be removed. it was just a travis ping.
@@ -168,15 +171,15 @@ def model_sanity(self, model): | |||
def test_training(self): | |||
"""Test doc2vec training.""" | |||
corpus = DocsLeeCorpus() | |||
model = doc2vec.Doc2Vec(size=100, min_count=2, iter=20) | |||
model = doc2vec.Doc2Vec(size=100, min_count=2, iter=20, window=8, sample=0.01, workers=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working with a tiny corpus, workers=1
is a good idea (and I believe what most of the Word2Vec tests do). Were the other two parameter changes also necessary for the test to still pass? (It's good to have checked it still did, but if holding them same-as-before isn't strictly necessary, better to from-here-on-out test the new defaults.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests work fine without them so removed the other two params.
model.build_vocab(corpus) | ||
self.assertEqual(model.docvecs.doctag_syn0.shape, (300, 100)) | ||
model.train(corpus) | ||
|
||
self.model_sanity(model) | ||
|
||
# build vocab and train in one step; must be the same as above | ||
model2 = doc2vec.Doc2Vec(corpus, size=100, min_count=2, iter=20) | ||
model2 = doc2vec.Doc2Vec(corpus, size=100, min_count=2, iter=20, window=8, sample=0.01, workers=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, workers=1
is good but for the long run the other parameters should only be locked to historic values if definitely necessary.
@@ -84,10 +84,11 @@ def test_int_doctags(self): | |||
"""Test doc2vec doctag alternatives""" | |||
corpus = DocsLeeCorpus() | |||
|
|||
model = doc2vec.Doc2Vec(min_count=1) | |||
size = 300 | |||
model = doc2vec.Doc2Vec(min_count=1, size=size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these tests would succeed with the new (inherited) default size of 100, I think it'd be better to now run them that way – and then test the internal array sizes for either matching '100' or (perhaps simply to verify internal self-consistency) 'model.vector_size'. (Though, it's good that they've been run at least once with parameters matching and passing the old values – as a one-time check against regression in that config.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seem to fail with the size = 100
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's both a 300 (count of text examples, length of syn0
) that needs to be tested, and a 100 (dimensionality of individual vectors) - you've combined them both into size
. These tests will pass at default (100) dimensionality, as long as you're not asserting the example-count is also 100.
@@ -106,11 +107,13 @@ def test_string_doctags(self): | |||
# force duplicated tags | |||
corpus = corpus[0:10] + corpus | |||
|
|||
model = doc2vec.Doc2Vec(min_count=1) | |||
size = 300 | |||
model = doc2vec.Doc2Vec(size=size, min_count=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, though holding test params constant for at least one cycle was useful, if new-default value would also work as suitable test, best to have it be the value going forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This too, seem to fail with the size = 100.
49163a0
to
9bf5cae
Compare
The build pass perfectly in my forked repo's Travis and VM. |
Yes, there's a test that's often hanging on Python 2.7 that's unrelated to your changes (TestWikiCorpus.test_first_element). I've made a separate issue for that (#971). (It may be that forcing re-tries will cause it to eventually succeed; it may be that something unique on the test machine is now causing it to reliably fail; in either case, when you see a failure there, it's not the fault of your changes.) Regarding the handling of |
718fbc6
to
0a353bb
Compare
Removed |
Ping @gojomo |
Looks good to me. Waiting for @gojomo review. |
Chnages all look good! The unrelated test timeout (wiki_corpus-related) means I'm not sure all Word2Vec/Doc2Vec tests have passed for the change, though. |
@markroxor Please rebase to resolve merge conflicts. |
more parameters excluded
213b4ec
to
98d0dc3
Compare
ping @tmylk |
Thanks for the PR! |
Removing Doc2Vec defaults so that it won't override Word2Vec defaults.