-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Word2Vec/Doc2Vec offer model-minimization method Fix issue #446 #987
Conversation
add finished_training method
""" | ||
Discard parametrs that are used in training and score. Use if you're sure you're done training a model, | ||
""" | ||
self.training_finished = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please call the super method in word2vec explicitly
for j in [0, 1]: | ||
model = word2vec.Word2Vec(sentences, size=10, min_count=0, seed=42, hs=i, negative=j) | ||
model.finished_training() | ||
self.assertTrue(len(model.vocab), 12) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please tests that necessary attributes are indeed deleted
for j in [0, 1]: | ||
model = doc2vec.Doc2Vec(sentences, size=5, min_count=1, negative=i, hs=j) | ||
model.finished_training() | ||
self.assertTrue(len(model.infer_vector(['graph'])), 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please tests that necessary attributes are indeed deleted
We can't just call «the super method in word2vec explicitly» without adding the flag to save syn0_lockf, which as is necessary to save in d2v.
@@ -392,6 +392,7 @@ def init_sims(self, replace=False): | |||
etc., but not `train` or `infer_vector`. | |||
|
|||
""" | |||
print ('HELLO DOC!!!') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted in next commit
|
||
def finished_training(self): | ||
""" | ||
Discard parametrs that are used in training and score. Use if you're sure you're done training a model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo parameters
""" | ||
for i in xrange(self.syn0.shape[0]): | ||
self.syn0[i, :] /= sqrt((self.syn0[i, :] ** 2).sum(-1)) | ||
self.syn0norm = self.syn0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all post-training applications want the unit-normalized vectors!
""" | ||
self._minimize_model(self.hs, self.negative > 0, True) | ||
if hasattr(self, 'doctag_syn0'): | ||
del self.doctag_syn0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many will consider the bulk-trained doctag-vectors a part of the model they want to retain.
Your changes closely match the motivating issue, #446 - but even though I originally wrote that, what I know since them makes me think minimization needs to be finer-grained, because much of this state is still relevant for downstream applications even without continued training. So I've added revisions to #446 that echo my line-by-line comments here. |
@@ -465,7 +465,7 @@ def __init__( | |||
self.total_train_time = 0 | |||
self.sorted_vocab = sorted_vocab | |||
self.batch_words = batch_words | |||
|
|||
self.training_finished = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A better name would be model_trimmed_post_training = False
@@ -1750,6 +1752,27 @@ def accuracy(self, questions, restrict_vocab=30000, most_similar=most_similar, c | |||
def __str__(self): | |||
return "%s(vocab=%s, size=%s, alpha=%s)" % (self.__class__.__name__, len(self.index2word), self.vector_size, self.alpha) | |||
|
|||
def _minimize_model(self, save_syn1 = False, save_syn1neg = False, save_syn0_lockf = False): | |||
self.training_finished = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flag is best set in the end of the method
""" | ||
if replace: | ||
for i in xrange(self.syn0.shape[0]): | ||
self.syn0[i, :] /= sqrt((self.syn0[i, :] ** 2).sum(-1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why duplicate code and not just call init_sims
?
@@ -757,6 +757,8 @@ def train(self, sentences, total_words=None, word_count=0, | |||
sentences are the same as those that were used to initially build the vocabulary. | |||
|
|||
""" | |||
if (self.model_trimmed_post_training): | |||
raise RuntimeError("parameters for training were discarded") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make a better message starting with the capital letter "Parameters for training were discarded using model_trimmed_post_training method"
self.assertTrue(len(model['human']), 10) | ||
self.assertTrue(model.vocab['graph'].count, 5) | ||
if (i == 1): | ||
self.assertTrue(hasattr(model, 'syn1')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we assert here that syn1
is deleted by _minimize_model
?
Same for other attributes
del self.syn0_lockf | ||
self.model_trimmed_post_training = True | ||
|
||
def discard_model_parameters(self, replace=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete_temporary_training_data
maybe is a better name. what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My English language skills allows me to only agree with you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in this question obviously yes
self.assertTrue(not hasattr(model, 'syn1')) | ||
self.assertTrue(not hasattr(model, 'syn1neg')) | ||
self.assertTrue(not hasattr(model, 'syn0_lockf')) | ||
model = word2vec.Word2Vec(sentences, min_count=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a separate test.
else: | ||
self.assertTrue(not hasattr(model, 'syn1neg')) | ||
self.assertTrue(hasattr(model, 'syn0_lockf')) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems I "sync' in git without "commit", when I added self.docvecs, 'doctag_syn0' checks :) will fix it
model = doc2vec.Doc2Vec(sentences, size=5, min_count=1, hs=i, negative=j) | ||
model.discard_model_parameters(remove_doctags_vectors=True) | ||
if i == 0 and j == 0: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can actually do hs and negative sampling...
model.discard_model_parameters(remove_doctags_vectors=True) | ||
if i == 0 and j == 0: | ||
continue | ||
model = doc2vec.Doc2Vec(sentences, size=5, min_count=1, window=4, hs=i, negative=j) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add asserts that it has all the attributes that are about to be deleted
for i in [0, 1]: | ||
for j in [0, 1]: | ||
model = word2vec.Word2Vec(sentences, size=10, min_count=0, seed=42, hs=i, negative=j) | ||
model.discard_model_parameters(replace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add assert that it has the attributes that are about to get deleted
Please add a note in CHANGELOG.md describing the change. |
del self.syn0_lockf | ||
self.model_trimmed_post_training = True | ||
|
||
def delete_temporary_training_data(self, replace=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename the replace
parameter to replace_word_vectors_with_normalized
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I called the parameter this way because we have init_sims(replace=False), with the parameter of the same idea. Should we rename parameter of init_sims to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the init_sims
context it is self-explanatory. But in the delete_temporary_training_data
it looks strange
Thanks for the PR! |
add delete_temporary_training_data method