Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word2Vec/Doc2Vec offer model-minimization method Fix issue #446 #987

Merged
merged 18 commits into from
Nov 13, 2016

Conversation

pum-purum-pum-pum
Copy link
Contributor

@pum-purum-pum-pum pum-purum-pum-pum commented Oct 31, 2016

add delete_temporary_training_data method

add finished_training method
@tmylk tmylk changed the title issue #446 Word2Vec/Doc2Vec offer model-minimization method Fix issue #446 Oct 31, 2016
"""
Discard parametrs that are used in training and score. Use if you're sure you're done training a model,
"""
self.training_finished = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please call the super method in word2vec explicitly

for j in [0, 1]:
model = word2vec.Word2Vec(sentences, size=10, min_count=0, seed=42, hs=i, negative=j)
model.finished_training()
self.assertTrue(len(model.vocab), 12)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please tests that necessary attributes are indeed deleted

for j in [0, 1]:
model = doc2vec.Doc2Vec(sentences, size=5, min_count=1, negative=i, hs=j)
model.finished_training()
self.assertTrue(len(model.infer_vector(['graph'])), 5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please tests that necessary attributes are indeed deleted

We can't just call «the super method in word2vec explicitly» without
adding the flag to save syn0_lockf, which as is necessary to save in
d2v.
@@ -392,6 +392,7 @@ def init_sims(self, replace=False):
etc., but not `train` or `infer_vector`.

"""
print ('HELLO DOC!!!')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted in next commit


def finished_training(self):
"""
Discard parametrs that are used in training and score. Use if you're sure you're done training a model.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo parameters

"""
for i in xrange(self.syn0.shape[0]):
self.syn0[i, :] /= sqrt((self.syn0[i, :] ** 2).sum(-1))
self.syn0norm = self.syn0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all post-training applications want the unit-normalized vectors!

"""
self._minimize_model(self.hs, self.negative > 0, True)
if hasattr(self, 'doctag_syn0'):
del self.doctag_syn0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many will consider the bulk-trained doctag-vectors a part of the model they want to retain.

@gojomo
Copy link
Collaborator

gojomo commented Nov 2, 2016

Your changes closely match the motivating issue, #446 - but even though I originally wrote that, what I know since them makes me think minimization needs to be finer-grained, because much of this state is still relevant for downstream applications even without continued training. So I've added revisions to #446 that echo my line-by-line comments here.

@@ -465,7 +465,7 @@ def __init__(
self.total_train_time = 0
self.sorted_vocab = sorted_vocab
self.batch_words = batch_words

self.training_finished = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better name would be model_trimmed_post_training = False

@@ -1750,6 +1752,27 @@ def accuracy(self, questions, restrict_vocab=30000, most_similar=most_similar, c
def __str__(self):
return "%s(vocab=%s, size=%s, alpha=%s)" % (self.__class__.__name__, len(self.index2word), self.vector_size, self.alpha)

def _minimize_model(self, save_syn1 = False, save_syn1neg = False, save_syn0_lockf = False):
self.training_finished = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flag is best set in the end of the method

"""
if replace:
for i in xrange(self.syn0.shape[0]):
self.syn0[i, :] /= sqrt((self.syn0[i, :] ** 2).sum(-1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why duplicate code and not just call init_sims?

@@ -757,6 +757,8 @@ def train(self, sentences, total_words=None, word_count=0,
sentences are the same as those that were used to initially build the vocabulary.

"""
if (self.model_trimmed_post_training):
raise RuntimeError("parameters for training were discarded")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make a better message starting with the capital letter "Parameters for training were discarded using model_trimmed_post_training method"

self.assertTrue(len(model['human']), 10)
self.assertTrue(model.vocab['graph'].count, 5)
if (i == 1):
self.assertTrue(hasattr(model, 'syn1'))
Copy link
Contributor

@tmylk tmylk Nov 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we assert here that syn1 is deleted by _minimize_model?
Same for other attributes

del self.syn0_lockf
self.model_trimmed_post_training = True

def discard_model_parameters(self, replace=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete_temporary_training_data maybe is a better name. what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My English language skills allows me to only agree with you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in this question obviously yes

self.assertTrue(not hasattr(model, 'syn1'))
self.assertTrue(not hasattr(model, 'syn1neg'))
self.assertTrue(not hasattr(model, 'syn0_lockf'))
model = word2vec.Word2Vec(sentences, min_count=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a separate test.

else:
self.assertTrue(not hasattr(model, 'syn1neg'))
self.assertTrue(hasattr(model, 'syn0_lockf'))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems I "sync' in git without "commit", when I added self.docvecs, 'doctag_syn0' checks :) will fix it

model = doc2vec.Doc2Vec(sentences, size=5, min_count=1, hs=i, negative=j)
model.discard_model_parameters(remove_doctags_vectors=True)
if i == 0 and j == 0:
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can actually do hs and negative sampling...

model.discard_model_parameters(remove_doctags_vectors=True)
if i == 0 and j == 0:
continue
model = doc2vec.Doc2Vec(sentences, size=5, min_count=1, window=4, hs=i, negative=j)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add asserts that it has all the attributes that are about to be deleted

for i in [0, 1]:
for j in [0, 1]:
model = word2vec.Word2Vec(sentences, size=10, min_count=0, seed=42, hs=i, negative=j)
model.discard_model_parameters(replace=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add assert that it has the attributes that are about to get deleted

@tmylk
Copy link
Contributor

tmylk commented Nov 11, 2016

Please add a note in CHANGELOG.md describing the change.
It should be for the next release

del self.syn0_lockf
self.model_trimmed_post_training = True

def delete_temporary_training_data(self, replace=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename the replace parameter to replace_word_vectors_with_normalized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I called the parameter this way because we have init_sims(replace=False), with the parameter of the same idea. Should we rename parameter of init_sims to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the init_sims context it is self-explanatory. But in the delete_temporary_training_data it looks strange

@tmylk tmylk merged commit 284a9f7 into piskvorky:develop Nov 13, 2016
@tmylk
Copy link
Contributor

tmylk commented Nov 13, 2016

Thanks for the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants