Word2Vec/Doc2Vec offer model-minimization method Fix issue #446 #987

pum-purum-pum-pum · 2016-10-31T17:06:26Z

add delete_temporary_training_data method

add finished_training method

tmylk · 2016-10-31T17:56:24Z

gensim/models/doc2vec.py

+        """
+        Discard parametrs that are used in training and score. Use if you're sure you're done training a model,
+        """
+        self.training_finished = True


Please call the super method in word2vec explicitly

tmylk · 2016-10-31T18:16:36Z

gensim/test/test_word2vec.py

+            for j in [0, 1]:
+                model = word2vec.Word2Vec(sentences, size=10, min_count=0, seed=42, hs=i, negative=j)
+                model.finished_training()
+                self.assertTrue(len(model.vocab), 12)


Please tests that necessary attributes are indeed deleted

tmylk · 2016-10-31T18:16:43Z

gensim/test/test_doc2vec.py

+            for j in [0, 1]:
+                model = doc2vec.Doc2Vec(sentences, size=5, min_count=1, negative=i, hs=j)
+                model.finished_training()
+                self.assertTrue(len(model.infer_vector(['graph'])), 5)


Please tests that necessary attributes are indeed deleted

We can't just call «the super method in word2vec explicitly» without adding the flag to save syn0_lockf, which as is necessary to save in d2v.

pum-purum-pum-pum · 2016-11-01T15:40:01Z

gensim/models/doc2vec.py

@@ -392,6 +392,7 @@ def init_sims(self, replace=False):
        etc., but not `train` or `infer_vector`.

        """
+        print ('HELLO DOC!!!')


deleted in next commit

gojomo · 2016-11-02T23:31:36Z

gensim/models/word2vec.py

+
+    def finished_training(self):
+        """
+        Discard parametrs that are used in training and score. Use if you're sure you're done training a model.


typo parameters

gojomo · 2016-11-02T23:32:27Z

gensim/models/word2vec.py

+        """
+        for i in xrange(self.syn0.shape[0]):
+            self.syn0[i, :] /= sqrt((self.syn0[i, :] ** 2).sum(-1))
+        self.syn0norm = self.syn0


Not all post-training applications want the unit-normalized vectors!

gojomo · 2016-11-02T23:33:35Z

gensim/models/doc2vec.py

+        """
+        self._minimize_model(self.hs, self.negative > 0, True)
+        if hasattr(self, 'doctag_syn0'):
+            del self.doctag_syn0


Many will consider the bulk-trained doctag-vectors a part of the model they want to retain.

gojomo · 2016-11-02T23:40:15Z

Your changes closely match the motivating issue, #446 - but even though I originally wrote that, what I know since them makes me think minimization needs to be finer-grained, because much of this state is still relevant for downstream applications even without continued training. So I've added revisions to #446 that echo my line-by-line comments here.

tmylk · 2016-11-08T19:18:44Z

gensim/models/word2vec.py

@@ -465,7 +465,7 @@ def __init__(
        self.total_train_time = 0
        self.sorted_vocab = sorted_vocab
        self.batch_words = batch_words
-
+        self.training_finished = False


A better name would be model_trimmed_post_training = False

tmylk · 2016-11-08T19:19:03Z

gensim/models/word2vec.py

@@ -1750,6 +1752,27 @@ def accuracy(self, questions, restrict_vocab=30000, most_similar=most_similar, c
    def __str__(self):
        return "%s(vocab=%s, size=%s, alpha=%s)" % (self.__class__.__name__, len(self.index2word), self.vector_size, self.alpha)

+    def _minimize_model(self, save_syn1 = False, save_syn1neg = False, save_syn0_lockf = False):
+        self.training_finished = True


Flag is best set in the end of the method

tmylk · 2016-11-08T19:22:17Z

gensim/models/word2vec.py

+        """
+        if replace:
+            for i in xrange(self.syn0.shape[0]):
+                self.syn0[i, :] /= sqrt((self.syn0[i, :] ** 2).sum(-1))


why duplicate code and not just call init_sims?

tmylk · 2016-11-09T13:11:52Z

gensim/models/word2vec.py

@@ -757,6 +757,8 @@ def train(self, sentences, total_words=None, word_count=0,
        sentences are the same as those that were used to initially build the vocabulary.

        """
+        if (self.model_trimmed_post_training):
+            raise RuntimeError("parameters for training were discarded")


Let's make a better message starting with the capital letter "Parameters for training were discarded using model_trimmed_post_training method"

tmylk · 2016-11-10T09:23:45Z

gensim/test/test_doc2vec.py

+                self.assertTrue(len(model['human']), 10)
+                self.assertTrue(model.vocab['graph'].count, 5)
+                if (i == 1):
+                    self.assertTrue(hasattr(model, 'syn1'))


should we assert here that syn1 is deleted by _minimize_model?
Same for other attributes

tmylk · 2016-11-10T09:26:21Z

gensim/models/word2vec.py

+            del self.syn0_lockf
+        self.model_trimmed_post_training = True
+
+    def discard_model_parameters(self, replace=False):


delete_temporary_training_data maybe is a better name. what do you think?

My English language skills allows me to only agree with you.

But in this question obviously yes

tmylk · 2016-11-10T09:27:45Z

gensim/test/test_word2vec.py

+                self.assertTrue(not hasattr(model, 'syn1'))
+                self.assertTrue(not hasattr(model, 'syn1neg'))
+                self.assertTrue(not hasattr(model, 'syn0_lockf'))
+        model = word2vec.Word2Vec(sentences, min_count=1)


this is a separate test.

pum-purum-pum-pum · 2016-11-10T09:33:56Z

gensim/test/test_doc2vec.py

+                else:
+                    self.assertTrue(not hasattr(model, 'syn1neg'))
+                self.assertTrue(hasattr(model, 'syn0_lockf'))
+


Seems I "sync' in git without "commit", when I added self.docvecs, 'doctag_syn0' checks :) will fix it

tmylk · 2016-11-11T13:57:48Z

gensim/test/test_doc2vec.py

-                model = doc2vec.Doc2Vec(sentences, size=5, min_count=1, hs=i, negative=j)
-                model.discard_model_parameters(remove_doctags_vectors=True)
+                if i == 0 and j == 0:
+                    continue


we can actually do hs and negative sampling...

tmylk · 2016-11-11T14:00:08Z

gensim/test/test_doc2vec.py

-                model.discard_model_parameters(remove_doctags_vectors=True)
+                if i == 0 and j == 0:
+                    continue
+                model = doc2vec.Doc2Vec(sentences, size=5, min_count=1, window=4, hs=i, negative=j)


add asserts that it has all the attributes that are about to be deleted

tmylk · 2016-11-11T14:00:48Z

gensim/test/test_word2vec.py

        for i in [0, 1]:
            for j in [0, 1]:
                model = word2vec.Word2Vec(sentences, size=10, min_count=0, seed=42, hs=i, negative=j)
-                model.discard_model_parameters(replace=True)


add assert that it has the attributes that are about to get deleted

tmylk · 2016-11-11T14:02:49Z

Please add a note in CHANGELOG.md describing the change.
It should be for the next release

tmylk · 2016-11-13T09:56:51Z

gensim/models/word2vec.py

+            del self.syn0_lockf
+        self.model_trimmed_post_training = True
+
+    def delete_temporary_training_data(self, replace=False):


Can we rename the replace parameter to replace_word_vectors_with_normalized?

I called the parameter this way because we have init_sims(replace=False), with the parameter of the same idea. Should we rename parameter of init_sims to?

in the init_sims context it is self-explanatory. But in the delete_temporary_training_data it looks strange

tmylk · 2016-11-13T13:14:40Z

Thanks for the PR!

issue piskvorky#446

2e9d2a5

add finished_training method

tmylk changed the title ~~issue #446~~ Word2Vec/Doc2Vec offer model-minimization method Fix issue #446 Oct 31, 2016

tmylk reviewed Oct 31, 2016

View reviewed changes

pum-purum-pum-pum added 2 commits November 1, 2016 18:37

private _minimize_model, tests

a2efb8c

We can't just call «the super method in word2vec explicitly» without adding the flag to save syn0_lockf, which as is necessary to save in d2v.

fix_print

26e6042

pum-purum-pum-pum commented Nov 1, 2016

View reviewed changes

flag finished_training fix

ba8c8c4

gojomo reviewed Nov 2, 2016

View reviewed changes

fix_bug with docvecs, controllability

51a64ba

tmylk reviewed Nov 8, 2016

View reviewed changes

rename flag, flag move, init_sims

c730984

tmylk reviewed Nov 9, 2016

View reviewed changes

renaming the RuntimeError message

a7cd9ba

tmylk reviewed Nov 10, 2016

View reviewed changes

pum-purum-pum-pum commented Nov 10, 2016

View reviewed changes

pum-purum-pum-pum added 6 commits November 10, 2016 20:48

fix, add more tests

a8cb0e7

fix, i == j

18ca26f

fix

a258241

tests_fix

9acf119

delete useless code

66fe5e3

Merge remote-tracking branch 'RaRe-Technologies/develop' into develop

85891f3

numpy fix

4395b75

tmylk reviewed Nov 11, 2016

View reviewed changes

pum-purum-pum-pum added 3 commits November 12, 2016 02:34

hs,neg in tests; assert parameters existance

06c6028

Merge remote-tracking branch 'RaRe-Technologies/develop' into develop

aa3942a

changelog update

5f96aa0

tmylk reviewed Nov 13, 2016

View reviewed changes

rename replace, description fix

84f174e

tmylk merged commit 284a9f7 into piskvorky:develop Nov 13, 2016

tmylk mentioned this pull request Feb 8, 2017

Word2Vec/Doc2Vec offer model-minimization method #446

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word2Vec/Doc2Vec offer model-minimization method Fix issue #446 #987

Word2Vec/Doc2Vec offer model-minimization method Fix issue #446 #987

pum-purum-pum-pum commented Oct 31, 2016 •

edited

Loading

tmylk Oct 31, 2016

tmylk Oct 31, 2016

tmylk Oct 31, 2016

pum-purum-pum-pum Nov 1, 2016

gojomo Nov 2, 2016

gojomo Nov 2, 2016

gojomo Nov 2, 2016

gojomo commented Nov 2, 2016

tmylk Nov 8, 2016

tmylk Nov 8, 2016

tmylk Nov 8, 2016

tmylk Nov 9, 2016

tmylk Nov 10, 2016 •

edited

Loading

tmylk Nov 10, 2016

pum-purum-pum-pum Nov 10, 2016

pum-purum-pum-pum Nov 10, 2016

tmylk Nov 10, 2016

pum-purum-pum-pum Nov 10, 2016

tmylk Nov 11, 2016

tmylk Nov 11, 2016

tmylk Nov 11, 2016

tmylk commented Nov 11, 2016

tmylk Nov 13, 2016

pum-purum-pum-pum Nov 13, 2016

tmylk Nov 13, 2016

tmylk commented Nov 13, 2016

Word2Vec/Doc2Vec offer model-minimization method Fix issue #446 #987

Word2Vec/Doc2Vec offer model-minimization method Fix issue #446 #987

Conversation

pum-purum-pum-pum commented Oct 31, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gojomo commented Nov 2, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmylk Nov 10, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmylk commented Nov 11, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmylk commented Nov 13, 2016

pum-purum-pum-pum commented Oct 31, 2016 •

edited

Loading

tmylk Nov 10, 2016 •

edited

Loading