Added save method for doc2vec #1256

parulsethi · 2017-04-01T19:43:09Z

This PR adds a method save_doc2vec_format to save document vectors in similar format as save_word2vec_format saves word vectors.

Another PR #699 was opened with this issue and to address @gojomo comment there

mixing word & doc vectors into the same flattened file on save

I’ve added a flag parameter to indicate whether to save word vectors along with doc vectors. For now, I append word vectors after doc vectors, let me know if some other criteria is more favorable.

And if someone would like to keep doc and word vectors disentangled as pointed out by @gojomo, they can separately save them to different files now using save_doc2vec_format (with flag word_vec=False) and then save_word2vec_format.

@tmylk @gojomo If this seems ok, I’ll add the corresponding load method

parulsethi · 2017-04-01T22:26:16Z

gensim/models/doc2vec.py

+            fout.write(utils.to_utf8("%s %s\n" % (total_vec, self.vector_size)))
+            # store as in input order
+            for i in range(len(self.docvecs)):
+                doctag = self.docvecs.index_to_doctag(i)


this will work in case user's model.docvecs.doctags is empty, and will assign the vector index as doctag

gojomo · 2017-04-01T22:39:04Z

Cool! Some thoughts:

since the format written is still the "word2vec_format" used by word2vec.c, the method name should still reflect that
potentially this could even be a save_word2vec_format() method on the DocvecsArray, rather than on the Doc2Vec model itself – though that leaves open the question of how to get both word-and-doc vectors in the same file
if I understand correctly, if strings are repeated between the word vocabulary and doctags, the resulting file will have repeating keys? A convention for avoiding those collisions, such as prefixing all doctags in a unique way (perhaps with a default that the user could override) would be helpful.
a new load_… may not be necessary; there's not enough info in the format to rebuilt a trainable model. But, since the format is compatible with KeyedVectors.load_word2vec_format(), anything written here can be reloaded, for read-only access, using that.

parulsethi · 2017-04-03T15:05:39Z

@gojomo made changes according to 1st and 3rd point, and agree on 4th point.
And yeah Word2vec instance is used for directly accessing vocab and syn0 to save word vectors in Doc2vec's save_word2vec_format(). I'm doubtful for how it could be done in DocvecsArray

tmylk · 2017-04-03T20:02:16Z

Could you please add an ipynb with images showing how to visualise doc2vec in Tensorboard?

parulsethi · 2017-04-03T20:14:46Z

@tmylk Sure, I'll prepare a notebook tutorial for that

gojomo

Suggestions on naming, reuse.

gojomo · 2017-04-04T23:46:48Z

gensim/models/doc2vec.py

@@ -808,6 +809,50 @@ def delete_temporary_training_data(self, keep_doctags_vectors=True, keep_inferen
        if self.docvecs and hasattr(self.docvecs, 'doctag_syn0_lockf'):
            del self.docvecs.doctag_syn0_lockf

+    def save_word2vec_format(self, fname, doc_vec=True, word_vec=False, prefix='dt_', fvocab=None, binary=False):


To be consistent with behavior before this change, default should be to just write word-vectors (as when implementation just inherited from Word2Vec). Let the new capability require explicit activation with a new parameter.

Thinking more about names:

to acknowledge the doc-vectors aren't necessarily one-per-doc, but one-per-doctag, and to follow the convention elsewhere, let's enable with doctag_vec=True, rather than doc_vec=True.

let's make the default prefix even weirder and less-at-risk of collision with any real tokens. In Mikolov's example sentence-vectors scripts, he prefixed those vector-keys with *_. So let's use *dt_ as the default prefix.

gojomo · 2017-04-04T23:48:01Z

gensim/models/doc2vec.py

+        # save document vectors
+        if doc_vec:
+            logger.info("storing %sx%s projection weights into %s" % (total_vec, self.vector_size, fname))
+            with utils.smart_open(fname, 'wb') as fout:


If all writing is done in this method, seems that should be possible, and would be cleaner, to only open the file once, rather once then a second time in append-mode. BUT, see later comment for a way that this method might be able to offload some of the writing, and thus would just choose to re-open append.

gojomo · 2017-04-04T23:48:51Z

gensim/models/doc2vec.py

+         `doc_vec` is an optional boolean indicating whether to store document vectors
+         `word_vec` is an optional boolean indicating whether to store word vectors
+         (if both doc_vec and word_vec are True, then both vectors are stored in the same file)
+         `prefix` to uniquely indentify doctags from word vocab, and avoid collision


Typo: 'indentify'

gojomo · 2017-04-04T23:52:25Z

gensim/models/doc2vec.py

+         (if both doc_vec and word_vec are True, then both vectors are stored in the same file)
+         `prefix` to uniquely indentify doctags from word vocab, and avoid collision
+         in case of repeated string in doctag and word vocab
+         `fvocab` is an optional file used to save the vocabulary


The potential to save the vocabulary, with particular index-positions that correspond to the word-vectors only, makes me think that when both word+doc vectors are stored, the word-vectors should go first. Then, at least, any vocab written aligns one-for-one with the word-vectors portion of the save file. (Also: does fvocab currently do anything in the save-both case?)

Done, word-vectors go first now.

(Also: does fvocab currently do anything in the save-both case?)

It didn't, earlier. But now that only KeyedVectors.save_word2vec is used for save-only-wv and save-both, vocab is saved in both the cases.

gojomo · 2017-04-04T23:57:34Z

gensim/models/doc2vec.py

+                        else:
+                            fout.write(utils.to_utf8("%s %s\n" % (word, ' '.join("%f" % val for val in row))))
+        else:
+            KeyedVectors.save_word2vec_format(self.wv, fname, fvocab=fvocab, binary=binary)


A thought, in combination with my other comments: perhaps code duplication can be reduced by, if word-vecs are enabled, just calling KeyedVectors on the word-vectors first, then appending doc-vecs if necessary. This would reuse the word-writing (and vocab-writing) code from KeyedVectors, but then re-open the file for append to add doc-vectors (if enabled). To make sure the front-of-file vector-count was correct, the KeyedVectors.load_word2vec_format() would need a new parameter telling it to boost the count by some factor the caller knows.

gojomo · 2017-04-06T00:29:11Z

Looks good!

Maybe just needs a note in CHANGELOG.md about new doctag_vec, word_vec options on Doc2Vec.save_word2vec_format().

parulsethi · 2017-04-06T10:16:26Z

Sure, and a test too.

tmylk

Please expand test coverage

tmylk · 2017-04-10T19:56:04Z

gensim/test/test_doc2vec.py

+        model = doc2vec.Doc2Vec(DocsLeeCorpus(), min_count=1)
+        model.save_word2vec_format(testfile(), doctag_vec=True, binary=True)
+        binary_model_dv = keyedvectors.KeyedVectors.load_word2vec_format(testfile(), binary=True)
+        self.assertEqual(len(model.wv.vocab) + len(model.docvecs), len(binary_model_dv.vocab))


Please add tests for more combinations of word_vec/doctag_vec True/False

added save method fordoc2vec

989ee74

parulsethi commented Apr 1, 2017

View reviewed changes

parulsethi added 2 commits April 3, 2017 03:09

change method name and add prefix

da3d566

use keyedvectors save

9660bd6

gojomo reviewed Apr 5, 2017

View reviewed changes

changes as per @gojomo reviews

c8d4eed

parulsethi added 2 commits April 6, 2017 16:47

added test

d1a5e0b

modified CHANGELOG.md

b2d9319

parulsethi changed the title ~~[WIP] Added save method for doc2vec~~ Added save method for doc2vec Apr 6, 2017

tmylk mentioned this pull request Apr 10, 2017

conversion function naming #1270

Closed

tmylk suggested changes Apr 10, 2017

View reviewed changes

parulsethi added 3 commits April 19, 2017 10:44

expanded tests

441298d

added doc2vec tensorboard viz notebook

fcbf45d

Merge branch 'develop' into save_doc2vec

102f473

tmylk approved these changes Apr 19, 2017

View reviewed changes

tmylk merged commit a6e86a1 into piskvorky:develop Apr 19, 2017

parulsethi mentioned this pull request Jun 13, 2017

[WIP] Implement save_word2vec_format can for Doc2Vec #699

Closed

englhardt mentioned this pull request Aug 25, 2017

Fix doctag unicode problem. Fix 1543 #1544

Merged

menshikh-iv mentioned this pull request Dec 1, 2017

total_vec don't exists #1746

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added save method for doc2vec #1256

Added save method for doc2vec #1256

parulsethi commented Apr 1, 2017

parulsethi Apr 1, 2017

gojomo commented Apr 1, 2017

parulsethi commented Apr 3, 2017

tmylk commented Apr 3, 2017

parulsethi commented Apr 3, 2017

gojomo left a comment

gojomo Apr 4, 2017

gojomo Apr 5, 2017

parulsethi Apr 5, 2017

gojomo Apr 4, 2017

gojomo Apr 4, 2017 •

edited by piskvorky

Loading

parulsethi Apr 5, 2017

gojomo Apr 4, 2017

parulsethi Apr 5, 2017

gojomo Apr 4, 2017

parulsethi Apr 5, 2017

gojomo commented Apr 6, 2017 •

edited

Loading

parulsethi commented Apr 6, 2017

tmylk left a comment

tmylk Apr 10, 2017

Added save method for doc2vec #1256

Added save method for doc2vec #1256

Conversation

parulsethi commented Apr 1, 2017

Choose a reason for hiding this comment

gojomo commented Apr 1, 2017

parulsethi commented Apr 3, 2017

tmylk commented Apr 3, 2017

parulsethi commented Apr 3, 2017

gojomo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gojomo Apr 4, 2017 • edited by piskvorky Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gojomo commented Apr 6, 2017 • edited Loading

parulsethi commented Apr 6, 2017

tmylk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gojomo Apr 4, 2017 •

edited by piskvorky

Loading

gojomo commented Apr 6, 2017 •

edited

Loading