-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added save method for doc2vec #1256
Conversation
gensim/models/doc2vec.py
Outdated
fout.write(utils.to_utf8("%s %s\n" % (total_vec, self.vector_size))) | ||
# store as in input order | ||
for i in range(len(self.docvecs)): | ||
doctag = self.docvecs.index_to_doctag(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will work in case user's model.docvecs.doctags
is empty, and will assign the vector index as doctag
Cool! Some thoughts:
|
@gojomo made changes according to 1st and 3rd point, and agree on 4th point. |
Could you please add an ipynb with images showing how to visualise doc2vec in Tensorboard? |
@tmylk Sure, I'll prepare a notebook tutorial for that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestions on naming, reuse.
gensim/models/doc2vec.py
Outdated
@@ -808,6 +809,50 @@ def delete_temporary_training_data(self, keep_doctags_vectors=True, keep_inferen | |||
if self.docvecs and hasattr(self.docvecs, 'doctag_syn0_lockf'): | |||
del self.docvecs.doctag_syn0_lockf | |||
|
|||
def save_word2vec_format(self, fname, doc_vec=True, word_vec=False, prefix='dt_', fvocab=None, binary=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent with behavior before this change, default should be to just write word-vectors (as when implementation just inherited from Word2Vec). Let the new capability require explicit activation with a new parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about names:
- to acknowledge the doc-vectors aren't necessarily one-per-doc, but one-per-doctag, and to follow the convention elsewhere, let's enable with
doctag_vec=True
, rather thandoc_vec=True
. - let's make the default prefix even weirder and less-at-risk of collision with any real tokens. In Mikolov's example sentence-vectors scripts, he prefixed those vector-keys with
*_
. So let's use*dt_
as the default prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
gensim/models/doc2vec.py
Outdated
# save document vectors | ||
if doc_vec: | ||
logger.info("storing %sx%s projection weights into %s" % (total_vec, self.vector_size, fname)) | ||
with utils.smart_open(fname, 'wb') as fout: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all writing is done in this method, seems that should be possible, and would be cleaner, to only open the file once, rather once then a second time in append-mode. BUT, see later comment for a way that this method might be able to offload some of the writing, and thus would just choose to re-open append.
gensim/models/doc2vec.py
Outdated
`doc_vec` is an optional boolean indicating whether to store document vectors | ||
`word_vec` is an optional boolean indicating whether to store word vectors | ||
(if both doc_vec and word_vec are True, then both vectors are stored in the same file) | ||
`prefix` to uniquely indentify doctags from word vocab, and avoid collision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: 'indentify'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
gensim/models/doc2vec.py
Outdated
(if both doc_vec and word_vec are True, then both vectors are stored in the same file) | ||
`prefix` to uniquely indentify doctags from word vocab, and avoid collision | ||
in case of repeated string in doctag and word vocab | ||
`fvocab` is an optional file used to save the vocabulary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The potential to save the vocabulary, with particular index-positions that correspond to the word-vectors only, makes me think that when both word+doc vectors are stored, the word-vectors should go first. Then, at least, any vocab written aligns one-for-one with the word-vectors portion of the save file. (Also: does fvocab
currently do anything in the save-both case?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, word-vectors go first now.
(Also: does fvocab currently do anything in the save-both case?)
It didn't, earlier. But now that only KeyedVectors.save_word2vec
is used for save-only-wv and save-both, vocab is saved in both the cases.
gensim/models/doc2vec.py
Outdated
else: | ||
fout.write(utils.to_utf8("%s %s\n" % (word, ' '.join("%f" % val for val in row)))) | ||
else: | ||
KeyedVectors.save_word2vec_format(self.wv, fname, fvocab=fvocab, binary=binary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A thought, in combination with my other comments: perhaps code duplication can be reduced by, if word-vecs are enabled, just calling KeyedVectors on the word-vectors first, then appending doc-vecs if necessary. This would reuse the word-writing (and vocab-writing) code from KeyedVectors, but then re-open the file for append to add doc-vectors (if enabled). To make sure the front-of-file vector-count was correct, the KeyedVectors.load_word2vec_format()
would need a new parameter telling it to boost the count by some factor the caller knows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Looks good! Maybe just needs a note in CHANGELOG.md about new doctag_vec, word_vec options on Doc2Vec.save_word2vec_format(). |
Sure, and a test too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please expand test coverage
gensim/test/test_doc2vec.py
Outdated
model = doc2vec.Doc2Vec(DocsLeeCorpus(), min_count=1) | ||
model.save_word2vec_format(testfile(), doctag_vec=True, binary=True) | ||
binary_model_dv = keyedvectors.KeyedVectors.load_word2vec_format(testfile(), binary=True) | ||
self.assertEqual(len(model.wv.vocab) + len(model.docvecs), len(binary_model_dv.vocab)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add tests for more combinations of word_vec
/doctag_vec
True
/False
This PR adds a method
save_doc2vec_format
to save document vectors in similar format assave_word2vec_format
saves word vectors.Another PR #699 was opened with this issue and to address @gojomo comment there
I’ve added a flag parameter to indicate whether to save word vectors along with doc vectors. For now, I append word vectors after doc vectors, let me know if some other criteria is more favorable.
And if someone would like to keep doc and word vectors disentangled as pointed out by @gojomo, they can separately save them to different files now using
save_doc2vec_format
(with flagword_vec=False
) and thensave_word2vec_format
.@tmylk @gojomo If this seems ok, I’ll add the corresponding load method