added Word2vec to Tensorflow 2D tensor file #1051

loretoparisi · 2016-12-20T19:05:07Z

This script is used to convert the word2vec format to Tensorflow 2D tensor and metadata formats for Embedding Projector Visualization
For more information about TensorBoard format see: https://www.tensorflow.org/versions/master/how_tos/embedding_viz/

tmylk

Requested minor changes like comments and logging.

tmylk · 2016-12-20T20:14:18Z

gensim/scripts/word2vec2tensor.py

+#
+# Copyright (C) 2016 Loreto Parisi <loretoparisi@gmail.com>
+# Copyright (C) 2016 Silvio Ogliastri <silvio.olivastri@gmail.com>
+# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html


May I ask for joint copyright as in https://github.com/RaRe-Technologies/gensim/wiki/Developer-page#legal ?

tmylk · 2016-12-20T20:16:14Z

gensim/scripts/word2vec2tensor.py

+    The script will create two TSV files. A 2d tensor format file, and a Word Embedding metadata file. Both files will
+    us the --output file name as prefix
+This script is used to convert the word2vec format to Tensorflow 2D tensor and metadata formats for Embedding Visualization
+For more information about TensorBoard format see: https://www.tensorflow.org/versions/master/how_tos/embedding_viz/


Please add instructions on how to see the viz in tensorboard, i.e. "Launch `tensorboard --logdir=dir_with_tsv"

tmylk · 2016-12-20T20:17:12Z

gensim/scripts/word2vec2tensor.py

+
+logger = logging.getLogger(__name__)
+
+'''


Please move docstring to after def word2vec2tensor line

tmylk · 2016-12-20T20:19:56Z

gensim/scripts/word2vec2tensor.py

+            for word in model.index2word:
+                file_metadata.write(word.encode('utf-8') + '\n')
+                vector_row = '\t'.join(map(str, model[word]))
+                file_vector.write(vector_row + '\n')


Please log the location and name of the written files.

Thanks, I have added further instructions and logging.

Done thanks.

I had an error occurring at line 54. TypeError: can't concat str to bytes

tmylk · 2016-12-20T20:22:55Z

@anmol01gulati @parulsethi may I ask you to alpha test this PR?

parulsethi · 2016-12-21T10:46:19Z

gensim/scripts/word2vec2tensor.py

+    @word2vec_model_path word2vec model
+    @tensor_filename tensor filename prefix
+    '''    
+    model = gensim.models.Word2Vec.load_word2vec_format(word2vec_model_path, binary=True)


Better use text format here, or make it optional atleast

You can ask for format as optional like,
parser.add_argument( "-b", "--binary", required=False, help="If word2vec model in binary format, set True, else False ")
and pass the argument to word2vec2tensor function
def word2vec2tensor(word2vec_model_path,tensor_filename, binary=False):
model = gensim.models.Word2Vec.load_word2vec_format(word2vec_model_path, binary=binary)

keeping text format as default

All done thanks.

Sorry, forgot this earlier, space before tensor_filename in def word2vec2tensor()

parulsethi · 2016-12-21T11:00:14Z

gensim/scripts/word2vec2tensor.py

+
+    with open(outfiletsv, 'w+') as file_vector:
+        with open(outfiletsvmeta, 'w+') as file_metadata:
+            for word in model.index2word:


use model.wv.index2word

Did this change get overwritten?

loretoparisi

Refactoring of docstrings in python style, changed index2word api.

loretoparisi

Fixed docs, added optional binary model format argument

loretoparisi · 2016-12-21T20:23:18Z

Did all requested changes for logging, comments, optional arguments for binary mode.

parulsethi · 2016-12-21T20:34:48Z

@tmylk does it need the separate test file?

tmylk · 2016-12-22T01:36:07Z

@parulsethi The problem with testing this is that we can't load it to tensorboard Travis. So no tests needed.

tmylk · 2016-12-22T01:36:54Z

@loretoparisi Thanks a lot for the PR! Visulisation is the top priority on our roadmap.

piskvorky

@tmylk @loretoparisi sorry, I only got to review now -- a few minor changes needed. Thanks for the cool new feature!

piskvorky · 2016-12-26T03:58:27Z

gensim/scripts/word2vec2tensor.py

@@ -0,0 +1,86 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+#


Missing license (LGPL, like the rest of gensim). @tmylk @loretoparisi

piskvorky · 2016-12-26T03:59:31Z

gensim/scripts/word2vec2tensor.py

+    outfiletsv = tensor_filename + '_tensor.tsv'
+    outfiletsvmeta = tensor_filename + '_metadata.tsv'
+
+    with open(outfiletsv, 'w+') as file_vector:


Use smart_open instead.

piskvorky · 2016-12-26T03:59:43Z

gensim/scripts/word2vec2tensor.py

+    outfiletsvmeta = tensor_filename + '_metadata.tsv'
+
+    with open(outfiletsv, 'w+') as file_vector:
+        with open(outfiletsvmeta, 'w+') as file_metadata:


piskvorky · 2016-12-26T04:00:59Z

gensim/scripts/word2vec2tensor.py

+                vector_row = '\t'.join(map(str, model[word]))
+                file_vector.write(vector_row + '\n')
+
+    logger.info("2D tensor file saved to %s" % outfiletsv)


@tmylk little nitpick, but for the future, prefer logger.xyz("log %s", something), not logger.xyz("log %s" % something) (use lazy argument formatting).

piskvorky · 2016-12-26T04:01:27Z

gensim/scripts/word2vec2tensor.py

+    parser.add_argument(
+        "-o", "--output", required=True,
+        help="Output tensor file name prefix")
+    parser.add_argument( "-b", "--binary", 


No vertical indent in gensim -- use normal hanging indent.

tmylk · 2016-12-28T12:58:48Z

FYI An alternative way to achieve this goal with tensorflow https://gist.github.com/lampts/026a4d6400b1efac9a13a3296f16e655

added Word2vec to Tensorflow 2D tensor file

3215ea4

loretoparisi mentioned this pull request Dec 20, 2016

Gensim Word2Vec model Export in TSV for Google's Embedding Projector Visualizer #1035

Closed

tmylk suggested changes Dec 20, 2016

View reviewed changes

loretoparisi added 5 commits December 21, 2016 09:43

Update word2vec2tensor.py

23fac40

Update word2vec2tensor.py

d24406a

instructions how to load TSV files in projector

77a96c0

Update word2vec2tensor.py

103d0d6

Update word2vec2tensor.py

fa7dbad

parulsethi reviewed Dec 21, 2016

View reviewed changes

loretoparisi added 2 commits December 21, 2016 15:40

model.wv.index2word to generate tensor

0bf3b85

Update word2vec2tensor.py

96d3a58

loretoparisi commented Dec 21, 2016

View reviewed changes

loretoparisi added 3 commits December 21, 2016 21:09

Update word2vec2tensor.py

15ab0e5

updated docs for new command line arguments

6007c29

Update word2vec2tensor.py

cc9fb70

loretoparisi commented Dec 21, 2016

View reviewed changes

tmylk merged commit f4c04a7 into piskvorky:develop Dec 22, 2016

piskvorky reviewed Dec 26, 2016

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added Word2vec to Tensorflow 2D tensor file #1051

added Word2vec to Tensorflow 2D tensor file #1051

loretoparisi commented Dec 20, 2016

tmylk left a comment

tmylk Dec 20, 2016

tmylk Dec 20, 2016

tmylk Dec 20, 2016

tmylk Dec 20, 2016

loretoparisi Dec 21, 2016

loretoparisi Dec 21, 2016

MengZhang0904 Feb 11, 2019

tmylk commented Dec 20, 2016

parulsethi Dec 21, 2016

parulsethi Dec 21, 2016

loretoparisi Dec 21, 2016

parulsethi Dec 21, 2016 •

edited

Loading

parulsethi Dec 21, 2016

loretoparisi Dec 21, 2016

tmylk Dec 22, 2016

loretoparisi left a comment

loretoparisi left a comment

loretoparisi commented Dec 21, 2016

parulsethi commented Dec 21, 2016

tmylk commented Dec 22, 2016

tmylk commented Dec 22, 2016

piskvorky left a comment

piskvorky Dec 26, 2016

piskvorky Dec 26, 2016

piskvorky Dec 26, 2016

piskvorky Dec 26, 2016

piskvorky Dec 26, 2016 •

edited

Loading

tmylk commented Dec 28, 2016

added Word2vec to Tensorflow 2D tensor file #1051

added Word2vec to Tensorflow 2D tensor file #1051

Conversation

loretoparisi commented Dec 20, 2016

tmylk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmylk commented Dec 20, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parulsethi Dec 21, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

loretoparisi left a comment

Choose a reason for hiding this comment

loretoparisi left a comment

Choose a reason for hiding this comment

loretoparisi commented Dec 21, 2016

parulsethi commented Dec 21, 2016

tmylk commented Dec 22, 2016

tmylk commented Dec 22, 2016

piskvorky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Dec 26, 2016 • edited Loading

Choose a reason for hiding this comment

tmylk commented Dec 28, 2016

parulsethi Dec 21, 2016 •

edited

Loading

piskvorky Dec 26, 2016 •

edited

Loading