Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added Word2vec to Tensorflow 2D tensor file #1051

Merged
merged 11 commits into from
Dec 22, 2016

Conversation

loretoparisi
Copy link
Contributor

This script is used to convert the word2vec format to Tensorflow 2D tensor and metadata formats for Embedding Projector Visualization
For more information about TensorBoard format see: https://www.tensorflow.org/versions/master/how_tos/embedding_viz/

Copy link
Contributor

@tmylk tmylk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested minor changes like comments and logging.

#
# Copyright (C) 2016 Loreto Parisi <loretoparisi@gmail.com>
# Copyright (C) 2016 Silvio Ogliastri <silvio.olivastri@gmail.com>
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script will create two TSV files. A 2d tensor format file, and a Word Embedding metadata file. Both files will
us the --output file name as prefix
This script is used to convert the word2vec format to Tensorflow 2D tensor and metadata formats for Embedding Visualization
For more information about TensorBoard format see: https://www.tensorflow.org/versions/master/how_tos/embedding_viz/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add instructions on how to see the viz in tensorboard, i.e. "Launch `tensorboard --logdir=dir_with_tsv"


logger = logging.getLogger(__name__)

'''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move docstring to after def word2vec2tensor line

for word in model.index2word:
file_metadata.write(word.encode('utf-8') + '\n')
vector_row = '\t'.join(map(str, model[word]))
file_vector.write(vector_row + '\n')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please log the location and name of the written files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have added further instructions and logging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had an error occurring at line 54. TypeError: can't concat str to bytes

@tmylk
Copy link
Contributor

tmylk commented Dec 20, 2016

@anmol01gulati @parulsethi may I ask you to alpha test this PR?

@word2vec_model_path word2vec model
@tensor_filename tensor filename prefix
'''
model = gensim.models.Word2Vec.load_word2vec_format(word2vec_model_path, binary=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better use text format here, or make it optional atleast

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can ask for format as optional like,
parser.add_argument( "-b", "--binary", required=False, help="If word2vec model in binary format, set True, else False ")
and pass the argument to word2vec2tensor function
def word2vec2tensor(word2vec_model_path,tensor_filename, binary=False):
model = gensim.models.Word2Vec.load_word2vec_format(word2vec_model_path, binary=binary)

keeping text format as default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All done thanks.

Copy link
Contributor

@parulsethi parulsethi Dec 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, forgot this earlier, space before tensor_filename in def word2vec2tensor()


with open(outfiletsv, 'w+') as file_vector:
with open(outfiletsvmeta, 'w+') as file_metadata:
for word in model.index2word:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use model.wv.index2word

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this change get overwritten?

Copy link
Contributor Author

@loretoparisi loretoparisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactoring of docstrings in python style, changed index2word api.

Copy link
Contributor Author

@loretoparisi loretoparisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed docs, added optional binary model format argument

@loretoparisi
Copy link
Contributor Author

Did all requested changes for logging, comments, optional arguments for binary mode.

@parulsethi
Copy link
Contributor

@tmylk does it need the separate test file?

@tmylk
Copy link
Contributor

tmylk commented Dec 22, 2016

@parulsethi The problem with testing this is that we can't load it to tensorboard Travis. So no tests needed.

@tmylk tmylk merged commit f4c04a7 into piskvorky:develop Dec 22, 2016
@tmylk
Copy link
Contributor

tmylk commented Dec 22, 2016

@loretoparisi Thanks a lot for the PR! Visulisation is the top priority on our roadmap.

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmylk @loretoparisi sorry, I only got to review now -- a few minor changes needed. Thanks for the cool new feature!

@@ -0,0 +1,86 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing license (LGPL, like the rest of gensim). @tmylk @loretoparisi

outfiletsv = tensor_filename + '_tensor.tsv'
outfiletsvmeta = tensor_filename + '_metadata.tsv'

with open(outfiletsv, 'w+') as file_vector:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use smart_open instead.

outfiletsvmeta = tensor_filename + '_metadata.tsv'

with open(outfiletsv, 'w+') as file_vector:
with open(outfiletsvmeta, 'w+') as file_metadata:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dtto.

vector_row = '\t'.join(map(str, model[word]))
file_vector.write(vector_row + '\n')

logger.info("2D tensor file saved to %s" % outfiletsv)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmylk little nitpick, but for the future, prefer logger.xyz("log %s", something), not logger.xyz("log %s" % something) (use lazy argument formatting).

parser.add_argument(
"-o", "--output", required=True,
help="Output tensor file name prefix")
parser.add_argument( "-b", "--binary",
Copy link
Owner

@piskvorky piskvorky Dec 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No vertical indent in gensim -- use normal hanging indent.

@tmylk
Copy link
Contributor

tmylk commented Dec 28, 2016

FYI An alternative way to achieve this goal with tensorflow https://gist.github.com/lampts/026a4d6400b1efac9a13a3296f16e655

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants