New feature: wordrank wrapper #1066

parulsethi · 2016-12-30T08:17:17Z

This PR adds python wrapper for Wordrank.
utils.check_output is modified here to pass an open file as stdout which is required in wrapper's train method.
Todo:

tutorial for using the wrapper
comparison with word2vec and fasttext blog and code
run dtm/mallet wrapper tests to make sure the check_output is not broken

tmylk · 2017-01-01T15:49:24Z

setup.py

@@ -118,7 +118,7 @@ def readfile(fname):

 python_2_6_backports = ''
 if sys.version_info[:2] < (2, 7):
-    python_2_6_backports = ['argparse', 'subprocess32']
+    python_2_6_backports = ['argparse', 'subprocess32', 'backport_collections']


should subprocess32 be removed now?

tmylk · 2017-01-01T15:55:13Z

gensim/utils.py

@@ -1154,7 +1154,7 @@ def check_output(*popenargs, **kwargs):
    Added extra KeyboardInterrupt handling
    """
    try:
-        process = subprocess.Popen(stdout=subprocess.PIPE, *popenargs, **kwargs)


stdout default has to be specified for other wrappers to work

can you confirm that ldamallet wrapper works even without the default stdout specified? what prevents you from keeping it as default?

ldamallet wrapper test passes without the default stdout

tmylk · 2017-01-10T12:48:19Z

Please add the new class to gensim/docs/src/apiref.rst and create an RST files as in #961

tmylk

Minor changes

tmylk · 2017-01-11T14:47:03Z

docs/notebooks/Wordrank_wrapper.ipynb

@@ -0,0 +1,286 @@
+{
+ "cells": [


let's call it "WordRank_wrapper_quickstart.ipynb"

tmylk · 2017-01-11T14:47:39Z

docs/notebooks/Wordrank_wrapper.ipynb

+   ],
+   "source": [
+    "word_similarity_file = 'datasets/ws-353.txt'\n",
+    "model.wv.evaluate_word_pairs(word_similarity_file)"


merge in latest develop to get correct output of this cell

tmylk · 2017-01-11T14:49:26Z

docs/notebooks/Wordrank_comparisons.ipynb

+   "source": [
+    "# Comparison of WordRank, Word2Vec and FastText\n",
+    "\n",
+    "Wordrank is a fresh new approach to the word embeddings, which formulates it as a ranking problem. That is, given a word w, it aims to output an ordered list (c1, c2, · · ·) of context words such that words that co-occur with w appear at the top of the list. This formulation fits naturally to popular word embedding tasks such as word similarity/analogy since instead of the likelihood of each word, we are interested in finding the most relevant words\n",


link to wordrank from here. same link as in references.

tmylk · 2017-01-11T14:49:54Z

docs/notebooks/Wordrank_comparisons.ipynb

+   "metadata": {},
+   "source": [
+    "# Comparison of WordRank, Word2Vec and FastText\n",
+    "\n",


add a link to your blog. say "this ipynb accompanies a more theoretical blog post [link] "

adding dummy url, will change if final url changes

tmylk · 2017-01-11T14:58:10Z

gensim/models/wrappers/wordrank.py

+>>> print model[word]  # prints vector for given words
+
+.. [1] https://bitbucket.org/shihaoji/wordrank/
+.. [2] https://arxiv.org/pdf/1506.02761v3.pdf


Please add

# Copyright (C) 2017 Parul Sethi <email> # Copyright (C) 2017 Radim Rehurek <me@radimrehurek.com> # Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html

tmylk · 2017-01-11T15:02:03Z

gensim/models/wrappers/wordrank.py

+        copyfile(corpus_file, os.path.join(meta_dir, corpus_file.split('/')[-1]))
+        os.chdir(meta_dir)
+
+        cmd0 = ['../../glove/vocab_count', '-min-count', str(min_count), '-max-vocab', str(max_vocab_size)]


please give meaningful names to variables like 'cmd_vocab_count'

tmylk · 2017-01-11T15:04:43Z

gensim/models/wrappers/wordrank.py

+        cmd3 = ['cut', '-d', " ", '-f', '1', temp_vocab_file]
+        cmds = [cmd0, cmd1, cmd2, cmd3]
+        logger.info("Preparing training data using glove code '%s'", cmds)
+        o0 = smart_open(temp_vocab_file, 'w')


meaningful names here too please

It's safer to open files in binary mode (wb), and explicitly encode all strings written there.

Has this been tested on unicode (non-ASCII) data?

tried on hindi characters, it works

tmylk · 2017-01-11T15:07:58Z

gensim/models/wrappers/wordrank.py

+            self.wv.vocab[word].count = counts[word]
+
+    def ensemble_embedding(self, word_embedding, context_embedding):
+        """Addition of two embeddings."""


Better docstring Replace syn0 with the sum of context and word embeddings

tmylk · 2017-01-11T15:14:10Z

gensim/models/wrappers/wordrank.py

+        glove2word2vec(context_embedding, context_embedding+'.w2vformat')
+        w_emb = Word2Vec.load_word2vec_format('%s.w2vformat' % word_embedding)
+        c_emb = Word2Vec.load_word2vec_format('%s.w2vformat' % context_embedding)
+        assert Counter(w_emb.wv.index2word) == Counter(c_emb.wv.index2word), 'Vocabs are not same for both embeddings'


just a vocab comparison would do. no need for Counter

piskvorky · 2017-01-12T00:07:53Z

gensim/models/wrappers/wordrank.py

+        outputs = [o0, o1, o2, o3]
+        inputs = [i0, i1, i2, i3]
+        prepare_train_data = [utils.check_output(cmd, stdin=inp, stdout=out) for cmd, inp, out in zip(cmds, inputs, outputs)]
+        o0.close()


Best practice is to use context managers for opening files for writing (with smart_open() as input0: ...).

This whole code section would read better if rewritten as a loop (for command, input_fname, output_fname in zip(commands, input_fnames, output_fnames): with smart_open(...): ...).

tmylk · 2017-01-13T01:52:20Z

gensim/models/wrappers/wordrank.py

        glove2word2vec(context_embedding, context_embedding+'.w2vformat')
        w_emb = Word2Vec.load_word2vec_format('%s.w2vformat' % word_embedding)
        c_emb = Word2Vec.load_word2vec_format('%s.w2vformat' % context_embedding)
-        assert Counter(w_emb.wv.index2word) == Counter(c_emb.wv.index2word), 'Vocabs are not same for both embeddings'
+        assert set(w_emb.wv.index2word) == set(c_emb.wv.index2word), 'Vocabs are not same for both embeddings'


is it possible to compare wv.vocab?

Oh yes, similarly using set(wv.vocab)
correcting in next commit

tmylk · 2017-01-15T13:08:20Z

docs/notebooks/datasets/ws-353.txt

@@ -0,0 +1,353 @@
+love	sex	6.77


This dataset is already in Gensim in https://github.com/parulsethi/gensim/blob/develop/gensim/test/test_data/wordsim353.tsv Please use it there

tmylk · 2017-01-15T13:09:13Z

docs/notebooks/datasets/simlex-999.txt

@@ -0,0 +1,999 @@
+old	new	1.58


Please move to test/test_data so it can be used in other code. Similar to https://github.com/parulsethi/gensim/blob/develop/gensim/test/test_data/wordsim353.tsv

tmylk · 2017-01-15T13:12:35Z

docs/notebooks/WordRank_wrapper_quickstart.ipynb

+    "# WordRank wrapper tutorial on Lee Corpus\n",
+    "\n",
+    "WordRank is a new word embedding algorithm which captures the semantic similarities in a text data well. See this [notebook](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Wordrank_comparisons.ipynb) for it's comparisons to other popular embedding models. This tutorial will serve as a guide to use the WordRank wrapper in gensim. You need to install [WordRank](https://bitbucket.org/shihaoji/wordrank) before proceeding with this tutorial.\n",
+    "\n",


this link doesn't work https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Wordrank_comparisons.ipynb

tmylk · 2017-01-15T15:11:16Z

gensim/models/wrappers/ldamallet.py

@@ -234,7 +234,7 @@ def show_topics(self, num_topics=10, num_words=10, log=False, formatted=True):
            if formatted:
                topic = self.print_topic(i, topn=num_words)
            else:
-                topic = self.show_topic(i, topn=num_words)
+                topic = self.show_topic(i, num_words=num_words)


this shouldn't be in this pr

All show_topics functions now have they same API as in https://github.com/RaRe-Technologies/gensim/blob/c2bfa48e71f851bc2f79fcbe51f531d6813a4503/gensim/models/basemodel.py#L12

Agree that num_words is correct

(just to mention in review thread)
show_topic() doesn't have topn keyword argument but num_words, this fixes it

parulsethi · 2017-01-16T18:26:44Z

gensim/utils.py

@@ -1146,15 +1146,16 @@ def keep_vocab_item(word, count, min_count, trim_rule=None):
        else:
            return default_res

-def check_output(*popenargs, **kwargs):
+def check_output(*popenargs, stdout=subprocess.PIPE, **kwargs):


this will keep the previous default stdout=subprocess.PIPE for other wrappers, and a different stdout can be defined for wordrank wrapper

I'm bit doubtful about a workaround for this, it gave a syntax error for python2 in above check.
If I specify stdout=subprocess.PIPE as first argument for making py2 compatible, it won't work for other wrappers as they have cmd(basic command) as their first argument

parulsethi · 2017-01-16T18:29:13Z

gensim/models/wrappers/wordrank.py

@@ -103,7 +103,7 @@ def train(cls, wr_path, corpus_file, out_path, size=100, window=15, symmetric=1,
        for command, input_fname, output_fname in zip(commands, input_fnames, output_fnames):
            with smart_open(input_fname, 'rb') as r:
                with smart_open(output_fname, 'wb') as w:
-                    utils.check_output(command, stdin=r, stdout=w)
+                    utils.check_output(command, stdout=w, stdin=r)


changed as per current utils.check_output update in this PR

parulsethi · 2017-01-22T15:55:30Z

gensim/utils.py

@@ -1146,15 +1146,15 @@ def keep_vocab_item(word, count, min_count, trim_rule=None):
        else:
            return default_res

-def check_output(*popenargs, stdout=subprocess.PIPE, **kwargs):
+def check_output(stdout=subprocess.PIPE, *popenargs, **kwargs):


this keeps default stdout=subprocess.PIPE. and check_output calls in all wrappers are updated to specify cmd as keyword argument rather than positional to work with this change

parulsethi added 9 commits December 30, 2016 13:10

added wordrank wrapper

420e426

update example

d2f5607

add comparison ipynb

c175851

update graph output

227546c

try diff interpreter

9b100e4

Merge branch 'develop' into wordrank_wrapper

7df527d

use unittest2

8e9a6cb

use unittest2

5c42762

add backport_collections for py2.6

17b2b75

tmylk reviewed Jan 1, 2017

View reviewed changes

parulsethi added 2 commits January 11, 2017 14:35

added tutorial ipynb

f569120

remove subprocess32

11ac3e8

tmylk mentioned this pull request Jan 11, 2017

[WIP] [DNM] Keyedvector load word2vec format #1078

Closed

tmylk suggested changes Jan 11, 2017

View reviewed changes

Merge branch 'develop' into wordrank_wrapper

f75c9ff

piskvorky reviewed Jan 12, 2017

View reviewed changes

made requested changes

7f541a2

tmylk reviewed Jan 13, 2017

View reviewed changes

parulsethi added 3 commits January 13, 2017 17:35

replace with vocab for comparison

fb75890

added conclusions

14d6f90

added some comments

4b9271e

tmylk reviewed Jan 15, 2017

View reviewed changes

parulsethi added 2 commits January 16, 2017 23:47

changed test data loc and update check_output

207dd8b

remove extra comment in check_output

09f4617

parulsethi commented Jan 16, 2017

View reviewed changes

tmylk changed the title ~~added wordrank wrapper~~ New feature: wordrank wrapper Jan 22, 2017

parulsethi added 2 commits January 22, 2017 20:51

update check_output

b3ecdd4

update wordrank_wrapper's check_output call

4256252

parulsethi commented Jan 22, 2017

View reviewed changes

tmylk merged commit b0fa47c into piskvorky:develop Jan 23, 2017

tmylk mentioned this pull request Mar 8, 2017

LdaMallet.show_topics(formatted=False) fails #1150

Closed

menshikh-iv mentioned this pull request Oct 2, 2017

loading ensemble embeddings take wordrank .word and context file where these files are located i did'nt get this. Is this embeddings are generated by demo script given in wordrank or by your train function given in wordrank wrapper #1357

Closed

New feature: wordrank wrapper #1066

New feature: wordrank wrapper #1066

Conversation

parulsethi commented Dec 30, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmylk commented Jan 10, 2017

tmylk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parulsethi Jan 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jan 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmylk Jan 15, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parulsethi Jan 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parulsethi commented Dec 30, 2016 •

edited

Loading

parulsethi Jan 12, 2017 •

edited

Loading

piskvorky Jan 12, 2017 •

edited

Loading

tmylk Jan 15, 2017 •

edited

Loading

parulsethi Jan 16, 2017 •

edited

Loading