[WIP] Potential Word Movers Distance performance improvement: WCD and RWMD #800

RishabGoel · 2016-07-21T16:52:48Z

The class calculates the K nearest documents based on heuristics mentioned in the paper section 4.

RishabGoel · 2016-07-23T11:21:23Z

Hi,
I have added the RWMD metric required for Fast KNN algorithm.

tmylk · 2016-07-25T14:18:32Z

Thanks.
Please add tests and add the code that actually uses WMD.

piskvorky · 2016-07-26T12:20:29Z

How does this compare to the WMD stuff that @olavurmortensen did?

RishabGoel · 2016-07-26T17:29:09Z

Hi,
I not sure I understood the question. But, I will try to explain what I understand

@olavurmortensen implemented the WMD measure as per PR [WIP] Word Movers Distance #482 , #521 and #619 #659. So, if I were to get say get 20 nearest neighbors to a test document, out of a set of thousands of the documents, based on WMD, I would need to do o(size of documents) WMD computations for the tasks and then sort it. This is computationally expensive, as the wmd involves minimization wrt 2 constraints(as mention in section 4 of the paper link in the next point).
According to heuristic mentioned in the paper under prefetch and prune heading, one can avoid this much computation and rather use WCD(Word Centroid Distance) and RWMD (Relaxed Word Mover's Distance) as indication for all those documents, for which we should compute the WMD. Thus, reducing fairly large computations as WCD is cheap and RWMD involves minimization wrt one constraint only . So, it is related with the WMD stuff in the sense it helps to find potential documents for which WMD should be calculated for the requirements.
Also, @olavurmortensen implemented WCD and RWMD in PR [WIP] Implement Word Mover's Distance (WMD) in Gensim #521, which were apparently removed afterwards. I am developing test cases that compare my performance with the implemented ones.
I am also thinking of test cases to test the quality of distance calculated. I will probably compare my results with vene's results.

Please, comment if I answered your question.
Thanks

piskvorky · 2016-07-27T06:55:40Z

Thanks @RishabGoel !

That was a question mostly for @tmylk . If we are to merge this PR, we'll have to integrate it with what we already have (and also drop the sklearn dependencies).

RishabGoel · 2016-08-01T18:45:06Z

Hi,
I have added the basic functionality required. Now, I will be working checking its accuracy, time measurements and other test cases.
One thing I have observed is that, this metric is as good as the Word2Vec vectors are representative proper context of the a word in the given data.
This is a good food for thought.

tmylk · 2016-08-05T08:58:12Z

gensim/models/fast_knn.py

+        doc : Normalised BOW representaion of stored doc .
+        doc_id : id of the stored doc
+        """
+        return (doc_id, distance.euclidean(np.dot(np.transpose(self.word_embedding), np.transpose(test_doc)), np.dot(np.transpose(self.word_embedding), doc)))


why do you transposet test_doc and not doc?

this is because doc is an array containing the reserved documents and the doc_id is the index of a document in the doc array. I should have named docs as docs_arr. I will correct it in next commit.
So, Continuing when I access doc[doc_id] i get an array of shape (1,x) while test_doc being an array has shape (x,). Now dimensions of word embedding is (x,1).
So, to make the shapes compatible, transpose of test_doc is taken, to make it (1,x)

tmylk · 2016-08-05T09:06:58Z

@RishabGoel Let's go over performance(time) comparison in our next meeting.

RishabGoel · 2016-08-05T11:44:09Z

Sure, I will do that testing. I have accepted the invitation. Let's talk in our next meeting about it.

RishabGoel · 2016-08-23T11:55:50Z

I have updated the documentation of the prune function. Is it understandable now??

RishabGoel · 2016-09-05T19:58:55Z

@mkusner I am implementing your paper and I am facing some issues. Kindly, check your email. I have mailed them in detail to you. I shall be extrmely grateful, if you could answer those queries!

piskvorky · 2016-09-06T00:30:45Z

@RishabGoel @mkusner I'd suggest discussing here rather than privately via email. That way, other people can chime in, help out, and there's a track record of what changes were made and why and how.

RishabGoel · 2016-09-06T07:37:10Z

@piskvorky Sure thing. I am pasting the contents of the mail below:-

I am implementing RWMD and WCD distances in python, unlike your matlab implementation.The time tests for RWMDis coming out to be more than WMD, as much as twice the magnitude. This might be ok. I wanted to check, if this was the case with you as well. Since RWMD is a pretty relaxed problem, it shouldnt take this much time, even if WMD is calculated in C.
Is the following graph (distance vs training input indices) obtained for all the vectors in test twitter data or is it in some cases? I am getting a very different graph for a particular twitter review.

mkusner · 2016-09-06T17:08:30Z

Thanks Rishab for your interest in the project! To your questions:

It's a bit unexpected that WMD is faster than RWMD because RWMD just requires two nearest neighbor searches (as opposed to solving a linear program). It's actually a strictly less-constrained optimization problem. RWMD was still faster than WMD for us even if the linear program was implemented in C and the distance computation was implemented in Matlab. Figure 7 in the paper shows this comparison (see the pink bars (RWMD) versus the dark red bars (WMD): http://mkusner.github.io/publications/WMD.pdf). Maybe the distance computation in python can be sped up?
In this graph we randomly sampled a single test point and computed all distances between it and the training points. We then sorted the distances in order of increasing WMD distance. Then we plotted WCD and RWMD. So the graph could look a bit different for a different test point. But what you should see is that RWMD very closely matches WMD, and WCD is a bit of a looser approximation.

RishabGoel · 2016-09-27T07:28:02Z

@dselivanov and @tmylk thanks for the suggestions and links...
I will look into it.
I am also done with the matlab vs python test of rwmd. Will post the results soon.

RishabGoel · 2016-09-28T01:12:16Z

The code looks similar to @mkusner 's implementation, except there is an alternate distance metric i.e. cosine distance. Will add this feature as well...

marco-c · 2016-11-14T23:29:13Z

In my case, using RWMD+WMD (I'm not using WCD for now) is much faster than just WMD. My implementation is in Python and not so different than @RishabGoel one (except that I'm using cosine distance and maybe doing a few less operations).

Here's what I currently have, perhaps it can be useful:

all_distances = 1 - np.dot(model.syn0norm, model.syn0norm[[model.vocab[word].index for word in words_to_test_clean]].transpose())

distances = []
for doc_id in range(0, len(corpus)):
    doc_words = [model.vocab[word].index for word in corpus[doc_id].words if word in model]
    if len(doc_words) != 0:
        word_dists = all_distances[doc_words]
        rwmd = max(np.sum(np.min(word_dists, axis=0)), np.sum(np.min(word_dists, axis=1)))
    else:
        rwmd = float('inf')
    distances.append((doc_id, rwmd))

distances.sort(key=lambda v: v[1])

confirmed_distances_ids = []
confirmed_distances = []

for i, (doc_id, rwmd_distance) in enumerate(distances):
    # Stop once we have 'top' confirmed distances and all the rwmd lower bounds are higher than the smallest top confirmed distance.
    if len(confirmed_distances) >= top and rwmd_distance > confirmed_distances[top-1]:
        break

    # TODO: directly use pyemd, so we don't recalculate distances we have already calculated for RWMD.
    wmd = model.wmdistance(words_to_test, corpus[doc_id].words)
    j = bisect.bisect(confirmed_distances, wmd)
    confirmed_distances.insert(j, wmd)
    confirmed_distances_ids.insert(j, doc_id)

similarities = zip(confirmed_distances_ids, confirmed_distances)

RishabGoel · 2016-11-15T01:47:39Z

Hi @marco-c,
What about performance?
Also, is runtime of RWMD smaller than WMD?
if that is the case, then we can replace the euclidean distance with cosine.

tmylk · 2016-11-15T09:09:24Z

@marco-c Thanks for visiting this. The WMD really needs a speed-up to become useful. Do you have some performance metrics comparing WMD vs RWMD+WMD?

marco-c · 2016-11-15T17:12:29Z

If I don't compute all the distances between words beforehand (all_distances = 1 - ...) but in the loop, RWMD+WMD is ~5x faster than WMD.
If I compute them beforehand, it's ~60x faster than WMD.
@mkusner's and text2vec's code are calculating the distances one at a time (I guess because they use WCD too).
These are approximate figures, as I haven't really measured the performance in a systematic way.

Clearly, the difference might also be dependent on the corpus (in my case, I currently have a ~30000 vocabulary and documents contain ~10-20 words). Maybe with a different corpus RMWD wouldn't be such a good lower bound.

I'm pretty sure the code can still be optimized (e.g. by using Cython and/or multiple threads/processes).

RishabGoel · 2016-11-15T18:27:23Z

@marco-c Thanks for the update. You are right that the goodness of RWMD depends on the corpus. So,please use one in mkusner's repo. Also, the WMD you are using for comparison with WMD+RWMD, is it the cosine distance one or the one in gensim?

marco-c · 2016-11-15T19:08:54Z

You are right that the goodness of RWMD depends on the corpus. So,please use one in mkusner's repo.

Not sure when I can get to it, maybe this weekend. BTW, the code is pretty much self-contained, so you can also test it yourself if you want (model is the model, corpus is a list of TaggedDocuments, words_to_test is the array of words of the document you're considering).

Also, the WMD you are using for comparison with WMD+RWMD, is it the cosine distance one or the one in gensim?

It's the one in gensim (yes, I'm using cosine distance for RWMD and euclidean for WMD, I know I have to fix it, but I was just trying to see if I could get it fast enough to be usable).

dselivanov · 2016-11-15T19:17:22Z

In my opinion, computing pairwise distances between words in advance is not an option for any not toy corpus.

RishabGoel · 2016-11-15T19:25:44Z

@dselivanov Yes, perhaps because we might not need them and they might take up a lot of memory. I will create a flag to decide whether to calculate distances before hand.

marco-c · 2016-11-15T19:37:56Z

In my opinion, computing pairwise distances between words in advance is not an option for any not toy corpus.

It certainly depends on the corpus. In my case I currently have ~30000 words in the vocabulary, the matrix all_distances takes ~900 kB.
If you have a much larger vocabulary that makes it impossible to calculate the entire matrix at once, you could also do it in batches (e.g. distances with the words in the first n documents), which would still be faster than calculating the distances one at a time for each document.

@dselivanov Yes, perhaps because we might not need them and they might take up a lot of memory. I will create a flag to decide whether to calculate distances before hand.

Actually, I think you always need them. I'm not calculating the distances between all words in the corpus and themselves, I'm calculating the distances between all words in the corpus and the words in the document I'm currently considering.

all_distances = 1 - np.dot(model.syn0norm, model.syn0norm[[model.vocab[word].index for word in document if word in model]].transpose())

and not:

all_distances = 1 - np.dot(model.syn0norm, model.syn0norm.transpose())

In theory, if you implemented the second, you would have a single matrix shared by all queries and so it might be even faster (but yes, its feasibility depends on the size of your vocabulary).

dselivanov · 2016-11-15T19:46:23Z

@marco-c ah, I got it. Good point!

olavurmortensen · 2016-11-18T14:57:10Z

Hey @RishabGoel, I just have a few comments on this PR.

On the topic of pre-computing word distances, you could use memoization. Lev (@tmylk) suggested this to me back when I was working on WMD. Simply store the distance in a dictionary the first time it is computed.

I don't think the name "FastKNN" is fitting. While this is technically KNN, it is specialized to a specific distance metric. Maybe "FastWMD" would be better.

It has been mentioned in this conversation that the amount of speed-up gained from using RWMD and WCD depends on the corpus. Have you tried applying it to some different corpora, to see if there is a difference in speed-up? If there is a difference in speed-up, then what seems to be the cause? For example, plot WMD, RWMD and WCD of all documents in sorted order, and compare the plots for each of the corpora.

I hope you succeed in scaling up this algorithm, would be very nice to see :)

menshikh-iv · 2017-06-13T09:47:50Z

Ping @RishabGoel, what status of this PR? Will you finish it soon?

menshikh-iv · 2017-06-22T08:11:41Z

I close PR because it is abandoned.

HobbsB · 2018-01-11T20:58:20Z

Hey @RishabGoel ,

Mind if I ask why this PR was abandoned?

I'm working on a project which needs this and want to pick it up or at least move it forward. I'm just starting to look at it, but it looks nearly finished minus a few optimizations.

Am I missing something? What else needs to be done? Were there issues not mentioned here?

Any help would be greatly appreciated.

krinkere · 2018-03-08T18:42:40Z

same here... i am using very large corpus and WMDSimilarity is virtually unusable

krinkere · 2018-03-08T20:11:17Z

@marco-c Do you mind sharing complete code that you have?

marco-c · 2018-03-08T22:11:00Z

There's some code at #800 (comment), which is almost self-contained (see my comment at #800 (comment)).

The full source code is in this repository https://github.com/marco-c/crashsimilarity, in particular at https://github.com/marco-c/crashsimilarity/blob/2e4e4a0b67cf2dfe36a8cad6be147df7bd4bb5de/crashsimilarity/models/base.py#L76. I think it's easier to hack on the example I posted in this PR rather than looking at the code in that repo.

menshikh-iv · 2018-03-09T05:43:26Z

@krinkere @HobbsB try to use soft-cosine similarity https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_tutorial.ipynb (this is significantly faster than WMD and works well)

menshikh-iv · 2018-03-09T06:21:59Z

One more thing @krinkere @marco-c @HobbsB, if anyone interested in this PR - you can continue /finish this one ("fork" it or simply copy-paste to your new branch)

krinkere · 2018-03-09T13:31:03Z

Thank you @menshikh-iv and @marco-c
I will give both a try!

tmylk and others added 10 commits November 5, 2015 19:07

Merge branch 'release-0.12.3rc1'

1c63c9a

Merge branch 'release-0.12.3'

280a488

Merge branch 'release-0.12.3'

ddeb002

Update CHANGELOG.txt

f2ac3a9

Update CHANGELOG.txt

cf09e8c

resolve merge conflict in Changelog

b61287a

Merge branch 'release-0.12.4' with piskvorky#596

3ade404

Added WCD based KNN

71a3042

Removed a naming bug

ba79ea1

Added the RWMD mesure

4eb41bc

piskvorky assigned tmylk Jul 24, 2016

added wmd and pruning function

de7f7d6

tmylk changed the title ~~Approximate KNN~~ Potential Word Movers Distance performance improvement: WCD and RWMD Aug 5, 2016

tmylk reviewed Aug 5, 2016
View reviewed changes

RishabGoel added 2 commits August 23, 2016 16:42

Added distance plots and code for google news word vectors

cc124a8

Added documentation

1889422

tmylk changed the title ~~Potential Word Movers Distance performance improvement: WCD and RWMD~~ [WIP] Potential Word Movers Distance performance improvement: WCD and RWMD Oct 4, 2016

tmylk added difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature labels Oct 4, 2016

marco-c mentioned this pull request Apr 29, 2017

Word Mover's Distance slow for fasttext #1292

Open

dselivanov mentioned this pull request May 23, 2017

Approach to speed up dist2 for a large number of documents dselivanov/text2vec#190

Closed

ghost mentioned this pull request Jun 13, 2017

[WIP] Implement Word Mover's Distance (WMD) in Gensim #521

Closed

menshikh-iv closed this Jun 22, 2017

HobbsB unassigned tmylk Jan 11, 2018

[WIP] Potential Word Movers Distance performance improvement: WCD and RWMD #800

[WIP] Potential Word Movers Distance performance improvement: WCD and RWMD #800

Conversation

RishabGoel commented Jul 21, 2016

RishabGoel commented Jul 23, 2016

tmylk commented Jul 25, 2016

piskvorky commented Jul 26, 2016

RishabGoel commented Jul 26, 2016 • edited Loading

piskvorky commented Jul 27, 2016

RishabGoel commented Aug 1, 2016 • edited Loading

tmylk Aug 5, 2016

Choose a reason for hiding this comment

RishabGoel Aug 5, 2016

Choose a reason for hiding this comment

tmylk commented Aug 5, 2016

RishabGoel commented Aug 5, 2016

RishabGoel commented Aug 23, 2016

RishabGoel commented Sep 5, 2016

piskvorky commented Sep 6, 2016 • edited Loading

RishabGoel commented Sep 6, 2016

mkusner commented Sep 6, 2016

RishabGoel commented Sep 27, 2016

RishabGoel commented Sep 28, 2016

marco-c commented Nov 14, 2016 • edited Loading

RishabGoel commented Nov 15, 2016 • edited Loading

tmylk commented Nov 15, 2016

marco-c commented Nov 15, 2016

RishabGoel commented Nov 15, 2016

marco-c commented Nov 15, 2016

dselivanov commented Nov 15, 2016

RishabGoel commented Nov 15, 2016

marco-c commented Nov 15, 2016

dselivanov commented Nov 15, 2016

olavurmortensen commented Nov 18, 2016

menshikh-iv commented Jun 13, 2017

menshikh-iv commented Jun 22, 2017

HobbsB commented Jan 11, 2018 • edited Loading

krinkere commented Mar 8, 2018

krinkere commented Mar 8, 2018

marco-c commented Mar 8, 2018

menshikh-iv commented Mar 9, 2018

menshikh-iv commented Mar 9, 2018

krinkere commented Mar 9, 2018

RishabGoel commented Jul 26, 2016 •

edited

Loading

RishabGoel commented Aug 1, 2016 •

edited

Loading

piskvorky commented Sep 6, 2016 •

edited

Loading

marco-c commented Nov 14, 2016 •

edited

Loading

RishabGoel commented Nov 15, 2016 •

edited

Loading

HobbsB commented Jan 11, 2018 •

edited

Loading