Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Potential Word Movers Distance performance improvement: WCD and RWMD #800

Closed
wants to merge 13 commits into from

Conversation

RishabGoel
Copy link
Contributor

The class calculates the K nearest documents based on heuristics mentioned in the paper section 4.

@RishabGoel
Copy link
Contributor Author

Hi,
I have added the RWMD metric required for Fast KNN algorithm.

@tmylk
Copy link
Contributor

tmylk commented Jul 25, 2016

Thanks.
Please add tests and add the code that actually uses WMD.

@piskvorky
Copy link
Owner

How does this compare to the WMD stuff that @olavurmortensen did?

@RishabGoel
Copy link
Contributor Author

RishabGoel commented Jul 26, 2016

Hi,
I not sure I understood the question. But, I will try to explain what I understand

  • @olavurmortensen implemented the WMD measure as per PR [WIP] Word Movers Distance #482 , #521 and #619 #659. So, if I were to get say get 20 nearest neighbors to a test document, out of a set of thousands of the documents, based on WMD, I would need to do o(size of documents) WMD computations for the tasks and then sort it. This is computationally expensive, as the wmd involves minimization wrt 2 constraints(as mention in section 4 of the paper link in the next point).
  • According to heuristic mentioned in the paper under prefetch and prune heading, one can avoid this much computation and rather use WCD(Word Centroid Distance) and RWMD (Relaxed Word Mover's Distance) as indication for all those documents, for which we should compute the WMD. Thus, reducing fairly large computations as WCD is cheap and RWMD involves minimization wrt one constraint only . So, it is related with the WMD stuff in the sense it helps to find potential documents for which WMD should be calculated for the requirements.
  • Also, @olavurmortensen implemented WCD and RWMD in PR [WIP] Implement Word Mover's Distance (WMD) in Gensim #521, which were apparently removed afterwards. I am developing test cases that compare my performance with the implemented ones.
  • I am also thinking of test cases to test the quality of distance calculated. I will probably compare my results with vene's results.

Please, comment if I answered your question.
Thanks

@piskvorky
Copy link
Owner

Thanks @RishabGoel !

That was a question mostly for @tmylk . If we are to merge this PR, we'll have to integrate it with what we already have (and also drop the sklearn dependencies).

@RishabGoel
Copy link
Contributor Author

RishabGoel commented Aug 1, 2016

Hi,
I have added the basic functionality required. Now, I will be working checking its accuracy, time measurements and other test cases.
One thing I have observed is that, this metric is as good as the Word2Vec vectors are representative proper context of the a word in the given data.
This is a good food for thought.

@tmylk tmylk changed the title Approximate KNN Potential Word Movers Distance performance improvement: WCD and RWMD Aug 5, 2016
doc : Normalised BOW representaion of stored doc .
doc_id : id of the stored doc
"""
return (doc_id, distance.euclidean(np.dot(np.transpose(self.word_embedding), np.transpose(test_doc)), np.dot(np.transpose(self.word_embedding), doc)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you transposet test_doc and not doc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is because doc is an array containing the reserved documents and the doc_id is the index of a document in the doc array. I should have named docs as docs_arr. I will correct it in next commit.
So, Continuing when I access doc[doc_id] i get an array of shape (1,x) while test_doc being an array has shape (x,). Now dimensions of word embedding is (x,1).
So, to make the shapes compatible, transpose of test_doc is taken, to make it (1,x)

@tmylk
Copy link
Contributor

tmylk commented Aug 5, 2016

@RishabGoel Let's go over performance(time) comparison in our next meeting.

@RishabGoel
Copy link
Contributor Author

Sure, I will do that testing. I have accepted the invitation. Let's talk in our next meeting about it.

@RishabGoel
Copy link
Contributor Author

I have updated the documentation of the prune function. Is it understandable now??

@RishabGoel
Copy link
Contributor Author

@mkusner I am implementing your paper and I am facing some issues. Kindly, check your email. I have mailed them in detail to you. I shall be extrmely grateful, if you could answer those queries!

@piskvorky
Copy link
Owner

piskvorky commented Sep 6, 2016

@RishabGoel @mkusner I'd suggest discussing here rather than privately via email. That way, other people can chime in, help out, and there's a track record of what changes were made and why and how.

@RishabGoel
Copy link
Contributor Author

@piskvorky Sure thing. I am pasting the contents of the mail below:-

  • I am implementing RWMD and WCD distances in python, unlike your matlab implementation.The time tests for RWMDis coming out to be more than WMD, as much as twice the magnitude. This might be ok. I wanted to check, if this was the case with you as well. Since RWMD is a pretty relaxed problem, it shouldnt take this much time, even if WMD is calculated in C.
  • Is the following graph (distance vs training input indices) obtained for all the vectors in test twitter data or is it in some cases? I am getting a very different graph for a particular twitter review.
    image

@mkusner
Copy link

mkusner commented Sep 6, 2016

Thanks Rishab for your interest in the project! To your questions:

  • It's a bit unexpected that WMD is faster than RWMD because RWMD just requires two nearest neighbor searches (as opposed to solving a linear program). It's actually a strictly less-constrained optimization problem. RWMD was still faster than WMD for us even if the linear program was implemented in C and the distance computation was implemented in Matlab. Figure 7 in the paper shows this comparison (see the pink bars (RWMD) versus the dark red bars (WMD): http://mkusner.github.io/publications/WMD.pdf). Maybe the distance computation in python can be sped up?
  • In this graph we randomly sampled a single test point and computed all distances between it and the training points. We then sorted the distances in order of increasing WMD distance. Then we plotted WCD and RWMD. So the graph could look a bit different for a different test point. But what you should see is that RWMD very closely matches WMD, and WCD is a bit of a looser approximation.

@RishabGoel
Copy link
Contributor Author

@dselivanov and @tmylk thanks for the suggestions and links...
I will look into it.
I am also done with the matlab vs python test of rwmd. Will post the results soon.

@RishabGoel
Copy link
Contributor Author

The code looks similar to @mkusner 's implementation, except there is an alternate distance metric i.e. cosine distance. Will add this feature as well...

@tmylk tmylk changed the title Potential Word Movers Distance performance improvement: WCD and RWMD [WIP] Potential Word Movers Distance performance improvement: WCD and RWMD Oct 4, 2016
@tmylk tmylk added difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature labels Oct 4, 2016
@marco-c
Copy link

marco-c commented Nov 14, 2016

In my case, using RWMD+WMD (I'm not using WCD for now) is much faster than just WMD. My implementation is in Python and not so different than @RishabGoel one (except that I'm using cosine distance and maybe doing a few less operations).

Here's what I currently have, perhaps it can be useful:

all_distances = 1 - np.dot(model.syn0norm, model.syn0norm[[model.vocab[word].index for word in words_to_test_clean]].transpose())

distances = []
for doc_id in range(0, len(corpus)):
    doc_words = [model.vocab[word].index for word in corpus[doc_id].words if word in model]
    if len(doc_words) != 0:
        word_dists = all_distances[doc_words]
        rwmd = max(np.sum(np.min(word_dists, axis=0)), np.sum(np.min(word_dists, axis=1)))
    else:
        rwmd = float('inf')
    distances.append((doc_id, rwmd))

distances.sort(key=lambda v: v[1])

confirmed_distances_ids = []
confirmed_distances = []

for i, (doc_id, rwmd_distance) in enumerate(distances):
    # Stop once we have 'top' confirmed distances and all the rwmd lower bounds are higher than the smallest top confirmed distance.
    if len(confirmed_distances) >= top and rwmd_distance > confirmed_distances[top-1]:
        break

    # TODO: directly use pyemd, so we don't recalculate distances we have already calculated for RWMD.
    wmd = model.wmdistance(words_to_test, corpus[doc_id].words)
    j = bisect.bisect(confirmed_distances, wmd)
    confirmed_distances.insert(j, wmd)
    confirmed_distances_ids.insert(j, doc_id)

similarities = zip(confirmed_distances_ids, confirmed_distances)

@RishabGoel
Copy link
Contributor Author

RishabGoel commented Nov 15, 2016

Hi @marco-c,
What about performance?
Also, is runtime of RWMD smaller than WMD?
if that is the case, then we can replace the euclidean distance with cosine.

@tmylk
Copy link
Contributor

tmylk commented Nov 15, 2016

@marco-c Thanks for visiting this. The WMD really needs a speed-up to become useful. Do you have some performance metrics comparing WMD vs RWMD+WMD?

@marco-c
Copy link

marco-c commented Nov 15, 2016

If I don't compute all the distances between words beforehand (all_distances = 1 - ...) but in the loop, RWMD+WMD is ~5x faster than WMD.
If I compute them beforehand, it's ~60x faster than WMD.
@mkusner's and text2vec's code are calculating the distances one at a time (I guess because they use WCD too).
These are approximate figures, as I haven't really measured the performance in a systematic way.

Clearly, the difference might also be dependent on the corpus (in my case, I currently have a ~30000 vocabulary and documents contain ~10-20 words). Maybe with a different corpus RMWD wouldn't be such a good lower bound.

I'm pretty sure the code can still be optimized (e.g. by using Cython and/or multiple threads/processes).

@RishabGoel
Copy link
Contributor Author

@marco-c Thanks for the update. You are right that the goodness of RWMD depends on the corpus. So,please use one in mkusner's repo. Also, the WMD you are using for comparison with WMD+RWMD, is it the cosine distance one or the one in gensim?

@marco-c
Copy link

marco-c commented Nov 15, 2016

You are right that the goodness of RWMD depends on the corpus. So,please use one in mkusner's repo.

Not sure when I can get to it, maybe this weekend. BTW, the code is pretty much self-contained, so you can also test it yourself if you want (model is the model, corpus is a list of TaggedDocuments, words_to_test is the array of words of the document you're considering).

Also, the WMD you are using for comparison with WMD+RWMD, is it the cosine distance one or the one in gensim?

It's the one in gensim (yes, I'm using cosine distance for RWMD and euclidean for WMD, I know I have to fix it, but I was just trying to see if I could get it fast enough to be usable).

@dselivanov
Copy link

In my opinion, computing pairwise distances between words in advance is not an option for any not toy corpus.

@RishabGoel
Copy link
Contributor Author

@dselivanov Yes, perhaps because we might not need them and they might take up a lot of memory. I will create a flag to decide whether to calculate distances before hand.

@marco-c
Copy link

marco-c commented Nov 15, 2016

In my opinion, computing pairwise distances between words in advance is not an option for any not toy corpus.

It certainly depends on the corpus. In my case I currently have ~30000 words in the vocabulary, the matrix all_distances takes ~900 kB.
If you have a much larger vocabulary that makes it impossible to calculate the entire matrix at once, you could also do it in batches (e.g. distances with the words in the first n documents), which would still be faster than calculating the distances one at a time for each document.

@dselivanov Yes, perhaps because we might not need them and they might take up a lot of memory. I will create a flag to decide whether to calculate distances before hand.

Actually, I think you always need them. I'm not calculating the distances between all words in the corpus and themselves, I'm calculating the distances between all words in the corpus and the words in the document I'm currently considering.

all_distances = 1 - np.dot(model.syn0norm, model.syn0norm[[model.vocab[word].index for word in document if word in model]].transpose())

and not:

all_distances = 1 - np.dot(model.syn0norm, model.syn0norm.transpose())

In theory, if you implemented the second, you would have a single matrix shared by all queries and so it might be even faster (but yes, its feasibility depends on the size of your vocabulary).

@dselivanov
Copy link

@marco-c ah, I got it. Good point!

@olavurmortensen
Copy link
Contributor

Hey @RishabGoel, I just have a few comments on this PR.

On the topic of pre-computing word distances, you could use memoization. Lev (@tmylk) suggested this to me back when I was working on WMD. Simply store the distance in a dictionary the first time it is computed.

I don't think the name "FastKNN" is fitting. While this is technically KNN, it is specialized to a specific distance metric. Maybe "FastWMD" would be better.

It has been mentioned in this conversation that the amount of speed-up gained from using RWMD and WCD depends on the corpus. Have you tried applying it to some different corpora, to see if there is a difference in speed-up? If there is a difference in speed-up, then what seems to be the cause? For example, plot WMD, RWMD and WCD of all documents in sorted order, and compare the plots for each of the corpora.

I hope you succeed in scaling up this algorithm, would be very nice to see :)

@menshikh-iv
Copy link
Contributor

Ping @RishabGoel, what status of this PR? Will you finish it soon?

@menshikh-iv
Copy link
Contributor

I close PR because it is abandoned.

@HobbsB
Copy link

HobbsB commented Jan 11, 2018

Hey @RishabGoel ,

Mind if I ask why this PR was abandoned?

I'm working on a project which needs this and want to pick it up or at least move it forward. I'm just starting to look at it, but it looks nearly finished minus a few optimizations.

Am I missing something? What else needs to be done? Were there issues not mentioned here?

Any help would be greatly appreciated.

@krinkere
Copy link

krinkere commented Mar 8, 2018

same here... i am using very large corpus and WMDSimilarity is virtually unusable

@krinkere
Copy link

krinkere commented Mar 8, 2018

@marco-c Do you mind sharing complete code that you have?

@marco-c
Copy link

marco-c commented Mar 8, 2018

There's some code at #800 (comment), which is almost self-contained (see my comment at #800 (comment)).

The full source code is in this repository https://github.com/marco-c/crashsimilarity, in particular at https://github.com/marco-c/crashsimilarity/blob/2e4e4a0b67cf2dfe36a8cad6be147df7bd4bb5de/crashsimilarity/models/base.py#L76. I think it's easier to hack on the example I posted in this PR rather than looking at the code in that repo.

@menshikh-iv
Copy link
Contributor

@krinkere @HobbsB try to use soft-cosine similarity https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_tutorial.ipynb (this is significantly faster than WMD and works well)

@menshikh-iv
Copy link
Contributor

One more thing @krinkere @marco-c @HobbsB, if anyone interested in this PR - you can continue /finish this one ("fork" it or simply copy-paste to your new branch)

@krinkere
Copy link

krinkere commented Mar 9, 2018

Thank you @menshikh-iv and @marco-c
I will give both a try!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants