-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Potential Word Movers Distance performance improvement: WCD and RWMD #800
Conversation
Hi, |
Thanks. |
How does this compare to the WMD stuff that @olavurmortensen did? |
Hi,
Please, comment if I answered your question. |
Thanks @RishabGoel ! That was a question mostly for @tmylk . If we are to merge this PR, we'll have to integrate it with what we already have (and also drop the sklearn dependencies). |
Hi, |
doc : Normalised BOW representaion of stored doc . | ||
doc_id : id of the stored doc | ||
""" | ||
return (doc_id, distance.euclidean(np.dot(np.transpose(self.word_embedding), np.transpose(test_doc)), np.dot(np.transpose(self.word_embedding), doc))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you transposet test_doc
and not doc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is because doc is an array containing the reserved documents and the doc_id is the index of a document in the doc array. I should have named docs as docs_arr. I will correct it in next commit.
So, Continuing when I access doc[doc_id] i get an array of shape (1,x) while test_doc being an array has shape (x,). Now dimensions of word embedding is (x,1).
So, to make the shapes compatible, transpose of test_doc is taken, to make it (1,x)
@RishabGoel Let's go over performance(time) comparison in our next meeting. |
Sure, I will do that testing. I have accepted the invitation. Let's talk in our next meeting about it. |
I have updated the documentation of the prune function. Is it understandable now?? |
@RishabGoel @mkusner I'd suggest discussing here rather than privately via email. That way, other people can chime in, help out, and there's a track record of what changes were made and why and how. |
@piskvorky Sure thing. I am pasting the contents of the mail below:-
|
Thanks Rishab for your interest in the project! To your questions:
|
@dselivanov and @tmylk thanks for the suggestions and links... |
The code looks similar to @mkusner 's implementation, except there is an alternate distance metric i.e. cosine distance. Will add this feature as well... |
In my case, using RWMD+WMD (I'm not using WCD for now) is much faster than just WMD. My implementation is in Python and not so different than @RishabGoel one (except that I'm using cosine distance and maybe doing a few less operations). Here's what I currently have, perhaps it can be useful: all_distances = 1 - np.dot(model.syn0norm, model.syn0norm[[model.vocab[word].index for word in words_to_test_clean]].transpose())
distances = []
for doc_id in range(0, len(corpus)):
doc_words = [model.vocab[word].index for word in corpus[doc_id].words if word in model]
if len(doc_words) != 0:
word_dists = all_distances[doc_words]
rwmd = max(np.sum(np.min(word_dists, axis=0)), np.sum(np.min(word_dists, axis=1)))
else:
rwmd = float('inf')
distances.append((doc_id, rwmd))
distances.sort(key=lambda v: v[1])
confirmed_distances_ids = []
confirmed_distances = []
for i, (doc_id, rwmd_distance) in enumerate(distances):
# Stop once we have 'top' confirmed distances and all the rwmd lower bounds are higher than the smallest top confirmed distance.
if len(confirmed_distances) >= top and rwmd_distance > confirmed_distances[top-1]:
break
# TODO: directly use pyemd, so we don't recalculate distances we have already calculated for RWMD.
wmd = model.wmdistance(words_to_test, corpus[doc_id].words)
j = bisect.bisect(confirmed_distances, wmd)
confirmed_distances.insert(j, wmd)
confirmed_distances_ids.insert(j, doc_id)
similarities = zip(confirmed_distances_ids, confirmed_distances) |
Hi @marco-c, |
@marco-c Thanks for visiting this. The WMD really needs a speed-up to become useful. Do you have some performance metrics comparing WMD vs RWMD+WMD? |
If I don't compute all the distances between words beforehand ( Clearly, the difference might also be dependent on the corpus (in my case, I currently have a ~30000 vocabulary and documents contain ~10-20 words). Maybe with a different corpus RMWD wouldn't be such a good lower bound. I'm pretty sure the code can still be optimized (e.g. by using Cython and/or multiple threads/processes). |
@marco-c Thanks for the update. You are right that the goodness of RWMD depends on the corpus. So,please use one in mkusner's repo. Also, the WMD you are using for comparison with WMD+RWMD, is it the cosine distance one or the one in gensim? |
Not sure when I can get to it, maybe this weekend. BTW, the code is pretty much self-contained, so you can also test it yourself if you want (
It's the one in gensim (yes, I'm using cosine distance for RWMD and euclidean for WMD, I know I have to fix it, but I was just trying to see if I could get it fast enough to be usable). |
In my opinion, computing pairwise distances between words in advance is not an option for any not toy corpus. |
@dselivanov Yes, perhaps because we might not need them and they might take up a lot of memory. I will create a flag to decide whether to calculate distances before hand. |
It certainly depends on the corpus. In my case I currently have ~30000 words in the vocabulary, the matrix
Actually, I think you always need them. I'm not calculating the distances between all words in the corpus and themselves, I'm calculating the distances between all words in the corpus and the words in the document I'm currently considering.
and not:
In theory, if you implemented the second, you would have a single matrix shared by all queries and so it might be even faster (but yes, its feasibility depends on the size of your vocabulary). |
@marco-c ah, I got it. Good point! |
Hey @RishabGoel, I just have a few comments on this PR. On the topic of pre-computing word distances, you could use memoization. Lev (@tmylk) suggested this to me back when I was working on WMD. Simply store the distance in a dictionary the first time it is computed. I don't think the name "FastKNN" is fitting. While this is technically KNN, it is specialized to a specific distance metric. Maybe "FastWMD" would be better. It has been mentioned in this conversation that the amount of speed-up gained from using RWMD and WCD depends on the corpus. Have you tried applying it to some different corpora, to see if there is a difference in speed-up? If there is a difference in speed-up, then what seems to be the cause? For example, plot WMD, RWMD and WCD of all documents in sorted order, and compare the plots for each of the corpora. I hope you succeed in scaling up this algorithm, would be very nice to see :) |
Ping @RishabGoel, what status of this PR? Will you finish it soon? |
I close PR because it is abandoned. |
Hey @RishabGoel , Mind if I ask why this PR was abandoned? I'm working on a project which needs this and want to pick it up or at least move it forward. I'm just starting to look at it, but it looks nearly finished minus a few optimizations. Am I missing something? What else needs to be done? Were there issues not mentioned here? Any help would be greatly appreciated. |
same here... i am using very large corpus and WMDSimilarity is virtually unusable |
@marco-c Do you mind sharing complete code that you have? |
There's some code at #800 (comment), which is almost self-contained (see my comment at #800 (comment)). The full source code is in this repository https://github.com/marco-c/crashsimilarity, in particular at https://github.com/marco-c/crashsimilarity/blob/2e4e4a0b67cf2dfe36a8cad6be147df7bd4bb5de/crashsimilarity/models/base.py#L76. I think it's easier to hack on the example I posted in this PR rather than looking at the code in that repo. |
@krinkere @HobbsB try to use soft-cosine similarity https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_tutorial.ipynb (this is significantly faster than WMD and works well) |
Thank you @menshikh-iv and @marco-c |
The class calculates the K nearest documents based on heuristics mentioned in the paper section 4.