Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word Mover's Distance slow for fasttext #1292

Open
arashsa opened this issue Apr 26, 2017 · 11 comments
Open

Word Mover's Distance slow for fasttext #1292

arashsa opened this issue Apr 26, 2017 · 11 comments
Assignees
Labels
difficulty easy Easy issue: required small fix feature Issue described a new feature performance Issue related to performance (in HW meaning)

Comments

@arashsa
Copy link

arashsa commented Apr 26, 2017

I am testing the fasttext wrapper:

from gensim.models.wrappers import FastText
no_model_ft = FastText.load_fasttext_format('wiki.en/wiki.en')

Then I am trying the Word Mover's Distance function. In comparison with the doc2vec format, it is very slow. Is there a way to make this function faster?

@gojomo
Copy link
Collaborator

gojomo commented Apr 26, 2017

Word Mover's Distance inherently requires a lot of computation. There might still be possible optimizations in the library gensim uses (PyEMD) or what gensim does – but there are no easy switches to throw for extra speed.

My only suggestions are to ensure you're caching reusable values in your application, and figure ways to only calculate pairwise WMDistances among likely candidates.

@arashsa
Copy link
Author

arashsa commented Apr 26, 2017

@gojomo

Yes, I am aware of the computational complexity. What I don't understand is why it is slower for fasttext vs word2vec? I assume that it might have something to do with fasttext's ngrams, and that the word mover algorithm is computed for each ngram. However, that should still not warrant the overall speed increase when compared to the word2vec class. These are only assumptions.

Do both classes share the same underlying implementation?

@gojomo
Copy link
Collaborator

gojomo commented Apr 26, 2017

How much slower are you seeing with FastText vs non-FastText word-vector sources?

Yes, native FastText word-vectors need to be composed from the full-word-token and all subgrams – so each access will be a bit slower. Also, looking at the gensim WMD code (https://github.com/RaRe-Technologies/gensim/blob/7127eeef8304852da589465b338e2bba14e881f1/gensim/models/keyedvectors.py#L414), I see that each word-vector is re-accessed many times during calculation – if that's a notable drag on FastText models, it could be optimized to use a local cache of the vectors.

If you try this optimization, please share the change & let us know what speedup you see.

@arashsa
Copy link
Author

arashsa commented Apr 26, 2017

@gojomo Unfortunately I did not time the functions. There was a substantial increase though. I will time them later and report back.

Will examine the code and see if there is a possible increase.

@tmylk
Copy link
Contributor

tmylk commented Apr 27, 2017

RWMD should be much faster than WMD but gives similar results. For example it is used in R text2vec package.
However previous attempts at RWMD in gensim have not been performant, but it's on the roadmap.

@marco-c
Copy link

marco-c commented Apr 29, 2017

Yes, I think this is unrelated to fasttext. WMD is slow no matter the embedding algorithm.
There's an open PR about using RWMD and WCD, but it looks like it stalled: #800.

@tmylk
Copy link
Contributor

tmylk commented May 2, 2017

Confirm that is the current status.
Closing as resolved

@tmylk tmylk closed this as completed May 2, 2017
@gojomo
Copy link
Collaborator

gojomo commented May 3, 2017

As implied by original reporter, and noted in my comment, the access pattern used by gensim's WMD code is particularly costly for fasttext vectors that need to be repeatedly re-composed from character ngrams.

While reporter hasn't quantified magnitude of fasttext-specific slowdown, that shouldn't be hard AND the fix is easy. (This would be a good student/1st-timer test & fix.)

@gojomo gojomo reopened this May 3, 2017
@gojomo gojomo added the difficulty easy Easy issue: required small fix label Jul 9, 2017
@menshikh-iv menshikh-iv added feature Issue described a new feature performance Issue related to performance (in HW meaning) labels Oct 2, 2017
@Bachstelze
Copy link

Bachstelze commented Jul 11, 2018

If the speed performance of the WMD is a big issue, then you can try this out: https://github.com/src-d/wmd-relax
I didn't compare the speed results. but would be interested if one does for general settings.

@menshikh-iv
Copy link
Contributor

Also, you can try to use soft-cosine similarity, this is pretty fast and nice alternative for WMD, see benchmark results

@gojomo
Copy link
Collaborator

gojomo commented Jul 11, 2018

@Bachstelze Thanks for the pointer to wmd-relax!

Do you happen to know if it has options to return the "flows" chosen to minimize the WMD value, or "remainder" (if ever doing a partial-match of unequal 'weights')? I have a hunch WMD might work well to check when one text conveys a superset of another (shorter) text's meaning – and then also find third texts that best add exactly what's missing to the shorter text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix feature Issue described a new feature performance Issue related to performance (in HW meaning)
Projects
None yet
Development

No branches or pull requests

6 participants