Word Mover's Distance slow for fasttext #1292

arashsa · 2017-04-26T15:01:42Z

I am testing the fasttext wrapper:

from gensim.models.wrappers import FastText
no_model_ft = FastText.load_fasttext_format('wiki.en/wiki.en')

Then I am trying the Word Mover's Distance function. In comparison with the doc2vec format, it is very slow. Is there a way to make this function faster?

The text was updated successfully, but these errors were encountered:

gojomo · 2017-04-26T18:48:02Z

Word Mover's Distance inherently requires a lot of computation. There might still be possible optimizations in the library gensim uses (PyEMD) or what gensim does – but there are no easy switches to throw for extra speed.

My only suggestions are to ensure you're caching reusable values in your application, and figure ways to only calculate pairwise WMDistances among likely candidates.

arashsa · 2017-04-26T19:23:05Z

@gojomo

Yes, I am aware of the computational complexity. What I don't understand is why it is slower for fasttext vs word2vec? I assume that it might have something to do with fasttext's ngrams, and that the word mover algorithm is computed for each ngram. However, that should still not warrant the overall speed increase when compared to the word2vec class. These are only assumptions.

Do both classes share the same underlying implementation?

gojomo · 2017-04-26T19:43:17Z

How much slower are you seeing with FastText vs non-FastText word-vector sources?

Yes, native FastText word-vectors need to be composed from the full-word-token and all subgrams – so each access will be a bit slower. Also, looking at the gensim WMD code (https://github.com/RaRe-Technologies/gensim/blob/7127eeef8304852da589465b338e2bba14e881f1/gensim/models/keyedvectors.py#L414), I see that each word-vector is re-accessed many times during calculation – if that's a notable drag on FastText models, it could be optimized to use a local cache of the vectors.

If you try this optimization, please share the change & let us know what speedup you see.

arashsa · 2017-04-26T21:03:25Z

@gojomo Unfortunately I did not time the functions. There was a substantial increase though. I will time them later and report back.

Will examine the code and see if there is a possible increase.

tmylk · 2017-04-27T02:08:20Z

RWMD should be much faster than WMD but gives similar results. For example it is used in R text2vec package.
However previous attempts at RWMD in gensim have not been performant, but it's on the roadmap.

marco-c · 2017-04-29T15:22:05Z

Yes, I think this is unrelated to fasttext. WMD is slow no matter the embedding algorithm.
There's an open PR about using RWMD and WCD, but it looks like it stalled: #800.

tmylk · 2017-05-02T22:15:02Z

Confirm that is the current status.
Closing as resolved

gojomo · 2017-05-03T04:44:55Z

As implied by original reporter, and noted in my comment, the access pattern used by gensim's WMD code is particularly costly for fasttext vectors that need to be repeatedly re-composed from character ngrams.

While reporter hasn't quantified magnitude of fasttext-specific slowdown, that shouldn't be hard AND the fix is easy. (This would be a good student/1st-timer test & fix.)

Bachstelze · 2018-07-11T10:40:33Z

If the speed performance of the WMD is a big issue, then you can try this out: https://github.com/src-d/wmd-relax
I didn't compare the speed results. but would be interested if one does for general settings.

menshikh-iv · 2018-07-11T15:12:55Z

Also, you can try to use soft-cosine similarity, this is pretty fast and nice alternative for WMD, see benchmark results

gojomo · 2018-07-11T18:17:01Z

@Bachstelze Thanks for the pointer to wmd-relax!

Do you happen to know if it has options to return the "flows" chosen to minimize the WMD value, or "remainder" (if ever doing a partial-match of unequal 'weights')? I have a hunch WMD might work well to check when one text conveys a superset of another (shorter) text's meaning – and then also find third texts that best add exactly what's missing to the shorter text.

tmylk closed this as completed May 2, 2017

gojomo reopened this May 3, 2017

piskvorky assigned menshikh-iv Jun 18, 2017

gojomo added the difficulty easy Easy issue: required small fix label Jul 9, 2017

menshikh-iv added feature Issue described a new feature performance Issue related to performance (in HW meaning) labels Oct 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word Mover's Distance slow for fasttext #1292

Word Mover's Distance slow for fasttext #1292

arashsa commented Apr 26, 2017 •

edited

Loading

gojomo commented Apr 26, 2017

arashsa commented Apr 26, 2017 •

edited

Loading

gojomo commented Apr 26, 2017 •

edited

Loading

arashsa commented Apr 26, 2017

tmylk commented Apr 27, 2017

marco-c commented Apr 29, 2017

tmylk commented May 2, 2017

gojomo commented May 3, 2017

Bachstelze commented Jul 11, 2018 •

edited

Loading

menshikh-iv commented Jul 11, 2018

gojomo commented Jul 11, 2018

Word Mover's Distance slow for fasttext #1292

Word Mover's Distance slow for fasttext #1292

Comments

arashsa commented Apr 26, 2017 • edited Loading

gojomo commented Apr 26, 2017

arashsa commented Apr 26, 2017 • edited Loading

gojomo commented Apr 26, 2017 • edited Loading

arashsa commented Apr 26, 2017

tmylk commented Apr 27, 2017

marco-c commented Apr 29, 2017

tmylk commented May 2, 2017

gojomo commented May 3, 2017

Bachstelze commented Jul 11, 2018 • edited Loading

menshikh-iv commented Jul 11, 2018

gojomo commented Jul 11, 2018

arashsa commented Apr 26, 2017 •

edited

Loading

arashsa commented Apr 26, 2017 •

edited

Loading

gojomo commented Apr 26, 2017 •

edited

Loading

Bachstelze commented Jul 11, 2018 •

edited

Loading