-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve speed of FastTextKeyedVectors __contains__ #1499
Improve speed of FastTextKeyedVectors __contains__ #1499
Conversation
The current implementation of __contains__ in FastTextKeyedVectors is `O(n*m)` where `n` is the number of character ngrams in the query word and `m` is the size of the vocabulary. This is very slow for large corpora. The new implementation is O(n).
gensim/models/wrappers/fasttext.py
Outdated
else: | ||
return False | ||
word_ngrams = FastText.compute_ngrams(word, self.min_n, self.max_n) | ||
return any(ng in self.ngrams for ng in word_ngrams) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! The original was a bizarre construct indeed.
How often can we expect compute_ngrams
to contain duplicates? If it's a non-trivial number, iterating over set(word_ngrams)
better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's difficult to say. As the character ngrams get smaller there is a greater chance of repeats and inputs may not always be english. Do character ngrams follow a Zipfian distribution? Even if they do, is the std library loop in the Set
constructor really that much faster than iterating in pure Python? Since the list of character ngrams is bound to be relatively small and we're already looping over all of the ngrams, it doesn't seem like making them into a set would get us anything.
When I played around with this before submitting the PR I actually memoized both __contains__
and computer_ngrams
in order to get more speed out of it but I didn't think that would be in line with gensim's goal of memory independence.
Also, I think word_ngrams
should become character_ngrams
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Since dict.__contains__
is fast (relative to allocating and filling in a new set
object), and the expected duplication rate low (I think), the current approach is probably for the best.
IMO the set
optimization should come from compute_ngrams()
itself, to avoid creating the intermediate list. An iterator over (unique?) ngrams unique_ngrams()
would be ideal -- no full explicit list or set needed (since we short-circuit anyway). An optimized inlining of compute_ngrams
into FastText.__contains__
seems the simplest & most efficient solution (starting the iteration from the smallest/most frequent ngrams, since these are the most likely to exist => early out). CC @jayantj
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compute_ngrams
previously used to return a set of unique ngrams. The reason this was changed was that the original FastText implementation sums up vectors for all character ngrams to obtain a word vector (so ngrams occurring more than once are actually added up multiple times), and so we wanted to replicate FastText behaviour as closely as possible.
An iterator would probably be ideal, and the char ngrams are already returned in increasing order of length, so the any(ng in self.ngrams for ng in word_ngrams)
should be fairly efficient. Returning them in decreasing order of frequency would be non-trivial though (partly because we don't even store ngram counts anyway), and probably unnecessary. It is possible to create a new method to iterate over unique ngrams only, but I don't think it would result in a significant gain in speed.
I'm not sure why the previous construct was the way it was - I can't think/remember of any good reason for it. There is already a very similar snippet in the word_vec
method -
ngrams = FastText.compute_ngrams(word, self.min_n, self.max_n)
ngrams = [ng for ng in ngrams if ng in self.ngrams]
So I'm not sure why I didn't already stick to that. Thanks a lot for the fix.
I changed a variable name and the docstring to improve clarity but didn't change anything else. This is what a hyper-optimized inlined ngrams function would look like: def __contains__(self, word):
"""
Check if `word` or any character ngrams in `word` are present in the vocabulary.
A vector for the word is guaranteed to exist if `__contains__` returns True.
"""
if word in self.vocab:
return True
else:
# Inline creation of character ngrams for speed
extended_word = '<{}>'.format(word)
n_chars = len(extended_word)
min_n = self.min_n
max_len = min(n_chars, self.max_n) + 1
char_ngrams = {extended_word[i:i+ngl] for ngl in xrange(min_n, max_len)
for i in xrange(0, n_chars - ngl + 1)}
return any(ng in self.ngrams for ng in char_ngrams) Is this something we are comfortable having in codebase? I could benchmark it over the wikipedia fasttext vocabulary, but before I do I'd like to get some feedback on whether this code is at a sufficiently high level of clarity for the project. Also, I used |
This could be even faster if "lazy set generation" was a thing in Python but I think that might be overdoing it. At that point I think a LFU cache would actually be better. We could be sure that would see use by virtue of Zipf's law. |
Thank you @ELind77 |
The current implementation of
__contains__
in FastTextKeyedVectors isO(n*m)
wheren
is the number of character ngrams in the query word andm
is the size of the vocabulary. This is very slow for largecorpora. The new implementation is
O(n)
.I was working with the Wikipedia-trained vectors provided by Facebook, which have a vocabulary of a few million, and it was extremely slow to create averaged vectors from sentences. I looked into the code and found that
__contains__
was implemented inefficiently and I replaced it with an implementation that is faster algorithmically and will short-circuit to give good average complexity.Testing
test_fasttext_wrapper.py
pass when run from Pycharm