Improve speed of FastTextKeyedVectors contains #1499

ELind77 · 2017-07-23T00:27:01Z

The current implementation of __contains__ in FastTextKeyedVectors is O(n*m) where n is the number of character ngrams in the query word and m is the size of the vocabulary. This is very slow for large
corpora. The new implementation is O(n).

I was working with the Wikipedia-trained vectors provided by Facebook, which have a vocabulary of a few million, and it was extremely slow to create averaged vectors from sentences. I looked into the code and found that __contains__ was implemented inefficiently and I replaced it with an implementation that is faster algorithmically and will short-circuit to give good average complexity.

Testing

No new tests were required to test this implementation change
All tests in test_fasttext_wrapper.py pass when run from Pycharm

The current implementation of __contains__ in FastTextKeyedVectors is `O(n*m)` where `n` is the number of character ngrams in the query word and `m` is the size of the vocabulary. This is very slow for large corpora. The new implementation is O(n).

piskvorky · 2017-07-23T11:33:44Z

gensim/models/wrappers/fasttext.py

-            else:
-                return False
+            word_ngrams = FastText.compute_ngrams(word, self.min_n, self.max_n)
+            return any(ng in self.ngrams for ng in word_ngrams)


Thanks! The original was a bizarre construct indeed.

How often can we expect compute_ngrams to contain duplicates? If it's a non-trivial number, iterating over set(word_ngrams) better.

It's difficult to say. As the character ngrams get smaller there is a greater chance of repeats and inputs may not always be english. Do character ngrams follow a Zipfian distribution? Even if they do, is the std library loop in the Set constructor really that much faster than iterating in pure Python? Since the list of character ngrams is bound to be relatively small and we're already looping over all of the ngrams, it doesn't seem like making them into a set would get us anything.

When I played around with this before submitting the PR I actually memoized both __contains__ and computer_ngrams in order to get more speed out of it but I didn't think that would be in line with gensim's goal of memory independence.

Also, I think word_ngrams should become character_ngrams.

Agreed. Since dict.__contains__ is fast (relative to allocating and filling in a new set object), and the expected duplication rate low (I think), the current approach is probably for the best.

IMO the set optimization should come from compute_ngrams() itself, to avoid creating the intermediate list. An iterator over (unique?) ngrams unique_ngrams() would be ideal -- no full explicit list or set needed (since we short-circuit anyway). An optimized inlining of compute_ngrams into FastText.__contains__ seems the simplest & most efficient solution (starting the iteration from the smallest/most frequent ngrams, since these are the most likely to exist => early out). CC @jayantj

compute_ngrams previously used to return a set of unique ngrams. The reason this was changed was that the original FastText implementation sums up vectors for all character ngrams to obtain a word vector (so ngrams occurring more than once are actually added up multiple times), and so we wanted to replicate FastText behaviour as closely as possible.

An iterator would probably be ideal, and the char ngrams are already returned in increasing order of length, so the any(ng in self.ngrams for ng in word_ngrams) should be fairly efficient. Returning them in decreasing order of frequency would be non-trivial though (partly because we don't even store ngram counts anyway), and probably unnecessary. It is possible to create a new method to iterate over unique ngrams only, but I don't think it would result in a significant gain in speed.

I'm not sure why the previous construct was the way it was - I can't think/remember of any good reason for it. There is already a very similar snippet in the word_vec method -

ngrams = FastText.compute_ngrams(word, self.min_n, self.max_n) ngrams = [ng for ng in ngrams if ng in self.ngrams]

So I'm not sure why I didn't already stick to that. Thanks a lot for the fix.

ELind77 · 2017-07-29T01:34:24Z

I changed a variable name and the docstring to improve clarity but didn't change anything else.

This is what a hyper-optimized inlined ngrams function would look like:

    def __contains__(self, word):
        """
        Check if `word` or any character ngrams in `word` are present in the vocabulary.
        A vector for the word is guaranteed to exist if `__contains__` returns True.
        """
        if word in self.vocab:
            return True
        else:
            # Inline creation of character ngrams for speed
            extended_word = '<{}>'.format(word)
            n_chars = len(extended_word)
            min_n = self.min_n
            max_len = min(n_chars, self.max_n) + 1
            char_ngrams = {extended_word[i:i+ngl] for ngl in xrange(min_n, max_len)
                           for i in xrange(0, n_chars - ngl + 1)}

            return any(ng in self.ngrams for ng in char_ngrams)

Is this something we are comfortable having in codebase? I could benchmark it over the wikipedia fasttext vocabulary, but before I do I'd like to get some feedback on whether this code is at a sufficiently high level of clarity for the project.

Also, I used xrange here out of habit, but in doing so I noticed that there don't seem to be any imports from six in this module and there are some things, like FileNotFoundError that I don't think are compatible with Py3. Are there compatibility tests for this module? Should I start a different PR for that?

ELind77 · 2017-07-29T01:42:38Z

This could be even faster if "lazy set generation" was a thing in Python but I think that might be overdoing it. At that point I think a LFU cache would actually be better. We could be sure that would see use by virtue of Zipf's law.

menshikh-iv · 2017-09-11T10:00:52Z

Thank you @ELind77

Eric Lind added 2 commits July 22, 2017 15:49

any() was unnecessary.

65c026e

piskvorky reviewed Jul 23, 2017

View reviewed changes

Update variable name and docstring to improve clarity

61487b3

menshikh-iv merged commit 224566c into piskvorky:develop Sep 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve speed of FastTextKeyedVectors contains #1499

Improve speed of FastTextKeyedVectors contains #1499

ELind77 commented Jul 23, 2017

piskvorky Jul 23, 2017

ELind77 Jul 25, 2017

piskvorky Jul 25, 2017 •

edited

Loading

jayantj Jul 25, 2017 •

edited

Loading

ELind77 commented Jul 29, 2017 •

edited

Loading

ELind77 commented Jul 29, 2017

menshikh-iv commented Sep 11, 2017

Improve speed of FastTextKeyedVectors __contains__ #1499

Improve speed of FastTextKeyedVectors __contains__ #1499

Conversation

ELind77 commented Jul 23, 2017

piskvorky Jul 23, 2017

Choose a reason for hiding this comment

ELind77 Jul 25, 2017

Choose a reason for hiding this comment

piskvorky Jul 25, 2017 • edited Loading

Choose a reason for hiding this comment

jayantj Jul 25, 2017 • edited Loading

Choose a reason for hiding this comment

ELind77 commented Jul 29, 2017 • edited Loading

ELind77 commented Jul 29, 2017

menshikh-iv commented Sep 11, 2017

Improve speed of FastTextKeyedVectors contains #1499

Improve speed of FastTextKeyedVectors contains #1499

piskvorky Jul 25, 2017 •

edited

Loading

jayantj Jul 25, 2017 •

edited

Loading

ELind77 commented Jul 29, 2017 •

edited

Loading