Add optional argument in_memory to FastSS init and open to store index in dictionary #1

rominf · 2019-05-06T03:26:48Z

On dictionary of 375 words query on index, opened with in_memory,
runs a third faster.

…index in dictionary On dictionary of 375 words query on index, opened with `in_memory`, runs a third faster.

piskvorky · 2021-05-16T19:43:10Z

Since this package is in the public domain, I cleaned it up a bit, got rid of the DB stuff (in-memory + pickle is both simpler and faster), optimized it and added it to Gensim: piskvorky/gensim#3146.

Thanks @fujimotos @rominf for your great work! It helped us with our indexing.

fujimotos · 2021-05-17T12:41:33Z

@piskvorky For my part, it's perfectly fine to embed TinyFastSS that way,

I placed this program into public domain, mostly to make it easier to
embed it into any larger programs.

piskvorky · 2021-05-20T10:17:45Z

FYI, an optimized version of in-memory TinyFastSS is now included in Gensim:
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/similarities/fastss.pyx

Query benchmarks against a 180,000 index of English words, max_dist=2:

common short word ("two"): 31x faster
mid-tier word ("koala"): 51x faster
rare long word ("registrable"): 8x faster

fujimotos · 2021-05-20T12:10:31Z

@piskvorky For that matter, I recommend you to check out polyleven.

Use polyleven.levenshtin() as a drop-in replacement of ceditdist(), and you'll
probably be able to archive even more speedup...

import polyleven

def editdist(s1, s2, max_dist):
    return polyleven.levenshtein(s1, s2, max_dist)

piskvorky · 2021-05-20T12:23:14Z

That seems slower actually:

timeit fastss.editdist('uživatel', 'živočich')
310 ns ± 9.24 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

timeit polyleven.levenshtein('uživatel', 'živočich')
444 ns ± 4.14 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

But for us, not having an external dependency was more important than a few percent here and there. That was the main goal. Which is why your TinyFastSS helped so much ❤️

EDIT: I checked the polyleven source code, that MBLEVEN_MATRIX approach is cool! It's probably indeed faster in some cases.

maxbachmann · 2021-05-21T18:21:33Z

@piskvorky interesting I just posted here: piskvorky/gensim#3146, since I could not reproduce your differences.

Using your example strings I get the following results on my machine:


polyleven.levenshtein:	249 ns
Levenshtein.distance:	174 ns
rapidfuzz.string_metric.levenshtein:	147 ns
fastss.editdist:	222 ns
empty function:	67 ns

As a comparision I used the following empty function:

def func(a, b):
    pass

At least for these small strings a big part of the runtime is the function call overhead + parsing of the arguments.

piskvorky · 2021-05-21T19:37:35Z

Microbenchmarking like that doesn't make much sense IMO, except to get a rough order-of-magnitude idea. Check out the full benchmark in the PR you link to, that should be more realistic – at least for the kind of application we're aiming at in Gensim: NLP, relatively short strings, definitely not random.

Add optional argument in_memory to FastSS __init__ and open to store …

d9a97ee

…index in dictionary On dictionary of 375 words query on index, opened with `in_memory`, runs a third faster.

piskvorky mentioned this pull request May 16, 2021

min_similarity & max_distance does not work in levsim piskvorky/gensim#2541

Closed

rominf closed this Feb 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional argument in_memory to FastSS init and open to store index in dictionary #1

Add optional argument in_memory to FastSS init and open to store index in dictionary #1

rominf commented May 6, 2019

piskvorky commented May 16, 2021 •

edited

Loading

fujimotos commented May 17, 2021

piskvorky commented May 20, 2021 •

edited

Loading

fujimotos commented May 20, 2021

piskvorky commented May 20, 2021 •

edited

Loading

maxbachmann commented May 21, 2021 •

edited

Loading

piskvorky commented May 21, 2021 •

edited

Loading

Add optional argument in_memory to FastSS __init__ and open to store index in dictionary #1

Add optional argument in_memory to FastSS __init__ and open to store index in dictionary #1

Conversation

rominf commented May 6, 2019

piskvorky commented May 16, 2021 • edited Loading

fujimotos commented May 17, 2021

piskvorky commented May 20, 2021 • edited Loading

fujimotos commented May 20, 2021

piskvorky commented May 20, 2021 • edited Loading

maxbachmann commented May 21, 2021 • edited Loading

piskvorky commented May 21, 2021 • edited Loading

Add optional argument in_memory to FastSS init and open to store index in dictionary #1

Add optional argument in_memory to FastSS init and open to store index in dictionary #1

piskvorky commented May 16, 2021 •

edited

Loading

piskvorky commented May 20, 2021 •

edited

Loading

piskvorky commented May 20, 2021 •

edited

Loading

maxbachmann commented May 21, 2021 •

edited

Loading

piskvorky commented May 21, 2021 •

edited

Loading