-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optional argument in_memory to FastSS __init__ and open to store index in dictionary #1
Conversation
…index in dictionary On dictionary of 375 words query on index, opened with `in_memory`, runs a third faster.
Since this package is in the public domain, I cleaned it up a bit, got rid of the DB stuff (in-memory + pickle is both simpler and faster), optimized it and added it to Gensim: piskvorky/gensim#3146. Thanks @fujimotos @rominf for your great work! It helped us with our indexing. |
@piskvorky For my part, it's perfectly fine to embed TinyFastSS that way, I placed this program into public domain, mostly to make it easier to |
FYI, an optimized version of in-memory TinyFastSS is now included in Gensim: Query benchmarks against a 180,000 index of English words,
|
@piskvorky For that matter, I recommend you to check out polyleven. Use import polyleven
def editdist(s1, s2, max_dist):
return polyleven.levenshtein(s1, s2, max_dist) |
That seems slower actually: timeit fastss.editdist('uživatel', 'živočich')
310 ns ± 9.24 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
timeit polyleven.levenshtein('uživatel', 'živočich')
444 ns ± 4.14 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) But for us, not having an external dependency was more important than a few percent here and there. That was the main goal. Which is why your TinyFastSS helped so much ❤️ EDIT: I checked the polyleven source code, that |
@piskvorky interesting I just posted here: piskvorky/gensim#3146, since I could not reproduce your differences. Using your example strings I get the following results on my machine:
As a comparision I used the following empty function:
At least for these small strings a big part of the runtime is the function call overhead + parsing of the arguments. |
Microbenchmarking like that doesn't make much sense IMO, except to get a rough order-of-magnitude idea. Check out the full benchmark in the PR you link to, that should be more realistic – at least for the kind of application we're aiming at in Gensim: NLP, relatively short strings, definitely not random. |
On dictionary of 375 words query on index, opened with
in_memory
,runs a third faster.