I made a fork with wheels #61

polm · 2021-01-11T08:07:21Z

I forked this package and put wheels (OSX, Windows, Linux) on PyPI as levenshtein.

This should fix compiler related issues with install that people seem to be having.

The last release of this project is from 2014, and I can't find any comments from the maintainer since 2019, so I assume they've moved on. At the moment the fork is just minimal changes to make wheels, but if there's interest I would be up for maintaining it.

The text was updated successfully, but these errors were encountered:

ztane · 2021-01-18T08:29:14Z

Howdy, there is no need to fork, I can pass the maintainership forward. I have not been using python-Levenshtein for ages, back in the day it was the only fast alternative when I needed lots of string matching but its license has been a problem lately.

Lucas-C · 2021-01-29T10:02:33Z

Just FYI, there is another published version with wheels there : https://pypi.org/project/python-Levenshtein-wheels/ by @Tobotimus, who already opened a PR about wheel support on this repo 2 years ago: #36

Maybe the people who worked on forks & alternative Pypi releases could team up & gather their efforts on this repo,
given your proposal to hand over maintainership @ztane?

The next step could be to migrate this project into https://github.com/python-Levenshtein/python-Levenshtein and give GitHub maintainer permissions to a few volunteers who have already contributed to this package? Or submit it to https://jazzband.co?

ztane · 2021-02-01T13:09:59Z

@Lucas-C I looked into creating an organization. To create an organization on Github one needs to set up a single communications email for the organization, which presumably will be used for recovery and such.

polm · 2021-02-01T16:00:47Z

Is having a single communications email a problem? I think there still needs to be a primary maintainer if it's an organization, but presumably that mail wouldn't have to be used regularly once other admin users are added.

maxbachmann · 2021-02-22T20:34:41Z

@ztane I am working on a project with a similar feature set: https://github.com/maxbachmann/RapidFuzz (In fact mostly created because of the License) and I am willing to help maintaining this project

polm · 2021-05-12T14:22:50Z

This project is MIT Licsensed and has Levenshtein distance, so it may be a good alternative.

https://github.com/Martinsos/edlib

maxbachmann · 2021-05-19T21:04:48Z

edlib provides some useful information about the alignment. However at least for short input (< 128 characters) it is a lot slower, than python-Levenshtein. I did perform a benchmark for multiple Levenshtein implementations a while ago:

@ztane What is your current plan on the maintenance? A couple of things that I think would be useful (I would be willing to work on them)

adding wheels using cibuildwheel
Releasing new versions in form of GitHub Releases + Tags
adding unit test + fuzz tests
Updating the usage of the Python C API, so the library properly works in future versions of Python (A lot of the used APIs are already deprecated for years)

woctezuma · 2021-05-20T07:38:45Z

If I understand this plot [1] correctly, then all these libraries i) provide support for Levenshtein distance ii) in Python, and iii) for every input string length, rapidfuzz [2] and polyleven [3] are a lot faster than the other libraries.

If true, then these two libraries look more enticing anyway... So I guess I am missing something there about support for python-Levenshtein. Does python-Levenshtein provide some utility that these two other libraries miss?

On a side-note, I don't understand why the runtime seems quadratic w.r.t. the input length N for python-Levenshtein and linear for other libraries, while you mention in [1] that the time complexity:

for python-Levenshtein is O(N*M), so O(N^2) if N=M,
in the worst case for rapidfuzz is O(N*M) as well. You mention O([N/64]*M), but that is O(N*M) as well, right?

Shouldn't all of these plots show a quadratic curve in the case N=M?

Or is it quadratic, but due to the division by 64, the x-axis is scaled by a factor 8 when N=M?
In this case, shouldn't the performance with rapidfuzz and N=512 be on par or better than with python-Levenshtein and N=64?

Sorry if I look inquisitive, but I am curious now. :)

[1] https://maxbachmann.github.io/RapidFuzz/string_metric.html#levenshtein
[2] https://github.com/maxbachmann/RapidFuzz
[3] https://github.com/fujimotos/polyleven

maxbachmann · 2021-05-20T11:56:00Z

So I guess I am missing something there about support for python-Levenshtein. Does python-Levenshtein provide some utility that these two other libraries miss?

For most use cases these libraries would be a better fit, since they are faster and more open licensed. However, there are many people using python-Levenshtein. This has probably multiple reasons. Some I could think of:

It already exists for a long time
It is easier to find a library called Levenshtein than RapidFuzz/Polyleven

It makes sense to fix some of the long-standing bugs and make it compatible with future Python versions. It would be even possible to implement the Levenshtein distance in faster for python-Levenshtein as well. AFAIK this should not take a lot of time, since I already work on very similar libraries.

in the worst case for rapidfuzz is O(N*M) as well. You mention O([N/64]M), but that is O(NM) as well, right?

All the other libraries use an implementation based on Myers or Hyyrö. They convert one of the two strings into bitvectors of computer word length. Afterwards they iterate over the second string and calculate the Levenshtein distance for the whole bitvector at once using bitwise operations. In all of these implementations computer words of 64 bit are used. However it would be possible to further extend this to 128/256/512 using SIMD. You can clearly see this behavior in the graph:

There are steps each 64 character, since afterwards another bitvector is required
There is linear growth in between, since each character of the second string is compared to each bitvector.

Especially in the graph of polyleven you can see the start of the exponential growth.

In this case, shouldn't the performance with rapidfuzz and N=512 be on par or better than with python-Levenshtein and N=64?

You can not really compare them this way, since they use completely different algorithms. So the runtime for each step already takes different times. This can be seen for editdistance and edlib. Both of them use the bitparallel implementation, but have a very inefficient implementation (E.g. one of them uses a hash map which is very slow).

In my opinion RapidFuzz is the best suited of the libraries for the following reasons (as the author I am super biased)

MIT License
fastest Levenshtein implementation in many cases. Polyleven uses a very similar implementation. However RapidFuzz has special handling for more cases. E.g. for similar strings, for strings containing only extended ASCII ...
It allows the use of custom weights for Insert/Deletion/Substitution in the Levenshtein distance
It provides all the metrics of FuzzyWuzzy with efficient implementations
It provides special functions to calculate the similarity for multiple strings at once. This can make a huge difference in performance, since it can pre-calculate some parts of the algorithm which are reused and does not have to convert between Python and C Types all the time (Python function calls are very slow)
It provides wheels for all platforms supported on PyPi

maxbachmann · 2021-07-31T12:46:44Z

@ztane any updates?

ztane · 2021-08-10T11:44:07Z

@maxbachmann hey, yeah, I do not really have much to update. Unfortunately enough, I've not needed approximate string-matching in scale for ages and therefore would have only limited time for Python-Levenshtein. And since the library is still so widely used I'd hesitate making any changes to it in my limited time. It would perhaps make most sense to have a library that's drop-in API compatible with python-Levenshtein (just need to change the import) and point people to it.

maxbachmann · 2021-08-15T18:24:16Z

@ztane As far as I understood you, you prefer a fork with a new package name over passing on ownership of python-Levenshtein. @polm does not has the time to take on maintenance of the project, so he handed over the ownership of the pypi project Levenshtein to me. I created a fork of the project, which can be found here: https://github.com/maxbachmann/Levenshtein. It will remain API compatible to python-Levenshtein (same import + same API). So all that changes for users is that they need to install Levenshtein instead of python-Levenshtein.

I just released version 0.13.0 which:

adds wheels for all python/Os/arch combinations starting Python3.5 (Python2.7 is still supported, but building wheels for it is probably not worth the effort anymore)
Fixes a couple of minor bugs of python-Levenshtein
Uses bitparallel distance and ratio implementations from https://github.com/maxbachmann/rapidfuzz-cpp, so they are now significantly faster
reduces usage of deprecated Python C APIs
uses git tags for releases to simplify building from source
using sphinx for the docs instead of parsing sources and generating a custom html doc page: https://maxbachmann.github.io/Levenshtein/levenshtein.html#

The full release notes can be found here: https://github.com/maxbachmann/Levenshtein/releases/tag/v0.13.0

ztane · 2021-08-16T06:44:38Z

@maxbachmann well, I can pass the maintainership of the python-Levenshtein PyPI project and would gladly do so. The reason why I've been stuck as the maintainer was that back in the day there was no fast enough library for different distance calculations back in the day for Python 3 and there were PRs already and miohtama didn't have time to merge and test the Python 3 changes, so I did it to scratch an itch.

Back in that project GPL was acceptable licensing option. But right now, I am all the time doing projects where relicensing the entire code base is an impossibility, so the thing I do is to not use python-Levenshtein. In my opinion having a utility library licensed under GPL is a mistake. I would rather see people not forking the codebase but doing a complete rewrite that was under some sensible license, and I could have time to help with that.

Right now it is rather silly for me to spend time "maintaining" (i.e. adding new features etc, writing documentation that did not exist in the first place) to a project that I cannot use myself, project based on algorithms whose papers I've not read etc, answering questions about algorithms whose correctness I cannot vouch for etc.

graingert · 2021-08-16T06:46:47Z

@maxbachmann well, I can pass the maintainership of the python-Levenshtein PyPI project and would gladly do so. The reason why I've been stuck as the maintainer was that back in the day there was no fast enough library for different distance calculations back in the day for Python 3 and there were PRs already and miohtama didn't have time to merge and test the Python 3 changes, so I did to scratch an itch.

Back in that project GPL was acceptable licensing option. But right now, I am all the time doing projects where relicensing the entire code base is an impossibility, so the thing I do is to not use python-Levenshtein. In my opinion having a utility library licensed under GPL is a mistake. I would rather see people not forking the codebase but doing a complete rewrite that was under some sensible license, and I could have time to help with that.

Right now it is rather silly for me to spend time "maintaining" (i.e. adding new features etc, writing documentation that did not exist in the first place) to a project that I cannot use myself, project based on algorithms whose papers I've not read etc, answering questions about algorithms whose correctness I cannot vouch for etc.

Why not transfer the project to jazzband?

polm · 2021-08-16T09:54:22Z

Hey, just wanted to clarify that like @maxbachmann said, my situation has changed since I opened this issue and I no longer have time to maintain this. Anyone who's using the code is welcome to the wheel code I added and I've transferred the levenshtein name on PyPI to Max.

I think @ztane's points about the license are correct. Since it sounds like Max is rewriting things anyway maybe it makes sense to have a v2 branch that just preserves the API, but replaces the entire implementation from scratch, and re-license that. I think the point that most people using the library aren't following development, and that that gives merit to continuing to use the name, is absolutely true.

maxbachmann · 2021-08-16T10:43:39Z

Back in that project GPL was acceptable licensing option. But right now, I am all the time doing projects where relicensing the entire code base is an impossibility, so the thing I do is to not use python-Levenshtein. In my opinion having a utility library licensed under GPL is a mistake. I would rather see people not forking the codebase but doing a complete rewrite that was under some sensible license, and I could have time to help with that.

Since it sounds like Max is rewriting things anyway maybe it makes sense to have a v2 branch that just preserves the API, but replaces the entire implementation from scratch, and re-license that. I think the point that most people using the library aren't following development, and that that gives merit to continuing to use the name, is absolutely true.

Yes actually I do rewrite most things in C++ under the MIT license already, since

I agree that a utility library like this should not use the GPL library (in fact a big share of the projects using python-Levenshtein break this license)
C++ support nowadays is good enough, that it is really not worth the pain to write the algorithms in C (Happy to hear your opinion on the C vs C++ usage)

Current state of things

Python2.7 support

Do we still need Python2.7 support for a rewrite? python-Levenshtein still has around 10% downloads for python2.7, so even though it requires some effort (string handling in Python worked differently in Python2.7) it might be worth maintaining it a bit longer.

C Library

@ztane are there many people using this as C library? C++ is a lot simpler to use for these algorithms, since we often match many different string types, so templates are very helpful. And in my experience it is simple enough to write a wrapper for something like this in C when needed.

rewritten algorithms

All algorithms I rewrite are currently placed in the following projects: https://github.com/maxbachmann/rapidfuzz-cpp, which is then used both in https://github.com/maxbachmann/Levenshtein and https://github.com/maxbachmann/RapidFuzz and are all MIT licensed.

distance / ratio / hamming

These algorithms are all already implemented in rapidfuzz, so I simply reuse their optimized implementations

jaro / jaro-winkler

The implementation in python-Levenshtein has a few bugs. I developed a way faster implementation of the algorithm. However it still requires some more development time. It will afterwards be placed in rapidfuzz-cpp

median, median_improve, quickmedian, setmedian, seqratio, setratio

I have no clue about these algorithms and since I am not able to copy them from the current implementation (license), I am currently unable to re implement them.

editop / opcodes / inverse / appy_edit / matching_blocks / subtract_edit

They have really weird APIs, which do not use keyword arguments and instead overload on the args count ....
Also it lead to many bug reports, that matching_blocks does not actually return the same as in difflib. For rapidfuzz I created a fast rewrite of the algorithm from difflib in C++. However I am still unsure how to properly replace the current mess.

Licensing

I would generally love to use a more open License for the project. However I would probably need help by someone for the whole median algorithms I do not know.

Future of python-Levenshtein

For users it would probably help to keep publishing to python-Levenshtein, so their software keeps working in upcoming python versions.

Collaboration

I personally do not like the idea transfer the project to jazzband, since we are not really just a Python project. E.g. there are definetly people using rapidfuzz-cpp as a standalone C++ library (someone recently packaged it for Conan as well). However especially when someone else like @ztane is interested in helping with more open licensed implementations it could be worth placing:

Rapidfuzz
rapidfuzz-cpp
Levenshtein/python-Levenshtein

in a common organisation

ztane · 2021-08-16T14:49:42Z

I'd say one good reason for the rewrite is totally dropping the Python 2 support. Would make the code cleaner too.

As for the C++, it probably is fine to use a C++ compiler, but C++ can be tricky otherwise in that need to recompile for different stdlib implementation, more so than for C, so it would be nice if the C++ code wouldn't use much of the standard library. The C++ templates and such would be fine still.

maxbachmann · 2021-08-16T15:51:24Z

I'd say one good reason for the rewrite is totally dropping the Python 2 support. Would make the code cleaner too.

Generally I would split it into a Python Wrapper and the implementation of the algorithms. The current macro based selection is e.g. broken in the latest version since safe_malloc is only defined when Python is used.

As for the C++, it probably is fine to use a C++ compiler, but C++ can be tricky otherwise in that need to recompile for different stdlib implementation, more so than for C, so it would be nice if the C++ code wouldn't use much of the standard library. The C++ templates and such would be fine still.

Which parts of the stlib/C++ do you consider problematic? I currently use:

C++11
type_traits -> mostly for SFINAE
a few algorithms/numeric functions
an std::string_view backport
-> mostly to combine const CharT* + size_t into a container which can internally be used with things like iterators
std::vector
mostly so I do not have to manually allocate + free
std::basic_string
Used in the interface for functions which preprocess strings
Internally when modifications of the string_view are required (e.g. to sort words in the text)
std::pair/std::tuple when I was to lazy to create a struct for the return value.
exceptions (mostly used for memory errors)

Note that the implementation so far is mostly header only, since the functions are either templates or very short. Most things could probably be implemented using less of these features, but it should be documented which features can/can't be used.

maxbachmann · 2021-08-26T08:15:29Z

@maxbachmann well, I can pass the maintainership of the python-Levenshtein PyPI project and would gladly do so. The reason why I've been stuck as the maintainer was that back in the day there was no fast enough library for different distance calculations back in the day for Python 3 and there were PRs already and miohtama didn't have time to merge and test the Python 3 changes, so I did it to scratch an itch.

@ztane I think this would be a good first step, so I can upload new releases for python-Levenshtein as well (https://pypi.org/user/maxbachmann/)

maxbachmann · 2021-09-09T15:16:22Z

@ztane ping regarding the maintainership of the python-Levenshtein PyPI project

maxbachmann · 2021-11-18T10:26:04Z

@ztane it has been 2 months. Are there any updates on this?

maxbachmann · 2022-01-08T22:49:53Z

I reached similar performance improvements for Levenshtein.jaro_winkler, which will be added to the project once the JaroWinkler package is released.

@ztane is there any update on this? I would love to take up maintainership of the project.

maxbachmann · 2022-01-11T04:12:21Z

In addition I significantly improved the editops implementation in RapidFuzz. It is not only faster, but requires 16x less memory as well. This will be added to the Levenshtein project once RapidFuzz v2.0.0 is released.

maxbachmann · 2022-02-10T12:46:37Z

Released v0.18.0 which includes the performance improvements and fixes multiple bugs from the original implementation. E.g. the following code previously did allocate a string using an uninitialized variable as size.

>>> import Levenshtein
>>> Levenshtein.quickmedian(["tes", "teste"], [0, 0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: character U+91594b00 is not in range [U+0000; U+10ffff]

@ztane Do you still plan to hand over maintainership of this project? It has now been around a year since my original offer to maintain this project. I can completely understand if you want to continue maintaining the project. However, I would appreciate it if you could give an update on this so that I don't just waste my time here.

woctezuma · 2022-02-10T19:36:59Z

I don't think you waste your time, as some people have found your project thanks to this issue and your plots. ❤️

maxbachmann · 2022-07-09T23:29:56Z

I finally had the time to continue working on the library. The latest release finalizes the work required to keep the Levenshtein library working in Python 3.12 when the deprecated Unicode APIs get removed (see #54).

maxbachmann · 2022-09-27T10:28:59Z

I do now have maintainer privileges for the python-Levenshtein project on PyPI and will continue to upload subsequent releases both to the Levenshtein and python-Levenshtein projects.

Since after install the project was always used Levenshtein anyways, I converted python-Levenshtein to a metapackage which simply installs the corresponding version of Levenshtein (e.g. python-Levenshtein==0.20.4 will install Levenshtein==0.20.4 in the background). This avoids clashes when people install both Levenshtein and python-Levenshtein which would previously install into the same directory.
The code for this metapackage can be found here: https://github.com/maxbachmann/python-Levenshtein

woctezuma · 2022-09-27T15:44:09Z

Congratulations! 🎉

polm · 2023-03-08T06:43:55Z

Closing since @maxbachmann has taken care of this. Thanks Max!

This was referenced Jan 11, 2021

Providing wheels and updating CI + supported versions #49

Open

Add cross-platform CI wheel builds #36

Open

emersonhsieh mentioned this issue Jan 26, 2021

Architecture not supported error for pip install fuzzywuzzy[speedup] seatgeek/fuzzywuzzy#297

Closed

Lucas-C mentioned this issue Jan 29, 2021

Use the wheels version of Levenshtein for Window compatibility Lucas-C/pre-commit-hooks#18

Closed

FredHappyface mentioned this issue Mar 6, 2021

Add py39 wheels to travis/azure piskvorky/gensim#3058

Merged

maxbachmann mentioned this issue Aug 4, 2021

Expectations from a maintainer? #74

Closed

polm mentioned this issue Dec 7, 2021

python-Levenshtein can't build wheel quadrismegistus/prosodic#20

Closed

peterjc mentioned this issue Jan 18, 2022

python-Levenshtein --> rapidfuzz peterjc/thapbi-pict#427

Merged

alpae mentioned this issue Jan 24, 2022

use more performant RapidFuzz instead of unmaintained python-Levenshtein alpae/fuzzyset#2

Merged

rachmadaniHaryono mentioned this issue Feb 22, 2022

[Bug]: Not installing on Mac ThioJoe/YT-Spammer-Purge#681

Open

2 tasks

maxbachmann mentioned this issue Apr 11, 2022

Update License? #83

Closed

AliceAML mentioned this issue Apr 12, 2022

Problem with install of python-Levenshtein Maxwell1447/ERRANT-fr#1

Closed

Lucas-C mentioned this issue Aug 13, 2022

python-Levenshtein-wheels not available for python 3.10 Lucas-C/pre-commit-hooks#55

Closed

maxbachmann mentioned this issue Sep 11, 2022

PEP 541 Request: python-Levenshtein pypi/support#2222

Closed

1 task

polm closed this as completed Mar 8, 2023

Lucas-C mentioned this issue Mar 31, 2023

replace fuzzywuzzy with rapidfuzz Lucas-C/pre-commit-hooks#58

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I made a fork with wheels #61

I made a fork with wheels #61

polm commented Jan 11, 2021 •

edited

Loading

ztane commented Jan 18, 2021

Lucas-C commented Jan 29, 2021 •

edited

Loading

ztane commented Feb 1, 2021

polm commented Feb 1, 2021

maxbachmann commented Feb 22, 2021

polm commented May 12, 2021

maxbachmann commented May 19, 2021 •

edited

Loading

woctezuma commented May 20, 2021 •

edited

Loading

maxbachmann commented May 20, 2021 •

edited

Loading

maxbachmann commented Jul 31, 2021

ztane commented Aug 10, 2021

maxbachmann commented Aug 15, 2021 •

edited

Loading

ztane commented Aug 16, 2021 •

edited

Loading

graingert commented Aug 16, 2021

polm commented Aug 16, 2021

maxbachmann commented Aug 16, 2021 •

edited

Loading

ztane commented Aug 16, 2021 •

edited

Loading

maxbachmann commented Aug 16, 2021 •

edited

Loading

maxbachmann commented Aug 26, 2021

maxbachmann commented Sep 9, 2021

maxbachmann commented Nov 18, 2021

maxbachmann commented Jan 8, 2022

maxbachmann commented Jan 11, 2022 •

edited

Loading

maxbachmann commented Feb 10, 2022

woctezuma commented Feb 10, 2022

maxbachmann commented Jul 9, 2022

maxbachmann commented Sep 27, 2022 •

edited

Loading

woctezuma commented Sep 27, 2022

polm commented Mar 8, 2023

I made a fork with wheels #61

I made a fork with wheels #61

Comments

polm commented Jan 11, 2021 • edited Loading

ztane commented Jan 18, 2021

Lucas-C commented Jan 29, 2021 • edited Loading

ztane commented Feb 1, 2021

polm commented Feb 1, 2021

maxbachmann commented Feb 22, 2021

polm commented May 12, 2021

maxbachmann commented May 19, 2021 • edited Loading

woctezuma commented May 20, 2021 • edited Loading

maxbachmann commented May 20, 2021 • edited Loading

maxbachmann commented Jul 31, 2021

ztane commented Aug 10, 2021

maxbachmann commented Aug 15, 2021 • edited Loading

ztane commented Aug 16, 2021 • edited Loading

graingert commented Aug 16, 2021

polm commented Aug 16, 2021

maxbachmann commented Aug 16, 2021 • edited Loading

Current state of things

Python2.7 support

C Library

rewritten algorithms

distance / ratio / hamming

jaro / jaro-winkler

median, median_improve, quickmedian, setmedian, seqratio, setratio

editop / opcodes / inverse / appy_edit / matching_blocks / subtract_edit

Licensing

Future of python-Levenshtein

Collaboration

ztane commented Aug 16, 2021 • edited Loading

maxbachmann commented Aug 16, 2021 • edited Loading

maxbachmann commented Aug 26, 2021

maxbachmann commented Sep 9, 2021

maxbachmann commented Nov 18, 2021

maxbachmann commented Jan 8, 2022

maxbachmann commented Jan 11, 2022 • edited Loading

maxbachmann commented Feb 10, 2022

woctezuma commented Feb 10, 2022

maxbachmann commented Jul 9, 2022

maxbachmann commented Sep 27, 2022 • edited Loading

woctezuma commented Sep 27, 2022

polm commented Mar 8, 2023

polm commented Jan 11, 2021 •

edited

Loading

Lucas-C commented Jan 29, 2021 •

edited

Loading

maxbachmann commented May 19, 2021 •

edited

Loading

woctezuma commented May 20, 2021 •

edited

Loading

maxbachmann commented May 20, 2021 •

edited

Loading

maxbachmann commented Aug 15, 2021 •

edited

Loading

ztane commented Aug 16, 2021 •

edited

Loading

maxbachmann commented Aug 16, 2021 •

edited

Loading

ztane commented Aug 16, 2021 •

edited

Loading

maxbachmann commented Aug 16, 2021 •

edited

Loading

maxbachmann commented Jan 11, 2022 •

edited

Loading

maxbachmann commented Sep 27, 2022 •

edited

Loading