Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new DamerauLevenshtein... classes #84

Merged
merged 6 commits into from
Sep 18, 2022

Conversation

juliangilbey
Copy link
Contributor

There are two versions of the Damerau-Levenshtein distance, as described in this Debian bug report: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1018933 Some of the external libraries implement one of them, others the other.

This PR splits introduces two different classes: DamerauLevenshteinRestricted and DamerauLevenshteinUnrestricted, with DamerauLevenshtein being the unrestricted version, so that it is clear what is intended.

@orsinium
Copy link
Member

orsinium commented Sep 6, 2022

Great work, thank you! How different are implementations for the two algorithms? Can the same be achieved with the same implementation and restricted flag? Similarly to how JaroWinkler has winklerize flag.

@juliangilbey
Copy link
Contributor Author

I don't know of a way of easily merging the two, but I haven't thought about it much. The unresticted one is also much more complicated (double loop) than the restricted one (single loop), so I would not see any benefit in merging them either.

@juliangilbey
Copy link
Contributor Author

Oh, now I think I see what you mean; two different algorithms but accessed by a single parameterised class. Yes, one could certainly do that.

@orsinium
Copy link
Member

orsinium commented Sep 6, 2022

I can take it from here if you want. I haven't touched textdistance for a while, but that shouldn't be hard.

@juliangilbey
Copy link
Contributor Author

Yes, that would be great, thanks!

A thing that's coming up soon, though it may be somewhat harder to handle, is Python 3.10 and the dependency on abydos. It turns out that the current release of abydos is incompatible with Python 3.10, and though a patch was applied to the release branch a couple of years ago, the developer has not touched the package in about 9 months and has not released a patched version. And using the patched abydos, textdistance.benchmark fails (though the test suite still passes).

Just letting you know...

@orsinium
Copy link
Member

orsinium commented Sep 8, 2022

Yep, I also noticed it. Well, abydos is pretty low in benchmarks, we can drop it.

@orsinium orsinium self-assigned this Sep 8, 2022
)
da[cs1] = i

return d[len1, len2]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really correct? we write into d[i, j] which should be at most d[len1-1, len2-1] from my understanding.

Copy link
Member

@orsinium orsinium Sep 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember much about the algorithms, I can only rely on tests.

Copy link
Member

@orsinium orsinium Sep 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the code. What a mess it is, oh my. We first set these values by doing iteration over range(len(s1) + 1) when initializing the matrix, And then when we do enumeration for i, cs1 in enumerate(s1):, we on the next line shift the index: i += 1 🧠 I'll try to clean it up a tiny bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is deliberate! The boundary values are needed, but are never modified later, as the algorithm looks at the preceding values in the array when calculating the new values. I tried to come up with a way of avoiding the i+=1 business, but I kept messing things up, so I decided to be somewhat non-Pythonic and stick to the algorithm as presented on Wikipedia!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to come up with a way of avoiding the i+=1 business, but I kept messing things up

Tip of the day: enumerate accepts an argument start, where you can specify the initial value. I've changed the code to use it. It became a bit slower, for some reason, but IMHO a bit more readable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I completely missed the += 1 before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to come up with a way of avoiding the i+=1 business, but I kept messing things up

Tip of the day: enumerate accepts an argument start, where you can specify the initial value. I've changed the code to use it. It became a bit slower, for some reason, but IMHO a bit more readable.

Oh, how neat! Thanks for that tip. I agree, more readable is better.

@orsinium orsinium merged commit 19b7238 into life4:master Sep 18, 2022
@orsinium
Copy link
Member

Thank you all :)

@juliangilbey juliangilbey deleted the split-damerau-levenshtein branch December 19, 2023 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants