-
Notifications
You must be signed in to change notification settings - Fork 629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(text): handle code points > U+FFFF in levenshteinDistance
#6014
Conversation
Benchmarks: import { levenshteinDistance as current } from "jsr:@std/text@1.0.6/levenshtein-distance";
import { levenshteinDistance as next } from "https://raw.githubusercontent.com/lionel-rowe/std/levenshtein-unicode/text/levenshtein_distance.ts";
for (const [name, fn] of Object.entries({ current, next })) {
Deno.bench(`${name} (ASCII)`, () => {
fn("a".repeat(10), "b".repeat(10));
fn("a".repeat(100), "b".repeat(100));
fn("a".repeat(10), "");
fn("a".repeat(100), "");
fn("", "b".repeat(10));
fn("", "b".repeat(100));
fn(
"a".repeat(100) + "b".repeat(100) + "a".repeat(100),
"b".repeat(100) + "a".repeat(100) + "b".repeat(100),
);
});
}
for (const [name, fn] of Object.entries({ current, next })) {
// will give wrong result with `current`, but just testing perf here
Deno.bench(`${name} (with emoji)`, () => {
fn(
"a".repeat(100) + "💩".repeat(100) + "a".repeat(100),
"💩".repeat(100) + "a".repeat(100) + "💩".repeat(100),
);
});
}
for (const len of [1e0, 1e1, 1e2, 1e3, 1e4]) {
for (const [name, fn] of Object.entries({ current, next })) {
Deno.bench(`${name} (string length ${len.toLocaleString("en-US")})`, () => {
fn("a".repeat(len), "b".repeat(len));
});
}
} Performance is almost identical on my machine, typical run:
|
Co-authored-by: ud2 <sjx233@qq.com>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #6014 +/- ##
=======================================
Coverage ? 96.29%
=======================================
Files ? 494
Lines ? 39541
Branches ? 5837
=======================================
Hits ? 38076
Misses ? 1423
Partials ? 42 ☔ View full report in Codecov by Sentry. |
3da5829
to
cabc32e
Compare
Previous fix was still buggy as it's tough to keep the various increments and length measurements in sync, given that sometimes you want the code point length and sometimes the code unit length. As a result I just switched to using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
Fixes #6013