Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(text): handle code points > U+FFFF in levenshteinDistance #6014

Merged
merged 4 commits into from
Sep 20, 2024

Conversation

lionel-rowe
Copy link
Contributor

Fixes #6013

@lionel-rowe lionel-rowe requested a review from kt3k as a code owner September 18, 2024 11:32
@github-actions github-actions bot added the text label Sep 18, 2024
@lionel-rowe
Copy link
Contributor Author

lionel-rowe commented Sep 18, 2024

Benchmarks:

import { levenshteinDistance as current } from "jsr:@std/text@1.0.6/levenshtein-distance";
import { levenshteinDistance as next } from "https://raw.githubusercontent.com/lionel-rowe/std/levenshtein-unicode/text/levenshtein_distance.ts";

for (const [name, fn] of Object.entries({ current, next })) {
  Deno.bench(`${name} (ASCII)`, () => {
    fn("a".repeat(10), "b".repeat(10));
    fn("a".repeat(100), "b".repeat(100));

    fn("a".repeat(10), "");
    fn("a".repeat(100), "");
    fn("", "b".repeat(10));
    fn("", "b".repeat(100));

    fn(
      "a".repeat(100) + "b".repeat(100) + "a".repeat(100),
      "b".repeat(100) + "a".repeat(100) + "b".repeat(100),
    );
  });
}

for (const [name, fn] of Object.entries({ current, next })) {
  // will give wrong result with `current`, but just testing perf here
  Deno.bench(`${name} (with emoji)`, () => {
    fn(
      "a".repeat(100) + "💩".repeat(100) + "a".repeat(100),
      "💩".repeat(100) + "a".repeat(100) + "💩".repeat(100),
    );
  });
}

for (const len of [1e0, 1e1, 1e2, 1e3, 1e4]) {
  for (const [name, fn] of Object.entries({ current, next })) {
    Deno.bench(`${name} (string length ${len.toLocaleString("en-US")})`, () => {
      fn("a".repeat(len), "b".repeat(len));
    });
  }
}

Performance is almost identical on my machine, typical run:

benchmark                        time/iter (avg)        iter/s      (min … max)           p75      p99     p995
-------------------------------- ----------------------------- --------------------- --------------------------
current (ASCII)                          34.4 µs        29,030 ( 24.3 µs …   1.4 ms)  31.1 µs 128.5 µs 184.8 µs
next (ASCII)                             35.8 µs        27,930 ( 24.4 µs … 893.6 µs)  31.8 µs 134.5 µs 197.2 µs
current (with emoji)                     33.0 µs        30,280 ( 22.5 µs … 682.6 µs)  29.4 µs 129.3 µs 183.6 µs
next (with emoji)                        36.8 µs        27,200 ( 24.2 µs … 656.6 µs)  39.8 µs 110.0 µs 183.1 µs
current (string length 1)                67.9 ns    14,720,000 ( 49.7 ns … 194.9 ns)  75.8 ns 142.3 ns 155.8 ns
next (string length 1)                   69.1 ns    14,470,000 ( 52.6 ns … 206.7 ns)  73.9 ns 139.5 ns 161.2 ns
current (string length 10)              391.7 ns     2,553,000 (310.8 ns … 587.3 ns) 421.5 ns 580.7 ns 587.3 ns
next (string length 10)                 386.4 ns     2,588,000 (318.8 ns … 641.8 ns) 404.6 ns 610.2 ns 641.8 ns
current (string length 100)               6.0 µs       167,000 (  5.7 µs …   6.5 µs)   6.1 µs   6.5 µs   6.5 µs
next (string length 100)                  6.1 µs       163,500 (  5.6 µs …   7.2 µs)   6.3 µs   7.2 µs   7.2 µs
current (string length 1,000)           222.4 µs         4,496 (185.2 µs …   1.4 ms) 222.7 µs 397.6 µs 569.9 µs
next (string length 1,000)              212.3 µs         4,711 (174.1 µs …   1.1 ms) 214.1 µs 413.4 µs 525.8 µs
current (string length 10,000)           18.1 ms          55.3 ( 16.8 ms …  19.9 ms)  18.6 ms  19.9 ms  19.9 ms
next (string length 10,000)              17.4 ms          57.5 ( 16.1 ms …  19.0 ms)  17.7 ms  19.0 ms  19.0 ms

text/levenshtein_distance.ts Outdated Show resolved Hide resolved
Co-authored-by: ud2 <sjx233@qq.com>
Copy link

codecov bot commented Sep 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (main@a550998). Learn more about missing BASE report.
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #6014   +/-   ##
=======================================
  Coverage        ?   96.29%           
=======================================
  Files           ?      494           
  Lines           ?    39541           
  Branches        ?     5837           
=======================================
  Hits            ?    38076           
  Misses          ?     1423           
  Partials        ?       42           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lionel-rowe
Copy link
Contributor Author

Previous fix was still buggy as it's tough to keep the various increments and length measurements in sync, given that sometimes you want the code point length and sometimes the code unit length. As a result I just switched to using [...str] char arrays, which surprisingly still has no measurable impact on perf (updated benchmarks above).

Copy link
Member

@kt3k kt3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@kt3k kt3k merged commit 4830d4d into denoland:main Sep 20, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

levenshteinDistance doesn't correctly handle code points over U+FFFF
3 participants