rustdoc: use a trie for name-based search #133005

notriddle · 2024-11-13T19:49:59Z

Potentially #131156 — need to try reproducing the problem with windows

Preview and profiler results

Here's some quick profiling in Firefox done on the rust compiler docs:

Before: https://share.firefox.dev/3UPm3M8
After: https://share.firefox.dev/40LXvYb

Here's the results for the node.js profiler:

https://notriddle.com/rustdoc-html-demo-15/trie-perf/index.html

Here's a copy that you can use to try it out. Compare it with the nightly. Try typing typecheckercontext one character at a time, slowly.

https://notriddle.com/rustdoc-html-demo-15/compiler-doc-trie/index.html

The fuzzy match algo is based on Fast String Correction with Levenshtein-Automata and the corresponding implementation code in moman and Lucene; the bit-packing representation comes from Lucene, but the actual matcher is more based on fsc.py. As suggested in the paper, a trie is used to represent the FSA dictionary.

The same trie is used for prefix matching. Substring matching is done with a side table of three-character¹ windows that point into the trie.

User-visible changes

I don't expect anybody to notice anything, but it does cause two changes:

Substring matches, in the middle of a name, only apply if there's three or more characters in the search query.
Levenshtein distance limit now maxes out at two. In the old version, the limit was w/3, so you could get looser matches for queries with 9 or more characters¹ in them.
It uses more RAM.
It's faster (assuming you don't swap thrash).

technically utf-16 code units ↩ ↩²

Preview and profiler results ---------------------------- Here's some quick profiling in Firefox done on the rust compiler docs: - Before: https://share.firefox.dev/3UPm3M8 - After: https://share.firefox.dev/40LXvYb Here's the results for the node.js profiler: - https://notriddle.com/rustdoc-html-demo-15/trie-perf/index.html Here's a copy that you can use to try it out. Compare it with [the nightly]. Try typing `typecheckercontext` one character at a time, slowly. - https://notriddle.com/rustdoc-html-demo-15/compiler-doc-trie/index.html [the nightly]: https://doc.rust-lang.org/nightly/nightly-rustc/ The fuzzy match algo is based on [Fast String Correction with Levenshtein-Automata] and the corresponding implementation code in [moman] and [Lucene]; the bit-packing representation comes from Lucene, but the actual matcher is more based on `fsc.py`. As suggested in the paper, a trie is used to represent the FSA dictionary. The same trie is used for prefix matching. Substring matching is done with a side table of three-character[^1] windows that point into the trie. [Fast String Correction with Levenshtein-Automata]: https://github.com/tpn/pdfs/blob/master/Fast%20String%20Correction%20with%20Levenshtein-Automata%20(2002)%20(10.1.1.16.652).pdf [Lucene]: https://fossies.org/linux/lucene/lucene/core/src/java/org/apache/lucene/util/automaton/Lev1TParametricDescription.java [moman]: https://gitlab.com/notriddle/moman-rustdoc User-visible changes -------------------- I don't expect anybody to notice anything, but it does cause two changes: - Substring matches, in the middle of a name, only apply if there's three or more characters in the search query. - Levenshtein distance limit now maxes out at two. In the old version, the limit was w/3, so you could get looser matches for queries with 9 or more characters[^1] in them. [^1]: technically utf-16 code units

rustbot · 2024-11-13T19:50:09Z

r? @fmease

rustbot has assigned @fmease.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

notriddle · 2024-11-13T19:50:21Z

r? @GuillaumeGomez

GuillaumeGomez · 2024-11-13T21:47:03Z

src/librustdoc/html/static/js/search.js

+        } else {
+            const sb = name.charCodeAt(substart);
+            let child;
+            if (this.children[sb] !== undefined) {


Wouldn't it be better to check this.children.length < sb?

This is a sparse array.

Add this explanation and link on the field definition please. :)

Okay, it's added.

src/librustdoc/html/static/js/search.js

GuillaumeGomez · 2024-11-13T21:51:52Z

Apart from small nits, looks good to me. Performance improvement is really impressive!

rustbot · 2024-11-13T21:55:10Z

Some changes occurred in HTML/CSS/JS.

cc @GuillaumeGomez, @jsha

GuillaumeGomez · 2024-11-14T13:50:45Z

Thanks!

@bors r+

bors · 2024-11-14T13:50:47Z

📌 Commit e534f47 has been approved by GuillaumeGomez

It is now in the queue for this repository.

…llaumeGomez Rollup of 5 pull requests Successful merges: - rust-lang#132172 (borrowck diagnostics: suggest borrowing function inputs in generic positions) - rust-lang#132649 (add ./x clippy ci) - rust-lang#133005 (rustdoc: use a trie for name-based search) - rust-lang#133034 (update download-rustc comments and default) - rust-lang#133036 (add myself into `users_on_vacation` on triagebot) r? `@ghost` `@rustbot` modify labels: rollup

Rollup merge of rust-lang#133005 - notriddle:notriddle/trie-search, r=GuillaumeGomez rustdoc: use a trie for name-based search Potentially rust-lang#131156 — need to try reproducing the problem with `windows` Preview and profiler results ---------------------------- Here's some quick profiling in Firefox done on the rust compiler docs: - Before: https://share.firefox.dev/3UPm3M8 - After: https://share.firefox.dev/40LXvYb Here's the results for the node.js profiler: - https://notriddle.com/rustdoc-html-demo-15/trie-perf/index.html Here's a copy that you can use to try it out. Compare it with [the nightly]. Try typing `typecheckercontext` one character at a time, slowly. - https://notriddle.com/rustdoc-html-demo-15/compiler-doc-trie/index.html [the nightly]: https://doc.rust-lang.org/nightly/nightly-rustc/ The fuzzy match algo is based on [Fast String Correction with Levenshtein-Automata] and the corresponding implementation code in [moman] and [Lucene]; the bit-packing representation comes from Lucene, but the actual matcher is more based on `fsc.py`. As suggested in the paper, a trie is used to represent the FSA dictionary. The same trie is used for prefix matching. Substring matching is done with a side table of three-character[^1] windows that point into the trie. [Fast String Correction with Levenshtein-Automata]: https://github.com/tpn/pdfs/blob/master/Fast%20String%20Correction%20with%20Levenshtein-Automata%20(2002)%20(10.1.1.16.652).pdf [Lucene]: https://fossies.org/linux/lucene/lucene/core/src/java/org/apache/lucene/util/automaton/Lev1TParametricDescription.java [moman]: https://gitlab.com/notriddle/moman-rustdoc User-visible changes -------------------- I don't expect anybody to notice anything, but it does cause two changes: - Substring matches, in the middle of a name, only apply if there's three or more characters in the search query. - Levenshtein distance limit now maxes out at two. In the old version, the limit was w/3, so you could get looser matches for queries with 9 or more characters[^1] in them. - It uses more RAM. - It's faster (assuming you don't swap thrash). [^1]: technically utf-16 code units

rustbot assigned fmease Nov 13, 2024

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. labels Nov 13, 2024

rustbot assigned GuillaumeGomez and unassigned fmease Nov 13, 2024

notriddle added A-rustdoc-search Area: Rustdoc's search feature T-rustdoc-frontend Relevant to the rustdoc-frontend team, which will review and decide on the web UI/UX output. labels Nov 13, 2024

GuillaumeGomez reviewed Nov 13, 2024

View reviewed changes

src/librustdoc/html/static/js/search.js Outdated Show resolved Hide resolved

GuillaumeGomez reviewed Nov 13, 2024

View reviewed changes

src/librustdoc/html/static/js/search.js Show resolved Hide resolved

Remove console.log

1d13399

Add descriptive comment for NameTrie

e534f47

GuillaumeGomez approved these changes Nov 14, 2024

View reviewed changes

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 14, 2024

GuillaumeGomez mentioned this pull request Nov 14, 2024

Rollup of 5 pull requests #133039

Merged

bors merged commit fc7ca70 into rust-lang:master Nov 14, 2024
6 checks passed

rustbot added this to the 1.84.0 milestone Nov 14, 2024

notriddle deleted the notriddle/trie-search branch November 14, 2024 21:55

notriddle mentioned this pull request Nov 15, 2024

rustdoc search: allow queries to end in an empty path segment #132569

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rustdoc: use a trie for name-based search #133005

rustdoc: use a trie for name-based search #133005

notriddle commented Nov 13, 2024 •

edited

Loading

rustbot commented Nov 13, 2024

notriddle commented Nov 13, 2024

GuillaumeGomez Nov 13, 2024

notriddle Nov 13, 2024

GuillaumeGomez Nov 13, 2024 •

edited

Loading

notriddle Nov 13, 2024

GuillaumeGomez commented Nov 13, 2024

rustbot commented Nov 13, 2024

GuillaumeGomez commented Nov 14, 2024

bors commented Nov 14, 2024

rustdoc: use a trie for name-based search #133005

rustdoc: use a trie for name-based search #133005

Conversation

notriddle commented Nov 13, 2024 • edited Loading

Preview and profiler results

User-visible changes

Footnotes

rustbot commented Nov 13, 2024

notriddle commented Nov 13, 2024

GuillaumeGomez Nov 13, 2024

Choose a reason for hiding this comment

notriddle Nov 13, 2024

Choose a reason for hiding this comment

GuillaumeGomez Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

notriddle Nov 13, 2024

Choose a reason for hiding this comment

GuillaumeGomez commented Nov 13, 2024

rustbot commented Nov 13, 2024

GuillaumeGomez commented Nov 14, 2024

bors commented Nov 14, 2024

notriddle commented Nov 13, 2024 •

edited

Loading

GuillaumeGomez Nov 13, 2024 •

edited

Loading