Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rustdoc: use a trie for name-based search #133005

Merged
merged 3 commits into from
Nov 14, 2024

Conversation

notriddle
Copy link
Contributor

@notriddle notriddle commented Nov 13, 2024

Potentially #131156 — need to try reproducing the problem with windows

Preview and profiler results

Here's some quick profiling in Firefox done on the rust compiler docs:

Here's the results for the node.js profiler:

Here's a copy that you can use to try it out. Compare it with the nightly. Try typing typecheckercontext one character at a time, slowly.

The fuzzy match algo is based on Fast String Correction with Levenshtein-Automata and the corresponding implementation code in moman and Lucene; the bit-packing representation comes from Lucene, but the actual matcher is more based on fsc.py. As suggested in the paper, a trie is used to represent the FSA dictionary.

The same trie is used for prefix matching. Substring matching is done with a side table of three-character1 windows that point into the trie.

User-visible changes

I don't expect anybody to notice anything, but it does cause two changes:

  • Substring matches, in the middle of a name, only apply if there's three or more characters in the search query.
  • Levenshtein distance limit now maxes out at two. In the old version, the limit was w/3, so you could get looser matches for queries with 9 or more characters1 in them.
  • It uses more RAM.
  • It's faster (assuming you don't swap thrash).

Footnotes

  1. technically utf-16 code units 2

Preview and profiler results
----------------------------

Here's some quick profiling in Firefox done on the rust compiler docs:

- Before: https://share.firefox.dev/3UPm3M8
- After: https://share.firefox.dev/40LXvYb

Here's the results for the node.js profiler:

- https://notriddle.com/rustdoc-html-demo-15/trie-perf/index.html

Here's a copy that you can use to try it out. Compare it with [the nightly].
Try typing `typecheckercontext` one character at a time, slowly.

- https://notriddle.com/rustdoc-html-demo-15/compiler-doc-trie/index.html

[the nightly]: https://doc.rust-lang.org/nightly/nightly-rustc/

The fuzzy match algo is based on [Fast String Correction with
Levenshtein-Automata] and the corresponding implementation code in [moman]
and [Lucene]; the bit-packing representation comes from Lucene, but the
actual matcher is more based on `fsc.py`. As suggested in the paper, a
trie is used to represent the FSA dictionary.

The same trie is used for prefix matching. Substring matching is done with a
side table of three-character[^1] windows that point into the trie.

[Fast String Correction with Levenshtein-Automata]: https://github.com/tpn/pdfs/blob/master/Fast%20String%20Correction%20with%20Levenshtein-Automata%20(2002)%20(10.1.1.16.652).pdf
[Lucene]: https://fossies.org/linux/lucene/lucene/core/src/java/org/apache/lucene/util/automaton/Lev1TParametricDescription.java
[moman]: https://gitlab.com/notriddle/moman-rustdoc

User-visible changes
--------------------

I don't expect anybody to notice anything, but it does cause two changes:

- Substring matches, in the middle of a name, only apply if there's three
  or more characters in the search query.
- Levenshtein distance limit now maxes out at two. In the old version,
  the limit was w/3, so you could get looser matches for queries with
  9 or more characters[^1] in them.

[^1]: technically utf-16 code units
@rustbot
Copy link
Collaborator

rustbot commented Nov 13, 2024

r? @fmease

rustbot has assigned @fmease.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. labels Nov 13, 2024
@notriddle
Copy link
Contributor Author

r? @GuillaumeGomez

@rustbot rustbot assigned GuillaumeGomez and unassigned fmease Nov 13, 2024
@notriddle notriddle added A-rustdoc-search Area: Rustdoc's search feature T-rustdoc-frontend Relevant to the rustdoc-frontend team, which will review and decide on the web UI/UX output. labels Nov 13, 2024
} else {
const sb = name.charCodeAt(substart);
let child;
if (this.children[sb] !== undefined) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to check this.children.length < sb?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a sparse array.

Copy link
Member

@GuillaumeGomez GuillaumeGomez Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this explanation and link on the field definition please. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, it's added.

@GuillaumeGomez
Copy link
Member

Apart from small nits, looks good to me. Performance improvement is really impressive!

@rustbot
Copy link
Collaborator

rustbot commented Nov 13, 2024

Some changes occurred in HTML/CSS/JS.

cc @GuillaumeGomez, @jsha

@GuillaumeGomez
Copy link
Member

Thanks!

@bors r+

@bors
Copy link
Contributor

bors commented Nov 14, 2024

📌 Commit e534f47 has been approved by GuillaumeGomez

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 14, 2024
bors added a commit to rust-lang-ci/rust that referenced this pull request Nov 14, 2024
…llaumeGomez

Rollup of 5 pull requests

Successful merges:

 - rust-lang#132172 (borrowck diagnostics: suggest borrowing function inputs in generic positions)
 - rust-lang#132649 (add ./x clippy ci)
 - rust-lang#133005 (rustdoc: use a trie for name-based search)
 - rust-lang#133034 (update download-rustc comments and default)
 - rust-lang#133036 (add myself into `users_on_vacation` on triagebot)

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit fc7ca70 into rust-lang:master Nov 14, 2024
6 checks passed
@rustbot rustbot added this to the 1.84.0 milestone Nov 14, 2024
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Nov 14, 2024
Rollup merge of rust-lang#133005 - notriddle:notriddle/trie-search, r=GuillaumeGomez

rustdoc: use a trie for name-based search

Potentially rust-lang#131156 — need to try reproducing the problem with `windows`

Preview and profiler results
----------------------------

Here's some quick profiling in Firefox done on the rust compiler docs:

- Before: https://share.firefox.dev/3UPm3M8
- After: https://share.firefox.dev/40LXvYb

Here's the results for the node.js profiler:

- https://notriddle.com/rustdoc-html-demo-15/trie-perf/index.html

Here's a copy that you can use to try it out. Compare it with [the nightly]. Try typing `typecheckercontext` one character at a time, slowly.

- https://notriddle.com/rustdoc-html-demo-15/compiler-doc-trie/index.html

[the nightly]: https://doc.rust-lang.org/nightly/nightly-rustc/

The fuzzy match algo is based on [Fast String Correction with Levenshtein-Automata] and the corresponding implementation code in [moman] and [Lucene]; the bit-packing representation comes from Lucene, but the actual matcher is more based on `fsc.py`. As suggested in the paper, a trie is used to represent the FSA dictionary.

The same trie is used for prefix matching. Substring matching is done with a side table of three-character[^1] windows that point into the trie.

[Fast String Correction with Levenshtein-Automata]: https://github.com/tpn/pdfs/blob/master/Fast%20String%20Correction%20with%20Levenshtein-Automata%20(2002)%20(10.1.1.16.652).pdf
[Lucene]: https://fossies.org/linux/lucene/lucene/core/src/java/org/apache/lucene/util/automaton/Lev1TParametricDescription.java
[moman]: https://gitlab.com/notriddle/moman-rustdoc

User-visible changes
--------------------

I don't expect anybody to notice anything, but it does cause two changes:

- Substring matches, in the middle of a name, only apply if there's three or more characters in the search query.
- Levenshtein distance limit now maxes out at two. In the old version, the limit was w/3, so you could get looser matches for queries with 9 or more characters[^1] in them.
- It uses more RAM.
- It's faster (assuming you don't swap thrash).

[^1]: technically utf-16 code units
@notriddle notriddle deleted the notriddle/trie-search branch November 14, 2024 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-rustdoc-search Area: Rustdoc's search feature S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. T-rustdoc-frontend Relevant to the rustdoc-frontend team, which will review and decide on the web UI/UX output.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants