From 5606d30fcc5c925da20b7450f0ccf3b249e1c8f9 Mon Sep 17 00:00:00 2001 From: Michael Howell Date: Sat, 6 Jan 2024 10:47:01 -0700 Subject: [PATCH] Add guide for rustdoc search implementation (#1846) --- src/SUMMARY.md | 1 + src/rustdoc-internals/search.md | 244 ++++++++++++++++++++++++++++++++ 2 files changed, 245 insertions(+) create mode 100644 src/rustdoc-internals/search.md diff --git a/src/SUMMARY.md b/src/SUMMARY.md index e484b6af6..a41417f48 100644 --- a/src/SUMMARY.md +++ b/src/SUMMARY.md @@ -74,6 +74,7 @@ - [Serialization in Rustc](./serialization.md) - [Parallel Compilation](./parallel-rustc.md) - [Rustdoc internals](./rustdoc-internals.md) + - [Search](./rustdoc-internals/search.md) # Source Code Representation diff --git a/src/rustdoc-internals/search.md b/src/rustdoc-internals/search.md new file mode 100644 index 000000000..cba7c5cfd --- /dev/null +++ b/src/rustdoc-internals/search.md @@ -0,0 +1,244 @@ +# Rustdoc search + +Rustdoc Search is two programs: `search_index.rs` +and `search.js`. The first generates a nasty JSON +file with a full list of items and function signatures +in the crates in the doc bundle, and the second reads +it, turns it into some in-memory structures, and +scans them linearly to search. + + + +## Search index format + +`search.js` calls this Raw, because it turns it into +a more normal object tree after loading it. +Naturally, it's also written without newlines or spaces. + +```json +[ + [ "crate_name", { + "doc": "Documentation", + "n": ["function_name", "Data"], + "t": "HF", + "d": ["This function gets the name of an integer with Data", "The data struct"], + "q": [[0, "crate_name"]], + "i": [2, 0], + "p": [[1, "i32"], [1, "str"], [5, "crate_name::Data"]], + "f": "{{gb}{d}}`", + "b": [], + "c": [], + "a": [["get_name", 0]], + }] +] +``` + +[`src/librustdoc/html/static/js/externs.js`] +defines an actual schema in a Closure `@typedef`. + +The above index defines a crate called `crate_name` +with a free function called `function_name` and a struct called `Data`, +with the type signature `Data, i32 -> str`, +and an alias, `get_name`, that equivalently refers to `function_name`. + +[`src/librustdoc/html/static/js/externs.js`]: https://github.com/rust-lang/rust/blob/79b710c13968a1a48d94431d024d2b1677940866/src/librustdoc/html/static/js/externs.js#L204-L258 + +The search index needs to fit the needs of the `rustdoc` compiler, +the `search.js` frontend, +and also be compact and fast to decode. +It makes a lot of compromises: + +* The `rustdoc` compiler runs on one crate at a time, + so each crate has an essentially separate search index. + It [merges] them by having each crate on one line + and looking at the first quoted string. +* Names in the search index are given + in their original case and with underscores. + When the search index is loaded, + `search.js` stores the original names for display, + but also folds them to lowercase and strips underscores for search. + You'll see them called `normalized`. +* The `f` array stores types as offsets into the `p` array. + These types might actually be from another crate, + so `search.js` has to turn the numbers into names and then + back into numbers to deduplicate them if multiple crates in the + same index mention the same types. +* It's a JSON file, but not designed to be human-readable. + Browsers already include an optimized JSON decoder, + so this saves on `search.js` code and performs better for small crates, + but instead of using objects like normal JSON formats do, + it tries to put data of the same type next to each other + so that the sliding window used by [DEFLATE] can find redundancies. + Where `search.js` does its own compression, + it's designed to save memory when the file is finally loaded, + not just size on disk or network transfer. + +[merges]: https://github.com/rust-lang/rust/blob/79b710c13968a1a48d94431d024d2b1677940866/src/librustdoc/html/render/write_shared.rs#L151-L164 +[DEFLATE]: https://en.wikipedia.org/wiki/Deflate + +### Parallel arrays and indexed maps + +Most data in the index +(other than `doc`, which is a single string for the whole crate, +`p`, which is a separate structure +and `a`, which is also a separate structure) +is a set of parallel arrays defining each searchable item. + +For example, +the above search index can be turned into this table: + +| n | t | d | q | i | f | b | c | +|---|---|---|---|---|---|---|---| +| `function_name` | `H` | This function gets the name of an integer with Data | `crate_name` | 2 | `{{gb}{d}}` | NULL | NULL | +| `Data` | `F` | The data struct | `crate_name` | 0 | `` ` `` | NULL | NULL | + +The above code doesn't use `c`, which holds deprecated indices, +or `b`, which maps indices to strings. +If `crate_name::function_name` used both, it would look like this. + +```json + "b": [[0, "impl-Foo-for-Bar"]], + "c": [0], +``` + +This attaches a disambiguator to index 0 and marks it deprecated. + +The advantage of this layout is that these APIs often have implicit structure +that DEFLATE can take advantage of, +but that rustdoc can't assume. +Like how names are usually CamelCase or snake_case, +but descriptions aren't. + +`q` is a Map from *the first applicable* ID to a parent module path. +This is a weird trick, but it makes more sense in pseudo-code: + +```rust +let mut parent_module = ""; +for (i, entry) in search_index.iter().enumerate() { + if q.contains(i) { + parent_module = q.get(i); + } + // ... do other stuff with `entry` ... +} +``` + +This is valid because everything has a parent module +(even if it's just the crate itself), +and is easy to assemble because the rustdoc generator sorts by path +before serializing. +Doing this allows rustdoc to not only make the search index smaller, +but reuse the same string representing the parent path across multiple in-memory items. + +### `i`, `f`, and `p` + +`i` and `f` both index into `p`, the array of parent items. + +`i` is just a one-indexed number +(not zero-indexed because `0` is used for items that have no parent item). +It's different from `q` because `q` represents the parent *module or crate*, +which everything has, +while `i`/`q` are used for *type and trait-associated items* like methods. + +`f`, the function signatures, use their own encoding. + +```ebnf +f = { FItem | FBackref } +FItem = FNumber | ( '{', {FItem}, '}' ) +FNumber = { '@' | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' }, ( '`' | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k ' | 'l' | 'm' | 'n' | 'o' ) +FBackref = ( '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | ':' | ';' | '<' | '=' | '>' | '?' ) +``` + +An FNumber is a variable-length, self-terminating base16 number +(terminated because the last hexit is lowercase while all others are uppercase). +These are one-indexed references into `p`, because zero is used for nulls, +and negative numbers represent generics. +The sign bit is represented using [zig-zag encoding] +(the internal object representation also uses negative numbers, +even after decoding, +to represent generics). +This alphabet is chosen because the characters can be turned into hexits by +masking off the last four bits of the ASCII encoding. + +For example, `{{gb}{d}}` is equivalent to the json `[[3, 1], [2]]`. +Because of zigzag encoding, `` ` `` is +0, `a` is -0 (which is not used), +`b` is +1, and `c` is -1. + +[empirically]: https://github.com/rust-lang/rust/pull/83003 +[zig-zag encoding]: https://en.wikipedia.org/wiki/Variable-length_quantity#Zigzag_encoding + +## Searching by name + +Searching by name works by looping through the search index +and running these functions on each: + +* [`editDistance`] is always used to determine a match + (unless quotes are specified, which would use simple equality instead). + It computes the number of swaps, inserts, and removes needed to turn + the query name into the entry name. + For example, `foo` has zero distance from itself, + but a distance of 1 from `ofo` (one swap) and `foob` (one insert). + It is checked against an heuristic threshold, and then, + if it is within that threshold, the distance is stored for ranking. +* [`String.prototype.indexOf`] is always used to determine a match. + If it returns anything other than -1, the result is added, + even if `editDistance` exceeds its threshold, + and the index is stored for ranking. +* [`checkPath`] is used if, and only if, a parent path is specified + in the query. For example, `vec` has no parent path, but `vec::vec` does. + Within checkPath, editDistance and indexOf are used, + and the path query has its own heuristic threshold, too. + If it's not within the threshold, the entry is rejected, + even if the first two pass. + If it's within the threshold, the path distance is stored + for ranking. +* [`checkType`] is used only if there's a type filter, + like the struct in `struct:vec`. If it fails, + the entry is rejected. + +If all four criteria pass +(plus the crate filter, which isn't technically part of the query), +the results are sorted by [`sortResults`]. + +[`editDistance`]: https://github.com/rust-lang/rust/blob/79b710c13968a1a48d94431d024d2b1677940866/src/librustdoc/html/static/js/search.js#L137 +[`String.prototype.indexOf`]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/indexOf +[`checkPath`]: https://github.com/rust-lang/rust/blob/79b710c13968a1a48d94431d024d2b1677940866/src/librustdoc/html/static/js/search.js#L1814 +[`checkType`]: https://github.com/rust-lang/rust/blob/79b710c13968a1a48d94431d024d2b1677940866/src/librustdoc/html/static/js/search.js#L1787 +[`sortResults`]: https://github.com/rust-lang/rust/blob/79b710c13968a1a48d94431d024d2b1677940866/src/librustdoc/html/static/js/search.js#L1229 + +## Searching by type + +Searching by type can be divided into two phases, +and the second phase has two sub-phases. + +* Turn names in the query into numbers. +* Loop over each entry in the search index: + * Quick rejection using a bloom filter. + * Slow rejection using a recursive type unification algorithm. + +In the names->numbers phase, if the query has only one name in it, +the editDistance function is used to find a near match if the exact match fails, +but if there's multiple items in the query, +non-matching items are treated as generics instead. +This means `hahsmap` will match hashmap on its own, but `hahsmap, u32` +is going to match the same things `T, u32` matches +(though rustdoc will detect this particular problem and warn about it). + +Then, when actually looping over each item, +the bloom filter will probably reject entries that don't have every +type mentioned in the query. +For example, the bloom query allows a query of `i32 -> u32` to match +a function with the type `i32, u32 -> bool`, +but unification will reject it later. + +The unification filter ensures that: + +* Bag semantics are respected. If you query says `i32, i32`, + then the function has to mention *two* i32s, not just one. +* Nesting semantics are respected. If your query says `vec