rustdoc's search-index.js file is huge for large projects #31387

jandem · 2016-02-03T15:13:28Z

See https://bugzilla.mozilla.org/show_bug.cgi?id=1245213#c3

We should fix SpiderMonkey / the profiler to be smarter, but 15.2 MB is a lot of JS to load on every page load. Looks like this file will also create a ton of JS objects/strings.

Can we load this file only when we're using the search bar? Maybe we can split it up somehow and load only the relevant parts? Or come up with a more efficient format for it?

dirk · 2016-02-03T23:48:38Z

I'll take a look at the layout of the index file tonight. 😄 My hunch is that a more efficient format for it could largely alleviate this problem.

dirk · 2016-02-04T05:37:30Z

After some digging, it seems like there are two quick gains to be made:

Shortening the keys in function types

Right now function types get an object in the index similar to the following:

{
  "inputs":  [
    {"name": "producer"},
    {"name": "consumer"}
  ],
  "output":  {"name": "option"}
}

This JSON is meant to be read by a machine, not a human. It seems to me that continuing to use these English identifiers in the machine format doesn't really get us anything. Instead, how about we focus on efficiently representing these structures in JSON and thoroughly documenting how that representation is shaped so that others can easily make use of it.

Sorting the array of items and paths

Having these arrays be presorted before they reach a client makes it much easier for that client to search them (sorting in Rust right before we emit these is going to be way faster than asking clients to sort them in JavaScript and such).

ranma42 · 2016-02-04T10:23:57Z

I think the format is currently documented here

It should be possible to completely get rid of the identifiers.

Some kind of delta compression is already employed on paths (a "" path means the previous path).
Additional savings in the file size (mostly for the parser, as hopefully the file is sent compressed over the Internet) might be possible by using shorter representations (example: omitting the element or using 0 instead of null or "" should be possible, as the type of each field is known).

It might make sense to check how much is saved by transposing the items table (this is somewhat in conflict with omitting the elements).

Some strings (mostly the typenames) are repeated several times. Replacing them with an offset in an array of per-crate or per-search-index types (basically interning them) might provide significant savings.

dirk · 2016-02-04T18:21:13Z

@ranma42: That documentation is helpful, but it could be improved. Also the documentation for how something is rendered should be where it is rendered, not where it is consumed. 😉

It should be possible to completely get rid of the identifiers.

To which identifiers are you referring?

Some strings (mostly the typenames) are repeated several times. Replacing them with an offset in an array of per-crate or per-search-index types (basically interning them) might provide significant savings.

A cursory review of some generating search-index.js'es makes me think the benefits for this wouldn't be that great compared to some of the simpler optimizations that could be made.

These are all good suggestions, but I think we should try knocking out the low-hanging fruit first before attempting more complex optimizations.

ranma42 · 2016-02-04T19:11:17Z

I tried to evaluate the effect of the simpler changes (on the search index generated for the rust repo)

Size	Change
1131406	Original
981061	Drop "name" keys `{"inputs":["producer","consumer"],"output":"option"}`
847959	Drop all keys, use array of inputs,output types `["producer","consumer","option"]`
778439	Also reorder fields and omit them when "default" (null/last path/...)

ranma42 · 2016-02-04T19:30:25Z

To which identifiers are you referring?

I was referring to "inputs", "output", and "name". In the search index there are also "items" and "paths", but they each of them appears once per crate, so removing them is unlikely to provide significant improvements except in very degenerate indices (many crates with few elements each?).

dirk · 2016-02-04T19:36:26Z

@ranma42: Gotcha. Dropping/shortening those keys definitely seems worth doing! 😄

To be honest, 15 megabytes (see the issue description) isn't that much data. So I think we should also focus on improving the actual search JavaScript that runs on the front-end. I did a cursory review of that and it seems like there are some improvements to be made. Will take a stab at that tonight (and of course link my PR to this issue).

nrc · 2016-08-02T20:15:28Z

Could you zip it on the server and unzip on the client? Failing that, we should use move to a new format. Optimised JSON is a terrible idea - either use JSON because it is clear what is getting serialised to what and its human readable or move to an efficient binary format to be efficient. Trying to optimise JSON gets you the worst of both worlds - a format which is easy to make mistakes with and isn't human readable, but is still inefficient.

nrc · 2016-08-02T20:17:07Z

Or move to an architecture where you don't need to download the whole search index to the client (which honestly seems like a terrible thing to do).

steveklabnik · 2018-10-31T17:14:09Z

Triage: no real changes here

jyn514 · 2021-01-22T18:49:05Z

Or move to an architecture where you don't need to download the whole search index to the client (which honestly seems like a terrible thing to do).

@nrc Do you have suggestions for how this could work?

jyn514 · 2021-04-12T15:30:58Z

Failing that, we should use move to a new format. Optimised JSON is a terrible idea - either use JSON because it is clear what is getting serialised to what and its human readable or move to an efficient binary format to be efficient.

Hmm, I'm not convinced this is true - using JSON means we can use the browser's optimized JSON.parse, using a binary format means we'd have to parse it in JavaScript. It's definitely worth looking into, but I don't think it should block other improvements to the JSON in the meantime.

notriddle · 2024-03-08T19:04:13Z

Recent state for the standard library can be found here: https://rust-lang.zulipchat.com/#narrow/stream/266220-t-rustdoc/topic/search.20index.20size/near/411848235

New numbers for different parts of the search index. The benchmarking tool source is:

set -ex
curl https://doc.rust-lang.org/nightly/search-index1.77.0.js > search-index.js
awk 'FNR==NR {a++;next} FNR<a-3' search-index.js{,} | awk 'NR>1 {gsub(/\],\\$/,""); gsub(/^\["[^"]+",/,""); print} {next}' | sed -E "s:\\\\':':g" > search-index.json
jq -c '.t' search-index.json > t.json
jq -c '.n' search-index.json > n.json
jq -c '.q' search-index.json > q.json
jq -c '.d' search-index.json > d.json
jq -c '.i' search-index.json > i.json
jq -c '.f' search-index.json > f.json
jq -c '.c' search-index.json > c.json
jq -c '.p' search-index.json > p.json
jq -c '.a' search-index.json > a.json
du -hs t.json n.json q.json d.json i.json f.json c.json p.json a.json

And its output is:

$ bash test.sh 
+ curl https://doc.rust-lang.org/nightly/search-index1.77.0.js
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 4296k  100 4296k    0     0  25.3M      0 --:--:-- --:--:-- --:--:-- 25.4M
+ awk 'FNR==NR {a++;next} FNR<a-3' search-index.js search-index.js
+ awk 'NR>1 {gsub(/\],\\$/,""); gsub(/^\["[^"]+",/,""); print} {next}'
+ sed -E 's:\\'\'':'\'':g'
+ jq -c .t search-index.json
+ jq -c .n search-index.json
+ jq -c .q search-index.json
+ jq -c .d search-index.json
+ jq -c .i search-index.json
+ jq -c .f search-index.json
+ jq -c .c search-index.json
+ jq -c .p search-index.json
+ jq -c .a search-index.json
+ du -hs t.json n.json q.json d.json i.json f.json c.json p.json a.json
68K	t.json
800K	n.json
16K	q.json
2.0M	d.json
200K	i.json
532K	f.json
12K	c.json
44K	p.json
24K	a.json

Breaking down each section:

t = types (specifically, the item kind, like "struct" or "function")
n = names
q = paths
d = descriptions
i = parent (indexes into p)
f = function signature
c = deprecation notice
p = parents
a = aliases

alexcrichton added the T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. label Feb 3, 2016

mitaa mentioned this issue Feb 25, 2016

Check the search-index generated by rustdoc tests #31483

Closed

ollie27 mentioned this issue Aug 2, 2016

rustdoc: simplify URLs #35020

Closed

steveklabnik added the T-tools label Mar 9, 2017

steveklabnik added T-dev-tools Relevant to the dev-tools subteam, which will review and decide on the PR/issue. and removed T-tools labels May 18, 2017

Mark-Simulacrum added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 24, 2017

jyn514 mentioned this issue Jan 22, 2021

Optimizing docs generation microsoft/windows-rs#420

Closed

ehuss removed the T-dev-tools Relevant to the dev-tools subteam, which will review and decide on the PR/issue. label Jan 18, 2022

aatifsyed mentioned this issue Nov 22, 2023

Don't deploy Rust docs to GH pages ChainSafe/forest#3731

Closed

3 tasks

fmease added I-heavy Issue: Problems and improvements with respect to binary size of generated code. A-rustdoc-search Area: Rustdoc's search feature labels Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rustdoc's search-index.js file is huge for large projects #31387

rustdoc's search-index.js file is huge for large projects #31387

jandem commented Feb 3, 2016

dirk commented Feb 3, 2016

dirk commented Feb 4, 2016

ranma42 commented Feb 4, 2016

dirk commented Feb 4, 2016

ranma42 commented Feb 4, 2016

ranma42 commented Feb 4, 2016

dirk commented Feb 4, 2016

nrc commented Aug 2, 2016

nrc commented Aug 2, 2016

steveklabnik commented Oct 31, 2018

jyn514 commented Jan 22, 2021 •

edited

Loading

jyn514 commented Apr 12, 2021

notriddle commented Mar 8, 2024

rustdoc's search-index.js file is huge for large projects #31387

rustdoc's search-index.js file is huge for large projects #31387

Comments

jandem commented Feb 3, 2016

dirk commented Feb 3, 2016

dirk commented Feb 4, 2016

Shortening the keys in function types

Sorting the array of items and paths

ranma42 commented Feb 4, 2016

dirk commented Feb 4, 2016

ranma42 commented Feb 4, 2016

ranma42 commented Feb 4, 2016

dirk commented Feb 4, 2016

nrc commented Aug 2, 2016

nrc commented Aug 2, 2016

steveklabnik commented Oct 31, 2018

jyn514 commented Jan 22, 2021 • edited Loading

jyn514 commented Apr 12, 2021

notriddle commented Mar 8, 2024

jyn514 commented Jan 22, 2021 •

edited

Loading