Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rustdoc's search-index.js file is huge for large projects #31387

Open
jandem opened this issue Feb 3, 2016 · 13 comments
Open

rustdoc's search-index.js file is huge for large projects #31387

jandem opened this issue Feb 3, 2016 · 13 comments
Labels
A-rustdoc-search Area: Rustdoc's search feature C-enhancement Category: An issue proposing an enhancement or a PR with one. I-heavy Issue: Problems and improvements with respect to binary size of generated code. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue.

Comments

@jandem
Copy link

jandem commented Feb 3, 2016

See https://bugzilla.mozilla.org/show_bug.cgi?id=1245213#c3

We should fix SpiderMonkey / the profiler to be smarter, but 15.2 MB is a lot of JS to load on every page load. Looks like this file will also create a ton of JS objects/strings.

Can we load this file only when we're using the search bar? Maybe we can split it up somehow and load only the relevant parts? Or come up with a more efficient format for it?

@alexcrichton alexcrichton added the T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. label Feb 3, 2016
@dirk
Copy link
Contributor

dirk commented Feb 3, 2016

I'll take a look at the layout of the index file tonight. 😄 My hunch is that a more efficient format for it could largely alleviate this problem.

@dirk
Copy link
Contributor

dirk commented Feb 4, 2016

After some digging, it seems like there are two quick gains to be made:

Shortening the keys in function types

Right now function types get an object in the index similar to the following:

{
  "inputs":  [
    {"name": "producer"},
    {"name": "consumer"}
  ],
  "output":  {"name": "option"}
}

This JSON is meant to be read by a machine, not a human. It seems to me that continuing to use these English identifiers in the machine format doesn't really get us anything. Instead, how about we focus on efficiently representing these structures in JSON and thoroughly documenting how that representation is shaped so that others can easily make use of it.

Sorting the array of items and paths

Having these arrays be presorted before they reach a client makes it much easier for that client to search them (sorting in Rust right before we emit these is going to be way faster than asking clients to sort them in JavaScript and such).

@ranma42
Copy link
Contributor

ranma42 commented Feb 4, 2016

I think the format is currently documented here

It should be possible to completely get rid of the identifiers.

Some kind of delta compression is already employed on paths (a "" path means the previous path).
Additional savings in the file size (mostly for the parser, as hopefully the file is sent compressed over the Internet) might be possible by using shorter representations (example: omitting the element or using 0 instead of null or "" should be possible, as the type of each field is known).

It might make sense to check how much is saved by transposing the items table (this is somewhat in conflict with omitting the elements).

Some strings (mostly the typenames) are repeated several times. Replacing them with an offset in an array of per-crate or per-search-index types (basically interning them) might provide significant savings.

@dirk
Copy link
Contributor

dirk commented Feb 4, 2016

@ranma42: That documentation is helpful, but it could be improved. Also the documentation for how something is rendered should be where it is rendered, not where it is consumed. 😉

It should be possible to completely get rid of the identifiers.

To which identifiers are you referring?

Some strings (mostly the typenames) are repeated several times. Replacing them with an offset in an array of per-crate or per-search-index types (basically interning them) might provide significant savings.

A cursory review of some generating search-index.js'es makes me think the benefits for this wouldn't be that great compared to some of the simpler optimizations that could be made.

These are all good suggestions, but I think we should try knocking out the low-hanging fruit first before attempting more complex optimizations.

@ranma42
Copy link
Contributor

ranma42 commented Feb 4, 2016

I tried to evaluate the effect of the simpler changes (on the search index generated for the rust repo)

Size Change
1131406 Original
981061 Drop "name" keys {"inputs":["producer","consumer"],"output":"option"}
847959 Drop all keys, use array of inputs,output types ["producer","consumer","option"]
778439 Also reorder fields and omit them when "default" (null/last path/...)

@ranma42
Copy link
Contributor

ranma42 commented Feb 4, 2016

To which identifiers are you referring?

I was referring to "inputs", "output", and "name". In the search index there are also "items" and "paths", but they each of them appears once per crate, so removing them is unlikely to provide significant improvements except in very degenerate indices (many crates with few elements each?).

@dirk
Copy link
Contributor

dirk commented Feb 4, 2016

@ranma42: Gotcha. Dropping/shortening those keys definitely seems worth doing! 😄

To be honest, 15 megabytes (see the issue description) isn't that much data. So I think we should also focus on improving the actual search JavaScript that runs on the front-end. I did a cursory review of that and it seems like there are some improvements to be made. Will take a stab at that tonight (and of course link my PR to this issue).

@nrc
Copy link
Member

nrc commented Aug 2, 2016

Could you zip it on the server and unzip on the client? Failing that, we should use move to a new format. Optimised JSON is a terrible idea - either use JSON because it is clear what is getting serialised to what and its human readable or move to an efficient binary format to be efficient. Trying to optimise JSON gets you the worst of both worlds - a format which is easy to make mistakes with and isn't human readable, but is still inefficient.

@nrc
Copy link
Member

nrc commented Aug 2, 2016

Or move to an architecture where you don't need to download the whole search index to the client (which honestly seems like a terrible thing to do).

@steveklabnik steveklabnik added T-dev-tools Relevant to the dev-tools subteam, which will review and decide on the PR/issue. and removed T-tools labels May 18, 2017
@Mark-Simulacrum Mark-Simulacrum added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 24, 2017
@steveklabnik
Copy link
Member

Triage: no real changes here

@jyn514
Copy link
Member

jyn514 commented Jan 22, 2021

Or move to an architecture where you don't need to download the whole search index to the client (which honestly seems like a terrible thing to do).

@nrc Do you have suggestions for how this could work?

@jyn514
Copy link
Member

jyn514 commented Apr 12, 2021

Failing that, we should use move to a new format. Optimised JSON is a terrible idea - either use JSON because it is clear what is getting serialised to what and its human readable or move to an efficient binary format to be efficient.

Hmm, I'm not convinced this is true - using JSON means we can use the browser's optimized JSON.parse, using a binary format means we'd have to parse it in JavaScript. It's definitely worth looking into, but I don't think it should block other improvements to the JSON in the meantime.

@ehuss ehuss removed the T-dev-tools Relevant to the dev-tools subteam, which will review and decide on the PR/issue. label Jan 18, 2022
@notriddle
Copy link
Contributor

Recent state for the standard library can be found here: https://rust-lang.zulipchat.com/#narrow/stream/266220-t-rustdoc/topic/search.20index.20size/near/411848235

New numbers for different parts of the search index. The benchmarking tool source is:

set -ex
curl https://doc.rust-lang.org/nightly/search-index1.77.0.js > search-index.js
awk 'FNR==NR {a++;next} FNR<a-3' search-index.js{,} | awk 'NR>1 {gsub(/\],\\$/,""); gsub(/^\["[^"]+",/,""); print} {next}' | sed -E "s:\\\\':':g" > search-index.json
jq -c '.t' search-index.json > t.json
jq -c '.n' search-index.json > n.json
jq -c '.q' search-index.json > q.json
jq -c '.d' search-index.json > d.json
jq -c '.i' search-index.json > i.json
jq -c '.f' search-index.json > f.json
jq -c '.c' search-index.json > c.json
jq -c '.p' search-index.json > p.json
jq -c '.a' search-index.json > a.json
du -hs t.json n.json q.json d.json i.json f.json c.json p.json a.json

And its output is:

$ bash test.sh 
+ curl https://doc.rust-lang.org/nightly/search-index1.77.0.js
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 4296k  100 4296k    0     0  25.3M      0 --:--:-- --:--:-- --:--:-- 25.4M
+ awk 'FNR==NR {a++;next} FNR<a-3' search-index.js search-index.js
+ awk 'NR>1 {gsub(/\],\\$/,""); gsub(/^\["[^"]+",/,""); print} {next}'
+ sed -E 's:\\'\'':'\'':g'
+ jq -c .t search-index.json
+ jq -c .n search-index.json
+ jq -c .q search-index.json
+ jq -c .d search-index.json
+ jq -c .i search-index.json
+ jq -c .f search-index.json
+ jq -c .c search-index.json
+ jq -c .p search-index.json
+ jq -c .a search-index.json
+ du -hs t.json n.json q.json d.json i.json f.json c.json p.json a.json
68K	t.json
800K	n.json
16K	q.json
2.0M	d.json
200K	i.json
532K	f.json
12K	c.json
44K	p.json
24K	a.json

Breaking down each section:

  • t = types (specifically, the item kind, like "struct" or "function")
  • n = names
  • q = paths
  • d = descriptions
  • i = parent (indexes into p)
  • f = function signature
  • c = deprecation notice
  • p = parents
  • a = aliases

@fmease fmease added I-heavy Issue: Problems and improvements with respect to binary size of generated code. A-rustdoc-search Area: Rustdoc's search feature labels Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-rustdoc-search Area: Rustdoc's search feature C-enhancement Category: An issue proposing an enhancement or a PR with one. I-heavy Issue: Problems and improvements with respect to binary size of generated code. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests