-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rustdoc's search-index.js file is huge for large projects #31387
Comments
I'll take a look at the layout of the index file tonight. 😄 My hunch is that a more efficient format for it could largely alleviate this problem. |
After some digging, it seems like there are two quick gains to be made: Shortening the keys in function typesRight now function types get an object in the index similar to the following: {
"inputs": [
{"name": "producer"},
{"name": "consumer"}
],
"output": {"name": "option"}
} This JSON is meant to be read by a machine, not a human. It seems to me that continuing to use these English identifiers in the machine format doesn't really get us anything. Instead, how about we focus on efficiently representing these structures in JSON and thoroughly documenting how that representation is shaped so that others can easily make use of it. Sorting the array of items and pathsHaving these arrays be presorted before they reach a client makes it much easier for that client to search them (sorting in Rust right before we emit these is going to be way faster than asking clients to sort them in JavaScript and such). |
I think the format is currently documented here It should be possible to completely get rid of the identifiers. Some kind of delta compression is already employed on paths (a It might make sense to check how much is saved by transposing the items table (this is somewhat in conflict with omitting the elements). Some strings (mostly the typenames) are repeated several times. Replacing them with an offset in an array of per-crate or per-search-index types (basically interning them) might provide significant savings. |
@ranma42: That documentation is helpful, but it could be improved. Also the documentation for how something is rendered should be where it is rendered, not where it is consumed. 😉
To which identifiers are you referring?
A cursory review of some generating These are all good suggestions, but I think we should try knocking out the low-hanging fruit first before attempting more complex optimizations. |
I tried to evaluate the effect of the simpler changes (on the search index generated for the rust repo)
|
I was referring to |
@ranma42: Gotcha. Dropping/shortening those keys definitely seems worth doing! 😄 To be honest, 15 megabytes (see the issue description) isn't that much data. So I think we should also focus on improving the actual search JavaScript that runs on the front-end. I did a cursory review of that and it seems like there are some improvements to be made. Will take a stab at that tonight (and of course link my PR to this issue). |
Could you zip it on the server and unzip on the client? Failing that, we should use move to a new format. Optimised JSON is a terrible idea - either use JSON because it is clear what is getting serialised to what and its human readable or move to an efficient binary format to be efficient. Trying to optimise JSON gets you the worst of both worlds - a format which is easy to make mistakes with and isn't human readable, but is still inefficient. |
Or move to an architecture where you don't need to download the whole search index to the client (which honestly seems like a terrible thing to do). |
Triage: no real changes here |
@nrc Do you have suggestions for how this could work? |
Hmm, I'm not convinced this is true - using JSON means we can use the browser's optimized |
Recent state for the standard library can be found here: https://rust-lang.zulipchat.com/#narrow/stream/266220-t-rustdoc/topic/search.20index.20size/near/411848235 New numbers for different parts of the search index. The benchmarking tool source is: set -ex
curl https://doc.rust-lang.org/nightly/search-index1.77.0.js > search-index.js
awk 'FNR==NR {a++;next} FNR<a-3' search-index.js{,} | awk 'NR>1 {gsub(/\],\\$/,""); gsub(/^\["[^"]+",/,""); print} {next}' | sed -E "s:\\\\':':g" > search-index.json
jq -c '.t' search-index.json > t.json
jq -c '.n' search-index.json > n.json
jq -c '.q' search-index.json > q.json
jq -c '.d' search-index.json > d.json
jq -c '.i' search-index.json > i.json
jq -c '.f' search-index.json > f.json
jq -c '.c' search-index.json > c.json
jq -c '.p' search-index.json > p.json
jq -c '.a' search-index.json > a.json
du -hs t.json n.json q.json d.json i.json f.json c.json p.json a.json And its output is:
Breaking down each section:
|
See https://bugzilla.mozilla.org/show_bug.cgi?id=1245213#c3
We should fix SpiderMonkey / the profiler to be smarter, but 15.2 MB is a lot of JS to load on every page load. Looks like this file will also create a ton of JS objects/strings.
Can we load this file only when we're using the search bar? Maybe we can split it up somehow and load only the relevant parts? Or come up with a more efficient format for it?
The text was updated successfully, but these errors were encountered: