Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Deterministic loading of data from path #4626

Closed
chmac opened this issue Mar 20, 2018 · 15 comments
Closed

Proposal: Deterministic loading of data from path #4626

chmac opened this issue Mar 20, 2018 · 15 comments
Assignees

Comments

@chmac
Copy link
Contributor

chmac commented Mar 20, 2018

tl;dr Could we remove the map of path to data file in app-*.js and instead try to fetch data by converting the link path to a data filename, handling 404s if it doesn't exist, etc?

History

I'm experimenting with a Gatsby site that has ~3.5k pages. The bundle sizes are like so:

2.7M app-*.js
240K chunk-manifest.json
300K commons-*.js

I haven't fully understood Gatsby's data structure, but checking the network tools shows that app-*.js is loaded as soon as the page finishes loading.

It seems like the current architecture uses webpack to build a Map of all paths to their path to the relevant file on disk. This means that as the number of pages grows, the site's bundle size grows. I presume this approach will not scale very well for sites with 10k or 100k pages.

Idea

Would it be possible to deterministically map path to data file? Further, if we could do that (which I guess we could), would it be possible to skip the "list of pages" and just fetch data by transforming the path variable into its data file?

Extra thoughts

  • I'm super to new Gatsby core development, zero experience with webpack, so my idea might be total nonsense in that context. If so, apologies, feel free to close.
  • This would remove the content based hash from the filename, which is a very useful technique for cache busting, etc.
@pieh
Copy link
Contributor

pieh commented Mar 20, 2018

AFAIK only reason hashes need to be there is exactly for cache busting. I am actually currently working on speeding up build process and am touching this part of the code (but no real change there - map to files is still dumped to single file). I'd be interested to hear ideas how we could handle that so it would scale nicer (but not by removing hashes from files :) ).

@vinniejames
Copy link

This would be a nice feature as well. I the docs, it looks like maybe it could override the $path variable in the template, instead using the filesystem path

@chmac
Copy link
Contributor Author

chmac commented Mar 20, 2018

@pieh Is your work in a branch somewhere that I could use as a starting point to try and dig into this?

I fear that this comes from deep in Gatsby's architecture and so could be difficult to change / refactor. I'm definitely willing to dig into it and see what I can figure out.

@pieh
Copy link
Contributor

pieh commented Mar 20, 2018

@chmac My branch is here https://github.com/pieh/gatsby/tree/json-loader

More context about it - together with @m-allanson we are working on speeding up build and develop process by removing bundling and loading json data (results of queries) out of webpack and doing that directly by gatsby. So this actually doesn't focus on reducing app bundle size, but I do small change to async-requires.js which is really just one big map to resources (to page/layout componets and to json files with results of queries). Here's WIP PR #4555 with code from @m-allanson (related to develop part) and my part (build speed up) is mentioned in first comment there along with link to branch/changes.

If you wish to dive in the code here are some entry points you might want to check (links to current master branch as I don't change too much in this department and my branch is still WIP):

  • pages-writer.js which writes out async-requires.js
  • loader.js which handles loading data and components needed for current page from async-requires.js during runtime

Before doing changes in code we should probably figure how we could design it so it doesn't increase build time too much and will produce more manageable bundles.

@chmac
Copy link
Contributor Author

chmac commented Mar 22, 2018

@pieh Awesome, thanks for the tips, that's a huge help. I've spent a few days deep diving into this stuff.

Here's what I've understood (please correct me if I've misunderstood any of this stuff, it's a real possibility).

  • The json-loader branch moves data out of webpack
  • A new jsonName property is added to pages
  • A map of jsonName to .json file name is written to static-data-paths.json
  • This data is built into app-*.js and embedded in every page HTML body

Goals

I'd suggest the following goals:

  • Remove the list of pages from the HTML and the core javascript. These files should not grow with every new page.
  • Retain content hashes in all non-HTML for cache busting.

Idea

Here's one idea about how we could move towards those goals.

  • Move static-data-paths.json out of the built HTML
  • Move it out of app-*.js as well
  • Create an async function to take a jsonName and return a JSON filename
  • Refactor to use this new function

The async function would allow us to fetch static-data-paths.json on demand. It would also allow us to come up with more complicated schemes in the future. For example, we could shard the file, splitting it into chunks, and only fetch each chunk as it's required. However, switching from this being sync to async might make refactoring difficult.

Very open to any feedback.

@pieh
Copy link
Contributor

pieh commented Mar 22, 2018

@chmac I was thinking about this a little and for initial load and mounting react components we don't need that map in app bundle or in html - we can delay loading that after initial component is mounted.

Not sure how we could approach chunking that map in the next step - how would we know what chunk we need to load to get path to data for given page?

@chmac
Copy link
Contributor Author

chmac commented Mar 22, 2018

Yes, loading it later makes sense. That will make it async anyway, which paves the way for fancier stuff.

Sharding, I'm thinking back to my WordPress days and database sharding on MySQL. We used to use a remarkably simple scheme that looked something like this:

const calculateShardNameForId(shardLength: int, id: string) => md5(id).substr(0, shardLength)

Any hashing algo would work, and the only thing we need to know is the shardLength. In an ideal world, we could decide that at build time. We could also create a map of shardName -> shardJsonFile so we could content hash the JSON files. Then inside the app-*.js file we only need to keep our map of shards.

A shardLength of 2 would give us 256 shards, so 100k pages would 390 per shard. Even 500k pages would only be 2k per shard.

There are probably lots of potential optimisations, but that was the general approach I was thinking about.

@KyleAMathews
Copy link
Contributor

Lazy loading of the paths to page json files is the obvious next step. Sharding would be nice for really large sites. Ideally you'd shard by something like path names so a shard for /blog/*, as those are likely to be needed together. I don't think we need to worry about that right away though as with v2, the amount of data needed per page is something like 10x smaller so sharding would only be helpful for sites with 25k+ pages.

@chmac
Copy link
Contributor Author

chmac commented Mar 28, 2018

OK, sounds like we're reaching consensus around the plan:

  1. Lazy load the map of path to JSON file names
  2. Think about sharding (chunking) it at some point in the future

How do we move forward? There's currently work being done on switching from webpack to our own JSON pipeline in #4555 (described somewhat in #3575). Do we fold the lazy loading into one of those tickets? Create a new ticket for the lazy loading idea?

The original idea I proposed in this ticket doesn't make sense, we'd break the cache busting / content hashing.

@pieh
Copy link
Contributor

pieh commented Mar 28, 2018

Lazy loading paths to jsons file names and map that specify what components (pages/layouts) are used for paths is pretty much done - #4715 (I should probably ping here when I posted it)

@pieh
Copy link
Contributor

pieh commented Mar 28, 2018

It's for v2 and it's based on #4555

@chmac
Copy link
Contributor Author

chmac commented Mar 28, 2018

@pieh Awesome! v2 is looking better and better!

In that case, I'll close this issue, and I'll create a new one about sharding data.json (linking to here for history). I assume sharding is a low priority upgrade to consider at some point in the future.

@chmac chmac closed this as completed Mar 28, 2018
@pieh
Copy link
Contributor

pieh commented Mar 28, 2018

Just to give more info - when I run my tests against https://github.com/freeCodeCamp/guides (~2800 pages) - gzipped "webpackified" data.json is 141KB - this surely won't scale very nice for 100 000 pages sites, but up to 10 000 I think this shouldn't be that much of a issue

@vinniejames
Copy link

@pieh thanks for cracking this! I'm curious if there is a planned/estimated release date for v2?

@m-allanson
Copy link
Contributor

@vinniejames There's no date but you can track progress over at https://github.com/gatsbyjs/gatsby/projects/2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants