-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore page build optimisations #20785
Explore page build optimisations #20785
Conversation
Is there anything in the way of just making |
Very cool! Nice work y'all making this happen and getting a PR up. One clarification question — you're only looking at the page context right now to see if an existing page has changed? |
@pieh our thinking is to not change the default behavior of Gatsby. It's not that this feature is experimental, it's more that the env var alone is not enough to utilise the incremental build as such. You have to do something after to use the incrementally built files (merge the changes files with a previous full build for example). That part would be project specific. We feel it's also useful (whichever way is the default) to be able to choose between a full or an incremental build. Happy to consider which behaviour is default though :) |
@KyleAMathews thank you, it's been a team effort, we're already seeing the benefits on our projects. @StuartRayson will be able to confirm, but we should be checking for new and for changes to existing data. |
@dominicfallows Nice work. From a brief look it seems that you're checking if the page context has changed. Am I missing a check to see if page queries have changed too? Most sites only pass an id or slug in context, so that wouldn't change even if the data does. Can it pick that sort of change up? |
Ah, I see, this is probably what @KyleAMathews was asking also. Yes, the intention is to pick up any change, we'll look into it and update commits |
yeah you need to check the following things for changes — page context, query, and the actual underlying nodes. It's the same logic as we use for running queries in development |
We can abstract out these checks as necessary so they're easier to reuse in multiple contexts. |
That'd be cool (to abstract existing checks) - looking around now, but any clue where these are currently? Something to do with gatsby/packages/gatsby/src/query/index.js Line 320 in 06f6688
or
gatsby/packages/gatsby/src/redux/nodes.js Line 63 in fa4ff69
? |
https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby/src/query/query-watcher.js for queries the internals cover a lot of details (though it's a bit outdated sometimes) e.g. https://www.gatsbyjs.org/docs/page-node-dependencies/ |
On top of page queries that Kyle mentioned, there is also tricky case of Static Queries where data changes there can affect all pages, some subset etc. The tricky part is that it's not currently tracked in which pages static query are used. |
The thing is that if you keep You can check this happening by building default starter, deleting
Because we expect results to be there. That's why |
@pieh ok I understand your original discussion about making incremental the default. We think we have two options:
What does everyone think? |
Hi Dominic, Thanks again for all of the work you’ve put into this. A group of us on the core team have been going through it and benchmarking your changes to see if we can find similar performance improvements. As hoped, we managed to get significant reductions in build times with warm caches. However this came with some important drawbacks that mean this PR can’t be merged in its current state. As Kyle noted above, right now it only compares context to determine whether to rebuild a page, and not page data. This means that it won’t detect any changes to the data in page queries. We confirmed this by running our markdown_id benchmark, editing a markdown page, then re-running the build. It did not rebuild the changed page, even though the source file had changed. We found similar results with other source plugins, which is what we would expect as the context does not change. As it stands, this PR would break on any site that uses page queries or static queries, as it would fail to rebuild changed pages. Unfortunately this includes most existing Gatsby sites. We do not recommend passing large amounts of data via context because it can negatively affect performance by over-fetching, increase bundle size, and prevent hot reloading in develop. For this reason all of our documentation, starters and tutorials use page queries and only pass an id or slug in the context. It’s not all bad news though. As a team, we think an approach like this has potential. To generalise it, Gatsby needs to be able to map data changes to related pages. We already do something similar in
In addition, any incremental build solution should also detect changes in webpack bundles – both for that particular page and template – and other bundles, like app or commons bundles. If you are interested in taking this forward, we’d be happy to do a pair programming session with you to help get you started. It’s likely that this is a large and complex problem to solve. In addition, if you’re able to provide a team member with access to your site, we’d like to do a performance audit to see if there are any generalisable fixes that we can implement in Gatsby to improve build times. There are several PRs in the past few weeks that have resulted from audits like these which have helped deliver dramatic performance improvements to many larger sites like yours. #20729 We are hyper-focussed on build performance at the moment, and audits of real-life sites like yours are key to achieving these results. To anyone else reading this: if you haven’t updated Gatsby in the past few weeks it is worth doing so now, particularly if your site is large. We have real-life sites which had build times over four hours building in five minutes. You may experience similar dramatic improvements, without the need for incremental builds. Cheers, Matt (and the rest of @gatsbyjs/core) |
I'm going to close this for now, please see the previous comment for details. We're very happy to review a new PR when you are ready with any updates. |
This is great news!
We definitely are interested in looking into it further, the build time improvements and CI/CD network bandwidth/deployment improvements (by reducing the number of pages to publish) are huge wins for us, and I hope all other Gatsby sites. We're definitely going to take you up on your offer of help and also Pair Programming, we'll get in touch asap. Thank you for your feedback so far, already very useful! |
Note that I've changed the title to avoid potential confusion. There is still a good chunk of follow-up work required to make this mergeable. |
We scheduled a Pair Programming session to try and learn more about the scope of work, but host didn't turn up. We'll reschedule, is there any specific time or people (based on this PR) from Gatsby that would rather schedule it with us? Or shall we use the public calendar? |
Uh-oh. That shouldn't happen. I will follow up with you via email. |
An update for those who are following. We had a great conversation today with @freiksenet and @wardpeet from the Gatsby team. Outcomes at this stage are, the essence of this PR can be developed further to offer some great build time and CI/CD deployable bundle size improvements, so we are going to work on this and open a new PR. Also, we are discussing how we might go about contributing to the wider full incremental build solution over the coming months. |
New PR, focusing in on incremental data changes #21523 |
+1 I don't actually understand why this should not be the default behaviour to begin with and why there is not more attention put to it. |
Description
Gatsby sources data from multiple sources (CMS, static files like Markdown, databases, APIs etc) and creates an aggregated data set in GraphQL. Currently, each
gatsby build
uses the GraphQL data set and queries to do a complete rebuild of the whole app, including static assets, such as HTML, JavaScript, JSON, and media files etc) from the graphQL queries, ready for deployment. For projects that have a small (10s to 100s) to medium (100s to 1000s) amount of content, these full builds don't present a problem.Sites with large amounts of content (10,000s upwards 😱) start to see increased build times and increased demand on CPU and memory.
Also, one of the principals of modern Continous Integration/Continuous Deployment is to release change (and therefore risk) in small batches. A full app rebuild may not be in line with that principle for some projects.
One solution to these problems might be to use Gatsby Cloud's 'Build' features (currently in Beta).
For projects that require self-hosted environments, where Gatsby Cloud would not be an option, being able to incrementally build only the content that has changed (or new) would help reduce build times and demand on resources, whilst helping keep in line with CI/CD principles.
This PR is to introduce optional incremental builds into Gatsby core.
How to use
To enable optional incremental builds, use the environment variable
GATSBY_INCREMENTAL_BUILD=true
in yourgatsby build
command, for example:GATSBY_INCREMENTAL_BUILD=true gatsby build
This will run the Gatsby build process, but only build assets that have changed (or are new) since your last build.
Reporting what has been built
After an incremental build has completed, you might need to get a list of the assets that have been built, for example, if you want to perform a sync action in your CI/CD pipeline.
To list the paths in the build assets (
public
) folder, you can use one (or both) of the following arguments in yourbuild
command.--log-pages
outputs the updated paths to the console at the end of the build--write-to-file
creates two files in the.cache
folder, with lists of the changes paths in the build assets (public
) folder.newPages.txt
will contain a list of paths that have changed or are newdeletedPages.txt
will contain a list of paths that have been deletedIf there are no changed or deleted paths, then the relevant files will not be created in the
.cache
folder.Approach
An incremental build works by comparing the previous page node data from cache (returned by
readState()
) to the newly created page node data in GraphQL, that can be accessed bystore.getState()
. By comparing these two data sets, we can determine which pages have been updated, newly created or deleted.There are two new functions
getNewPageKeys
andremovePreviousPageData
inutils/page-data.js
:getNewPageKeys
loops through eachpage.context
comparing the previous data to the new. If there is a difference, or the key does not exist (new page), this key is added to this functions returned array.removePreviousPageData
loops through each key, if the key is not present in the new data, the page will be removed and a key added to this functions returned array.This array of path keys is assigned to
pageQueryIds
and used as thepagePaths
value for thebuildHTML.buildPages
process.Performance improvement
We have run various performance tests on our projects. For context, we use AWS CodePipeline to build and deploy our Gatsby projects, one of which is approaching 30k pages.
On our ~30k page project, when we run a full build versus an incremental build, we are seeing vastly improved build and deploy times, alongside reduced CPU and memory spikes.
For example, for a full build and deploy, we see an average of 10-11 minutes. For an incremental build, this is reduced down to an average 5-6 minutes 🚀
Further considerations
.cache/redux.state
file between builds, allowing for comparison on an incremental build, if there is noredux.state
file located in the.cache
the folder then a full build will be triggeredgatsby build
Related Issues
Addresses #5002