-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(gatsby): Page build optimisations for incremental data changes #21523
feat(gatsby): Page build optimisations for incremental data changes #21523
Conversation
…ctive-investor/gatsby into improve-page-build-on-data-change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work 👍
Some comments.
|
||
const pagePaths = [...store.getState().pages.keys()] | ||
const pagePaths = process.env.GATSBY_PAGE_BUILD_ON_DATA_CHANGES | ||
? await pageDataUtil.getChangedPageDataKeys(store.getState(), readState()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already read the state when we initialize, maybe we can save reference to required old state (webpack hash, pages and pageData). I think if we do at the beginning of build (before bootstrap), it should have old page data. I'm kinda wary about re-reading cache multiple times, who knows when it can get changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep that makes sense 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok that's done 🙇
deletedPageKeys = await pageDataUtil.removePreviousPageData( | ||
program.directory, | ||
store.getState(), | ||
readState() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto about not re-reading the cache here.
packages/gatsby/src/query/index.js
Outdated
const { pageData } = readState() | ||
|
||
if (pageData) { | ||
readState().pageData.forEach((value, key) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can pass the cached pageData from the build function here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can eliminate the explicit if
branch by:
readState().pageData.forEach((value, key) => { | |
pageData?.forEach((value, key) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting a no-unused-expressions
lint error 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue to keep original code here (maybe just lose readState().pageData
to pageData
- the optional chaining here makes it more difficult to read what's going on ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StuartRayson That's because we're currently using an old version of ESLint that doesn't understand optional chaining
Excellent work!
Can we add some documentation for this? Cc @ascorbic We'll want to standardise on this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR :) I mostly reviewed superficial syntactical comments that stuck out to me.
)}`.replace(/,/g, ``) | ||
) | ||
} | ||
if (typeof deletedPageKeys !== `undefined` && deletedPageKeys.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming null
is not an important distinction, you can simplify this;
if (typeof deletedPageKeys !== `undefined` && deletedPageKeys.length) { | |
if (deletedPageKeys?.length) { |
packages/gatsby/src/query/index.js
Outdated
const { pageData } = readState() | ||
|
||
if (pageData) { | ||
readState().pageData.forEach((value, key) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can eliminate the explicit if
branch by:
readState().pageData.forEach((value, key) => { | |
pageData?.forEach((value, key) => { |
Another From code perspective this is ready to be merged (IMO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a documentation standpoint, this is very clear and well written. Well done!
I've made a few suggestions, particularly to the important last point. But other than that the text should be ready.
docs/docs/page-build-optimizations-for-incremental-data-changes.md
Outdated
Show resolved
Hide resolved
docs/docs/page-build-optimizations-for-incremental-data-changes.md
Outdated
Show resolved
Hide resolved
docs/docs/page-build-optimizations-for-incremental-data-changes.md
Outdated
Show resolved
Hide resolved
docs/docs/page-build-optimizations-for-incremental-data-changes.md
Outdated
Show resolved
Hide resolved
docs/docs/page-build-optimizations-for-incremental-data-changes.md
Outdated
Show resolved
Hide resolved
docs/docs/page-build-optimizations-for-incremental-data-changes.md
Outdated
Show resolved
Hide resolved
docs/docs/page-build-optimizations-for-incremental-data-changes.md
Outdated
Show resolved
Hide resolved
docs/docs/page-build-optimizations-for-incremental-data-changes.md
Outdated
Show resolved
Hide resolved
docs/docs/page-build-optimizations-for-incremental-data-changes.md
Outdated
Show resolved
Hide resolved
Co-Authored-By: LB <laurie@gatsbyjs.com>
Ops, changes from suggestions to docs broke linting. Please run |
docs/docs/page-build-optimizations-for-incremental-data-changes.md
Outdated
Show resolved
Hide resolved
resolve( | ||
fs.outputFile( | ||
getPageHtmlFilePath(join(process.cwd(), `public`), path), | ||
htmlString | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just interested :
everywhere there is publicDir
as param, but here is join(process.cwd(), `public`)
used with process.cwd()
i think all the publicDir
comes from the following snippet and use program.directory
?
module.exports = async function build(program: BuildArgs) {
const publicDir = path.join(program.directory, `public`)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where renderHTML
is used in commands/build-html.ts
, doBuildPages
is not passed program
as a parameter.
await doBuildPages(rendererPath, pagePaths, activity, workerPool) |
maybe this is to avoid program
being passed unnecessarily to other functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually should be passing program
around (or at least build directory) instead of using process.cwd()
, but this doesn't change anything in practice for now (as process.cwd()
was already used before).
We do want to make it so we don't use process.cwd()
at all, as this make it hard for us to enable options like changing build directory in future, or even things like triggering builds programmatically (instead of using CLI), but this is not concern of this PR, so it's ok for it to stay like that for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have noted this in mind of future change of output dir
Yes - it is not the scope of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Great job! 👍
@pieh are you ok to review this? |
From my perspective this is good to go and I have no further comments/reviews. I do want to give some time for @sidharthachatterjee and/or @ascorbic to give either go-ahead or leave comments |
We'll be looking at this tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Let's get this in 🎉
@@ -0,0 +1,26 @@ | |||
const fs = require(`fs-extra`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably move these to gatsby-core-utils
later so they are reusable across gatsby packages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get this in!
Thank you @dominicfallows and @StuartRayson for work put into this feature!
Published |
Brill, thanks all for the collab! |
🙌 I've really enjoyed working on this ! 🚀 |
Has anybody tested this with |
Hello everyone. Great work! I was trying this out on one of our existing CI/CD pipelines. When ever a page is removed from our CMS incremental build fails. Should all the contents of the public folder be there for the build to pass. |
@sharanrprasad |
Problem:
Environment:
Running the the same command on my local environment will do an incremental build. Are there any flags which will allow to debug the reason why the cache used is not valid? |
Sadly, no. Maybe you're using something along the lines of |
Description
Gatsby sources data from multiple sources (CMS, static files - like Markdown, databases, APIs, etc) and creates an aggregated dataset in GraphQL. Currently, each
gatsby build
uses the GraphQL dataset and queries to do a complete rebuild of the whole app - ready for deployment - including static assets like HTML, JavaScript, JSON, media files, etc.Projects that have a small (10s to 100s) to medium (100s to 1000s) amount of content, deploying these sites don't present a problem.
Building sites with large amounts of content (10,000s upwards) are already relatively fast with Gatsby. However, some projects might start to experience issues when adopting CI/CD principles - continuously building and deploying. Gatsby rebuilds the complete app which means the complete app also needs to be deployed. Doing this each time a small data change occurs unnecessarily increases demand on CPU, memory, and bandwidth.
One solution to these problems might be to use Gatsby Cloud's Build features.
For projects that require self-hosted environments, where Gatsby Cloud would not be an option, being able to only deploy the content that has changed or is new (incremental data changes, you might say) would help reduce build times, deployment times and demand on resources.
This PR is to introduce an experimental enhancement to only build pages with data changes.
How to use
To enable this enhancement, use the environment variable
GATSBY_PAGE_BUILD_ON_DATA_CHANGES=true
in yourgatsby build
command, for example:GATSBY_PAGE_BUILD_ON_DATA_CHANGES=true node ./node_modules/.bin/gatsby build
This will run the Gatsby build process, but only build pages that have data changes since your last build. If there are any changes to code (JS, CSS) the bundling process returns a new webpack compilation hash which causes all pages to be rebuilt.
Reporting what has been built
You might need to get a list of the pages that have been built for example, if you want to perform a sync action in your CI/CD pipeline.
To list the paths in the build assets (
public
) folder, you can use one (or both) of the following arguments in yourbuild
command.--log-pages
outputs the updated paths to the console at the end of the build--write-to-file
creates two files in the.cache
folder, with lists of the changes paths in the build assets (public
) folder.newPages.txt
will contain a list of paths that have changed or are newdeletedPages.txt
will contain a list of paths that have been deletedIf there are no changed or deleted paths, then the relevant files will not be created in the
.cache
folder.Approach
An enhancement works by comparing the previous page data from cache (returned by
readState()
) to the newly created page data in GraphQL, that can be accessed bystore.getState()
. By comparing these two data sets, we can determine which pages have been updated, newly created or deleted.There are two new functions
getChangedPageDataKeys
andremovePreviousPageData
inutils/page-data.js
:getChangedPageDataKeys
loops through each page's "content" this includes the data and context, comparing it to the previous content. If there is a difference, or the key does not exist (new page), this key is added to this functions returned array.removePreviousPageData
loops through each key, if the key is not present in the new data, the page will be removed and a key added to this functions returned array.This array of path keys used as the
pagePaths
value for thebuildHTML.buildPages
process.At the end of the build process, the
removePreviousPageData
function uses each deleted page key to remove a matching directory from the public folder. This is instead of deleting all HTML from the public directory at the beginning of the build process.Performance improvement
We have run various performance tests on our projects. For context, we use AWS CodePipeline to build and deploy our Gatsby projects, one of which is approaching 30k pages.
On our ~30k page project, when we run a full build versus a content change build, we are seeing vastly improved deploy times, alongside reduced CPU and memory spikes.
For example, for a full build and deploy, we see an average of 10-11 minutes. For a content change build, this is reduced down to an average 5-6 minutes 🚀
Further considerations
To enable this build option you will need to set an environment variable, so you will need access to set variables in your build environment.
You will need to persist the
.cache/redux.state
between builds, allowing for comparison, if there is noredux.state
file located in the/.cache
the folder then a full build will be triggered.Any code or static query changes (templates, components, source handling, new plugins etc) creates a new webpack compilation hash and triggers a full build.
Related Issues
Related to PR #20785
Related to Issue #5002