Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(gatsby): Page build optimisations for incremental data changes #21523

Merged
Merged
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
8937307
store page data between builds
StuartRayson Feb 5, 2020
13b4071
Support removing of pages from public
StuartRayson Feb 5, 2020
6736205
build all pages if webpackCompilationHash has changed
StuartRayson Feb 5, 2020
bfd281e
Support deleting of data on processors that run once
StuartRayson Feb 5, 2020
30e041d
Add expirement flag to page performance
StuartRayson Feb 5, 2020
ad83b98
Add expirement flag to page performance
StuartRayson Feb 5, 2020
a47aa30
Add comments and types to actions
StuartRayson Feb 6, 2020
d9aaac4
Add missing page data reducer
StuartRayson Feb 6, 2020
635bbe4
Retain public between builds
StuartRayson Feb 6, 2020
3555004
Add docs to page build time enhancement
StuartRayson Feb 14, 2020
326d412
Update Page build optimisations docs
dominicfallows Feb 17, 2020
584ca37
Fix build.js conflict
StuartRayson Feb 17, 2020
3d595a4
Merge branch 'improve-page-build-on-data-change' of github.com:intera…
StuartRayson Feb 17, 2020
e9b8492
initial refactor from code review
StuartRayson Feb 17, 2020
d4d8a67
Use hash instead of whole page context
StuartRayson Feb 17, 2020
4af1936
Use hash instead of whole page context
StuartRayson Feb 17, 2020
ca51945
Remove page data in dev action deleteComponentsDependencies
StuartRayson Feb 17, 2020
967596d
Pass cache page data to processQueries function
StuartRayson Feb 17, 2020
46ff752
Remove added pageData check in component-data-dependencies.js
StuartRayson Feb 17, 2020
7060d8d
Remove pagedata if page removed in page-hot-reloader
StuartRayson Feb 17, 2020
a9b2b68
Update snapshots
StuartRayson Feb 18, 2020
697634e
Revert "Update snapshots"
StuartRayson Feb 18, 2020
01b8283
Update reducer context
StuartRayson Feb 18, 2020
83fd4fa
Update docs/docs/page-build-optimizations-for-incremental-data-change…
dominicfallows Feb 18, 2020
ce2202d
Update docs/docs/page-build-optimizations-for-incremental-data-change…
dominicfallows Feb 18, 2020
e78c9a6
Update docs/docs/page-build-optimizations-for-incremental-data-change…
dominicfallows Feb 18, 2020
ab4f93b
Merge branch 'improve-page-build-on-data-change' of github.com:intera…
dominicfallows Feb 18, 2020
eb9b9f8
Update docs/docs/page-build-optimizations-for-incremental-data-change…
dominicfallows Feb 18, 2020
623a3ad
Update docs/docs/page-build-optimizations-for-incremental-data-change…
dominicfallows Feb 18, 2020
6a73b97
Update docs/docs/page-build-optimizations-for-incremental-data-change…
dominicfallows Feb 18, 2020
d0ad8ee
Fix doc duplication error
dominicfallows Feb 18, 2020
d379cd8
Update docs
dominicfallows Feb 19, 2020
fdc65a3
Update docs/docs/page-build-optimizations-for-incremental-data-change…
dominicfallows Feb 20, 2020
6056bf4
Update www/src/data/sidebars/doc-links.yaml
dominicfallows Feb 20, 2020
637084f
refector improvements
StuartRayson Feb 20, 2020
cbdff6a
Merge branch 'improve-page-build-on-data-change' of github.com:intera…
StuartRayson Feb 20, 2020
6053297
improve delete to use promise all
StuartRayson Feb 20, 2020
4ce5c27
Block develop mode if experimental flag is used
StuartRayson Feb 20, 2020
0029ae1
Update docs with new flag name GATSBY_EXPERIMENTAL_PAGE_BUILD_ON_DATA…
StuartRayson Feb 20, 2020
6f08e53
Improvements remove pages logic
StuartRayson Feb 21, 2020
65efb64
Remove empty directory if no files
StuartRayson Feb 21, 2020
9c95671
Refactor pagePath reassign in build
StuartRayson Feb 22, 2020
0f57c47
initial attempt at removing nested folders in the correct order
StuartRayson Feb 23, 2020
300d49c
refactor delete public html and data function
StuartRayson Feb 23, 2020
0599b8f
use join in render-html.js
StuartRayson Feb 23, 2020
fa11ff6
Renaming functions and refactoring
StuartRayson Feb 24, 2020
e04bcd8
Add remove functions to page util
StuartRayson Feb 25, 2020
ded3ccc
Remove whitespace from page-data
StuartRayson Feb 25, 2020
c6f5d1c
Move new build functions to build-utils.js
StuartRayson Feb 27, 2020
2d60d5e
check html suffix function
StuartRayson Feb 28, 2020
ec61132
update docs
StuartRayson Feb 28, 2020
fef05f5
handle .html paths when removing empty directories
StuartRayson Feb 28, 2020
a31749c
Apply suggestions from code review
dominicfallows Feb 28, 2020
09cc305
fix formating issue on docs
StuartRayson Feb 29, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/docs/overview-of-the-gatsby-build-process.md
Original file line number Diff line number Diff line change
Expand Up @@ -301,6 +301,8 @@ Page queries that were queued up earlier from query extraction are run so the da

With everything ready for the HTML pages in place, HTML is compiled and written out to files so it can be served up statically. Since HTML is being produced in a Node.js server context, [references to browser APIs like `window` can break the build](/docs/debugging-html-builds/) and must be conditionally applied.

By default, Gatsby rebuilds static HTML for all pages on each build. There is an experimental feature flag `GATSBY_PAGE_BUILD_ON_DATA_CHANGES` which enables [Page Build Optimizations for Incremental Data Changes](/docs/page-build-optimizations-for-incremental-data-changes/).

## What do you get from a successful build?

When a Gatsby build is successfully completed, everything you need to deploy your site ends up in the `public` folder at the root of the site. The build includes minified files, transformed images, JSON files with information and data for each page, static HTML for each page, and more.
Expand Down
80 changes: 80 additions & 0 deletions docs/docs/page-build-optimizations-for-incremental-data-changes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
---
title: Page Build Optimizations for Incremental Data Changes
dominicfallows marked this conversation as resolved.
Show resolved Hide resolved
---

Gatsby sources data from multiple sources (CMS, static files - like Markdown, databases, APIs, etc) and creates an aggregated dataset in GraphQL. Currently, each Gatsby build uses the GraphQL dataset and queries to do a complete rebuild of the whole app - ready for deployment - including static assets like HTML, JavaScript, JSON, media files, etc.

For projects that have a small (10s to 100s) to medium (100s to 1000s) amount of content, deployment is not a challenge.

Even building sites with large amounts of content (10,000s upwards) are relatively fast with Gatsby. However, some projects might start to experience issues when adopting CI/CD principles - continuously building and deploying. Gatsby rebuilds the complete app which means the complete app also needs to be deployed. Doing this each time a small data change occurs unnecessarily increases demand on CPU, memory, and bandwidth.

One solution to these problems might be to use [Gatsby Cloud's Build features](https://www.gatsbyjs.com/cloud/).

For projects that require self-hosted environments, where Gatsby Cloud would not be an option, being able to only deploy the content that has changed or is new (incremental data changes, you might say) would help reduce build times, deployment times and demand on resources.
dominicfallows marked this conversation as resolved.
Show resolved Hide resolved

For more info on the standard build process please see [overview of the gatsby build process](/docs/overview-of-the-gatsby-build-process/)

## How to use

To enable this enhancement, use the environment variable `GATSBY_PAGE_BUILD_ON_DATA_CHANGES=true` in your `gatsby build` command, for example:

`GATSBY_PAGE_BUILD_ON_DATA_CHANGES=true node ./node_modules/.bin/gatsby build`

This will run the Gatsby build process, but only build pages that have data changes since your last build. If there are any changes to code (JS, CSS) the bundling process returns a new webpack compilation hash which causes all pages to be rebuilt.

### Reporting what has been built

You may want to retrieve a list of the pages that were built. For example, if you want to perform a sync action in your CI/CD pipeline.

To list the paths in the build assets (`public`) folder, you can use one (or both) of the following arguments in your `build` command.

- `--log-pages` outputs the updated paths to the console at the end of the build

```bash
success Building production JavaScript and CSS bundles - 82.198s
success run queries - 82.762s - 4/4 0.05/s
success Building static HTML for pages - 19.386s - 2/2 0.10/s
+ success Delete previous page data - 1.512s
info Done building in 152.084 sec
+ info Built pages:
+ Updated page: /about
+ Updated page: /accounts/example
+ info Deleted pages:
+ Deleted page: /test

Done in 154.501 sec
```

- `--write-to-file` creates two files in the `.cache` folder, with lists of the changes paths in the build assets (`public`) folder.
dominicfallows marked this conversation as resolved.
Show resolved Hide resolved

- `newPages.txt` will contain a list of paths that have changed or are new
dominicfallows marked this conversation as resolved.
Show resolved Hide resolved
- `deletedPages.txt` will contain a list of paths that have been deleted
dominicfallows marked this conversation as resolved.
Show resolved Hide resolved

If there are no changed or deleted paths, then the relevant files will not be created in the `.cache` folder.

The `--log-pages` parameter will output all the file paths that were updated or deleted at the end of the build stage.

```bash
success Building production JavaScript and CSS bundles - 82.198s
success run queries - 82.762s - 4/4 0.05/s
success Building static HTML for pages - 19.386s - 2/2 0.10/s
+ success Delete previous page data - 1.512s
info Done building in 152.084 sec
+ info Built pages:
+ Updated page: /about
+ Updated page: /accounts/example
+ info Deleted pages:
+ Deleted page: /test

Done in 154.501 sec
```

## More information

- This enhancement works by comparing the page data from the previous build to the new page data. This creates a list of page directories that are passed to the static build process
dominicfallows marked this conversation as resolved.
Show resolved Hide resolved

- To enable this build option you will need to set an environment variable, so you will need access to set variables in your build environment
dominicfallows marked this conversation as resolved.
Show resolved Hide resolved

- At the end of each build, gatsby creates a `redux.state` file in `/.cache` that contains the all previous build data. You will need to persist the `.cache/redux.state` between builds, allowing for comparison, if there is no `redux.state` file located in the `/.cache` folder then a full build will be triggered

- Any code or static query changes (templates, components, source handling, new plugins etc) creates a new webpack compilation hash and triggers a full build
pieh marked this conversation as resolved.
Show resolved Hide resolved
5 changes: 4 additions & 1 deletion packages/gatsby/src/bootstrap/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,10 @@ module.exports = async (args: BootstrapArgs) => {

// During builds, delete html and css files from the public directory as we don't want
// deleted pages and styles from previous builds to stick around.
if (process.env.NODE_ENV === `production`) {
if (
!process.env.GATSBY_PAGE_BUILD_ON_DATA_CHANGES &&
process.env.NODE_ENV === `production`
) {
activity = report.activityTimer(
`delete html and css files from previous builds`,
{
Expand Down
7 changes: 6 additions & 1 deletion packages/gatsby/src/bootstrap/page-hot-reloader.js
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
const { emitter, store } = require(`../redux`)
const apiRunnerNode = require(`../utils/api-runner-node`)
const { boundActionCreators } = require(`../redux/actions`)
const { deletePage, deleteComponentsDependencies } = boundActionCreators
const {
deletePage,
deleteComponentsDependencies,
removePageData,
} = boundActionCreators
const report = require(`gatsby-cli/lib/reporter`)

let pagesDirty = false
Expand Down Expand Up @@ -34,6 +38,7 @@ const runCreatePages = async () => {
page.path !== `/404.html`
) {
deleteComponentsDependencies([page.path])
removePageData(page.path)
pieh marked this conversation as resolved.
Show resolved Hide resolved
deletePage(page)
}
})
Expand Down
86 changes: 79 additions & 7 deletions packages/gatsby/src/commands/build.js
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

const path = require(`path`)
const report = require(`gatsby-cli/lib/reporter`)
const fs = require(`fs-extra`)
import { buildHTML } from "./build-html"
const buildProductionBundle = require(`./build-javascript`)
const bootstrap = require(`../bootstrap`)
Expand All @@ -11,14 +12,17 @@ const { initTracer, stopTracer } = require(`../utils/tracer`)
const db = require(`../db`)
const signalExit = require(`signal-exit`)
const telemetry = require(`gatsby-telemetry`)
const { store, emitter } = require(`../redux`)
const { store, emitter, readState } = require(`../redux`)
const queryUtil = require(`../query`)
const appDataUtil = require(`../utils/app-data`)
const WorkerPool = require(`../utils/worker/pool`)
const { structureWebpackErrors } = require(`../utils/webpack-error-utils`)
const {
waitUntilAllJobsComplete: waitUntilAllJobsV2Complete,
} = require(`../utils/jobs-manager`)
const pageDataUtil = require(`../utils/page-data`)

const cacheData = readState()
pieh marked this conversation as resolved.
Show resolved Hide resolved

type BuildArgs = {
directory: string,
Expand Down Expand Up @@ -70,6 +74,7 @@ module.exports = async function build(program: BuildArgs) {
processStaticQueries,
} = queryUtil.getInitialQueryProcessors({
parentSpan: buildSpan,
cacheData,
})

await processStaticQueries()
Expand Down Expand Up @@ -137,14 +142,11 @@ module.exports = async function build(program: BuildArgs) {
`BOOTSTRAP_QUERY_RUNNING_FINISHED`
)

await db.saveState()

await waitUntilAllJobsComplete()

// we need to save it again to make sure our latest state has been saved
await db.saveState()
pieh marked this conversation as resolved.
Show resolved Hide resolved

const pagePaths = [...store.getState().pages.keys()]
const pagePaths = process.env.GATSBY_PAGE_BUILD_ON_DATA_CHANGES
? pageDataUtil.getChangedPageDataKeys(store.getState(), cacheData)
: [...store.getState().pages.keys()]
activity = report.createProgress(
`Building static HTML for pages`,
pagePaths.length,
Expand Down Expand Up @@ -184,6 +186,28 @@ module.exports = async function build(program: BuildArgs) {
}
activity.done()

let deletedPageKeys = []
if (process.env.GATSBY_PAGE_BUILD_ON_DATA_CHANGES) {
activity = report.activityTimer(`Delete previous page data`)
activity.start()
deletedPageKeys = pageDataUtil.removePreviousPageData(
store.getState(),
cacheData
)

deletedPageKeys.forEach(value => {
if (value === `/`) {
fs.removeSync(`${program.directory}/public/index.html`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build is async so we can do await Promise.all on all files we need to delete.

fs.removeSync(`${program.directory}/public/page-data/index`)
} else {
fs.removeSync(`${program.directory}/public${value}`)
fs.removeSync(`${program.directory}/public/page-data${value}`)
}
})

activity.end()
}

activity = report.activityTimer(`onPostBuild`, { parentSpan: buildSpan })
activity.start()
await apiRunnerNode(`onPostBuild`, {
Expand All @@ -201,4 +225,52 @@ module.exports = async function build(program: BuildArgs) {
await stopTracer()
workerPool.end()
buildActivity.end()

if (
process.env.GATSBY_PAGE_BUILD_ON_DATA_CHANGES &&
process.argv.includes(`--log-pages`)
) {
if (pagePaths.length) {
report.info(
`Built pages:\n${pagePaths
.map(path => `Updated page: ${path}`)
.join(`\n`)}`
)
}

if (deletedPageKeys.length) {
report.info(
`Deleted pages:\n${deletedPageKeys
.map(path => `Deleted page: ${path}`)
.join(`\n`)}`
)
}
}

if (
process.env.GATSBY_PAGE_BUILD_ON_DATA_CHANGES &&
process.argv.includes(`--write-to-file`)
) {
const createdFilesPath = path.resolve(
`${program.directory}/.cache`,
`newPages.txt`
)
const deletedFilesPath = path.resolve(
`${program.directory}/.cache`,
`deletedPages.txt`
)

if (pagePaths.length) {
fs.writeFileSync(createdFilesPath, `${pagePaths.join(`\n`)}\n`, `utf8`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use async version here and await.

report.info(`newPages.txt created`)
pieh marked this conversation as resolved.
Show resolved Hide resolved
}
if (deletedPageKeys.length) {
fs.writeFileSync(
deletedFilesPath,
`${deletedPageKeys.join(`\n`)}\n`,
`utf8`
)
report.info(`deletedPages.txt created`)
pieh marked this conversation as resolved.
Show resolved Hide resolved
}
}
}
31 changes: 24 additions & 7 deletions packages/gatsby/src/query/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -158,9 +158,21 @@ const groupQueryIds = queryIds => {
}
}

const processQueries = async (queryJobs, activity) => {
const processQueries = async (queryJobs, activity, cacheData) => {
const queue = queryQueue.createBuildQueue()
await queryQueue.processBatch(queue, queryJobs, activity)
if (process.env.GATSBY_PAGE_BUILD_ON_DATA_CHANGES) {
const { pages } = store.getState()
if (cacheData && cacheData.pageData) {
cacheData.pageData.forEach((_value, key) => {
if (!pages.has(key)) {
boundActionCreators.removePageData({
id: key,
})
pieh marked this conversation as resolved.
Show resolved Hide resolved
}
})
}
}
pieh marked this conversation as resolved.
Show resolved Hide resolved
}

const createStaticQueryJob = (state, queryId) => {
Expand Down Expand Up @@ -206,7 +218,7 @@ const processStaticQueries = async (queryIds, { state, activity }) => {
)
}

const processPageQueries = async (queryIds, { state, activity }) => {
const processPageQueries = async (queryIds, { state, activity }, cacheData) => {
state = state || store.getState()
// Make sure we filter out pages that don't exist. An example is
// /dev-404-page/, whose SitePage node is created via
Expand All @@ -215,11 +227,12 @@ const processPageQueries = async (queryIds, { state, activity }) => {
const pages = _.filter(queryIds.map(id => state.pages.get(id)))
await processQueries(
pages.map(page => createPageQueryJob(state, page)),
activity
activity,
cacheData
)
}

const getInitialQueryProcessors = ({ parentSpan } = {}) => {
const getInitialQueryProcessors = ({ parentSpan, cacheData } = {}) => {
const state = store.getState()
const queryIds = calcInitialDirtyQueryIds(state)
const { staticQueryIds, pageQueryIds } = groupQueryIds(queryIds)
Expand All @@ -230,12 +243,12 @@ const getInitialQueryProcessors = ({ parentSpan } = {}) => {

let activity = null
let processedQueuesCount = 0
const createProcessor = (fn, queryIds) => async () => {
const createProcessor = (fn, queryIds, cacheData) => async () => {
if (!activity) {
activity = createQueryRunningActivity(queryjobsCount, parentSpan)
}

await fn(queryIds, { state, activity })
await fn(queryIds, { state, activity }, cacheData)

processedQueuesCount++
// if both page and static queries are done, finish activity
Expand All @@ -246,7 +259,11 @@ const getInitialQueryProcessors = ({ parentSpan } = {}) => {

return {
processStaticQueries: createProcessor(processStaticQueries, staticQueryIds),
processPageQueries: createProcessor(processPageQueries, pageQueryIds),
processPageQueries: createProcessor(
processPageQueries,
pageQueryIds,
cacheData
),
pageQueryIds,
}
}
Expand Down
9 changes: 7 additions & 2 deletions packages/gatsby/src/query/query-runner.js
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,6 @@ module.exports = async (graphqlRunner, queryJob: QueryJob) => {
.createHash(`sha1`)
.update(resultJSON)
.digest(`base64`)

if (resultHash !== resultHashes.get(queryJob.id)) {
resultHashes.set(queryJob.id, resultHash)

Expand All @@ -117,7 +116,6 @@ module.exports = async (graphqlRunner, queryJob: QueryJob) => {
`d`,
`${queryJob.hash}.json`
)

await fs.outputFile(resultPath, resultJSON)
}
}
Expand All @@ -128,5 +126,12 @@ module.exports = async (graphqlRunner, queryJob: QueryJob) => {
isPage: queryJob.isPage,
})

if (process.env.GATSBY_PAGE_BUILD_ON_DATA_CHANGES && queryJob.isPage) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add comment mentioning that this is here because of easy access to already calculated resultHash (most of other changes are (or will be) contained in build.js file)

boundActionCreators.setPageData({
id: queryJob.id,
result: { resultHash },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
result: { resultHash },
resultHash,

No need to create extra objects if only store hash.

There will be more code changes needed to adjust for changed shape

})
}

return result
}
1 change: 1 addition & 0 deletions packages/gatsby/src/query/queue.js
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ const createBaseOptions = () => {

const createBuildQueue = () => {
const graphqlRunner = new GraphQLRunner(store)

const handler = (queryJob, callback) =>
queryRunner(graphqlRunner, queryJob)
.then(result => callback(null, result))
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ Object {
"complete": Map {},
"incomplete": Map {},
},
"pageData": Map {},
"pageDataStats": Map {},
"staticQueryComponents": Map {},
"status": Object {
Expand Down
Loading