Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per page manifest #14359

Merged
merged 16 commits into from
Jun 11, 2019
Merged

Per page manifest #14359

merged 16 commits into from
Jun 11, 2019

Conversation

Moocar
Copy link
Contributor

@Moocar Moocar commented May 27, 2019

The work for this PR was being tracked at #13004. But now that it's ready to be merged, I've opened this PR

For months, I've been merging work into the per-page-manifest branch, which changes Gatsby so that it has a manifest/query-result file per page, instead of a global list of all pages. see #13004 for details.

This PR finally merges that branch into master. All work on the per-page-manifest branch has already been reviewed, however one final look through wouldn't hurt.

@Moocar Moocar marked this pull request as ready for review May 29, 2019 03:28
@Moocar Moocar requested review from a team as code owners May 29, 2019 03:28
@Moocar
Copy link
Contributor Author

Moocar commented May 29, 2019

RE the 2 test failures:

  • e2e_tests_production_runtime: Not sure what the actual error is. Something to do with the last scroll behavior test taking too long. I'm fairly sure it's not related to this PR
  • unit_tests_windows. Something related to parsing page-data.json files in static-entry.js. This might be a real error, but I'm on a mac. Anyone on a windows machine able to verify it's failing?

@ArthurHwang
Copy link

ArthurHwang commented May 29, 2019

RE the 2 test failures:

  • e2e_tests_production_runtime: Not sure what the actual error is. Something to do with the last scroll behavior test taking too long. I'm fairly sure it's not related to this PR
  • unit_tests_windows. Something related to parsing page-data.json files in static-entry.js. This might be a real error, but I'm on a mac. Anyone on a windows machine able to verify it's failing?

I want to preface this that I don't know much about this issue and I stumbled here from googling around how to fix the performance of my 3000+ page gatsby project that produces a 200+kb pages-manifest.js file. Everything on my application feels laggy due to the sheer size of the pages-manifest.js file.

However, I tried to install your temp fix, the gatsby@per-page-manifest package. For myself on windows 10, it fails during build process.

More specifically I am using the packages gatsby-source-wordpress, gatsby-paginate. Here is my error
error

It seems to be failing at the pagination creation.

Building static HTML failed for path "/blog"
  37 | 
  38 | export default function BlogIndex({ data, pathContext }: Props) {
> 39 |   const { group, index, first, last, pageCount } = pathContext
     |           ^
  40 |   const previousUrl = index - 1 === 1 ? "" : (index - 1).toString()
  41 |   const nextUrl = (index + 1).toString()
  42 | 


  WebpackError: TypeError: Cannot read property 'group' of undefined

  - PostsTemplate.tsx:39 BlogIndex
    lib/src/components/elements/blog/PostsTemplate.tsx:39:11

  - bootstrap:22 c
    lib/webpack/bootstrap:22:1

  - bootstrap:25 Sa
    lib/webpack/bootstrap:25:1

  - bootstrap:30 a.render
    lib/webpack/bootstrap:30:1

  - bootstrap:30 a.read
    lib/webpack/bootstrap:30:1

  - bootstrap:42 renderToString
    lib/webpack/bootstrap:42:1

  - static-entry.js:213 Module../.cache/static-entry.js.__webpack_exports__.default
    lib/.cache/static-entry.js:213:18

  - bootstrap:24 Promise
    lib/webpack/bootstrap:24:1


  - gatsby-browser-entry.js:35 Promise._resolveFromExecutor
    lib/.cache/gatsby-browser-entry.js:35:12

  - bootstrap:68 new Promise
    lib/webpack/bootstrap:68:1


  - bootstrap:5 tryCatcher
    lib/webpack/bootstrap:5:1

  - bootstrap:50 MappingPromiseArray._promiseFulfilled
    lib/webpack/bootstrap:50:1

  - api-runner-ssr.js:6 MappingPromiseArray.PromiseArray._iterate
    lib/.cache/api-runner-ssr.js:6:27

  - bootstrap:67 MappingPromiseArray.init
    lib/webpack/bootstrap:67:1

here is my gatsby-node:

require("dotenv").config({
  path: `.env.${process.env.NODE_ENV}`
})

const path = require("path")
const createPaginatedPages = require("gatsby-paginate")

// Implement the Gatsby API “onCreatePage”. This is
// called after every page is created.
exports.onCreatePage = async ({ page, actions }) => {
  const { createPage } = actions

  // page.matchPath is a special key that's used for matching pages
  // only on the client.
  if (page.path.match(/^\/app/)) {
    page.matchPath = "/admin/*"

    // Update the page.
    createPage(page)
  }
}

exports.createPages = ({ graphql, actions }) => {
  const { createPage } = actions

  return new Promise((resolve, reject) => {
    graphql(
      `
        {
          allWordpressPost(limit: ${process.env.BLOG_LIMIT}) {
            edges {
              node {
                id
                slug
                title
                content
                excerpt
                date
                modified
                categories {
                  name
                }
                author {
                  name
                }
                featured_media {
                  localFile {
                    childImageSharp {
                      fluid {
                        src
                        srcSet
                        sizes
                        aspectRatio
                        tracedSVG
                      }
                    }
                  }
                }
              }
            }
          }
        }
      `
    ).then(result => {
      if (result.errors) {
        console.log(result.errors)
        reject(result.errors)
      }

      createPaginatedPages({
        edges: result.data.allWordpressPost.edges,
        createPage: createPage,
        pageTemplate: "./src/components/elements/blog/PostsTemplate.tsx",
        pageLength: 5,
        pathPrefix: "/blog"
      })

      result.data.allWordpressPost.edges.forEach(({ node }) => {
        createPage({
          path: "blog/" + node.slug,
          component: path.resolve(
            "./src/components/elements/blog/PostTemplate.tsx"
          ),
          context: {
            slug: node.slug,
            imageSharp: node.featured_media,
            title: node.title,
            author: node.author.name,
            categories: node.categories,
            modified: node.modified,
            content: node.content
          }
        })
      })
      resolve()
    })
  })
}

here is my posttemplate of where the error is occuring:

import { Link } from "gatsby"
import React from "react"
import { FaArrowAltCircleLeft, FaArrowAltCircleRight } from "react-icons/fa"
import styled from "styled-components"
import Layout from "../../layouts/Layout"
import BlogCard from "./BlogCard"

interface Props {
  data: {
    allWordpressPost: {
      edges: {
        node: Node
      }
    }
  }
}

interface Node {
  node: {
    id: string
    slug: string
    title: string
    content: string
    excerpt: string
    date: string
    modified: string
  }
}

const NavLink = props => {
  if (!props.test) {
    return <Link to={props.url}>{props.text}</Link>
  } else {
    return <span style={{ textDecoration: "line-through" }}>{props.text}</span>
  }
}

export default function BlogIndex({ data, pathContext }: Props) {
  const { group, index, first, last, pageCount } = pathContext
  const previousUrl = index - 1 === 1 ? "" : (index - 1).toString()
  const nextUrl = (index + 1).toString()

  return (
    <Layout sidebar={true}>
      {/* <h1 style={{ textAlign: "center" }}>My Blog</h1> */}
      {/* <h2>{pageCount} Posts</h2> */}
      <BlogPostsWrapper>
        <ul style={{ margin: "0", listStyleType: "none" }}>
          {group.map(({ node }: Node) => (
            <li key={node.id}>
              <Link className="card-wrapper" to={"blog/" + node.slug}>
                <BlogCard data={node} />
              </Link>
            </li>
          ))}
        </ul>
      </BlogPostsWrapper>
      <StyledPagination>
        <div className="previous-link">
          <FaArrowAltCircleLeft className="arrow-left" />
          <NavLink test={first} url={`blog/${previousUrl}`} text="Previous" />
        </div>
        <div className="next-link">
          <NavLink test={last} url={`blog/${nextUrl}`} text="Next" />
          <FaArrowAltCircleRight className="arrow-right" />
        </div>
      </StyledPagination>
    </Layout>
  )
}

const BlogPostsWrapper = styled.section`
  width: 100%;
  margin-bottom: 2rem;
`
const StyledPagination = styled.div`
  display: flex;
  justify-content: space-between;
  margin-bottom: 2rem;
  align-items: center;

  .previous-link,
  .next-link {
    display: flex;
    align-items: center;
  }
  .arrow-left,
  .arrow-right {
    font-size: 2rem;
    margin: 0.5rem;
  }
`

I am honestly not even sure why or how this is happening, but I thought maybe if there is a chance for my issue to help you then it would be worth for me to share with you.

If i revert the gatsby version to one of the regular versions everything builds correctly. If there is something I can do more to help you feel free to ask.

@Moocar
Copy link
Contributor Author

Moocar commented May 30, 2019

@ArthurHwang You found a bug! Thanks so much for the report. I'm working on a fix now. It's actually unrelated to Windows.

Also, just FYI, pathContext is deprecated. If you're on Gatsby v2, you should be using pageContext. https://www.gatsbyjs.org/docs/migrating-from-v1-to-v2/#rename-pathcontext-to-pagecontext

@DSchau
Copy link
Contributor

DSchau commented May 30, 2019

@Moocar nicely done!

I think others have already tested on www (gatsbyjs.org) but wanted to try it out for myself!

As such, I deployed it to Netlify here -> https://gatsby-per-page-manifest.netlify.com/ (note: some of the screenshots on showcase and probably starters are broken; I presume unrelated to this PR and some network issues on my machine)

It's fun to see the improvement with e.g. webpagetest because we're no longer loading a super large page manifest.

This PR Webpagetest vs. latest Gatsbyjs.org

Some important call outs:

Metric Improvement
Speed Index 194ms
Load Time 1042ms
First Interactive 5031ms

Super excited to see this land!

@pieh @freiksenet maybe we run the build runner tool (if we haven't already!) to identify any issues with sites "in the wild."

@Moocar
Copy link
Contributor Author

Moocar commented May 30, 2019

@DSchau Thanks for trying it out! I've already run develop-runner (which I presume you're referring to) and didn't find any issues, although that only covers the build step.

@KyleAMathews and I had a chat yesterday and decided that the next step should be deploying gatsby www to netlify using split-testing to deploy to a portion of users. Then we can use sentry to monitor for frontend issues. Once that's done, we will have tested both the build step and the running app fairly comprehensively.

@KyleAMathews
Copy link
Contributor

@DSchau very cool! Ran some more analysis on webpagetest using their comparison tool https://www.webpagetest.org/video/compare.php?tests=190530_AB_69f522b65151feb6c39a369a50ff7a14%2C190530_8Q_07506324ce40c1917f0d1b37b6c5bdfc&thumbSize=200&ival=100&end=visual

The new version starts rendering slightly earlier due I'm guessing to less resource contention from downloading less JS.

It's even more dramatic looking at the network graph as the current version is taking a lot longer to download everything due again to having more JS to download.

One maybe bug is that this PR is requesting the same resource multiple times which results in 304s — is that by design?

@KyleAMathews
Copy link
Contributor

Here's standalone versions as well for the simple 3g test config

new: https://www.webpagetest.org/result/190530_1R_3b04582c9a3d6632b3eb1b8b745a2269/2/details/#waterfall_view_step1
old: https://www.webpagetest.org/result/190530_AE_e271290dee60819571492d98062c105d/

@ChristopherBiscardi
Copy link
Contributor

It seems like with page-specific data we could leverage loadComponent to enable page-specific MDX components and higher-level work beyond that like returning components from GraphQL queries.

Does this sound like the right train of thought and if so is there any plan to expose page data modification capabilities to plugins?

@Moocar
Copy link
Contributor Author

Moocar commented Jun 7, 2019

@ChristopherBiscardi Could you elaborate some more? I don't have enough MDX knowledge to answer this. But here are some comments that may or may not be applicable:

like returning components from GraphQL queries

pageData does include the results of graphql queries, but those query results must be pure data, since they have to be serialized to json and sent to the browser. I guess components could be included in graphql query results, but they'd have to be in a pure data AST form that could be loaded by the gatsby app in the browser.

is there any plan to expose page data modification capabilities to plugins?

It's not something I've ever considered. At the moment, page-data.json is an implementation detail of Gatsby. Could you give an example where it would be useful?

@KyleAMathews
Copy link
Contributor

@ChristopherBiscardi adding support for prefetching other page resources is unrelated to this PR — we'd need a new action where plugins could add additional resources to be prefetched by Gatsby e.g. as discussed here: gatsbyjs/rfcs#35

@KyleAMathews
Copy link
Contributor

@Moocar it's feeling like this is ready to go as the only issues we've seen are long-standing issues unrelated to this PR (e.g. not handling statusCode 0 response when loading data).

It's been running like a champ on gatsbyjs.org and other site that have tested it so I suggest we get this in and then work on the other issues in other PRs.

@Moocar
Copy link
Contributor Author

Moocar commented Jun 11, 2019

@KyleAMathews Great! I have some follow up code to address the status 0 stuff. But it will need its own thorough review.

I just noticed some conflicts with master, I'll create a PR to address now.

Moocar added 9 commits June 11, 2019 13:50
**Note: merges to `per-page-manifest`, not master. See #13004 for more info**

## Description

This PR saves the `page-data.json` during query running. In order to minimize the size of PRs, my strategy is to save the page-data.json along with the normal query results, and then gradually shift functionality over to use `page-data.json` instead of `data.json`. Once all those PRs are merged, we'll be able to go back and delete the static query results, jsonNames, dataPaths, data.json etc.

It does mean that we'll be storing double the amount of query results on disk. Hopefully that's ok in the interim.

Compilation-hash will be added in future PRs.

## Related Issues

- Sub-PR of #13004
* add utils/page-data.read

* read websocket page data from utils/page-data
* page-data loader working for production-app

* get new loader.getPage stuff working with gatsby develop

* fix static-entry.js test

* remove loadPageDataSync (will be in next PR)

* use array of matchPaths

* Deprecate various loader methods

* remove console.log

* document why fetchPageData needs to check that response is JSON

* in offline, use prefetchedResources.push(...resourceUrls)

* root.js remove else block

* loader.js make* -> create*

* loader drop else block

* pass correct path/resourceUrls to onPostPrefetch

* s/err => null/() => null/

* Extract loadComponent from loadPage

* remove pageData from window object

* update jest snapshots for static-entry (to use window.pagePath)

* add loadPageOr404

* preload 404 page in background

* normalize page paths

* comment out resource-loading-resilience.js (will fix later)
* remove data.json from plugin-guess

* add test for plugin-guess
* use match-paths.json for gatbsy serve

* remove pages.json
* move query running into build and develop

* save build compilation hash

* write compilationHash to page data

* reload browser if rebuild occurs in background

* add test to ensure that browser is reloaded if rebuild occurs

* update page-datas when compilation hash changes

* use worker pool to udpate page data compilation hash

* update tests snapshot

* reset plugin offline whitelist if compilation hash changes

* prettier: remove global Cypress

* separate page for testing compilation-hash

* fix case typo

* mock out static entry test webpackCompilationHash field

* consolidate jest-worker calls into a central worker pool
* remove json-name and don't save /static/d/ page query results

* in loader.cleanAndFindPath, use __BASE_PATH__. Not __PATH_PREFIX__
@harrygreen
Copy link
Contributor

Ah, ok. Thanks v much for the explanation.

Yeah, we do have CloudFront in front of s3 - but we’re not invalidating the cache when we push changes. We just sync the files to s3 and let CloudFront respect the cache headers we set on the files. So we don’t set a Minimum TTL on it - but perhaps we should (as outlined in https://github.com/jariz/gatsby-plugin-s3/blob/master/recipes/with-cloudfront.md).

@leonfs
Copy link

leonfs commented Jun 12, 2019

Thank-you @KyleAMathews , for the explanation, and for all the hard work you and the rest in the community have been putting into it.

Unfortunately, I can't stop feeling that the Cloudfront caching strategy from the documentation looks a bit like a hack. Invalidating Cloudfront (or any CDN for the matter) has a cost and carries a propagation delay. Doing so on every build doesn't feel like a good engineering practice. Even less so when your software development practices allow you to push to production more than 50 times a day.

We've been following and waiting for this fix for a long time, eager to get it working on our site, but the caching solution for the page-data.json is a big problem for us.

So far, we've been caching the .html files for 60 seconds (even if not recommended by Gatsby) and we haven't had any problems with it. The 60s feel more deterministic than a CF invalidation; you know when those pages will get refreshed.

Is there a possibility to add a hash suffix to the page-data.json? Or some internal mechanisms in Gatsby (naming conventions?) would prevent this from happening? At least this way, people who use Gatsby have the option to decide how to handle their caching strategy.

Thanks again for all the hard and good work.

@wardpeet
Copy link
Contributor

We won't be able to use hashes inside the filename but we're thinking of adding a querystring parameters to these files. page-data.json?v=[webpack-hash] or something like that.

Would this help your use case?

@leonfs
Copy link

leonfs commented Jun 12, 2019

Thanks @wardpeet for the prompt reply. I would need to run some tests before answering that, try a few edge cases that come to the top of my mind.

Although, why not to add the webpack hash as a suffix of the file name? Is not the perfect solution as you still get a cache miss on each new build, but at least it's possible to cache the file forever, without having to worry about hammering the origin with requests, and benefiting from a good cache policy.

References in old .html files will still point to files that exist at the origin and most likely at the CDN too. Would that work?

@roylines
Copy link

Agree with @leonfs - adding the webpack-hash to the filename would be perfect. Better than the query string method as then multiple copies of the file could be statically served during an upgrade process. Would be a great work around for this issue if the md5 hash is tricky.

@KyleAMathews
Copy link
Contributor

Adding the webpack-hash doesn't help actually as you can update a page's data without re-running webpack.

Is it really that expensive for cloudfront to revalidate files against an s3 bucket? Our expectation is that revalidating a file whether it's changed or not is cheap. E.g. with fastly you can mark a single file as invalidated https://docs.fastly.com/guides/purging/single-purges

@KyleAMathews
Copy link
Contributor

(this is me not knowing very much about cloudfront)

@sebastienfi
Copy link
Contributor

CloudFront also allows you to purge a single file. This can be achieved in various ways, including on-demand and CI/CD style.

@roylines
Copy link

The issue here is timing. If it is important that the html and the json files are distributed in a lock-step fashion, clearing caches does not provide that. The distributed nature of CDNs by design means that a user could have an older (or newer) version of html than the distributed json. The only way to guarantee consistency is some form of versioning of the json file, either by hash or otherwise.

@leonfs
Copy link

leonfs commented Jun 17, 2019

@KyleAMathews - If the webpack-hash doesn't work, wouldn't a build hash (unique per build) work?

@thovden
Copy link

thovden commented Jun 19, 2019

This change also broke my S3 + CloudFront setup, resulting in the browser re-loading the page Site has changed on server. Reloading browser.

I'm using gatsby-plugin-s3 to deploy. Clearing the browser cache does not resolve this - the next request will have force a page reload.

Reverted to 2.8 for now.

mxxk pushed a commit to mxxk/gatsby that referenced this pull request Jun 21, 2019
* feat(gatsby): Page data without compilation hash (gatsbyjs#13139)

**Note: merges to `per-page-manifest`, not master. See gatsbyjs#13004 for more info**

## Description

This PR saves the `page-data.json` during query running. In order to minimize the size of PRs, my strategy is to save the page-data.json along with the normal query results, and then gradually shift functionality over to use `page-data.json` instead of `data.json`. Once all those PRs are merged, we'll be able to go back and delete the static query results, jsonNames, dataPaths, data.json etc.

It does mean that we'll be storing double the amount of query results on disk. Hopefully that's ok in the interim.

Compilation-hash will be added in future PRs.

## Related Issues

- Sub-PR of gatsbyjs#13004

* Websocket manager use page data (gatsbyjs#13389)

* add utils/page-data.read

* read websocket page data from utils/page-data

* Loader use page data (gatsbyjs#13409)

* page-data loader working for production-app

* get new loader.getPage stuff working with gatsby develop

* fix static-entry.js test

* remove loadPageDataSync (will be in next PR)

* use array of matchPaths

* Deprecate various loader methods

* remove console.log

* document why fetchPageData needs to check that response is JSON

* in offline, use prefetchedResources.push(...resourceUrls)

* root.js remove else block

* loader.js make* -> create*

* loader drop else block

* pass correct path/resourceUrls to onPostPrefetch

* s/err => null/() => null/

* Extract loadComponent from loadPage

* remove pageData from window object

* update jest snapshots for static-entry (to use window.pagePath)

* add loadPageOr404

* preload 404 page in background

* normalize page paths

* comment out resource-loading-resilience.js (will fix later)

* Remove data json from guess (gatsbyjs#13727)

* remove data.json from plugin-guess

* add test for plugin-guess

* Gatsby serve use page data (gatsbyjs#13728)

* use match-paths.json for gatbsy serve

* remove pages.json

* move query/pages-writer to bootstrap/requires-writer (gatsbyjs#13729)

And delete data.json

* fix(gatsby): refresh browser if webpack rebuild occurs (gatsbyjs#13871)

* move query running into build and develop

* save build compilation hash

* write compilationHash to page data

* reload browser if rebuild occurs in background

* add test to ensure that browser is reloaded if rebuild occurs

* update page-datas when compilation hash changes

* use worker pool to udpate page data compilation hash

* update tests snapshot

* reset plugin offline whitelist if compilation hash changes

* prettier: remove global Cypress

* separate page for testing compilation-hash

* fix case typo

* mock out static entry test webpackCompilationHash field

* consolidate jest-worker calls into a central worker pool

* page-data.json cleanup PR. Remove jsonName and dataPath (gatsbyjs#14167)

* remove json-name and don't save /static/d/ page query results

* in loader.cleanAndFindPath, use __BASE_PATH__. Not __PATH_PREFIX__

* loader getPage -> loadPageSync (gatsbyjs#14264)

* Page data loading resilience (gatsbyjs#14286)

* fetchPageHtml if page resources aren't found

* add page-data to production-runtime/resource-loading-resilience test

Also use cypress tasks for blocking resources instead of npm run chunks

* fetchPageHtml -> doesPageHtmlExist

* remove loadPageOr404Sync

* revert plugin-offline to master (gatsbyjs#14385)

* Misc per-page-manifest fixes (gatsbyjs#14413)

* use path.join for static-entry page-data path

* add pathContext back to static-entry props (fix regression)

* Remove jsonDataPaths from pre-existing redux state (gatsbyjs#14501)

* Add context to sentry xhr errors (gatsbyjs#14577)

* Add extra context to Sentry xhr errors

* Don't define Sentry immediately as it'll always be undefined then

* mv cpu-core-count.js test to utils/worker/__tests__ (fix bad merge)

* remove sentry reporting
mxxk pushed a commit to mxxk/gatsby that referenced this pull request Jun 21, 2019
@xavivars
Copy link
Contributor

xavivars commented Jun 21, 2019

This PR also "broke" a workaround to use gatsby as a static page generator instead of a static site generator: #4337 (comment)

By broke I mean that the workaround is no longer working, because before we could change window.path to avoid the canonical check, but now window.pagePath also drives where the page-data.json is downloaded from, so we can't hack it anymore. :(

@karolis-sh
Copy link

@ChristopherBiscardi Could you elaborate some more? I don't have enough MDX knowledge to answer this. But here are some comments that may or may not be applicable:

like returning components from GraphQL queries

pageData does include the results of graphql queries, but those query results must be pure data, since they have to be serialized to json and sent to the browser. I guess components could be included in graphql query results, but they'd have to be in a pure data AST form that could be loaded by the gatsby app in the browser.

is there any plan to expose page data modification capabilities to plugins?

It's not something I've ever considered. At the moment, page-data.json is an implementation detail of Gatsby. Could you give an example where it would be useful?

@Moocar I have an example for page data modification capabilities use case.

We have a couple hundred pages translated to 20+ languages. The translations are managed in separate web app and are consumed by our gatsby app as json data with react-intl. We pass translations via pageContext so naturally they end up in page-data.json's. As there are lots of pages with different content the translations json for single language is quite huge. Including translations of whole site in 1 page is not performant, so for now tackle this problem with runing gatsby build 2 times:

  • On first gatsby build we do some static code analysis with babel plugins (babel-plugin-react-intl etc) - this generates some stats on where the translations are used
  • On second gatsby build we check the previously generated translation stats and filter out unused translation values when passing them to the pageContext

If we could somehow hook into the page-data.json generation, maybe we could ditch the 2nd build step and reduce the build time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.