Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why are page-data.json files not content hashed? #15080

Closed
mmv08 opened this issue Jun 24, 2019 · 60 comments
Closed

Why are page-data.json files not content hashed? #15080

mmv08 opened this issue Jun 24, 2019 · 60 comments
Labels
type: question or discussion Issue discussing or asking a question about Gatsby

Comments

@mmv08
Copy link

mmv08 commented Jun 24, 2019

Hi gatsby team :) We've just spent few hours on debugging an issue with content not appearing on the website and turned out that it was because of page-data.json cached.

Why does it not use content hash? Should cache be handled another way?

@gatsbot gatsbot bot added the type: question or discussion Issue discussing or asking a question about Gatsby label Jun 24, 2019
@mmv08 mmv08 changed the title Why is page-data.json not content hashed? Why are page-data.json files not content hashed? Jun 24, 2019
@marco-th
Copy link

marco-th commented Jun 26, 2019

@mikheevm The Gatsby Team describes in this blog post why they are not hashing the page-data.json files anymore.

I am also running into some issues because of this. Is there a way to enable the hashing of the page-data.json files again?

@gaplo917
Copy link

gaplo917 commented Jun 29, 2019

There are only two hard things in Computer Science: cache invalidation and naming things.
-- Phil Karlton

If the new version .md file is incompatible to previous version (i.e. frontmatter schema)
OR
If the new React Component is incompatible to previous page-data.json

Which lead to:

  • NEW html getting old page-data json will throw error (i.e. XXX is undefined)
  • OLD html getting new page-data json also will hv the same issue

Gatsby should have a config to allow user toggle page-data.json hash feature for cache friendly environment. Until someday the website development is stable(only content update, no React code changes), then user can switch the feature off for the performance gain.

Currently, I have to downgrade to 2.8.8 for rapid agile development

@KyleAMathews
Copy link
Contributor

@gaplo917 how do you know you're seeing this error? We include a build hash in each page-data.json so if a client loads a json file and sees it came from a newer build, it'll force a full refresh to get the new component code.

If that's not working that's a bug we need to fix. More details about what's happening and reproduction instructions would be great!

@KyleAMathews
Copy link
Contributor

Also in general, you should only be caching files that are in the static folder and .js files as described in the caching docs https://www.gatsbyjs.org/docs/caching/#static-files

The new json files shouldn't be cached. If you're running into troubles, check out your cache-control settings.

@gaplo917
Copy link

Also in general, you should only be caching files that are in the static folder and .js files as described in the caching docs https://www.gatsbyjs.org/docs/caching/#static-files

The new json files shouldn't be cached. If you're running into troubles, check out your cache-control settings.

Thanks for your advice! I have assumed that everything can be cached.

Because gatsby 2.8.8 will generate hash for all page-data.json. This guarantee a new/old html will load new/old/same json

Will try your advice and see the result.

@leonfs
Copy link

leonfs commented Jul 4, 2019

Thank-you everybody for bringing this problem into notice.

It has also been debated in the original PR thread (#14359), with some great contributions on how the manual cache-invalidation technique is not really a great solution.

Our team has been working on a small script (that runs on Gatsby's onPostBuild hook) that appends a hash to all page-data.json files and updates the references. So far it has proven to work but we haven't pushed it to our production site because we've been experiencing other problems with gatsby-plugin-sharp and the newer versions of Gatsby (different issue).

If anyone is interested, we could share our approach.

@mmv08
Copy link
Author

mmv08 commented Jul 4, 2019

@leonfs Would be really great if you share it :)

@harrygreen
Copy link
Contributor

harrygreen commented Jul 4, 2019

@mikheevm something along the lines of..

const fs = require("fs").promises;
const glob = require("glob");
const md5 = require("md5");
const path = require("path");

exports.onPostBuild = async () => {
  const publicPath = path.join(__dirname, "public");
  const hash = md5("replace-with-your-own-hash");

  const jsonFiles = glob.sync(`${publicPath}/page-data/**/page-data.json`);
  console.log("[onPostBuild] Renaming the following files:");
  for (let file of jsonFiles) {
    console.log(file);
    const newFilename = file.replace(`page-data.json`, `page-data.${hash}.json`);
    await fs.rename(file, newFilename);
  }

  const htmlAndJSFiles = glob.sync(`${publicPath}/**/*.{html,js}`);
  console.log("[onPostBuild] Replacing page-data.json references in the following files:");
  for (let file of htmlAndJSFiles) {
    const stats = await fs.stat(file, "utf8");
    if (!stats.isFile()) continue;
    console.log(file);
    var content = await fs.readFile(file, "utf8");
    var result = content.replace(/page-data.json/g, `page-data.${hash}.json`);
    await fs.writeFile(file, result, "utf8");
  }
};

Disclaimer! As @leonfs point out, this seems to work for us - but it's obviously a hack to suit our caching implementation. No guarantee it'll work for anyone else.

--
Edited code, as per #15080 (comment).

@leonfs
Copy link

leonfs commented Jul 5, 2019

If anyone else tries this approach, it would be great to add comments on the results gotten.

@city41
Copy link

city41 commented Jul 23, 2019

I have been hitting this issue with a Gatsby site that is deployed to Github Pages. So I unfortunately have no control over how the server decides to set caching headers on page-data.json.

In my case I find after a new deploy, that stale page-data.json data gets used due to it being cached, and you see the page flicker from the new data back to the old once the page-data.json request returns.

@city41
Copy link

city41 commented Jul 23, 2019

I added the onPostBuild that @harrygreen posted (thanks for that). But I am also finding that the root app-<sha>.js file does not update correctly.

If I:

  1. gatsby build
  2. note the name of app-sha.js inside public, currently it is app-560e4b2f43729239ce7d.js for me
  3. add some new data such that a new build would yield different html
  4. gatsby build again
  5. note the name of app-sha.js again. I expect it to be different, but it is still app-560e4b2f43729239ce7d.js

So since the file name does not update, my browser uses the old cached version, which loads the wrong page-data.json file.

@city41
Copy link

city41 commented Jul 23, 2019

btw @harrygreen, I think it should be if (!stats.isFile()) continue;, as the return exits the function.

@wardpeet
Copy link
Contributor

If this works and it's widely used. Feel free to create a plugin that changes this behaviour. We're happy to take new plugins 🎉 https://www.gatsbyjs.org/contributing/submit-to-plugin-library/

@harrygreen
Copy link
Contributor

harrygreen commented Jul 23, 2019

@city41 Glad it could help you somewhat. We're not using it in production yet because of a different but consistent failure inside gatsby-plugin-sharp (also introduced with the new updates).

As for this:

btw @harrygreen, I think it should be if (!stats.isFile()) continue;, as the return exits the function.

We do want the inverse of your suggestion though; we only want to rename files, not folders. We want to exit the function early if it's not a file.

Is the app-<sha>.js issue a result of my code, or unrelated?

@wardpeet Thanks for the tip. My code really is a hack against Gatsby, so I'm reluctant to publish anywhere.. but maybe one day. I'd much rather page-data.[hash].json :)

@city41
Copy link

city41 commented Jul 23, 2019

@harrygreen using return instead of continue means as soon as you hit a directory, you stop processing files. continue causes the for loop to move onto the next iteration, so the next file can get a chance to process.

app-<sha>.js is not a result of your code, but it is related. It's a file that ideally needs a hash in its name because otherwise it will get cached just like page-data.json files did. It used to not matter that it got cached, because it only had page-data.json embedded in it. Now with the hash fix, it has page-data.<hash>.json embedded in it, thus using a cached version is now bad.

I agree this shouldn't be a plugin. I honestly feel like this is a bug in Gatsby.

@harrygreen
Copy link
Contributor

harrygreen commented Jul 23, 2019

@city41 good spot 🤜 (it was originally inside a function but I switched to for (let .. of)). Thanks - will update the snippet.

The issue is Gatsby's rule to not cache certain assets, e.g. HTML, and now this JSON - both of which require a lockstep relationship). Now that the JSON hash has been lost, some flexibility over the possible caching strategies has been lost. This comment sums up the issue if cache-clearing is required on a CDN.

@wardpeet
Copy link
Contributor

wardpeet commented Jul 23, 2019

@city41 it's not a gatsby bug, page-data.json shouldn't be cached. It gives us the opportunity to build pages separately on only change what's needed.

I do agree that each new version of app-560e4b2f43729239ce7d.js should get a new hash if things changed. So that's definitely a bug. Do you mind opening a new one with this information?

see #15080 (comment)

@city41
Copy link

city41 commented Jul 23, 2019

I think app-sha.js is built by webpack? From its perspective, it hasn't changed. So that one might be trickier.

Maybe bug is too strong of a word? Maybe this should be an option the user can opt into? It makes using Gatsby when you don't have access to the server difficult. Gatsby is a great choice when you only have static hosting (like gh-pages), but this caching issue makes it unusable in those scenarios without the onPostBuild workaround.

If Gatsby fingerprinted page-data.json files, then app-sha.js would probably naturally get a change, and webpack would re-fingerprint it I imagine.

@jaredsilver
Copy link
Contributor

jaredsilver commented Jul 23, 2019

I agree with @city41 and many of the other folks who have chimed in here. We had a long discussion internally, and our consensus is that without being able to host everything at the edge cache level, using Gatsby as opposed to something like Next.js serves very little purpose. The entire point of rendering at build time is that we don't need to have a server and can instead host the built files directly on the edge. If files like page-data.json are not hashed, it eliminates the ability to do that because of the cache invalidation issues referenced by @roylines and @leonfs in this thread: #14359.

For our purposes, we already were serving index pages from an nginx server and using assetPrefix for various other reasons for the time being, so we just modified @harrygreen's script to work for that use case (see below).

Going forward, however, we will need to reevaluate the value of a tool like Gatsby compared to a tool like Next, given that build-time rendering introduces a whole host of issues we need to work around and very few benefits over traditional server side rendering if the files have to be hosted on a server instead of on the edge.

Happy to chat more about this and alternative approaches -- we love Gatsby and are very grateful for the amazing work ya'll are doing!

Here's the gatsby-node.js script to enable hosting the page-data files on a server while using assetPrefix to host assets in edge cache:

const path = require('path');
const fs = require('fs').promises;
const glob = require('glob');

exports.onPostBuild = async () => {
  const publicPath = path.join(__dirname, 'public');

   const htmlAndJSFiles = glob.sync(`${publicPath}/**/*.{html,js}`);
  console.log(
    '[onPostBuild] Replacing page-data.json references in the following files:'
  );
  for (let file of htmlAndJSFiles) {
    const stats = await fs.stat(file, 'utf8');
    if (!stats.isFile()) return;
    console.log(file);
    var content = await fs.readFile(file, 'utf8');
    var result = content.replace(
      'https://static.datacamp.com/page-data',
      '/page-data'
    );
    await fs.writeFile(file, result, 'utf8');
  }
};

@xavivars
Copy link
Contributor

It seems that the amount of people using gatsby in a way that needs hashing (as a much simpler way to manage CDNs and new deployments) it's actually pretty big.

Release 2.9 was a huge step forward for Gatsby, increasing performance for big sites. Thanks a lot for that amazing work!

But the side effects of that release (impossible to fully disable client side navigation yet, issues with easy deployments to s3+cloudfront) clearly makes 2.9 a non backwards compatible release, with quite a lot of people now locked to 2.8.x due to all those issues.

Would it make sense to introduce @jaredsilver solution in core, or via a generic plugin (supported by gatsby project itself)? That would make Gatsby 2.9+ closer to a real 2.8-compatible version.

@lifehome
Copy link

I propose introducing the solution by @jaredsilver into the core, as it would allow easy cache invalidation on the edge, and a really streamlined, rapid deployment of the site, without all the waiting on "that cache is still alive" somewhere else.

Also, this would allow us to literally cache everything, decreasing the impact to the origin and maximize the use of CDN.

@kimbaudi
Copy link
Contributor

kimbaudi commented Aug 4, 2019

Hi, I noticed that my gatsbyjs website would show a blank screen and become really unresponsive (approximately 20-50 seconds) on some mobile browsers (Chrome and Opera) after performing a hard refresh. Inspecting the network using Chrome DevTools indicates that the delay comes from page-data.json

On page refresh, the request for page-data.json is pending

issue1

It took about 50 seconds for the blank screen to go away and the page to refresh.

issue2

I've already mentioned this in #11006 (comment).

I'm not sure if adding a hash to page-data.json filename would resolve this issue, but I wanted to share an issue I am facing that is causing page-data.json to load really slowly or not at all.

@KyleAMathews
Copy link
Contributor

@jaredsilver where do you host your site? Perhaps there's a misunderstanding here? I'm not sure what you mean by "edge host". Any cdn can work with the changes in 2.9 -- you just need to set the cache-control settings correctly. Is there something we can document for you?

@KyleAMathews
Copy link
Contributor

Perhaps the reason we overlooked this is that most CDNs designed to serve static sites handle this correctly already i.e. they serve a file from the edge until a new build invalidates the file. Netlify/Zeit/Firebase etc. work this way. Most/all CDNs have a purge or invalidation API that can be setup to do this as well.

What CDNs are y'all using? Let's put together some docs on setups.

This does complicate setup I agree and can be error prone. If you haven't had a chance to read about why we needed to remove the page manifest (and hence the hashed data files) please read the blog post https://www.gatsbyjs.org/blog/2019-06-12-performance-improvements-for-large-sites/

Happy to dive more into the rational behind it.

@jaredsilver
Copy link
Contributor

I was in the middle of writing out a long response detailing why you're wrong when I realized that I am wrong 😄

If we use a combination of must-revalidate and max-age=0, CloudFront's edge nodes will hold the asset and only download a new version from the origin if the file has changed. This is the behavior we want, whereas adding a no-cache directive would have prevented the file from being held in CloudFront at all. For everyone concerned about the out of sync invalidations, I believe this should eliminate the need for any invalidations at all -- whenever the files change, they will be fetched anew from the origin server and stored at the edge level.

I have opened a PR to update the caching page of the docs with this information so hopefully other folks won't run into this issue in the future. Feel free to check it out here: #16368

Thanks, @KyleAMathews!

And if anyone else in the thread has any questions/concerns, I'd be happy to explore this further.

@lifehome
Copy link

lifehome commented Aug 5, 2019

Hey @jaredsilver, mind if take a look at Cloudflare? We're using a mix of Cloudfront and Cloudflare to ensure the cached b/w and security, tho it seems despite the change from Cloudfront, Cloudflare needs to manually clearing the cache IMO.

@jaredsilver
Copy link
Contributor

@lifehome It looks like Cloudflare does not cache page-data.json or index.html by default, which is good. Unless you've manually configured them to be cached, it looks like it hopefully shouldn't be a problem. If you have manually configured them to be cached, it's possible that the headers are wrong or that Cloudflare is overwriting your headers. I would load the file in your browser and verify that the must-revalidate and max-age=0 directives are there for the index.html and page-data.json files.

Note: if you're using a service worker, it looks like Cloudflare does cache that by default since it ends in a .js extension. That could very likely be the issue. You can overwrite that behavior with page rules.

@antoinerousseau
Copy link
Contributor

antoinerousseau commented Apr 4, 2020

This stale page data makes your website crash if its object structure changed for example.
I think they should be versioned back, as proposed (and not just in a plugin), and a try/catch mechanism should trigger a full page reload when the page data is not found anymore.

@KyleAMathews
Copy link
Contributor

Hey folks, we'd love to see reproductions of ways this can cause crashes. If someone could demonstrate exactly how this happens that'd be great e.g. a script which swaps in files in a certain order and causes the frontend to crash. There's a lot of tests to ensure the frontend is robust to different scenarios and so we'd love to see what we're missing.

@gaplo917
Copy link

gaplo917 commented Apr 5, 2020

@KyleAMathews @antoinerousseau
I suffered this caching behaviour since the v2.9.0 release and I decide to spent an hour to record a video of the reproduction of this issue. Check this out: https://youtu.be/pAyFLv6hGL0

The most frustration part is that the site using GatsbyJs v2.8.8 (page-data-{hash}.json) would work in Google Cloud Bucket because

  • Getting the latest HTML guarantee fetch the latest page-data-{latest hash}.json
  • Getting the old HTML guarantee fetch the old page-data-{old hash}.json

I have mentioned in this thread before:

Cache invalidation is hard.

I would say that "GatsbyJs >= 2.9.0 is not cache friendly". The only way to fix it is to completely disable JSON file cache because page-data.json using the same name and GatsbyJs has NO way to control the browser cache.

As a result, using GatsbyJs >= 2.9.0 would fail on same deployment config that previously worked in a long time (<= 2.8.8).

In the video demo, I use a query variable to force get the latest HTML. In fact, this is a normal behaviour, that we cannot control query variable appeared in the browser (just like social platform will add it...). The only thing we can do is sacrificing the caching ability of all potentially useful JSON files just like the settings of Netlify and spent extra effort to test the configuration on GCP Bucket / CloudFront / CloudFlare.

Test Repo: https://github.com/gaplo917/gatsby-starter-blog
Reproduction Recording: https://youtu.be/pAyFLv6hGL0

@github-actions
Copy link

Hiya!

This issue has gone quiet. Spooky quiet. 👻

We get a lot of issues, so we currently close issues after 30 days of inactivity. It’s been at least 20 days since the last update here.
If we missed this issue or if you want to keep it open, please reply here. You can also add the label "not stale" to keep this issue open!
As a friendly reminder: the best way to see this issue, or any other, fixed is to open a Pull Request. Check out gatsby.dev/contribute for more information about opening PRs, triaging issues, and contributing!

Thanks for being a part of the Gatsby community! 💪💜

@github-actions github-actions bot added the stale? Issue that may be closed soon due to the original author not responding any more. label Apr 25, 2020
@Undistraction
Copy link
Contributor

Undistraction commented Apr 25, 2020 via email

@antoinerousseau antoinerousseau added not stale type: bug An issue or pull request relating to a bug in Gatsby and removed stale? Issue that may be closed soon due to the original author not responding any more. labels Apr 25, 2020
@LekoArts LekoArts added type: question or discussion Issue discussing or asking a question about Gatsby and removed type: question or discussion Issue discussing or asking a question about Gatsby type: bug An issue or pull request relating to a bug in Gatsby impact: med not stale labels May 26, 2020
@LekoArts
Copy link
Contributor

Hey folks, the original question was answered and continuing the discussion about a bug 20 answers later after initial discussion about a question isn't worthwhile to find the relevant information quickly. We'd appreciate a new bug report with a filled out template and a reproduction to track the bug if it's not the same as #19618. Otherwise the bug itself is tracked there already. Thanks!

@aychtang
Copy link

@gaplo917 Do you want to create the new ticket?

@rori4
Copy link

rori4 commented Aug 7, 2020

This is still an issue and especially when deployed to Github Pages. The default caching for files on github pages is with cache-control: max-age=600 and there is no way to control that.

@martymclaugh
Copy link

We came across the app-<sha>.js issue as well. We expanded on the postBuild script above to update app-<sha>.js and all references in addition to the page-data.json files:

const fs = require('fs').promises;
const glob = require('glob');
const path = require('path');

exports.onPostBuild = async () => {
  const publicPath = path.join(__dirname, 'public');
  const hash = Date.now();
  const jsonFiles = glob.sync(`${publicPath}/page-data/**/page-data.json`);

  console.log('[onPostBuild] Renaming the following files:');

  for (let file of jsonFiles) {
    console.log(file);
    const newFilename = file.replace(`page-data.json`, `page-data.${hash}.json`);

    await fs.rename(file, newFilename);
  }

  const appShaFiles = glob.sync(`${publicPath}/**/app-*.js`);
  const [appShaFile] = appShaFiles;
  const [appShaFilename] = appShaFile.split('/').slice(-1);
  const appShaFilenameReg = new RegExp(appShaFilename, 'g');
  const newAppShaFilename = `app-${hash}.js`;
  const newFilePath = appShaFile.replace(appShaFilename, newAppShaFilename);

  console.log(`[onPostBuild] Renaming: ${appShaFilename} to ${newAppShaFilename}`);

  await fs.rename(appShaFile, newFilePath);
  await fs.rename(`${appShaFile}.map`, `${newFilePath}.map`);

  const htmlJSAndJSONFiles = [
    `${newFilePath}.map`,
    ...glob.sync(`${publicPath}/**/*.{html,js,json}`),
  ];

  console.log(
    `[onPostBuild] Replacing page-data.json, ${appShaFilename}, and ${appShaFilename}.map references in the following files:`,
  );

  for (let file of htmlJSAndJSONFiles) {
    const stats = await fs.stat(file, 'utf8');

    if (!stats.isFile()) {
      continue;
    }

    const content = await fs.readFile(file, 'utf8');
    const result = content
      .replace(appShaFilenameReg, newAppShaFilename)
      .replace(/page-data.json/g, `page-data.${hash}.json`);

    if (result !== content) {
      console.log(file);
      await fs.writeFile(file, result, 'utf8');
    }
  }
};

@rori4
Copy link

rori4 commented Jan 22, 2021

here is one option that kind of works for me. Just adding a unique query string at the end of page-data.json and app-data.json.

add to gatsby-node.js:

const path = require(`path`)
const glob = require('glob')
const md5 = require('md5')
const fs = require('fs-extra')

exports.onPostBuild = async () => {
	const publicPath = path.join(__dirname, 'public')
	const hash = md5(Math.random().toString(36).substring(7))

	const htmlAndJSFiles = glob.sync(`${publicPath}/**/*.{html,js}`)
	console.log(
		'[onPostBuild] Replacing page-data.json references in the following files:'
	)
	for (let file of htmlAndJSFiles) {
		const stats = await fs.stat(file, 'utf8')
		if (!stats.isFile()) continue
		console.log(file)
		var content = await fs.readFile(file, 'utf8')
		var result = content
			.replace(/page-data.json/g, `page-data.json?${hash}`)
			.replace(/app-data.json/g, `app-data.json?${hash}`)
		await fs.writeFile(file, result, 'utf8')
	}
}

This seems to be needed it as I am using Github Pages to host the gatsby site and GHpages has a cache policy cache-control: max-age=600 and there is no way to control that.

However, the documentation on gatsby clearly says https://www.gatsbyjs.com/docs/caching/ that page-data and app-data should always be validated and never cached

This makes a weird behavior in pages that have been updated but don't seem to change as well as 404 results for some weird reason as Github Pages caches 404 responses as well.

YES 404 are cached by Github Pages...

so yeah if you try to visit a page yorgatsbysite.com/test, github pages will cache the 404 of page-data.json and after you upload the page test you will have a 404 page for a long time.

@satouriko
Copy link

satouriko commented Dec 28, 2021

I added the onPostBuild that @harrygreen posted (thanks for that). But I am also finding that the root app-<sha>.js file does not update correctly.

If I:

1. `gatsby build`

2. note the name of app-sha.js inside public, currently it is `app-560e4b2f43729239ce7d.js` for me

3. add some new data such that a new build would yield different html

4. `gatsby build` again

5. note the name of app-sha.js again. I expect it to be different, but it is still `app-560e4b2f43729239ce7d.js`

So since the file name does not update, my browser uses the old cached version, which loads the wrong page-data.json file.

@city41 Exact the same problem. With #15080 (comment) 's solution the app-sha.js gets touched while gatsby regards it as a valid cache, so that referring to the wrong page-data-sha.json

Any solutions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: question or discussion Issue discussing or asking a question about Gatsby
Projects
None yet
Development

No branches or pull requests