Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if remote node has been downloaded already #4748

Closed

Conversation

tsimons
Copy link
Contributor

@tsimons tsimons commented Mar 28, 2018

This is branched off of my other PR: #4616
The only change is found on: https://github.com/gatsbyjs/gatsby/pull/4748/files#diff-1dde006d0131753bd3f750b01d2288f3R202

This sped up the build of a 1000 image blog by 30-60 seconds.

This checks to see if the remote node has been downloaded already by generating the file path, and running a fast pathExists check. If it was downloaded already and there are cachedHeaders, don't bother requesting the file again.

I also added support for an env variable: NO_CACHE in case the image updated on the server, and new ones should be downloaded. I have no strong ties to how it's named, and would like to hear if such a flag exists already.

TJ Simons added 9 commits March 19, 2018 15:15
The blog I work on has over 9,000 media objects and currently, it tries
to download them all. This PR chunks them in groups of 100, but that
setting can be increased.
Add Better Queue for more control over processing
update readme with new config option
number and we can look to change this in the future
@gatsbybot
Copy link
Collaborator

Deploy preview for using-drupal ready!

Built with commit 559fc0f

https://deploy-preview-4748--using-drupal.netlify.com

@gatsbybot
Copy link
Collaborator

gatsbybot commented Mar 28, 2018

Deploy preview for gatsbygram ready!

Built with commit 1ef4f40

https://deploy-preview-4748--gatsbygram.netlify.com

@gatsbybot
Copy link
Collaborator

gatsbybot commented Mar 28, 2018

Deploy preview for using-drupal ready!

Built with commit 1ef4f40

https://deploy-preview-4748--using-drupal.netlify.com

@pieh
Copy link
Contributor

pieh commented Mar 28, 2018

This checks to see if the remote node has been downloaded already by generating the file path, and running a fast pathExists check. If it was downloaded already and there are cachedHeaders, don't bother requesting the file again.

Problem here is we can't skip making request based just on filename - file could be changed on remote host but still have same name. That's why we store headers so we do request and server check headers and if nothing changed it will return "304 Not Modified" response with empty body.

Real solution for this problem would be to track this in wordpress plugin with modified or modified_gmt fields in reponse from media rest api endpoint and see if it changed since last time plugin ran. If it didn't change then we could use cached file instead of using createRemoteFileNode (we would have to track cached file too).

@tsimons
Copy link
Contributor Author

tsimons commented Mar 29, 2018

Interesting. It must be since our images are hosted on S3, but they all get downloaded fresh every build. I added the check for production environment, as well as the a new var to be added to make sure they get downloaded if it does change.

@KyleAMathews
Copy link
Contributor

S3 has etag support so it shouldn't be downloading anew every time. If there's something wrong with our algorithm there, would love your help debugging. But as @pieh explained, this PR will have false positives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants