Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gatsby-parallel-runner package #21733

Closed
wants to merge 11 commits into from

Conversation

biilmann
Copy link

Description

gatsby-parallel-runner is a Gatsby build runtime that allows plugins and core parts of Gatsby to take advantage of the concept of external jobs with ipc introduced in #20835

When gatsby is executed with gatsby-parallel-runner instead of with gatsby build, it will be wrapped in a parent runtime that can process external jobs.

The plugin currently includes support for the IMAGE_PROCESSING job emitted by gatsby-plugin-sharp and includes a Google Cloud based runner that can parallelize this task to Google Cloud Functions and offers clear performance benefits for image heavy sites. On the default image benchmark site show initial tests shows more than 4 time speedup for the image processing part of the build.

The plugin is built around a set of abstractions that makes it viable to add additional parallel runtimes like an AWS Lambda based runtime would be a great example, and is structured to make it easy for plugin authors to add additional cloud processors for new tasks.

I believe having a plugin like this in Gatsby core (rather than purely in an external plugin registry) will be important to allow plugin authors from the open source ecosystem to develop plugins that can take advantage of the external job system. Allowing plugin authors or developers working with site specific local plugins to build and test jobs that can be trivially parallelized via serverless functions will greatly benefit the whole community.

Documentation

Install in your gatsby project:

npm i gatsby-parallel-runner

To use with Google Cloud, set relevant env variables in your shell:

export GOOGLE_APPLICATION_CREDENTIALS=~/path/to/your/google-credentials.json
export TOPIC=parallel-runner-topic

Deploy the cloud function:

npx gatsby-parallel-runner deploy

Then run your Gatsby build with the parallel runner instead of the default gatsby build command.

npx gatsby-parallel-runner

Related Issues

#19831

This adds a new plugin called `gatsby-parallel-runner` that brings
support for running external jobs in parallel cloud functions.

Out of the box it adds support for the gatsby-plugin-sharp processor
and can significantly speed up image heavy Gatsby sites.

It comes with an extensible model that can be used to allow any Gatsby
plugin to take advantage of this type of cloud function based
parallelization.
@biilmann biilmann requested review from a team as code owners February 25, 2020 15:07
topic: () => {
return {
publish: async msg => {
expect(msg).toBe(msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tautology?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed the shadowing that caused this

@janosh
Copy link
Contributor

janosh commented Feb 26, 2020

Just out of curiosity, what does the plugin do when you're running a build locally while offline? Does it fall back to the old behavior of running everything in a single thread or does it try and fail to access Google cloud?

@kdichev
Copy link
Contributor

kdichev commented Feb 26, 2020

Great work!!! 😸
This is something that could benefit a lot of big projects.
I was wondering what kind of jobs are available to potentially parallelize?
For instance we are generating around 20k pages and the html gen step also takes a bunch of time, could be awesome if we can split this job into multiple!

await file.download({ destination: `/tmp/result-${id}` })
const data = (await fs.readFile(`/tmp/result-${id}`)).toString()
const payload = JSON.parse(data)
await fs.remove(`/tmp/result-${id}`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When i understand this comment googleapis/nodejs-storage#676 (comment) right
Then you can read the filecontent direct without saving the file to disc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was expecting that to work and had const data = await file.download({}) in an earlier version, but consistently got an empty response in that way. It would be much cleaner though. /tmp is an in memory volume, so performance wise one approach or the other probable won't make a difference, but it does add some annoying boilerplate code.

Apparently some Google Accounts enforces only lowercase + dashes in function names.
This makes sure functions deployed via gatsby-parallel-runner follow that convention
@KyleAMathews
Copy link
Contributor

KyleAMathews commented Feb 27, 2020

Hey Matt!

This is super exciting! We were hoping people would jump in with more implementations of the coming Jobs API so very glad y'all are doing deep dives on this 🙏

So a little background for everyone on the status of the Jobs API. As you probably have guessed, it's super early (and undocumented haha so props on reverse engineering it!). We've switched one plugin (gatsby-plugin-sharp) over and did a focused implementation for that on Gatsby Cloud. The end goal is an API for offloading CPU/IO/Memory/Network intensive work to workers as obviously node.js suffers at this & to scale/speed builds — everything needs to move eventually beyond a single process/machine. We're inspired by tools like Spark, Dagster, Kafka, experimental projects like gg, etc.

We'll be moving more plugins / internal functions over to the API & testing how that goes. We'll also be adding to core a "local build" implementation (on by default) that'll distribute jobs to our local worker processes (which we use already for doing HTML SSR). We're pretty certain the API will change as we do this so are intentionally leaving it undocumented while we experiment & learn.

One of our TODOs is to create a job metadata spec so that our cloud job runner (and other implementations) have all the info needed to create the functions for running jobs. That way plugins don't have to directly do anything to get jobs running anywhere — just write to the API. This would enable an e.g. gatsby-plugin-google-cloud-workers that would be responsible for creating functions & pushing them to Google & then actually running jobs. Y'all could do a native implementation on Netlify like we are, etc.

So long story short — we're really excited to see your implementation but it seems a bit early as we're probably going to be moving APIs around as we learn things & do more implementations. Also the extra framework bits you add won't be necessary. We're happy to have you maintain it in the meantime & we'd love to share learnings as we go (there's probably going to be a fair bit of nuance around how and when to distribute & cache jobs) but adding it to the Gatsby monorepo wouldn't help us move quickly in the short term.

As things settle we'll be writing an RFC with more details about what will be the final API but pinning versions & syncing with changes should work for now.

Excited to be working with y'all on this!

@biilmann
Copy link
Author

Excited about the job metadata spec and the RFC! Those will be a great steps forward.

I’ve open sourced the gatsby-parallel-runner plugin in it’s own repo at:

https://github.com/netlify/gatsby-parallel-runner

It’s perhaps a shame that the parallel capabilities of Gatsby will be developed in private for now, outside of the open-source community, but I can obviously understand the business reasons behind it!

Since we’re already seeing great results from the parallel runner approach, we’ll keep building on this in the open from our side. Hopefully plugin authors outside of Gatsby, Inc can benefit a lot from having a framework that allows them to experiment with parallelization and contribute potential improvements and capabilities to the core parallel runtime, even if the final implementation will still be very much in flux.

Looking forward to continuing to work together on a faster web!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants