Add gatsby-parallel-runner package #21733

biilmann · 2020-02-25T15:07:46Z

Description

gatsby-parallel-runner is a Gatsby build runtime that allows plugins and core parts of Gatsby to take advantage of the concept of external jobs with ipc introduced in #20835

When gatsby is executed with gatsby-parallel-runner instead of with gatsby build, it will be wrapped in a parent runtime that can process external jobs.

The plugin currently includes support for the IMAGE_PROCESSING job emitted by gatsby-plugin-sharp and includes a Google Cloud based runner that can parallelize this task to Google Cloud Functions and offers clear performance benefits for image heavy sites. On the default image benchmark site show initial tests shows more than 4 time speedup for the image processing part of the build.

The plugin is built around a set of abstractions that makes it viable to add additional parallel runtimes like an AWS Lambda based runtime would be a great example, and is structured to make it easy for plugin authors to add additional cloud processors for new tasks.

I believe having a plugin like this in Gatsby core (rather than purely in an external plugin registry) will be important to allow plugin authors from the open source ecosystem to develop plugins that can take advantage of the external job system. Allowing plugin authors or developers working with site specific local plugins to build and test jobs that can be trivially parallelized via serverless functions will greatly benefit the whole community.

Documentation

Install in your gatsby project:

npm i gatsby-parallel-runner

To use with Google Cloud, set relevant env variables in your shell:

export GOOGLE_APPLICATION_CREDENTIALS=~/path/to/your/google-credentials.json
export TOPIC=parallel-runner-topic

Deploy the cloud function:

npx gatsby-parallel-runner deploy

Then run your Gatsby build with the parallel runner instead of the default gatsby build command.

npx gatsby-parallel-runner

Related Issues

#19831

This adds a new plugin called `gatsby-parallel-runner` that brings support for running external jobs in parallel cloud functions. Out of the box it adds support for the gatsby-plugin-sharp processor and can significantly speed up image heavy Gatsby sites. It comes with an extensible model that can be used to allow any Gatsby plugin to take advantage of this type of cloud function based parallelization.

janosh · 2020-02-25T21:40:16Z

...tsby-parallel-runner/src/processor-queue/implementations/google-functions/__tests__/index.js

+    topic: () => {
+      return {
+        publish: async msg => {
+          expect(msg).toBe(msg)


Good catch, fixed the shadowing that caused this

janosh · 2020-02-26T07:15:16Z

Just out of curiosity, what does the plugin do when you're running a build locally while offline? Does it fall back to the old behavior of running everything in a single thread or does it try and fail to access Google cloud?

kdichev · 2020-02-26T09:47:22Z

Great work!!! 😸
This is something that could benefit a lot of big projects.
I was wondering what kind of jobs are available to potentially parallelize?
For instance we are generating around 20k pages and the html gen step also takes a bunch of time, could be awesome if we can split this job into multiple!

muescha · 2020-02-26T10:31:35Z

packages/gatsby-parallel-runner/src/processor-queue/implementations/google-functions/index.js

+    await file.download({ destination: `/tmp/result-${id}` })
+    const data = (await fs.readFile(`/tmp/result-${id}`)).toString()
+    const payload = JSON.parse(data)
+    await fs.remove(`/tmp/result-${id}`)


When i understand this comment googleapis/nodejs-storage#676 (comment) right
Then you can read the filecontent direct without saving the file to disc

Yeah, I was expecting that to work and had const data = await file.download({}) in an earlier version, but consistently got an empty response in that way. It would be much cleaner though. /tmp is an in memory volume, so performance wise one approach or the other probable won't make a difference, but it does add some annoying boilerplate code.

packages/gatsby-parallel-runner/src/processor-queue/implementations/google-functions/index.js

Apparently some Google Accounts enforces only lowercase + dashes in function names. This makes sure functions deployed via gatsby-parallel-runner follow that convention

packages/gatsby-parallel-runner/src/processor-queue/implementations/google-functions/index.js

KyleAMathews · 2020-02-27T18:21:35Z

Hey Matt!

This is super exciting! We were hoping people would jump in with more implementations of the coming Jobs API so very glad y'all are doing deep dives on this 🙏

So a little background for everyone on the status of the Jobs API. As you probably have guessed, it's super early (and undocumented haha so props on reverse engineering it!). We've switched one plugin (gatsby-plugin-sharp) over and did a focused implementation for that on Gatsby Cloud. The end goal is an API for offloading CPU/IO/Memory/Network intensive work to workers as obviously node.js suffers at this & to scale/speed builds — everything needs to move eventually beyond a single process/machine. We're inspired by tools like Spark, Dagster, Kafka, experimental projects like gg, etc.

We'll be moving more plugins / internal functions over to the API & testing how that goes. We'll also be adding to core a "local build" implementation (on by default) that'll distribute jobs to our local worker processes (which we use already for doing HTML SSR). We're pretty certain the API will change as we do this so are intentionally leaving it undocumented while we experiment & learn.

One of our TODOs is to create a job metadata spec so that our cloud job runner (and other implementations) have all the info needed to create the functions for running jobs. That way plugins don't have to directly do anything to get jobs running anywhere — just write to the API. This would enable an e.g. gatsby-plugin-google-cloud-workers that would be responsible for creating functions & pushing them to Google & then actually running jobs. Y'all could do a native implementation on Netlify like we are, etc.

So long story short — we're really excited to see your implementation but it seems a bit early as we're probably going to be moving APIs around as we learn things & do more implementations. Also the extra framework bits you add won't be necessary. We're happy to have you maintain it in the meantime & we'd love to share learnings as we go (there's probably going to be a fair bit of nuance around how and when to distribute & cache jobs) but adding it to the Gatsby monorepo wouldn't help us move quickly in the short term.

As things settle we'll be writing an RFC with more details about what will be the final API but pinning versions & syncing with changes should work for now.

Excited to be working with y'all on this!

biilmann · 2020-02-27T23:32:52Z

Excited about the job metadata spec and the RFC! Those will be a great steps forward.

I’ve open sourced the gatsby-parallel-runner plugin in it’s own repo at:

https://github.com/netlify/gatsby-parallel-runner

It’s perhaps a shame that the parallel capabilities of Gatsby will be developed in private for now, outside of the open-source community, but I can obviously understand the business reasons behind it!

Since we’re already seeing great results from the parallel runner approach, we’ll keep building on this in the open from our side. Hopefully plugin authors outside of Gatsby, Inc can benefit a lot from having a framework that allows them to experiment with parallelization and contribute potential improvements and capabilities to the core parallel runtime, even if the final implementation will still be very much in flux.

Looking forward to continuing to work together on a faster web!

biilmann requested review from a team as code owners February 25, 2020 15:07

biilmann added 4 commits February 25, 2020 07:11

Fix two typos in README

43edd6e

Fix linter issues in gatsby-parallel-runner

4d0a08c

Remove a trailing newline in README

ca9c432

Fix failing test case for gatsby-parallel-runner

5c9676a

janosh reviewed Feb 25, 2020

View reviewed changes

biilmann added 3 commits February 25, 2020 13:46

Update package.json

0d55fc1

Improve test case for google pubsub implementation

a6e225b

fix node 8 compatibility for gatsby-parallel-runner

92d73e1

muescha reviewed Feb 26, 2020

View reviewed changes

jamo reviewed Feb 26, 2020

View reviewed changes

packages/gatsby-parallel-runner/src/processor-queue/implementations/google-functions/index.js Outdated Show resolved Hide resolved

biilmann added 2 commits February 26, 2020 13:29

Stricter function naming for google functions

f8934cb

Apparently some Google Accounts enforces only lowercase + dashes in function names. This makes sure functions deployed via gatsby-parallel-runner follow that convention

small code style fix and version bump

2ea23a1

t2ca reviewed Feb 26, 2020

View reviewed changes

packages/gatsby-parallel-runner/src/processor-queue/implementations/google-functions/index.js Outdated Show resolved Hide resolved

Fix broken Date().now() call

4e1600e

KyleAMathews closed this Feb 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gatsby-parallel-runner package #21733

Add gatsby-parallel-runner package #21733

biilmann commented Feb 25, 2020

janosh Feb 25, 2020

biilmann Feb 25, 2020

janosh commented Feb 26, 2020

kdichev commented Feb 26, 2020

muescha Feb 26, 2020

biilmann Feb 26, 2020

KyleAMathews commented Feb 27, 2020 •

edited

Loading

biilmann commented Feb 27, 2020

Add gatsby-parallel-runner package #21733

Add gatsby-parallel-runner package #21733

Conversation

biilmann commented Feb 25, 2020

Description

Documentation

Related Issues

janosh Feb 25, 2020

Choose a reason for hiding this comment

biilmann Feb 25, 2020

Choose a reason for hiding this comment

janosh commented Feb 26, 2020

kdichev commented Feb 26, 2020

muescha Feb 26, 2020

Choose a reason for hiding this comment

biilmann Feb 26, 2020

Choose a reason for hiding this comment

KyleAMathews commented Feb 27, 2020 • edited Loading

biilmann commented Feb 27, 2020

KyleAMathews commented Feb 27, 2020 •

edited

Loading