Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] feat(gatsby): worker-pool for custom GraphQL field resolvers #10938

Closed
wants to merge 29 commits into from

Conversation

Moocar
Copy link
Contributor

@Moocar Moocar commented Jan 9, 2019

WIP: Don't merge

I'd love some feedback on this PR before I finish it off. Read on for an overview of what's going on

Summary

This PR adds the ability for GraphQL field resolvers to be executed in a pool of child node.js processes, thus offloading work to more than 1 core.

This is a partial answer to #8400 (this PR implements field resolvers rather than entire queries).

Motivation

For larger sites, the execution of GraphQL queries takes up the majority of build time. GraphQL first filters down to a set of nodes that satisfy the GraphQL query, then it calls the resolve function on the fields on the body of the query.

Most of these resolve functions simply return a property of the node in question. Since this operation is so fast and relies on the ability to query a large set of nodes in memory, it doesn't make sense to run these resolvers in another process.

Resolvers that are declared by plugins via the setFieldsOnGraphQLNodeType node API often end up performing CPU intensive tasks like generating images, or parsing markdown. By offloading these functions to other cores, we can potentially speed up the build.

How it works

  1. To specify that a field resolver should be run in parallel, plugin authors add a workerPlugin property. This signifies that the field resolver is fully async and is safe to be executed in a worker
  2. During schema generation, in setFieldsOnGraphQLNodeType, we find all workerPlugin fields and store them.
  3. Before running queries, we create a pool of jest-workers. Each worker is supplied with a list of workerPlugin fields. Each worker will:
    1. load all the field's plugins gatsby-node.js and call setFieldsOnGraphQLNodeType() with an API matching what would normally be passed from api-runner-node (e.g getNode), except that each API function is reimplemented as an RPC that communicates over IPC back to the master process where it will be fulfilled.
    2. expose a function execResolver that will call the resolver returned in the previous step.
  4. Redefine field resolvers to call the worker pool execResolver function
  5. When gatsby runs the query phase, GraphQL will call the resolver, which will call the worker, which will execute the resolver on that core, and then return a response.

Considerations/Things to Know

  • This works very effectively on resolvers that perform CPU intensive tasks
  • Inter-process Communication (IPC) between processes can be expensive since messages need to be converted to and from JSON before being sent over the pipe. So resolvers that take large nodes or return large responses aren't as effective on workers. In fact, if CPU is low, and node/response are large, it can actually be slower to place the work on another core.
  • sharp uses libvips which can already use all cores. So even though sharp is highly CPU intensive and writes to disk rather than returning a large response, there's no point in running it as a resolver worker
  • I've only enabled worker resolvers on transformer-remark, since it's a good candidate. On the markdown benchmark, I see a 2x performance increase in the query phase on my 4-core laptop.
  • I've used jest-worker since it includes queueing, pooling and sharding. However I had to make a change to enable IPC communication. You can see the changes here. Assuming we decide to go ahead with the above, I'll submit a PR to jest. In the meantime, I've published my own @moocar/jest-worker
  • I looked briefly at memory-mapping (e.g mmap-io) to share immutable data across workers but it did my head in.

TODO/Questions/Specific Feedback requests

  • still need to write a docs guide
  • I'm not happy with the workerPlugin property. Other ideas on how to automatically figure out the plugin name would be appreciated.
  • Handle timeouts in RPCs. It shouldn't really happen since these are only pipes, not network boundaries, but we should still handle it.
  • Potentially make actions (e.g createNode) into RPCs as well. I've re-implemented them to throw an unsupported error right now.
  • Figure out if we need to call initPool() again after updating schema in bootstrap (since schema is built twice)
  • the cache needs some thinking. I've used jest-worker's computeWorkerKey to ensure that nodes are always sent to the same worker, thus allowing them to take advantage of local caches. But open to better solutions.
  • Tests! (yay)
  • please run this on your machine, especially if you have > 4 cores. A great test is the markdown benchmark. I see a 2x speed up on my 4-core machine.
  • gatsby-plugin-sharp cheats and loads redux actions directly as opposed to having them passed. This means it will be using its own redux store per process. Need to fix.
markdown $ pwd
/Users/amarcar/dev/gatsby/gatsby/benchmarks/markdown
markdown $ rm -rf .cache && NUM_PAGES=1000 NUM_ROWS=1000 gatsby build

Future

I'm excited to see if this work can enable other parts of Gatsby to run in parallel. It will require a bunch more of gatsby to become asynchronous. Fun times.

@Moocar Moocar requested a review from a team as a code owner January 9, 2019 07:47
@Moocar
Copy link
Contributor Author

Moocar commented Jan 10, 2019

I had a great chat with @KyleAMathews yesterday about some alternative approaches to this. The core idea is to give plugin resolvers a "worker API" that they can explicitly call to offload work to another process, rather than running ALL resolvers on workers. I'm going to close this PR while I think about this idea.

@Moocar Moocar closed this Jan 10, 2019
@Moocar
Copy link
Contributor Author

Moocar commented Jan 11, 2019

An interesting finding is that when using workers, it's important to bump up the query-queue.js concurrency so that more jobs can be queued up on workers. When concurrency == num cores, then time is spent idle on workers while the main thread processes responses before sending out the next job. Illustrated below:

concurrency == 4:

screenshot 2019-01-11 13 55 53

concurrency == 32:

screenshot 2019-01-11 13 54 56

@Moocar
Copy link
Contributor Author

Moocar commented Feb 7, 2019

Ok, I'm re-opening this. I've done some more research on parallel resolvers and looked at alternative implementations. I can break my findings into two areas:

1. We should feature flag it

The upside of parallel resolvers is clear. Those that operate on small nodes, perform lots of CPU work, and the return small responses, will see a near linear increase in query performance per core.

The reality however, is that with the exception of image processing (which already uses all cores), there are no resolvers that meet this criteria. The closest is transformer-remark. But if a markdown node needs lots of CPU processing, then the node/result will also both be large. So we'll be taxed by the serialized to/from JSON required by IPC.

Still, we do see a speed up. As mentioned in the PR description, a 2x speed up in query performance on benchmarks/markdown on a 4-core machine. I ran a test on a 32-core machine and saw an 8x speed up. That's a lot of wasted CPU processing on serialization, but it is a decent speed up.

But in comparison, when running the test on real websites like www, or gatsbygram, I don't see any noticeable speed up on my 4-core machine.

So my recommendation is that we feature flag this. That way if someone knows that their site might benefit, or they're running on a 32-core rig, they can run builds with parallelism. But we won't be accidentally making some sites slower.

2. We shouldn't use a "worker API"

The implementation in this PR runs entire resolvers on a worker farm. All the plugin author has to do is set the workerPlugin field on the resolver. After a chat with @KyleAMathews, we came up with alternative approach where resolvers would explicitly offload high CPU work to a worker farm via an api supplied by Gatsby. I've created a PoC of this that you can see at Moocar/worker-api (comparison).

At a glance, this approach makes more sense, since plugin code might know that certain nodes would benefit from being run on a worker farm, while others may not. But there are a number of downsides:

  • In reality, it's hard for many plugins to know whether a node is better off being run on a worker. E.g for transformer-remark, is it a node with lots of content? Or maybe with lots of code blocks? Or maybe the list of sub plugins (e.g remark-code-titles) affects it? Writing adhoc code to answer this is hard. It's really a job for machine learning to figure out at scale.
  • This PR uses jests's computeWorkerKey to always send work for a particular node to the same worker. This won't be possible in a worker API since we no longer have context about the node.
  • In setFieldsOnGraphQLNodeType, we close over gatsby API functions (e.g getNode). But in a worker API, we'd need to instead pass these in a context to each function. Which would result in a rewrite of most plugin code (see remark/worker.js for example).
  • Plugin code has to be rewritten so that high CPU functions are in a separate file (which will be read by jest-worker). Whereas the solution in this PR simply requires a workerPlugin field be added to the graphql field config.

Next Steps/Asks

See original PR asks. This PR isn't ready to be merged, but hopefully I can get some feedback on the rough approach.

@Moocar Moocar reopened this Feb 7, 2019
@@ -403,12 +408,14 @@ module.exports = (
return resolve({
html: {
type: GraphQLString,
workerPlugin,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I like that. This will break graphql-js/graphql-compose typings for people who use typescript. Could we do smth like

const workerPlugin = new GatsbyWorkerPlugin(`gatsby-transformer-remark`)

{
   htmlAst: {
        type: GraphQLJSON,
        resolve: workerPlugin.wrapResolver((markdownNode) => {
          // ...
        },
     }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this idea. It's way more explicit rather than flagging and picking it up later in the process. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the implementation to get something like wrapResolver is gonna be pretty hacky. But it will work. Does GraphQL have any other way of specifying "tags" on Field objects? I essentially want to attach metadata to a field.

return o
}

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My branch makes it so that all resolvers use stuff passed to context, instead of global ones. So here you could modify context to have the worker resolver stuff.

Copy link
Contributor Author

@Moocar Moocar Feb 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freiksenet This isn't a breaking change right? I.e will field resolvers still be able to use the closed over getNode etc supplied by setFieldsOnGraphQLNodeType in addition to those supplied in the context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had a read through of the branch. It looks the only api functions injected into context are nodeModel, which will supply getNode etc. Is there also a plan to include things like getCache, reporter, createContentDigest etc as well?

@Moocar
Copy link
Contributor Author

Moocar commented Feb 17, 2019

Pausing this while I figure out how to make it work with the latest plugin-sharp changes: #10964

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants