Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(gatsby): Create index on the fly for non-id index #20729

Merged
merged 4 commits into from
Jan 28, 2020
Merged

Conversation

pvdz
Copy link
Contributor

@pvdz pvdz commented Jan 20, 2020

This takes a more generic approach to shortcircuiting eq filters one exactly one property, including a chain of properties.

To be clear; this can make a HUGE perf difference at scale. It is a followup of #20609 which only applied to index by id and dropped a site runtime from 4.5h to 5 minutes.

This PR seems to improve perf the same for that site, but now for slug or any other property!

This means a filter like allISomeSource(filter: { fields: { slug: { eq: $slug } } }) { can now be as fast as using id, which before was the only thing being indexed.

This PR is not finished (hence in a draft state). The graphql guru's need to take a closer look at this and then there are some (for me) obvious points that need to be addressed;

  • How should we deal with the mappedByKey cache?
    • I've now added it as a key to the nodeModel and pass it through to runQuery ---- runSift. See next point about verification.
  • How can I verify (where are these tests?) that this cache is properly torn down when any node gets changes/added/deleted? I don't see existing tests and I'm not sure how to run this e2e such to test this. I would want to start with some nodes, run a filter, verify the cache was updated, and then for each of the three mutations apply the mutation and verify the cache was reset.
  • Am I shortcutting too much? Too little? Since I'm not running sift, I'm a liiiitle worried that I'm missing something to the filtering.
  • Is the __gatsby_resolved bit properly handled?
    • it is now
  • Can we skip the __gatsby_resolved bit? My benchmarks complete without error if I omit them, but I'm worried that it just means somehting broke but didn't lead to a syntax error.
    • no, if nothing else a query can filter for that property too
  • Confirm the __gatsby_resolved property is not set "too early". Or is that state basically invariant after the bootstrap? (Before it would be set on each call of the filter, now we're just setting it once when creating the cache, technically there's time where the state could change, but I don't think that's the case here)
  • Abstraction / polish
  • If the cache is missed, can we just return undefined instead of going through sift anyways? I think it will result in the same result.
    • that seems to be correct
    • No, properties can be aliased in graphql schemas (slug: String @proxy(from: "slugInternal") and we don't know about that so a miss should still go through sift just in case.
  • For a different PR, shortcut early if filter is empty. Other parts are already kind of doing this, but after the heavy step.
  • I've moved the ./nodes requires to the top. Was there a particular reason they were inline to the function?
    • there used to be a circular ref but seeing how all tests are passing that seems to no longer be the case
  • Anyone know a faster/better way to assert that the filter is a singler property path to eq? The current (initial) appraoch seems to perform fine, but I wouldn't be surprised if there were faster ways.
    • I guess not really, if at all.
  • Fix tests (fairly easy by the looks of it, but who knows)
    • fixed
  • Verify the anomaly where the bench-md-id benchmark (186 before/after) now completes slower than the bench-md-slug benchmark (208->178)
    • Most likely caused by mega maps, which are larger than the number of pages since there are more nodes. By giving id the similar treatment (creating an index based on types+id) a smaller map is created and the lookup is faster. One shortcut we can still apply for id is not creating the Sets in the index, since each bucket can only have one node (since id must be unique). This now drops the bench-md-id benchmark from 186 to 173. Woohoo
  • Fix sorting
  • Special casing for elemMatch
  • Add (general) test case for elemMatch (--> @freiksenet will add a test)
  • Benchmark whether it makes a difference to loop through the sets by type instead of all nodes inside ensureIndexByTypedChain. Let's say there's three types and they have an equal share of nodes then it might only loop a third of the total number of nodes. And these sets are already present anyways. But it might not really make a difference so I need to check this.

While an improvement for many, this adds a little overhead to filters that are not flat (one leaf), filters that use elemMatch, and even more regression for schemas with @proxy where the filter misses (since that still first goes through the new index). But probably within acceptable bounds.

Copy link
Contributor

@freiksenet freiksenet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should we deal with the mappedByKey cache? See TODO desc inline for details.

Added inline comment

Is the __gatsby_resolved bit properly handled?
Can we skip the __gatsby_resolved bit? My benchmarks complete without error if I omit them, but I'm worried that it just means somehting broke but didn't lead to a syntax error.

See inline comment for both. We can't skip it, but it needs to be moved.

If the cache is missed, can we just return undefined instead of going through sift anyways? I think it will result in the same result.

As we always ensure cache, I think yes.

I've moved the ./nodes requires to the top. Was there a particular reason they were inline to the function?

There used to be a circular dependency. Are tests passing?

Verify the anomaly where the bench-md-id benchmark (186 before/after) now completes slower than the bench-md-slug benchmark (208->178)

Could it just be log growth of map lookup? You do by-id lookup on map of all nodes vs by-index lookup of a much smaller subset.

packages/gatsby/src/redux/nodes.js Outdated Show resolved Hide resolved
packages/gatsby/src/redux/nodes.js Show resolved Hide resolved
packages/gatsby/src/redux/run-sift.js Outdated Show resolved Hide resolved
packages/gatsby/src/redux/nodes.js Outdated Show resolved Hide resolved
@pvdz
Copy link
Contributor Author

pvdz commented Jan 21, 2020

Could it just be log growth of map lookup?

At first I thought so as well but then I realized that slugs have it worse; they are unique (supposedly) but are also adding an extra Set object for each position in the Map. There is a catch here, though; we generate way more nodes than we have pages. So perhaps that's a reason. I will make a note to look into doing the getPage lookup through type specific maps instead, see if that makes a difference at scale.

@pvdz pvdz force-pushed the generic-anti-sift branch from 97f0a80 to a4ab520 Compare January 21, 2020 09:31
@pvdz
Copy link
Contributor Author

pvdz commented Jan 21, 2020

There used to be a circular dependency. Are tests passing?

Tests are passing and I quickly checked but do not spot a circle now. I guess it's resolved so let's keep it at the top.

@pvdz
Copy link
Contributor Author

pvdz commented Jan 21, 2020

Could it just be log growth of map lookup?

It was. See the updated code and how id gets special cased now. An indexes is still created for it (just sans the Sets) leading to smaller Maps leading to faster lookups.

@pvdz pvdz force-pushed the generic-anti-sift branch from a4ab520 to c04d5bf Compare January 21, 2020 12:35
@pvdz pvdz force-pushed the generic-anti-sift branch 6 times, most recently from 9727b7f to 1977ecf Compare January 22, 2020 19:22
Copy link
Contributor

@freiksenet freiksenet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite brilliant! 👍 Some minor comments.

packages/gatsby/src/redux/run-sift.js Show resolved Hide resolved
packages/gatsby/src/schema/node-model.js Outdated Show resolved Hide resolved
packages/gatsby/src/schema/__tests__/node-model.js Outdated Show resolved Hide resolved
packages/gatsby/src/schema/node-model.js Show resolved Hide resolved
@freiksenet
Copy link
Contributor

How can I verify (where are these tests?) that this cache is properly torn down when any node gets changes/added/deleted? I don't see existing tests and I'm not sure how to run this e2e such to test this. I would want to start with some nodes, run a filter, verify the cache was updated, and then for each of the three mutations apply the mutation and verify the cache was reset.

There are currently no cache tests, the problem is that it also deconstructs cache at wrong level. Ideally graphql-runner should take care of monitoring changes to schema and nodes, however for some reason I implemented it above graphql-runner.

Am I shortcutting too much? Too little? Since I'm not running sift, I'm a liiiitle worried that I'm missing something to the filtering.

I'd say we can shortcut a bit more. Eg should we rerun with sift if we return undefined?

Confirm the __gatsby_resolved property is not set "too early". Or is that state basically invariant after the bootstrap? (Before it would be set on each call of the filter, now we're just setting it once when creating the cache, technically there's time where the state could change, but I don't think that's the case here)

__gatsby_resolve is guaranteed to be available once you run prepareNodes in node-model. Thus it should always be there if it is needed. If the test that proxies "slug" to "originalSlug" works correctly then it's all done correctly.

@pvdz pvdz force-pushed the generic-anti-sift branch 5 times, most recently from 6d9cd88 to 1d1bb2e Compare January 23, 2020 16:49
@pvdz
Copy link
Contributor Author

pvdz commented Jan 23, 2020

Apart from the one TODO about the nodes loop in ensureIndexByTypedChain, this is nearing completion. Comments are welcome. I'll remove the draft tags once I'm done with profiling.

@pvdz pvdz force-pushed the generic-anti-sift branch 2 times, most recently from 60793c7 to 8dd7989 Compare January 24, 2020 09:30
@pvdz pvdz removed the status: WIP label Jan 27, 2020
@pvdz pvdz marked this pull request as ready for review January 27, 2020 11:50
@pvdz pvdz requested a review from a team as a code owner January 27, 2020 11:50
@pvdz pvdz force-pushed the generic-anti-sift branch from 668bb23 to 95444b2 Compare January 27, 2020 15:12
@freiksenet freiksenet requested a review from vladar January 27, 2020 15:41
@pvdz pvdz force-pushed the generic-anti-sift branch from 95444b2 to cb83b92 Compare January 28, 2020 08:10
freiksenet
freiksenet previously approved these changes Jan 28, 2020
Copy link
Contributor

@freiksenet freiksenet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Let's wait for @vladar 's review and ship it.

packages/gatsby/src/redux/nodes.js Show resolved Hide resolved
vladar
vladar previously approved these changes Jan 28, 2020
Copy link
Contributor

@vladar vladar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work! Looks good to me. Left one small nit that shouldn't block this PR in any way.

packages/gatsby/src/redux/run-sift.js Show resolved Hide resolved
@pvdz pvdz dismissed stale reviews from vladar and freiksenet via 658d07e January 28, 2020 11:56
@pvdz
Copy link
Contributor Author

pvdz commented Jan 28, 2020

Pushed a fix by @freiksenet to make sure the loki path also works (tests were failing because of that, thankfully). So now GATSBY_DB_NODES=loki yarn test also passes.

@pvdz pvdz merged commit 115d5c4 into master Jan 28, 2020
@delete-merged-branch delete-merged-branch bot deleted the generic-anti-sift branch January 28, 2020 13:20
* cached instead of a Set of Nodes.
*/
replaceTypeKeyValueCache(map = new Map()) {
this._typedKeyValueIndexes = new Map() // See redux/nodes.js for usage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a mistake not using the argument to assign?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ... Eh ... yes? :/

pvdz added a commit that referenced this pull request Feb 11, 2020
gatsbybot pushed a commit that referenced this pull request Feb 11, 2020
gatsbybot added a commit to gatsbyjs/gatsby-starter-blog that referenced this pull request Feb 11, 2020
leonhiat added a commit to leonhiat/gatsby-starter-blog that referenced this pull request Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants