Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[apollo-engine-reporting] : willResolveField called after stopTiming! #4472

Closed
loftykhanna opened this issue Aug 11, 2020 · 21 comments · Fixed by #6398
Closed

[apollo-engine-reporting] : willResolveField called after stopTiming! #4472

loftykhanna opened this issue Aug 11, 2020 · 21 comments · Fixed by #6398

Comments

@loftykhanna
Copy link

We are receiving this error Error: [internal apollo-server error] willResolveField called after stopTiming!, this is leading to unhandled rejection and frequent restart of node server.

Any insight to debug this or in what situation it usually happens.

"apollo-server": "^2.14.3",
"apollo-engine-reporting": "2.1.0"

How do we pinpoint which resolver is causing this issue as entire stack tree is from graphql and apollo server.

Thanks

@codehimanshu
Copy link

Facing the same error.
"apollo-server": "2.9.13"

Attaching complete stack trace:
You have triggered an unhandledRejection, you may have forgotten to catch a Promise rejection: Error: [internal apollo-server error] willResolveField called after stopTiming! at internalError (/usr/src/app/packages/giraffe/node_modules/apollo-engine-reporting/dist/treeBuilder.js:8:12) at EngineReportingTreeBuilder.willResolveField (/usr/src/app/packages/giraffe/node_modules/apollo-engine-reporting/dist/treeBuilder.js:49:19) at EngineReportingExtension.willResolveField (/usr/src/app/packages/giraffe/node_modules/apollo-engine-reporting/dist/extension.js:85:33) at handlers.extensions.map.extension (/usr/src/app/packages/giraffe/node_modules/graphql-extensions/dist/index.js:48:23) at Array.map (<anonymous>) at GraphQLExtensionStack.willResolveField (/usr/src/app/packages/giraffe/node_modules/graphql-extensions/dist/index.js:47:14) at field.resolve (/usr/src/app/packages/giraffe/node_modules/graphql-extensions/dist/index.js:104:28) at resolveFieldValueOrError (/usr/src/app/packages/giraffe/node_modules/graphql/execution/execute.js:467:18) at resolveField (/usr/src/app/packages/giraffe/node_modules/graphql/execution/execute.js:434:16) at executeFields (/usr/src/app/packages/giraffe/node_modules/graphql/execution/execute.js:275:18) at collectAndExecuteSubfields (/usr/src/app/packages/giraffe/node_modules/graphql/execution/execute.js:713:10) at completeObjectValue (/usr/src/app/packages/giraffe/node_modules/graphql/execution/execute.js:703:10) at completeValue (/usr/src/app/packages/giraffe/node_modules/graphql/execution/execute.js:591:12) at completeValue (/usr/src/app/packages/giraffe/node_modules/graphql/execution/execute.js:557:21) at /usr/src/app/packages/giraffe/node_modules/graphql/execution/execute.js:492:16

@grrowl
Copy link

grrowl commented Nov 25, 2020

This caused an alert for us this morning as every node spewed stacktraces into STDERR. Following apollo-engine 502s doing the same thing yesterday. Is this a client-facing error or just with engine reporting? Can we customise how these errors are logged by apollo-server?

@awinograd
Copy link

This error started showing up for me after upgrading from apollo-server 3.4.1 to 3.6.1

@glasser
Copy link
Member

glasser commented Jan 13, 2022

I wonder if this is related to #5372
Probably not since that's about the end hook being called too late.

@chrskrchr
Copy link

chrskrchr commented Mar 22, 2022

My team also started seeing these unhandled promise rejections in our Fastify + apollo-server-core@3.6.4 + graphql@15 service after a recent change in client query patterns. The query being executed by the client is fairly complex and we're still trying to track down the specific thing in the query/resolvers that's triggering the unhandled promise rejection, but we know for certain the unhandled rejections are coming from the usage reporting plugin:

Error: [internal apollo-server error] willResolveField called after stopTiming!
    at internalError (/usr/src/app/node_modules/apollo-server-core/dist/plugin/traceTreeBuilder.js:7:12)
    at TraceTreeBuilder.willResolveField (/usr/src/app/node_modules/apollo-server-core/dist/plugin/traceTreeBuilder.js:51:19)
    at Object.willResolveField (/usr/src/app/node_modules/apollo-server-core/dist/plugin/usageReporting/plugin.js:288:52)
    at Dispatcher.invokeSyncDidStartHook (/usr/src/app/node_modules/apollo-server-core/dist/utils/dispatcher.js:48:43)
    at invokeWillResolveField (/usr/src/app/node_modules/apollo-server-core/dist/requestPipeline.js:138:77)
    at field.resolve (/usr/src/app/node_modules/apollo-server-core/dist/utils/schemaInstrumentation.js:31:13)
    at resolveField (/usr/src/app/node_modules/graphql/execution/execute.js:464:18)
    at executeFields (/usr/src/app/node_modules/graphql/execution/execute.js:292:18)
    at collectAndExecuteSubfields (/usr/src/app/node_modules/graphql/execution/execute.js:748:10)
    at completeObjectValue (/usr/src/app/node_modules/graphql/execution/execute.js:738:10)

For now, we've suppressed the unhandled rejections by building a wrapper plugin around the ApolloServerPluginUsageReporting plugin that adds a try/catch around the call to willResolveField() that catches errors thrown by this code:

https://github.com/apollographql/apollo-server/blob/main/packages/apollo-server-core/src/plugin/traceTreeBuilder.ts#L70-L72

  public willResolveField(info: GraphQLResolveInfo): () => void {
    ...
    if (this.stopped) {
      throw internalError('willResolveField called after stopTiming!');
    }

We haven't seen an unhandled rejection since deploying this wrapper plugin to our production environment. However, this isn't an ideal solution.

I've put together a sample repo that shows how to reproduce the condition above where willResolveField() is called after the response has already been sent. That repo is here:

https://github.com/chrskrchr/apollo-server-race-condition

Although I've been able to reliably reproduce this willResolveField called after willSendResponse condition, I haven't been able to figure out under what circumstances produces an unhandled promise rejection.

@glasser
Copy link
Member

glasser commented Mar 30, 2022

@chrskrchr Thanks for the reproduction! I'm looking into this now.

FYI, your package-lock file tries to install all packages from some sort of private "artifactory". I'm just deleting the package-lock and reinstalling from scratch, so hopefully I'll get close enough versions to reproduce. In the future when making otherwise excellent reproductions, watch out for that :)

@glasser
Copy link
Member

glasser commented Mar 30, 2022

Here's a PR against your repo that makes the reproduction more reproducible: chrskrchr/apollo-server-race-condition#1

I think this is related to how graphql-js execution handles the bubbling-up of errors/nulls when there is other stuff running in parallel below it. I want to investigate the graphql-js execution code a bit more to see if what's going on is obvious. This does raise some questions about the "right" way to handle this case.

@glasser
Copy link
Member

glasser commented Mar 30, 2022

OK, it looks pretty clear that graphql-js doesn't proactively short-circuit executions when they are guaranteed to end up ignored by a null-bubbling, as long as the execution actually started due to there being Promises involved.

I believe what’s going on here is: let’s say you have { x y { ARBITRARY_SELECTION_SET} } } where x has a non-null return type, and x and y both have resolvers that return Promises. And let’s say that y returns a Promise that rejects (or resolves to null). What this means is that we’re going to eventually end up with data: null (nothing under y will actually matter), but that graphql-js execution will continue running whatever is under ARBITRARY_SELECTION_SET without any sort of short circuiting. (Its response ends up in a Promise that is being captured by a Promise.all which has already rejected.). In fact, the Promise returned from execute itself can happily resolve while execution is still chugging away on an arbitrary amount of fields under that ARBITRARY_SELECTION_SET. There’s no way to detect from the outside “all the execution related to this operation is done”, nor to “short-circuit” execution so that it stops going.

This is a bit of a funky situation. I'm not sure that there's an easy way without hacking into the graphql-js executor code to actually wait until execution even of deep sub-tree stuff is done before finalizing the trace. But it also feels like traces/field execution counts/etc should in fact care about fields that are executed even if their output ends up not making it into the result! So I'm at least slightly hesitant about entirely ignoring reporting these fields. But the current logged error is not helpful, for sure.

@chrskrchr
Copy link

FYI, your package-lock file tries to install all packages from some sort of private "artifactory".

Doh! The classic "works on my machine". Thanks for fixing that and making the repro deterministic.

The previous non-determinism was left over from when I was trying to fuzz test my way into the unhandled promise rejection scenario. It's easy enough to reproduce the willResolveField called after stopTiming! error, and I was trying all sorts of combinations of random delays, errors, returning null for non-nullable fields, etc... to see if I could trigger an unhandled promise rejection. I have yet to be successful on repro'ing the unhandled rejection. :(

It sounds like you have a firm grasp on what's triggering the internal willResolveField called after stopTiming! error. Any idea why this error generally appears to be caught and swallowed by graphql-js, but not always? It's this edge case (and the subsequent server restarts) that's keeping us up at night.

@glasser
Copy link
Member

glasser commented Mar 31, 2022

Hmm, I don't think in my reproduction I had seen the server restarts. Let me go back to your original reproduction (modulo package-lock) and see if I can see that.

@glasser
Copy link
Member

glasser commented Mar 31, 2022

Or hmm, I guess I wouldn't see the server restarts in your reproduction because there's no usage reporting plugin. I'd assume that the equivalent of the server restarts would be triggered if you add a throw right after the orphanedRequestPlugin logging, but I don't see that. Do you have a reproduction of the server restarts?

@chrskrchr
Copy link

Or hmm, I guess I wouldn't see the server restarts in your reproduction because there's no usage reporting plugin.

I just realized I was seeing the usage reporting plugin error thrown when reproducing the issue locally, but only because I have an APOLLO_KEY env var that was triggering ApolloServer to register that plugin during startup. I've just pushed a commit to the repro repo that updates the ApolloServerPluginOrphanedRequestLogger plugin to throw the error after logging per your comment above. This should more accurately reflect what the usage reporting plugin is doing.

Do you have a reproduction of the server restarts?

I also just realized the repro repo was missing another key component - a process.on('unhandledRejection') handler that will exit the process on unhandled promise rejections. We have a handler like this in our Node 14 apps to be forwards compatible with the new behavior Node 16+ that automatically exits the process under similar conditions. I've just pushed a commit with an unhandledRejection handler to make the behavior consistent regardless of whether you're running Node 16+ or older.

Even still, we haven't been able to reliably reproduce the server restarts. We're seeing the willResolveField called after stopTiming! error logged thousands of times a day in our production env, but only a small handful of these errors (less than 10 per day) were triggering an unhandled promise rejection and server restart. And that number has dropped to 0 since we added a wrapper around the usage reporting plugin that catches and swallows these errors entirely.

I'm going to spend some time today re-visiting my fuzz testing approach to see if we can accidentally back our way into the condition that's resulting in an unhandled promise rejection.

@chrskrchr
Copy link

chrskrchr commented Apr 8, 2022

@loftykhanna @codehimanshu @grrowl @awinograd - I'm curious, were any of you by chance using the @opentelemetry/instrumentation-graphql plugin when you were seeing these unhandledRejection errors in your services?

https://github.com/open-telemetry/opentelemetry-js-contrib/tree/main/plugins/node/opentelemetry-instrumentation-graphql

(I'm guessing the answer is "no" given that this issue appears to pre-date the creation of that plugin, but thought it was worth asking)

@glasser
Copy link
Member

glasser commented May 3, 2022

@chrskrchr Hmm, seems like this might be responsible for an issue like this turning into an unhandledRejection. graphql/graphql-js#3529
That graphql-js PR won't resolve the fact that we log the weird thing (or the fact that execution can just keep running indefinitely after the response is sent) but should resolve the unhandled rejections at least. Currently blocked on CLA...

@chrskrchr
Copy link

@glasser - great find! That's exactly the type of issue I went hunting for when we first started seeing this back in March (thinking that surely other folks would have run into this), but I wasn't able to find anything nor could I back my way into the correct order of events to reproduce it. But there it is, test case and all! I'll definitely keep an eye on that PR and take it for a spin when it's merged.

That graphql-js PR won't resolve the fact that we log the weird thing (or the fact that execution can just keep running indefinitely after the response is sent) but should resolve the unhandled rejections at least.

I can live with these other things as long as our process no longer exits unexpectedly. 😄

@glasser
Copy link
Member

glasser commented May 6, 2022

I tried to write a test to validate that my fix (removing the internal error) works. I couldn't actually manage to reproduce the original error being visible — I think it ends up being (appropriately) swallowed since it's treated like an error thrown by a field which has already been ruled to not be part of the returned data. The reproduction I'd done before (based on yours) just involved explicitly logging when the issue occurred. And presumably the reason that you were able to observe it was graphql/graphql-js#3529. But I'll fix it anyway.

glasser added a commit that referenced this issue May 6, 2022
The comment explains this in detail. Basically, this "shouldn't happen"
error actually could happen. In theory, the times it could happen are
the exact times that the error itself would be swallowed rather than
becoming visible... but a graphql-js bug meant that sometimes they would
become visible anyway.

Fixes #4472.
@chrskrchr
Copy link

chrskrchr commented May 6, 2022

@glasser - I was able to use the information provided in graphql/graphql-js#3529's unit test to successfully reproduce the case where a thrown willResolveField called after stopTiming! error results in an unhandled rejection that causes the process to exit. Those repro steps have been pushed to this repo:

https://github.com/chrskrchr/apollo-server-race-condition

NOTE: there are some additional fields in the schema that aren't required to reproduce the error, but I was testing a bunch of different error scenarios and decided to leave all those fields in there for now. Check out the request in curl.sh to see what's actually required to reproduce the issue.

I need to focus on some other work items today but as soon as possible, I'll test the proposed graphql-js fix in the context of our repro repo to verify it addresses the unhandledRejection we're seeing.

FWIW - I think your changes in #6398 are also worthwhile, but I'll sleep much better knowing that the root cause for our process exits has been addressed and isn't just waiting to bite us again. 😄

@chrskrchr
Copy link

I'll test the proposed graphql-js fix in the context of our repro repo to verify it addresses the unhandledRejection we're seeing.

Good news! I patched a local copy of graphql-js and verified that it does address the unhandledRejection issue.

@glasser
Copy link
Member

glasser commented May 6, 2022

Great to hear. I hope that the CLA issue is resolved soon; in the meantime perhaps https://www.npmjs.com/package/patch-package is your friend.

glasser added a commit that referenced this issue Jun 4, 2022
The comment explains this in detail. Basically, this "shouldn't happen"
error actually could happen. In theory, the times it could happen are
the exact times that the error itself would be swallowed rather than
becoming visible... but a graphql-js bug meant that sometimes they would
become visible anyway.

Fixes #4472.
glasser added a commit that referenced this issue Jun 4, 2022
…te" (#6398)

The comment explains this in detail. Basically, this "shouldn't happen"
error actually could happen. In theory, the times it could happen are
the exact times that the error itself would be swallowed rather than
becoming visible... but a graphql-js bug meant that sometimes they would
become visible anyway.

Fixes #4472.
@glasser
Copy link
Member

glasser commented Jun 4, 2022

Fixed in v3.8.2.

@chrskrchr
Copy link

Fixed in v3.8.2.

We pushed this change out to production earlier today and saw the number of tracing plugin errors drop to zero:

Screen Shot 2022-06-07 at 4 27 52 PM

Thanks for the fix, @glasser!

(and FWIW, we're still going to pursue that graphql-js fix, but this apollo-server-core fix is all we needed in order to remove our bandaid wrapper around the tracing plugin)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants