[apollo-engine-reporting] : willResolveField called after stopTiming! #4472

loftykhanna · 2020-08-11T06:59:23Z

We are receiving this error Error: [internal apollo-server error] willResolveField called after stopTiming!, this is leading to unhandled rejection and frequent restart of node server.

Any insight to debug this or in what situation it usually happens.

"apollo-server": "^2.14.3",
"apollo-engine-reporting": "2.1.0"

How do we pinpoint which resolver is causing this issue as entire stack tree is from graphql and apollo server.

Thanks

The text was updated successfully, but these errors were encountered:

codehimanshu · 2020-08-12T13:05:36Z

Facing the same error.
"apollo-server": "2.9.13"

Attaching complete stack trace:
You have triggered an unhandledRejection, you may have forgotten to catch a Promise rejection: Error: [internal apollo-server error] willResolveField called after stopTiming! at internalError (/usr/src/app/packages/giraffe/node_modules/apollo-engine-reporting/dist/treeBuilder.js:8:12) at EngineReportingTreeBuilder.willResolveField (/usr/src/app/packages/giraffe/node_modules/apollo-engine-reporting/dist/treeBuilder.js:49:19) at EngineReportingExtension.willResolveField (/usr/src/app/packages/giraffe/node_modules/apollo-engine-reporting/dist/extension.js:85:33) at handlers.extensions.map.extension (/usr/src/app/packages/giraffe/node_modules/graphql-extensions/dist/index.js:48:23) at Array.map (<anonymous>) at GraphQLExtensionStack.willResolveField (/usr/src/app/packages/giraffe/node_modules/graphql-extensions/dist/index.js:47:14) at field.resolve (/usr/src/app/packages/giraffe/node_modules/graphql-extensions/dist/index.js:104:28) at resolveFieldValueOrError (/usr/src/app/packages/giraffe/node_modules/graphql/execution/execute.js:467:18) at resolveField (/usr/src/app/packages/giraffe/node_modules/graphql/execution/execute.js:434:16) at executeFields (/usr/src/app/packages/giraffe/node_modules/graphql/execution/execute.js:275:18) at collectAndExecuteSubfields (/usr/src/app/packages/giraffe/node_modules/graphql/execution/execute.js:713:10) at completeObjectValue (/usr/src/app/packages/giraffe/node_modules/graphql/execution/execute.js:703:10) at completeValue (/usr/src/app/packages/giraffe/node_modules/graphql/execution/execute.js:591:12) at completeValue (/usr/src/app/packages/giraffe/node_modules/graphql/execution/execute.js:557:21) at /usr/src/app/packages/giraffe/node_modules/graphql/execution/execute.js:492:16

grrowl · 2020-11-25T22:49:13Z

This caused an alert for us this morning as every node spewed stacktraces into STDERR. Following apollo-engine 502s doing the same thing yesterday. Is this a client-facing error or just with engine reporting? Can we customise how these errors are logged by apollo-server?

awinograd · 2022-01-13T22:11:07Z

This error started showing up for me after upgrading from apollo-server 3.4.1 to 3.6.1

glasser · 2022-01-13T23:06:36Z

I wonder if this is related to #5372
Probably not since that's about the end hook being called too late.

chrskrchr · 2022-03-22T21:21:36Z

My team also started seeing these unhandled promise rejections in our Fastify + apollo-server-core@3.6.4 + graphql@15 service after a recent change in client query patterns. The query being executed by the client is fairly complex and we're still trying to track down the specific thing in the query/resolvers that's triggering the unhandled promise rejection, but we know for certain the unhandled rejections are coming from the usage reporting plugin:

Error: [internal apollo-server error] willResolveField called after stopTiming!
    at internalError (/usr/src/app/node_modules/apollo-server-core/dist/plugin/traceTreeBuilder.js:7:12)
    at TraceTreeBuilder.willResolveField (/usr/src/app/node_modules/apollo-server-core/dist/plugin/traceTreeBuilder.js:51:19)
    at Object.willResolveField (/usr/src/app/node_modules/apollo-server-core/dist/plugin/usageReporting/plugin.js:288:52)
    at Dispatcher.invokeSyncDidStartHook (/usr/src/app/node_modules/apollo-server-core/dist/utils/dispatcher.js:48:43)
    at invokeWillResolveField (/usr/src/app/node_modules/apollo-server-core/dist/requestPipeline.js:138:77)
    at field.resolve (/usr/src/app/node_modules/apollo-server-core/dist/utils/schemaInstrumentation.js:31:13)
    at resolveField (/usr/src/app/node_modules/graphql/execution/execute.js:464:18)
    at executeFields (/usr/src/app/node_modules/graphql/execution/execute.js:292:18)
    at collectAndExecuteSubfields (/usr/src/app/node_modules/graphql/execution/execute.js:748:10)
    at completeObjectValue (/usr/src/app/node_modules/graphql/execution/execute.js:738:10)

For now, we've suppressed the unhandled rejections by building a wrapper plugin around the ApolloServerPluginUsageReporting plugin that adds a try/catch around the call to willResolveField() that catches errors thrown by this code:

https://github.com/apollographql/apollo-server/blob/main/packages/apollo-server-core/src/plugin/traceTreeBuilder.ts#L70-L72

  public willResolveField(info: GraphQLResolveInfo): () => void {
    ...
    if (this.stopped) {
      throw internalError('willResolveField called after stopTiming!');
    }

We haven't seen an unhandled rejection since deploying this wrapper plugin to our production environment. However, this isn't an ideal solution.

I've put together a sample repo that shows how to reproduce the condition above where willResolveField() is called after the response has already been sent. That repo is here:

https://github.com/chrskrchr/apollo-server-race-condition

Although I've been able to reliably reproduce this willResolveField called after willSendResponse condition, I haven't been able to figure out under what circumstances produces an unhandled promise rejection.

glasser · 2022-03-30T21:40:48Z

@chrskrchr Thanks for the reproduction! I'm looking into this now.

FYI, your package-lock file tries to install all packages from some sort of private "artifactory". I'm just deleting the package-lock and reinstalling from scratch, so hopefully I'll get close enough versions to reproduce. In the future when making otherwise excellent reproductions, watch out for that :)

glasser · 2022-03-30T21:57:38Z

Here's a PR against your repo that makes the reproduction more reproducible: chrskrchr/apollo-server-race-condition#1

I think this is related to how graphql-js execution handles the bubbling-up of errors/nulls when there is other stuff running in parallel below it. I want to investigate the graphql-js execution code a bit more to see if what's going on is obvious. This does raise some questions about the "right" way to handle this case.

glasser · 2022-03-30T23:10:45Z

OK, it looks pretty clear that graphql-js doesn't proactively short-circuit executions when they are guaranteed to end up ignored by a null-bubbling, as long as the execution actually started due to there being Promises involved.

I believe what’s going on here is: let’s say you have { x y { ARBITRARY_SELECTION_SET} } } where x has a non-null return type, and x and y both have resolvers that return Promises. And let’s say that y returns a Promise that rejects (or resolves to null). What this means is that we’re going to eventually end up with data: null (nothing under y will actually matter), but that graphql-js execution will continue running whatever is under ARBITRARY_SELECTION_SET without any sort of short circuiting. (Its response ends up in a Promise that is being captured by a Promise.all which has already rejected.). In fact, the Promise returned from execute itself can happily resolve while execution is still chugging away on an arbitrary amount of fields under that ARBITRARY_SELECTION_SET. There’s no way to detect from the outside “all the execution related to this operation is done”, nor to “short-circuit” execution so that it stops going.

This is a bit of a funky situation. I'm not sure that there's an easy way without hacking into the graphql-js executor code to actually wait until execution even of deep sub-tree stuff is done before finalizing the trace. But it also feels like traces/field execution counts/etc should in fact care about fields that are executed even if their output ends up not making it into the result! So I'm at least slightly hesitant about entirely ignoring reporting these fields. But the current logged error is not helpful, for sure.

chrskrchr · 2022-03-31T15:45:39Z

FYI, your package-lock file tries to install all packages from some sort of private "artifactory".

Doh! The classic "works on my machine". Thanks for fixing that and making the repro deterministic.

The previous non-determinism was left over from when I was trying to fuzz test my way into the unhandled promise rejection scenario. It's easy enough to reproduce the willResolveField called after stopTiming! error, and I was trying all sorts of combinations of random delays, errors, returning null for non-nullable fields, etc... to see if I could trigger an unhandled promise rejection. I have yet to be successful on repro'ing the unhandled rejection. :(

It sounds like you have a firm grasp on what's triggering the internal willResolveField called after stopTiming! error. Any idea why this error generally appears to be caught and swallowed by graphql-js, but not always? It's this edge case (and the subsequent server restarts) that's keeping us up at night.

glasser · 2022-03-31T22:00:15Z

Hmm, I don't think in my reproduction I had seen the server restarts. Let me go back to your original reproduction (modulo package-lock) and see if I can see that.

glasser · 2022-03-31T22:09:53Z

Or hmm, I guess I wouldn't see the server restarts in your reproduction because there's no usage reporting plugin. I'd assume that the equivalent of the server restarts would be triggered if you add a throw right after the orphanedRequestPlugin logging, but I don't see that. Do you have a reproduction of the server restarts?

chrskrchr · 2022-04-04T13:44:07Z

Or hmm, I guess I wouldn't see the server restarts in your reproduction because there's no usage reporting plugin.

I just realized I was seeing the usage reporting plugin error thrown when reproducing the issue locally, but only because I have an APOLLO_KEY env var that was triggering ApolloServer to register that plugin during startup. I've just pushed a commit to the repro repo that updates the ApolloServerPluginOrphanedRequestLogger plugin to throw the error after logging per your comment above. This should more accurately reflect what the usage reporting plugin is doing.

Do you have a reproduction of the server restarts?

I also just realized the repro repo was missing another key component - a process.on('unhandledRejection') handler that will exit the process on unhandled promise rejections. We have a handler like this in our Node 14 apps to be forwards compatible with the new behavior Node 16+ that automatically exits the process under similar conditions. I've just pushed a commit with an unhandledRejection handler to make the behavior consistent regardless of whether you're running Node 16+ or older.

Even still, we haven't been able to reliably reproduce the server restarts. We're seeing the willResolveField called after stopTiming! error logged thousands of times a day in our production env, but only a small handful of these errors (less than 10 per day) were triggering an unhandled promise rejection and server restart. And that number has dropped to 0 since we added a wrapper around the usage reporting plugin that catches and swallows these errors entirely.

I'm going to spend some time today re-visiting my fuzz testing approach to see if we can accidentally back our way into the condition that's resulting in an unhandled promise rejection.

chrskrchr · 2022-04-08T21:23:54Z

@loftykhanna @codehimanshu @grrowl @awinograd - I'm curious, were any of you by chance using the @opentelemetry/instrumentation-graphql plugin when you were seeing these unhandledRejection errors in your services?

https://github.com/open-telemetry/opentelemetry-js-contrib/tree/main/plugins/node/opentelemetry-instrumentation-graphql

(I'm guessing the answer is "no" given that this issue appears to pre-date the creation of that plugin, but thought it was worth asking)

glasser · 2022-05-03T06:50:55Z

@chrskrchr Hmm, seems like this might be responsible for an issue like this turning into an unhandledRejection. graphql/graphql-js#3529
That graphql-js PR won't resolve the fact that we log the weird thing (or the fact that execution can just keep running indefinitely after the response is sent) but should resolve the unhandled rejections at least. Currently blocked on CLA...

chrskrchr · 2022-05-05T23:03:22Z

@glasser - great find! That's exactly the type of issue I went hunting for when we first started seeing this back in March (thinking that surely other folks would have run into this), but I wasn't able to find anything nor could I back my way into the correct order of events to reproduce it. But there it is, test case and all! I'll definitely keep an eye on that PR and take it for a spin when it's merged.

That graphql-js PR won't resolve the fact that we log the weird thing (or the fact that execution can just keep running indefinitely after the response is sent) but should resolve the unhandled rejections at least.

I can live with these other things as long as our process no longer exits unexpectedly. 😄

glasser · 2022-05-06T06:32:07Z

I tried to write a test to validate that my fix (removing the internal error) works. I couldn't actually manage to reproduce the original error being visible — I think it ends up being (appropriately) swallowed since it's treated like an error thrown by a field which has already been ruled to not be part of the returned data. The reproduction I'd done before (based on yours) just involved explicitly logging when the issue occurred. And presumably the reason that you were able to observe it was graphql/graphql-js#3529. But I'll fix it anyway.

The comment explains this in detail. Basically, this "shouldn't happen" error actually could happen. In theory, the times it could happen are the exact times that the error itself would be swallowed rather than becoming visible... but a graphql-js bug meant that sometimes they would become visible anyway. Fixes #4472.

chrskrchr · 2022-05-06T13:20:16Z

@glasser - I was able to use the information provided in graphql/graphql-js#3529's unit test to successfully reproduce the case where a thrown willResolveField called after stopTiming! error results in an unhandled rejection that causes the process to exit. Those repro steps have been pushed to this repo:

https://github.com/chrskrchr/apollo-server-race-condition

NOTE: there are some additional fields in the schema that aren't required to reproduce the error, but I was testing a bunch of different error scenarios and decided to leave all those fields in there for now. Check out the request in curl.sh to see what's actually required to reproduce the issue.

I need to focus on some other work items today but as soon as possible, I'll test the proposed graphql-js fix in the context of our repro repo to verify it addresses the unhandledRejection we're seeing.

FWIW - I think your changes in #6398 are also worthwhile, but I'll sleep much better knowing that the root cause for our process exits has been addressed and isn't just waiting to bite us again. 😄

chrskrchr · 2022-05-06T14:51:48Z

I'll test the proposed graphql-js fix in the context of our repro repo to verify it addresses the unhandledRejection we're seeing.

Good news! I patched a local copy of graphql-js and verified that it does address the unhandledRejection issue.

glasser · 2022-05-06T16:30:18Z

Great to hear. I hope that the CLA issue is resolved soon; in the meantime perhaps https://www.npmjs.com/package/patch-package is your friend.

The comment explains this in detail. Basically, this "shouldn't happen" error actually could happen. In theory, the times it could happen are the exact times that the error itself would be swallowed rather than becoming visible... but a graphql-js bug meant that sometimes they would become visible anyway. Fixes #4472.

…te" (#6398) The comment explains this in detail. Basically, this "shouldn't happen" error actually could happen. In theory, the times it could happen are the exact times that the error itself would be swallowed rather than becoming visible... but a graphql-js bug meant that sometimes they would become visible anyway. Fixes #4472.

glasser · 2022-06-04T00:56:28Z

Fixed in v3.8.2.

chrskrchr · 2022-06-07T21:51:10Z

Fixed in v3.8.2.

We pushed this change out to production earlier today and saw the number of tracing plugin errors drop to zero:

Thanks for the fix, @glasser!

(and FWIW, we're still going to pursue that graphql-js fix, but this apollo-server-core fix is all we needed in order to remove our bandaid wrapper around the tracing plugin)

abernix added the 🧮 usage-reporting label Dec 31, 2020

glasser mentioned this issue May 6, 2022

usage reporting: don't throw errors if willResolveField is called "late" #6398

Merged

glasser closed this as completed in #6398 Jun 4, 2022

glasser mentioned this issue Dec 2, 2022

Execute can continue running resolvers even after its returned Promise resolves graphql/graphql-js#3792

Closed

github-actions bot locked as resolved and limited conversation to collaborators Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[apollo-engine-reporting] : willResolveField called after stopTiming! #4472

[apollo-engine-reporting] : willResolveField called after stopTiming! #4472

loftykhanna commented Aug 11, 2020

codehimanshu commented Aug 12, 2020

grrowl commented Nov 25, 2020

awinograd commented Jan 13, 2022

glasser commented Jan 13, 2022

chrskrchr commented Mar 22, 2022 •

edited

Loading

glasser commented Mar 30, 2022

glasser commented Mar 30, 2022

glasser commented Mar 30, 2022

chrskrchr commented Mar 31, 2022

glasser commented Mar 31, 2022

glasser commented Mar 31, 2022

chrskrchr commented Apr 4, 2022

chrskrchr commented Apr 8, 2022 •

edited

Loading

glasser commented May 3, 2022

chrskrchr commented May 5, 2022

glasser commented May 6, 2022

chrskrchr commented May 6, 2022 •

edited

Loading

chrskrchr commented May 6, 2022

glasser commented May 6, 2022

glasser commented Jun 4, 2022

chrskrchr commented Jun 7, 2022

[apollo-engine-reporting] : willResolveField called after stopTiming! #4472

[apollo-engine-reporting] : willResolveField called after stopTiming! #4472

Comments

loftykhanna commented Aug 11, 2020

codehimanshu commented Aug 12, 2020

grrowl commented Nov 25, 2020

awinograd commented Jan 13, 2022

glasser commented Jan 13, 2022

chrskrchr commented Mar 22, 2022 • edited Loading

glasser commented Mar 30, 2022

glasser commented Mar 30, 2022

glasser commented Mar 30, 2022

chrskrchr commented Mar 31, 2022

glasser commented Mar 31, 2022

glasser commented Mar 31, 2022

chrskrchr commented Apr 4, 2022

chrskrchr commented Apr 8, 2022 • edited Loading

glasser commented May 3, 2022

chrskrchr commented May 5, 2022

glasser commented May 6, 2022

chrskrchr commented May 6, 2022 • edited Loading

chrskrchr commented May 6, 2022

glasser commented May 6, 2022

glasser commented Jun 4, 2022

chrskrchr commented Jun 7, 2022

chrskrchr commented Mar 22, 2022 •

edited

Loading

chrskrchr commented Apr 8, 2022 •

edited

Loading

chrskrchr commented May 6, 2022 •

edited

Loading