Production-only bug causing stale query output: internal results cache contains stale data that is erroneously merged into readFromStore output #9735

akallem · 2022-05-17T17:35:43Z

Intended outcome:
Apollo Client should return the same data in production as in development.

Actual outcome:
Under a specific configuration of InMemoryCache and usage of GraphQL fragments (details below), Apollo Client returns correct data in development but incorrect (stale) data in production.

Quick summary of the reason:
Objects cached in the internal results cache (the calls to wrap here and here) are erroneously being modified after-the-fact to include fields that aren't in the cache key selection set. The erroneous modification is happening because a single instance of DeepMerger is being used. At an intermediate level during diffQueryAgainstStore's recursion, a deep copy output from DeepMerger is cached in the results cache, but that object can subsequently be modified later in the recursion because it's in DeepMerger's pastCopies set.

How to reproduce the issue:
Here's a minimal example reproducing the issue.

Steps to reproduce the bug:

Run the app in development mode (npm run start).
Use the UI to delete tasks or swap them between lists. Observe that everything works as expected.
Run the app in production mode (npm run build then npm run serve).
Again use the UI to delete tasks or swap them between lists. Observe that it's broken. For example (refresh the page before each of these examples):
- Delete Task 400. It should disappear, but it doesn't.
- Delete Task 300. It disappears, but now there are two Task 400s.
- Swap Task 400. It appears in the other list, but a stale version remains in the old position.

Versions
System:
OS: Linux 5.4 Ubuntu 18.04.6 LTS (Bionic Beaver)
Binaries:
Node: 16.15.0 - ~/.nvm/versions/node/v16.15.0/bin/node
npm: 8.5.5 - ~/.nvm/versions/node/v16.15.0/bin/npm
Browsers:
Chrome: 100.0.4896.75
Firefox: 100.0
npmPackages:
@apollo/client: ^3.6.0 => 3.6.0

Additional Info
Here are more details from my investigation to help you with the resolution.

The bug first appeared in Apollo Client v3.5.0. (v3.4.17 was fine.) In particular, the issue was introduced in this commit. Interestingly, this is a merge commit. Neither parent has the bug! The reason is that the bug only occurs if, among other things, canonizeResults is false. One parent to that merge includes commits that changed the default to false. The other parent includes this commit which modified readFromStore's merging logic. Independently they're fine, but together, they introduced the bug.

Why does this bug occur only in production? In development, Apollo Client freezes returned objects so that the app fails loudly if it tries to modify them. In production, this behavior is disabled. It sounds like that couldn't possibly cause an issue, but it does! The reason is that DeepMerger behaves differently if an object is frozen. The behavior was introduced in this commit. In development (object is frozen) it returns a fresh copy, while in production (object is not frozen) it returns what it thinks is a safe-to-modify past copy. It is indeed a past copy, but it is not safe to modify because it has been cached in the results cache.

One of the easier ways to see the results cache is getting corrupted is to add a console log here logging a snapshot of the object (e.g., via JSON.parse(JSON.stringify(returnValue)) and the object itself. You'll see that additional fields beyond the selection set appear on the object itself that don't appear in the snapshot or the selection set.

Once the results cache returns an object that includes fields that aren't in the selection set, all bets are off. Whether it actually results in erroneous data being returned by Apollo Client to the caller depends on how the GraphQL document was defined.

The bug only occurs when canonizeResults is false (the default since v3.4.14), resultsCaching is true (the default), and the GraphQL document includes fragments in a particular way that causes the erroneously modified results cache to actually manifest when reading back the data.

Here's my summary of what I think happened:

canonizeResults was defaulted to false on one branch, and the new readFromStore behavior was introduced on a separate branch in this commit
Those branches were merged, introducing the bug. (This wasn't a released version of the library, of course.)
At this point, development apps would fail because DeepMerger was internally trying to modify frozen objects! That could have been a clue that the previous changes introduced a regression in production.
Unfortunately, rather than stopping DeepMerger from modifying the objects in development and production, a "fix" was introduced which did so only in development (when the objects are frozen).

The text was updated successfully, but these errors were encountered:

benjamn · 2022-05-17T19:27:13Z

Thanks for the detailed and deep analysis @akallem! Your diagnosis makes sense to me, and you have no doubt saved me a bunch of time/work here.

It definitely was/is my intention that no observable behavior should ever depend on the actual frozenness of the objects, to prevent any differences of behavior in production (where we'd like to avoid actually freezing objects, since that makes using them slower in various ways).

Besides hiding this bug (presumably), did you ever see benefits from canonization? We had to change the default canonizeResults value because it led to confusing/unexpected sharing of equivalent subtrees (like all empty array [] objects), but there may still be something worthwhile in that set of ideas.

I'm sure we can find a way to scope the DeepMerger usage to keep it from thinking previously generated result objects (from the same execution) are fair game for mutation.

akallem · 2022-05-17T20:00:29Z

No problem @benjamn, I'm glad it's helpful. Thanks for the quick acknowledgement and response.

I did indeed reenable canonizeResults in my app as a quick way to avoid encountering this bug. I haven't looked into the effects of canonization closely enough to know whether I've seen any benefits. That said, for a while I had been using an early 3.4.x version where canonization was enabled by default, and I don't remember encountering any confusion or other ill effects, which is why I was comfortable reenabling it as a mitigation.

Meantime, as you think about the patch for this, it might be helpful to backport the patch to 3.5.x (if you don't already plan to), since I (and others) might not be ready or want to move to 3.6.x just yet.

Issue #9735 appears to be due to a bug I introduced in my commit 756ab87 in PR #8734. Premature optimization has a bad reputation already, but you can add this PR to the mountain of evidence in its disfavor.

If you revert the previous commit and run `npm test`, you'll see all the tests this dynamic Object.isFrozen check has been silently protecting from failing, but only (and this is the important part) in development. Since we only actually freeze objects with Object.freeze in development, this Object.isFrozen check does not help in production, so an object that would have been frozen in development gets reused as a mutable copy, potentially acquiring properties it should not acquire (a bug fixed by the previous commit, first reported in issue #9735).

Issue #9735 appears to be due to a bug I introduced in my commit 756ab87 in PR #8734. Premature optimization has a bad reputation already, but you can add this PR to the mountain of evidence in its disfavor.

If you revert the previous commit and run `npm test`, you'll see all the tests this dynamic Object.isFrozen check has been silently protecting from failing, but only (and this is the important part) in development. Since we only actually freeze objects with Object.freeze in development, this Object.isFrozen check does not help in production, so an object that would have been frozen in development gets reused as a mutable copy, potentially acquiring properties it should not acquire (a bug fixed by the previous commit, first reported in issue #9735).

Issue #9735 appears to be due to a bug I introduced in my commit 756ab87 in PR #8734. Premature optimization has a bad reputation already, but you can add this PR to the mountain of evidence in its disfavor.

If you revert the previous commit and run `npm test`, you'll see all the tests this dynamic Object.isFrozen check has been silently protecting from failing, but only (and this is the important part) in development. Since we only actually freeze objects with Object.freeze in development, this Object.isFrozen check does not help in production, so an object that would have been frozen in development gets reused as a mutable copy, potentially acquiring properties it should not acquire (a bug fixed by the previous commit, first reported in issue #9735).

benjamn · 2022-05-23T22:17:03Z

@akallem Thanks to your reproduction, I am confident this will be fixed in @apollo/client@3.6.5 (just published to npm)!

akallem · 2022-05-23T23:47:04Z

@benjamn Great! And thanks for the quick resolution. Is it possible for the fix to be backported to v3.5.x as well?

benjamn self-assigned this May 17, 2022

benjamn added 💸 caching 🌹 has-reproduction labels May 17, 2022

benjamn added this to the v3.6.x patch releases milestone May 17, 2022

benjamn added the 🐞 bug label May 17, 2022

benjamn mentioned this issue May 19, 2022

Restrict DeepMerger mutable object reuse to fix subtle production-only bug #9742

Merged

benjamn linked a pull request May 19, 2022 that will close this issue

Restrict DeepMerger mutable object reuse to fix subtle production-only bug #9742

Merged

benjamn closed this as completed in #9742 May 23, 2022

github-actions bot locked as resolved and limited conversation to collaborators Feb 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production-only bug causing stale query output: internal results cache contains stale data that is erroneously merged into readFromStore output #9735

Production-only bug causing stale query output: internal results cache contains stale data that is erroneously merged into readFromStore output #9735

akallem commented May 17, 2022 •

edited

Loading

benjamn commented May 17, 2022

akallem commented May 17, 2022 •

edited

Loading

benjamn commented May 23, 2022

akallem commented May 23, 2022

Production-only bug causing stale query output: internal results cache contains stale data that is erroneously merged into readFromStore output #9735

Production-only bug causing stale query output: internal results cache contains stale data that is erroneously merged into readFromStore output #9735

Comments

akallem commented May 17, 2022 • edited Loading

benjamn commented May 17, 2022

akallem commented May 17, 2022 • edited Loading

benjamn commented May 23, 2022

akallem commented May 23, 2022

akallem commented May 17, 2022 •

edited

Loading

akallem commented May 17, 2022 •

edited

Loading