Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Production-only bug causing stale query output: internal results cache contains stale data that is erroneously merged into readFromStore output #9735

Closed
akallem opened this issue May 17, 2022 · 4 comments · Fixed by #9742

Comments

@akallem
Copy link

akallem commented May 17, 2022

Intended outcome:
Apollo Client should return the same data in production as in development.

Actual outcome:
Under a specific configuration of InMemoryCache and usage of GraphQL fragments (details below), Apollo Client returns correct data in development but incorrect (stale) data in production.

Quick summary of the reason:
Objects cached in the internal results cache (the calls to wrap here and here) are erroneously being modified after-the-fact to include fields that aren't in the cache key selection set. The erroneous modification is happening because a single instance of DeepMerger is being used. At an intermediate level during diffQueryAgainstStore's recursion, a deep copy output from DeepMerger is cached in the results cache, but that object can subsequently be modified later in the recursion because it's in DeepMerger's pastCopies set.

How to reproduce the issue:
Here's a minimal example reproducing the issue.

Steps to reproduce the bug:

  • Run the app in development mode (npm run start).
  • Use the UI to delete tasks or swap them between lists. Observe that everything works as expected.
  • Run the app in production mode (npm run build then npm run serve).
  • Again use the UI to delete tasks or swap them between lists. Observe that it's broken. For example (refresh the page before each of these examples):
    • Delete Task 400. It should disappear, but it doesn't.
    • Delete Task 300. It disappears, but now there are two Task 400s.
    • Swap Task 400. It appears in the other list, but a stale version remains in the old position.

Versions
System:
OS: Linux 5.4 Ubuntu 18.04.6 LTS (Bionic Beaver)
Binaries:
Node: 16.15.0 - ~/.nvm/versions/node/v16.15.0/bin/node
npm: 8.5.5 - ~/.nvm/versions/node/v16.15.0/bin/npm
Browsers:
Chrome: 100.0.4896.75
Firefox: 100.0
npmPackages:
@apollo/client: ^3.6.0 => 3.6.0

Additional Info
Here are more details from my investigation to help you with the resolution.

The bug first appeared in Apollo Client v3.5.0. (v3.4.17 was fine.) In particular, the issue was introduced in this commit. Interestingly, this is a merge commit. Neither parent has the bug! The reason is that the bug only occurs if, among other things, canonizeResults is false. One parent to that merge includes commits that changed the default to false. The other parent includes this commit which modified readFromStore's merging logic. Independently they're fine, but together, they introduced the bug.

Why does this bug occur only in production? In development, Apollo Client freezes returned objects so that the app fails loudly if it tries to modify them. In production, this behavior is disabled. It sounds like that couldn't possibly cause an issue, but it does! The reason is that DeepMerger behaves differently if an object is frozen. The behavior was introduced in this commit. In development (object is frozen) it returns a fresh copy, while in production (object is not frozen) it returns what it thinks is a safe-to-modify past copy. It is indeed a past copy, but it is not safe to modify because it has been cached in the results cache.

One of the easier ways to see the results cache is getting corrupted is to add a console log here logging a snapshot of the object (e.g., via JSON.parse(JSON.stringify(returnValue)) and the object itself. You'll see that additional fields beyond the selection set appear on the object itself that don't appear in the snapshot or the selection set.

Once the results cache returns an object that includes fields that aren't in the selection set, all bets are off. Whether it actually results in erroneous data being returned by Apollo Client to the caller depends on how the GraphQL document was defined.

The bug only occurs when canonizeResults is false (the default since v3.4.14), resultsCaching is true (the default), and the GraphQL document includes fragments in a particular way that causes the erroneously modified results cache to actually manifest when reading back the data.

Here's my summary of what I think happened:

  • canonizeResults was defaulted to false on one branch, and the new readFromStore behavior was introduced on a separate branch in this commit
  • Those branches were merged, introducing the bug. (This wasn't a released version of the library, of course.)
  • At this point, development apps would fail because DeepMerger was internally trying to modify frozen objects! That could have been a clue that the previous changes introduced a regression in production.
  • Unfortunately, rather than stopping DeepMerger from modifying the objects in development and production, a "fix" was introduced which did so only in development (when the objects are frozen).
@benjamn
Copy link
Member

benjamn commented May 17, 2022

Thanks for the detailed and deep analysis @akallem! Your diagnosis makes sense to me, and you have no doubt saved me a bunch of time/work here.

It definitely was/is my intention that no observable behavior should ever depend on the actual frozenness of the objects, to prevent any differences of behavior in production (where we'd like to avoid actually freezing objects, since that makes using them slower in various ways).

Besides hiding this bug (presumably), did you ever see benefits from canonization? We had to change the default canonizeResults value because it led to confusing/unexpected sharing of equivalent subtrees (like all empty array [] objects), but there may still be something worthwhile in that set of ideas.

I'm sure we can find a way to scope the DeepMerger usage to keep it from thinking previously generated result objects (from the same execution) are fair game for mutation.

@benjamn benjamn added this to the v3.6.x patch releases milestone May 17, 2022
@akallem
Copy link
Author

akallem commented May 17, 2022

No problem @benjamn, I'm glad it's helpful. Thanks for the quick acknowledgement and response.

I did indeed reenable canonizeResults in my app as a quick way to avoid encountering this bug. I haven't looked into the effects of canonization closely enough to know whether I've seen any benefits. That said, for a while I had been using an early 3.4.x version where canonization was enabled by default, and I don't remember encountering any confusion or other ill effects, which is why I was comfortable reenabling it as a mitigation.

Meantime, as you think about the patch for this, it might be helpful to backport the patch to 3.5.x (if you don't already plan to), since I (and others) might not be ready or want to move to 3.6.x just yet.

benjamn added a commit that referenced this issue May 19, 2022
Issue #9735 appears to be due to a bug I introduced in my commit
756ab87 in PR #8734.

Premature optimization has a bad reputation already, but you can add
this PR to the mountain of evidence in its disfavor.
benjamn added a commit that referenced this issue May 19, 2022
If you revert the previous commit and run `npm test`, you'll see all the
tests this dynamic Object.isFrozen check has been silently protecting
from failing, but only (and this is the important part) in development.

Since we only actually freeze objects with Object.freeze in development,
this Object.isFrozen check does not help in production, so an object
that would have been frozen in development gets reused as a mutable
copy, potentially acquiring properties it should not acquire (a bug
fixed by the previous commit, first reported in issue #9735).
benjamn added a commit that referenced this issue May 19, 2022
Issue #9735 appears to be due to a bug I introduced in my commit
756ab87 in PR #8734.

Premature optimization has a bad reputation already, but you can add
this PR to the mountain of evidence in its disfavor.
benjamn added a commit that referenced this issue May 19, 2022
If you revert the previous commit and run `npm test`, you'll see all the
tests this dynamic Object.isFrozen check has been silently protecting
from failing, but only (and this is the important part) in development.

Since we only actually freeze objects with Object.freeze in development,
this Object.isFrozen check does not help in production, so an object
that would have been frozen in development gets reused as a mutable
copy, potentially acquiring properties it should not acquire (a bug
fixed by the previous commit, first reported in issue #9735).
benjamn added a commit that referenced this issue May 23, 2022
Issue #9735 appears to be due to a bug I introduced in my commit
756ab87 in PR #8734.

Premature optimization has a bad reputation already, but you can add
this PR to the mountain of evidence in its disfavor.
benjamn added a commit that referenced this issue May 23, 2022
If you revert the previous commit and run `npm test`, you'll see all the
tests this dynamic Object.isFrozen check has been silently protecting
from failing, but only (and this is the important part) in development.

Since we only actually freeze objects with Object.freeze in development,
this Object.isFrozen check does not help in production, so an object
that would have been frozen in development gets reused as a mutable
copy, potentially acquiring properties it should not acquire (a bug
fixed by the previous commit, first reported in issue #9735).
@benjamn
Copy link
Member

benjamn commented May 23, 2022

@akallem Thanks to your reproduction, I am confident this will be fixed in @apollo/client@3.6.5 (just published to npm)!

@akallem
Copy link
Author

akallem commented May 23, 2022

@benjamn Great! And thanks for the quick resolution. Is it possible for the fix to be backported to v3.5.x as well?

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.