Discussion: different cache format #285

stubailo · 2016-06-20T18:41:37Z

Note that this isn't a proposal to actually change anything, just something I wrote up on a plane flight when I was thinking about my mental model for how the apollo cache works.

Idea about alternate cache format

This is based on thinking about how I would write a blog post about how caching works in Apollo client (AC). The main way it makes sense for me to talk about how the cache works, and my mental model for how I can understand what is and isn't cached, is through paths to data (inspired by falcor).

Current model: normalized objects

Let's say you have a query and result like this:

{
  user(id: 5) {
    firstName
    avatar {
      thumbnail
    }
  }
}

{
  user: {
    firstName: "Marie",
    avatar: {
      thumbnail: "http://asdasdfsdf.jpg",
    }
  }
}

Right now, AC understands this as three objects:

ROOT_QUERY:
  user: ROOT_QUERY.user

ROOT_QUERY.user:
  firstName: "Marie"
  avatar: ROOT_QUERY.user.avatar

ROOT_QUERY.user.avatar:
  thumbnail: "http://asdasdfsdf.jpg"

If one of the objects has an ID, then the paths are relative to the ID:

// alternate result with ID
{
  user: {
    id: 5,
    firstName: "Marie",
    avatar: {
      thumbnail: "http://asdasdfsdf.jpg",
    }
  }
}

Representation in cache:

ROOT_QUERY:
  user: 5

5:
  firstName: "Marie"
  avatar: 5.avatar

5.avatar:
  thumbnail: "http://asdasdfsdf.jpg"

This is somewhat nice because we maintain the identity of "objects" - these are things that have fields, and some of the fields are references and some are scalar values.

Alternative model: paths

OK so what's the other model I'm proposing? It's to get rid of the idea of objects entirely, and store paths only instead. Let's look at the above in this light.

Let's look at our query from above again:

{
  user(id: 5) {
    firstName
    avatar {
      thumbnail
    }
  }
}

This query, which is a nested structure, can be rewritten as a series of paths, just by looking at the query:

user(id: 5).firstName
user(id: 5).avatar.thumbnail

This already gives us a bit of clarity, which is that the query which looks somewhat complex, is actually only fetching two scalar fields. Now, what if the cache just used these same paths to store the result?

user(id: 5).firstName: "Marie"
user(id: 5).avatar.thumbnail: "http://asdasdfsdf.jpg"

You can see right away that this is much easier to look at than the set of three objects at the top, because it completely removes the need for references or generated IDs - it just uses the same stuff you typed in the query. And what would this look like with IDs?

5.firstName: "Marie"
5.avatar.thumbnail: "http://asdasdfsdf.jpg"

Also quite clear. There is one question here, which is whether we should somehow remember that user 5 came from the user(id: 5) query - the current format remembers that but the new one doesn't yet. Perhaps we could add a second entry in the case of IDs, which maps paths to ID:

user(id: 5): 5

Pros and cons

Advantages of current, object based approach
- It might be easier to inspect a cache composed of objects, and it's more similar perhaps to the mental model of the returned GraphQL result. You can easily imagine how a JSON object might be decomposed into smaller objects, and that's what the store does.
- Optimistic UI might be easier because you can just manipulate JS objects. For example, if you want to set a couple fields on an object, it could be easier to just set properties on an object rather than setting the values of some paths.
- It's more similar to what Relay does, so it could be easier to convince people it's the right approach. It's also probably closer to what most people would implement intuitively. And, it's more similar to the proposed "graph mode" which represents GraphQL as a graph of objects.
- If you ever need to iterate over all of the fields of an object, you can only do that with this approach. The path based one results in a full "table scan".
Advantages of path-based approach
- The mental model is super easy to explain. When you want to know if a certain bit of data is in the cache, just think about the path to that bit of data and check if it's there. It's really easy to inspect a query and identify which paths it will fill in the store (barring array indices).
- The cache format becomes much simpler, since there are no references, just scalar fields. There will be no paths in the values, only in the keys.
- It could be easier to keep track of which data is referenced by which queries. Right now, we would either have to do a coarse-grained approach which tracks referenced objects, or we would have to record object and field names to keep track of which queries reference which objects. Also since we need to overwrite an object every time we add a field, we can't use === to compare them.

Quick third approach

There's also the approach of storing an actual tree of JSON instead of the paths, and merging the data into the tree on result. So the above would be stored directly as a tree:

{
  user(id: 5): {
    firstName: "Marie",
    avatar: {
      thumbnail: "http://asdasdfsdf.jpg",
    }
  }
}

Unless there is an ID, in which case the tree is normalized:

{
  user(id: 5): 5,
  5: {
    id: 5,
    firstName: "Marie",
    avatar: {
      thumbnail: "http://asdasdfsdf.jpg",
    }
  }
}

I feel like this could be easier to understand for simple cases, but will end up with a disorganized store where it's not clear how deep objects actually go.

Conclusion

I think this is something to think about, and perhaps as we implement new features we should decide if the path-based format would be a help or a hindrance.

Based on the tradeoffs, I don't think we should make this change immediately, but it can be a useful tool for explaining the mental model of how the cache works. If the mental model makes sense to people, it could be worth updating the code to match it.

The text was updated successfully, but these errors were encountered:

smolinari · 2016-06-21T05:04:07Z

Hi,

I hope you don't mind me making some comments and asking some questions, even though I am not a team member. I am just a user (at some point). This subject intrigues me a lot and I am learning. I hope you will have patience with me.

I've looked at the Relay docs and they use a flattened record approach. Correct me if I am wrong, but the rule for data sent back from a GraphQL server is that the data always has ids and these must be unique among all records (i.e. like node ids in a graph) (as mentioned here towards the bottom)? So, with that in mind, the flattening to records makes sense to me, since every record is unique.

What I don't understand completely is the need to remember references to the actual queries. Maybe I missed you guys trying to come up with a different approach, but what I believe should be happening is the client should be asking for a full result of data from the cache, as if it were a GraphQL server itself. If the cache can fulfill the request fully, it does with a GraphQL response, with no request to the real GraphQL server. If it can't, the cache system would return a diff'ed query (usually smaller than the original), which would be sent to the real GraphQL server to get the missing data, to inevitably update the cache and fulfill the request.

I know the overall goal of Meteor is to be able to combine data sources in the backend, even other REST endpoints. But, to do this and conform to GraphQL, any external data source would need some sort of UUID in their data as the id. This is where things get murky for me.

I realize the Relay method to caching the data is complicated, but isn't that the ultimate goal for the client's cache with GraphQL? Fill the request, if it can't, give the optimized query for the missing data, fetch it, then update the cache?

In other words, and I might be overstepping my realm of influence here, but to me, the cache format should be optimized and structured for the best and optimal for use with the client to serve as a "middleware" to the real GraphQL server. It shouldn't be built or structured for any other reason.

😄

Scott

stubailo · 2016-06-21T05:56:13Z

Actually our client does the exact same thing that relay does - this question doesn't have anything to do with whether or not IDs are used to normalize.

Note that we use IDs when they are present but don't require them to exist to do caching.

stubailo · 2016-06-21T05:57:06Z

The suggested approach here with paths is just a different data structure for representing flattened records.

smolinari · 2016-06-21T06:45:03Z

Oh ok. The have or have not id's throws me for a loop I guess. Theoretically, there won't be a record, without ID's or rather there can't be (in my mind). So, I guess that is my misunderstanding. I guess I am at a higher level with my thinking than this issue is. Sorry, for the interruption then.

Scott

stubailo · 2016-06-21T06:48:39Z

Theoretically, there won't be a record, without ID's or rather there can't be (in my mind).

Even with Relay, not all records need to have IDs. In that case, Relay generates an ID, and and Apollo Client uses the query path. In fact, you can do a lot of nice caching stuff with no IDs at all, which can be useful for certain cases.

smolinari · 2016-06-21T08:41:10Z

Interesting. I'll have to learn some more, I guess.....

Thanks for your time.

Scott

deoqc · 2016-06-30T15:49:30Z

tl;dr: We need to be able to identify the same entity called from 2 different paths. Global id's serves this nicely.

I use some Relay machinery (by means of graphql-relay) in my Apollo client, as some say it is the idiomatic graphql.

The Connections and Node Interface caches nice in Relay (at least if think, since I don't have redux devtools to inspect it =) but no wonders, it is not supported in Apollo.

For example (I'm telling the obvious with these example, but for completude...):

Users connection

query {
  users(first: 10) {
     id
     fieldA
     fieldB
  }
}

Node Interface

query {
  node(id: 'MyGlobalUniqueOpaqueId') {
    fieldB
    fieldC
  }
}

Problem

Even if the first user in connection (first query) is the same of the user in the node call (second query), they will be cached separately. And all sorts of problems arise: inconsistent data, only way to update the cache of the connection is calling the exactly same connection, etc.

I've seem talks of Apollo supporting these idiomatic features, like here. For anything other than simple applications these features are a must to have.

It could get much more complicated. User have a friend, which is also a user... identifying different paths - whatever they are - by the unique global id is much better than the developer having to call queries and think to hard in ways to optimize the caching.

I know this feature wouldn't be the default caching mechanism for Apollo, but shouldn't it be not only supported but even encouraged? How is the roadmap for this right now?

ps: Sorry if this discussion don't belong here, I move somewhere else...

stubailo · 2016-06-30T16:44:58Z

Sorry if this discussion don't belong here, I move somewhere else...

Please do! We do have the concept of a global ID which you can use by passing dataIdFromObject, but currently that isn't used when querying the Node interface. It would be a pretty simple change to do.

I use some Relay machinery (by means of graphql-relay) in my Apollo client, as some say it is the idiomatic graphql.

I don't agree that all of what Relay specifies needs to be idiomatic GraphQL, especially because a lot of things like the mutation spec is simply driven by the idiosyncrasies of the Facebook internal GraphQL layer. But I think it should be possible to use Node queries in a useful way if you have them available!

stubailo · 2016-06-30T16:45:29Z

@deoqc btw, this is an issue about the same thing: #332 so let's talk there

Revert "Replace `createMeteorNetworkInterface` with `createNetworkInterface` in `meteor.md`"

stubailo added the idea label Jun 20, 2016

deoqc mentioned this issue Jun 30, 2016

Querying for data already in the store with a different query #332

Closed

stubailo closed this as completed Jan 5, 2017

renovate bot mentioned this issue Aug 4, 2017

chore(deps): update dependency ts-jest to version 20.0.9 #2000

Merged

jbaxleyiii pushed a commit that referenced this issue Oct 17, 2017

Merge pull request #285 from apollographql/revert-284-patch-1

7d741a6

Revert "Replace `createMeteorNetworkInterface` with `createNetworkInterface` in `meteor.md`"

github-actions bot locked as resolved and limited conversation to collaborators Feb 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: different cache format #285

Discussion: different cache format #285

stubailo commented Jun 20, 2016

smolinari commented Jun 21, 2016 •

edited

Loading

stubailo commented Jun 21, 2016

stubailo commented Jun 21, 2016

smolinari commented Jun 21, 2016

stubailo commented Jun 21, 2016 •

edited

Loading

smolinari commented Jun 21, 2016

deoqc commented Jun 30, 2016

stubailo commented Jun 30, 2016

stubailo commented Jun 30, 2016

Discussion: different cache format #285

Discussion: different cache format #285

Comments

stubailo commented Jun 20, 2016

Idea about alternate cache format

Current model: normalized objects

Alternative model: paths

Pros and cons

Quick third approach

Conclusion

smolinari commented Jun 21, 2016 • edited Loading

stubailo commented Jun 21, 2016

stubailo commented Jun 21, 2016

smolinari commented Jun 21, 2016

stubailo commented Jun 21, 2016 • edited Loading

smolinari commented Jun 21, 2016

deoqc commented Jun 30, 2016

Problem

stubailo commented Jun 30, 2016

stubailo commented Jun 30, 2016

smolinari commented Jun 21, 2016 •

edited

Loading

stubailo commented Jun 21, 2016 •

edited

Loading