-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Gossip the schema #1743
RFC: Gossip the schema #1743
Conversation
Github wiki or markdown. We should figure out the standard format for design proposals like this. I don't have strong feelings other than there being a standard. |
|
||
# Drawbacks | ||
|
||
This is slightly more complicated than the current implementation. Because the schema is eventually-consistent, how do we know when migrations are done? We'll have to count on the TTL, which feels a little dirty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be worthwhile to detail out the expected latency for all the nodes to see an updated schema with this approach. This isn't just the gossip TTL, but the maximum hops times the TTL. Should also mention what the gossip TTL is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't just migrations. The namespace and table descriptors will probably be holding permissions (and possibly other settings, though I can't think of any right now). These changes will potentially be more frequent than schema changes, and will probably need to be applied faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed this in detailed design.
Neither solution is great. The wiki doesn't have any commenting facilities so the discussion has to take place somewhere else. Checking the docs in is better because there can be comments on the PR, but it's still weird - the Three other options that come to mind:
|
|
||
# Motivation | ||
|
||
In order to support performant SQL queries, each gateway node must be able to address the data requested by the query. Today this requires the node to read `TableDescriptor`s from the KV map, which we believe (though we haven't measured) will cause a substantial performance hit which we can eliminate by actively gossiping the necessary metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/eliminate/mitigate/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
The non-design doc commits should be broken out into a separate PR. |
|
||
# Detailed design | ||
|
||
Distribution of the schema metadata will be modeled after the current configuration distribution mechanism. Whenever a new `TableDescriptor` is written to the KV map, the node holding the leader lease for that key's range will add the `TableDescriptor` to gossip under a key derived from the `TableDescriptor`'s `Name` with a TTL TBD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/current/deprecated/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
I've updated this design doc to reflect everyone's comments, but I'm having second thoughts. In particular, I'm thinking we should consider the alternative of doing kv reads as we do now with a time-bounded cache. When a cache key is read, we can check its TTL and asynchronously refresh the value. This would be simple to implement and would allow us to substantially reduce the scope of gossip. |
Codifies what was done in cockroachdb#1743.
gossip: more proactive, but always sends all data to everybody, but with incremental updates. almost no read pressure on descriptor range. Shorter TTL possible than with kv+cache. Under the assumption that unless we get smart about replica locations, we'll have most schemas on most nodes, so sending it all to everybody, and thus gossip, is better. |
This is not strictly true - gossip will impose a lower bound on the TTL which we currently don't have the facility to determine. If we set this TTL lower than |
that's true. I was thinking have a timestamp in the gossip'ed info, so that this issue wouldn't be there (actual gossip TTL would be higher), but that's also somewhat awkward. |
If you really want to keep a local cache and have every node do consistent reads, perhaps we should organize the schema data with a logical version number as a key prefix so only the most recent diffs for a schema need be queried. Otherwise, you're going to be dragging potentially many megabytes of data every few minutes to every node in the system. Given that wrinkle of complexity, I think you should reconsider the use of gossip. |
Won't gossip do exactly this, though? |
True, though it's on the to-fix list. Regardless, the many megabytes of data every few minutes get streamed out of the node holding schema information just once as opposed to N times. |
Updated this with a slightly more detailed |
[ci skip]
|
||
Complete propagation of new metadata will take at most `numHops * gossipInterval` where `numHops` is the maximum number of hops between any node and the publishing node, and `gossipInterval` is the maximum interval between sequential writes to the gossip network on a given node. | ||
|
||
On the read side, metadata reads' behaviour will change such that they will read from gossip rather than the KV store. This will require plumbing a closure or a reference to the `Gossip` instance down to the `sql.Planner` instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not all of them. We definitely want anything that will mutate the metadata to first do a kv read.
Obviously high-qps requests (select, insert, delete, update) will use the cache.
Everything else (show *) could do either. If they use the gossiped version, they'll be consistent with high-qps ops (meaning you can tell when your changes have been propagated). If they don't, they'll be consistent with metadata mutators (meaning sequences of GRANT|REVOKE ...
and SHOW GRANTS
would be easier).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I'm okay with omitting that detail from this RFC since there's no controversy; either metadata-direct operations always do consistent things or we add flags.
LGTM. |
Format blatantly stolen from https://github.com/rust-lang/rfcs.
rendered
@cockroachdb/owners @cockroachdb/developers PTAL