RFC: Gossip the schema #1743

tamird · 2015-07-20T21:15:35Z

Format blatantly stolen from https://github.com/rust-lang/rfcs.

@cockroachdb/owners @cockroachdb/developers PTAL

petermattis · 2015-07-21T00:14:05Z

Github wiki or markdown. We should figure out the standard format for design proposals like this. I don't have strong feelings other than there being a standard.

petermattis · 2015-07-21T00:17:22Z

docs/RFCS/schema_gossip.md

+
+# Drawbacks
+
+This is slightly more complicated than the current implementation. Because the schema is eventually-consistent, how do we know when migrations are done? We'll have to count on the TTL, which feels a little dirty.


I think it would be worthwhile to detail out the expected latency for all the nodes to see an updated schema with this approach. This isn't just the gossip TTL, but the maximum hops times the TTL. Should also mention what the gossip TTL is.

This isn't just migrations. The namespace and table descriptors will probably be holding permissions (and possibly other settings, though I can't think of any right now). These changes will potentially be more frequent than schema changes, and will probably need to be applied faster.

Addressed this in detailed design.

bdarnell · 2015-07-21T00:26:37Z

Neither solution is great. The wiki doesn't have any commenting facilities so the discussion has to take place somewhere else. Checking the docs in is better because there can be comments on the PR, but it's still weird - the docs directory should eventually be a place for our user-facing docs, not docs that are primarily historical artifacts. And when do we merge them? Only once we have decided to implement the plan, or do we merge rejected design docs as well?

Three other options that come to mind:

Check them in as in this PR, but in a separate repo (this is what rust does)
Put the docs in a gist (which supports commenting, unlike a wiki, although they don't work well for multi-author docs), and make a wiki page linking to them for discoverability.
Use google docs (or quip?) instead of github for this (again, with wiki links for discoverability), similar to what the go team does.

spencerkimball · 2015-07-22T21:29:24Z

docs/RFCS/schema_gossip.md

+
+# Motivation
+
+In order to support performant SQL queries, each gateway node must be able to address the data requested by the query. Today this requires the node to read `TableDescriptor`s from the KV map, which we believe (though we haven't measured) will cause a substantial performance hit which we can eliminate by actively gossiping the necessary metadata.


s/eliminate/mitigate/

petermattis · 2015-07-29T15:23:42Z

The non-design doc commits should be broken out into a separate PR.

spencerkimball · 2015-07-29T15:59:30Z

docs/RFCS/schema_gossip.md

+
+# Detailed design
+
+Distribution of the schema metadata will be modeled after the current configuration distribution mechanism. Whenever a new `TableDescriptor` is written to the KV map, the node holding the leader lease for that key's range will add the `TableDescriptor` to gossip under a key derived from the `TableDescriptor`'s `Name` with a TTL TBD.


s/current/deprecated/

tamird · 2015-07-29T16:05:13Z

I've updated this design doc to reflect everyone's comments, but I'm having second thoughts. In particular, I'm thinking we should consider the alternative of doing kv reads as we do now with a time-bounded cache. When a cache key is read, we can check its TTL and asynchronously refresh the value.

This would be simple to implement and would allow us to substantially reduce the scope of gossip.

Codifies what was done in cockroachdb#1743.

tbg · 2015-07-30T10:11:32Z

I've updated this design doc to reflect everyone's comments, but I'm having second thoughts. In particular, I'm thinking we should consider the alternative of doing kv reads as we do now with a time-bounded cache. When a cache key is read, we can check its TTL and asynchronously refresh the value.

This would be simple to implement and would allow us to substantially reduce the scope of gossip.

gossip: more proactive, but always sends all data to everybody, but with incremental updates. almost no read pressure on descriptor range. Shorter TTL possible than with kv+cache.
kv+cache: all nodes need to poll descriptors, but requires only lookups for tables which see traffic.

Under the assumption that unless we get smart about replica locations, we'll have most schemas on most nodes, so sending it all to everybody, and thus gossip, is better.
Whatever we use, the sql server doesn't need to know about it. It just needs an interface.

tamird · 2015-07-30T10:26:20Z

Shorter TTL possible than with kv+cache.

This is not strictly true - gossip will impose a lower bound on the TTL which we currently don't have the facility to determine. If we set this TTL lower than numHops * gossipInterval we're gonna have a bad time. Measuring the read time in kv+cache is pretty trivial, so implementing continual freshness there doesn't have the same problem.

tbg · 2015-07-30T10:44:57Z

that's true. I was thinking have a timestamp in the gossip'ed info, so that this issue wouldn't be there (actual gossip TTL would be higher), but that's also somewhat awkward.
If we just use an interface doesn't lock us into either, I'm happy to start with KV. I'm always thinking about online schema changes, but it's way to early for that.

spencerkimball · 2015-07-30T14:07:40Z

If you really want to keep a local cache and have every node do consistent reads, perhaps we should organize the schema data with a logical version number as a key prefix so only the most recent diffs for a schema need be queried. Otherwise, you're going to be dragging potentially many megabytes of data every few minutes to every node in the system. Given that wrinkle of complexity, I think you should reconsider the use of gossip.

tamird · 2015-07-30T14:10:34Z

Otherwise, you're going to be dragging potentially many megabytes of data every few minutes to every node in the system.

Won't gossip do exactly this, though?

spencerkimball · 2015-07-30T14:14:08Z

True, though it's on the to-fix list. Regardless, the many megabytes of data every few minutes get streamed out of the node holding schema information just once as opposed to N times.

tamird · 2015-08-28T16:14:01Z

Updated this with a slightly more detailed Detailed design. PTAL: I'd like to merge this today.

[ci skip]

mberhault · 2015-08-28T16:45:49Z

docs/RFCS/schema_gossip.md

+
+Complete propagation of new metadata will take at most `numHops * gossipInterval` where `numHops` is the maximum number of hops between any node and the publishing node, and `gossipInterval` is the maximum interval between sequential writes to the gossip network on a given node.
+
+On the read side, metadata reads' behaviour will change such that they will read from gossip rather than the KV store. This will require plumbing a closure or a reference to the `Gossip` instance down to the `sql.Planner` instance.


not all of them. We definitely want anything that will mutate the metadata to first do a kv read.
Obviously high-qps requests (select, insert, delete, update) will use the cache.
Everything else (show *) could do either. If they use the gossiped version, they'll be consistent with high-qps ops (meaning you can tell when your changes have been propagated). If they don't, they'll be consistent with metadata mutators (meaning sequences of GRANT|REVOKE ... and SHOW GRANTS would be easier).

Sure. I'm okay with omitting that detail from this RFC since there's no controversy; either metadata-direct operations always do consistent things or we add flags.

mberhault · 2015-08-28T17:50:09Z

LGTM.

RFC: Gossip the schema

tbg added the PTAL label Jul 20, 2015

petermattis reviewed Jul 21, 2015
View reviewed changes

spencerkimball reviewed Jul 22, 2015
View reviewed changes

spencerkimball reviewed Jul 29, 2015
View reviewed changes

bdarnell added a commit to bdarnell/cockroach that referenced this pull request Jul 29, 2015

Document a simple RFC process.

3251599

Codifies what was done in cockroachdb#1743.

bdarnell mentioned this pull request Jul 29, 2015

Document a simple RFC process. #1863

Merged

tbg mentioned this pull request Aug 11, 2015

RFC: SQL privileges #2054

Merged

tamird removed the PTAL label Aug 18, 2015

RFC: Gossip the schema

7ed03d7

[ci skip]

mberhault reviewed Aug 28, 2015
View reviewed changes

tamird assigned mberhault Aug 28, 2015

tamird added a commit that referenced this pull request Aug 28, 2015

Merge pull request #1743 from tamird/schema-gossip

8e5b132

RFC: Gossip the schema

tamird merged commit 8e5b132 into cockroachdb:master Aug 28, 2015

tamird deleted the schema-gossip branch August 28, 2015 17:52

jess-edwards mentioned this pull request Sep 9, 2015

Product Roadmap #2132

Closed

78 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Gossip the schema #1743

RFC: Gossip the schema #1743

tamird commented Jul 20, 2015

petermattis commented Jul 21, 2015

petermattis Jul 21, 2015

mberhault Jul 22, 2015

tamird Jul 28, 2015

bdarnell commented Jul 21, 2015

spencerkimball Jul 22, 2015

tamird Jul 28, 2015

petermattis commented Jul 29, 2015

spencerkimball Jul 29, 2015

tamird Aug 28, 2015

tamird commented Jul 29, 2015

tbg commented Jul 30, 2015

tamird commented Jul 30, 2015

tbg commented Jul 30, 2015

spencerkimball commented Jul 30, 2015

tamird commented Jul 30, 2015

spencerkimball commented Jul 30, 2015

tamird commented Aug 28, 2015

mberhault Aug 28, 2015

tamird Aug 28, 2015

mberhault commented Aug 28, 2015


		# Drawbacks

		This is slightly more complicated than the current implementation. Because the schema is eventually-consistent, how do we know when migrations are done? We'll have to count on the TTL, which feels a little dirty.


		# Motivation

		In order to support performant SQL queries, each gateway node must be able to address the data requested by the query. Today this requires the node to read `TableDescriptor`s from the KV map, which we believe (though we haven't measured) will cause a substantial performance hit which we can eliminate by actively gossiping the necessary metadata.


		# Detailed design

		Distribution of the schema metadata will be modeled after the current configuration distribution mechanism. Whenever a new `TableDescriptor` is written to the KV map, the node holding the leader lease for that key's range will add the `TableDescriptor` to gossip under a key derived from the `TableDescriptor`'s `Name` with a TTL TBD.


		Complete propagation of new metadata will take at most `numHops * gossipInterval` where `numHops` is the maximum number of hops between any node and the publishing node, and `gossipInterval` is the maximum interval between sequential writes to the gossip network on a given node.

		On the read side, metadata reads' behaviour will change such that they will read from gossip rather than the KV store. This will require plumbing a closure or a reference to the `Gossip` instance down to the `sql.Planner` instance.

RFC: Gossip the schema #1743

RFC: Gossip the schema #1743

Conversation

tamird commented Jul 20, 2015

petermattis commented Jul 21, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdarnell commented Jul 21, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petermattis commented Jul 29, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tamird commented Jul 29, 2015

tbg commented Jul 30, 2015

tamird commented Jul 30, 2015

tbg commented Jul 30, 2015

spencerkimball commented Jul 30, 2015

tamird commented Jul 30, 2015

spencerkimball commented Jul 30, 2015

tamird commented Aug 28, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mberhault commented Aug 28, 2015