Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Gossip the schema #1743

Merged
merged 1 commit into from
Aug 28, 2015
Merged

RFC: Gossip the schema #1743

merged 1 commit into from
Aug 28, 2015

Conversation

tamird
Copy link
Contributor

@tamird tamird commented Jul 20, 2015

Format blatantly stolen from https://github.com/rust-lang/rfcs.

rendered

@cockroachdb/owners @cockroachdb/developers PTAL

@tbg tbg added the PTAL label Jul 20, 2015
@petermattis
Copy link
Collaborator

Github wiki or markdown. We should figure out the standard format for design proposals like this. I don't have strong feelings other than there being a standard.


# Drawbacks

This is slightly more complicated than the current implementation. Because the schema is eventually-consistent, how do we know when migrations are done? We'll have to count on the TTL, which feels a little dirty.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be worthwhile to detail out the expected latency for all the nodes to see an updated schema with this approach. This isn't just the gossip TTL, but the maximum hops times the TTL. Should also mention what the gossip TTL is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't just migrations. The namespace and table descriptors will probably be holding permissions (and possibly other settings, though I can't think of any right now). These changes will potentially be more frequent than schema changes, and will probably need to be applied faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this in detailed design.

@bdarnell
Copy link
Contributor

Neither solution is great. The wiki doesn't have any commenting facilities so the discussion has to take place somewhere else. Checking the docs in is better because there can be comments on the PR, but it's still weird - the docs directory should eventually be a place for our user-facing docs, not docs that are primarily historical artifacts. And when do we merge them? Only once we have decided to implement the plan, or do we merge rejected design docs as well?

Three other options that come to mind:

  • Check them in as in this PR, but in a separate repo (this is what rust does)
  • Put the docs in a gist (which supports commenting, unlike a wiki, although they don't work well for multi-author docs), and make a wiki page linking to them for discoverability.
  • Use google docs (or quip?) instead of github for this (again, with wiki links for discoverability), similar to what the go team does.


# Motivation

In order to support performant SQL queries, each gateway node must be able to address the data requested by the query. Today this requires the node to read `TableDescriptor`s from the KV map, which we believe (though we haven't measured) will cause a substantial performance hit which we can eliminate by actively gossiping the necessary metadata.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/eliminate/mitigate/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@petermattis
Copy link
Collaborator

The non-design doc commits should be broken out into a separate PR.


# Detailed design

Distribution of the schema metadata will be modeled after the current configuration distribution mechanism. Whenever a new `TableDescriptor` is written to the KV map, the node holding the leader lease for that key's range will add the `TableDescriptor` to gossip under a key derived from the `TableDescriptor`'s `Name` with a TTL TBD.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/current/deprecated/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@tamird
Copy link
Contributor Author

tamird commented Jul 29, 2015

I've updated this design doc to reflect everyone's comments, but I'm having second thoughts. In particular, I'm thinking we should consider the alternative of doing kv reads as we do now with a time-bounded cache. When a cache key is read, we can check its TTL and asynchronously refresh the value.

This would be simple to implement and would allow us to substantially reduce the scope of gossip.

bdarnell added a commit to bdarnell/cockroach that referenced this pull request Jul 29, 2015
@tbg
Copy link
Member

tbg commented Jul 30, 2015

I've updated this design doc to reflect everyone's comments, but I'm having second thoughts. In particular, I'm thinking we should consider the alternative of doing kv reads as we do now with a time-bounded cache. When a cache key is read, we can check its TTL and asynchronously refresh the value.

This would be simple to implement and would allow us to substantially reduce the scope of gossip.

gossip: more proactive, but always sends all data to everybody, but with incremental updates. almost no read pressure on descriptor range. Shorter TTL possible than with kv+cache.
kv+cache: all nodes need to poll descriptors, but requires only lookups for tables which see traffic.

Under the assumption that unless we get smart about replica locations, we'll have most schemas on most nodes, so sending it all to everybody, and thus gossip, is better.
Whatever we use, the sql server doesn't need to know about it. It just needs an interface.

@tamird
Copy link
Contributor Author

tamird commented Jul 30, 2015

Shorter TTL possible than with kv+cache.

This is not strictly true - gossip will impose a lower bound on the TTL which we currently don't have the facility to determine. If we set this TTL lower than numHops * gossipInterval we're gonna have a bad time. Measuring the read time in kv+cache is pretty trivial, so implementing continual freshness there doesn't have the same problem.

@tbg
Copy link
Member

tbg commented Jul 30, 2015

that's true. I was thinking have a timestamp in the gossip'ed info, so that this issue wouldn't be there (actual gossip TTL would be higher), but that's also somewhat awkward.
If we just use an interface doesn't lock us into either, I'm happy to start with KV. I'm always thinking about online schema changes, but it's way to early for that.

@spencerkimball
Copy link
Member

If you really want to keep a local cache and have every node do consistent reads, perhaps we should organize the schema data with a logical version number as a key prefix so only the most recent diffs for a schema need be queried. Otherwise, you're going to be dragging potentially many megabytes of data every few minutes to every node in the system. Given that wrinkle of complexity, I think you should reconsider the use of gossip.

@tamird
Copy link
Contributor Author

tamird commented Jul 30, 2015

Otherwise, you're going to be dragging potentially many megabytes of data every few minutes to every node in the system.

Won't gossip do exactly this, though?

@spencerkimball
Copy link
Member

True, though it's on the to-fix list. Regardless, the many megabytes of data every few minutes get streamed out of the node holding schema information just once as opposed to N times.

@tbg tbg mentioned this pull request Aug 11, 2015
@tamird tamird removed the PTAL label Aug 18, 2015
@tamird
Copy link
Contributor Author

tamird commented Aug 28, 2015

Updated this with a slightly more detailed Detailed design. PTAL: I'd like to merge this today.


Complete propagation of new metadata will take at most `numHops * gossipInterval` where `numHops` is the maximum number of hops between any node and the publishing node, and `gossipInterval` is the maximum interval between sequential writes to the gossip network on a given node.

On the read side, metadata reads' behaviour will change such that they will read from gossip rather than the KV store. This will require plumbing a closure or a reference to the `Gossip` instance down to the `sql.Planner` instance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not all of them. We definitely want anything that will mutate the metadata to first do a kv read.
Obviously high-qps requests (select, insert, delete, update) will use the cache.
Everything else (show *) could do either. If they use the gossiped version, they'll be consistent with high-qps ops (meaning you can tell when your changes have been propagated). If they don't, they'll be consistent with metadata mutators (meaning sequences of GRANT|REVOKE ... and SHOW GRANTS would be easier).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'm okay with omitting that detail from this RFC since there's no controversy; either metadata-direct operations always do consistent things or we add flags.

@mberhault
Copy link
Contributor

LGTM.

tamird added a commit that referenced this pull request Aug 28, 2015
@tamird tamird merged commit 8e5b132 into cockroachdb:master Aug 28, 2015
@tamird tamird deleted the schema-gossip branch August 28, 2015 17:52
@jess-edwards jess-edwards mentioned this pull request Sep 9, 2015
78 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants