Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: v3api rfc #2675

Merged
merged 1 commit into from
May 18, 2015
Merged

doc: v3api rfc #2675

merged 1 commit into from
May 18, 2015

Conversation

xiang90
Copy link
Contributor

@xiang90 xiang90 commented Apr 14, 2015

This is a real Request For Comments thing.

All feedbacks are welcomed! Any question are also welcomed!

We have been working really hard on the internal raft. As the new raft getting stable, we are starting to design the new API/store for etcd.

Here are the feature requests from the community that we plan to support:

Optimistic concurrency GoogleCloudPlatform/kubernetes#1957 (comment)
Access to past versions of objects (after history compaction) GoogleCloudPlatform/kubernetes#1957 (comment)
Efficient reliable and recursive watch GoogleCloudPlatform/kubernetes#1957 (comment)
Multi-object transactions #1282 #860 #652
The ability to somehow build multiple indices #860 #1548
Pagination support GoogleCloudPlatform/kubernetes#1957 (comment)
Low per-object memory footprint GoogleCloudPlatform/kubernetes#1957 (comment)
Object TTLs (for events, logically deleted objects) #1232
Allow clients to request watch heartbeats to refresh the clients window #2048
Binary key and value #1548
On-disk storage #732

For #2647

@xiang90 xiang90 added the v3api label Apr 14, 2015
@xiang90 xiang90 mentioned this pull request Apr 14, 2015
raft_term = 0x1,
kvs = {
{
key = foo,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

odd spacing

@xiang90 xiang90 force-pushed the v3rfc branch 4 times, most recently from 0e5783c to 44182a0 Compare April 14, 2015 03:44
optional uint64 raft_term = 5;
}

message RangeRequest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan for the store to implement MVCC based on index or the version of the key? If so, would a range request be able to filter on a maximum index? I didn't know whether the underlying store would include MVCC, or whether that would be the level above the store (imposed as the last key segment by the etcd simple api).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Range query is done atomically. The underlay store is MVCC, so it is low cost.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So then adding an optional int64 max_index to this struct would not break the underlying model and also allow you to do range queries where you get the latest resource up to index X?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use case I'm asking about is being able to do consistent paged range queries,

  1. get initial range min_key=start limit=100 max_index=0,
  2. read index from initial range Y and max key Z
  3. get next range min_key=Z max_index=Y limit=100
  4. repeat

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton
Let us be more clear here.

foo = bar  [index 1]
foo = bar00 [index 2]
range query all -> foo = bar00; range query (all, max_index=1) -> nothing
foo1 = bar  [index 3]
range query all -> foo = bar00; foo1 = bar 1; range query (all, max_index=2) -> foo = bar

Is this what you want?

We are not going to keep all versions in the key value space.

Instead, You can get the old versions only by scanning the history.
You can range scan the history via watch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton All the old versions of keys can be retrieved from the history store, which is sorted by logic time order.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Apr 14, 2015, at 12:03 PM, Xiang Li notifications@github.com wrote:

In Documentation/rfc/v3api.md:

  • // the cluster id to check with
  • // if the cluster id is not matched, an error is returned.
  • optional uint64 cluster_id = 1;
    +}

+message ResponseHeader {

  • // an error type message?
  • optional string error = 1;
  • optional uint64 cluster_id = 2;
  • optional uint64 member_id = 3;
  • // index of the store when the requested was processed.
  • optional int64 index = 4;
  • optional uint64 raft_term = 5;
    +}

+message RangeRequest {
@smarterclayton

The current API does not directly support consistent range over multiple requests. You can build your own via range + history.

Do you have a real use case? What are you trying to solve?

For Kubernetes, there are advantages to being able to do ranged snapshot queries to process data in chunks (instead of loading 100k keys in memory at once, make 100 1K range requests). We can do that via buckets in the keyspace, but the larger appeal of range was the consistent chunk aspect.

That being said, it does not break our use cases if that's not available. I believe in order to emulate that we would need delete tombstones in the key space and would have to track the highest modified index in the scan. Do you anticipate/expect there to be delete tombstones in the keyspace?


Reply to this email directly or view it on GitHub.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you anticipate/expect there to be delete tombstones in the keyspace?

The tombstones is in the event history. Do a range query with max_index. Fill in the gaps via another range query over the history from the max_index to now.

Does that make sense to you?

We can do the consistent query too. But it would make the API a little bit complicated. For each range query, we will return you a range ID and a timeout. Within the timeout, you can get consistent snapshot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Apr 14, 2015, at 12:21 PM, Xiang Li notifications@github.com wrote:

In Documentation/rfc/v3api.md:

  • // the cluster id to check with
  • // if the cluster id is not matched, an error is returned.
  • optional uint64 cluster_id = 1;
    +}

+message ResponseHeader {

  • // an error type message?
  • optional string error = 1;
  • optional uint64 cluster_id = 2;
  • optional uint64 member_id = 3;
  • // index of the store when the requested was processed.
  • optional int64 index = 4;
  • optional uint64 raft_term = 5;
    +}

+message RangeRequest {
Do you anticipate/expect there to be delete tombstones in the keyspace?

The tombstones is in the event history. Do a range query with max_index. Fill in the gaps via another range query over the history from the max_index to now.

Does that make sense to you

Yes, I think so.
We can do the consistent query too. But it would make the API a little bit complicated. For each range query, we will return you a range ID and a timeout. Within the timeout, you can get consistent snapshot.

I don't want to overcomplicate the store or API, certainly. I think consistent paged range queries are valuable on their own to a wide range of users, as long as it does not break the core consistency guarantees of the store. But I may also be bringing my own assumptions to the discussion. If easy to emulate above, or can lead to complex problems on failover, it doesn't need to be in the API.

Reply to this email directly or view it on GitHub.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton Sure. We will keep that in mind. We will have another iteration on the API soon.

@xiang90
Copy link
Contributor Author

xiang90 commented Apr 21, 2015

@barakmich Can you take a look? Want you know your opinion on this.

- easy for people to write simple etcd application


## Protobuf Defined API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split into a .proto file for further comment (and ultimately copy!)

// LeaseTnx likes Tnx. It does one additional thing that all the keys put into the
// store are attached with the given lease. This is rpc is useful when you want to
// put a key and attach it to a lease atomically.
rpc LeaseTnx(LeaseTnxRequest) returns (LeaseTnxResponse) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be separate or be part of the generic Tnx?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wil explore more. i am not a big fan of this api either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... after think this for a while, i think we should still let the lease know about tnx (a key space operation) instead of letting the key operation knowing about the lease.

rpc LeaseKeepAlive(stream LeaseKeepAliveRequest) returns (stream LeaseKeepAliveResponse) {}
}

message RequestHeader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused?

optional ResponseHeader header = 1;
}

message DeleteRangeRequest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this part of RequestUnion?

@xiang90
Copy link
Contributor Author

xiang90 commented May 18, 2015

@barakmich I fixed up the proto. Now it compiles. I am going to squash and merge this.

xiang90 added a commit that referenced this pull request May 18, 2015
@xiang90 xiang90 merged commit 6ee5cd9 into etcd-io:master May 18, 2015
@xiang90 xiang90 deleted the v3rfc branch May 18, 2015 20:52
@schmichael
Copy link
Contributor

Looks like I just missed the merge window for this, but I do think #2857 (non-recursive Create/PUT behavior) would be a useful feature.

@xiang90
Copy link
Contributor Author

xiang90 commented May 22, 2015

@schmichael In v3, there is no dir concept. So there is no recursive Create/PUT actually...

@deepakthipeswamy
Copy link

Can I store binary objects as values? If yes, is there a C binding which supports this API?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

10 participants