-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP, DO NOT MERGE: *: make Range() fine grained #9199
Conversation
related issue: #7719 I'm still working on benchmarking this feature with realistic data set. I'm glad if I can hear opinions about the design. @xiang90 @gyuho @hexfusion Currently, this PR only cares about serializable read. But making linearizable read fine grain would be straightforward because recent read requests are processed in the RPC layer, not in the apply loop. Also, having a similar mechanism in the client side is valuable. It can solve the blocking issue and also reduce the peak memory usage of the server. Server side mechanism is useful for enforcing the policy of small txns, though. |
2c0b061
to
750435f
Compare
etcdserver/apply.go
Outdated
resp.Kvs = make([]*mvccpb.KeyValue, len(res.KVs)) | ||
for i := range res.KVs { | ||
if r.KeysOnly { | ||
res.KVs[i].Value = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines seem redundant with after-lines when !r.Serializable
. Could it be shared between two code paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm considering to unify the range implementation of the changed serializable and the existing linearizable because it doesn't break semantics. How do you think about this idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sounds good, as long as we don't copy same code around.
There are two things we want to solve:
In this way, the large read request will not block other smaller requests. However, this will not increase the overall throughput since the locked critical section remains unchanged, and there still will be contentions among cores.
If we can reduce the critical sections or make them smaller, we can achieve better throughput and utilize multi-cores better. @heyitsanthony already did some work on 2) in previous release by caching, but it can be further improved. I suggest you read though related issues before get started on 2). |
@xiang90 yes, increasing parallelism in large range will be effective so I'll work on it. But the main problem in #7719 is reducing the pause time of write transaction. I'll share the benchamrking result of this PR probably until tomorrow which shows the change is effective for reducing the pause time (although the throughput of read txns will be degraded). |
@mitake agreed. thanks! |
750435f
to
26175e8
Compare
@xiang90 I did rough benchmark on the latest version of this PR. environmentAll nodes are on GCP and include 1 client node and 3 server nodes. The instance type is n1-standard-4. how to initialize etcdFor making Range() heavy, I putted 1M keys with benchmark command:
benchmarkingI ran default (equal to current etcd)
--max-range-keys-once 100000
Although it isn't ideal yet, it seems that this PR is successfully breaking Note that we need to be careful a little bit for reproducing the results. We must specify the leader node as the endpoint of |
etcdserver/apply.go
Outdated
lim = r.Limit | ||
} | ||
startKey := r.Key | ||
noEnd := bytes.Compare(rangeEnd, []byte{0}) != 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add simple bytes.Compare
documentation?
e.g. // rangeEnd == byte-0
Also for bytes.Compare(startKey, rangeEnd) == -1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does byte-0
mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was saying we should document what this bytes.Compare
does. bytes.Compare(rangeEnd, []byte{0}) != 0
is clear since we define as noEnd
. Can we godoc bytes.Compare(startKey, rangeEnd) == -1
?
I also executed the benchmark with
The latencies of |
Have you tried to limit it to lower number than 100000? Would p99 get
improved? And how long does get take without put and the patch?
…On Thu, Jan 25, 2018 at 7:01 PM Hitoshi Mitake ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In etcdserver/apply.go
<#9199 (comment)>:
> + rr, err = txn.Range(r.Key, rangeEnd, ro)
+ if err != nil {
+ return nil, err
+ }
+ } else {
+ rr = &mvcc.RangeResult{
+ KVs: make([]mvccpb.KeyValue, 0),
+ Rev: r.Revision,
+ }
+
+ lim := int64(a.s.Cfg.RangeMaxKeysOnce)
+ if lim == 0 || r.Limit != 0 && r.Limit < lim {
+ lim = r.Limit
+ }
+ startKey := r.Key
+ noEnd := bytes.Compare(rangeEnd, []byte{0}) != 0
What does byte-0 mean?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9199 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AERby_Mvpq2NEY-iD58OH4Hx8mGeu65Aks5tOT_kgaJpZM4RpL6D>
.
|
@xiang90 yes, #9199 (comment) shows the result of 10000. Both of the latency and throughput scores are worse than the scores of 100000. It means shorter read txns don't guarantee low latencies of write txns. I'll share the baseline range performance later. But it seems that Range() performance degrades with the limited read txn size. |
what i do not understand is that why p99 of 10,000 is worser than p99 of 100,000. Can you try to explicitly yield the go routine after each txn call? I expect the latency of 10,000 is lower than 100,000. |
You mean goroutine for gathering the results? |
i mean call I am not sure how the go mutex works internally. i might not be a fair lock, which means that a go routine might re-entry the critical section right after its leave if it does not yield to other routines. but i am not sure. |
@xiang90 I understood what you mean. Probably I should call |
@mitake oh. yes. it is called |
@xiang90 it seems so :) https://golang.org/pkg/runtime/#Gosched |
26175e8
to
6c02cda
Compare
@xiang90 I updated based on your explicit yielding idea. The results (including baseline range performance) is below. The latency score of baseline range performancedefault
10,000
100,000
concurrent range and putdefault
10,000
100,000
|
Yes. I still do not fully understand where the improvements come from, and why exactly the Puts are blocked. Let us dig into this a little bit more. |
/cc @jpbetz |
I did some profiling for understanding the behaviour. For making the experiment simple, I made a single node cluster on my machine. Clients (etcdctl and benchmark) were executed on the same machine. It seems that smaller read txns consume lots of time for the commit process of the backend. default
--max-range-keys-once 10000
--max-range-keys-once 100000
|
6c02cda
to
203ffe8
Compare
I also implemented a client side mechanism for the same purpose in the second commit. The major todo of the client side is implementing the sorting feature. The Range() calls can be separated now so the sorting should be performed in the client side too. |
Current Range() tries to read entire requested range of keys in a single read transaction. It can introduce long waiting of writer transactions which can be observed as high latency spikes. For solving this problem, this commit lets Range() split its read transaction in a fine grained manner. In the interval of the read transactions, concurrent write RPCs (e.g. Put(), DeleteRange()) can have a chance of starting their transactions. This commit also adds a new option `--range-max-keys-once` to etcd. With the option, users can specify the maximum number of keys read in a single transaction during Range().
…ingle range txn This commit adds a similar mechanism of making a read txn smaller in the client side. This is good for reducing peak memory usage of the server. This commit adds a new flag --max-range-keys-once to `etcdctl get`. With specifying a value with the option, users can use this feature during the execution of the command.
203ffe8
to
a1c7731
Compare
@mitake it is strange that making the range max size to 10000 will increase the number of commit of batched tx. Read only requests should have nothing to do with backend commit. |
I'm closing this because the problem of large range performance is solved in #10523 |
Current Range() tries to read entire requested range of keys in a
single read transaction. It can introduce long waiting of writer
transactions which can be observed as high latency spikes. For solving
this problem, this commit lets serializable Range() split its read
transaction in a fine grained manner. In the interval of the read
transactions, concurrent write RPCs (e.g. Put(), DeleteRange()) can
have a chance of starting their transactions. Serializable read only
Txn() is also turned into the smaller transactions.
This commit also adds a new option
--range-max-keys-once
toetcd. With the option, users can specify the maximum number of keys
read in a single transaction during Range().