-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mvcc: Clone for batch index compaction and shorten lock #9511
Conversation
The idea looks right. But as @heyitsanthony commented here: #9384 (comment), we probably should have a better abstraction for traversing the tree in a mvcc manner without holding long for a long time. |
@xiang90 We're not sure that we fully understand the concerns being discussed in your link to #9384 and aren't super confident that we could be the ones to address the need for a better abstraction here without further guidance. Are you looking for any particular changes from us or are you asking for us to wait until the problem being discussed in #9384 is settled? If you could provide us with a more concrete example or some psuedo-code we could certainly take a crack at it. |
Based on the general store lock here this doesn't seem to really alleviate our problem fully and we are still investigating how to maintain consistent throughput during index compactions. |
Basically, we want to iterate the tree index without locking the lock for a long time. We can provide a generic func say: t.fuzzyAscend just like t.tree.Ascend to do it rather than exposing the details of locking in our business logic func. |
5a286c2
to
eaee6fa
Compare
During more extensive testing, we discovered that the lock we previously mentioned was in fact preventing updates while the index was being compacted. Even batching in groups of ten thousand is enough to cause noticeable latency. We were finally able to achieve consistent throughput during compactions by pushing the lock on the tree index down into the |
@jcalvert We cannot really modify the vendored btree. We have to push the change to upstream btree pkg first. |
do you call yield to allow a reschedule of the go routine? i previously did some benchmark myself, and found batching of 10,000 should be good enough. |
@xiang90 That change is already in a more recent version of the btree package. Did I not do the vendoring process correctly? |
mvcc/index_test.go
Outdated
@@ -17,7 +17,7 @@ package mvcc | |||
import ( | |||
"reflect" | |||
"testing" | |||
|
|||
"time" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strange. why this is added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure. Removing.
No, we did not try that. You mean adding |
@jcalvert lazily clone is probably better if provided by the btree library already. |
@jcalvert We should already have the latest btree depedency. Can you try |
@jcalvert rebase with current master? |
eb74545
to
fd4e132
Compare
@xiang90 rebased. Thank you. |
mvcc/index.go
Outdated
|
||
clone.Ascend(func(item btree.Item) bool { | ||
keyi := item.(*keyIndex) | ||
ti.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comment on why the lock is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment about why the lock is needed. Let us know if this is not sufficient. Thank you.
mvcc/index_bench_test.go
Outdated
func BenchmarkIndexCompact100000(b *testing.B) { benchmarkIndexCompact(b, 100000) } | ||
func BenchmarkIndexCompact1000000(b *testing.B) { benchmarkIndexCompact(b, 1000000) } | ||
|
||
func benchmarkIndexCompact(b *testing.B, size int) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have a result for the benchmark with a comparison with the result before the patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On local development machine -
master:
go test github.com/coreos/etcd/mvcc -v -run=^$ -bench BenchmarkIndexCompact
goos: linux
goarch: amd64
pkg: github.com/coreos/etcd/mvcc
BenchmarkIndexCompact1-16 1000000 1655 ns/op
BenchmarkIndexCompact100-16 100000 19772 ns/op
BenchmarkIndexCompact10000-16 2000 880883 ns/op
BenchmarkIndexCompact100000-16 200 8725995 ns/op
BenchmarkIndexCompact1000000-16 100 386072323 ns/op
PASS
this branch:
go test github.com/coreos/etcd/mvcc -v -run=^$ -bench BenchmarkIndexCompact
goos: linux
goarch: amd64
pkg: github.com/coreos/etcd/mvcc
BenchmarkIndexCompact1-16 1000000 1572 ns/op
BenchmarkIndexCompact100-16 100000 22216 ns/op
BenchmarkIndexCompact10000-16 2000 1058001 ns/op
BenchmarkIndexCompact100000-16 100 10484775 ns/op
BenchmarkIndexCompact1000000-16 100 423129293 ns/op
Added time likely due to the additional lock/unlock calls.
mvcc/kvstore.go
Outdated
ch := make(chan struct{}) | ||
var j = func(ctx context.Context) { | ||
keep := s.kvindex.Compact(rev) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this needs to be changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The defer of s.mu.Unlock()
seems to mean that throughput still stalls during index compaction, although on audit of the code it isn't clear how this is the case. We can look to see if we can create a test to validate that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i mean why do we move the compact func into the schedule j unit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seemed cleaner to put it there rather than remove the deferred unlock in place of explicitly unlocking in the error cases. This could be mistaken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Transactions do indeed touch this lock so by offloading it to the scheduled action it means that transactions won't time out waiting for the lock. If you'd prefer us to change it so that the code explicitly unlocks we can do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i do not think this is safe since this func expects to be returned after the index is compacted (the old revisions are not reachable). we still need to keep the index compaction synchronously
mvcc/index.go
Outdated
@@ -17,7 +17,6 @@ package mvcc | |||
import ( | |||
"sort" | |||
"sync" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert this change.
fd4e132
to
48554b4
Compare
mvcc/index_bench_test.go
Outdated
import ( | ||
"testing" | ||
|
||
"go.uber.org/zap" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format? and why do we need this pkg?
lgtm. defer to @gyuho |
mvcc/index_bench_test.go
Outdated
|
||
func benchmarkIndexCompact(b *testing.B, size int) { | ||
log := zap.NewNop() | ||
kvindex := newTreeIndex(log) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you run gofmt -w *.go
on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm. why the tree index takes log as its arg? strange.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiang90 I recently added structured logger as an option (where the logger object is created at top level and passed downstream). Couldn't find an easier way to pass it around (will try to find a cleaner way). We can just pass nil
for testing.
@jcalvert Thanks again for hard work!
Could you phrase this into a short release note https://github.com/coreos/etcd/blob/master/CHANGELOG-3.4.md#improved with a link to this PR, so people know who worked on this? Also, do you have reproducible workloads that would benefit from this patch, so that I can cross-check on my side? |
It does not make sense to pass log object at func level in my opinion. But it has nothing to do with this PR though. |
@xiang90 Agree. I will find a cleaner way! |
|
CHANGELOG-3.4.md
Outdated
@@ -34,6 +34,7 @@ See [code changes](https://github.com/coreos/etcd/compare/v3.3.0...v3.4.0) and [ | |||
- e.g. a node is removed from cluster, or [`raftpb.MsgProp` arrives at current leader while there is an ongoing leadership transfer](https://github.com/coreos/etcd/issues/8975). | |||
- Add [`snapshot`](https://github.com/coreos/etcd/pull/9118) package for easier snapshot workflow (see [`godoc.org/github.com/etcd/snapshot`](https://godoc.org/github.com/coreos/etcd/snapshot) for more). | |||
- Improve [functional tester](https://github.com/coreos/etcd/tree/master/functional) coverage: [proxy layer to run network fault tests in CI](https://github.com/coreos/etcd/pull/9081), [TLS is enabled both for server and client](https://github.com/coreos/etcd/pull/9534), [liveness mode](https://github.com/coreos/etcd/issues/9230), [shuffle test sequence](https://github.com/coreos/etcd/issues/9381), [membership reconfiguration failure cases](https://github.com/coreos/etcd/pull/9564), [disastrous quorum loss and snapshot recover from a seed member](https://github.com/coreos/etcd/pull/9565), [embedded etcd](https://github.com/coreos/etcd/pull/9572). | |||
- Improve [index compaction throughput](https://github.com/coreos/etcd/pull/9511) by using a copy on write clone to avoid holding the lock for the traversal of the entire index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it wont improve the compaction throughput. it solves the blocking problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
4ffe407
to
9c451a3
Compare
9c451a3
to
cef166b
Compare
@gyuho Can we squash the commits into one? |
For compaction, clone the original Btree for traversal purposes, so as to not hold the lock for the duration of compaction. This allows read/write throughput by not blocking when the index tree is large (> 1M entries). mvcc: add comment for index compaction lock mvcc: explicitly unlock store to do index compaction synchronously mvcc: formatting index bench mvcc: add release note for index compaction changes mvcc: add license header
cef166b
to
f176427
Compare
@xiang90 Done. |
lgtm |
This is to address #9506 - When the BTree index grows to millions of entries, index compaction must iterate over all of them and this can take a substantial amount of time while continuing to hold the lock. This lock means that any concurrent read/writes will wait until compaction completes. By breaking compaction into batches of 10000, similar to the backend compaction, this allows relief for contention of the index lock.
We came up with the following test as a way of validating that there is contention that blocks puts to the index while compaction is ongoing. On the
master
branch this test will fail, but it passes with our provided changes. We felt this test was useful for us in proving the issue but does not fit well into the existing test suite for etcd as far as we can tell.Coauthored with @cosgroveb