Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mvcc/backend: Optimize etcd compaction #11021

Closed
wants to merge 1 commit into from

Conversation

jpbetz
Copy link
Contributor

@jpbetz jpbetz commented Aug 12, 2019

etcd's current compaction batch size is 10k and compactions are executed via mvcc backend write transaction. This can result in severely degraded p99 latency under heavy load each time a compaction occurs, due, in large part, to the batch size.

This PR reduces the batch size to 1k.

Also, the compaction writes are currently accumulated in the mvcc backend write buffer which must later perform a large write while holding the mvcc backend's main read/write lock, preventing new reads from initializing (i.e. ongoing reads may progress concurrently, but new ones cannot begin). But since compaction process removes only boltdb records for revisions that are no longer visible to readers. The main benefit of writing via a mvcc backend write transaction--buffering of writes in a way that ensure they remain visible to readers--is not needed. So this PR change performs compaction writes directly against boltdb. Avoiding lock contention in the mvcc backend and skipping the write buffer. This improves performance significantly. The kube sig-scalability team tested this against 5k kubemark clusters and validated that it largely eliminates degraded p99 latency.

Even though this is simpler and more efficient. The boltdb writes that perform compaction block the mvcc backend write buffer from performing writeback. When this happens, the writeback will hold the main mvcc backend read/write lock, preventing new reads from initializing. Reducing the batch size to 1k helps minimize this (if the batch size is too low, we risk losing throughput).

Potential future work:

  • Find a more optimal batch size)
  • Improve how the mvcc backend manages locks to only acquire the main mvcc backend read/write lock once it has acquired the boltdb write lock, to increase concurrency.

There is a potential edge case where writes in the write buffer that are not yet flushed to boltdb become eligible for compaction. By deleting compacted revisions directly from boltdb, these revisions would not be compacted like they should. But this is not a problem because they are not visible to readers, and the next compaction operation after the write buffer is flushed will delete them from boltdb.

mvcc/kvstore_compaction.go Outdated Show resolved Hide resolved
version/version.go Outdated Show resolved Hide resolved
@gyuho gyuho added this to the etcd-v3.4 milestone Aug 12, 2019
@jpbetz jpbetz force-pushed the compaction-opt branch 2 times, most recently from 16f53ef to e2c3769 Compare August 12, 2019 19:55
@jpbetz jpbetz changed the title WIP - Optimize etcd compaction Optimize etcd compaction Aug 12, 2019
@jpbetz
Copy link
Contributor Author

jpbetz commented Aug 12, 2019

@jingyih I remember we had an alternative option where we would still write through the mvvcc backend write transaction but avoid buffering the write. I've summarized in the description my reasoning on why I think this is the simplest approach. Feedback welcome.

@jpbetz
Copy link
Contributor Author

jpbetz commented Aug 12, 2019

@gyuho I was thinking of trying to get this into etcd 3.4 at some point since it's a non-functional performance enhancement. Maybe 3.4.1?

@gyuho
Copy link
Contributor

gyuho commented Aug 12, 2019

@jpbetz Yes, let's get this in 3.4.0.

mvcc/kvstore_compaction.go Outdated Show resolved Hide resolved
@xiang90
Copy link
Contributor

xiang90 commented Aug 12, 2019

at alibaba, we changed this number to something like 1000 for large scale k8s. i remember we also make the number configurable. /cc @WIZARD-CXY can you confirm?

@jpbetz jpbetz changed the title Optimize etcd compaction mvcc/backend: Optimize etcd compaction Aug 12, 2019
@WIZARD-CXY
Copy link
Contributor

We do various benchmark tests and find that compaction indeed hurts performance. At Alibaba We make this batch size 100 and make this compaction batch interval to 10ms too.

case <-time.After(10 * time.Millisecond):

@WIZARD-CXY
Copy link
Contributor

How to pick the batch size and compact interval is depending on the disk performance and workload actually running. It is hard to pick one best parameter. So make these configurable is necessary.

@jingyih
Copy link
Contributor

jingyih commented Aug 13, 2019

Looks good to me in general. My initial concern is the discrepancy in views of key space between s.b.ConcurrentReadTx() vs. s.b.Compact(), This is, however, already addressed by the PR description.

But we should investigate the test failures.

@gyuho
Copy link
Contributor

gyuho commented Aug 13, 2019

@jpbetz Can we rebase from current master? And fix the test failures?

zap.Duration("took", time.Since(totalStart)),
)
} else {
plog.Printf("finished scheduled compaction at %d (took %v)", compactMainRev, time.Since(totalStart))
plog.Printf("finished scheduled compaction of %d keys at %d (took %v)", keyCompactions, compactMainRev, time.Since(totalStart))
}
return true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are missing tx.Unlock() here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nvm. See it at line 92.

@gyuho
Copy link
Contributor

gyuho commented Aug 13, 2019

Also, let's highlight this in the CHANGELOG.

Copy link
Contributor

@jingyih jingyih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added few comments. Thanks!

zap.Duration("took", time.Since(totalStart)),
)
} else {
plog.Printf("finished scheduled compaction batch of %d keys at %d (took %v)", batchCompactions, compactMainRev, time.Since(batchStart))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use Infof to make the log level more explicit?

zap.Duration("took", time.Since(totalStart)),
)
} else {
plog.Printf("finished scheduled compaction at %d (took %v)", compactMainRev, time.Since(totalStart))
plog.Printf("finished scheduled compaction of %d keys at %d (took %v)", keyCompactions, compactMainRev, time.Since(totalStart))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use Infof to make the log level more explicit?

@@ -275,6 +279,19 @@ type IgnoreKey struct {
Key string
}

func (b *backend) Compact(bucket []byte, keys [][]byte) error {
return b.db.Update(func(tx *bolt.Tx) error {
b := tx.Bucket(bucket)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if bucket exists?

@jpbetz
Copy link
Contributor Author

jpbetz commented Aug 14, 2019

Superseded by #11034

We found a problem with this approach where the commits get backed up behind the write ahead buffer's boltdb write lock, which is keeps open except when it commits writes to boltdb and then immediately opens a write lock.

@jpbetz jpbetz closed this Aug 14, 2019
@wenjiaswe
Copy link
Contributor

I think this one is superseded by #11034:)

@jpbetz
Copy link
Contributor Author

jpbetz commented Aug 14, 2019

Thanks @jingyih and @gyuho for the reviews and sorry I had to change approaches at the 11th hour.

@jingyih
Copy link
Contributor

jingyih commented Aug 14, 2019

@jpbetz Could you update the PR number in comment #11021 (comment). As @wenjiaswe pointed out, it should be 11034. Thanks!

@gyuho
Copy link
Contributor

gyuho commented Aug 15, 2019

@jingyih @jpbetz

commits get backed up behind the write ahead buffer's boltdb write lock, which is keeps open except when it commits writes to boltdb and then immediately opens a write lock.

Not sure if I understand this correctly. Do you mean if we use concurrent read tx, the compact keys could be stale compared to the write buffer? How did this manifest in your tests?

jingyih added a commit to jingyih/website that referenced this pull request Sep 3, 2019
Remove mentioning of using concurrent read in compaction. The original PR etcd-io/etcd#11021 was superseded by etcd-io/etcd#11034, which no long uses concurrent read in compaction.
jingyih added a commit to jingyih/website that referenced this pull request Sep 13, 2019
Remove mentioning of using concurrent read in compaction. The original PR etcd-io/etcd#11021 was superseded by etcd-io/etcd#11034, which no long uses concurrent read in compaction.
k8s-ci-robot pushed a commit to kubernetes/website that referenced this pull request Sep 13, 2019
Remove mentioning of using concurrent read in compaction. The original PR etcd-io/etcd#11021 was superseded by etcd-io/etcd#11034, which no long uses concurrent read in compaction.
it2911 pushed a commit to it2911/kubernetes-website that referenced this pull request Sep 17, 2019
Remove mentioning of using concurrent read in compaction. The original PR etcd-io/etcd#11021 was superseded by etcd-io/etcd#11034, which no long uses concurrent read in compaction.
wahyuoi pushed a commit to wahyuoi/website that referenced this pull request Sep 18, 2019
Remove mentioning of using concurrent read in compaction. The original PR etcd-io/etcd#11021 was superseded by etcd-io/etcd#11034, which no long uses concurrent read in compaction.
@WIZARD-CXY
Copy link
Contributor

@jingyih @jpbetz I think we can work on this optimization in compact process again. I think it is very useful.

@jingyih
Copy link
Contributor

jingyih commented Oct 16, 2019

@WIZARD-CXY Sounds good! Thanks!

@WIZARD-CXY
Copy link
Contributor

WIZARD-CXY commented Jan 10, 2020

@jingyih @jpbetz Now etcd do the index compaction first and use the index compaction to get the keys(revisions) we want to keep(in this case we call it keepSet), and then range the boltdb to get the whole revisions under compaction revision and minus the keepSet to get the keys we want to delete in the boltdb.
I come up with another way. We can directly get the old revisions from index compact phase, and directly use it to delete the old revisions we don't want to keep. In this new way, we don't need the boltdb range anymore which saves a lot of time.
I implement this new idea and give it a try and the result is promising.

@jingyih
Copy link
Contributor

jingyih commented Jan 20, 2020

@WIZARD-CXY Sounds good! Could you send out a PR?

@WIZARD-CXY
Copy link
Contributor

@jingyih OK. I will try to make it happen before the spring festival.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

6 participants