-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mvcc/backend: Optimize etcd compaction #11021
Conversation
16f53ef
to
e2c3769
Compare
@jingyih I remember we had an alternative option where we would still write through the mvvcc backend write transaction but avoid buffering the write. I've summarized in the description my reasoning on why I think this is the simplest approach. Feedback welcome. |
@gyuho I was thinking of trying to get this into etcd 3.4 at some point since it's a non-functional performance enhancement. Maybe 3.4.1? |
@jpbetz Yes, let's get this in 3.4.0. |
at alibaba, we changed this number to something like 1000 for large scale k8s. i remember we also make the number configurable. /cc @WIZARD-CXY can you confirm? |
We do various benchmark tests and find that compaction indeed hurts performance. At Alibaba We make this batch size 100 and make this compaction batch interval to 10ms too.
|
How to pick the batch size and compact interval is depending on the disk performance and workload actually running. It is hard to pick one best parameter. So make these configurable is necessary. |
Looks good to me in general. My initial concern is the discrepancy in views of key space between But we should investigate the test failures. |
@jpbetz Can we rebase from current master? And fix the test failures? |
zap.Duration("took", time.Since(totalStart)), | ||
) | ||
} else { | ||
plog.Printf("finished scheduled compaction at %d (took %v)", compactMainRev, time.Since(totalStart)) | ||
plog.Printf("finished scheduled compaction of %d keys at %d (took %v)", keyCompactions, compactMainRev, time.Since(totalStart)) | ||
} | ||
return true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are missing tx.Unlock()
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nvm. See it at line 92.
Also, let's highlight this in the CHANGELOG. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added few comments. Thanks!
zap.Duration("took", time.Since(totalStart)), | ||
) | ||
} else { | ||
plog.Printf("finished scheduled compaction batch of %d keys at %d (took %v)", batchCompactions, compactMainRev, time.Since(batchStart)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use Infof
to make the log level more explicit?
zap.Duration("took", time.Since(totalStart)), | ||
) | ||
} else { | ||
plog.Printf("finished scheduled compaction at %d (took %v)", compactMainRev, time.Since(totalStart)) | ||
plog.Printf("finished scheduled compaction of %d keys at %d (took %v)", keyCompactions, compactMainRev, time.Since(totalStart)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use Infof
to make the log level more explicit?
@@ -275,6 +279,19 @@ type IgnoreKey struct { | |||
Key string | |||
} | |||
|
|||
func (b *backend) Compact(bucket []byte, keys [][]byte) error { | |||
return b.db.Update(func(tx *bolt.Tx) error { | |||
b := tx.Bucket(bucket) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check if bucket exists?
Superseded by #11034 We found a problem with this approach where the commits get backed up behind the write ahead buffer's boltdb write lock, which is keeps open except when it commits writes to boltdb and then immediately opens a write lock. |
I think this one is superseded by #11034:) |
@jpbetz Could you update the PR number in comment #11021 (comment). As @wenjiaswe pointed out, it should be 11034. Thanks! |
Not sure if I understand this correctly. Do you mean if we use concurrent read tx, the compact keys could be stale compared to the write buffer? How did this manifest in your tests? |
Remove mentioning of using concurrent read in compaction. The original PR etcd-io/etcd#11021 was superseded by etcd-io/etcd#11034, which no long uses concurrent read in compaction.
Remove mentioning of using concurrent read in compaction. The original PR etcd-io/etcd#11021 was superseded by etcd-io/etcd#11034, which no long uses concurrent read in compaction.
Remove mentioning of using concurrent read in compaction. The original PR etcd-io/etcd#11021 was superseded by etcd-io/etcd#11034, which no long uses concurrent read in compaction.
Remove mentioning of using concurrent read in compaction. The original PR etcd-io/etcd#11021 was superseded by etcd-io/etcd#11034, which no long uses concurrent read in compaction.
Remove mentioning of using concurrent read in compaction. The original PR etcd-io/etcd#11021 was superseded by etcd-io/etcd#11034, which no long uses concurrent read in compaction.
@WIZARD-CXY Sounds good! Thanks! |
@jingyih @jpbetz Now etcd do the index compaction first and use the index compaction to get the keys(revisions) we want to keep(in this case we call it keepSet), and then range the boltdb to get the whole revisions under compaction revision and minus the keepSet to get the keys we want to delete in the boltdb. |
@WIZARD-CXY Sounds good! Could you send out a PR? |
@jingyih OK. I will try to make it happen before the spring festival. |
etcd's current compaction batch size is 10k and compactions are executed via mvcc backend write transaction. This can result in severely degraded p99 latency under heavy load each time a compaction occurs, due, in large part, to the batch size.
This PR reduces the batch size to 1k.
Also, the compaction writes are currently accumulated in the mvcc backend write buffer which must later perform a large write while holding the mvcc backend's main read/write lock, preventing new reads from initializing (i.e. ongoing reads may progress concurrently, but new ones cannot begin). But since compaction process removes only boltdb records for revisions that are no longer visible to readers. The main benefit of writing via a mvcc backend write transaction--buffering of writes in a way that ensure they remain visible to readers--is not needed. So this PR change performs compaction writes directly against boltdb. Avoiding lock contention in the mvcc backend and skipping the write buffer. This improves performance significantly. The kube sig-scalability team tested this against 5k kubemark clusters and validated that it largely eliminates degraded p99 latency.
Even though this is simpler and more efficient. The boltdb writes that perform compaction block the mvcc backend write buffer from performing writeback. When this happens, the writeback will hold the main mvcc backend read/write lock, preventing new reads from initializing. Reducing the batch size to 1k helps minimize this (if the batch size is too low, we risk losing throughput).
Potential future work:
There is a potential edge case where writes in the write buffer that are not yet flushed to boltdb become eligible for compaction. By deleting compacted revisions directly from boltdb, these revisions would not be compacted like they should. But this is not a problem because they are not visible to readers, and the next compaction operation after the write buffer is flushed will delete them from boltdb.