mvcc/backend: Optimize etcd compaction #11021

jpbetz · 2019-08-12T19:44:35Z

etcd's current compaction batch size is 10k and compactions are executed via mvcc backend write transaction. This can result in severely degraded p99 latency under heavy load each time a compaction occurs, due, in large part, to the batch size.

This PR reduces the batch size to 1k.

Also, the compaction writes are currently accumulated in the mvcc backend write buffer which must later perform a large write while holding the mvcc backend's main read/write lock, preventing new reads from initializing (i.e. ongoing reads may progress concurrently, but new ones cannot begin). But since compaction process removes only boltdb records for revisions that are no longer visible to readers. The main benefit of writing via a mvcc backend write transaction--buffering of writes in a way that ensure they remain visible to readers--is not needed. So this PR change performs compaction writes directly against boltdb. Avoiding lock contention in the mvcc backend and skipping the write buffer. This improves performance significantly. The kube sig-scalability team tested this against 5k kubemark clusters and validated that it largely eliminates degraded p99 latency.

Even though this is simpler and more efficient. The boltdb writes that perform compaction block the mvcc backend write buffer from performing writeback. When this happens, the writeback will hold the main mvcc backend read/write lock, preventing new reads from initializing. Reducing the batch size to 1k helps minimize this (if the batch size is too low, we risk losing throughput).

Potential future work:

Find a more optimal batch size)
Improve how the mvcc backend manages locks to only acquire the main mvcc backend read/write lock once it has acquired the boltdb write lock, to increase concurrency.

There is a potential edge case where writes in the write buffer that are not yet flushed to boltdb become eligible for compaction. By deleting compacted revisions directly from boltdb, these revisions would not be compacted like they should. But this is not a problem because they are not visible to readers, and the next compaction operation after the write buffer is flushed will delete them from boltdb.

mvcc/kvstore_compaction.go

version/version.go

jpbetz · 2019-08-12T20:02:08Z

@jingyih I remember we had an alternative option where we would still write through the mvvcc backend write transaction but avoid buffering the write. I've summarized in the description my reasoning on why I think this is the simplest approach. Feedback welcome.

jpbetz · 2019-08-12T20:03:07Z

@gyuho I was thinking of trying to get this into etcd 3.4 at some point since it's a non-functional performance enhancement. Maybe 3.4.1?

gyuho · 2019-08-12T20:03:26Z

@jpbetz Yes, let's get this in 3.4.0.

mvcc/kvstore_compaction.go

xiang90 · 2019-08-12T20:39:54Z

at alibaba, we changed this number to something like 1000 for large scale k8s. i remember we also make the number configurable. /cc @WIZARD-CXY can you confirm?

…o boltdb

WIZARD-CXY · 2019-08-13T11:01:04Z

We do various benchmark tests and find that compaction indeed hurts performance. At Alibaba We make this batch size 100 and make this compaction batch interval to 10ms too.

case <-time.After(10 * time.Millisecond):

WIZARD-CXY · 2019-08-13T11:08:30Z

How to pick the batch size and compact interval is depending on the disk performance and workload actually running. It is hard to pick one best parameter. So make these configurable is necessary.

jingyih · 2019-08-13T22:05:53Z

Looks good to me in general. My initial concern is the discrepancy in views of key space between s.b.ConcurrentReadTx() vs. s.b.Compact(), This is, however, already addressed by the PR description.

But we should investigate the test failures.

gyuho · 2019-08-13T22:17:41Z

@jpbetz Can we rebase from current master? And fix the test failures?

gyuho · 2019-08-13T22:31:39Z

mvcc/kvstore_compaction.go

 					zap.Duration("took", time.Since(totalStart)),
 				)
 			} else {
-				plog.Printf("finished scheduled compaction at %d (took %v)", compactMainRev, time.Since(totalStart))
+				plog.Printf("finished scheduled compaction of %d keys at %d (took %v)", keyCompactions, compactMainRev, time.Since(totalStart))
 			}
 			return true


We are missing tx.Unlock() here

Oh, nvm. See it at line 92.

gyuho · 2019-08-13T22:44:01Z

Also, let's highlight this in the CHANGELOG.

jingyih

Added few comments. Thanks!

jingyih · 2019-08-13T22:40:44Z

mvcc/kvstore_compaction.go

+					zap.Duration("took", time.Since(totalStart)),
+				)
+			} else {
+				plog.Printf("finished scheduled compaction batch of %d keys at %d (took %v)", batchCompactions, compactMainRev, time.Since(batchStart))


nit: use Infof to make the log level more explicit?

jingyih · 2019-08-13T22:41:03Z

mvcc/kvstore_compaction.go

 					zap.Duration("took", time.Since(totalStart)),
 				)
 			} else {
-				plog.Printf("finished scheduled compaction at %d (took %v)", compactMainRev, time.Since(totalStart))
+				plog.Printf("finished scheduled compaction of %d keys at %d (took %v)", keyCompactions, compactMainRev, time.Since(totalStart))


nit: use Infof to make the log level more explicit?

jingyih · 2019-08-13T22:45:47Z

mvcc/backend/backend.go

@@ -275,6 +279,19 @@ type IgnoreKey struct {
 	Key    string
 }

+func (b *backend) Compact(bucket []byte, keys [][]byte) error {
+	return b.db.Update(func(tx *bolt.Tx) error {
+		b := tx.Bucket(bucket)


check if bucket exists?

jpbetz · 2019-08-14T07:22:27Z

Superseded by #11034

We found a problem with this approach where the commits get backed up behind the write ahead buffer's boltdb write lock, which is keeps open except when it commits writes to boltdb and then immediately opens a write lock.

wenjiaswe · 2019-08-14T16:28:23Z

I think this one is superseded by #11034:)

jpbetz · 2019-08-14T17:31:29Z

Thanks @jingyih and @gyuho for the reviews and sorry I had to change approaches at the 11th hour.

jingyih · 2019-08-14T17:57:36Z

@jpbetz Could you update the PR number in comment #11021 (comment). As @wenjiaswe pointed out, it should be 11034. Thanks!

gyuho · 2019-08-15T17:06:21Z

@jingyih @jpbetz

commits get backed up behind the write ahead buffer's boltdb write lock, which is keeps open except when it commits writes to boltdb and then immediately opens a write lock.

Not sure if I understand this correctly. Do you mean if we use concurrent read tx, the compact keys could be stale compared to the write buffer? How did this manifest in your tests?

Remove mentioning of using concurrent read in compaction. The original PR etcd-io/etcd#11021 was superseded by etcd-io/etcd#11034, which no long uses concurrent read in compaction.

WIZARD-CXY · 2019-10-16T08:24:01Z

@jingyih @jpbetz I think we can work on this optimization in compact process again. I think it is very useful.

jingyih · 2019-10-16T20:17:47Z

@WIZARD-CXY Sounds good! Thanks!

WIZARD-CXY · 2020-01-10T07:55:28Z

@jingyih @jpbetz Now etcd do the index compaction first and use the index compaction to get the keys(revisions) we want to keep(in this case we call it keepSet), and then range the boltdb to get the whole revisions under compaction revision and minus the keepSet to get the keys we want to delete in the boltdb.
I come up with another way. We can directly get the old revisions from index compact phase, and directly use it to delete the old revisions we don't want to keep. In this new way, we don't need the boltdb range anymore which saves a lot of time.
I implement this new idea and give it a try and the result is promising.

jingyih · 2020-01-20T10:02:09Z

@WIZARD-CXY Sounds good! Could you send out a PR?

WIZARD-CXY · 2020-01-21T01:51:35Z

@jingyih OK. I will try to make it happen before the spring festival.

gyuho reviewed Aug 12, 2019

View reviewed changes

mvcc/kvstore_compaction.go Outdated Show resolved Hide resolved

gyuho reviewed Aug 12, 2019

View reviewed changes

version/version.go Outdated Show resolved Hide resolved

gyuho added this to the etcd-v3.4 milestone Aug 12, 2019

gyuho added the area/performance label Aug 12, 2019

jpbetz force-pushed the compaction-opt branch 2 times, most recently from 16f53ef to e2c3769 Compare August 12, 2019 19:55

jpbetz changed the title ~~WIP - Optimize etcd compaction~~ Optimize etcd compaction Aug 12, 2019

gyuho reviewed Aug 12, 2019

View reviewed changes

mvcc/kvstore_compaction.go Outdated Show resolved Hide resolved

jpbetz changed the title ~~Optimize etcd compaction~~ mvcc/backend: Optimize etcd compaction Aug 12, 2019

mvcc/backend: optimize compaction batch interval and write directly t…

54c340b

…o boltdb

jpbetz force-pushed the compaction-opt branch from e2c3769 to 54c340b Compare August 12, 2019 22:42

gyuho reviewed Aug 13, 2019

View reviewed changes

jingyih reviewed Aug 13, 2019

View reviewed changes

jpbetz mentioned this pull request Aug 14, 2019

mvcc: Optimize compaction for short commit pauses #11034

Merged

jpbetz closed this Aug 14, 2019

This was referenced Sep 3, 2019

Update announcing-etcd-3.4.md jingyih/website#1

Closed

Update announcing-etcd-3.4.md kubernetes/website#16205

Merged

WIZARD-CXY mentioned this pull request Feb 24, 2020

Refine compact in etcd server #11649

Closed

yb01 mentioned this pull request Sep 21, 2020

Investigate "killing connection/stream because serving request timed out" during perf test CentaurusInfra/arktos#621

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mvcc/backend: Optimize etcd compaction #11021

mvcc/backend: Optimize etcd compaction #11021

jpbetz commented Aug 12, 2019 •

edited

Loading

jpbetz commented Aug 12, 2019

jpbetz commented Aug 12, 2019

gyuho commented Aug 12, 2019 •

edited

Loading

xiang90 commented Aug 12, 2019

WIZARD-CXY commented Aug 13, 2019

WIZARD-CXY commented Aug 13, 2019

jingyih commented Aug 13, 2019 •

edited

Loading

gyuho commented Aug 13, 2019

gyuho Aug 13, 2019

gyuho Aug 13, 2019

gyuho commented Aug 13, 2019

jingyih left a comment

jingyih Aug 13, 2019

jingyih Aug 13, 2019

jingyih Aug 13, 2019

jpbetz commented Aug 14, 2019 •

edited

Loading

wenjiaswe commented Aug 14, 2019

jpbetz commented Aug 14, 2019

jingyih commented Aug 14, 2019

gyuho commented Aug 15, 2019 •

edited

Loading

WIZARD-CXY commented Oct 16, 2019

jingyih commented Oct 16, 2019

WIZARD-CXY commented Jan 10, 2020 •

edited

Loading

jingyih commented Jan 20, 2020

WIZARD-CXY commented Jan 21, 2020

mvcc/backend: Optimize etcd compaction #11021

mvcc/backend: Optimize etcd compaction #11021

Conversation

jpbetz commented Aug 12, 2019 • edited Loading

jpbetz commented Aug 12, 2019

jpbetz commented Aug 12, 2019

gyuho commented Aug 12, 2019 • edited Loading

xiang90 commented Aug 12, 2019

WIZARD-CXY commented Aug 13, 2019

WIZARD-CXY commented Aug 13, 2019

jingyih commented Aug 13, 2019 • edited Loading

gyuho commented Aug 13, 2019

gyuho Aug 13, 2019

Choose a reason for hiding this comment

gyuho Aug 13, 2019

Choose a reason for hiding this comment

gyuho commented Aug 13, 2019

jingyih left a comment

Choose a reason for hiding this comment

jingyih Aug 13, 2019

Choose a reason for hiding this comment

jingyih Aug 13, 2019

Choose a reason for hiding this comment

jingyih Aug 13, 2019

Choose a reason for hiding this comment

jpbetz commented Aug 14, 2019 • edited Loading

wenjiaswe commented Aug 14, 2019

jpbetz commented Aug 14, 2019

jingyih commented Aug 14, 2019

gyuho commented Aug 15, 2019 • edited Loading

WIZARD-CXY commented Oct 16, 2019

jingyih commented Oct 16, 2019

WIZARD-CXY commented Jan 10, 2020 • edited Loading

jingyih commented Jan 20, 2020

WIZARD-CXY commented Jan 21, 2020

jpbetz commented Aug 12, 2019 •

edited

Loading

gyuho commented Aug 12, 2019 •

edited

Loading

jingyih commented Aug 13, 2019 •

edited

Loading

jpbetz commented Aug 14, 2019 •

edited

Loading

gyuho commented Aug 15, 2019 •

edited

Loading

WIZARD-CXY commented Jan 10, 2020 •

edited

Loading