mvcc: Clone for batch index compaction and shorten lock #9511

jcalvert · 2018-03-29T21:26:10Z

This is to address #9506 - When the BTree index grows to millions of entries, index compaction must iterate over all of them and this can take a substantial amount of time while continuing to hold the lock. This lock means that any concurrent read/writes will wait until compaction completes. By breaking compaction into batches of 10000, similar to the backend compaction, this allows relief for contention of the index lock.

We came up with the following test as a way of validating that there is contention that blocks puts to the index while compaction is ongoing. On the master branch this test will fail, but it passes with our provided changes. We felt this test was useful for us in proving the issue but does not fit well into the existing test suite for etcd as far as we can tell.

Coauthored with @cosgroveb

func TestIndexCompactAndRuntime(t *testing.T) {                           
  ti := newTreeIndex()                                                    
  size := 1000000                                                         
  bytesN := 64                                                            
  keys := createBytesSlice(bytesN, size)                                  
  for i := 1; i < size-1; i++ {                                           
    ti.Put(keys[i], revision {main: int64(i), sub: int64(i)})             
  }                                                                       
  go ti.Compact(int64(500000))                                            
  time.Sleep(200000 * time.Nanosecond)                                  
  t1 := time.Now().UnixNano()                                             
  ti.Put(keys[size-1], revision {main: int64(size-1), sub: int64(size-1)})
  t2 := time.Now().UnixNano() - t1                                        
  if t2 > 150000000 {                                                     
    t.Errorf("Run time took too long! %v", t2)                            
  }                                                                       
}

xiang90 · 2018-03-30T20:59:50Z

The idea looks right. But as @heyitsanthony commented here: #9384 (comment), we probably should have a better abstraction for traversing the tree in a mvcc manner without holding long for a long time.

cosgroveb · 2018-04-02T18:34:57Z

@xiang90 We're not sure that we fully understand the concerns being discussed in your link to #9384 and aren't super confident that we could be the ones to address the need for a better abstraction here without further guidance.

Are you looking for any particular changes from us or are you asking for us to wait until the problem being discussed in #9384 is settled? If you could provide us with a more concrete example or some psuedo-code we could certainly take a crack at it.

jcalvert · 2018-04-10T18:59:32Z

Based on the general store lock here this doesn't seem to really alleviate our problem fully and we are still investigating how to maintain consistent throughput during index compactions.

xiang90 · 2018-04-12T20:53:12Z

@jcalvert

Basically, we want to iterate the tree index without locking the lock for a long time.

We can provide a generic func say: t.fuzzyAscend just like t.tree.Ascend to do it rather than exposing the details of locking in our business logic func.

jcalvert · 2018-04-16T20:12:30Z

@xiang90

During more extensive testing, we discovered that the lock we previously mentioned was in fact preventing updates while the index was being compacted. Even batching in groups of ten thousand is enough to cause noticeable latency. We were finally able to achieve consistent throughput during compactions by pushing the lock on the tree index down into the Ascend function. In order to preserve traverse order, we used the Clone function to produce a copy on write version of the tree so that we do not need to lock for the entire traversal. Please let us know any feedback you have, as we have experienced serious throughput degradation during index compactions.

xiang90 · 2018-04-16T20:48:04Z

@jcalvert We cannot really modify the vendored btree. We have to push the change to upstream btree pkg first.

xiang90 · 2018-04-16T20:49:29Z

Even batching in groups of ten thousand is enough to cause noticeable latency.

do you call yield to allow a reschedule of the go routine? i previously did some benchmark myself, and found batching of 10,000 should be good enough.

jcalvert · 2018-04-16T20:57:41Z

@xiang90 That change is already in a more recent version of the btree package. Did I not do the vendoring process correctly?

xiang90 · 2018-04-16T21:00:21Z

@jcalvert You probably need to first update the vendored pkg and the vendor lock file in another PR first.

@gyuho might help you on that.

xiang90 · 2018-04-16T21:01:40Z

mvcc/index_test.go

@@ -17,7 +17,7 @@ package mvcc
 import (
 	"reflect"
 	"testing"
-
+	"time"


strange. why this is added?

Not sure. Removing.

jcalvert · 2018-04-16T21:02:04Z

do you call yield to allow a reschedule of the go routine? i previously did some benchmark myself, and found batching of 10,000 should be good enough.

No, we did not try that. You mean adding runtime.Gosched()? That would be sufficient to allow others to acquire the lock?

xiang90 · 2018-04-16T21:13:34Z

@jcalvert lazily clone is probably better if provided by the btree library already.

gyuho · 2018-04-16T21:19:54Z

@jcalvert We should already have the latest btree depedency. Can you try scripts/updatedep.sh?

xiang90 · 2018-04-17T02:49:55Z

@jcalvert rebase with current master?

jcalvert · 2018-04-17T15:44:45Z

@xiang90 rebased. Thank you.

xiang90 · 2018-04-17T15:49:52Z

mvcc/index.go

+
+	clone.Ascend(func(item btree.Item) bool {
+		keyi := item.(*keyIndex)
+		ti.Lock()


add comment on why the lock is needed?

Added a comment about why the lock is needed. Let us know if this is not sufficient. Thank you.

xiang90 · 2018-04-17T15:50:20Z

mvcc/index_bench_test.go

+func BenchmarkIndexCompact100000(b *testing.B) { benchmarkIndexCompact(b, 100000) }
+func BenchmarkIndexCompact1000000(b *testing.B) { benchmarkIndexCompact(b, 1000000) }
+
+func benchmarkIndexCompact(b *testing.B, size int) {


do you have a result for the benchmark with a comparison with the result before the patch?

On local development machine -

master:

go test github.com/coreos/etcd/mvcc -v -run=^$ -bench BenchmarkIndexCompact goos: linux goarch: amd64 pkg: github.com/coreos/etcd/mvcc BenchmarkIndexCompact1-16 1000000 1655 ns/op BenchmarkIndexCompact100-16 100000 19772 ns/op BenchmarkIndexCompact10000-16 2000 880883 ns/op BenchmarkIndexCompact100000-16 200 8725995 ns/op BenchmarkIndexCompact1000000-16 100 386072323 ns/op PASS

this branch:

go test github.com/coreos/etcd/mvcc -v -run=^$ -bench BenchmarkIndexCompact goos: linux goarch: amd64 pkg: github.com/coreos/etcd/mvcc BenchmarkIndexCompact1-16 1000000 1572 ns/op BenchmarkIndexCompact100-16 100000 22216 ns/op BenchmarkIndexCompact10000-16 2000 1058001 ns/op BenchmarkIndexCompact100000-16 100 10484775 ns/op BenchmarkIndexCompact1000000-16 100 423129293 ns/op

Added time likely due to the additional lock/unlock calls.

xiang90 · 2018-04-17T15:50:54Z

mvcc/kvstore.go

 	ch := make(chan struct{})
 	var j = func(ctx context.Context) {
+		keep := s.kvindex.Compact(rev)


why this needs to be changed?

The defer of s.mu.Unlock() seems to mean that throughput still stalls during index compaction, although on audit of the code it isn't clear how this is the case. We can look to see if we can create a test to validate that.

i mean why do we move the compact func into the schedule j unit?

It seemed cleaner to put it there rather than remove the deferred unlock in place of explicitly unlocking in the error cases. This could be mistaken.

Transactions do indeed touch this lock so by offloading it to the scheduled action it means that transactions won't time out waiting for the lock. If you'd prefer us to change it so that the code explicitly unlocks we can do that.

i do not think this is safe since this func expects to be returned after the index is compacted (the old revisions are not reachable). we still need to keep the index compaction synchronously

xiang90 · 2018-04-17T15:51:09Z

mvcc/index.go

@@ -17,7 +17,6 @@ package mvcc
 import (
 	"sort"
 	"sync"
-


revert this change.

xiang90 · 2018-04-17T22:16:40Z

mvcc/index_bench_test.go

+import (
+	"testing"
+
+  "go.uber.org/zap"


format? and why do we need this pkg?

xiang90 · 2018-04-17T22:16:52Z

lgtm. defer to @gyuho

gyuho · 2018-04-17T22:21:11Z

mvcc/index_bench_test.go

+
+func benchmarkIndexCompact(b *testing.B, size int) {
+  log := zap.NewNop()
+  kvindex := newTreeIndex(log)


can you run gofmt -w *.go on this?

hmm. why the tree index takes log as its arg? strange.

@xiang90 I recently added structured logger as an option (where the logger object is created at top level and passed downstream). Couldn't find an easier way to pass it around (will try to find a cleaner way). We can just pass nil for testing.

gyuho · 2018-04-17T22:25:35Z

@jcalvert Thanks again for hard work!

copy on write version of the tree so that we do not need to lock for the entire traversal

Could you phrase this into a short release note https://github.com/coreos/etcd/blob/master/CHANGELOG-3.4.md#improved with a link to this PR, so people know who worked on this?

Also, do you have reproducible workloads that would benefit from this patch, so that I can cross-check on my side?

xiang90 · 2018-04-17T23:18:05Z

It does not make sense to pass log object at func level in my opinion. But it has nothing to do with this PR though.

gyuho · 2018-04-17T23:26:24Z

@xiang90 Agree. I will find a cleaner way!

jcalvert · 2018-04-18T15:14:55Z

@gyuho

PUT 10 million keys into server, send compaction request, immediately begin PUT, see timeouts. We discovered this trend by watching the timeouts from application servers in our Grafana dashboard. If you need, we can try to provide a client script to demonstrate. Thank you.

xiang90 · 2018-04-18T17:39:44Z

CHANGELOG-3.4.md

@@ -34,6 +34,7 @@ See [code changes](https://github.com/coreos/etcd/compare/v3.3.0...v3.4.0) and [
  - e.g. a node is removed from cluster, or [`raftpb.MsgProp` arrives at current leader while there is an ongoing leadership transfer](https://github.com/coreos/etcd/issues/8975).
 - Add [`snapshot`](https://github.com/coreos/etcd/pull/9118) package for easier snapshot workflow (see [`godoc.org/github.com/etcd/snapshot`](https://godoc.org/github.com/coreos/etcd/snapshot) for more).
 - Improve [functional tester](https://github.com/coreos/etcd/tree/master/functional) coverage: [proxy layer to run network fault tests in CI](https://github.com/coreos/etcd/pull/9081), [TLS is enabled both for server and client](https://github.com/coreos/etcd/pull/9534), [liveness mode](https://github.com/coreos/etcd/issues/9230), [shuffle test sequence](https://github.com/coreos/etcd/issues/9381), [membership reconfiguration failure cases](https://github.com/coreos/etcd/pull/9564), [disastrous quorum loss and snapshot recover from a seed member](https://github.com/coreos/etcd/pull/9565), [embedded etcd](https://github.com/coreos/etcd/pull/9572).
+- Improve [index compaction throughput](https://github.com/coreos/etcd/pull/9511) by using a copy on write clone to avoid holding the lock for the traversal of the entire index.


it wont improve the compaction throughput. it solves the blocking problem.

gyuho · 2018-04-18T20:13:14Z

@jcalvert @xiang90 I resolved some formatting issue with license header. Otherwise looks good. Will merge after CI passes. Thanks a lot!

xiang90 · 2018-04-18T20:26:50Z

@gyuho Can we squash the commits into one?

For compaction, clone the original Btree for traversal purposes, so as to not hold the lock for the duration of compaction. This allows read/write throughput by not blocking when the index tree is large (> 1M entries). mvcc: add comment for index compaction lock mvcc: explicitly unlock store to do index compaction synchronously mvcc: formatting index bench mvcc: add release note for index compaction changes mvcc: add license header

gyuho · 2018-04-18T20:30:13Z

@xiang90 Done.

xiang90 · 2018-04-18T20:32:14Z

lgtm

gyuho added area/performance stage/investigating labels Apr 12, 2018

jcalvert force-pushed the index_compaction_breakup branch from 5a286c2 to eaee6fa Compare April 16, 2018 20:05

xiang90 reviewed Apr 16, 2018

View reviewed changes

jcalvert mentioned this pull request Apr 16, 2018

Upgrade google/btree to latest revision. #9573

Closed

jcalvert force-pushed the index_compaction_breakup branch 6 times, most recently from eb74545 to fd4e132 Compare April 17, 2018 15:41

jcalvert changed the title ~~Batch index compaction~~ mvcc: Clone for batch index compaction and shorten lock Apr 17, 2018

xiang90 reviewed Apr 17, 2018

View reviewed changes

mvcc/index.go Outdated

@@ -17,7 +17,6 @@ package mvcc

import (

"sort"

"sync"

Copy link

Contributor

xiang90 Apr 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this change.

jcalvert force-pushed the index_compaction_breakup branch from fd4e132 to 48554b4 Compare April 17, 2018 19:09

xiang90 reviewed Apr 17, 2018

View reviewed changes

mvcc/index_bench_test.go Outdated

import (

"testing"

"go.uber.org/zap"

Copy link

Contributor

xiang90 Apr 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format? and why do we need this pkg?

gyuho reviewed Apr 17, 2018

View reviewed changes

xiang90 reviewed Apr 18, 2018

View reviewed changes

jcalvert force-pushed the index_compaction_breakup branch from 4ffe407 to 9c451a3 Compare April 18, 2018 18:04

gyuho removed the stage/investigating label Apr 18, 2018

gyuho force-pushed the index_compaction_breakup branch from 9c451a3 to cef166b Compare April 18, 2018 20:12

gyuho force-pushed the index_compaction_breakup branch from cef166b to f176427 Compare April 18, 2018 20:30

gyuho merged commit e5c9483 into etcd-io:master Apr 18, 2018

mvcc: Clone for batch index compaction and shorten lock #9511

mvcc: Clone for batch index compaction and shorten lock #9511

Conversation

jcalvert commented Mar 29, 2018

xiang90 commented Mar 30, 2018 • edited Loading

cosgroveb commented Apr 2, 2018

jcalvert commented Apr 10, 2018

xiang90 commented Apr 12, 2018

jcalvert commented Apr 16, 2018

xiang90 commented Apr 16, 2018

xiang90 commented Apr 16, 2018

jcalvert commented Apr 16, 2018

xiang90 commented Apr 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcalvert commented Apr 16, 2018

xiang90 commented Apr 16, 2018

gyuho commented Apr 16, 2018

xiang90 commented Apr 17, 2018

jcalvert commented Apr 17, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiang90 commented Apr 17, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gyuho Apr 17, 2018 • edited Loading

Choose a reason for hiding this comment

gyuho commented Apr 17, 2018 • edited Loading

xiang90 commented Apr 17, 2018

gyuho commented Apr 17, 2018

jcalvert commented Apr 18, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gyuho commented Apr 18, 2018

xiang90 commented Apr 18, 2018

gyuho commented Apr 18, 2018

xiang90 commented Apr 18, 2018

xiang90 commented Mar 30, 2018 •

edited

Loading

gyuho Apr 17, 2018 •

edited

Loading

gyuho commented Apr 17, 2018 •

edited

Loading