kvserver: separate raft log #16624

irfansharif · 2017-06-20T16:52:11Z

Umbrella issue for #16361.

irfansharif · 2017-06-27T15:14:43Z

~  env COCKROACH_DEDICATED_RAFT_STORAGE=DISABLED make bench PKG=./pkg/storage BENCHES="BenchmarkReplicaRaftStorage" TESTFLAGS="-count 10"  > perf-disabled
~  env COCKROACH_DEDICATED_RAFT_STORAGE=ENABLED make bench PKG=./pkg/storage BENCHES="BenchmarkReplicaRaftStorage" TESTFLAGS="-count 10"  > perf-enabled
~  benchstat perf-enabled perf-disabled

  name                             old time/op  new time/op  delta
  ReplicaRaftStorage/vs=1024-4     1.70ms ± 4%  0.44ms ±12%  -73.90%  (p=0.000 n=10+10)
  ReplicaRaftStorage/vs=4096-4     1.91ms ± 2%  0.63ms ±11%  -66.95%  (p=0.000 n=10+10)
  ReplicaRaftStorage/vs=16384-4    2.74ms ± 2%  1.46ms ± 4%  -46.68%  (p=0.000 n=10+10)
  ReplicaRaftStorage/vs=65536-4    5.16ms ± 5%  4.38ms ± 3%  -15.10%  (p=0.000 n=10+10)
  ReplicaRaftStorage/vs=262144-4   13.6ms ± 4%  16.4ms ± 3%  +20.51%  (p=0.000 n=10+10)
  ReplicaRaftStorage/vs=1048576-4  50.9ms ± 3%  71.0ms ± 4%  +39.70%  (p=0.000 n=10+8)

Early, untuned & possibly incorrect benchmarks. Will update comment once verified.

petermattis · 2017-06-27T16:31:02Z

Any insight as to why performance gets worse with a dedicated Raft storage engine at larger value sizes?

irfansharif · 2017-06-27T16:36:42Z

Not yet, no. It shouldn't happen as far as I can tell, doesn't align with the early experiments done in #16361 either. Investigating.

irfansharif · 2017-06-30T02:01:58Z

there was particularly ill-suited reflect.DeepEqual debug assertion here skewing things. That aside we have:

~ benchstat perf-disabled perf-enabled

  name                             old time/op  new time/op  delta
  ReplicaRaftStorage/vs=1024-4      320µs ± 6%   385µs ±18%  +20.29%  (p=0.000 n=10+10)
  ReplicaRaftStorage/vs=4096-4      613µs ±14%   580µs ± 2%     ~     (p=0.278 n=10+9)
  ReplicaRaftStorage/vs=16384-4    2.59ms ± 3%  2.05ms ± 4%  -20.87%  (p=0.000 n=10+9)
  ReplicaRaftStorage/vs=65536-4    4.11ms ± 7%  3.29ms ± 3%  -19.97%  (p=0.000 n=10+10)
  ReplicaRaftStorage/vs=262144-4   13.4ms ± 8%  10.7ms ± 3%  -20.39%  (p=0.000 n=10+10)
  ReplicaRaftStorage/vs=1048576-4  56.8ms ± 3%  36.4ms ± 2%  -35.91%  (p=0.000 n=10+10)

tbg · 2017-06-30T02:41:38Z

This one? #16591

On Thu, Jun 29, 2017, 22:02 irfan sharif ***@***.***> wrote: there was particularly ill-suited reflect.DeepEqual debug assertion here skewing things. That aside we have: ~ benchstat perf-disabled perf-enabled name old time/op new time/op delta ReplicaRaftStorage/vs=1024-4 320µs ± 6% 385µs ±18% +20.29% (p=0.000 n=10+10) ReplicaRaftStorage/vs=4096-4 613µs ±14% 580µs ± 2% ~ (p=0.278 n=10+9) ReplicaRaftStorage/vs=16384-4 2.59ms ± 3% 2.05ms ± 4% -20.87% (p=0.000 n=10+9) ReplicaRaftStorage/vs=65536-4 4.11ms ± 7% 3.29ms ± 3% -19.97% (p=0.000 n=10+10) ReplicaRaftStorage/vs=262144-4 13.4ms ± 8% 10.7ms ± 3% -20.39% (p=0.000 n=10+10) ReplicaRaftStorage/vs=1048576-4 56.8ms ± 3% 36.4ms ± 2% -35.91% (p=0.000 n=10+10) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#16624 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE135J3qhkvLqnrSUqm-i2Zy8XFC-dWPks5sJFcdgaJpZM4N_3iO> .

--

…

-- Tobias

irfansharif · 2017-06-30T02:43:48Z

nope, I had sprinkled them all over when implementing this to guarantee parity across the two instances. Just forgot to remove all of them when posting the first benchmark.

petermattis · 2017-07-17T19:14:22Z

Have you experimented with disabling syncing when applying Raft log truncations? That would be the ideal scenario with regards to syncing of the non-Raft storage engine.

…

On Mon, Jul 17, 2017 at 3:08 PM, irfan sharif ***@***.***> wrote: The following are some results from experiments to determine the relative *lack* of speed ups from #16809 <#16809> (pardon the verbosity it's primarily self-documentation). I ran two long running experiments on navy, ycsb -workload F (pure writes) with the same initial-load. From 7/13 04:00 (grafana time) to 7/13 21:11 for the first run (with new changes), second from 7/13 21:27 to 7/14 14:39 First, the results: --- Before 17h21m56s 1145.0 0 / 0 / 0 1145 / 0 0 / 0 elapsed__ops/sec(total)__errors(total) 62516.6s 919.5 0 --- After 17h49m5s 1099.0 0 / 0 / 0 1099 / 0 0 / 0 elapsed__ops/sec(total)__errors(total) 64145.9s 896.2 0 Overall a 2.59% decrease in throughput. I did the same for a shorter run: --- Before 1h12m25s 1084.9 0 / 0 / 0 1085 / 0 0 / 0 elapsed__ops/sec(total)__errors(total) 4345.3s 900.9 0 --- After 1h18m40s 1060.0 0 / 0 / 0 1060 / 0 0 / 0 elapsed__ops/sec(total)__errors(total) 4720.8s 877.0 0 Similarly, 2.62% decrease in throughput. Here are some interesting graphs from the longer run, the first half is navy with the dedicated raft engine changes and the second without: [image: image] <https://user-images.githubusercontent.com/10536690/28283345-7a5f13dc-6afb-11e7-8352-fe73ac16c1ed.png> Overall the change, as is, seems to increase Raft command commit latency. Picking navy-006 to remove the outlier: [image: image] <https://user-images.githubusercontent.com/10536690/28283531-f96ee1a2-6afb-11e7-94d8-fe8c4833d781.png> We have more CGo calls than we did before (this is expected, we're separately committing to a whole another engine without any form of batching across the two): [image: image] <https://user-images.githubusercontent.com/10536690/28283518-efbf52fe-6afb-11e7-835d-4639cdf0ded9.png> We have slightly longer raft log processing times: [image: image] <https://user-images.githubusercontent.com/10536690/28283883-404689e4-6afd-11e7-89dd-253714211b72.png> Some other asides: - Overall memory usage goes slightly higher [image: image] <https://user-images.githubusercontent.com/10536690/28283595-287417f6-6afc-11e7-9d85-798207072470.png> - We seem to have more replicas/ranges per node (I don't know if this a valid inference, or why this comes about) [image: image] <https://user-images.githubusercontent.com/10536690/28283706-8b16a554-6afc-11e7-9038-6dfea3725f45.png> - Ignore the graph headings but here are RocksDB stats for the new engine [image: image] <https://user-images.githubusercontent.com/10536690/28284158-0f742596-6afe-11e7-8bba-4dde871db3bd.png> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#16624 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF6f99-Pl6b_hIkgyXXqFfoJoGVm4pvpks5sO7EwgaJpZM4N_3iO> .

irfansharif · 2017-07-17T19:23:23Z

The following are some results from experiments to determine the relative lack of speed ups from #16809 (pardon the verbosity it's primarily self-documentation). I ran two long running experiments on navy, ycsb -workload F (pure writes) with the same initial-load. From 7/13 04:00 (grafana time) to 7/13 21:11 for the first run (with new changes), second from 7/13 21:27 to 7/14 14:39.

First, the results:

--- Before
17h21m56s       1145.0           0 / 0 / 0        1145 / 0           0 / 0
elapsed__ops/sec(total)__errors(total)
  62516.6s          919.5              0

--- After
17h49m5s       1099.0           0 / 0 / 0        1099 / 0           0 / 0
elapsed__ops/sec(total)__errors(total)
  64145.9s          896.2              0

Overall a 2.59% decrease in throughput. I did the same for a shorter run:

--- Before
1h12m25s       1084.9           0 / 0 / 0        1085 / 0           0 / 0
elapsed__ops/sec(total)__errors(total)
  4345.3s          900.9              0

--- After
1h18m40s       1060.0           0 / 0 / 0        1060 / 0           0 / 0
elapsed__ops/sec(total)__errors(total)
  4720.8s          877.0              0

Similarly, 2.62% decrease in throughput.

Here are some interesting graphs from the longer run, the first half is navy with the dedicated raft engine changes and the second without:

Overall the change, as is, seems to increase Raft command commit latency. Picking navy-006 to remove the outlier:

We have more CGo calls than we did before (this is expected, we're separately committing to a whole another engine without any form of batching across the two):

We have slightly longer raft log processing times:

I tried using go-torch after, here are some flame graphs operating on the same data:

Before:

After:

They both mostly have the same structure (good, expected) except after we have a much wider second flame (I don't know what else to call it).
Looking into this further using callgrind (screenshots not shown) I was able to confirm a larger proportion of time spent in CGo land, happening right about here:

         .          .   4540:   isLogTruncationRequest := rResult.RaftLogDelta != nil
         .      3.62s   4541:   if err := batch.Commit(isLogTruncationRequest); err != nil {
         .          .   4542:           return enginepb.MVCCStats{}, roachpb.NewError(NewReplicaCorruptionError(
         .          .   4543:                   errors.Wrap(err, "could not commit batch")))
         .          .   4544:   }
         .      700ms   4545:   if err := raftBatch.Commit(isLogTruncationRequest); err != nil {
         .          .   4546:           return enginepb.MVCCStats{}, roachpb.NewError(NewReplicaCorruptionError(
         .          .   4547:                   errors.Wrap(err, "could not commit raft batch")))
         .          .   4548:   }
      10ms       10ms   4549:   elapsed := timeutil.Since(start)
         .       30ms   4550:   r.store.metrics.RaftCommandCommitLatency.RecordValue(elapsed.Nanoseconds())
         .       10ms   4551:   return rResult.Delta, nil
         .          .   4552:}
         .          .   4553:

I tried disabling raft log syncing (for both engines, before and after these changes to get a fair comparison) but still saw similar results. As for what I think is happening is our current batching process for RocksDB batch commits. They are specific to individual engines and now that we have two, we batch all writes for the first engine and then separately do the same for the second. This creates this two-phased batching process which might explain the drop in throughput. Compare this to a better batching strategy batching all writes going down to RocksDB, regardless of the engine addressed. Looking into what else can be shared across the two instances, by default the total number of compaction threads are but there probably are more (see include/rocksdb/env.h).

irfansharif · 2017-07-17T19:23:54Z

Some other asides:

Overall memory usage goes slightly higher
We seem to have more replicas/ranges per node (I don't know if this a valid inference, or why this comes about)
Ignore the graph headings but here are RocksDB stats for the new engine. I'm surfacing these as separate metrics to track.

irfansharif · 2017-07-17T19:24:27Z

+cc @tschottdorf.

petermattis · 2017-07-17T19:50:37Z

I tried disabling raft log syncing (for both engines, before and after these changes to get a fair comparison) but still saw similar results. As for what I think is happening is our current batching process for RocksDB batch commits. They are specific to individual engines and now that we have two, we batch all writes for the first engine and then separately do the same for the second. This creates this two-phased batching process which might explain the drop in throughput. Compare this to a better batching strategy batching all writes going down to RocksDB, regardless of the engine addressed.

Yeah, that does seem problematic. When adding the dedicated syncing goroutine, I considered adding a dedicated goroutine to perform the commits. With some restructuring, I think we could have a dedicated per-engine goroutine for committing batches, queue both batches and then wait for both of the commits to happen.

irfansharif · 2017-07-19T17:23:04Z

Rebasing on top of new master I'm seeing 0 QPS occurring quite frequently, this is ycsb -workload F on ultramarine:

  9m17s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m18s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m19s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m20s          0.0           0 / 0 / 0           0 / 0           0 / 0
elapsed______ops/sec__reads/empty/errors___writes/errors____scans/errors
  9m21s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m22s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m23s        116.0           0 / 0 / 0         116 / 0           0 / 0
  9m24s        499.0           0 / 0 / 0         499 / 0           0 / 0
  9m25s        420.0           0 / 0 / 0         420 / 0           0 / 0
  9m26s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m27s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m28s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m29s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m30s        405.0           0 / 0 / 0         405 / 0           0 / 0
  9m31s         59.0           0 / 0 / 0          59 / 0           0 / 0
  9m32s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m33s          0.0           0 / 0 / 0           0 / 0           0 / 0
   ...
elapsed__ops/sec(total)__errors(total)
 773.4s          155.4              0

This is new post-rebase, I suspect it's due to #16942 and the reworked batching mechanisms. Isolating a build without those changes.

For posterity these are the numbers for without the raft log changes:

elapsed______ops/sec__reads/empty/errors___writes/errors____scans/errors
   6m1s        558.0           0 / 0 / 0         558 / 0           0 / 0
   6m2s        541.0           0 / 0 / 0         541 / 0           0 / 0
   6m3s        596.0           0 / 0 / 0         596 / 0           0 / 0
   6m4s        553.0           0 / 0 / 0         553 / 0           0 / 0
   6m5s        629.0           0 / 0 / 0         629 / 0           0 / 0
   6m6s        538.0           0 / 0 / 0         538 / 0           0 / 0
   6m7s        566.0           0 / 0 / 0         566 / 0           0 / 0
   6m8s        597.0           0 / 0 / 0         597 / 0           0 / 0
 ... 
elapsed__ops/sec(total)__errors(total)
 368.0s          457.6              0

No zero QPS, almost a thrice as fast (!).

irfansharif · 2017-07-19T18:09:13Z

Nope, something else going on here.
Here's a build reverting three related commits to the new batching/syncing mechanism:

a86bba2f6 Revert "storage/engine: use SyncWAL instead of sync on commit"
f9379de0e Revert "storage/engine: unify code to commit write-only and read-write batches"
ec8bd15c0 Revert "storage/engine: lazily initialize the RocksDB batch"

   4m1s          0.0           0 / 0 / 0           0 / 0           0 / 0
   4m2s          0.0           0 / 0 / 0           0 / 0           0 / 0
   4m3s        244.0           0 / 0 / 0         244 / 0           0 / 0
   4m4s         12.0           0 / 0 / 0          12 / 0           0 / 0
   4m5s          0.0           0 / 0 / 0           0 / 0           0 / 0
   4m6s          0.0           0 / 0 / 0           0 / 0           0 / 0
   4m7s          0.0           0 / 0 / 0           0 / 0           0 / 0
   4m8s          0.0           0 / 0 / 0           0 / 0           0 / 0
   4m9s          0.0           0 / 0 / 0           0 / 0           0 / 0
  4m10s          0.0           0 / 0 / 0           0 / 0           0 / 0
  4m11s        280.0           0 / 0 / 0         280 / 0           0 / 0
 ... 
elapsed__ops/sec(total)__errors(total)
 251.3s          245.9              0

irfansharif · 2017-07-19T18:29:23Z

Zero QPS points are during lots of completely idle CPU time:

petermattis · 2017-07-19T18:32:13Z

0 QPS like that often indicates a problem with leases.

irfansharif · 2017-07-25T15:02:14Z

Circling back to this issue, still visible after restructuring around truncations moving downstream and happening on it's own dedicated queue after the fact ("truncations" only update TruncatedState, actual deletion of log entries happen after the fact). I'm sure there's a very simple oversight causing this but still investigating. Below are some graphs from a longer run last night where we observe gradual degradation to 0 QPS point, my initial theory is some funky interaction with the existing raft log queue (queueing truncations) and the snapshot queue (queueing snapshots; on initial investigation however the requests seem to originate out of raft directly). We see continuous cycle of snapshot generation + application for the same small subset of ranges (this should have been mitigated after the proposal quota pool, I've come full circle). The workload here is ycsb -workload F -splits 1000 -concurrency 1000.
Forgive the verbosity/noise, I don't know what's relevant as yet.

Slow degradation in throughput followed by intermittent peaks:

Enqueued replicas despite 0 throughput, also a lot of failures:

Surge of non-preemptive snapshots generated (master doesn't show this for the same workload):

Raft log self behind creeps up:

Replicas turn quiescent:

Other side effects, raft is blocked as well:

irfansharif · 2017-07-25T18:09:56Z

Seeing a lot of snapshot generation/application isolated to few specific ranges:

I170725 03:58:16.971243 1078385 storage/replica_raftstorage.go:769  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applying Raft snapshot at index 8570 (id=1d594482, encoded size=8836147, 34 rocksdb batches, 24 log entries)
I170725 03:58:17.112937 1078385 storage/replica_raftstorage.go:777  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applied Raft snapshot in 142ms [clear=0ms batch=99ms entries=0ms commit=42ms]
I170725 03:58:18.162192 1083635 storage/replica_raftstorage.go:769  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applying Raft snapshot at index 8619 (id=194c1442, encoded size=8895367, 34 rocksdb batches, 30 log entries)
I170725 03:58:18.315653 1083635 storage/replica_raftstorage.go:777  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applied Raft snapshot in 153ms [clear=5ms batch=57ms entries=1ms commit=90ms]
I170725 03:58:19.455216 1088924 storage/replica_raftstorage.go:769  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applying Raft snapshot at index 8665 (id=9afacbdb, encoded size=8941200, 34 rocksdb batches, 29 log entries)
I170725 03:58:19.535733 1088924 storage/replica_raftstorage.go:777  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applied Raft snapshot in 80ms [clear=0ms batch=36ms entries=0ms commit=44ms]
I170725 03:58:20.617621 1094108 storage/replica_raftstorage.go:769  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applying Raft snapshot at index 8703 (id=07eca121, encoded size=8972789, 35 rocksdb batches, 24 log entries)
I170725 03:58:20.861394 1094108 storage/replica_raftstorage.go:777  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applied Raft snapshot in 244ms [clear=0ms batch=174ms entries=1ms commit=69ms]
I170725 03:58:21.939922 1099239 storage/replica_raftstorage.go:769  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applying Raft snapshot at index 8748 (id=59467f09, encoded size=9017571, 35 rocksdb batches, 23 log entries)
I170725 03:58:22.004040 1099239 storage/replica_raftstorage.go:777  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applied Raft snapshot in 64ms [clear=0ms batch=8ms entries=0ms commit=55ms]
I170725 03:58:23.110255 1104397 storage/replica_raftstorage.go:769  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applying Raft snapshot at index 8802 (id=f619d9ce, encoded size=9087894, 35 rocksdb batches, 33 log entries)

bdarnell · 2017-07-25T18:10:10Z

I'm not sure what it means yet, but I think that last graph is going to be the key: we have periodic bursts of messages getting through, coinciding with an increase in the level of dropped messages (and each time the spikes/increases get smaller). It looks like it repeats at regular intervals, but not at a round number (11 minutes?)

This suggests problems in the raft transport, maybe flow control related. I'd try increasing initialConnWindowSize in rpc/context.go.

irfansharif · 2017-07-27T16:09:05Z

I'm sure there's a very simple oversight causing this but still investigating

Found the issue, it was due to Replica.mu.raftLogSize not getting updated correctly despite actual truncations taking place. This happened after a live "migration" over to the new engine when upstream of raft the "simulated truncation" (

cockroach/pkg/storage/replica_command.go

Line 1927 in 7ca57b6

ms.SysBytes = -ms.SysBytes // simulate the deletion

) did not play well with "actual" truncations happening downstream. The result of this was that despite "actual" truncations happening (log entries got deleted in RockDB), the raft log queue still accrued replicas as individual Replica.mu.raftLogSize's grew indefinitely. All the following emergent side effects could be traced down to just that.

The graphs looks steady following a 10hr run last night, albeit with a reduced throughput.

irfansharif · 2017-07-27T16:26:16Z

Here are the new degradations (implementation is as yet incomplete, awkward state debugging the previous issue), things to look out for:

Still we have more replicas queued for log truncations (I have a handle on why, addressing):

Despite this we seem to be processing fewer replica log truncations:

Side effect perhaps, raft throughput takes a hit:

Most pressing, raft log commit latencies shoot way up. To a lesser degree the same is true for commit latencies. For this run log truncations, though processed downstream, happen just as frequently. The Raft RocksDB instance is untuned (same defaults as the base engine). Investigating.

irfansharif · 2017-07-27T19:00:53Z

Another stab at this, w/ dedicated raft log gc queue "actually" deleting after the fact. Live migration takes place around 16:52, we maintain steady performance for a while before eventual 0 QPS again.
Similar symptoms as before, and at first glance similar causes (over-aggressive raft log truncations).

Dropping throughput, eventually zero, starting at 16:52:

Uptick in replicas queued in the raft log queue, again, starting at 16:52:

Again we eventually get to cyclic snapshot generation:

Right around then more replicas report themselves to be "behind" (hence the snapshots):

Replicas eventually quiesce, we're now at 0 QPS:

Raft log commit latencies shoot up, raft command commits too but not so much. Consider that the new engine is as yet untuned:

As an aside, some new metrics we're tracking now. Will aid when tuning:

petermattis · 2017-08-01T01:40:51Z

Heh, I've used the term "snapshot death spiral" for this behavior. Similar problems occurred in the past when Raft log truncation was truncating while we were trying to catch up or rebalance a replica. The heuristics now in raft_log_queue.go are supposed to avoid this, but perhaps something in this PR has broken those heuristics.

irfansharif · 2017-08-02T15:49:04Z

Fixing the zero QPS issue (more raft log truncation simulation badness, sigh) I'm running into OOM/high RocksDB memory usage behavior.

First two steep runs are running truncations in the separate raft log gc queue using tombstones, fairly infrequently at that. The disk usage for the raft engine just grows out of bound indefinitely until we hit 47 GiB when we are killed. The long running stable run is our disk usage without raft engine enabled where we seem to stay steady under the same constant load (ycsb -workload F -splits 500 -concurrency 500).
The differences between the three runs where we max out disk space:

First run, using raft deletion tombstones instead of actual deletions
Second run, actual deletions
Third run, increasing how aggressively the new raft log gc queue runs. This very negatively affects throughput (by 43% or so) but our memory usage, though still growing indefinitely, is somewhat staggered

The blip towards the end of the first run is what happens when the first node is killed having run out of memory and queries are stalled temporarily allowing the other nodes to presumably run compactions and lower disk usage.

Here are some compactions stats right before this happens on n2,s2:

I170802 14:47:44.643162 92 storage/store.go:4386  [n2,s2] sstables (raft eng read amplification = 3):
25438 0 [ 38M 1 ]: 38M
25439 5 [  4M 1 ]: 4M
25440 6 [ 79M 2 ]: 77M 1M
25441 I170802 14:47:44.643226 92 storage/store.go:4387  [n2,s2]
25442 ** Compaction Stats [default] **
25443 Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
25444 ----------------------------------------------------------------------------------------------------------------------------------------------------------
25445   L0      1/0   37.67 MB   0.9      0.0     0.0      0.0       3.7      3.7       0.0   1.0      0.0     49.6        76        99    0.770       0      0
25446   L5      1/0    3.60 MB   0.5      3.5     3.5      0.0       2.8      2.8       0.0   0.8     46.4     37.2        78        47    1.654   6896K   686K
25447   L6      2/0   78.83 MB   0.0      6.5     3.0      3.5       3.6      0.1       0.0   1.2     62.5     34.6       106        63    1.678   9644K  6277K
25448  Sum      4/0   120.09 MB   0.0     10.0     6.5      3.5      10.1      6.6       0.0   2.7     39.3     39.8       260       209    1.243     16M  6963K
25449  Int      0/0    0.00 KB   0.0      1.3     0.8      0.5       1.3      0.9       0.0   2.7     28.4     29.4        46        25    1.837   2117K   849K
Uptime(secs): 4810.2 total, 4810.2 interval
25451 Flush(GB): cumulative 3.693, interval 0.491
25452 AddFile(GB): cumulative 0.000, interval 0.000
25453 AddFile(Total Files): cumulative 0, interval 0
25454 AddFile(L0 Files): cumulative 0, interval 0
25455 AddFile(Keys): cumulative 0, interval 0
25456 Cumulative compaction: 10.09 GB write, 2.15 MB/s write, 9.98 GB read, 2.12 MB/s read, 259.7 seconds
25457 Interval compaction: 1.32 GB write, 0.28 MB/s write, 1.27 GB read, 0.27 MB/s read, 45.9 seconds
25458 Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_byte      s, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count

Here are some after:

29674 I170802 15:07:44.645798 92 storage/store.go:4386  [n2,s2] sstables (raft eng read amplification = 1):
29675 6 [ 7M 1 ]: 7M
29676 I170802 15:07:44.645980 92 storage/store.go:4387  [n2,s2]
29677 ** Compaction Stats [default] **
29678 Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
29679 ----------------------------------------------------------------------------------------------------------------------------------------------------------
29680   L0      0/0    0.00 KB   0.0      0.0     0.0      0.0       3.8      3.8       0.0   1.0      0.0     49.5        78       102    0.766       0      0
29681   L5      0/0    0.00 KB   0.0      3.6     3.6      0.0       2.9      2.9       0.0   0.8     47.0     37.3        80        49    1.623   7267K   743K
29682   L6      1/0    7.23 MB   0.0      6.7     3.0      3.6       3.7      0.0       0.0   1.2     61.2     33.5       112        65    1.718     10M  6628K
29683  Sum      1/0    7.23 MB   0.0     10.3     6.7      3.7      10.3      6.7       0.0   2.7     39.2     39.3       269       216    1.247     17M  7371K
29684  Int      0/0    0.00 KB   0.0      0.1     0.1      0.1       0.0     -0.1       0.0   3.2    171.1     28.6         1         3    0.264    436K   263K
29685 Uptime(secs): 6010.2 total, 6010.2 interval
29686 Flush(GB): cumulative 3.776, interval 0.007
29687 AddFile(GB): cumulative 0.000, interval 0.000
29688 AddFile(Total Files): cumulative 0, interval 0
29689 AddFile(L0 Files): cumulative 0, interval 0
29690 AddFile(Keys): cumulative 0, interval 0
29691 Cumulative compaction: 10.33 GB write, 1.76 MB/s write, 10.32 GB read, 1.76 MB/s read, 269.3 seconds
29692 Interval compaction: 0.02 GB write, 0.00 MB/s write, 0.13 GB read, 0.02 MB/s read, 0.8 seconds
29693 Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_byte      s, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count

Disk size usage before (raft data is only stored in the /raft subdir, the delta is the disk space used by the single base engine):

~ crl-ssh ultramarine all sudo du -sh /mnt/data1/raft
1: cockroach@cockroach-ultramarine-0001.crdb.io
11G     /mnt/data1/raft

2: cockroach@cockroach-ultramarine-0002.crdb.io
11G     /mnt/data1/raft

3: cockroach@cockroach-ultramarine-0003.crdb.io
12G     /mnt/data1/raft

4: cockroach@cockroach-ultramarine-0004.crdb.io
10G     /mnt/data1/raft

~ crl-ssh ultramarine all sudo du -sh /mnt/data1
1: cockroach@cockroach-ultramarine-0001.crdb.io
16G     /mnt/data1

2: cockroach@cockroach-ultramarine-0002.crdb.io
15G     /mnt/data1

3: cockroach@cockroach-ultramarine-0003.crdb.io
16G     /mnt/data1

4: cockroach@cockroach-ultramarine-0004.crdb.io
15G     /mnt/data1

Disk usage after, n1 was killed and restarted. Note the disk usage for the other nodes dropped in the interim.

➜  server git:(dedicated-raft-storage) crl-ssh ultramarine all sudo du -sh /mnt/data1/raft
1: cockroach@cockroach-ultramarine-0001.crdb.io
80M     /mnt/data1/raft

2: cockroach@cockroach-ultramarine-0002.crdb.io
11G     /mnt/data1/raft

3: cockroach@cockroach-ultramarine-0003.crdb.io
12G     /mnt/data1/raft

4: cockroach@cockroach-ultramarine-0004.crdb.io
11G     /mnt/data1/raft

➜  server git:(dedicated-raft-storage) crl-ssh ultramarine all sudo du -sh /mnt/data1
1: cockroach@cockroach-ultramarine-0001.crdb.io
4.8G    /mnt/data1

2: cockroach@cockroach-ultramarine-0002.crdb.io
17G     /mnt/data1

3: cockroach@cockroach-ultramarine-0003.crdb.io
18G     /mnt/data1

4: cockroach@cockroach-ultramarine-0004.crdb.io
17G     /mnt/data1

Before looking into this further, is there something in place currently to bound disk usage for the base engine? I'm surprised it grows so rapidly having simply moved raft log entries + truncations to a separate engine and reducing the frequency in which it occurs. Raft engines for all three runs are using the same config/tunings as the base engine.

petermattis · 2017-08-02T16:10:46Z

As mentioned offline, there is no limiter on the amount of disk space RocksDB uses. The significant disk growth is very surprising. I wonder if RocksDB isn't compacting sstables in the Raft log engine for some reason. Nothing is jumping out at me in the compaction stats. Do you have compaction stats from the Raft log engine when its disk usage grew to 47GB?

irfansharif · 2017-08-02T16:20:06Z

The snippets above are the stats from the raft log engine. Investigating.

petermattis · 2017-08-02T16:47:16Z

The snippets above are the stats from the raft log engine. Investigating.

Huh? The total data size in sstables there is tiny. I guess that makes sense on the one hand, we don't expect the Raft log engine to take much space on disk. So where is the 47GB number coming from?

irfansharif · 2017-08-02T17:52:04Z

strange, the compaction stats doesn't match up with the files on disk.

I170802 17:39:07.336853 10 storage/store.go:4386  [n1,s1] sstables (raft eng read amplification = 1):
6 [ 2M 1 ]: 2M
I170802 17:39:07.336915 10 storage/store.go:4387  [n1,s1]
** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0     23.7         1        15    0.087       0      0
  L6      1/0    1.82 MB   0.0      0.1     0.0      0.0       0.0     -0.0       0.0   0.9     81.5     37.7         1        14    0.047    613K   588K
 Sum      1/0    1.82 MB   0.0      0.1     0.0      0.0       0.1      0.0       0.0   1.8     27.5     28.4         2        29    0.068    613K   588K
 Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0         0         0    0.000       0      0

Where I'm supposed to have a single 2MB L6 file, instead I have:

cockroach@cockroach-ultramarine-0001:/mnt/data1/raft$ du -sh *
2.0M    000007.sst
2.4M    000011.sst
2.0M    000014.sst
1.8M    000017.sst
1.7M    000020.sst
1.2M    000023.sst
2.0M    000026.sst
1.5M    000029.sst
2.1M    000032.sst
1.5M    000035.sst
2.1M    000038.sst
2.1M    000041.sst
2.2M    000044.sst
1.4M    000047.sst
55M     000048.log
1.9M    000050.sst
4.0K    auxiliary
4.0K    COCKROACHDB_VERSION
4.0K    CURRENT
4.0K    IDENTITY
0       LOCK
4.0K    MANIFEST-000008
8.0K    OPTIONS-000005

blathers-crl · 2022-10-12T19:45:28Z

cc @cockroachdb/replication

sumeerbhola · 2022-10-12T22:38:09Z

@andrewbaptist suggested removing the fsync interference caused by sharing the filesystem by instead having 2 filesystems on the same block device.
There are two options for this filesystem split:

different ones for the raft log and state machine: The downside is fragmenting the disk capacity, since even we though we know that the raft logs are small, we may not have a good upper bound across all the replicas.
one for the 2 WALs and one for the sstables etc. for the 2 engines: Given that we have a fixed maximum backlog of unflushed memtables before suffering a write stall, we can put a bound on the capacity needed for the WALs.

andrewbaptist · 2022-10-14T21:24:20Z

A slightly different alternative is to not use fsync on either WAL and instead write them using AIO/O_DIRECT. If we go down that path, we would want to preallocate the WAL and then treat it as a circular buffer. Using the O_DIRECT path would mean we would need to be more explicit about our writes (need to be 4K aligned), but if we do that, then we also get some of the benefits of having an async interface to the storage layer since only the WAL write is in the critical path. This could also be done with io_uring, but I don't know a lot about that...

petermattis · 2022-10-17T13:36:46Z

The Pebble WAL is not pre-allocated (well it is, but not in the way that you're describing), but the WAL recycling functionality means it is more or less a circular buffer when operating in the steady state. We don't seem to have a good description of WAL recycling in a comment or design doc, but @jbowens or @sumeerbhola know where the bodies are buried (i.e. how this all works).

petermattis · 2022-10-17T13:39:17Z

https://github.com/cockroachdb/pebble/blob/master/record/record.go#L60-L98 talks about the low-level WAL format changes used for WAL recycling and mentions that WAL recycling is a prerequisite for using direct IO.

andrewbaptist · 2022-10-17T13:55:19Z

Completely agree that we would need to use recycling of the file to use direct io. I also don't have a good feel for what the complexity of this change would be. That said, this is one of the "easier" format changes as for an upgrade you could read the old format, delete that file and then start writing in the new format.

For this to work would require the WAL would have to be fixed sized and preallocated. The fixed size is not completely true as it could be handled like a hash table resize where there is a very rare "freeze the world" event, and all data is copied from the old log to the new log. This would be a very high latency (2-3s) event however so should be avoided as much as possible.

If this is done, there are 3 primary benefits

The WAL could be written without explicit syncs (O_DIRECT will sync if the file is preallocated). There are some questions about write caches on the drives, but talking to Seagate, they guarantee on all SSD drives that there is enough capacitance so that a write that makes it to the drive will make it to persistent storage
The LSM files could be written without concern about the OS level synces freezing the WAL writes
The removal of block allocation need for the WAL would reduce any unnecessary fragmentation and allow the file system to generally be more healthy.

andrewbaptist · 2022-10-17T14:02:18Z

Not all drives have this, but enterprise SSD will have this protection:

https://semiconductor.samsung.com/resources/others/Samsung_SSD_845DC_05_Power_loss_protection_PLP.pdf
https://www.seagate.com/solutions/network-attached-storage/power-loss-data-protection/

petermattis · 2022-10-17T14:12:21Z

I'm not sure if any additional WAL format changes are needed. WAL recycling already works (those format changes are already in) and is battle hardened. RocksDB already has code for writing the WAL using O_DIRECT and it is using the same WAL format. Even expanding the size of the WAL already works. If you need to add another WAL file code exists to do that. Perhaps there are some additional parts here, but I suspect most of the pieces are in place.

There are some questions about write caches on the drives, but talking to Seagate, they guarantee on all SSD drives that there is enough capacitance so that a write that makes it to the drive will make it to persistent storage

We'd have to chat with the cloud providers to see what guarantees they provide with regard to EBS and similar.

andrewbaptist · 2022-10-17T14:26:05Z

Reading a little more on this, the file can be opened with O_DIRECT|O_DSYNC which will be a no-op if the underlying block device has protection and will slow down writes if it doesn't by setting the FUA (Forced Unit Access). It then guarantees that as long as the file is preallocated (no allocations during writing) the data will be persistent on media before the call returns. Mixing this will AIO allows an async interface to the device so that writes can be "pipeline" and still notify the higher level app on a sync.

I think block-level sync on EBS is a no-op since it always assumes FUA is set (https://twitter.com/molson/status/1350854839090102273), and (https://www.evanjones.ca/durability-cloud.html).

If most of the pieces are in place for this that would be great! I haven't looked through pebble (or RocksDB) much before, so this might "just work". If this does all work, it would address some of the issues related to writing to both RAFT and the state in the same pebble instance, but not all of them.

The significant missing piece is part of the Init implementation to handle a RecoveryInconsistentReplica that requires applying committed raft log entries to the state machine. This missing piece will need to wait until cockroachdb#75729 is fixed. There are multiple TODOs, including related to concurrency, but the implementation is complete enough for the datadriven test to exercise many state transitions. Additionally, the test exercises loss of unsynced state, and fixup of that state in ReplicasStorage.Init, by using vfs.NewStrictMem. Informs cockroachdb#16624 Release note: None

To support a separate raft log[^1] we need to perform certain reconciliation operations at start-up to recover from a lack of atomicity between the state and log engines. This commit gets us closer to being able to do so by listing all replicas before starting the store, which means we now have a handle on which uninitialized replicas exist in the system. As a first application of this knowledge, we now ensure that every replica has a persisted FullReplicaID. Since this would not necessarily be true for stores that have seen older releases, we backfill the ReplicaID in 23.1 and can then require it to be present in a future release that forces a migration through 23.1. [^1]: cockroachdb#16624 Epic: CRDB-220 Release note: None

This introduces the concept of an Engine specific for use in a log store. In other words, it's the beginnings of a logical separation of the state machine and log engines, i.e. cockroachdb#16624. For now, both continue to be backed by the same storage.Engine, and LogEngine is not correctly used in all places. For example, snapshot application hasn't yet been updated to account for the possibility of two separate engines and writes a set of SSTS that is atomically ingested into the single engine currently present but which logically spans both engines (cockroachdb#93251). Epic: CRDB-220 Release note: None

To support a separate raft log[^1] we need to perform certain reconciliation operations at start-up to recover from a lack of atomicity between the state and log engines. This commit gets us closer to being able to do so by listing all replicas before starting the store, which means we now have a handle on which uninitialized replicas exist in the system. As a first application of this knowledge, we now ensure that every replica has a persisted FullReplicaID. Since this would not necessarily be true for stores that have seen older releases, we backfill the ReplicaID in 23.1 and can then require it to be present in a future release that forces a migration through 23.1. [^1]: cockroachdb#16624 Epic: CRDB-220 Release note: None

93317: kvserver: also load uninitialized replicas, verify replicaID r=pavelkalinnikov a=tbg To support a separate raft log[^1] we need to perform certain reconciliation operations at start-up to recover from a lack of atomicity between the state and log engines. This commit gets us closer to being able to do so by listing all replicas before starting the store, which means we now have a handle on which uninitialized replicas exist in the system. As a first application of this knowledge, we now ensure that every replica has a persisted FullReplicaID. Since this would not necessarily be true for stores that have seen older releases, we backfill the ReplicaID in 23.1 and can then require it to be present in a future release that forces a migration through 23.1. [^1]: #16624 Epic: CRDB-220 Release note: None 93721: roachpb: include tenant name in invalid name error r=andreimatei a=andreimatei Release note: none Epic: none 93736: ring: make ring.Buffer generic r=ajwerner a=ajwerner Epic: none Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Andrew Werner <awerner32@gmail.com>

dianasaur323 added this to the 1.1 milestone Jun 23, 2017

tbg mentioned this issue Jul 9, 2017

storage/engine: use SyncWAL instead of sync on commit #16942

Merged

irfansharif mentioned this issue Jul 17, 2017

storage: don't send log truncations through Raft #16993

Merged

nicktrav added the T-kv-replication label Oct 12, 2022

tbg changed the title ~~storage: dedicated raft storage engine~~ kvserver: separate raft log Dec 8, 2022

tbg removed their assignment Dec 8, 2022

tbg mentioned this issue Dec 19, 2022

logstore: introduce logstore.Engine #93889

Closed

tbg mentioned this issue Feb 24, 2023

kvserver: sep-raft-log: revisit throughput under separate raft log #97653

Open

tbg mentioned this issue Jun 20, 2023

kvserver: test *Replica like *raft.RawNode #105177

Open

jbowens mentioned this issue Jul 10, 2023

storage: investigate fsync latency spikes #106231

Open

exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024

pav-kv mentioned this issue Oct 5, 2024

raft: separate log and state storage logically #132030

Open

kvserver: separate raft log #16624

kvserver: separate raft log #16624

Comments

irfansharif commented Jun 20, 2017 • edited by exalate-issue-sync bot Loading

irfansharif commented Jun 27, 2017

petermattis commented Jun 27, 2017

irfansharif commented Jun 27, 2017

irfansharif commented Jun 30, 2017

tbg commented Jun 30, 2017 via email

irfansharif commented Jun 30, 2017

petermattis commented Jul 17, 2017 via email

irfansharif commented Jul 17, 2017 • edited Loading

irfansharif commented Jul 17, 2017

irfansharif commented Jul 17, 2017

petermattis commented Jul 17, 2017

irfansharif commented Jul 19, 2017 • edited Loading

irfansharif commented Jul 19, 2017

irfansharif commented Jul 19, 2017

petermattis commented Jul 19, 2017

irfansharif commented Jul 25, 2017

irfansharif commented Jul 25, 2017

bdarnell commented Jul 25, 2017

irfansharif commented Jul 27, 2017 • edited Loading

irfansharif commented Jul 27, 2017

irfansharif commented Jul 27, 2017 • edited Loading

petermattis commented Aug 1, 2017

irfansharif commented Aug 2, 2017

petermattis commented Aug 2, 2017

irfansharif commented Aug 2, 2017 • edited Loading

petermattis commented Aug 2, 2017

irfansharif commented Aug 2, 2017

blathers-crl bot commented Oct 12, 2022

sumeerbhola commented Oct 12, 2022

andrewbaptist commented Oct 14, 2022

petermattis commented Oct 17, 2022

petermattis commented Oct 17, 2022

andrewbaptist commented Oct 17, 2022

andrewbaptist commented Oct 17, 2022

petermattis commented Oct 17, 2022

andrewbaptist commented Oct 17, 2022 • edited Loading

irfansharif commented Jun 20, 2017 •

edited by exalate-issue-sync bot

Loading

irfansharif commented Jul 17, 2017 •

edited

Loading

irfansharif commented Jul 19, 2017 •

edited

Loading

irfansharif commented Jul 27, 2017 •

edited

Loading

irfansharif commented Jul 27, 2017 •

edited

Loading

irfansharif commented Aug 2, 2017 •

edited

Loading

andrewbaptist commented Oct 17, 2022 •

edited

Loading