-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: separate raft log #16624
Comments
~ env COCKROACH_DEDICATED_RAFT_STORAGE=DISABLED make bench PKG=./pkg/storage BENCHES="BenchmarkReplicaRaftStorage" TESTFLAGS="-count 10" > perf-disabled
~ env COCKROACH_DEDICATED_RAFT_STORAGE=ENABLED make bench PKG=./pkg/storage BENCHES="BenchmarkReplicaRaftStorage" TESTFLAGS="-count 10" > perf-enabled
~ benchstat perf-enabled perf-disabled
name old time/op new time/op delta
ReplicaRaftStorage/vs=1024-4 1.70ms ± 4% 0.44ms ±12% -73.90% (p=0.000 n=10+10)
ReplicaRaftStorage/vs=4096-4 1.91ms ± 2% 0.63ms ±11% -66.95% (p=0.000 n=10+10)
ReplicaRaftStorage/vs=16384-4 2.74ms ± 2% 1.46ms ± 4% -46.68% (p=0.000 n=10+10)
ReplicaRaftStorage/vs=65536-4 5.16ms ± 5% 4.38ms ± 3% -15.10% (p=0.000 n=10+10)
ReplicaRaftStorage/vs=262144-4 13.6ms ± 4% 16.4ms ± 3% +20.51% (p=0.000 n=10+10)
ReplicaRaftStorage/vs=1048576-4 50.9ms ± 3% 71.0ms ± 4% +39.70% (p=0.000 n=10+8) Early, untuned & possibly incorrect benchmarks. Will update comment once verified. |
Any insight as to why performance gets worse with a dedicated Raft storage engine at larger value sizes? |
Not yet, no. It shouldn't happen as far as I can tell, doesn't align with the early experiments done in #16361 either. Investigating. |
there was particularly ill-suited
|
This one? #16591
On Thu, Jun 29, 2017, 22:02 irfan sharif ***@***.***> wrote:
there was particularly ill-suited reflect.DeepEqual debug assertion here
skewing things. That aside we have:
~ benchstat perf-disabled perf-enabled
name old time/op new time/op delta
ReplicaRaftStorage/vs=1024-4 320µs ± 6% 385µs ±18% +20.29% (p=0.000 n=10+10)
ReplicaRaftStorage/vs=4096-4 613µs ±14% 580µs ± 2% ~ (p=0.278 n=10+9)
ReplicaRaftStorage/vs=16384-4 2.59ms ± 3% 2.05ms ± 4% -20.87% (p=0.000 n=10+9)
ReplicaRaftStorage/vs=65536-4 4.11ms ± 7% 3.29ms ± 3% -19.97% (p=0.000 n=10+10)
ReplicaRaftStorage/vs=262144-4 13.4ms ± 8% 10.7ms ± 3% -20.39% (p=0.000 n=10+10)
ReplicaRaftStorage/vs=1048576-4 56.8ms ± 3% 36.4ms ± 2% -35.91% (p=0.000 n=10+10)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#16624 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE135J3qhkvLqnrSUqm-i2Zy8XFC-dWPks5sJFcdgaJpZM4N_3iO>
.
--
…-- Tobias
|
nope, I had sprinkled them all over when implementing this to guarantee parity across the two instances. Just forgot to remove all of them when posting the first benchmark. |
Have you experimented with disabling syncing when applying Raft log
truncations? That would be the ideal scenario with regards to syncing of
the non-Raft storage engine.
…On Mon, Jul 17, 2017 at 3:08 PM, irfan sharif ***@***.***> wrote:
The following are some results from experiments to determine the relative
*lack* of speed ups from #16809
<#16809> (pardon the
verbosity it's primarily self-documentation). I ran two long running
experiments on navy, ycsb -workload F (pure writes) with the same
initial-load. From 7/13 04:00 (grafana time) to 7/13 21:11 for the first
run (with new changes), second from 7/13 21:27 to 7/14 14:39 First, the
results:
--- Before
17h21m56s 1145.0 0 / 0 / 0 1145 / 0 0 / 0
elapsed__ops/sec(total)__errors(total)
62516.6s 919.5 0
--- After
17h49m5s 1099.0 0 / 0 / 0 1099 / 0 0 / 0
elapsed__ops/sec(total)__errors(total)
64145.9s 896.2 0
Overall a 2.59% decrease in throughput. I did the same for a shorter run:
--- Before
1h12m25s 1084.9 0 / 0 / 0 1085 / 0 0 / 0
elapsed__ops/sec(total)__errors(total)
4345.3s 900.9 0
--- After
1h18m40s 1060.0 0 / 0 / 0 1060 / 0 0 / 0
elapsed__ops/sec(total)__errors(total)
4720.8s 877.0 0
Similarly, 2.62% decrease in throughput.
Here are some interesting graphs from the longer run, the first half is
navy with the dedicated raft engine changes and the second without:
[image: image]
<https://user-images.githubusercontent.com/10536690/28283345-7a5f13dc-6afb-11e7-8352-fe73ac16c1ed.png>
Overall the change, as is, seems to increase Raft command commit latency.
Picking navy-006 to remove the outlier:
[image: image]
<https://user-images.githubusercontent.com/10536690/28283531-f96ee1a2-6afb-11e7-94d8-fe8c4833d781.png>
We have more CGo calls than we did before (this is expected, we're
separately committing to a whole another engine without any form of
batching across the two):
[image: image]
<https://user-images.githubusercontent.com/10536690/28283518-efbf52fe-6afb-11e7-835d-4639cdf0ded9.png>
We have slightly longer raft log processing times:
[image: image]
<https://user-images.githubusercontent.com/10536690/28283883-404689e4-6afd-11e7-89dd-253714211b72.png>
Some other asides:
- Overall memory usage goes slightly higher
[image: image]
<https://user-images.githubusercontent.com/10536690/28283595-287417f6-6afc-11e7-9d85-798207072470.png>
- We seem to have more replicas/ranges per node (I don't know if this
a valid inference, or why this comes about)
[image: image]
<https://user-images.githubusercontent.com/10536690/28283706-8b16a554-6afc-11e7-9038-6dfea3725f45.png>
- Ignore the graph headings but here are RocksDB stats for the new
engine
[image: image]
<https://user-images.githubusercontent.com/10536690/28284158-0f742596-6afe-11e7-8bba-4dde871db3bd.png>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#16624 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF6f99-Pl6b_hIkgyXXqFfoJoGVm4pvpks5sO7EwgaJpZM4N_3iO>
.
|
The following are some results from experiments to determine the relative lack of speed ups from #16809 (pardon the verbosity it's primarily self-documentation). I ran two long running experiments on First, the results:
Overall a 2.59% decrease in throughput. I did the same for a shorter run:
Similarly, 2.62% decrease in throughput. Here are some interesting graphs from the longer run, the first half is Overall the change, as is, seems to increase Raft command commit latency. Picking We have more CGo calls than we did before (this is expected, we're separately committing to a whole another engine without any form of batching across the two): We have slightly longer raft log processing times: I tried using They both mostly have the same structure (good, expected) except after we have a much wider second flame (I don't know what else to call it).
I tried disabling raft log syncing (for both engines, before and after these changes to get a fair comparison) but still saw similar results. As for what I think is happening is our current batching process for RocksDB batch commits. They are specific to individual engines and now that we have two, we batch all writes for the first engine and then separately do the same for the second. This creates this two-phased batching process which might explain the drop in throughput. Compare this to a better batching strategy batching all writes going down to RocksDB, regardless of the engine addressed. Looking into what else can be shared across the two instances, by default the total number of compaction threads are but there probably are more (see include/rocksdb/env.h). |
+cc @tschottdorf. |
Yeah, that does seem problematic. When adding the dedicated syncing goroutine, I considered adding a dedicated goroutine to perform the commits. With some restructuring, I think we could have a dedicated per-engine goroutine for committing batches, queue both batches and then wait for both of the commits to happen. |
Rebasing on top of new master I'm seeing 0 QPS occurring quite frequently, this is
This is new post-rebase, I suspect it's due to #16942 and the reworked batching mechanisms. Isolating a build without those changes. For posterity these are the numbers for without the raft log changes:
No zero QPS, almost a thrice as fast (!). |
Nope, something else going on here. a86bba2f6 Revert "storage/engine: use SyncWAL instead of sync on commit"
f9379de0e Revert "storage/engine: unify code to commit write-only and read-write batches"
ec8bd15c0 Revert "storage/engine: lazily initialize the RocksDB batch"
|
0 QPS like that often indicates a problem with leases. |
Seeing a lot of snapshot generation/application isolated to few specific ranges:
|
I'm not sure what it means yet, but I think that last graph is going to be the key: we have periodic bursts of messages getting through, coinciding with an increase in the level of dropped messages (and each time the spikes/increases get smaller). It looks like it repeats at regular intervals, but not at a round number (11 minutes?) This suggests problems in the raft transport, maybe flow control related. I'd try increasing |
Found the issue, it was due to cockroach/pkg/storage/replica_command.go Line 1927 in 7ca57b6
Replica.mu.raftLogSize 's grew indefinitely. All the following emergent side effects could be traced down to just that.
The graphs looks steady following a 10hr run last night, albeit with a reduced throughput. |
Heh, I've used the term "snapshot death spiral" for this behavior. Similar problems occurred in the past when Raft log truncation was truncating while we were trying to catch up or rebalance a replica. The heuristics now in |
Fixing the zero QPS issue (more raft log truncation simulation badness, sigh) I'm running into OOM/high RocksDB memory usage behavior. First two steep runs are running truncations in the separate raft log gc queue using tombstones, fairly infrequently at that. The disk usage for the raft engine just grows out of bound indefinitely until we hit 47 GiB when we are killed. The long running stable run is our disk usage without raft engine enabled where we seem to stay steady under the same constant load (
The blip towards the end of the first run is what happens when the first node is killed having run out of memory and queries are stalled temporarily allowing the other nodes to presumably run compactions and lower disk usage. Here are some compactions stats right before this happens on
Here are some after:
Disk size usage before (raft data is only stored in the
Disk usage after,
Before looking into this further, is there something in place currently to bound disk usage for the base engine? I'm surprised it grows so rapidly having simply moved raft log entries + truncations to a separate engine and reducing the frequency in which it occurs. Raft engines for all three runs are using the same config/tunings as the base engine. |
As mentioned offline, there is no limiter on the amount of disk space RocksDB uses. The significant disk growth is very surprising. I wonder if RocksDB isn't compacting sstables in the Raft log engine for some reason. Nothing is jumping out at me in the compaction stats. Do you have compaction stats from the Raft log engine when its disk usage grew to 47GB? |
The snippets above are the stats from the raft log engine. Investigating. |
Huh? The total data size in sstables there is tiny. I guess that makes sense on the one hand, we don't expect the Raft log engine to take much space on disk. So where is the 47GB number coming from? |
strange, the compaction stats doesn't match up with the files on disk.
Where I'm supposed to have a single 2MB L6 file, instead I have:
|
cc @cockroachdb/replication |
@andrewbaptist suggested removing the fsync interference caused by sharing the filesystem by instead having 2 filesystems on the same block device.
|
A slightly different alternative is to not use fsync on either WAL and instead write them using AIO/O_DIRECT. If we go down that path, we would want to preallocate the WAL and then treat it as a circular buffer. Using the O_DIRECT path would mean we would need to be more explicit about our writes (need to be 4K aligned), but if we do that, then we also get some of the benefits of having an async interface to the storage layer since only the WAL write is in the critical path. This could also be done with io_uring, but I don't know a lot about that... |
The Pebble WAL is not pre-allocated (well it is, but not in the way that you're describing), but the WAL recycling functionality means it is more or less a circular buffer when operating in the steady state. We don't seem to have a good description of WAL recycling in a comment or design doc, but @jbowens or @sumeerbhola know where the bodies are buried (i.e. how this all works). |
https://github.com/cockroachdb/pebble/blob/master/record/record.go#L60-L98 talks about the low-level WAL format changes used for WAL recycling and mentions that WAL recycling is a prerequisite for using direct IO. |
Completely agree that we would need to use recycling of the file to use direct io. I also don't have a good feel for what the complexity of this change would be. That said, this is one of the "easier" format changes as for an upgrade you could read the old format, delete that file and then start writing in the new format. For this to work would require the WAL would have to be fixed sized and preallocated. The fixed size is not completely true as it could be handled like a hash table resize where there is a very rare "freeze the world" event, and all data is copied from the old log to the new log. This would be a very high latency (2-3s) event however so should be avoided as much as possible. If this is done, there are 3 primary benefits
|
Not all drives have this, but enterprise SSD will have this protection: https://semiconductor.samsung.com/resources/others/Samsung_SSD_845DC_05_Power_loss_protection_PLP.pdf |
I'm not sure if any additional WAL format changes are needed. WAL recycling already works (those format changes are already in) and is battle hardened. RocksDB already has code for writing the WAL using O_DIRECT and it is using the same WAL format. Even expanding the size of the WAL already works. If you need to add another WAL file code exists to do that. Perhaps there are some additional parts here, but I suspect most of the pieces are in place.
We'd have to chat with the cloud providers to see what guarantees they provide with regard to EBS and similar. |
Reading a little more on this, the file can be opened with I think block-level sync on EBS is a no-op since it always assumes FUA is set (https://twitter.com/molson/status/1350854839090102273), and (https://www.evanjones.ca/durability-cloud.html). If most of the pieces are in place for this that would be great! I haven't looked through pebble (or RocksDB) much before, so this might "just work". If this does all work, it would address some of the issues related to writing to both RAFT and the state in the same pebble instance, but not all of them. |
The significant missing piece is part of the Init implementation to handle a RecoveryInconsistentReplica that requires applying committed raft log entries to the state machine. This missing piece will need to wait until cockroachdb#75729 is fixed. There are multiple TODOs, including related to concurrency, but the implementation is complete enough for the datadriven test to exercise many state transitions. Additionally, the test exercises loss of unsynced state, and fixup of that state in ReplicasStorage.Init, by using vfs.NewStrictMem. Informs cockroachdb#16624 Release note: None
To support a separate raft log[^1] we need to perform certain reconciliation operations at start-up to recover from a lack of atomicity between the state and log engines. This commit gets us closer to being able to do so by listing all replicas before starting the store, which means we now have a handle on which uninitialized replicas exist in the system. As a first application of this knowledge, we now ensure that every replica has a persisted FullReplicaID. Since this would not necessarily be true for stores that have seen older releases, we backfill the ReplicaID in 23.1 and can then require it to be present in a future release that forces a migration through 23.1. [^1]: cockroachdb#16624 Epic: CRDB-220 Release note: None
This introduces the concept of an Engine specific for use in a log store. In other words, it's the beginnings of a logical separation of the state machine and log engines, i.e. cockroachdb#16624. For now, both continue to be backed by the same storage.Engine, and LogEngine is not correctly used in all places. For example, snapshot application hasn't yet been updated to account for the possibility of two separate engines and writes a set of SSTS that is atomically ingested into the single engine currently present but which logically spans both engines (cockroachdb#93251). Epic: CRDB-220 Release note: None
To support a separate raft log[^1] we need to perform certain reconciliation operations at start-up to recover from a lack of atomicity between the state and log engines. This commit gets us closer to being able to do so by listing all replicas before starting the store, which means we now have a handle on which uninitialized replicas exist in the system. As a first application of this knowledge, we now ensure that every replica has a persisted FullReplicaID. Since this would not necessarily be true for stores that have seen older releases, we backfill the ReplicaID in 23.1 and can then require it to be present in a future release that forces a migration through 23.1. [^1]: cockroachdb#16624 Epic: CRDB-220 Release note: None
93317: kvserver: also load uninitialized replicas, verify replicaID r=pavelkalinnikov a=tbg To support a separate raft log[^1] we need to perform certain reconciliation operations at start-up to recover from a lack of atomicity between the state and log engines. This commit gets us closer to being able to do so by listing all replicas before starting the store, which means we now have a handle on which uninitialized replicas exist in the system. As a first application of this knowledge, we now ensure that every replica has a persisted FullReplicaID. Since this would not necessarily be true for stores that have seen older releases, we backfill the ReplicaID in 23.1 and can then require it to be present in a future release that forces a migration through 23.1. [^1]: #16624 Epic: CRDB-220 Release note: None 93721: roachpb: include tenant name in invalid name error r=andreimatei a=andreimatei Release note: none Epic: none 93736: ring: make ring.Buffer generic r=ajwerner a=ajwerner Epic: none Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Andrew Werner <awerner32@gmail.com>
Umbrella issue for #16361.
Jira issue: CRDB-12188
Epic CRDB-220
The text was updated successfully, but these errors were encountered: