Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DNM] kv: prototype async Raft log writes #87050

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Aug 29, 2022

See #17500.

This commit polishes and flushes a prototype I've had lying around that demonstrates the async Raft log appends component of #17500. I'm not actively planning to productionize this, but it sounds like we may work on this project in v23.1, so this prototype might help. It also demonstrates the kind of performance wins we can expect to see on write-heavy workloads. To this point, we had only demonstrated the potential speedup in a simulated environment with rafttoy.

Half of the change here is to etcd/raft itself, which needs to be adapted to support asynchronous log writes. These changes are presented in nvanbenschoten/etcd@1d1fa32.

The other half of the change is extracting a Raft log writer component that handles the process of asynchronously appending to a collection of Raft logs and notifying individual replicas about the eventual durability of these writes. This component is pretty basic and should probably be entirely rewritten, but it gets the job done for the prototype.

The Raft log writer reveals an interesting dynamic where concurrency at this level actually hurts performance because it leads to concurrent calls to sync Pebble's WAL, which is less performant than having a single caller due to the fact that Pebble only exposes a synchronous Sync API and coalesces all Sync requests on to a single thread. An async Pebble Sync API would be valuable here. See the comment in NewWriter for more details.

Benchmarks

name                          old ops/s    new ops/s    delta
kv0/enc=false/nodes=3/cpu=32   36.4k ± 5%   46.5k ± 5%  +27.64%  (p=0.000 n=10+10)

name                          old avg(ms)  new avg(ms)  delta
kv0/enc=false/nodes=3/cpu=32    5.26 ± 3%    4.14 ± 6%  -21.33%  (p=0.000 n=8+10)

name                          old p99(ms)  new p99(ms)  delta
kv0/enc=false/nodes=3/cpu=32    10.9 ± 8%     9.1 ±10%  -15.94%  (p=0.000 n=10+10)

These are compelling results. I haven't pushed on this enough to know whether there's actually a throughput win here, or whether the fixed concurrency and reduced average latency is just making it look like there is. kv0bench should help answer that question.

cc. @erikgrinaker

See cockroachdb#17500.

This commit polishes and flushes a prototype I've had lying around that
demonstrates the async Raft log appends component of cockroachdb#17500. I'm not actively
planning to productionize this, but it sounds like we may work on this project
in v23.1, so this prototype might help. It also demonstrates the kind of
performance wins we can expect to see on write-heavy workloads. To this point,
we had only [demonstrated the potential speedup](cockroachdb#17500 (comment))
in a simulated environment with [rafttoy](https://github.com/nvanbenschoten/rafttoy).

Half of the change here is to `etcd/raft` itself, which needs to be adapted to support
asynchronous log writes. These changes are presented in nvanbenschoten/etcd@1d1fa32.

The other half of the change is extracting a Raft log writer component that
handles the process of asynchronously appending to a collection of Raft logs and
notifying individual replicas about the eventual durability of these writes.
This component is pretty basic and should probably be entirely rewritten, but it
gets the job done for the prototype.

The Raft log writer reveals an interesting dynamic where concurrency at this
level actually hurts performance because it leads to concurrent calls to sync
Pebble's WAL, which is less performant than having a single caller due to the
fact that Pebble only exposes a synchronous Sync API and coalesces all Sync
requests on to a single thread. An async Pebble Sync API would be valuable here.
See the comment in NewWriter for more details.

\### Benchmarks

```
name                          old ops/s    new ops/s    delta
kv0/enc=false/nodes=3/cpu=32   36.4k ± 5%   46.5k ± 5%  +27.64%  (p=0.000 n=10+10)

name                          old avg(ms)  new avg(ms)  delta
kv0/enc=false/nodes=3/cpu=32    5.26 ± 3%    4.14 ± 6%  -21.33%  (p=0.000 n=8+10)

name                          old p99(ms)  new p99(ms)  delta
kv0/enc=false/nodes=3/cpu=32    10.9 ± 8%     9.1 ±10%  -15.94%  (p=0.000 n=10+10)
```

These are compelling results. I haven't pushed on this enough to know whether
there's actually a throughput win here, or whether the fixed concurrency and
reduced average latency is just making it look like there is. `kv0bench` should
help answer that question.
@nvanbenschoten nvanbenschoten requested a review from a team August 29, 2022 18:21
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker
Copy link
Contributor

Thanks @nvanbenschoten! These results are certainly compelling, and even if we only ended up with half of this on real workloads that'd still be fantastic. The relatively limited scope of the changes here is also reassuring.

it sounds like we may work on this project in v23.1

Unfortunately, this didn't make the cut, and may have to wait until 23.2. We're doing the Raft log separation this cycle, but that should dovetail nicely into this work.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)


pkg/kv/kvserver/raftlog/writer.go line 73 at r1 (raw file):

	// some goroutine wait on a sync to be notified of durability. An asynchronous
	// variant of Pebble's API (similar to Pebble's own syncQueue) that is hooked
	// directly into Pebble's LogWriter.flushLoop would help.

I reread our slack thread from May 19, which touched on this.
Our current synchronous scheme incurs a mean of 1.5x the latency of a single LogWriter flush+sync since the LogWriter is working on something and must finish it before flushing again. And we ignore the latency incurred in waiting for a raftScheduler.worker because we assume that there are enough workers that someone will be free. This assumption partly relies on the fact that there is no deterministic assignment of replicas=>workers, so we don't have to worry about getting a hotspot because of a poor assignment. If we mirror that approach in these writers, i.e., no deterministic assignment, and enough writers, we can continue to assume zero latency in waiting for the writer to be free. Not having a deterministic assignment is also generally good to smear out the load. So I don't think we need an async API in Pebble.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Sep 23, 2022
… log

This commit replaces the call to `Term(raftAppliedIndex)` with direct
use of the new `RaftAppliedIndexTerm` field (added in c3bc064) when
generating a `SnapshotMetadata` in service of the `raft.Storage.Snapshot`
interface. As of v22.2, this field has been fully migrated in.

First and foremost, this is a code simplification. However, it also
helps with projects like cockroachdb#87050, where async Raft log writes make it
possible for a Raft leader to apply an entry before it has been appended
to the leader's own log. Such flexibility[^1] would help smooth out tail
latency in any single replica's local log writes, even if that replica
is the leader itself. This is an important characteristic of quorum
systems that we fail to provide because of the tight coupling between
the Raft leader's own log writes and the Raft leader's acknowledgement
of committed proposals.

[^1]: if safe, I haven't convinced myself that it is in all cases. It
certainly is not for operations like non-loosely coupled log truncation.
craig bot pushed a commit that referenced this pull request Sep 27, 2022
88419: ui: add connected components for insights r=ericharmeling a=ericharmeling

This commit adds connected components for the workload and schema insights pages, for use in the CC Console.

Fixes #87693.

https://www.loom.com/share/08163a7c125948119ca71ac097099a29

Release note: None

88510: ui: move `latestQuery`, `latestFormattedQuery` from redux to local state r=xinhaoz a=xinhaoz

Previously, the fields `latestQuery` and `latestFormattedQuery` representing the latest non-empty query string for a statement viewed from the detaisl page was being stored in redux. The purpose of these fields was to preserve the query when changing tabs in the stmt details page. Saving this in the redux store was unnecessary and so this commit moves these fields to the stmt details local state.

Release note: None


https://www.loom.com/share/a28d412fb83a429391210935982404de

88594: cloud: add version gate for auth via assume role in AWS and GCP stora… r=rhu713 a=rhu713

…ge and KMS

Add a version gate for auth via assume role in AWS and GCP storage and KMS to prevent this type of auth until all nodes in the cluster has been upgraded to 22.2. The gate prevents a class of job failures where sometimes a job can succeed with assume role auth if its processors happen to all be on 22.2 nodes, but fail at times when one of its processor nodes don't support assume role. This version gate preempts the issue by preventing this type of auth until the cluster has been finalized on 22.2 and gives a better error message of why the auth cannot be used.

It's important to note that this gate does not prevent a user from creating a BACKUP job that uses assume role auth, e.g. via the DETACHED option, because the destination storage is not accessed during planning. This is inline with existing behavior for other types of auth errors, e.g. if the user enters incorrect credentials. The BACKUP job will still fail with the version gate error when it eventually executes.

Release note: None

88596: kv: use RaftAppliedIndexTerm to generate SnapshotMetadata, don't scan log r=nvanbenschoten a=nvanbenschoten

This commit replaces the call to `Term(raftAppliedIndex)` with direct use of the new `RaftAppliedIndexTerm` field (added in c3bc064) when generating a `SnapshotMetadata` in service of the `raft.Storage.Snapshot` interface. As of v22.2, this field has been fully migrated in.

First and foremost, this is a code simplification. However, it also helps with projects like #87050, where async Raft log writes make it possible for a Raft leader to apply an entry before it has been appended to the leader's own log. Such flexibility[^1] would help smooth out tail latency in any single replica's local log writes, even if that replica is the leader itself. This is an important characteristic of quorum systems that we fail to provide because of the tight coupling between the Raft leader's own log writes and the Raft leader's acknowledgment of committed proposals.

Release justification: None. Don't backport to release-22.2.

Release note: None.

[^1]: if safe, I haven't convinced myself that it is in all cases. It certainly is not for operations like non-loosely coupled log truncation.

88800: kvserver: use type-safe atomics in raftSendQueue r=nvanbenschoten a=pavelkalinnikov

Go 1.19 introduced atomic types that enforce atomic access to variables, which in many situation is less error-prone. This commit resolves a TODO to take advantage of these types.

Release note: None

Co-authored-by: Eric Harmeling <eric.harmeling@cockroachlabs.com>
Co-authored-by: Xin Hao Zhang <xzhang@cockroachlabs.com>
Co-authored-by: Rui Hu <rui@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/kv/kvserver/raftlog/writer.go line 73 at r1 (raw file):
[Sumeer and I discussed this over lunch]

This assumption partly relies on the fact that there is no deterministic assignment of replicas=>workers, so we don't have to worry about getting a hotspot because of a poor assignment. If we mirror that approach in these writers, i.e., no deterministic assignment, and enough writers, we can continue to assume zero latency in waiting for the writer to be free.

While this is true across ranges, we do still need to pay for ordering within a range because we need log writes for a given range to be performed in increasing index order without the possibility of gaps. This is where the desire for an async API comes from. Without an async API, a log write for a given range needs to wait for the sequencing (seq num assignment) and the sync of previous writes for that range. Ideally, it would only need to wait for the sequencing of previous writes and could then independently wait for the sync.

Assuming multiple ranges and a full replication pipeline within each range, the sync Pebble API incurs an average-case 1.5x latency multiplier to order log writes within a range and then another average-case 1.5x latency multiplier when syncing log writes while waiting for syncs from other ranges due to the single LogWriter flush+sync loop you mentioned. Combined, that means that the average-case latency multiplier to get through this Write.Append call is 2.25x.

shards = 1 helps because it avoids the second multiplier, but only because no one else is syncing in the system and as a result, the LogWriter flush+sync loop idles when not being synced by this writer. I think one way to get around the absence of an async API is to stop sync-ing individual WriteBatch writes and to lift our own flushing loop up here that calls Pebble.SyncWAL in a loop. That would allow us to build our own async API. But I'd like to avoid that. That also only helps if no one else in the system is ever driving the Pebble LogWriter sync loop.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)


pkg/kv/kvserver/raftlog/writer.go line 73 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

[Sumeer and I discussed this over lunch]

This assumption partly relies on the fact that there is no deterministic assignment of replicas=>workers, so we don't have to worry about getting a hotspot because of a poor assignment. If we mirror that approach in these writers, i.e., no deterministic assignment, and enough writers, we can continue to assume zero latency in waiting for the writer to be free.

While this is true across ranges, we do still need to pay for ordering within a range because we need log writes for a given range to be performed in increasing index order without the possibility of gaps. This is where the desire for an async API comes from. Without an async API, a log write for a given range needs to wait for the sequencing (seq num assignment) and the sync of previous writes for that range. Ideally, it would only need to wait for the sequencing of previous writes and could then independently wait for the sync.

Assuming multiple ranges and a full replication pipeline within each range, the sync Pebble API incurs an average-case 1.5x latency multiplier to order log writes within a range and then another average-case 1.5x latency multiplier when syncing log writes while waiting for syncs from other ranges due to the single LogWriter flush+sync loop you mentioned. Combined, that means that the average-case latency multiplier to get through this Write.Append call is 2.25x.

shards = 1 helps because it avoids the second multiplier, but only because no one else is syncing in the system and as a result, the LogWriter flush+sync loop idles when not being synced by this writer. I think one way to get around the absence of an async API is to stop sync-ing individual WriteBatch writes and to lift our own flushing loop up here that calls Pebble.SyncWAL in a loop. That would allow us to build our own async API. But I'd like to avoid that. That also only helps if no one else in the system is ever driving the Pebble LogWriter sync loop.

I agree that CockroachDB should not need to call something like Pebble.SyncWAL in a loop since it becomes complicated to coordinate what got synced.
I'll get you to explain the 2.25x calculation in front of a whiteboard. I get the same answer but by: 1.5x fsync for doing a fsync'd write to Pebble + 1.5x/2 for waiting for this worker thread that is doing the write to free up = 2.25x. What you are asking for would make this 1x latency, and perhaps more importantly would remove the 1.5x fsync of wait for IO time from the worker thread, which allows for higher throughput on the range (which may allow us to make the ranges fatter), and makes the throughput (assuming an open loop workload), insensitive to variations in fsync latency.

Making the Pebble change should be easy -- do you want it now for this prototype, or later when we are writing the production code?

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)


pkg/kv/kvserver/raftlog/writer.go line 73 at r1 (raw file):

I'll get you to explain the 2.25x calculation in front of a whiteboard

👍 I was also not able to really confirm this one. Also, depending on how you model it, I think the actual wait might be longer. When the inflight shard finishes, and ours starts, it has no chance to catch a 50% done pebble sync. It'll either be the one starting the sync or it has to latch onto a sync that just started. The average is 0.5, but only if you assume both cases are equally likely. In a system where many ranges are seeing write, the average is closer to 1.

Anyway, it seems clear that a an API that asynchronously confirms inflight fsyncs is the right solution.

Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)


pkg/kv/kvserver/raftlog/writer.go line 73 at r1 (raw file):

The average is 0.5, but only if you assume both cases are equally likely. In a system where many ranges are seeing write, the average is closer to 1.

For the case where all writer shards are writing+syncing in a tight loop, I think you are correct that each individual shard ends up participating in every second sync. This leads to an expected wait time of 1 sync, so using multiple shards doubles the latency of any individual shard's call to sync. That's directly in line with my prototype results.

Making the Pebble change should be easy -- do you want it now for this prototype, or later when we are writing the production code?

If you think the changes should be easy and are willing to make them then I'd be happy to integrate them into this PR. I've been working on the process of productionizing the changes to etcd/raft, which should be in a good enough state to rebase this PR on shortly.

sumeerbhola added a commit to sumeerbhola/pebble that referenced this pull request Nov 9, 2022
ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues
the Batch to the WAL, adds to the memtable, and waits until the batch is
visible in the memtable, and then returns to the caller. The caller is
responsible for calling Batch.SyncWait to wait until the write to the
WAL is fsynced.

This change required splitting the WaitGroup in the Batch into two
WaitGroups, so waiting for the visibility can happen separately from
waiting for the WAL write. Additionally, the channel used as a semaphore
for reserving space in the two lock-free queues is split into two channels,
since dequeueing from these queues can happen in arbitrary order.
There may be some performance overhead of pushing and popping from two
channels instead of one.

Informs cockroachdb/cockroach#17500

See discussion thread cockroachdb/cockroach#87050 (review)
sumeerbhola added a commit to sumeerbhola/pebble that referenced this pull request Nov 10, 2022
ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues
the Batch to the WAL, adds to the memtable, and waits until the batch is
visible in the memtable, and then returns to the caller. The caller is
responsible for calling Batch.SyncWait to wait until the write to the
WAL is fsynced.

This change required splitting the WaitGroup in the Batch into two
WaitGroups, so waiting for the visibility can happen separately from
waiting for the WAL write. Additionally, the channel used as a semaphore
for reserving space in the two lock-free queues is split into two channels,
since dequeueing from these queues can happen in arbitrary order.
There may be some performance overhead of pushing and popping from two
channels instead of one.

Informs cockroachdb/cockroach#17500

See discussion thread cockroachdb/cockroach#87050 (review)
sumeerbhola added a commit to sumeerbhola/pebble that referenced this pull request Nov 10, 2022
ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues
the Batch to the WAL, adds to the memtable, and waits until the batch is
visible in the memtable, and then returns to the caller. The caller is
responsible for calling Batch.SyncWait to wait until the write to the
WAL is fsynced.

This change required splitting the WaitGroup in the Batch into two
WaitGroups, so waiting for the visibility can happen separately from
waiting for the WAL write. Additionally, the channel used as a semaphore
for reserving space in the two lock-free queues is split into two channels,
since dequeueing from these queues can happen in arbitrary order.
There may be some performance overhead of pushing and popping from two
channels instead of one.

Informs cockroachdb/cockroach#17500

See discussion thread cockroachdb/cockroach#87050 (review)
sumeerbhola added a commit to sumeerbhola/pebble that referenced this pull request Jan 9, 2023
ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues
the Batch to the WAL, adds to the memtable, and waits until the batch is
visible in the memtable, and then returns to the caller. The caller is
responsible for calling Batch.SyncWait to wait until the write to the
WAL is fsynced.

This change required splitting the WaitGroup in the Batch into two
WaitGroups, so waiting for the visibility can happen separately from
waiting for the WAL write. Additionally, the channel used as a semaphore
for reserving space in the two lock-free queues is split into two channels,
since dequeueing from these queues can happen in arbitrary order.
There may be some performance overhead of pushing and popping from two
channels instead of one.

Informs cockroachdb/cockroach#17500

See discussion thread cockroachdb/cockroach#87050 (review)
sumeerbhola added a commit to sumeerbhola/pebble that referenced this pull request Jan 9, 2023
ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues
the Batch to the WAL, adds to the memtable, and waits until the batch is
visible in the memtable, and then returns to the caller. The caller is
responsible for calling Batch.SyncWait to wait until the write to the
WAL is fsynced.

This change required splitting the WaitGroup in the Batch into two
WaitGroups, so waiting for the visibility can happen separately from
waiting for the WAL write. Additionally, the channel used as a semaphore
for reserving space in the two lock-free queues is split into two channels,
since dequeueing from these queues can happen in arbitrary order.

Benchmarks indicate that the overhead of pushing and popping on an extra
channel is tolerable. Benchmarks were run on a macbook pro -- note these are
not doing an actual sync since they use io.Discard, and are only benchmarking
the commit pipeline.

Sync wait on master (old) vs this branch (new):

name                                               old time/op    new time/op    delta
CommitPipeline/no-sync-wait=false/parallel=1-10      1.09µs ± 6%    1.15µs ± 9%    ~     (p=0.310 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10      1.53µs ± 4%    1.54µs ± 2%    ~     (p=0.841 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10      1.54µs ± 1%    1.59µs ± 1%  +2.87%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10      1.52µs ± 1%    1.55µs ± 1%  +2.43%  (p=0.008 n=5+5)

name                                               old speed      new speed      delta
CommitPipeline/no-sync-wait=false/parallel=1-10    14.7MB/s ± 5%  13.9MB/s ±10%    ~     (p=0.310 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10    10.5MB/s ± 4%  10.4MB/s ± 2%    ~     (p=0.841 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10    10.4MB/s ± 1%  10.1MB/s ± 1%  -2.78%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10    10.5MB/s ± 1%  10.3MB/s ± 1%  -2.35%  (p=0.008 n=5+5)

name                                               old alloc/op   new alloc/op   delta
CommitPipeline/no-sync-wait=false/parallel=1-10      1.37kB ± 0%    1.40kB ± 0%  +2.31%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10      1.37kB ± 0%    1.40kB ± 0%  +2.31%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10      1.37kB ± 0%    1.40kB ± 0%  +2.15%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10      1.37kB ± 0%    1.40kB ± 0%  +2.34%  (p=0.008 n=5+5)

name                                               old allocs/op  new allocs/op  delta
CommitPipeline/no-sync-wait=false/parallel=1-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=2-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=4-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=8-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)

Sync wait on this branch (old) vs async wait on this branch (new):

name                            old time/op    new time/op    delta
CommitPipeline/parallel=1-10      1.15µs ± 9%    1.20µs ± 7%     ~     (p=0.421 n=5+5)
CommitPipeline/parallel=2-10      1.54µs ± 2%    1.59µs ± 2%   +3.50%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10      1.59µs ± 1%    1.58µs ± 1%     ~     (p=0.802 n=5+5)
CommitPipeline/parallel=8-10      1.55µs ± 1%    1.56µs ± 1%     ~     (p=0.452 n=5+5)

name                            old speed      new speed      delta
CommitPipeline/parallel=1-10    13.9MB/s ±10%  13.3MB/s ± 7%     ~     (p=0.421 n=5+5)
CommitPipeline/parallel=2-10    10.4MB/s ± 2%  10.1MB/s ± 2%   -3.36%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10    10.1MB/s ± 1%  10.1MB/s ± 1%     ~     (p=0.786 n=5+5)
CommitPipeline/parallel=8-10    10.3MB/s ± 1%  10.3MB/s ± 1%     ~     (p=0.452 n=5+5)

name                            old alloc/op   new alloc/op   delta
CommitPipeline/parallel=1-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.651 n=5+5)
CommitPipeline/parallel=2-10      1.40kB ± 0%    1.39kB ± 0%   -0.21%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.706 n=5+5)
CommitPipeline/parallel=8-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.587 n=5+5)

name                            old allocs/op  new allocs/op  delta
CommitPipeline/parallel=1-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=2-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=4-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=8-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)

Informs cockroachdb/cockroach#17500

See discussion thread cockroachdb/cockroach#87050 (review)
@nvanbenschoten
Copy link
Member Author

Replaced by #94165.

sumeerbhola added a commit to sumeerbhola/pebble that referenced this pull request Jan 10, 2023
ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues
the Batch to the WAL, adds to the memtable, and waits until the batch is
visible in the memtable, and then returns to the caller. The caller is
responsible for calling Batch.SyncWait to wait until the write to the
WAL is fsynced.

This change required splitting the WaitGroup in the Batch into two
WaitGroups, so waiting for the visibility can happen separately from
waiting for the WAL write. Additionally, the channel used as a semaphore
for reserving space in the two lock-free queues is split into two channels,
since dequeueing from these queues can happen in arbitrary order.

Benchmarks indicate that the overhead of pushing and popping on an extra
channel is tolerable. Benchmarks were run on a macbook pro -- note these are
not doing an actual sync since they use io.Discard, and are only benchmarking
the commit pipeline.

Sync wait on master (old) vs this branch (new):

name                                               old time/op    new time/op    delta
CommitPipeline/no-sync-wait=false/parallel=1-10      1.09µs ± 6%    1.15µs ± 9%    ~     (p=0.310 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10      1.53µs ± 4%    1.54µs ± 2%    ~     (p=0.841 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10      1.54µs ± 1%    1.59µs ± 1%  +2.87%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10      1.52µs ± 1%    1.55µs ± 1%  +2.43%  (p=0.008 n=5+5)

name                                               old speed      new speed      delta
CommitPipeline/no-sync-wait=false/parallel=1-10    14.7MB/s ± 5%  13.9MB/s ±10%    ~     (p=0.310 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10    10.5MB/s ± 4%  10.4MB/s ± 2%    ~     (p=0.841 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10    10.4MB/s ± 1%  10.1MB/s ± 1%  -2.78%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10    10.5MB/s ± 1%  10.3MB/s ± 1%  -2.35%  (p=0.008 n=5+5)

name                                               old alloc/op   new alloc/op   delta
CommitPipeline/no-sync-wait=false/parallel=1-10      1.37kB ± 0%    1.40kB ± 0%  +2.31%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10      1.37kB ± 0%    1.40kB ± 0%  +2.31%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10      1.37kB ± 0%    1.40kB ± 0%  +2.15%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10      1.37kB ± 0%    1.40kB ± 0%  +2.34%  (p=0.008 n=5+5)

name                                               old allocs/op  new allocs/op  delta
CommitPipeline/no-sync-wait=false/parallel=1-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=2-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=4-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=8-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)

Sync wait on this branch (old) vs async wait on this branch (new):

name                            old time/op    new time/op    delta
CommitPipeline/parallel=1-10      1.15µs ± 9%    1.20µs ± 7%     ~     (p=0.421 n=5+5)
CommitPipeline/parallel=2-10      1.54µs ± 2%    1.59µs ± 2%   +3.50%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10      1.59µs ± 1%    1.58µs ± 1%     ~     (p=0.802 n=5+5)
CommitPipeline/parallel=8-10      1.55µs ± 1%    1.56µs ± 1%     ~     (p=0.452 n=5+5)

name                            old speed      new speed      delta
CommitPipeline/parallel=1-10    13.9MB/s ±10%  13.3MB/s ± 7%     ~     (p=0.421 n=5+5)
CommitPipeline/parallel=2-10    10.4MB/s ± 2%  10.1MB/s ± 2%   -3.36%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10    10.1MB/s ± 1%  10.1MB/s ± 1%     ~     (p=0.786 n=5+5)
CommitPipeline/parallel=8-10    10.3MB/s ± 1%  10.3MB/s ± 1%     ~     (p=0.452 n=5+5)

name                            old alloc/op   new alloc/op   delta
CommitPipeline/parallel=1-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.651 n=5+5)
CommitPipeline/parallel=2-10      1.40kB ± 0%    1.39kB ± 0%   -0.21%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.706 n=5+5)
CommitPipeline/parallel=8-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.587 n=5+5)

name                            old allocs/op  new allocs/op  delta
CommitPipeline/parallel=1-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=2-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=4-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=8-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)

Informs cockroachdb/cockroach#17500

See discussion thread cockroachdb/cockroach#87050 (review)
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/asyncRaftLog branch January 10, 2023 23:49
sumeerbhola added a commit to sumeerbhola/pebble that referenced this pull request Jan 11, 2023
ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues
the Batch to the WAL, adds to the memtable, and waits until the batch is
visible in the memtable, and then returns to the caller. The caller is
responsible for calling Batch.SyncWait to wait until the write to the
WAL is fsynced.

This change required splitting the WaitGroup in the Batch into two
WaitGroups, so waiting for the visibility can happen separately from
waiting for the WAL write. Additionally, the channel used as a semaphore
for reserving space in the two lock-free queues is split into two channels,
since dequeueing from these queues can happen in arbitrary order.

Benchmarks indicate that the overhead of pushing and popping on an extra
channel is tolerable. Benchmarks were run on a macbook pro -- note these are
not doing an actual sync since they use io.Discard, and are only benchmarking
the commit pipeline.

Sync wait on master (old) vs this branch (new):

name                                               old time/op    new time/op    delta
CommitPipeline/no-sync-wait=false/parallel=1-10      1.09µs ± 6%    1.15µs ± 9%    ~     (p=0.310 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10      1.53µs ± 4%    1.54µs ± 2%    ~     (p=0.841 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10      1.54µs ± 1%    1.59µs ± 1%  +2.87%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10      1.52µs ± 1%    1.55µs ± 1%  +2.43%  (p=0.008 n=5+5)

name                                               old speed      new speed      delta
CommitPipeline/no-sync-wait=false/parallel=1-10    14.7MB/s ± 5%  13.9MB/s ±10%    ~     (p=0.310 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10    10.5MB/s ± 4%  10.4MB/s ± 2%    ~     (p=0.841 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10    10.4MB/s ± 1%  10.1MB/s ± 1%  -2.78%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10    10.5MB/s ± 1%  10.3MB/s ± 1%  -2.35%  (p=0.008 n=5+5)

name                                               old alloc/op   new alloc/op   delta
CommitPipeline/no-sync-wait=false/parallel=1-10      1.37kB ± 0%    1.40kB ± 0%  +2.31%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10      1.37kB ± 0%    1.40kB ± 0%  +2.31%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10      1.37kB ± 0%    1.40kB ± 0%  +2.15%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10      1.37kB ± 0%    1.40kB ± 0%  +2.34%  (p=0.008 n=5+5)

name                                               old allocs/op  new allocs/op  delta
CommitPipeline/no-sync-wait=false/parallel=1-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=2-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=4-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=8-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)

Sync wait on this branch (old) vs async wait on this branch (new):

name                            old time/op    new time/op    delta
CommitPipeline/parallel=1-10      1.15µs ± 9%    1.20µs ± 7%     ~     (p=0.421 n=5+5)
CommitPipeline/parallel=2-10      1.54µs ± 2%    1.59µs ± 2%   +3.50%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10      1.59µs ± 1%    1.58µs ± 1%     ~     (p=0.802 n=5+5)
CommitPipeline/parallel=8-10      1.55µs ± 1%    1.56µs ± 1%     ~     (p=0.452 n=5+5)

name                            old speed      new speed      delta
CommitPipeline/parallel=1-10    13.9MB/s ±10%  13.3MB/s ± 7%     ~     (p=0.421 n=5+5)
CommitPipeline/parallel=2-10    10.4MB/s ± 2%  10.1MB/s ± 2%   -3.36%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10    10.1MB/s ± 1%  10.1MB/s ± 1%     ~     (p=0.786 n=5+5)
CommitPipeline/parallel=8-10    10.3MB/s ± 1%  10.3MB/s ± 1%     ~     (p=0.452 n=5+5)

name                            old alloc/op   new alloc/op   delta
CommitPipeline/parallel=1-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.651 n=5+5)
CommitPipeline/parallel=2-10      1.40kB ± 0%    1.39kB ± 0%   -0.21%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.706 n=5+5)
CommitPipeline/parallel=8-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.587 n=5+5)

name                            old allocs/op  new allocs/op  delta
CommitPipeline/parallel=1-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=2-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=4-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=8-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)

Informs cockroachdb/cockroach#17500

See discussion thread cockroachdb/cockroach#87050 (review)
sumeerbhola added a commit to cockroachdb/pebble that referenced this pull request Jan 11, 2023
ApplyNoSyncWait must only be used when WriteOptions.Sync is true. It enqueues
the Batch to the WAL, adds to the memtable, and waits until the batch is
visible in the memtable, and then returns to the caller. The caller is
responsible for calling Batch.SyncWait to wait until the write to the
WAL is fsynced.

This change required splitting the WaitGroup in the Batch into two
WaitGroups, so waiting for the visibility can happen separately from
waiting for the WAL write. Additionally, the channel used as a semaphore
for reserving space in the two lock-free queues is split into two channels,
since dequeueing from these queues can happen in arbitrary order.

Benchmarks indicate that the overhead of pushing and popping on an extra
channel is tolerable. Benchmarks were run on a macbook pro -- note these are
not doing an actual sync since they use io.Discard, and are only benchmarking
the commit pipeline.

Sync wait on master (old) vs this branch (new):

name                                               old time/op    new time/op    delta
CommitPipeline/no-sync-wait=false/parallel=1-10      1.09µs ± 6%    1.15µs ± 9%    ~     (p=0.310 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10      1.53µs ± 4%    1.54µs ± 2%    ~     (p=0.841 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10      1.54µs ± 1%    1.59µs ± 1%  +2.87%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10      1.52µs ± 1%    1.55µs ± 1%  +2.43%  (p=0.008 n=5+5)

name                                               old speed      new speed      delta
CommitPipeline/no-sync-wait=false/parallel=1-10    14.7MB/s ± 5%  13.9MB/s ±10%    ~     (p=0.310 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10    10.5MB/s ± 4%  10.4MB/s ± 2%    ~     (p=0.841 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10    10.4MB/s ± 1%  10.1MB/s ± 1%  -2.78%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10    10.5MB/s ± 1%  10.3MB/s ± 1%  -2.35%  (p=0.008 n=5+5)

name                                               old alloc/op   new alloc/op   delta
CommitPipeline/no-sync-wait=false/parallel=1-10      1.37kB ± 0%    1.40kB ± 0%  +2.31%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=2-10      1.37kB ± 0%    1.40kB ± 0%  +2.31%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=4-10      1.37kB ± 0%    1.40kB ± 0%  +2.15%  (p=0.008 n=5+5)
CommitPipeline/no-sync-wait=false/parallel=8-10      1.37kB ± 0%    1.40kB ± 0%  +2.34%  (p=0.008 n=5+5)

name                                               old allocs/op  new allocs/op  delta
CommitPipeline/no-sync-wait=false/parallel=1-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=2-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=4-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)
CommitPipeline/no-sync-wait=false/parallel=8-10        2.00 ± 0%      2.00 ± 0%    ~     (all equal)

Sync wait on this branch (old) vs async wait on this branch (new):

name                            old time/op    new time/op    delta
CommitPipeline/parallel=1-10      1.15µs ± 9%    1.20µs ± 7%     ~     (p=0.421 n=5+5)
CommitPipeline/parallel=2-10      1.54µs ± 2%    1.59µs ± 2%   +3.50%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10      1.59µs ± 1%    1.58µs ± 1%     ~     (p=0.802 n=5+5)
CommitPipeline/parallel=8-10      1.55µs ± 1%    1.56µs ± 1%     ~     (p=0.452 n=5+5)

name                            old speed      new speed      delta
CommitPipeline/parallel=1-10    13.9MB/s ±10%  13.3MB/s ± 7%     ~     (p=0.421 n=5+5)
CommitPipeline/parallel=2-10    10.4MB/s ± 2%  10.1MB/s ± 2%   -3.36%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10    10.1MB/s ± 1%  10.1MB/s ± 1%     ~     (p=0.786 n=5+5)
CommitPipeline/parallel=8-10    10.3MB/s ± 1%  10.3MB/s ± 1%     ~     (p=0.452 n=5+5)

name                            old alloc/op   new alloc/op   delta
CommitPipeline/parallel=1-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.651 n=5+5)
CommitPipeline/parallel=2-10      1.40kB ± 0%    1.39kB ± 0%   -0.21%  (p=0.008 n=5+5)
CommitPipeline/parallel=4-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.706 n=5+5)
CommitPipeline/parallel=8-10      1.40kB ± 0%    1.40kB ± 0%     ~     (p=0.587 n=5+5)

name                            old allocs/op  new allocs/op  delta
CommitPipeline/parallel=1-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=2-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=4-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)
CommitPipeline/parallel=8-10        2.00 ± 0%      2.00 ± 0%     ~     (all equal)

Informs cockroachdb/cockroach#17500

See discussion thread cockroachdb/cockroach#87050 (review)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants