-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage/engine: use SyncWAL instead of sync on commit #16942
storage/engine: use SyncWAL instead of sync on commit #16942
Conversation
@bdarnell I think the new semantics of |
8a470b6
to
208d219
Compare
That's a pretty sweet improvement. I've only glanced at the code (looks good), but just wanted to point out that we should try to avoid forcing @irfansharif to refactor. Perhaps we can hold off on these merges until #16624 is done, which also gives us a clearer view of the performance gain of this change (expected to be much smaller, since we're forcing less syncs, correct?) OTOH, if the refactor is negligible, good to get this in sooner rather than later. Either way, let's ping @irfansharif on all of these PRs (easier to rebase on top of something you've seen before you crashed into it). |
Reviewed 3 of 4 files at r1. pkg/storage/engine/db.h, line 104 at r1 (raw file):
Adding a comment would be nice, at least to clarify that it blocks pkg/storage/engine/rocksdb.go, line 435 at r2 (raw file):
Was this shown to be important for performance? Is it just here to ensure we actually get cycles every millisecond? A comment would probably be helpful down the road. Comments from Reviewable |
Review status: 3 of 4 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful. pkg/storage/engine/db.h, line 104 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Oops, meant to comment that and forgot. Done. pkg/storage/engine/rocksdb.go, line 435 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
It's not the cycles for this thread, but to prevent other goroutines from being schedule on our thread and experiencing additional latency. Added a comment. Comments from Reviewable |
a1886fb
to
a4d1f55
Compare
Reviewed 2 of 4 files at r1, 2 of 2 files at r3. Comments from Reviewable |
@bdarnell Just to be clear, is it fine for the Raft log entries and hard state updates to be visible to other goroutines before they have been synced to disk? Is there a scenario with sending a snapshot that could be problematic? I haven't thought of one, but you're the expert here. Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful. Comments from Reviewable |
Yes, it's fine for the log entries (and HardState) to be visible before they are synced to disk. The important constraints are that these syncs happen before raft messages are sent and before the call to RawNode.Advance. Our regular locking is sufficient for this and if other goroutines can see the data early, that's fine. In particular, snapshots are not a problem (I think they're the only thing that might access log entries or HardState from other goroutines). Snapshots do not include either the HardState or uncommitted log entries, and even if they did include log entries that were not persisted to disk, it wouldn't be a problem because raft does not infer the that entries are persisted on the node that sends a snapshot. Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful. Comments from Reviewable |
^- sounds like good comment material.
On Mon, Jul 10, 2017 at 4:11 PM Ben Darnell ***@***.***> wrote:
Yes, it's fine for the log entries (and HardState) to be visible before
they are synced to disk. The important constraints are that these syncs
happen before raft messages are sent and before the call to
RawNode.Advance. Our regular locking is sufficient for this and if other
goroutines can see the data early, that's fine. In particular, snapshots
are not a problem (I think they're the only thing that might access log
entries or HardState from other goroutines). Snapshots do not include
either the HardState or uncommitted log entries, and even if they did
include log entries that were not persisted to disk, it wouldn't be a
problem because raft does not infer the that entries are persisted on the
node that sends a snapshot.
------------------------------
Review status: all files reviewed at latest revision, 2 unresolved
discussions, all commit checks successful.
------------------------------
*Comments from Reviewable
<https://reviewable.io:443/reviews/cockroachdb/cockroach/16942#-:-Koi3e3j8itlFeXK2ZV2:bnihfz4>*
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#16942 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE135Az4QYOVO9ctQDPOd_qng8VlUTsxks5sMoWBgaJpZM4ORmFI>
.
--
…-- Tobias
|
The previous implementation of batching for synced commits had the unfortunate property that commits that did not require syncing were required to wait for the WAL to be synced. When RocksDB writes a batch with sync==true it internally does: 1. Add batch to WAL 2. Sync WAL 3. Add entries to mem table Switch to using SyncWAL to explicitly sync the WAL after writing a batch. This is slightly different semantics from the above: 1. Add batch to WAL 2. Add entries to mem table 3. Sync WAL The advantage of this new approach is that non-synced batches do not have to wait for the WAL to sync. Prior to this change, it was observed that essentially every batch was waiting for a WAL sync. Approximately half of all batch commits are performed with sync==true (half of all batch commits are for Raft log entries or Raft state). Forcing the non-synced commits to wait for the WAL added significant time. Reworked the implementation of batch grouping. The sequence number mechanism was replaced by per-batch sync.WaitGroup embedded in the rocksDBBatch structure. Once a batch is committed (and synced if requested), the wait group is signalled ensuring that only the desired goroutine is woken instead of waking all of the goroutines in the previous implementation. Syncing of the WAL is performed by a dedicated thread add only the batches which specify sync==true wait for the syncing to occur. Added "rocksdb.min_wal_sync_interval" which adds specifies a minimum delay to wait between calls to SyncWAL. The default of 1ms was experimentally determined to reduce disk IOs by ~50% (vs 0ms) while leaving write throughput unchanged. For a write-only workload (`ycsb -workload F`), write throughput improved by 15%. Raft command commit latency (unsynced commits) dropped from 30-40ms to 6-10ms.
@tschottdorf Good idea. Added Ben's comment as a comment above the batch commit of Raft log/state in |
a4d1f55
to
82cbb49
Compare
The previous implementation of batching for synced commits had the
unfortunate property that commits that did not require syncing were
required to wait for the WAL to be synced. When RocksDB writes a batch
with sync==true it internally does:
Switch to using SyncWAL to explicitly sync the WAL after writing a
batch. This is slightly different semantics from the above:
The advantage of this new approach is that non-synced batches do not
have to wait for the WAL to sync. Prior to this change, it was observed
that essentially every batch was waiting for a WAL sync. Approximately
half of all batch commits are performed with sync==true (half of all
batch commits are for Raft log entries or Raft state). Forcing the
non-synced commits to wait for the WAL added significant time.
Reworked the implementation of batch grouping. The sequence number
mechanism was replaced by per-batch condition variables embedded in the
rocksDBBatch structure. Once a batch is committed (and synced if
requested), the condition variable is signalled ensuring that only the
desired goroutines are woken instead of waking all of the goroutines
waiting on the condition. Syncing of the WAL is performed by a dedicated
thread add only the batches which specify sync==true wait for the
syncing to occur.
Added "rocksdb.min_wal_sync_interval" which adds specifies a minimum
delay to wait between calls to SyncWAL. The default of 1ms was
experimentally determined to reduce disk IOs by ~50% (vs 0ms) while leaving write
throughput unchanged.
For a write-only workload (
ycsb -workload F
), write throughputimproved by 15%. Raft command commit latency (unsynced commits) dropped
from 30-40ms to 6-10ms.