Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage/engine: use SyncWAL instead of sync on commit #16942

Merged

Conversation

petermattis
Copy link
Collaborator

The previous implementation of batching for synced commits had the
unfortunate property that commits that did not require syncing were
required to wait for the WAL to be synced. When RocksDB writes a batch
with sync==true it internally does:

  1. Add batch to WAL
  2. Sync WAL
  3. Add entries to mem table

Switch to using SyncWAL to explicitly sync the WAL after writing a
batch. This is slightly different semantics from the above:

  1. Add batch to WAL
  2. Add entries to mem table
  3. Sync WAL

The advantage of this new approach is that non-synced batches do not
have to wait for the WAL to sync. Prior to this change, it was observed
that essentially every batch was waiting for a WAL sync. Approximately
half of all batch commits are performed with sync==true (half of all
batch commits are for Raft log entries or Raft state). Forcing the
non-synced commits to wait for the WAL added significant time.

Reworked the implementation of batch grouping. The sequence number
mechanism was replaced by per-batch condition variables embedded in the
rocksDBBatch structure. Once a batch is committed (and synced if
requested), the condition variable is signalled ensuring that only the
desired goroutines are woken instead of waking all of the goroutines
waiting on the condition. Syncing of the WAL is performed by a dedicated
thread add only the batches which specify sync==true wait for the
syncing to occur.

Added "rocksdb.min_wal_sync_interval" which adds specifies a minimum
delay to wait between calls to SyncWAL. The default of 1ms was
experimentally determined to reduce disk IOs by ~50% (vs 0ms) while leaving write
throughput unchanged.

For a write-only workload (ycsb -workload F), write throughput
improved by 15%. Raft command commit latency (unsynced commits) dropped
from 30-40ms to 6-10ms.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@petermattis petermattis requested review from bdarnell and tbg July 8, 2017 02:07
@petermattis
Copy link
Collaborator Author

@bdarnell I think the new semantics of sync==true are kosher, but please verify. We won't respond to a Raft request to append log entries until the entries are synced to disk, but the entries will become visible to other goroutines before the WAL is synced.

@petermattis petermattis force-pushed the pmattis/rocksdb-sync-wal branch 2 times, most recently from 8a470b6 to 208d219 Compare July 8, 2017 14:55
@tbg
Copy link
Member

tbg commented Jul 9, 2017

That's a pretty sweet improvement. I've only glanced at the code (looks good), but just wanted to point out that we should try to avoid forcing @irfansharif to refactor. Perhaps we can hold off on these merges until #16624 is done, which also gives us a clearer view of the performance gain of this change (expected to be much smaller, since we're forcing less syncs, correct?)

OTOH, if the refactor is negligible, good to get this in sooner rather than later. Either way, let's ping @irfansharif on all of these PRs (easier to rebase on top of something you've seen before you crashed into it).

@petermattis
Copy link
Collaborator Author

I don't think there is any conflict with #16809. Yes, the perf gain from this PR will be smaller after #16809 as we usually won't be syncing Raft command commits, though we still will for log truncation commands.

@a-robinson
Copy link
Contributor

:lgtm: crazy good improvement


Reviewed 3 of 4 files at r1.
Review status: 3 of 4 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/storage/engine/db.h, line 104 at r1 (raw file):

// complete.
DBStatus DBFlush(DBEngine* db);
DBStatus DBSyncWAL(DBEngine* db);

Adding a comment would be nice, at least to clarify that it blocks


pkg/storage/engine/rocksdb.go, line 435 at r2 (raw file):

func (r *RocksDB) syncLoop() {
	runtime.LockOSThread()

Was this shown to be important for performance? Is it just here to ensure we actually get cycles every millisecond? A comment would probably be helpful down the road.


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: 3 of 4 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/storage/engine/db.h, line 104 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Adding a comment would be nice, at least to clarify that it blocks

Oops, meant to comment that and forgot. Done.


pkg/storage/engine/rocksdb.go, line 435 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Was this shown to be important for performance? Is it just here to ensure we actually get cycles every millisecond? A comment would probably be helpful down the road.

It's not the cycles for this thread, but to prevent other goroutines from being schedule on our thread and experiencing additional latency. Added a comment.


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/rocksdb-sync-wal branch 2 times, most recently from a1886fb to a4d1f55 Compare July 10, 2017 19:29
@bdarnell
Copy link
Contributor

:lgtm:


Reviewed 2 of 4 files at r1, 2 of 2 files at r3.
Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

@bdarnell Just to be clear, is it fine for the Raft log entries and hard state updates to be visible to other goroutines before they have been synced to disk? Is there a scenario with sending a snapshot that could be problematic? I haven't thought of one, but you're the expert here.


Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


Comments from Reviewable

@bdarnell
Copy link
Contributor

Yes, it's fine for the log entries (and HardState) to be visible before they are synced to disk. The important constraints are that these syncs happen before raft messages are sent and before the call to RawNode.Advance. Our regular locking is sufficient for this and if other goroutines can see the data early, that's fine. In particular, snapshots are not a problem (I think they're the only thing that might access log entries or HardState from other goroutines). Snapshots do not include either the HardState or uncommitted log entries, and even if they did include log entries that were not persisted to disk, it wouldn't be a problem because raft does not infer the that entries are persisted on the node that sends a snapshot.


Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Jul 10, 2017 via email

The previous implementation of batching for synced commits had the
unfortunate property that commits that did not require syncing were
required to wait for the WAL to be synced. When RocksDB writes a batch
with sync==true it internally does:

  1. Add batch to WAL
  2. Sync WAL
  3. Add entries to mem table

Switch to using SyncWAL to explicitly sync the WAL after writing a
batch. This is slightly different semantics from the above:

  1. Add batch to WAL
  2. Add entries to mem table
  3. Sync WAL

The advantage of this new approach is that non-synced batches do not
have to wait for the WAL to sync. Prior to this change, it was observed
that essentially every batch was waiting for a WAL sync. Approximately
half of all batch commits are performed with sync==true (half of all
batch commits are for Raft log entries or Raft state). Forcing the
non-synced commits to wait for the WAL added significant time.

Reworked the implementation of batch grouping. The sequence number
mechanism was replaced by per-batch sync.WaitGroup embedded in the
rocksDBBatch structure. Once a batch is committed (and synced if
requested), the wait group is signalled ensuring that only the desired
goroutine is woken instead of waking all of the goroutines in the
previous implementation. Syncing of the WAL is performed by a dedicated
thread add only the batches which specify sync==true wait for the
syncing to occur.

Added "rocksdb.min_wal_sync_interval" which adds specifies a minimum
delay to wait between calls to SyncWAL. The default of 1ms was
experimentally determined to reduce disk IOs by ~50% (vs 0ms) while leaving write
throughput unchanged.

For a write-only workload (`ycsb -workload F`), write throughput
improved by 15%. Raft command commit latency (unsynced commits) dropped
from 30-40ms to 6-10ms.
@petermattis
Copy link
Collaborator Author

@tschottdorf Good idea. Added Ben's comment as a comment above the batch commit of Raft log/state in replica.go.

@petermattis petermattis merged commit 4a1b7fe into cockroachdb:master Jul 10, 2017
@petermattis petermattis deleted the pmattis/rocksdb-sync-wal branch July 10, 2017 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants