storage/engine: use SyncWAL instead of sync on commit #16942

petermattis · 2017-07-08T02:07:04Z

The previous implementation of batching for synced commits had the
unfortunate property that commits that did not require syncing were
required to wait for the WAL to be synced. When RocksDB writes a batch
with sync==true it internally does:

Add batch to WAL
Sync WAL
Add entries to mem table

Switch to using SyncWAL to explicitly sync the WAL after writing a
batch. This is slightly different semantics from the above:

Add batch to WAL
Add entries to mem table
Sync WAL

The advantage of this new approach is that non-synced batches do not
have to wait for the WAL to sync. Prior to this change, it was observed
that essentially every batch was waiting for a WAL sync. Approximately
half of all batch commits are performed with sync==true (half of all
batch commits are for Raft log entries or Raft state). Forcing the
non-synced commits to wait for the WAL added significant time.

Reworked the implementation of batch grouping. The sequence number
mechanism was replaced by per-batch condition variables embedded in the
rocksDBBatch structure. Once a batch is committed (and synced if
requested), the condition variable is signalled ensuring that only the
desired goroutines are woken instead of waking all of the goroutines
waiting on the condition. Syncing of the WAL is performed by a dedicated
thread add only the batches which specify sync==true wait for the
syncing to occur.

Added "rocksdb.min_wal_sync_interval" which adds specifies a minimum
delay to wait between calls to SyncWAL. The default of 1ms was
experimentally determined to reduce disk IOs by ~50% (vs 0ms) while leaving write
throughput unchanged.

For a write-only workload (ycsb -workload F), write throughput
improved by 15%. Raft command commit latency (unsynced commits) dropped
from 30-40ms to 6-10ms.

cockroach-teamcity · 2017-07-08T02:07:09Z

This change is

petermattis · 2017-07-08T02:09:10Z

@bdarnell I think the new semantics of sync==true are kosher, but please verify. We won't respond to a Raft request to append log entries until the entries are synced to disk, but the entries will become visible to other goroutines before the WAL is synced.

tbg · 2017-07-09T01:28:04Z

That's a pretty sweet improvement. I've only glanced at the code (looks good), but just wanted to point out that we should try to avoid forcing @irfansharif to refactor. Perhaps we can hold off on these merges until #16624 is done, which also gives us a clearer view of the performance gain of this change (expected to be much smaller, since we're forcing less syncs, correct?)

OTOH, if the refactor is negligible, good to get this in sooner rather than later. Either way, let's ping @irfansharif on all of these PRs (easier to rebase on top of something you've seen before you crashed into it).

petermattis · 2017-07-09T14:57:53Z

I don't think there is any conflict with #16809. Yes, the perf gain from this PR will be smaller after #16809 as we usually won't be syncing Raft command commits, though we still will for log truncation commands.

a-robinson · 2017-07-10T17:01:33Z

crazy good improvement

Reviewed 3 of 4 files at r1.
Review status: 3 of 4 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.

pkg/storage/engine/db.h, line 104 at r1 (raw file):

// complete.
DBStatus DBFlush(DBEngine* db);
DBStatus DBSyncWAL(DBEngine* db);

Adding a comment would be nice, at least to clarify that it blocks

pkg/storage/engine/rocksdb.go, line 435 at r2 (raw file):

func (r *RocksDB) syncLoop() {
	runtime.LockOSThread()

Was this shown to be important for performance? Is it just here to ensure we actually get cycles every millisecond? A comment would probably be helpful down the road.

Comments from Reviewable

petermattis · 2017-07-10T18:31:43Z

Review status: 3 of 4 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.

pkg/storage/engine/db.h, line 104 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Adding a comment would be nice, at least to clarify that it blocks

Oops, meant to comment that and forgot. Done.

pkg/storage/engine/rocksdb.go, line 435 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Was this shown to be important for performance? Is it just here to ensure we actually get cycles every millisecond? A comment would probably be helpful down the road.

It's not the cycles for this thread, but to prevent other goroutines from being schedule on our thread and experiencing additional latency. Added a comment.

Comments from Reviewable

bdarnell · 2017-07-10T19:52:44Z

Reviewed 2 of 4 files at r1, 2 of 2 files at r3.
Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.

Comments from Reviewable

petermattis · 2017-07-10T19:55:36Z

@bdarnell Just to be clear, is it fine for the Raft log entries and hard state updates to be visible to other goroutines before they have been synced to disk? Is there a scenario with sending a snapshot that could be problematic? I haven't thought of one, but you're the expert here.

Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.

Comments from Reviewable

bdarnell · 2017-07-10T20:11:38Z

Yes, it's fine for the log entries (and HardState) to be visible before they are synced to disk. The important constraints are that these syncs happen before raft messages are sent and before the call to RawNode.Advance. Our regular locking is sufficient for this and if other goroutines can see the data early, that's fine. In particular, snapshots are not a problem (I think they're the only thing that might access log entries or HardState from other goroutines). Snapshots do not include either the HardState or uncommitted log entries, and even if they did include log entries that were not persisted to disk, it wouldn't be a problem because raft does not infer the that entries are persisted on the node that sends a snapshot.

Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.

Comments from Reviewable

tbg · 2017-07-10T20:29:21Z

^- sounds like good comment material.

On Mon, Jul 10, 2017 at 4:11 PM Ben Darnell ***@***.***> wrote: Yes, it's fine for the log entries (and HardState) to be visible before they are synced to disk. The important constraints are that these syncs happen before raft messages are sent and before the call to RawNode.Advance. Our regular locking is sufficient for this and if other goroutines can see the data early, that's fine. In particular, snapshots are not a problem (I think they're the only thing that might access log entries or HardState from other goroutines). Snapshots do not include either the HardState or uncommitted log entries, and even if they did include log entries that were not persisted to disk, it wouldn't be a problem because raft does not infer the that entries are persisted on the node that sends a snapshot. ------------------------------ Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful. ------------------------------ *Comments from Reviewable <https://reviewable.io:443/reviews/cockroachdb/cockroach/16942#-:-Koi3e3j8itlFeXK2ZV2:bnihfz4>* — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#16942 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE135Az4QYOVO9ctQDPOd_qng8VlUTsxks5sMoWBgaJpZM4ORmFI> .

--

…

-- Tobias

The previous implementation of batching for synced commits had the unfortunate property that commits that did not require syncing were required to wait for the WAL to be synced. When RocksDB writes a batch with sync==true it internally does: 1. Add batch to WAL 2. Sync WAL 3. Add entries to mem table Switch to using SyncWAL to explicitly sync the WAL after writing a batch. This is slightly different semantics from the above: 1. Add batch to WAL 2. Add entries to mem table 3. Sync WAL The advantage of this new approach is that non-synced batches do not have to wait for the WAL to sync. Prior to this change, it was observed that essentially every batch was waiting for a WAL sync. Approximately half of all batch commits are performed with sync==true (half of all batch commits are for Raft log entries or Raft state). Forcing the non-synced commits to wait for the WAL added significant time. Reworked the implementation of batch grouping. The sequence number mechanism was replaced by per-batch sync.WaitGroup embedded in the rocksDBBatch structure. Once a batch is committed (and synced if requested), the wait group is signalled ensuring that only the desired goroutine is woken instead of waking all of the goroutines in the previous implementation. Syncing of the WAL is performed by a dedicated thread add only the batches which specify sync==true wait for the syncing to occur. Added "rocksdb.min_wal_sync_interval" which adds specifies a minimum delay to wait between calls to SyncWAL. The default of 1ms was experimentally determined to reduce disk IOs by ~50% (vs 0ms) while leaving write throughput unchanged. For a write-only workload (`ycsb -workload F`), write throughput improved by 15%. Raft command commit latency (unsynced commits) dropped from 30-40ms to 6-10ms.

petermattis · 2017-07-10T21:00:09Z

@tschottdorf Good idea. Added Ben's comment as a comment above the batch commit of Raft log/state in replica.go.

petermattis requested review from bdarnell and tbg July 8, 2017 02:07

petermattis force-pushed the pmattis/rocksdb-sync-wal branch 2 times, most recently from 8a470b6 to 208d219 Compare July 8, 2017 14:55

petermattis requested a review from irfansharif July 9, 2017 14:54

petermattis force-pushed the pmattis/rocksdb-sync-wal branch 2 times, most recently from a1886fb to a4d1f55 Compare July 10, 2017 19:29

irfansharif approved these changes Jul 10, 2017

View reviewed changes

petermattis force-pushed the pmattis/rocksdb-sync-wal branch from a4d1f55 to 82cbb49 Compare July 10, 2017 21:00

petermattis merged commit 4a1b7fe into cockroachdb:master Jul 10, 2017

petermattis deleted the pmattis/rocksdb-sync-wal branch July 10, 2017 21:19

irfansharif mentioned this pull request Jul 19, 2017

kvserver: separate raft log #16624

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage/engine: use SyncWAL instead of sync on commit #16942

storage/engine: use SyncWAL instead of sync on commit #16942

petermattis commented Jul 8, 2017

cockroach-teamcity commented Jul 8, 2017

petermattis commented Jul 8, 2017

tbg commented Jul 9, 2017

petermattis commented Jul 9, 2017

a-robinson commented Jul 10, 2017

petermattis commented Jul 10, 2017

bdarnell commented Jul 10, 2017

petermattis commented Jul 10, 2017

bdarnell commented Jul 10, 2017

tbg commented Jul 10, 2017 via email

petermattis commented Jul 10, 2017

storage/engine: use SyncWAL instead of sync on commit #16942

storage/engine: use SyncWAL instead of sync on commit #16942

Conversation

petermattis commented Jul 8, 2017

cockroach-teamcity commented Jul 8, 2017

petermattis commented Jul 8, 2017

tbg commented Jul 9, 2017

petermattis commented Jul 9, 2017

a-robinson commented Jul 10, 2017

petermattis commented Jul 10, 2017

bdarnell commented Jul 10, 2017

petermattis commented Jul 10, 2017

bdarnell commented Jul 10, 2017

tbg commented Jul 10, 2017 via email

petermattis commented Jul 10, 2017