-
Notifications
You must be signed in to change notification settings - Fork 642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use vectorized io when writing dirty pages #339
Conversation
14d3365
to
0d8d760
Compare
also coded up an alternative with existing syscall in golang/x/sys that works with This one, interestingly enough, has some issues in the manydb tests. It leads me to believe that gogc is re-arranging the pages and that makes it write some weird memory into the file. Consistency checks here fail too. |
Please ping me when you done. Also reminder, small PR (e.g. < 100 lines of change) is preferred. |
0d8d760
to
3e65272
Compare
@ahrtr I tried to simplify a bit more, just over 100 lines added though. I still have to benchmark this further... Last time in September I remember it was 10% slower. |
just trying to get my benchmark setup and running again, I get a minor difference in tail latency:
I'm running a couple of strace sessions now to see how many syscalls this actually saves. Ideally this should reduce tail latency on very big consecutive writes. |
I would not expect a huge performance boost for the light write workload. But it should not be 10% slower, either :). |
Let me triple check the docker images again, with strace I couldn't really see the new syscall being used. It's still really difficult to get the build to get the right bbolt version, even with go mods and replaces :/ |
running strace with this PR:
against vanilla 3.5.6:
Doesn't seem to be that much of a bottleneck, as I'll get some other hardware in the cloud tomorrow for a bigger test, this was a pretty slow machine in my local network over a gig ethernet. |
Interesting... There are more write and futex calls after this change. Do you know why? |
@xiang90 that's indeed interesting, I assume that one strace ran longer than the other. I'll go checkout where the write calls originate from, but futex/write and fsync calls are most likely related to the WAL: https://github.com/etcd-io/etcd/blob/main/server/storage/wal/wal.go I think there's a bit more to gain in the WAL, I was thinking about block aligned writes and directio. The former was a bit of an issue recently with network mounted block devices like Ceph for us... |
3e65272
to
59e2546
Compare
Verified that the throughput and latency is not affected by this PR at all, WAL seems to be the dominating factor. Running some tests today on Intel Cascade Lake 4 core with 100g of SSD persistent disk on GCP, another same-spec'd machine in the same VPC runs the check commands using etcdctl. Above is a run of three operations each, spaced by a minute or two:
Left side starting with this PR on top of 3.5.6, the right hand side after 12.35pm is vanilla 3.5.6. So not that much of an improvement I expected to see. One thing I would still like to test is the defrag performance, this should be doing pretty well as it re-arranges the pages again into a pattern that would benefit pwritev. Having said that, it's probably not worth optimizing this case specifically. |
rebasing over 3.6, the pattern is similar :/ regarding defrag, with 3.6 (current HEAD) and a db size of 5.4gb:
with this bbolt patch on the same database:
So that seems to be the clear winner operation here ;-) |
Ah... I think you probably should just run a benchmark for the boltdb rather than etcd? I remember there is a benchmark subcommand for boltdb in bolt ctl or something :P. Then the WAL stuff will not affect your result. |
@xiang90 beautiful, thanks for digging this up. I'll benchmark this in depth tomorrow using bbolt bench and post the result. |
bolt_unix.go
Outdated
var curVecs []syscall.Iovec | ||
for i := 0; i < len(pages); i++ { | ||
p := pages[i] | ||
if p.id != (lastPid+1) || len(curVecs) >= maxVec { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems not correct. Shouldn't it be something like below?
p.id != lastPid + lastOverflow + 1 || len(curVecs) >= maxVec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, let me see how the overflow works. I was under the impression that the page id is just an increasing identifier for the page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, very curious that no e2e test case covered this. I'm also not even sure whether the overflow is in use.
) | ||
for i := 0; i < len(offsets); i++ { | ||
_, _, e = syscall.Syscall6(syscall.SYS_PWRITEV, db.file.Fd(), uintptr(unsafe.Pointer(&iovecs[i][0])), | ||
uintptr(len(iovecs[i])), uintptr(offsets[i]), uintptr(offsets[i]>>0x8), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The syscall
pacakge is deprecated and locked down. Please consider using golang.org/x/sys
instead. I had a quick look at your another solution vectorizedio2, it seems that you did not cast iovecs
to [][]byte correctly (about line 556 in tx.go).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, let me try again with the other solution. I'm afraid this will be drastically slower due to the extra allocation however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, it's significantly slower:
with x/sys:
[tjungblu ~/git/bbolt]$ go run cmd/bbolt/main.go bench -count 100000
# Write 199.840821ms (1.998µs/op) (500500 op/sec)
# Read 1.000347769s (17ns/op) (58823529 op/sec)
this PR:
[tjungblu ~/git/bbolt]$ go run cmd/bbolt/main.go bench -count 100000
# Write 173.741269ms (1.737µs/op) (575705 op/sec)
# Read 1.001078252s (14ns/op) (71428571 op/sec)
master:
[tjungblu ~/git/bbolt]$ go run cmd/bbolt/main.go bench -count 100000
# Write 184.005891ms (1.84µs/op) (543478 op/sec)
# Read 1.000812791s (14ns/op) (71428571 op/sec)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pushed the code into another branch master...tjungblu:bbolt:vec2
59e2546
to
530b1c5
Compare
with bench the difference is clearer:
|
530b1c5
to
e076bd4
Compare
From the CPU perspective, yes, most of the computing is probably spent on |
Likely I'm missing something obvious:
Even further: node.write() should directly write to the db.data (mmapped buffer) instead of working on intermediate buffer. |
Thank you @xiang90. I've read the article, but it mostly says about concurrent calls of fsync and responsibility of retries. My mental model is following (assuming the goal is to get highest possible performance [and we are not in position to complicate bbolt's reliability for performance] ):
Again - I'm not saying we should do it... Just brainstorming what's the "speed of light" version of the memory copying optimization. |
@tjungblu |
Right. I actually referred to hyc_symas's (he is the author of LMDB) comment in that HN thread, not just the article :P. But I think you are probably right. Only the meta-pages must be handled explicitly (split from the normal pages mmap sync). Other pages should not matter much, assuming msync will return the error explicitly for any failing physical writes. I do not see a reason why it cannot be done. But we may want to understand the perf benefits as well :) |
This makes use of the vectorized IO call pwritev to write subsequent pages in up to 1024 batches of smaller buffer writes. Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
this is still a bit experimental, thus drafting it - but that should reduce the amount of write syscalls by a lot and resolve a very old TODO.
Will benchmark this more in-depth tomorrow, but if somebody finds some time earlier I would appreciate some other tests as well!