Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use vectorized io when writing dirty pages #339

Closed
wants to merge 1 commit into from

Conversation

tjungblu
Copy link
Contributor

this is still a bit experimental, thus drafting it - but that should reduce the amount of write syscalls by a lot and resolve a very old TODO.

Will benchmark this more in-depth tomorrow, but if somebody finds some time earlier I would appreciate some other tests as well!

bolt_windows.go Outdated Show resolved Hide resolved
@tjungblu
Copy link
Contributor Author

also coded up an alternative with existing syscall in golang/x/sys that works with [][]byte:
master...tjungblu:bbolt:vectorizedio2

This one, interestingly enough, has some issues in the manydb tests. It leads me to believe that gogc is re-arranging the pages and that makes it write some weird memory into the file. Consistency checks here fail too.

@ahrtr
Copy link
Member

ahrtr commented Nov 13, 2022

Please ping me when you done.

Also reminder, small PR (e.g. < 100 lines of change) is preferred.

@tjungblu tjungblu marked this pull request as ready for review November 29, 2022 15:15
@tjungblu
Copy link
Contributor Author

@ahrtr I tried to simplify a bit more, just over 100 lines added though. I still have to benchmark this further... Last time in September I remember it was 10% slower.

@tjungblu
Copy link
Contributor Author

tjungblu commented Nov 29, 2022

just trying to get my benchmark setup and running again, I get a minor difference in tail latency:

~~~xxx~~~
previous bench was not actually running this code

I'm running a couple of strace sessions now to see how many syscalls this actually saves. Ideally this should reduce tail latency on very big consecutive writes.

@xiang90
Copy link
Contributor

xiang90 commented Nov 29, 2022

I would not expect a huge performance boost for the light write workload. But it should not be 10% slower, either :).

@tjungblu
Copy link
Contributor Author

Let me triple check the docker images again, with strace I couldn't really see the new syscall being used. It's still really difficult to get the build to get the right bbolt version, even with go mods and replaces :/

@tjungblu
Copy link
Contributor Author

tjungblu commented Nov 29, 2022

running check perf --load=l for a bit more load now.

strace with this PR:

root@etcd-bbolt-7dd449c8f9-jl9zq:/# strace -c -p 1

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 68.58   18.571681         329     56359      4400 futex
  8.81    2.384835         190     12497           write
  6.51    1.761957          70     24902     12398 read
  5.38    1.457237         208      6977           nanosleep
  5.27    1.425854        2995       476           fdatasync
  5.01    1.355508          90     14940           epoll_pwait
  0.12    0.033640          96       347           sched_yield
  0.08    0.022086          16      1351         1 rt_sigreturn
  0.07    0.018166         263        69           pwritev
  0.06    0.017491          40       437           lseek
  0.06    0.017383          20       833           setsockopt
  0.01    0.003979          29       133         2 epoll_ctl
  0.01    0.002998          24       121         1 accept4
  0.01    0.002220         222        10           mmap
  0.01    0.002153          18       119           getsockname
  0.01    0.001500          78        19           madvise
  0.00    0.000956          79        12           close
  0.00    0.000898          42        21           pwrite64
  0.00    0.000321          12        25           tgkill
  0.00    0.000215           8        25           getpid
  0.00    0.000090          45         2           openat
  0.00    0.000050          25         2           getdents64
  0.00    0.000031          31         1           fallocate
  0.00    0.000016          16         1           munmap
  0.00    0.000013          13         1           fcntl
------ ----------- ----------- --------- --------- ----------------
100.00   27.081278         226    119680     16802 total


2174 writes/s
Slowest request took too long: 1.830543s
PASS: Stddev is 0.035899s

against vanilla 3.5.6:

root@etcd-deployment-774c8fdd55-zxdk8:/# strace -c -p 1

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 63.64    7.943450         251     31604      2717 futex
  8.98    1.120182         173      6465           write
  7.34    0.915813          65     13896      6924 read
  6.43    0.802187        2737       293           fdatasync
  6.02    0.751166         187      4008           nanosleep
  5.09    0.635376          82      7748           epoll_pwait
  0.89    0.110715           5     21314         1 rt_sigreturn
  0.53    0.066214          33      1954           pwrite64
  0.45    0.056711          23      2387           setsockopt
  0.32    0.039322         115       341           accept4
  0.12    0.014873          79       188           sched_yield
  0.07    0.008601          25       343         2 epoll_ctl
  0.06    0.008073          23       341           getsockname
  0.06    0.007504          28       265           lseek
  0.01    0.000802          80        10           mmap
  0.00    0.000055          55         1           munmap
  0.00    0.000039          19         2           getdents64
  0.00    0.000034          17         2           openat
  0.00    0.000018          18         1           close
  0.00    0.000000           0         2           rt_sigprocmask
  0.00    0.000000           0        20           getpid
  0.00    0.000000           0         1           clone
  0.00    0.000000           0         1           fcntl
  0.00    0.000000           0        20           tgkill
  0.00    0.000000           0         1           fallocate
------ ----------- ----------- --------- --------- ----------------
100.00   12.481135         136     91208      9644 total

2260 writes/s
Slowest request took too long: 1.586535s
PASS: Stddev is 0.030607s

Doesn't seem to be that much of a bottleneck, as pwritev isn't called all that often. The amount of pwrite64 is reduced though, the remaining calls there are just for the metadata.

I'll get some other hardware in the cloud tomorrow for a bigger test, this was a pretty slow machine in my local network over a gig ethernet.

@ahrtr ahrtr self-requested a review November 30, 2022 00:31
@xiang90
Copy link
Contributor

xiang90 commented Nov 30, 2022

@tjungblu

Interesting... There are more write and futex calls after this change. Do you know why?

@tjungblu
Copy link
Contributor Author

@xiang90 that's indeed interesting, I assume that one strace ran longer than the other. I'll go checkout where the write calls originate from, but futex/write and fsync calls are most likely related to the WAL: https://github.com/etcd-io/etcd/blob/main/server/storage/wal/wal.go

I think there's a bit more to gain in the WAL, I was thinking about block aligned writes and directio. The former was a bit of an issue recently with network mounted block devices like Ceph for us...

@tjungblu
Copy link
Contributor Author

tjungblu commented Nov 30, 2022

Verified that the throughput and latency is not affected by this PR at all, WAL seems to be the dominating factor.

Running some tests today on Intel Cascade Lake 4 core with 100g of SSD persistent disk on GCP, another same-spec'd machine in the same VPC runs the check commands using etcdctl.

image

Above is a run of three operations each, spaced by a minute or two:

check datascale --load=l
check perf --load=xl
check perf --load=xl

Left side starting with this PR on top of 3.5.6, the right hand side after 12.35pm is vanilla 3.5.6. So not that much of an improvement I expected to see.

One thing I would still like to test is the defrag performance, this should be doing pretty well as it re-arranges the pages again into a pattern that would benefit pwritev. Having said that, it's probably not worth optimizing this case specifically.

@tjungblu
Copy link
Contributor Author

rebasing over 3.6, the pattern is similar :/

regarding defrag, with 3.6 (current HEAD) and a db size of 5.4gb:

{"level":"info","ts":"2022-11-30T14:12:44.978Z","caller":"backend/backend.go:547","msg":"finished defragmenting directory","path":"default.etcd/member/snap/db","current-db-size-bytes-diff":-24576,"current-db-size-bytes":5432389632,"current-db-size":"5.4 GB","current-db-size-in-use-bytes-diff":-24576,"current-db-size-in-use-bytes":5432369152,"current-db-size-in-use":"5.4 GB","took":"1m3.791887536s"}

with this bbolt patch on the same database:

{"level":"info","ts":"2022-11-30T14:15:31.033Z","caller":"backend/backend.go:547","msg":"finished defragmenting directory","path":"default.etcd/member/snap/db","current-db-size-bytes-diff":-24576,"current-db-size-bytes":5432389632,"current-db-size":"5.4 GB","current-db-size-in-use-bytes-diff":-24576,"current-db-size-in-use-bytes":5432369152,"current-db-size-in-use":"5.4 GB","took":"30.260298509s"}

So that seems to be the clear winner operation here ;-)

@xiang90
Copy link
Contributor

xiang90 commented Nov 30, 2022

@tjungblu

Ah... I think you probably should just run a benchmark for the boltdb rather than etcd? I remember there is a benchmark subcommand for boltdb in bolt ctl or something :P. Then the WAL stuff will not affect your result.

@xiang90
Copy link
Contributor

xiang90 commented Nov 30, 2022

@tjungblu
Copy link
Contributor Author

@xiang90 beautiful, thanks for digging this up. I'll benchmark this in depth tomorrow using bbolt bench and post the result.

bolt_unix.go Outdated
var curVecs []syscall.Iovec
for i := 0; i < len(pages); i++ {
p := pages[i]
if p.id != (lastPid+1) || len(curVecs) >= maxVec {
Copy link
Member

@ahrtr ahrtr Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems not correct. Shouldn't it be something like below?

p.id != lastPid + lastOverflow + 1 || len(curVecs) >= maxVec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, let me see how the overflow works. I was under the impression that the page id is just an increasing identifier for the page.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, very curious that no e2e test case covered this. I'm also not even sure whether the overflow is in use.

)
for i := 0; i < len(offsets); i++ {
_, _, e = syscall.Syscall6(syscall.SYS_PWRITEV, db.file.Fd(), uintptr(unsafe.Pointer(&iovecs[i][0])),
uintptr(len(iovecs[i])), uintptr(offsets[i]), uintptr(offsets[i]>>0x8), 0)
Copy link
Member

@ahrtr ahrtr Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The syscall pacakge is deprecated and locked down. Please consider using golang.org/x/sys instead. I had a quick look at your another solution vectorizedio2, it seems that you did not cast iovecs to [][]byte correctly (about line 556 in tx.go).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, let me try again with the other solution. I'm afraid this will be drastically slower due to the extra allocation however.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, it's significantly slower:

with x/sys:
[tjungblu ~/git/bbolt]$ go run cmd/bbolt/main.go bench -count 100000
# Write	199.840821ms	(1.998µs/op)	(500500 op/sec)
# Read	1.000347769s	(17ns/op)	(58823529 op/sec)


this PR:
[tjungblu ~/git/bbolt]$ go run cmd/bbolt/main.go bench -count 100000
# Write	173.741269ms	(1.737µs/op)	(575705 op/sec)
# Read	1.001078252s	(14ns/op)	(71428571 op/sec)


master:
[tjungblu ~/git/bbolt]$ go run cmd/bbolt/main.go bench -count 100000
# Write	184.005891ms	(1.84µs/op)	(543478 op/sec)
# Read	1.000812791s	(14ns/op)	(71428571 op/sec)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed the code into another branch master...tjungblu:bbolt:vec2

@tjungblu
Copy link
Contributor Author

tjungblu commented Dec 1, 2022

with bench the difference is clearer:

this PR:
[tjungblu ~/git/bbolt]$ go run cmd/bbolt/main.go bench -count 100000
# Write	173.741269ms	(1.737µs/op)	(575705 op/sec)
# Read	1.001078252s	(14ns/op)	(71428571 op/sec)


master:
[tjungblu ~/git/bbolt]$ go run cmd/bbolt/main.go bench -count 100000
# Write	184.005891ms	(1.84µs/op)	(543478 op/sec)
# Read	1.000812791s	(14ns/op)	(71428571 op/sec)

@tjungblu
Copy link
Contributor Author

tjungblu commented Dec 1, 2022

seems the vectorized write in the bbolt benchmark is also not a bottleneck:

image

image

Let's stash this PR then for the time being, seems we can get more performance by optimizing the allocation. Thanks for the help, everyone!

@xiang90
Copy link
Contributor

xiang90 commented Dec 2, 2022

From the CPU perspective, yes, most of the computing is probably spent on allocate. We made some improvements on that, but there must be tons that can be further improved... On the I/O path, I feel there are probably even more inefficiencies.

@ptabor
Copy link
Contributor

ptabor commented Dec 15, 2022

Likely I'm missing something obvious:

  • we have the bbolt disk file mmaped...
  • we only use the mmap area to make reads (db.go:933)
  • we don't we use it perform writes (i.e. copy onto the memory) and call msync
    Is there a good reason ?

Even further: node.write() should directly write to the db.data (mmapped buffer) instead of working on intermediate buffer.

@xiang90
Copy link
Contributor

xiang90 commented Dec 16, 2022

@ptabor

This behavior is from LMDB. And there are reasons for it. This link provides some of the benefits of handling writes explicitly.

@ptabor
Copy link
Contributor

ptabor commented Dec 19, 2022

Thank you @xiang90. I've read the article, but it mostly says about concurrent calls of fsync and responsibility of retries.

My mental model is following (assuming the goal is to get highest possible performance [and we are not in position to complicate bbolt's reliability for performance] ):

  1. The bbolt writes/syncs are single-threaded, so there is an exclusive lock under which we need to make sure transaction is committed, it is fync or msync the file.
  2. We need to precisely control write of the meta pages. Meta page can be only written if we make sure all other modifications to the tree are persisted. So writes to meta pages should be handled explicitly.
  3. All other modifications than meta page should be not 'referencable' so the order of writes does not matter.
  4. work on mmaped pages would allow to:
    • avoid allocation of temporary memory for pages and 'zeroing' them
    • avoid copy of the temporary content into fwrite() memory.
    • the additional risk & benefit is that mmaped content can start being written to disk as soon as we (partially) modifying the pages -> without delaying it till the commit happens. This is optional mode (alternatively we can do copy-on-write), but it's sense depends on the ratio of rolled back transactions. For etcd there are nearly no rolled back transactions.

Again - I'm not saying we should do it... Just brainstorming what's the "speed of light" version of the memory copying optimization.

@ptabor
Copy link
Contributor

ptabor commented Dec 19, 2022

@tjungblu
From the chart #339 (comment), the time seems to be dominated by mmap. But the mmap should be an infrequent operation for databases that are 'mature', as it happens only when the file needs to grow. Wouldn't it be better to benchmark an already upsized file, e.g. run with high InitialMmapSize option ?

@xiang90
Copy link
Contributor

xiang90 commented Dec 19, 2022

@ptabor

Right. I actually referred to hyc_symas's (he is the author of LMDB) comment in that HN thread, not just the article :P.

But I think you are probably right. Only the meta-pages must be handled explicitly (split from the normal pages mmap sync). Other pages should not matter much, assuming msync will return the error explicitly for any failing physical writes. I do not see a reason why it cannot be done. But we may want to understand the perf benefits as well :)

This makes use of the vectorized IO call pwritev to write subsequent
pages in up to 1024 batches of smaller buffer writes.

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
@ahrtr ahrtr deleted the branch etcd-io:master December 12, 2023 10:26
@ahrtr ahrtr closed this Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants