use vectorized io when writing dirty pages #339

tjungblu · 2022-09-14T16:24:47Z

this is still a bit experimental, thus drafting it - but that should reduce the amount of write syscalls by a lot and resolve a very old TODO.

Will benchmark this more in-depth tomorrow, but if somebody finds some time earlier I would appreciate some other tests as well!

bolt_windows.go

tjungblu · 2022-09-15T16:21:01Z

also coded up an alternative with existing syscall in golang/x/sys that works with [][]byte:
master...tjungblu:bbolt:vectorizedio2

This one, interestingly enough, has some issues in the manydb tests. It leads me to believe that gogc is re-arranging the pages and that makes it write some weird memory into the file. Consistency checks here fail too.

ahrtr · 2022-11-13T05:58:14Z

Please ping me when you done.

Also reminder, small PR (e.g. < 100 lines of change) is preferred.

tjungblu · 2022-11-29T15:15:43Z

@ahrtr I tried to simplify a bit more, just over 100 lines added though. I still have to benchmark this further... Last time in September I remember it was 10% slower.

tjungblu · 2022-11-29T16:55:21Z

just trying to get my benchmark setup and running again, I get a minor difference in tail latency:

~~~xxx~~~
previous bench was not actually running this code

I'm running a couple of strace sessions now to see how many syscalls this actually saves. Ideally this should reduce tail latency on very big consecutive writes.

xiang90 · 2022-11-29T17:14:44Z

I would not expect a huge performance boost for the light write workload. But it should not be 10% slower, either :).

tjungblu · 2022-11-29T17:17:18Z

Let me triple check the docker images again, with strace I couldn't really see the new syscall being used. It's still really difficult to get the build to get the right bbolt version, even with go mods and replaces :/

tjungblu · 2022-11-29T18:12:47Z

running check perf --load=l for a bit more load now.

strace with this PR:

root@etcd-bbolt-7dd449c8f9-jl9zq:/# strace -c -p 1

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 68.58   18.571681         329     56359      4400 futex
  8.81    2.384835         190     12497           write
  6.51    1.761957          70     24902     12398 read
  5.38    1.457237         208      6977           nanosleep
  5.27    1.425854        2995       476           fdatasync
  5.01    1.355508          90     14940           epoll_pwait
  0.12    0.033640          96       347           sched_yield
  0.08    0.022086          16      1351         1 rt_sigreturn
  0.07    0.018166         263        69           pwritev
  0.06    0.017491          40       437           lseek
  0.06    0.017383          20       833           setsockopt
  0.01    0.003979          29       133         2 epoll_ctl
  0.01    0.002998          24       121         1 accept4
  0.01    0.002220         222        10           mmap
  0.01    0.002153          18       119           getsockname
  0.01    0.001500          78        19           madvise
  0.00    0.000956          79        12           close
  0.00    0.000898          42        21           pwrite64
  0.00    0.000321          12        25           tgkill
  0.00    0.000215           8        25           getpid
  0.00    0.000090          45         2           openat
  0.00    0.000050          25         2           getdents64
  0.00    0.000031          31         1           fallocate
  0.00    0.000016          16         1           munmap
  0.00    0.000013          13         1           fcntl
------ ----------- ----------- --------- --------- ----------------
100.00   27.081278         226    119680     16802 total


2174 writes/s
Slowest request took too long: 1.830543s
PASS: Stddev is 0.035899s

against vanilla 3.5.6:

root@etcd-deployment-774c8fdd55-zxdk8:/# strace -c -p 1

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 63.64    7.943450         251     31604      2717 futex
  8.98    1.120182         173      6465           write
  7.34    0.915813          65     13896      6924 read
  6.43    0.802187        2737       293           fdatasync
  6.02    0.751166         187      4008           nanosleep
  5.09    0.635376          82      7748           epoll_pwait
  0.89    0.110715           5     21314         1 rt_sigreturn
  0.53    0.066214          33      1954           pwrite64
  0.45    0.056711          23      2387           setsockopt
  0.32    0.039322         115       341           accept4
  0.12    0.014873          79       188           sched_yield
  0.07    0.008601          25       343         2 epoll_ctl
  0.06    0.008073          23       341           getsockname
  0.06    0.007504          28       265           lseek
  0.01    0.000802          80        10           mmap
  0.00    0.000055          55         1           munmap
  0.00    0.000039          19         2           getdents64
  0.00    0.000034          17         2           openat
  0.00    0.000018          18         1           close
  0.00    0.000000           0         2           rt_sigprocmask
  0.00    0.000000           0        20           getpid
  0.00    0.000000           0         1           clone
  0.00    0.000000           0         1           fcntl
  0.00    0.000000           0        20           tgkill
  0.00    0.000000           0         1           fallocate
------ ----------- ----------- --------- --------- ----------------
100.00   12.481135         136     91208      9644 total

2260 writes/s
Slowest request took too long: 1.586535s
PASS: Stddev is 0.030607s

Doesn't seem to be that much of a bottleneck, as pwritev isn't called all that often. The amount of pwrite64 is reduced though, the remaining calls there are just for the metadata.

I'll get some other hardware in the cloud tomorrow for a bigger test, this was a pretty slow machine in my local network over a gig ethernet.

xiang90 · 2022-11-30T02:27:54Z

@tjungblu

Interesting... There are more write and futex calls after this change. Do you know why?

tjungblu · 2022-11-30T08:42:06Z

@xiang90 that's indeed interesting, I assume that one strace ran longer than the other. I'll go checkout where the write calls originate from, but futex/write and fsync calls are most likely related to the WAL: https://github.com/etcd-io/etcd/blob/main/server/storage/wal/wal.go

I think there's a bit more to gain in the WAL, I was thinking about block aligned writes and directio. The former was a bit of an issue recently with network mounted block devices like Ceph for us...

tjungblu · 2022-11-30T12:11:04Z

Verified that the throughput and latency is not affected by this PR at all, WAL seems to be the dominating factor.

Running some tests today on Intel Cascade Lake 4 core with 100g of SSD persistent disk on GCP, another same-spec'd machine in the same VPC runs the check commands using etcdctl.

Above is a run of three operations each, spaced by a minute or two:

check datascale --load=l
check perf --load=xl
check perf --load=xl

Left side starting with this PR on top of 3.5.6, the right hand side after 12.35pm is vanilla 3.5.6. So not that much of an improvement I expected to see.

One thing I would still like to test is the defrag performance, this should be doing pretty well as it re-arranges the pages again into a pattern that would benefit pwritev. Having said that, it's probably not worth optimizing this case specifically.

tjungblu · 2022-11-30T14:15:55Z

rebasing over 3.6, the pattern is similar :/

regarding defrag, with 3.6 (current HEAD) and a db size of 5.4gb:

{"level":"info","ts":"2022-11-30T14:12:44.978Z","caller":"backend/backend.go:547","msg":"finished defragmenting directory","path":"default.etcd/member/snap/db","current-db-size-bytes-diff":-24576,"current-db-size-bytes":5432389632,"current-db-size":"5.4 GB","current-db-size-in-use-bytes-diff":-24576,"current-db-size-in-use-bytes":5432369152,"current-db-size-in-use":"5.4 GB","took":"1m3.791887536s"}

with this bbolt patch on the same database:

{"level":"info","ts":"2022-11-30T14:15:31.033Z","caller":"backend/backend.go:547","msg":"finished defragmenting directory","path":"default.etcd/member/snap/db","current-db-size-bytes-diff":-24576,"current-db-size-bytes":5432389632,"current-db-size":"5.4 GB","current-db-size-in-use-bytes-diff":-24576,"current-db-size-in-use-bytes":5432369152,"current-db-size-in-use":"5.4 GB","took":"30.260298509s"}

So that seems to be the clear winner operation here ;-)

xiang90 · 2022-11-30T15:05:32Z

@tjungblu

Ah... I think you probably should just run a benchmark for the boltdb rather than etcd? I remember there is a benchmark subcommand for boltdb in bolt ctl or something :P. Then the WAL stuff will not affect your result.

xiang90 · 2022-11-30T15:08:45Z

@tjungblu

Here https://github.com/etcd-io/bbolt/blob/master/cmd/bbolt/main.go#L146

tjungblu · 2022-11-30T16:00:45Z

@xiang90 beautiful, thanks for digging this up. I'll benchmark this in depth tomorrow using bbolt bench and post the result.

ahrtr · 2022-12-01T02:59:28Z

bolt_unix.go

+	var curVecs []syscall.Iovec
+	for i := 0; i < len(pages); i++ {
+		p := pages[i]
+		if p.id != (lastPid+1) || len(curVecs) >= maxVec {


This seems not correct. Shouldn't it be something like below?

p.id != lastPid + lastOverflow + 1 || len(curVecs) >= maxVec

interesting, let me see how the overflow works. I was under the impression that the page id is just an increasing identifier for the page.

fixed, very curious that no e2e test case covered this. I'm also not even sure whether the overflow is in use.

ahrtr · 2022-12-01T07:32:50Z

bolt_unix.go

+	)
+	for i := 0; i < len(offsets); i++ {
+		_, _, e = syscall.Syscall6(syscall.SYS_PWRITEV, db.file.Fd(), uintptr(unsafe.Pointer(&iovecs[i][0])),
+			uintptr(len(iovecs[i])), uintptr(offsets[i]), uintptr(offsets[i]>>0x8), 0)


The syscall pacakge is deprecated and locked down. Please consider using golang.org/x/sys instead. I had a quick look at your another solution vectorizedio2, it seems that you did not cast iovecs to [][]byte correctly (about line 556 in tx.go).

yeah, let me try again with the other solution. I'm afraid this will be drastically slower due to the extra allocation however.

yep, it's significantly slower:

with x/sys: [tjungblu ~/git/bbolt]$ go run cmd/bbolt/main.go bench -count 100000 # Write 199.840821ms (1.998µs/op) (500500 op/sec) # Read 1.000347769s (17ns/op) (58823529 op/sec) this PR: [tjungblu ~/git/bbolt]$ go run cmd/bbolt/main.go bench -count 100000 # Write 173.741269ms (1.737µs/op) (575705 op/sec) # Read 1.001078252s (14ns/op) (71428571 op/sec) master: [tjungblu ~/git/bbolt]$ go run cmd/bbolt/main.go bench -count 100000 # Write 184.005891ms (1.84µs/op) (543478 op/sec) # Read 1.000812791s (14ns/op) (71428571 op/sec)

pushed the code into another branch master...tjungblu:bbolt:vec2

tjungblu · 2022-12-01T10:04:50Z

with bench the difference is clearer:

this PR:
[tjungblu ~/git/bbolt]$ go run cmd/bbolt/main.go bench -count 100000
# Write	173.741269ms	(1.737µs/op)	(575705 op/sec)
# Read	1.001078252s	(14ns/op)	(71428571 op/sec)


master:
[tjungblu ~/git/bbolt]$ go run cmd/bbolt/main.go bench -count 100000
# Write	184.005891ms	(1.84µs/op)	(543478 op/sec)
# Read	1.000812791s	(14ns/op)	(71428571 op/sec)

tjungblu · 2022-12-01T15:05:43Z

seems the vectorized write in the bbolt benchmark is also not a bottleneck:

Let's stash this PR then for the time being, seems we can get more performance by optimizing the allocation. Thanks for the help, everyone!

xiang90 · 2022-12-02T22:12:13Z

From the CPU perspective, yes, most of the computing is probably spent on allocate. We made some improvements on that, but there must be tons that can be further improved... On the I/O path, I feel there are probably even more inefficiencies.

ptabor · 2022-12-15T21:58:02Z

Likely I'm missing something obvious:

we have the bbolt disk file mmaped...
we only use the mmap area to make reads (db.go:933)
we don't we use it perform writes (i.e. copy onto the memory) and call msync
Is there a good reason ?

Even further: node.write() should directly write to the db.data (mmapped buffer) instead of working on intermediate buffer.

xiang90 · 2022-12-16T00:06:06Z

@ptabor

This behavior is from LMDB. And there are reasons for it. This link provides some of the benefits of handling writes explicitly.

ptabor · 2022-12-19T11:53:03Z

Thank you @xiang90. I've read the article, but it mostly says about concurrent calls of fsync and responsibility of retries.

My mental model is following (assuming the goal is to get highest possible performance [and we are not in position to complicate bbolt's reliability for performance] ):

The bbolt writes/syncs are single-threaded, so there is an exclusive lock under which we need to make sure transaction is committed, it is fync or msync the file.
We need to precisely control write of the meta pages. Meta page can be only written if we make sure all other modifications to the tree are persisted. So writes to meta pages should be handled explicitly.
All other modifications than meta page should be not 'referencable' so the order of writes does not matter.
work on mmaped pages would allow to:
- avoid allocation of temporary memory for pages and 'zeroing' them
- avoid copy of the temporary content into fwrite() memory.
- the additional risk & benefit is that mmaped content can start being written to disk as soon as we (partially) modifying the pages -> without delaying it till the commit happens. This is optional mode (alternatively we can do copy-on-write), but it's sense depends on the ratio of rolled back transactions. For etcd there are nearly no rolled back transactions.

Again - I'm not saying we should do it... Just brainstorming what's the "speed of light" version of the memory copying optimization.

ptabor · 2022-12-19T12:34:14Z

@tjungblu
From the chart #339 (comment), the time seems to be dominated by mmap. But the mmap should be an infrequent operation for databases that are 'mature', as it happens only when the file needs to grow. Wouldn't it be better to benchmark an already upsized file, e.g. run with high InitialMmapSize option ?

xiang90 · 2022-12-19T18:46:42Z

@ptabor

Right. I actually referred to hyc_symas's (he is the author of LMDB) comment in that HN thread, not just the article :P.

But I think you are probably right. Only the meta-pages must be handled explicitly (split from the normal pages mmap sync). Other pages should not matter much, assuming msync will return the error explicitly for any failing physical writes. I do not see a reason why it cannot be done. But we may want to understand the perf benefits as well :)

This makes use of the vectorized IO call pwritev to write subsequent pages in up to 1024 batches of smaller buffer writes. Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>

tjungblu force-pushed the vectorizedio branch from 14d3365 to 0d8d760 Compare September 14, 2022 16:27

tjungblu commented Sep 14, 2022

View reviewed changes

bolt_windows.go Outdated Show resolved Hide resolved

tjungblu force-pushed the vectorizedio branch from 0d8d760 to 3e65272 Compare November 29, 2022 15:11

tjungblu marked this pull request as ready for review November 29, 2022 15:15

ahrtr self-requested a review November 30, 2022 00:31

tjungblu force-pushed the vectorizedio branch from 3e65272 to 59e2546 Compare November 30, 2022 09:11

tjungblu mentioned this pull request Nov 30, 2022

Remove unncessary locking in encoder etcd-io/etcd#14877

Closed

ahrtr reviewed Dec 1, 2022

View reviewed changes

tjungblu force-pushed the vectorizedio branch from 59e2546 to 530b1c5 Compare December 1, 2022 09:59

tjungblu force-pushed the vectorizedio branch from 530b1c5 to e076bd4 Compare December 1, 2022 10:05

ptabor marked this pull request as draft December 21, 2022 20:36

ptabor mentioned this pull request Dec 30, 2022

Improve startup time by reading DB file in larger chunks #305

Closed

tjungblu mentioned this pull request Feb 17, 2023

ETCD-391: Adding Microshift enhancement for defragementation of etcd openshift/enhancements#1350

Merged

use vectorized io when writing dirty pages

cd85a1e

This makes use of the vectorized IO call pwritev to write subsequent pages in up to 1024 batches of smaller buffer writes. Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>

tjungblu force-pushed the vectorizedio branch from e076bd4 to cd85a1e Compare March 17, 2023 09:02

ahrtr deleted the branch etcd-io:master December 12, 2023 10:26

ahrtr closed this Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use vectorized io when writing dirty pages #339

use vectorized io when writing dirty pages #339

tjungblu commented Sep 14, 2022

tjungblu commented Sep 15, 2022

ahrtr commented Nov 13, 2022

tjungblu commented Nov 29, 2022

tjungblu commented Nov 29, 2022 •

edited

Loading

xiang90 commented Nov 29, 2022

tjungblu commented Nov 29, 2022

tjungblu commented Nov 29, 2022 •

edited

Loading

xiang90 commented Nov 30, 2022

tjungblu commented Nov 30, 2022

tjungblu commented Nov 30, 2022 •

edited

Loading

tjungblu commented Nov 30, 2022

xiang90 commented Nov 30, 2022

xiang90 commented Nov 30, 2022

tjungblu commented Nov 30, 2022

ahrtr Dec 1, 2022 •

edited

Loading

tjungblu Dec 1, 2022

tjungblu Dec 1, 2022

ahrtr Dec 1, 2022 •

edited

Loading

tjungblu Dec 1, 2022

tjungblu Dec 1, 2022

tjungblu Dec 1, 2022

tjungblu commented Dec 1, 2022

tjungblu commented Dec 1, 2022 •

edited

Loading

xiang90 commented Dec 2, 2022

ptabor commented Dec 15, 2022

xiang90 commented Dec 16, 2022

ptabor commented Dec 19, 2022 •

edited

Loading

ptabor commented Dec 19, 2022

xiang90 commented Dec 19, 2022

use vectorized io when writing dirty pages #339

use vectorized io when writing dirty pages #339

Conversation

tjungblu commented Sep 14, 2022

tjungblu commented Sep 15, 2022

ahrtr commented Nov 13, 2022

tjungblu commented Nov 29, 2022

tjungblu commented Nov 29, 2022 • edited Loading

xiang90 commented Nov 29, 2022

tjungblu commented Nov 29, 2022

tjungblu commented Nov 29, 2022 • edited Loading

xiang90 commented Nov 30, 2022

tjungblu commented Nov 30, 2022

tjungblu commented Nov 30, 2022 • edited Loading

tjungblu commented Nov 30, 2022

xiang90 commented Nov 30, 2022

xiang90 commented Nov 30, 2022

tjungblu commented Nov 30, 2022

ahrtr Dec 1, 2022 • edited Loading

Choose a reason for hiding this comment

tjungblu Dec 1, 2022

Choose a reason for hiding this comment

tjungblu Dec 1, 2022

Choose a reason for hiding this comment

ahrtr Dec 1, 2022 • edited Loading

Choose a reason for hiding this comment

tjungblu Dec 1, 2022

Choose a reason for hiding this comment

tjungblu Dec 1, 2022

Choose a reason for hiding this comment

tjungblu Dec 1, 2022

Choose a reason for hiding this comment

tjungblu commented Dec 1, 2022

tjungblu commented Dec 1, 2022 • edited Loading

xiang90 commented Dec 2, 2022

ptabor commented Dec 15, 2022

xiang90 commented Dec 16, 2022

ptabor commented Dec 19, 2022 • edited Loading

ptabor commented Dec 19, 2022

xiang90 commented Dec 19, 2022

tjungblu commented Nov 29, 2022 •

edited

Loading

tjungblu commented Nov 29, 2022 •

edited

Loading

tjungblu commented Nov 30, 2022 •

edited

Loading

ahrtr Dec 1, 2022 •

edited

Loading

ahrtr Dec 1, 2022 •

edited

Loading

tjungblu commented Dec 1, 2022 •

edited

Loading

ptabor commented Dec 19, 2022 •

edited

Loading