-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Writing to volblocksize=8k zvol is slow #824
Comments
I'm experiencing a similar issue. At first I thought I was an isolated case, but seeing that you too have poor performance with As a related note, try playing with |
Oh, wait.
Have you tried benchmarking empty zvols? If you're experiencing much better performance with empty zvols versus filled zvols, you're definitely hitting #361. That completely explains the reading activity you're describing. |
I can exhibit that behaviour, yes, if I write/rewrite/random rewrite with Here's a more comprehensive look at the random rewrite case; same system, same setup: http://pastebin.com/ksHqed4e
Thanks, I'll incorporate that into my next test pass. I suspect however that the answer to this particular riddle will mostly be found within code. |
This part of my testing is noteworthy; I suspect that with systems where |
It's not so simple. With normal writes and a high throughput pool, a high With synchronous writes however, a low zvol_threads value will without a single doubt destroy performance, and the reason why is clearly explained in #567. That's because in the synchronous write case, zvol threads are actually blocking on each write (I/O bound), which is not the case with normal writes (CPU bound). In fact, with some heavy synchronous write workloads on large pools, So until we can come up with a better solution, we have to choose some middle ground, and it can't be |
Oh, another little tidbit. When I observe the behaviour of |
After more testing, the following tweaks have had no noticeable impact:
Changing the zvol primarycache from all to metadata worsened the situation. Adding another 120GB SSD as L2ARC didn't help. Anyone familiar with the code who could chime in why 8k is such a bad case? And mostly, why there's a need to read from disk when we're (re)writing with optimal blocksize (ie. aligned)... |
Is this still an issue in HEAD? multiple performance fixes have been merged. |
I haven't looked at this issue for a while. I'll need to fire up a test during the weekend -- I'll post back with results. |
@shapemaker Can you rerun your tests using the latest ZoL master source from git. I just merged several significant memory management improvements which may help for small block sized. @cwedgwood has also recently done some testing with zvols and may have some insights in to the expected performance. |
Been sick for nearly 2 weeks in the meantime. Will run another test pass during next week and see how it goes. |
Hi guys, I have been testing ZFS on 1TB velociraptors all weekend. Unfortunately I'm seeing it under-utilize the drives(only 62% of what MD can do) when testing with iozone. Anyway, I have been benchmarking the zvol's at different block sizes as well. I have just done some testing with a module I compiled from todays code. Take a look at this uneven disk usage with 8k blocks during writes: And during reads: I don't really start seeing better performance till 32k-64k zvol block sizes. Reads are considerably worse than writes. With 64K block sizes the disk utilization is very even and performance way up but still a ways off from MD. This is with a RAIDZ-2 though performance was pretty bad at lower block levels with a RAID-10 as well. I can't get the disk usage right now because I just lost connection to the office, but I was seeing bad performance with iozone on ZFS when reducing the record size down from 128K as well. |
It just struck me that those iostat samples may be from different iozone runs. In each though there are two disks that are wildly different. During writes two disks are getting heavily hit and in that read samples(5 seconds) two disks are not read from at all. I would expect to see this if the two parities are being written to the same two disks for a large number of blocks? I'll see if they are related in the same run tomorrow. |
Hmmm, here are the results from the same iozone run. Sample during writes: You can see sda and sdd are getting smoked. Here is a sample when it switches over to reads: That's right when it kicks over so it looks like the writes are still getting sent to disk. But you can see sdb and sde are not really getting reads. Now with 64K blocks it is smoothed out and quite fast: That's actually the peak, these disks hit about 210MB/s with MD:/ |
I had a chance to replicate the test setup I had quite closely. It seems that this 8k volblocksize issue is gone in the latest master so this can be closed. ARC space utilisation is still a problem with 8k zvols, but that is a separate matter. "Thank you" to everyone who has contributed in the meantime for fixing this issue :) |
In `zcache iostat -l`, the buckets are labeled starting at 1024ns (and doubling from there), when in reality the first bucket is 1000ns (1us exactly).
Default zvol volblocksize is 8k. However, it also is dead slow on ZoL. I've been testing this for some time, and I've done the same tests over on OpenIndiana side, which doesn't exhibit the same slowness - for native ZFS the best volblocksize clearly is 8k.
Test setup: 8x mirror pairs of 600GB SAS 10k disks, LSI2008 HBA, 6GB SAS expander, 2x Xeon E5640, 48GB RAM (arc_max=28GB). Created 5x 200GB zvols, with volblocksize=128k/64k/32k/16k/8k. compression/dedup=off. Random write speeds to prefilled zvols (measured with fio 2.0.6) are like so:
On OpenIndiana the other blocksizes give 150-200 MB/s while 8k blocksize zvol attains 250-400 MB/s. So the behaviour is essentially reversed. The same effect can also be seen with SATA disks on another machine.
Running zpool iostat -v alongside the write perf test (see below) at some point looks interesting; take note how there's a lot of reads going on at the same time, which doesn't happen with other volblocksize zvols. My guess would be these reads are what kills the performance. Noteworthy is that arcstat.pl shows there's maybe 5k reads/s also which are satisfied from ARC (metadata probably) all the time while the write operation goes on too, however those reads don't affect writes since they're from ARC.
This particular test gave these numbers after fio had ran all the way through the zvol writing 200GB data:
Another point worth noticing is that while arc_max=28GB, the real memory usage near the end of the test is 41GB. Looks like slab fragmentation affects this test in particular, though I can't say for sure. I find it pretty interesting that 8k is such a pathological case for ZoL while it's the best performing for native ZFS. Also, take note that there's a 120GB SSD as cache so metadata should all be served from RAM or flash; however that might not be the case here.
Another point is that these tests are being ran with zvol_threads=16 (numcpus), for in my testing using 32 zvol threads leads to much more context switching and around 20% slower performance in both sequential and random write throughput.
Comments and suggestions welcome; I'd love to see more performance tuning in the future. The basic code stability has been fine after the VM tweaks. This test rig has been chugging along without issues for 50 days now.
The text was updated successfully, but these errors were encountered: