read scalability issues with large archives #218

cyril42e · 2024-04-25T09:43:53Z

cyril42e
Apr 25, 2024

I need to store logs generated every day, that are a bit large (around 400MB per day, i.e. almost 150GB per year), and are made of a rather high number of files (around 8k per day, i.e. almost 3M per year).

So I played a bit around with DwarFS, that seemed to allow compression ratios as good as tar.xz while keeping reading speed as good as SquashFS. However when I used large block sizes (>=2**26) for good compression ratio with quite large archives (1 month of logs, ie 13GB uncompressed, 125MB compressed, 240k files), the reading just became awfully slow, similar to reading directly from the tar.xz archive with archive mount, that is more than 30min for reading 1 day of logs instead of less than 4s with SquashFS or lower block size / archive size with DwarFS.

So I was a bit disappointed to realize that eventually I could not get both the good compression ratio and reasonable reading speed at the same time. I made a quite extensive benchmark to have a better view, and make sure that there was no suitable sweet spot for my use case. It is available here if you need more detail : https://github.com/cyril42e/dwarfs-scalability/tree/master .

Still I am curious whether this is expected, fixable, or if there is just no way around ?

What gives me a little hope that it could be fixable is that DwarFS exhibits an increase of reading time with archive size that SquashFS does not exhibit with similar block size / compression ratio (of course SquashFS limits block size to 2**20 / 1MB).

mhx · 2024-04-25T20:53:16Z

mhx
Apr 25, 2024
Maintainer

Hi and thanks a lot for the thorough analysis!

I'd definitely be interested in trying to improve performance for your use case.

Off the top of my head, here's a few things that might be worth playing with:

By default, at -l9, DwarFS will use nilsimsa similarity ordering to reorder files by similarity. This generally helps achieve better compression, as it'll group files together that are "similar". In your case, all the log files are probably very similar to begin with, so there's likely not much point in similarity ordering. On the contrary, given that you're typically accessing the compressed log files sequentially, having the files re-ordered will be detrimental for access speed. So I'd suggest trying --order=path, assuming that this matches your access pattern (if not, there's also a way to determine the exact order in which files will be stored by DwarFS). AFAIK, squashfs doesn't support similarity ordering, which actually works in its favour in this case.
lzma compression is very slow to decompress. It's definitely worthwhile to try both brotli and zstd compressions. brotli should be faster to decompress than lzma with very similar compression ratio. zstd can come close to lzma in terms of compression ratio, but is about 10 times faster to decompress. So please try -C brotli:quality=11 or -C zstd:level=22.
It can be interesting to observe the block cache when mounting a DwarFS image. Try running the FUSE driver with -f -odebuglevel=verbose. This will run the driver in foreground mode and upon unmounting (either Ctrl+C or explicit umount) output statistics from the block cache and other components of the driver:

$ dwarfs image.dwarfs mnt -f -odebuglevel=verbose
I 20:35:37.366281 dwarfs (v0.9.8, fuse version 35)
I 20:35:37.377163 file system initialized [10.86ms]
V 20:35:59.923336 iovec size p90: 6
V 20:35:59.923372 iovec size p95: 10
V 20:35:59.923380 iovec size p99: 27
V 20:35:59.924168 blocks created: 11686
V 20:35:59.924182 blocks evicted: 11171
V 20:35:59.924187 blocks tidied: 0
V 20:35:59.924193 request sets merged: 17061
V 20:35:59.924198 total requests: 5700357
V 20:35:59.924203 active hits (fast): 75
V 20:35:59.924208 active hits (slow): 21917
V 20:35:59.924212 cache hits (fast): 5666679
V 20:35:59.924217 cache hits (slow): 0
V 20:35:59.924223 total bytes decompressed: 12249608370
V 20:35:59.924233 average block decompression: 100.0%
V 20:35:59.924241 fast hit rate: 99.411%
V 20:35:59.924249 slow hit rate: 0.384%
V 20:35:59.924257 miss rate: 0.205%
V 20:35:59.924266 expired active requests: 28262
V 20:35:59.924290 active set size p50: 2, p75: 3, p90: 5, p95: 6, p99: 9

This information can be very helpful in understanding where the driver is wasting time. The most interesting numbers are miss rate (you'd want this to be as low as possible) and blocks created (you'd also want this to be low, this is the total number of blocks that have been decompressed). The DwarFS image in the above case has just over 4000 blocks in total, so quite a few blocks have been extracted multiple times.

I think this is what happens as you increase both block size and image size:

A larger image means it gets more an more likely that a certain file is stored in a file system block that is not active in the block cache. That's because (very likely) similarity ordering has somewhat randomized the order of the files and you're trying to access them sequentially.
Larger blocks mean there is significantly more overhead involved in decompressing a single block. That's not a problem if you actually need most of the data, but if the likelihood is rather low because of the larger block count and randomly distributed files, it can become a real bottleneck.

0 replies

cyril42e · 2024-04-25T22:13:13Z

cyril42e
Apr 25, 2024
Author

Thanks for the detailed answer and good leads!
I will definitely explore them and get back to you. I already had a look at zstd but I will redo it more cleanly. For sure the similarity reordering looks like an excellent candidate for explaining the sensitivity to both archive size and block size !

0 replies

mhx · 2024-04-25T23:26:52Z

mhx
Apr 25, 2024
Maintainer

FWIW, I've downloaded all 2022/2023 logs from SEC's EDGAR, extracted the files, and split each file every 1000 lines, ultimately resulting in 2,075,627 files and almost 200 GiB of data in total.

I compressed this using:

$ mkdwarfs -i output -o logs-S26-zstd16.dwarfs -l9 --order=path -S26 -C zstd:level=16 --file-hash=none -B0

I chose -S26 because of diminishing returns according to your graphs and zstd:level=16 because I wanted to build this quickly. --file-hash=none disables duplicate detection and -B0 disables segmentation, both of which I assume are unlikely to help with your data and just cost time building the DwarFS image.

This resulted in a 16 GiB image file.

I can read data from the mounted image at around 3 GiB/s:

$ find mnt -type f -print0 | xargs -0 -n256 -P16 cat | dd of=/dev/null bs=256k status=progress
210556930800 bytes (211 GB, 196 GiB) copied, 71 s, 3.0 GB/s
0+3748627 records in
0+3748627 records out
212068450743 bytes (212 GB, 198 GiB) copied, 71.4975 s, 3.0 GB/s

Note that this uses 16 parallel cat processes for reading. Using just a single process, the throughput drops to around 600 MiB/s.

Another interesting data point: searching all logs using ripgrep took 110 seconds, so that's about 1.9 GiB/s. I'm not entirely sure how ripgrep determines the order in which to scan files, but it looks somewhat random to me. Increasing the number of workers for the FUSE driver to 8 (-oworkers=8) reduces the time from 110 seconds to 77 seconds, i.e. 2.6 GiB/s, which is not too different from cating the files sequentially.

0 replies

mhx · 2024-04-26T08:37:24Z

mhx
Apr 26, 2024
Maintainer

I can read data from the mounted image at around 3 GiB/s:

I've also built another DwarFS image, but this time with --order=nilsimsa (the default). First, as suspected, similarity ordering does in fact not help with logs:

$ ll logs-S26-zstd16*
.rw-r--r-- 18,022,731,545 mhx mhx 26 Apr 09:21 logs-S26-zstd16-nilsimsa.dwarfs
.rw-r--r-- 17,226,092,266 mhx mhx 26 Apr 00:54 logs-S26-zstd16.dwarfs

The similarity-ordered image is almost 5% bigger. More importantly, as suspected, sequential read rate absolutely tanks:

$ find mnt -type f -print0 | xargs -0 -n256 -P16 cat | dd of=/dev/null bs=256k status=progress
5026252017 bytes (5.0 GB, 4.7 GiB) copied, 308 s, 16.3 MB/s^C
0+97865 records in
0+97865 records out
5031389136 bytes (5.0 GB, 4.7 GiB) copied, 308.331 s, 16.3 MB/s

That's almost 200 times slower than the image with path ordering. And, as suspected, it shows in the cache metrics: the miss rate for the similarity ordered image is at 34%, whereas for the path ordered image it is at 0.1%.

I also used a small subset (all logs from Q2/2022, 17.24 GiB in 177,598 files) to test different compression algorithms. A bit of a surprise, but lzma was actually the worst in terms of compression ratio on this data set:

$ ll --sort=size logs-small-S26-{lzma9,zstd21,zstd22,brotli11}*
.rw-r--r-- 605,472,994 mhx mhx 26 Apr 10:11 logs-small-S26-brotli11.dwarfs
.rw-r--r-- 632,536,029 mhx mhx 26 Apr 10:58 logs-small-S26-zstd22.dwarfs
.rw-r--r-- 633,562,093 mhx mhx 26 Apr 09:48 logs-small-S26-zstd21.dwarfs
.rw-r--r-- 699,140,044 mhx mhx 26 Apr 09:52 logs-small-S26-lzma9.dwarfs

It comes at a cost, though. Compression for brotli is about 10x slower than lzma. zstd:level=21 might just hit a sweet spot here, and it's also the fastest to read:

Algorithm	Size	Time	Read Speed (16p)	Read Speed (1p)	ripgrep
`lzma:level=9`	667 MiB	118 s	1.4 GiB/s	246 MiB/s	9.8 s
`zstd:level=21`	605 MiB	243 s	3.4 GiB/s	525 MiB/s	5.5 s
`zstd:level=22`	604 MiB	381 s	3.4 GiB/s	519 MiB/s	5.5 s
`brotli:quality=11`	578 MiB	1164 s	3.2 GiB/s	422 MiB/s	4.9 s
squashfs (`zstd -21`)	672 MiB	179 s	254 MiB/s	486 MiB/s	36.9 s
`zstd:level=21` (10k)	603 MiB	243 s	3.3 GiB/s	730 MiB/s	5.3 s
`zstd:level=21` (100k)	603 MiB	243 s	3.5 GiB/s	1.0 GiB/s	5.1 s

The squashfs image was built using:

$ mksquashfs logs-small logs-small-1M-zstd21.squashfs -comp zstd -Xcompression-level 21 -b 1M

The "(10k)" and "(100k)" DwarFS images were built from the exact same data, but with files split into 10k lines and 100k lines each instead of 1k lines. So each file is 10x / 100x larger. This was mainly done to demonstrate the overhead of accessing many small files compared to fewer large files when only using a single process/thread to read the data.

0 replies

mhx · 2024-04-26T18:22:54Z

mhx
Apr 26, 2024
Maintainer

I've also played around with the sample log files archive you've linked to. Again, zstd:level=21 beats lzma:level=9 in terms of compression ratio. I'd suggest running with the following parameters on your log data:

$ mkdwarfs -i <in> -o <out> -l7 -S26 -C zstd:level=21 --order=path

If you don't mind trading a significant amount of time for 2-3% better compression, you can drop -C zstd:level=21 and go with the default (zstd:level=22).

0 replies

mhx · 2024-04-27T17:07:22Z

mhx
Apr 27, 2024
Maintainer

With all that being said, there's still room for improvement. Internally, DwarFS is able to read data from the file system at more than 10 GiB/s on my system, which is faster than the SSDs the image is stored on. So at least in theory, it should be possible to dump the contents of a DwarFS image faster than if the data were stored raw on disk.

There's definitely some overhead due to the FUSE abstraction, but that's likely only a problem when accessing small files.

I'm just working on adding a sequential access detector that can trigger prefetches of file system blocks if it detects that data is accessed sequentially. In my early tests I'm seeing roughly twice the throughput for sequential access patterns because reads will stall much less frequently:

$ time grep -r dumbledore mnt   # BEFORE
user	7.306
system	2.393
total	20.507

$ time grep -r dumbledore mnt   # AFTER
user	7.281
system	2.495
total	11.481

That's 17 GiB of data in 17890 files, so about 1.5 GiB/s. When using ripgrep instead, the results are pretty much unchanged (the small difference is likely noise):

$ time rg dumbledore mnt   # BEFORE
user	1.796
system	5.664
total	5.132

$ time rg dumbledore mnt   # AFTER
user	1.772
system	5.716
total	4.872

That's because its access pattern doesn't trigger the detector. Nonetheless, ripgrep scans the data at 3.5 GiB/s as it's running multi-threaded.

0 replies

cyril42e · 2024-04-28T21:46:48Z

cyril42e
Apr 28, 2024
Author

Thanks a lot for your analysis, you've done all the work for me, sorry I should have provided a larger set of data. I've still performed the analysis on my side too, and updated the results, which overall are consistent with yours. I'm just not completely convinced by zstd:21 instead of lzma:9, because I'm not sure that in my case the 35% faster reading speed is worth the x4 compression time.

Anyway the answer I came for was order=path, so thanks a lot for your support !

Your idea of prefetch is interesting though, I guess that sequential access is quite common, and the gain will come as free for the users :).

0 replies

mhx · 2024-04-29T16:21:41Z

mhx
Apr 29, 2024
Maintainer

Thanks a lot for your analysis, you've done all the work for me, sorry I should have provided a larger set of data. I've still performed the analysis on my side too, and updated the results, which overall are consistent with yours.

Cool, glad to see it's performing nicely now.

I'm just not completely convinced by zstd:21 instead of lzma:9, because I'm not sure that in my case the 35% faster reading speed is worth the x4 compression time.

I guess that depends on how often you compress / read. But it's definitely good to know which trade-offs can be made.

Anyway the answer I came for was order=path, so thanks a lot for your support !

You're welcom!

Your idea of prefetch is interesting though, I guess that sequential access is quite common, and the gain will come as free for the users :).

This feature should be enabled by default in the upcoming v0.9.9, which is going to be released quite soon to fix #217.

If you want to give it a try before the release, you can grab a build from the work branch in the meantime, e.g. this build will have support for sequential read detection. dwarfs-0.9.8-17-g306eaaf178-Linux-x86_64-clang.tar.zst or dwarfs-universal-0.9.8-17-g306eaaf178-Linux-x86_64-clang would be the release build for x86_64.

0 replies

mhx · 2024-04-29T18:21:39Z

mhx
Apr 29, 2024
Maintainer

I'll move this to discussions as it think this may be worthwhile for future reference.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read scalability issues with large archives #218

{{title}}

Replies: 9 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

read scalability issues with large archives #218

cyril42e Apr 25, 2024

Replies: 9 comments

mhx Apr 25, 2024 Maintainer

cyril42e Apr 25, 2024 Author

mhx Apr 25, 2024 Maintainer

mhx Apr 26, 2024 Maintainer

mhx Apr 26, 2024 Maintainer

mhx Apr 27, 2024 Maintainer

cyril42e Apr 28, 2024 Author

mhx Apr 29, 2024 Maintainer

mhx Apr 29, 2024 Maintainer

cyril42e
Apr 25, 2024

mhx
Apr 25, 2024
Maintainer

cyril42e
Apr 25, 2024
Author

mhx
Apr 25, 2024
Maintainer

mhx
Apr 26, 2024
Maintainer

mhx
Apr 26, 2024
Maintainer

mhx
Apr 27, 2024
Maintainer

cyril42e
Apr 28, 2024
Author

mhx
Apr 29, 2024
Maintainer

mhx
Apr 29, 2024
Maintainer