Best options for efficient deduplication? #63

M-Gonzalo · 2021-10-23T17:04:12Z

M-Gonzalo
Oct 23, 2021

Hi! I have a special case for which I wanted to use dwarfs. There are 12 videos, each in 3 different languages + a bundled video with all 3 audio tracks on the same file for each of the 12 (48 files total). So even though video files are not usually compressible, these have a lot of information in common (probably every frame that is not text on a particular language is duplicated across 3 files, and the multilingual bundled videos, which are 100% made up of tracks copied from the others).

Using a sensible sorting and a very fast deduplication utility I was able to reduce the size of the corpus to 35% on my tests. The utility is Bulat Ziganshin's "rep" filter from the old FreeArc compression suite of algorithms. It's a Lempel-Ziv type of compressor with no entropy coding.

But the problem is, mkdwarfs' hash-based deduplication is not picking up on it. I tried reducing window size to ridiculous amounts, using a lookback value of 10 and a nimsilsa limit of 224 but the savings are barely some 3%, I'm guessing due to the actual zstd compression.

So my question is: am I doing something wrong? Or is mkdwarfs simply not able to detect this duplicate data? I noticed that it starts writing to the disk way before the segmentation stage is complete. Could this be the reason so much duplicate data is missed? What happens if the first file analyzed is very similar to the last one? Would it be possible to perform the complete analysis before starting to compress?

I'm sorry if I misunderstood some parts of the process. Thanks in advance for any reply!

Answered by mhx

Oct 24, 2021

Interesting use case, thanks for the feedback!

My gut feeling when reading this was that it's simply because the lookback buffer is too small that mkdwarfs isn't able to pick up the redundancies. The rationale behind keeping the lookback buffer size limited is for a more typical use case: say you have a file that's relatively small compared to the configured filesystem block size. Even if you were able to assemble that file mostly from chunks split across 20 different filesystem blocks, you'd rarely want to, because you'd have to decompress 20 filesystem blocks in order to re-assemble the file when mounting the filesystem image.

However, your use case is just begging for a) large filesyst…

View full answer

Phantop · 2021-10-24T03:09:40Z

Phantop
Oct 24, 2021

Yeah I recently noticed this myself! Even when usng back-referencing and very high block sizes, using srep gets some significantly better deduplication going on and fairly quickly (I'll have an example up soon but getting like the third of the size on a specific image after converting it to tar using dwarfsextract and putting it through srep and zstd-mt). Would be really cool to get deduplication like that going on on a live-mountable image, if it is at all possible.

3 replies

M-Gonzalo Oct 24, 2021
Author

Yup. You reminded me of some other configurations I tried. Updated first message accordingly. Thanks!

Phantop Oct 24, 2021

Yup. You reminded me of some other configurations I tried. Updated first message accordingly. Thanks!

I noticed in your comment you mentioned the zstd compression as a point of note. I doubt that's it, assuming you were compressing the deduplicated file with something like lzma, as the difference shouldn't be that dramatic between zstd and lzma.

Also, here's my files as an example. Both are compressed with a zstd block size of 32MB at level 22.

First is the dwarfs file. This mostly the default settings, but with a lookback of something like 100 blocks in an attempt to max out deduplcation.

Then is the tar.srep.zst file. This was put through srep -m5 after converting the dwarfs file back to tar using dwarfsextract (I'm unsure if the compression of this ultimately benefitting from dwarfs's inode sorting, and will likely need to test this at some point). The lowest setting, srep -m0 resulted in an output 300mb larger (before zstd) but still smaller than dwarfs. Then I ran zstd-mt -22 -b 32 to compress it similarly to the dwarfs file. Removing the specifying of the block size gets the file down by a few dozen MB, as well. As you can see, the file is about half the size (can get down to a third with stronger algorithms but that is, of course, not the goal).

Now, I don't know how srep works, but I do know it has resulted in a file that is smaller and faster to compress than what zstd gives me, and not much slower to decompress either. Would be really cool if dwarfs could benefit from this somehow.

M-Gonzalo Oct 24, 2021
Author

I didn't actually compress the deduplicated data.
My workflow is like this:

$ mkdwarfs -i FOLDER -o FILE.dwarfs [different options] // only about ~3% savings

rep is the younger sibling of srep. If you're using srep 3.9, it's the same as -m0. It's a simpler algorithm that serves as a preprocessor. Its output is basically as easily compressible as the original data, except in this case what remains is video and audio already compressed so I didn't even try to process it. I'm guessing after a pass through zstd or xz, the size wouldn't have shrinked more than 3-5%.

So, at the end, the comparison would be something like

mkdwarfs => 97% ratio
rep+zstd => 32% ratio
rep+lzma => 30% ratio

I'm not uploading my test files because they amount to more than 26 GB. That's also the reason I didn't try to compress them with lzma or zstd 22, which is pretty slow.

mhx · 2021-10-24T16:06:02Z

mhx
Oct 24, 2021
Maintainer

Interesting use case, thanks for the feedback!

My gut feeling when reading this was that it's simply because the lookback buffer is too small that mkdwarfs isn't able to pick up the redundancies. The rationale behind keeping the lookback buffer size limited is for a more typical use case: say you have a file that's relatively small compared to the configured filesystem block size. Even if you were able to assemble that file mostly from chunks split across 20 different filesystem blocks, you'd rarely want to, because you'd have to decompress 20 filesystem blocks in order to re-assemble the file when mounting the filesystem image.

However, your use case is just begging for a) large filesystem blocks and b) a more or less unlimited lookback buffer. Thankfully, a large lookback buffer isn't much of a problem if you've got enough memory. It won't even slow down mkdwarfs much.

I've tried to reproduce your use case by taking a ~500 MiB .mkv video with two audio tracks. I've made 3 copies of it: one with both audio tracks, and one each with just one track:

$ ll -h dwarfs-video-test
total 1.5G
-rw-r--r-- 1 mhx users 466M Oct 24 15:42 de.mkv
-rw-r--r-- 1 mhx users 517M Oct 24 15:42 en-de.mkv
-rw-r--r-- 1 mhx users 467M Oct 24 15:42 en.mkv

As far as options go, I went for -S 26 -B 16 -C null. That gives a block size of 64 MiB (2^26 bytes). Combined with a 16-block lookback buffer, we should be able to look back up to 1 GiB. Last but not least, block compression is set to null as there's no point in actually compressing blocks that store pretty much incompressible data. So here we go:

$ time dwarfs-versions/dwarfs-0.5.6-Linux/bin/mkdwarfs -i dwarfs-video-test/ -o dwarfs-video-test.dwarfs -S 26 -B 16 -C null
I 15:57:26.546163 scanning dwarfs-video-test/
I 15:57:26.546383 assigning directory and link inodes...
I 15:57:26.546441 waiting for background scanners...
I 15:57:29.950867 scanning CPU time: 9.571s
I 15:57:29.950929 finalizing file inodes...
I 15:57:29.950981 saved 0 B / 1.414 GiB in 0/3 duplicate files
I 15:57:29.951024 assigning device inodes...
I 15:57:29.951043 assigning pipe/socket inodes...
I 15:57:29.951098 building metadata...
I 15:57:29.951121 building blocks...
I 15:57:29.951220 saving names and symlinks...
I 15:57:29.951268 updating name and link indices...
I 15:57:29.952227 using a 4 KiB window at 512 B steps for segment analysis
I 15:57:29.952240 bloom filter size: 4 MiB
I 15:57:29.952347 ordering 3 inodes using nilsimsa similarity...
I 15:57:29.952404 nilsimsa: depth=20000 (1000), limit=255
I 15:57:29.952434 pre-sorted index (0 name, 0 path lookups) [674ns]
I 15:57:29.952488 3 inodes ordered [74.58us, 75.28us CPU]
I 15:57:29.952532 waiting for segmenting/blockifying to finish...
I 15:57:35.939175 segmenting/blockifying CPU time: 5.983s
I 15:57:35.939291 bloom filter reject rate: 94.107% (TPR=0.912%, lookups=556977657)
I 15:57:35.939328 segmentation matches: good=30689, bad=268757, total=299446
I 15:57:35.939390 segmentation collisions: L1=0.005%, L2=0.000% [1072463 hashes]
I 15:57:35.939416 saving chunks...
I 15:57:35.939673 saving directories...
I 15:57:35.939713 saving shared files table...
I 15:57:35.940347 saving names table... [610.8us]
I 15:57:35.940382 saving symlinks table... [567ns]
I 15:57:35.945781 waiting for compression to finish...
I 15:57:35.973173 compressed 1.414 GiB to 524 MiB (ratio=0.361839)
I 15:57:35.989743 compression CPU time: 849.9ms
I 15:57:35.989872 filesystem created without errors [9.444s]
⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
waiting for block compression to finish
1 dirs, 0/0 soft/hard links, 3/3 files, 0 other
original size: 1.414 GiB, dedupe: 0 B (0 files), segment: 924.3 MiB
filesystem: 523.7 MiB in 9 blocks (48216 chunks, 3/3 inodes)
compressed filesystem: 9 blocks/524 MiB written [depth: 20000]
████████████████████████████████████████████████████████████████████▏100% -

real    0m9.467s
user    0m15.937s
sys     0m0.638s

$ ll -h dwarfs-video-test.dwarfs 
-rw-r--r-- 1 mhx users 524M Oct 24 15:57 dwarfs-video-test.dwarfs

As you can see, mkdwarfs compressed the 3 videos to just over 1% more than the size of the largest video. In less than 10 seconds.

So I'd basically suggest you set -S to at least 26, -B as large as possible (if you have more RAM than the total expected output size, just set it to 10000), and -C to null (unless you have other data in there that's actually compressible).

1 reply

mhx Oct 24, 2021
Maintainer

Ah, sorry, definitely do not set -B to 10000. I forgot that the bloom filter size is scaled up with the -B value. A value of 10000 will give you a ludicrous bloom filter of 4 GiB. Instead, you can compute a good -B value by dividing the expected output size (i.e. the total size of all 12 videos with all audio tracks) by the block size.

mhx · 2021-10-24T16:19:32Z

mhx
Oct 24, 2021
Maintainer

One more thing: you can actually get away with less RAM and a smaller lookback buffer if you order the input files accordingly. I'd be surprised if any of the similarity-based ordering algorithms that mkdwarfs comes with did a perfect job, so you'd need to help it a bit. If you can name the files/directories such that ordering the full paths asciibetically would cluster files with the same content, the --order=path would do the trick. In that case, you can set the lookback buffer size such that it's just a bit (say 20%) larger than the largest file you're trying to pack. Otherwise, you could build mkdwarfs with experimental Python support and provide some Python code to determine the order in which the files are processed.

(Which reminds me of #6, it'd be really nice if you could just provide a list of files in the order in which you want to pack them.)

3 replies

mhx Oct 24, 2021
Maintainer

Also, if you use --order=path, the time for building the filesystem goes down by 30%:

$ time dwarfs-versions/dwarfs-0.5.6-Linux/bin/mkdwarfs -i dwarfs-video-test/ -o dwarfs-video-test.dwarfs -S 26 -B 10 -C null --order=path
[...]
real    0m6.267s
user    0m6.595s
sys     0m0.610s

The reason being that mkdwarfs doesn't have to compute the similarity hash function for all input data.

I've also now tried lrzip for comparison (I don't have rep or srep to try):

$ time lrztar -n -o dwarfs-video-test.lrzip dwarfs-video-test
Compression Ratio: 2.803. Average Compression Speed: 160.889MB/s.
Total time: 00:00:08.92

real    0m8.932s
user    0m8.370s
sys     0m3.487s

$ ll dwarfs-video-test.lrzip -h
-rw-r--r-- 1 mhx users 517M Oct 24 21:00 dwarfs-video-test.lrzip

So this is almost as fast as mkdwarfs, but gives a slightly better compression ratio (actually, if you lower the windows size with mkdwarfs a little, you can get the image size down to 518 MiB). However, you can't just mount the .lrzip file and get random access without unpacking it first.

Phantop Oct 25, 2021

I'm starting to wonder myself whether using path ordering would help for any of my inputs. I suppose I should give it a shot since data earlier in my set is definitely more similar to each other than later. One question I have for @mhx is: how are numbers handled by --order=path? If I had a folder called 1.9 and a folder called 1.14, which would be considered "first", particularly when there is another folder called 1.1?

mhx Oct 25, 2021
Maintainer

One question I have for @mhx is: how are numbers handled by --order=path? If I had a folder called 1.9 and a folder called 1.14, which would be considered "first", particularly when there is another folder called 1.1?

--order=path orders asciibetically. So:

1.1 foo
1.10 bar
1.19 zap
1.2 top

M-Gonzalo · 2021-10-25T00:36:50Z

M-Gonzalo
Oct 25, 2021
Author

OK, I finally have some time to write a proper reply.

You were right! I didn't realize that I was only searching for duplicates on some 160 MB of data. After half a day of experimenting on my slow machine, here are some takes:

Nimsilsa ordering didn't work. It was just reading alphabetically so I canceled the process.
After manually renaming everything, I could use --order=path and it worked.
I used -S 26 -B 32 -C null to have an even bigger window for the deduplication to work on.
Compression still is some 15% worse than using rep:1g but I got the whole folder down to a 45% so that's good
I'm guessing using srep could have given even better results but it's a very big set and I'm tired 😅
I did manage to run a comparable test with the smallest of the 12 sub-sets and I got this:

original	866050048	100,00 %
dwarfs_null	412631040	 47,65 % // very similar to the 44.98% of the bigger dataset
dwarfs_zstd	358998016	 41,45 % // 6.2% better... there is after all some redundancy on video files apparently
tar+rep:1g	287432704	 33,19 % // 14.5% better than dwarfs -C null
dwarfs_null+rep	287019008	 33,14 % // dwarfs doesn't introduce entropy 👌
rep:1g+fxz	263577600	 30,43 % // best resource-efficient option

BTW, I strongly recommend you give a look at rep... It's borderline magic. I use it basically for everything because it's extremely fast (yes, it can even be used before lz4), and it's the most efficient preprocessor I know of. It usually improves both speed and ratio because there are just fewer data to compress, and it effectively "enlarges" the window for the final codec. I'm guessing you could include it just before lz4, zstd, and lzma to give mkdwarfs a boost in speed and ratio. It's part of FreeArc, which is open-source but abandonware so if you can't find it, let me know, I have a copy.

If you want, I could run some more tests with any data you tell me.

0 replies

mhx · 2021-10-25T08:04:47Z

mhx
Oct 25, 2021
Maintainer

Okay, I've now tried srep (at least):

$ time tar cvf - dwarfs-video-test | wine git/github/freearc/win/srep64_generic_sse2.exe -m0 - dwarfs-video-test.srep
dwarfs-video-test/
dwarfs-video-test/de.mkv
0050:err:explorer:initialize_display_settings Failed to query current display settings for L"\\\\.\\DISPLAY1".
00cc:fixme:ntdll:NtQuerySystemInformation info_class SYSTEM_PERFORMANCE_INFORMATION
SREP 3.93a beta (October 11, 2014): input size 25600 mb, memory used 661 mb, -m0 -hash=vmac -b8mb -d512mb:h128mb:l512:c64
0%: 33,554,432 -> 33,547,135: 99.98%.  Cpu 145 mb/s (0.220 sec), real 140 mb/s (0.228 sec) = 96%.  Remains 02:56   00cc:fixme:explorerframe:taskbar_list_SetProgressValue iface 0000000000034080, hwnd 0000000000000000, ullCompleted 2000000, ullTotal 40000000 stub!
1%: 469,762,048 -> 469,732,675: 99.99%.  Cpu 157 mb/s (2.850 sec), real 152 mb/s (2.946 sec) = 97%.  Remains 02:45   dwarfs-video-test/en.mkv
3%: 947,912,704 -> 537,358,709: 56.69%.  Cpu 142 mb/s (6.380 sec), real 158 mb/s (5.723 sec) = 111%.  Remains 02:36   dwarfs-video-test/1-en-de.mkv
5%: 1,518,139,392 -> 625,793,962: 41.22%.  Cpu 138 mb/s (10.520 sec), real 161 mb/s (8.969 sec) = 117%.  Remains 02:30
Decompression memory is 493 mb.  29,760 matches = 476,160 bytes = 0.08% of file

real    0m9.166s
user    0m10.578s
sys     0m1.191s

$ ll dwarfs-video-test* -h
-rw-r--r-- 1 mhx users 518M Oct 24 21:09 dwarfs-video-test.dwarfs
-rw-r--r-- 1 mhx users 517M Oct 24 21:06 dwarfs-video-test.lrzip
-rw-r--r-- 1 mhx users 597M Oct 25 06:39 dwarfs-video-test.srep

So speed is in the same ballpark, but compression is actually worse. (If I'm just using it wrong, please let me know, but this is what I inferred from the discussion above.)

With -m4 I can get srep to output a file of the same size as the DwarFS image. srep (8.3s) is a bit faster than lrzip (8.9s) and a bit slower than mkdwarfs (6.3s). So to be honest, I would be surprised if srep or its siblings were able to do anything "magical" here.

What I don't quite understand is why you're still seeing such a large discrepancy between mkdwarfs and rep. The only thing that comes to my mind is that your video stream data comes in smaller chunks and so it might be worthwhile to also try lowering the match window size (not by much, I'd try -W 11 or -W 10; the more you lower it, the slower it'll become). By using -W 11 on my test data, I can get the output image size down by another 5 MiB.

I'm guessing you could include it just before lz4, zstd, and lzma to give mkdwarfs a boost in speed and ratio.

Hardly. The problem here is that DwarFS is a file system, not an archiver (even though it does offer compression ratios that are comparable to archivers). The requirements for a file system can be quite different. For example, you want to be able to quickly access individual files without having to unpack the whole file system image. If DwarFS was using something like rep during compression, it would have to undo the process during decompression. Using it on the whole data is not a option as it would mean the whole data would have to be unpacked when accessing a single file. Using it on individual file system blocks would be pointless as most compression algorithms used for block compression have dictionary sizes that can span a full block, so there's nothing gained from a de-duplication filter (not quite true; DwarFS' segmenting stage, which I would think is similar to what rep is doing, actually helps to store more data in a single file system block, thus speeding up both compression and decompression and increasing cache efficiency when using the file system).

Regarding your test, please help me to correctly understand your input data. From

original	866050048	100,00 %

I understand that this is 4 video files that take up a total of 866 MB? So each video file by itself is just over 200 MB? And the video stream data in all those files is identical?

So, I've made another test:

$ ll -h dwarfs-video-test2/
total 469M
-rw-r--r-- 1 mhx users 162M Oct 25 07:42 1-en-de-it.mkv
-rw-r--r-- 1 mhx users 103M Oct 25 07:50 2-en.mkv
-rw-r--r-- 1 mhx users 103M Oct 25 07:50 3-de.mkv
-rw-r--r-- 1 mhx users 103M Oct 25 07:50 4-it.mkv

I was thinking because of the size of your files, that they might probably be slightly lower quality video streams than my previous test files. As mentioned above, the could result in potentially smaller segments. So the new files are compressed more aggressively and I've also used 3 audio tracks. And indeed, in order to get results that are comparable with my previous test, I have to lower the window size quite a bit:

$ ll dwarfs-video-test2*.dwarfs -h
-rw-r--r-- 1 mhx users 329M Oct 25 07:51 dwarfs-video-test2-W12.dwarfs
-rw-r--r-- 1 mhx users 290M Oct 25 07:51 dwarfs-video-test2-W11.dwarfs
-rw-r--r-- 1 mhx users 267M Oct 25 07:52 dwarfs-video-test2-W10.dwarfs
-rw-r--r-- 1 mhx users 255M Oct 25 07:52 dwarfs-video-test2-W9.dwarfs
-rw-r--r-- 1 mhx users 168M Oct 25 07:52 dwarfs-video-test2-W8.dwarfs
-rw-r--r-- 1 mhx users 166M Oct 25 07:52 dwarfs-video-test2-W7.dwarfs
-rw-r--r-- 1 mhx users 166M Oct 25 07:53 dwarfs-video-test2-W6.dwarfs

So please definitely give it a try and pass in a smaller window size. (This is something that I don't recommend in general as it can easily lead to a fragmented file system and slows down mkdwarfs, but in your scenario, it's exactly the right tool for the job.)

31 replies

Phantop Oct 26, 2021

However, I'd prefer if it was possible to rather get a trigger from the OS that it's in need of memory and only then start purging old blocks. Pretty much like Linux already does when caching file systems.

Is it likely this will be happening any time soon? No pressure regarding it of course but it would definitely ease memory concerns when using dwarfs.

mhx Oct 27, 2021
Maintainer

However, I'd prefer if it was possible to rather get a trigger from the OS that it's in need of memory and only then start purging old blocks. Pretty much like Linux already does when caching file systems.

Is it likely this will be happening any time soon? No pressure regarding it of course but it would definitely ease memory concerns when using dwarfs.

I've spent some time looking around for a mechanism that would allow dwarfs to know if the system is short on memory. It seems that there have been a series of (controversial) kernel patches floating around about a decade ago, but apart from some cgroups stuff I didn't find anything that was obviously useful.

I've actually got an implementation for cache tidying using several strategies. By default, the behaviour would be as-is (i.e. no cache tidying). But you could also choose block-age-based tidying (e.g. blocks that haven't been accessed for 10 minutes will be removed from the cache), or swap-based tidying (i.e. blocks that have been partially or fully swapped out by the kernel will be removed from the cache). I could also think of something based around available system memory, but don't want to make things more complicated as they already are.

So yeah, something like this should hopefully go into the next release. No firm release date yet, though.

mhx Oct 27, 2021
Maintainer

So yeah, something like this should hopefully go into the next release. No firm release date yet, though.

You can check out the wip branch.

Phantop Oct 28, 2021

So yeah, something like this should hopefully go into the next release. No firm release date yet, though.

You can check out the wip branch.

I've been checking this out. Thanks for the addition! Quick question: what is the default tidy_interval? The documentation doesn't say anything but I'm assuming there is one.

mhx Oct 29, 2021
Maintainer

You can check out the wip branch.

I've been checking this out. Thanks for the addition! Quick question: what is the default tidy_interval? The documentation doesn't say anything but I'm assuming there is one.

It's 5 minutes for tidy_interval and 10 minutes for tidy_max_age. I'll make sure these get documented. :)

M-Gonzalo · 2021-10-26T14:36:15Z

M-Gonzalo
Oct 26, 2021
Author

Done! I finally could make the most efficient filesystem and it only is a 30% of the uncompressed size.

These were my options:
-S 26 -B 16 -C null --order=path -W 8 --bloom-filter-size=2

I 17:21:37.154426 scanning Asamblea
I 17:21:37.165205 assigning directory and link inodes...
I 17:21:37.168435 waiting for background scanners...
I 17:29:30.376195 scanning CPU time: 126.3s
I 17:29:30.376788 finalizing file inodes...
I 17:29:30.377411 saved 0 B / 26.64 GiB in 0/92 duplicate files
I 17:29:30.382633 assigning device inodes...
I 17:29:30.383201 assigning pipe/socket inodes...
I 17:29:30.384076 building metadata...
I 17:29:30.385073 building blocks...
I 17:29:30.386986 saving names and symlinks...
I 17:29:30.443259 using a 256 B window at 32 B steps for segment analysis
I 17:29:30.443513 bloom filter size: 16 MiB
I 17:29:30.443931 ordering 92 inodes by path name...
I 17:29:30.502831 92 inodes ordered [58.68ms, 484.9us CPU]
I 17:29:30.503089 assigning file inodes...
I 17:29:30.503381 waiting for segmenting/blockifying to finish...
I 17:29:30.506985 updating name and link indices...
I 00:02:57.959228 segmenting/blockifying CPU time: 2.268e+04s
I 00:02:57.971330 bloom filter reject rate: 13.275% (TPR=11.489%, lookups=8933905887)
I 00:02:57.971944 segmentation matches: good=7887180, bad=941411182, total=949379670
I 00:02:57.972492 segmentation collisions: L1=0.492%, L2=0.124% [275474981 hashes]
I 00:02:57.973092 saving chunks...
I 00:03:00.853344 saving directories...
I 00:03:00.868237 saving shared files table...
I 00:03:00.887158 saving names table... [17.44ms]
I 00:03:00.887413 saving symlinks table... [6.6us]
I 00:03:10.473034 waiting for compression to finish...
I 00:05:58.444351 compressed 26.64 GiB to 8.248 GiB (ratio=0.309589)
I 00:05:59.440332 compression CPU time: 195.4s
I 00:05:59.442177 filesystem created without errors [2.426e+04s]
⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
waiting for block compression to finish
15 dirs, 0/0 soft/hard links, 92/92 files, 0 other
original size: 26.64 GiB, dedupe: 0 B (0 files), segment: 18.43 GiB
filesystem: 8.21 GiB in 132 blocks (8784710 chunks, 92/92 inodes)
compressed filesystem: 132 blocks/8.248 GiB written
██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏100% \

1178 kbps, 30.95 %

I'm guessing if I had 64 GB of memory and didn't mind waiting forever, it could get a little smaller by leveraging some minutes of the last videos showing a repetition of the first ones (it's a convention and some parts get an overview at the end)

Thanks for all the help! I know a lot more about dwarfs now!
BTW: Could you show how much data is being processed during compression? To compare it to filesystem: ### B in ### blocks... so we can have a ballpark estimation on how well is deduplication working. Thanks in advance!

2 replies

mhx Oct 26, 2021
Maintainer

Whoa. 6 hours. :)

BTW: Could you show how much data is being processed during compression? To compare it to filesystem: ### B in ### blocks... so we can have a ballpark estimation on how well is deduplication working. Thanks in advance!

I think you can already see most of this:

original size: 26.64 GiB, dedupe: 0 B (0 files), segment: 18.43 GiB
|                         |                      |
|                         |                      `-- amount of data saved by segmentation (byte-level deduplication)
|                         `------------------------- amount saved by file-level deduplication (exact file copies)
`--------------------------------------------------- original size of the data in the source file system

filesystem: 8.21 GiB in 132 blocks (8784710 chunks, 92/92 inodes)
|                       |           |
|                       |           `--------------- number of chunks (i.e. how fragmented the file system is)
|                       `--------------------------- number of file system blocks
`--------------------------------------------------- size of the uncompressed file system (i.e. how much memory you
                                                     would need to keep the whole thing cached when mounting)

So I guess the number you're looking for is after segment:. If I misunderstood what you're looking for, let me know.

One more thing that's useful to look at is the output of dwarfsck, e.g.:

$ ./dwarfsck dwarfs-video-test2-W10.dwarfs 
DwarFS version 2.3 [2]
created by: libdwarfs v0.5.6
created on: 2021-10-25 07:51:59
block size: 64 MiB
block count: 5
inode count: 5
original filesystem size: 468.3 MiB
compressed block size: 265.3 MiB (100.00%)
uncompressed block size: 265.3 MiB
compressed metadata size: 733.3 KiB (74.46%)
uncompressed metadata size: 984.7 KiB
options: mtime_only
metadata memory usage:
               total metadata.............1,008,364 bytes       201672.8 bytes/inode
       146,650 chunks.....................1,008,219 bytes 100.0%   6.9 bytes/item
             4 compact_names.....................42 bytes   0.0%  10.5 bytes/item
               |- data                           38 bytes   0.0%   9.5 bytes/item
               '- index                           4 bytes   0.0%   1.0 bytes/item
             5 chunk_table.......................12 bytes   0.0%   2.4 bytes/item
             5 inodes.............................7 bytes   0.0%   1.4 bytes/item
             2 modes..............................4 bytes   0.0%   2.0 bytes/item
             5 dir_entries........................4 bytes   0.0%   0.8 bytes/item
             1 uids...............................2 bytes   0.0%   2.0 bytes/item
             1 gids...............................1 bytes   0.0%   1.0 bytes/item
             2 directories........................1 bytes   0.0%   0.5 bytes/item
             0 devices............................0 bytes   0.0%   0.0 bytes/item
             0 shared_files_table.................0 bytes   0.0%   0.0 bytes/item
             0 symlink_table......................0 bytes   0.0%   0.0 bytes/item

Some of this output is only available in the main branch and hasn't made it into a release yet, however, the breakdown of the metadata is generally useful, as you can see how much memory is spent for a certain type of metadata. For example, you can see that in the above case, each inode (i.e. file or directory in this case) is using. Because the files are heavily fragmented, each file has more then 200k of metadata, pretty much all of which is used up by the chunks from which each file is composed. Each chunk is a triplet block:offset:size, and the concatenation of all chunks makes up a file. If you're interested in more detail, take a look at the DwarFS format documentation.

One of the key takeaways is probably that you should generally try to avoid fragmentation as it slows down file access and bloats the metadata. Unless you store the metadata uncompressed (which is always an option, as it's designed for low redundancy even when uncompressed), the metadata block will always have to be fully unpacked in order to work with the file system.

M-Gonzalo Oct 26, 2021
Author

Whoa. 6 hours. :)

Yup. I have a snail computer

I think you can already see most of this:

original size: 26.64 GiB, dedupe: 0 B (0 files), segment: 18.43 GiB
|                         |                      |
|                         |                      `-- amount of data saved by segmentation (byte-level deduplication)
|                         `------------------------- amount saved by file-level deduplication (exact file copies)
`--------------------------------------------------- original size of the data in the source file system

filesystem: 8.21 GiB in 132 blocks (8784710 chunks, 92/92 inodes)
|                       |           |
|                       |           `--------------- number of chunks (i.e. how fragmented the file system is)
|                       `--------------------------- number of file system blocks
`--------------------------------------------------- size of the uncompressed file system (i.e. how much memory you
                                                     would need to keep the whole thing cached when mounting)

Yes, thanks! Not exactly what I was thinking about, but the segment info is useful.

One of the key takeaways is probably that you should generally try to avoid fragmentation as it slows down file access and bloats the metadata.

I experienced this playing with -W -S -B values on different datasets. A sensible setting can definitely improve the ratio but it also means reading speed plummeting sometimes.

Unless you store the metadata uncompressed (which is always an option, as it's designed for low redundancy even when uncompressed), the metadata block will always have to be fully unpacked in order to work with the file system.

I will probably try this next. Especially because my computer doesn't have too much memory so it's best if I use only what I need

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best options for efficient deduplication? #63

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 40 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Best options for efficient deduplication? #63

M-Gonzalo Oct 23, 2021

Replies: 6 comments · 40 replies

Phantop Oct 24, 2021

M-Gonzalo Oct 24, 2021 Author

Phantop Oct 24, 2021

M-Gonzalo Oct 24, 2021 Author

mhx Oct 24, 2021 Maintainer

mhx Oct 24, 2021 Maintainer

mhx Oct 24, 2021 Maintainer

mhx Oct 24, 2021 Maintainer

Phantop Oct 25, 2021

mhx Oct 25, 2021 Maintainer

M-Gonzalo Oct 25, 2021 Author

mhx Oct 25, 2021 Maintainer

Phantop Oct 26, 2021

mhx Oct 27, 2021 Maintainer

mhx Oct 27, 2021 Maintainer

Phantop Oct 28, 2021

mhx Oct 29, 2021 Maintainer

M-Gonzalo Oct 26, 2021 Author

mhx Oct 26, 2021 Maintainer

M-Gonzalo Oct 26, 2021 Author

M-Gonzalo
Oct 23, 2021

Replies: 6 comments 40 replies

Phantop
Oct 24, 2021

M-Gonzalo Oct 24, 2021
Author

M-Gonzalo Oct 24, 2021
Author

mhx
Oct 24, 2021
Maintainer

mhx Oct 24, 2021
Maintainer

mhx
Oct 24, 2021
Maintainer

mhx Oct 24, 2021
Maintainer

mhx Oct 25, 2021
Maintainer

M-Gonzalo
Oct 25, 2021
Author

mhx
Oct 25, 2021
Maintainer

mhx Oct 27, 2021
Maintainer

mhx Oct 27, 2021
Maintainer

mhx Oct 29, 2021
Maintainer

M-Gonzalo
Oct 26, 2021
Author

mhx Oct 26, 2021
Maintainer

M-Gonzalo Oct 26, 2021
Author