-
Hi! I have a special case for which I wanted to use dwarfs. There are 12 videos, each in 3 different languages + a bundled video with all 3 audio tracks on the same file for each of the 12 (48 files total). So even though video files are not usually compressible, these have a lot of information in common (probably every frame that is not text on a particular language is duplicated across 3 files, and the multilingual bundled videos, which are 100% made up of tracks copied from the others). Using a sensible sorting and a very fast deduplication utility I was able to reduce the size of the corpus to 35% on my tests. The utility is Bulat Ziganshin's "rep" filter from the old FreeArc compression suite of algorithms. It's a Lempel-Ziv type of compressor with no entropy coding. But the problem is, mkdwarfs' hash-based deduplication is not picking up on it. I tried reducing window size to ridiculous amounts, using a lookback value of 10 and a nimsilsa limit of 224 but the savings are barely some 3%, I'm guessing due to the actual zstd compression. So my question is: am I doing something wrong? Or is mkdwarfs simply not able to detect this duplicate data? I noticed that it starts writing to the disk way before the segmentation stage is complete. Could this be the reason so much duplicate data is missed? What happens if the first file analyzed is very similar to the last one? Would it be possible to perform the complete analysis before starting to compress? I'm sorry if I misunderstood some parts of the process. Thanks in advance for any reply! |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 40 replies
-
Yeah I recently noticed this myself! Even when usng back-referencing and very high block sizes, using |
Beta Was this translation helpful? Give feedback.
-
Interesting use case, thanks for the feedback! My gut feeling when reading this was that it's simply because the lookback buffer is too small that However, your use case is just begging for a) large filesystem blocks and b) a more or less unlimited lookback buffer. Thankfully, a large lookback buffer isn't much of a problem if you've got enough memory. It won't even slow down I've tried to reproduce your use case by taking a ~500 MiB
As far as options go, I went for
As you can see, So I'd basically suggest you set |
Beta Was this translation helpful? Give feedback.
-
One more thing: you can actually get away with less RAM and a smaller lookback buffer if you order the input files accordingly. I'd be surprised if any of the similarity-based ordering algorithms that (Which reminds me of #6, it'd be really nice if you could just provide a list of files in the order in which you want to pack them.) |
Beta Was this translation helpful? Give feedback.
-
OK, I finally have some time to write a proper reply. You were right! I didn't realize that I was only searching for duplicates on some 160 MB of data. After half a day of experimenting on my slow machine, here are some takes:
BTW, I strongly recommend you give a look at
|
Beta Was this translation helpful? Give feedback.
-
Okay, I've now tried
So speed is in the same ballpark, but compression is actually worse. (If I'm just using it wrong, please let me know, but this is what I inferred from the discussion above.) With What I don't quite understand is why you're still seeing such a large discrepancy between
Hardly. The problem here is that DwarFS is a file system, not an archiver (even though it does offer compression ratios that are comparable to archivers). The requirements for a file system can be quite different. For example, you want to be able to quickly access individual files without having to unpack the whole file system image. If DwarFS was using something like Regarding your test, please help me to correctly understand your input data. From
I understand that this is 4 video files that take up a total of 866 MB? So each video file by itself is just over 200 MB? And the video stream data in all those files is identical? So, I've made another test:
I was thinking because of the size of your files, that they might probably be slightly lower quality video streams than my previous test files. As mentioned above, the could result in potentially smaller segments. So the new files are compressed more aggressively and I've also used 3 audio tracks. And indeed, in order to get results that are comparable with my previous test, I have to lower the window size quite a bit:
So please definitely give it a try and pass in a smaller window size. (This is something that I don't recommend in general as it can easily lead to a fragmented file system and slows down |
Beta Was this translation helpful? Give feedback.
-
Done! I finally could make the most efficient filesystem and it only is a 30% of the uncompressed size. These were my options:
I'm guessing if I had 64 GB of memory and didn't mind waiting forever, it could get a little smaller by leveraging some minutes of the last videos showing a repetition of the first ones (it's a convention and some parts get an overview at the end) Thanks for all the help! I know a lot more about dwarfs now! |
Beta Was this translation helpful? Give feedback.
Interesting use case, thanks for the feedback!
My gut feeling when reading this was that it's simply because the lookback buffer is too small that
mkdwarfs
isn't able to pick up the redundancies. The rationale behind keeping the lookback buffer size limited is for a more typical use case: say you have a file that's relatively small compared to the configured filesystem block size. Even if you were able to assemble that file mostly from chunks split across 20 different filesystem blocks, you'd rarely want to, because you'd have to decompress 20 filesystem blocks in order to re-assemble the file when mounting the filesystem image.However, your use case is just begging for a) large filesyst…