FLAC for uncompressed PCM audio? #95

Dwedit · 2022-07-25T06:08:48Z

Dwedit
Jul 25, 2022

It appears that uncompressed PCM audio can still be found in some places.

There is the CHD (Compressed Hunk of Data) project from MAME, which provides a compressed format for Disk Images, mainly CDs. When it's time to encode a CD Audio track, that is done as FLAC.

I was once thinking about integrating FLAC into a read-only compressed filesystem, but there are things that would need to be sorted out before it would be practical. You'd need to know if a block of data was Uncompressed Audio or not, and of course, audio blocks have headers that aren't audio.

M-Gonzalo · 2023-02-04T14:37:52Z

M-Gonzalo
Feb 4, 2023

I think the most mature lossless audio compressor right now is wavpack. It's fast enough to be decompressed on-the-fly, and of course compresses way better than any general purpose algorithm.
The only drawback I see is the special case where there are several quasi identical files. dwarf's deduplication might exploit this, but not when they're already compressed individually

0 replies

mhx · 2023-07-15T09:02:44Z

mhx
Jul 15, 2023
Maintainer

I think the most mature lossless audio compressor right now is wavpack. It's fast enough to be decompressed on-the-fly, and of course compresses way better than any general purpose algorithm.

Just did a quick test and the flac decoder was almost 3 times as fast (168 MiB/s) as the wavpack decoder (60 MiB/s). The difference in size between the compressed files was marginal (54.3% for flac, 53.6% for wavpack).

The only drawback I see is the special case where there are several quasi identical files. dwarf's deduplication might exploit this, but not when they're already compressed individually

At least in theory, it'd be perfectly reasonable to perform deduplication first and then compress whatever remains after deduplication.

In general, it might be worth chunking up large audio files before compression to improve both access times and decompression speed. I just did a test with flac, split a ~600 MiB wav file into 16 MiB chunks and compressed each of them individually.

Compressing the single 600 MiB file took 15.0 seconds (flac --best).
Compressing the 38 chunks using 16 processes took less than 2 seconds. (To be expected on an 8-core CPU.)
Decompressing the 38 chunks using 16 processes took around 400 ms (that's 1.5 GiB/s).
Decompressing the single huge flac file took 3.7 seconds.

What's more, turns out that the 38 flac chunks combined were 1% smaller than the single huge flac file.

I've no idea how representative this test is (performed using a CDDA rip of Mike Oldfield's "Amarok"), but it certainly looks promising.

0 replies

mhx · 2023-07-15T09:13:24Z

mhx
Jul 15, 2023
Maintainer

I completely ignored the fact that flac is a seekable format, so random access isn't really an issue. Anyhow, the ability to decompress flac multi-threaded would be quite neat.

0 replies

mhx · 2023-07-15T13:31:54Z

mhx
Jul 15, 2023
Maintainer

After thinking about this for a bit, I'm pretty certain the changes required to integrate PCM audio compression into DwarFS overlap nicely with the changes for better binary (executable) compression. Even better, none of these should require any metadata changes (I always try to keep the metadata as small as possible).

The basic idea would be like this:

A "categorizer" (already in the works) will identify PCM data that is suitable for flac (/wavpack) compression.
Files will be grouped by their PCM format (number of channels, bit depth, endianness, signedness, sample rate).
Within each group, files will be ordered by similarity/filename/... (similarity works surprisingly well to match e.g. full songs with short snippets from that song).
Groups are processed one after another. When each file is parsed, the header & other metadata go into a "regular" file system block, whereas the PCM data goes into a "special" PCM data block.
Segmentation analysis applies to both blocks as usual, but it likely makes sense to use a much larger window size for PCM data blocks.
PCM data blocks are compressed with the PCM codec chosen (flac or wavpack) and using the appropriate (shared) PCM format.
The header & metadata of each PCM file is compressed along with any other non-PCM data.

So quite a few changes are required for building the file system as there's currently no distinction between different block types and only a single block can be active at each point in time. However, from the POV of the DwarFS image reader, the only difference is that there's a new block compression algorithm (flac/wavpack).

For binary (executable/shared lib/...) input files, the process would be very similar. They would also be grouped (e.g. by architecture) and then an appropriate compression algorithm (or filter) would be chosen. In case of e.g. multi-platform executables, data may even be distributed to multiple different blocks. Essentially the categorizer can freely define which parts of a file should be stored in which type of block.

I guess this will be a major feature for the 0.8.0 or 0.9.0 release.

1 reply

M-Gonzalo Jul 16, 2023

Love it

mhx · 2023-07-15T13:42:28Z

mhx
Jul 15, 2023
Maintainer

It appears that uncompressed PCM audio can still be found in some places.

Definitely. Especially when editing, compressed formats are rarely used. A friend of mine has already requested PCM audio support in DwarFS for backing up his music projects.

I was once thinking about integrating FLAC into a read-only compressed filesystem, but there are things that would need to be sorted out before it would be practical. You'd need to know if a block of data was Uncompressed Audio or not, and of course, audio blocks have headers that aren't audio.

Not only headers. :) But fortunately there are already libraries that should be able to help with the parsing of the PCM files.

0 replies

mhx · 2023-08-24T21:30:32Z

mhx
Aug 24, 2023
Maintainer

4 replies

M-Gonzalo Aug 26, 2023

mhx Aug 26, 2023
Maintainer

Yeah, sorry, sneak preview. You can get those features on the mhx/categorizer branch, but with absolutely no guarantees. There's a bunch of stuff to still be cleaned up and sorted out, I hope this will be officially released before the end of the year.

Phantop Aug 26, 2023

Does this implementation have the ability to find PCM audio within other files? For example, I have a decrypted image of a Wii game which I know uses .wav audio in it, however the image is of course a single file.

mhx Aug 26, 2023
Maintainer

Does this implementation have the ability to find PCM audio within other files? For example, I have a decrypted image of a Wii game which I know uses .wav audio in it, however the image is of course a single file.

No, and I'm not even sure this can be done reliably. If it's an image, it isn't necessarily guaranteed that the PCM audio data is stored contiguous. So even though it'd be relatively straightforward to scan files for embedded WAV/AIFF/... headers and generate appropriate fragments for downstream processing, I don't know if non-contiguous storage could be an issue.

xcfmc · 2023-08-26T22:22:51Z

xcfmc
Aug 26, 2023

Here's an extreme solution... O:-)

Putting my forensics hat on... What if we fingerprinted blocks (rather than headers) to see if a block is audio? I can't remember which app had this, but one of the hex editors I used (maybe winhex?) was able to do an FFT graphical plot of binary data. The plot made it easy to identify compressed or encrypted data (based on the use of each character in the plot). Raw audio is sinusoidal, so you would have a heavy concentration of values near 0 (or 127 in some audio formats), with fewer occasions of characters at the extremes. My proposal would be to run an FFT on a piece of each block (post deduplication) and then try different compressors (including flac or wavpack) based on a 'frequency fingerprint'. You could also use delta values to detect audio (i.e. byte n - byte n+1 <= ~10, OR byte n - byte n+2 <= ~10 for stereo). If it appears to be audio you send about 2048 bytes of that block to FLAC, brute forcing each of the parameter combinations (# of channels, BPS, and sample rate), until the compression % comes out low, then you send the whole chunk through it. It's a bit extreme, but, with a small user-selectable sample size, and multithreading, it might go fast.

0 replies

Dwedit · 2023-09-07T17:57:34Z

Dwedit
Sep 7, 2023
Author

I thought of a really simple classifier for a block of uncompressed audio/image data.

Basically you subtract the previous sample, take the absolute value, and sum it up.

Audio and pixel data is likely to have few discontinuities, so the gradient (sample minus previous sample) is likely to be low. Then the 'second derivative' (gradient minus previous gradient) is likely to be low too.

Picking out the sample/pixel format:
Formats to consider:
Number of channels: (Stereo, Mono, 24-bit RGB, 32-bit RGBA)
Sample Format: (8-bit unsigned/signed, 16-bit unsigned/signed, little/big endian, offset by 0/1)

You read a sample from your data, and normalize it (multiply 8-bit data by 65535/256 to make it take up the same value range as 16-bit data)
You subtract the previous sample.
Take the absolute value of that
Sum all of these.

For multiple channels (stereo audio, RGB image data, etc), the previous sample will be more than one position away.

You will end up with a 'score' (sum) for each possible interpretation of the data. Lowest sum is the most likely format for the data.

For the second derivative (if you want to use it), it's Sample - 2 * previous sample + two samples ago. You can sum the absolute value of that to get a second score.

Then you can threshold your score (possibly second score too) to see if you want to treat it as audio/image or treat it as arbitrary data.

Data that is already compressed will have extremely high scores, while uncompressed audio or images will have low scores.

Weakness: a louder sound (or higher contrast image) has a higher gradient than a quieter sound (or lower contrast image).

Fun fact, grayscale bitmaps compress fairly well as FLAC.

0 replies

mhx · 2024-01-10T16:35:10Z

mhx
Jan 10, 2024
Maintainer

If anyone wants to play with the new features, the CI build workflow now automatically uploads the binary artifacts (universal binaries as well as the regular binary tarball) for each build. These are definitely not "release quality", but it's still good to give them wider exposure to find bugs early.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FLAC for uncompressed PCM audio? #95

{{title}}

Replies: 9 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

FLAC for uncompressed PCM audio? #95

Dwedit Jul 25, 2022

Replies: 9 comments · 5 replies

M-Gonzalo Feb 4, 2023

mhx Jul 15, 2023 Maintainer

mhx Jul 15, 2023 Maintainer

mhx Jul 15, 2023 Maintainer

M-Gonzalo Jul 16, 2023

mhx Jul 15, 2023 Maintainer

mhx Aug 24, 2023 Maintainer

M-Gonzalo Aug 26, 2023

mhx Aug 26, 2023 Maintainer

Phantop Aug 26, 2023

mhx Aug 26, 2023 Maintainer

xcfmc Aug 26, 2023

Dwedit Sep 7, 2023 Author

mhx Jan 10, 2024 Maintainer

Dwedit
Jul 25, 2022

Replies: 9 comments 5 replies

M-Gonzalo
Feb 4, 2023

mhx
Jul 15, 2023
Maintainer

mhx
Jul 15, 2023
Maintainer

mhx
Jul 15, 2023
Maintainer

mhx
Jul 15, 2023
Maintainer

mhx
Aug 24, 2023
Maintainer

mhx Aug 26, 2023
Maintainer

mhx Aug 26, 2023
Maintainer

xcfmc
Aug 26, 2023

Dwedit
Sep 7, 2023
Author

mhx
Jan 10, 2024
Maintainer