Add `decode_all` and `decode_all_to_vec` #70

philipc · 2024-09-16T07:12:34Z

These decode multiple frames at once.

philipc · 2024-09-16T07:14:57Z

I need to add tests for these still. Also I'm not sure if this should be using StreamingDecoder in the implementation, or if it should use FrameDecoder directly; doing so may allow us to avoid use of io::Error.

KillingSpark · 2024-09-16T11:08:42Z

A good test would probably to concatenate a few of the already checked in .zst files in memory, adding skippable frames in interesting locations (first position, somewhere in the middle, last position).

I think it would be best to use the FrameDecoder to make the errors more meaningful. Since there is no IO going on converting the crates errors to io errors just for users having to interpret these to understand underlying conditions seems unnecessarily complex.

Thanks for working on this!

These decode multiple frames at once.

philipc · 2024-09-17T06:00:27Z

src/frame_decoder.rs

+            Err(e) => return Err(e),
+        };
+        loop {
+            decoder.decode_blocks(&mut input, BlockDecodingStrategy::UptoBlocks(1))?;


This strategy is different from what StreamingDecoder uses, but I chose it because I assume it will reduce memory usage without affecting performance.

I think 1 block is a bit too restrictive, this means decoding in 128kb blocks, copying the results and continuing. I'll benchmark this to see how severe the impact is. Functionality wise this doesn't really matter so I won't block the PR ont his.

Is there some overhead that is incurred from copying more often? What do you think would be a better size? I don't think we want to use BlockDecodingStrategy::UptoBytes(output.len()) because if the input is a single large frame then that roughly doubles the total memory usage, and probably results in a few unwanted reallocs as the buffer grows.

Ideally the decoder would use output directly and avoid a copy in the first place, but the code doesn't seem to be set up to allow that.

Ideally the decoder would use output directly and avoid a copy in the first place, but the code doesn't seem to be set up to allow that.

That would be ideal in terms of memory usage but the algorithm always needs access to the last window-size bytes where the window size is defined by the frame header anywas so some dynamic buffering will have to be done by the decoder. You could spill data directly to the output but that would also mean a lot of small copies.

Is there some overhead that is incurred from copying more often?

It just means that you enter the decoding loop more often to copy results in small chunks. It's not going to be a huge difference but I'm going to guess it's noticable.

I don't think we want to use BlockDecodingStrategy::UptoBytes(output.len())

I agree. Looking at it now, I'd limit the streaming decoder to only decode to a reasonable limit like a few MB. The expectation of the StreamingDecoder wasn't that the output would be a huge buffer but I see how that can definitely be the case and be bad in terms of memory usage. This is what I'd recommend for this implementation too.

That would be ideal in terms of memory usage but the algorithm always needs access to the last window-size bytes where the window size is defined by the frame header anywas so some dynamic buffering will have to be done by the decoder.

I know you need that for streaming, but in this case it should be able to use output for the window buffer.

I guess that's true but I think it's pretty uncommon to know how large the decompressed frame is going to be. Maybe the Sequence Execution (currently implemented on a ringbuffer) could be made pluggable. It would remove the need for all the memcpy calls needed to collect bytes from the decoder which would be nice. Not sure it's worth the effort/code complexity though tbh.

src/frame_decoder.rs

KillingSpark · 2024-09-18T10:07:35Z

Do you have any more changes you want to get into this PR? It currently looks good to me

philipc · 2024-09-18T10:29:35Z

I think this can be merged. I do agree that the current BlockDecodingStrategy should be changed, but that can be done later after benchmarking.

philipc · 2024-09-21T05:14:00Z

I did some benchmarking with a single 64 MB file. I manually tried a range of strategies, both UptoBytes and UptoBlocks. Making them too large definitely made performance worse.

Changing from UptoBlocks(1) to UptoBytes(1_000_000):

decode_all              time:   [558.81 ms 559.58 ms 560.28 ms]
                        change: [+0.7347% +0.9593% +1.1589%] (p = 0.00 < 0.05)

Changing from UptoBlocks(1) to UptoBytes(10_000_000):

decode_all              time:   [572.79 ms 574.92 ms 577.69 ms]
                        change: [+3.1492% +3.5998% +4.1800%] (p = 0.00 < 0.05)

Changing from UptoBlocks(1) to UptoBytes(1) is possibly a small improvement (and it would make sense to me for it to be better since some blocks gives 0 bytes), but the difference is down in the noise.

KillingSpark · 2024-09-21T14:08:54Z

That's very unexpected. I'll have to have a look at that, my expectations where pretty much completely contrary to these results. Seems like there is something awry in either my code or my understanding of what affects the performance of the code

KillingSpark · 2024-09-27T10:45:54Z

I added a benchmark that uses decode_all not the vec version because the performance of that does not depend on the capacity of the target. This benchmark does benefit from a bigger buffer size of 1MB compared to UpToBytes(1). Can you confirm that?

philipc · 2024-09-27T11:09:52Z

Yes, I see that benefit for the benchmark you added. However, I don't think this is a realistic benchmark. The test data is small, so the criterion warmup will be putting all of source, decode buffer and target in the cache. This isn't going to happen in practice because in practice you don't decompress the same thing repeatedly in a loop.

If I modify the benchmark to move the FrameDecoder::new() call inside the loop, then I once again see that a smaller buffer size is better.

philipc · 2024-09-27T11:10:56Z

Hmm, but that's not a good test either, because of course allocating a smaller buffer will be faster.

philipc · 2024-09-27T12:08:49Z

I modified your benchmark to use the 64MB compressed file from my previous testing, and with it I am seeing a performance improvement by changing to UpToBytes(1). (This is without the FrameDecoder::new() change.)

philipc marked this pull request as draft September 16, 2024 07:37

Add decode_all and decode_all_to_vec

fb05632

These decode multiple frames at once.

philipc force-pushed the decode_all branch from 2c80fd7 to fb05632 Compare September 17, 2024 05:56

philipc marked this pull request as ready for review September 17, 2024 05:58

philipc commented Sep 17, 2024

View reviewed changes

KillingSpark reviewed Sep 17, 2024

View reviewed changes

src/frame_decoder.rs Outdated Show resolved Hide resolved

philipc added 2 commits September 17, 2024 19:29

Reuse FrameDecoder

e8138c0

Change decode_all and decode_all_to_vec to be FrameDecoder methods

c2cf53d

KillingSpark merged commit b99a8b9 into KillingSpark:master Sep 18, 2024
2 checks passed

philipc deleted the decode_all branch September 18, 2024 11:56

KillingSpark mentioned this pull request Oct 4, 2024

Should StreamingDecoder::read_exact prefer Read::read_exact's contract over its own? #69

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `decode_all` and `decode_all_to_vec` #70

Add `decode_all` and `decode_all_to_vec` #70

philipc commented Sep 16, 2024

philipc commented Sep 16, 2024

KillingSpark commented Sep 16, 2024 •

edited

Loading

philipc Sep 17, 2024

KillingSpark Sep 17, 2024

philipc Sep 17, 2024

KillingSpark Sep 17, 2024 •

edited

Loading

philipc Sep 17, 2024

KillingSpark Sep 18, 2024 •

edited

Loading

KillingSpark commented Sep 18, 2024 •

edited

Loading

philipc commented Sep 18, 2024

philipc commented Sep 21, 2024 •

edited

Loading

KillingSpark commented Sep 21, 2024

KillingSpark commented Sep 27, 2024

philipc commented Sep 27, 2024

philipc commented Sep 27, 2024

philipc commented Sep 27, 2024

Add decode_all and decode_all_to_vec #70

Add decode_all and decode_all_to_vec #70

Conversation

philipc commented Sep 16, 2024

philipc commented Sep 16, 2024

KillingSpark commented Sep 16, 2024 • edited Loading

philipc Sep 17, 2024

Choose a reason for hiding this comment

KillingSpark Sep 17, 2024

Choose a reason for hiding this comment

philipc Sep 17, 2024

Choose a reason for hiding this comment

KillingSpark Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

philipc Sep 17, 2024

Choose a reason for hiding this comment

KillingSpark Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

KillingSpark commented Sep 18, 2024 • edited Loading

philipc commented Sep 18, 2024

philipc commented Sep 21, 2024 • edited Loading

KillingSpark commented Sep 21, 2024

KillingSpark commented Sep 27, 2024

philipc commented Sep 27, 2024

philipc commented Sep 27, 2024

philipc commented Sep 27, 2024

Add `decode_all` and `decode_all_to_vec` #70

Add `decode_all` and `decode_all_to_vec` #70

KillingSpark commented Sep 16, 2024 •

edited

Loading

KillingSpark Sep 17, 2024 •

edited

Loading

KillingSpark Sep 18, 2024 •

edited

Loading

KillingSpark commented Sep 18, 2024 •

edited

Loading

philipc commented Sep 21, 2024 •

edited

Loading