Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add decode_all and decode_all_to_vec #70

Merged
merged 3 commits into from
Sep 18, 2024

Conversation

philipc
Copy link
Contributor

@philipc philipc commented Sep 16, 2024

These decode multiple frames at once.

@philipc
Copy link
Contributor Author

philipc commented Sep 16, 2024

I need to add tests for these still. Also I'm not sure if this should be using StreamingDecoder in the implementation, or if it should use FrameDecoder directly; doing so may allow us to avoid use of io::Error.

@philipc philipc marked this pull request as draft September 16, 2024 07:37
@KillingSpark
Copy link
Owner

KillingSpark commented Sep 16, 2024

A good test would probably to concatenate a few of the already checked in .zst files in memory, adding skippable frames in interesting locations (first position, somewhere in the middle, last position).

I think it would be best to use the FrameDecoder to make the errors more meaningful. Since there is no IO going on converting the crates errors to io errors just for users having to interpret these to understand underlying conditions seems unnecessarily complex.

Thanks for working on this!

These decode multiple frames at once.
@philipc philipc marked this pull request as ready for review September 17, 2024 05:58
Err(e) => return Err(e),
};
loop {
decoder.decode_blocks(&mut input, BlockDecodingStrategy::UptoBlocks(1))?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This strategy is different from what StreamingDecoder uses, but I chose it because I assume it will reduce memory usage without affecting performance.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 1 block is a bit too restrictive, this means decoding in 128kb blocks, copying the results and continuing. I'll benchmark this to see how severe the impact is. Functionality wise this doesn't really matter so I won't block the PR ont his.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some overhead that is incurred from copying more often? What do you think would be a better size? I don't think we want to use BlockDecodingStrategy::UptoBytes(output.len()) because if the input is a single large frame then that roughly doubles the total memory usage, and probably results in a few unwanted reallocs as the buffer grows.

Ideally the decoder would use output directly and avoid a copy in the first place, but the code doesn't seem to be set up to allow that.

Copy link
Owner

@KillingSpark KillingSpark Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally the decoder would use output directly and avoid a copy in the first place, but the code doesn't seem to be set up to allow that.

That would be ideal in terms of memory usage but the algorithm always needs access to the last window-size bytes where the window size is defined by the frame header anywas so some dynamic buffering will have to be done by the decoder. You could spill data directly to the output but that would also mean a lot of small copies.

Is there some overhead that is incurred from copying more often?

It just means that you enter the decoding loop more often to copy results in small chunks. It's not going to be a huge difference but I'm going to guess it's noticable.

I don't think we want to use BlockDecodingStrategy::UptoBytes(output.len())

I agree. Looking at it now, I'd limit the streaming decoder to only decode to a reasonable limit like a few MB. The expectation of the StreamingDecoder wasn't that the output would be a huge buffer but I see how that can definitely be the case and be bad in terms of memory usage. This is what I'd recommend for this implementation too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be ideal in terms of memory usage but the algorithm always needs access to the last window-size bytes where the window size is defined by the frame header anywas so some dynamic buffering will have to be done by the decoder.

I know you need that for streaming, but in this case it should be able to use output for the window buffer.

Copy link
Owner

@KillingSpark KillingSpark Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that's true but I think it's pretty uncommon to know how large the decompressed frame is going to be. Maybe the Sequence Execution (currently implemented on a ringbuffer) could be made pluggable. It would remove the need for all the memcpy calls needed to collect bytes from the decoder which would be nice. Not sure it's worth the effort/code complexity though tbh.

src/frame_decoder.rs Outdated Show resolved Hide resolved
@KillingSpark
Copy link
Owner

KillingSpark commented Sep 18, 2024

Do you have any more changes you want to get into this PR? It currently looks good to me

@philipc
Copy link
Contributor Author

philipc commented Sep 18, 2024

I think this can be merged. I do agree that the current BlockDecodingStrategy should be changed, but that can be done later after benchmarking.

@KillingSpark KillingSpark merged commit b99a8b9 into KillingSpark:master Sep 18, 2024
2 checks passed
@philipc philipc deleted the decode_all branch September 18, 2024 11:56
@philipc
Copy link
Contributor Author

philipc commented Sep 21, 2024

I did some benchmarking with a single 64 MB file. I manually tried a range of strategies, both UptoBytes and UptoBlocks. Making them too large definitely made performance worse.

Changing from UptoBlocks(1) to UptoBytes(1_000_000):

decode_all              time:   [558.81 ms 559.58 ms 560.28 ms]
                        change: [+0.7347% +0.9593% +1.1589%] (p = 0.00 < 0.05)

Changing from UptoBlocks(1) to UptoBytes(10_000_000):

decode_all              time:   [572.79 ms 574.92 ms 577.69 ms]
                        change: [+3.1492% +3.5998% +4.1800%] (p = 0.00 < 0.05)

Changing from UptoBlocks(1) to UptoBytes(1) is possibly a small improvement (and it would make sense to me for it to be better since some blocks gives 0 bytes), but the difference is down in the noise.

@KillingSpark
Copy link
Owner

That's very unexpected. I'll have to have a look at that, my expectations where pretty much completely contrary to these results. Seems like there is something awry in either my code or my understanding of what affects the performance of the code

@KillingSpark
Copy link
Owner

I added a benchmark that uses decode_all not the vec version because the performance of that does not depend on the capacity of the target. This benchmark does benefit from a bigger buffer size of 1MB compared to UpToBytes(1). Can you confirm that?

@philipc
Copy link
Contributor Author

philipc commented Sep 27, 2024

Yes, I see that benefit for the benchmark you added. However, I don't think this is a realistic benchmark. The test data is small, so the criterion warmup will be putting all of source, decode buffer and target in the cache. This isn't going to happen in practice because in practice you don't decompress the same thing repeatedly in a loop.

If I modify the benchmark to move the FrameDecoder::new() call inside the loop, then I once again see that a smaller buffer size is better.

@philipc
Copy link
Contributor Author

philipc commented Sep 27, 2024

Hmm, but that's not a good test either, because of course allocating a smaller buffer will be faster.

@philipc
Copy link
Contributor Author

philipc commented Sep 27, 2024

I modified your benchmark to use the 64MB compressed file from my previous testing, and with it I am seeing a performance improvement by changing to UpToBytes(1). (This is without the FrameDecoder::new() change.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants