-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add decode_all
and decode_all_to_vec
#70
Conversation
I need to add tests for these still. Also I'm not sure if this should be using |
A good test would probably to concatenate a few of the already checked in .zst files in memory, adding skippable frames in interesting locations (first position, somewhere in the middle, last position). I think it would be best to use the FrameDecoder to make the errors more meaningful. Since there is no IO going on converting the crates errors to io errors just for users having to interpret these to understand underlying conditions seems unnecessarily complex. Thanks for working on this! |
These decode multiple frames at once.
src/frame_decoder.rs
Outdated
Err(e) => return Err(e), | ||
}; | ||
loop { | ||
decoder.decode_blocks(&mut input, BlockDecodingStrategy::UptoBlocks(1))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This strategy is different from what StreamingDecoder
uses, but I chose it because I assume it will reduce memory usage without affecting performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 1 block is a bit too restrictive, this means decoding in 128kb blocks, copying the results and continuing. I'll benchmark this to see how severe the impact is. Functionality wise this doesn't really matter so I won't block the PR ont his.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some overhead that is incurred from copying more often? What do you think would be a better size? I don't think we want to use BlockDecodingStrategy::UptoBytes(output.len())
because if the input is a single large frame then that roughly doubles the total memory usage, and probably results in a few unwanted reallocs as the buffer grows.
Ideally the decoder would use output
directly and avoid a copy in the first place, but the code doesn't seem to be set up to allow that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally the decoder would use output directly and avoid a copy in the first place, but the code doesn't seem to be set up to allow that.
That would be ideal in terms of memory usage but the algorithm always needs access to the last window-size bytes where the window size is defined by the frame header anywas so some dynamic buffering will have to be done by the decoder. You could spill data directly to the output but that would also mean a lot of small copies.
Is there some overhead that is incurred from copying more often?
It just means that you enter the decoding loop more often to copy results in small chunks. It's not going to be a huge difference but I'm going to guess it's noticable.
I don't think we want to use BlockDecodingStrategy::UptoBytes(output.len())
I agree. Looking at it now, I'd limit the streaming decoder to only decode to a reasonable limit like a few MB. The expectation of the StreamingDecoder wasn't that the output would be a huge buffer but I see how that can definitely be the case and be bad in terms of memory usage. This is what I'd recommend for this implementation too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be ideal in terms of memory usage but the algorithm always needs access to the last window-size bytes where the window size is defined by the frame header anywas so some dynamic buffering will have to be done by the decoder.
I know you need that for streaming, but in this case it should be able to use output
for the window buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that's true but I think it's pretty uncommon to know how large the decompressed frame is going to be. Maybe the Sequence Execution (currently implemented on a ringbuffer) could be made pluggable. It would remove the need for all the memcpy calls needed to collect bytes from the decoder which would be nice. Not sure it's worth the effort/code complexity though tbh.
Do you have any more changes you want to get into this PR? It currently looks good to me |
I think this can be merged. I do agree that the current |
I did some benchmarking with a single 64 MB file. I manually tried a range of strategies, both Changing from
Changing from
Changing from |
That's very unexpected. I'll have to have a look at that, my expectations where pretty much completely contrary to these results. Seems like there is something awry in either my code or my understanding of what affects the performance of the code |
I added a benchmark that uses decode_all not the vec version because the performance of that does not depend on the capacity of the target. This benchmark does benefit from a bigger buffer size of 1MB compared to UpToBytes(1). Can you confirm that? |
Yes, I see that benefit for the benchmark you added. However, I don't think this is a realistic benchmark. The test data is small, so the criterion warmup will be putting all of source, decode buffer and target in the cache. This isn't going to happen in practice because in practice you don't decompress the same thing repeatedly in a loop. If I modify the benchmark to move the |
Hmm, but that's not a good test either, because of course allocating a smaller buffer will be faster. |
I modified your benchmark to use the 64MB compressed file from my previous testing, and with it I am seeing a performance improvement by changing to |
These decode multiple frames at once.