Skip to content
This repository has been archived by the owner on Oct 28, 2023. It is now read-only.

Additional Performance gains #16

Closed

Conversation

austinjones
Copy link
Contributor

This builds on PR1 to improve performance by another ~15%.

Rust iterators have some great specialization features - and .zip() has a neat specialization when joining two DoubleEndedIterator + ExactSizeIterator - which std::Slice::Iter implements! See zip.rs.

@austinjones
Copy link
Contributor Author

The main reason I made this a 2nd PR is the drawback to splitting better_match into 2 separate loops. It actually makes a difference due to cache hits, but it means that 'black on black' guide areas take a lot of loops to stack up cost. So, I put the color cost first, even if alpha is small.

austinjones and others added 3 commits September 10, 2019 14:41
I ran cargo flamegraph, and it turns out a huge portion of the runtime was spent in find_match and find_better_match.  It's a very, very hot loop.

Almost all of the work done in the inner loop (find_better_match) is a function of two u8s... it can be precomputed!

Also, the alpha masks can be rendered into these precomputed cost functions, avoiding the need to do any alpha computations in the loop.

I ran all the examples and couldn't find any visible artifacts.
I found a few small bugs while looking at @zicklag's comments on EmbarkStudios#14

First: there is a numerical precision bug with the calculation of distance gaussians.  The exp() function used to be f64::exp(), and I was using f32::exp().

Second: there were missing entries in the precomputed function table.  Loop bounds are exclusive... but 256u8 is not a u8... so it needs 0..=255u8 which was made for this situation.
…ecialization for ExactSizeIterator

Rust iterators have some great specialization features - and .zip() has fantastic performance when joining two `DoubleEndedIterator + ExactSizeIterator` - which std::Slice::Iter implements!  The trick is that it can combine the bounds checks for the zipped-together items.  And the benches look faster than they used to be - so llvm may be optimizing the bound checks away.

These changes improve the example01 performance down to
13.06 real
39.56 user

from:
15.50 real
50.72 sys
@arirawr arirawr requested a review from Jake-Shadle September 11, 2019 12:11
@arirawr arirawr added the enhancement New feature or request label Sep 12, 2019
@Jake-Shadle Jake-Shadle mentioned this pull request Sep 12, 2019
@Jake-Shadle
Copy link
Member

This PR was merged into #28, thanks again for the changes!

@Jake-Shadle Jake-Shadle removed their request for review September 13, 2019 10:14
@austinjones austinjones deleted the pr.addl-performance branch September 13, 2019 17:27
@austinjones
Copy link
Contributor Author

No problem @Jake-Shadle. Thanks for merging!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants