Additional Performance gains #16

austinjones · 2019-09-07T20:56:50Z

This builds on PR1 to improve performance by another ~15%.

Rust iterators have some great specialization features - and .zip() has a neat specialization when joining two DoubleEndedIterator + ExactSizeIterator - which std::Slice::Iter implements! See zip.rs.

austinjones · 2019-09-07T21:00:58Z

The main reason I made this a 2nd PR is the drawback to splitting better_match into 2 separate loops. It actually makes a difference due to cache hits, but it means that 'black on black' guide areas take a lot of loops to stack up cost. So, I put the color cost first, even if alpha is small.

I ran cargo flamegraph, and it turns out a huge portion of the runtime was spent in find_match and find_better_match. It's a very, very hot loop. Almost all of the work done in the inner loop (find_better_match) is a function of two u8s... it can be precomputed! Also, the alpha masks can be rendered into these precomputed cost functions, avoiding the need to do any alpha computations in the loop. I ran all the examples and couldn't find any visible artifacts.

@zicklag

I found a few small bugs while looking at @zicklag's comments on EmbarkStudios#14 First: there is a numerical precision bug with the calculation of distance gaussians. The exp() function used to be f64::exp(), and I was using f32::exp(). Second: there were missing entries in the precomputed function table. Loop bounds are exclusive... but 256u8 is not a u8... so it needs 0..=255u8 which was made for this situation.

…ecialization for ExactSizeIterator Rust iterators have some great specialization features - and .zip() has fantastic performance when joining two `DoubleEndedIterator + ExactSizeIterator` - which std::Slice::Iter implements! The trick is that it can combine the bounds checks for the zipped-together items. And the benches look faster than they used to be - so llvm may be optimizing the bound checks away. These changes improve the example01 performance down to 13.06 real 39.56 user from: 15.50 real 50.72 sys

Jake-Shadle · 2019-09-13T10:14:02Z

This PR was merged into #28, thanks again for the changes!

austinjones · 2019-09-13T17:34:22Z

No problem @Jake-Shadle. Thanks for merging!

austinjones and others added 3 commits September 10, 2019 14:41

austinjones force-pushed the pr.addl-performance branch from c4aa777 to 55e4b7e Compare September 10, 2019 21:50

arirawr requested a review from Jake-Shadle September 11, 2019 12:11

arirawr added the enhancement New feature or request label Sep 12, 2019

Jake-Shadle mentioned this pull request Sep 12, 2019

Various changes #28

Merged

Jake-Shadle closed this Sep 13, 2019

Jake-Shadle removed their request for review September 13, 2019 10:14

austinjones deleted the pr.addl-performance branch September 13, 2019 17:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional Performance gains #16

Additional Performance gains #16

austinjones commented Sep 7, 2019

austinjones commented Sep 7, 2019

Jake-Shadle commented Sep 13, 2019

austinjones commented Sep 13, 2019

Additional Performance gains #16

Additional Performance gains #16

Conversation

austinjones commented Sep 7, 2019

austinjones commented Sep 7, 2019

Jake-Shadle commented Sep 13, 2019

austinjones commented Sep 13, 2019