MP3: move bounds checks out of the hot loop #101

Shnatsel · 2022-01-28T00:24:18Z

Move bounds checks out of the hot loop to be performed only once outside it. This should provide a modest performance boost, especially on CPUs with less powerful speculative execution.

Reduces the output of cargo asm for this function from 488 to 412 lines, when annotated with #[inline(never)] to make it easy to inspect.

Instruction counts before

125 movss 92 mov 85 lea 25 cmp 18 call 16 mulss 14 xor 14 jmp 13 addss 11 jne 6 ud2 6 push 6 pop 6 jae 5 jb 5 add 3 je 1 subss 1 sub 1 ret

Instruction counts after

125 movss 74 lea 68 mov 16 mulss 14 xor 14 jmp 14 cmp 13 addss 12 call 11 jne 6 push 6 pop 5 add 3 je 1 subss 1 sub 1 ret

This doesn't introduce any new panics - the code would have panicked anyway, now the check is simply performed once per function call outside the hot loop.

There might be a more elegant iterator-based way than .try_into().unwrap(); I haven't really looked into it.

Many other functions in this file can be given a similar treatment; this should provide a total boost of a few % in terms of performance for end-to-end decoding. This PR is meant to demonstrate the approach. I could proceed with converting the rest to this idiom, if you wish.

…ide it

pdeljanov · 2022-01-28T00:32:26Z

I actually used this technique in the new fft module before the fft32 call. It worked pretty well there so I'd be interested in making similar changes to the MP3 decoder if it offers us any performance uplift.

I use this script to benchmark changes against either ffmpeg or a baseline version.

#!/bin/bash
IN="${1@Q}"
hyperfine -m 20 "ffmpeg -threads 1 -benchmark -v 0 -i ${IN} -f null -" "./target/release/symphonia-play --decode-only ${IN}"
# hyperfine -m 20 "./symphonia-play-baseline --decode-only ${IN}" "./target/release/symphonia-play --decode-only ${IN}"

What kind of numbers are you seeing?

Shnatsel · 2022-01-28T00:59:30Z

My machine is usually too noisy to pick up the difference of a few % (which is why I've been relying on instruction counts), but I'll give it a shot.

I know there is some way to measure end-to-end instruction counts too, I just don't know what it is. IIRC rustc benchmarking uses it.

Shnatsel · 2022-01-28T02:14:36Z

I had to crank the test count all the way up to -m 1000 to get results that are reproducible from test to test, but yes, I am seeing a 1% improvement in runtime from this. No difference in instruction count according to perf stat -e instructions, curiously.

Shnatsel · 2022-01-28T03:28:59Z

I've applied the same technique to imdct36() and the loop inside hybrid_synthesis(), and got the total performance increase to 3% compared to master.

The instruction count for hybrid_synthesis() exploded from 1154 to 1658; I think the removed bounds checks have allowed the compiler to unroll the fixed-size loop in hybrid_synthesis().

…s well

pdeljanov · 2022-01-29T00:08:05Z

So this is fun. These changes result in incorrect decoding (try running symphonia-check on a few files). I was confused for a while because the code looked correct, but turns out this line:

let sub_band: &mut [f32; 18] = &mut samples[start..(start + 18)].try_into().unwrap();

needs to be

let sub_band: &mut [f32; 18] = (&mut samples[start..(start + 18)]).try_into().unwrap();

I believe in the first case try_into is making a [f32; 18] from the 18 samples, and then you're getting getting a mutable slice to that copy. Whereas in the second case we're getting a mutable slice &mut [f32] first, and then try_into yields &mut [f32; 18] to the original samples.

Shnatsel · 2022-01-29T00:39:51Z

Yep, that seems to have been it! With the fix applied symphonia-check seems to pass. But the performance improvement is back down to 1%.

My apologies for causing breakage - I did not expect such a simple change to cause so much trouble.

pdeljanov · 2022-01-29T03:36:53Z

No problem. It was a very non-obvious issue.

I'm measuring about a 1-3% change on my system (Linux 5.16, Intel Core i7 4790k), though it varies. It's something, but this definitely isn't the hottest part of the decoder so major gains are hard to come by.

Sorry for forgetting to mention it earlier, but there's a clippy warning too. I'll merge the PR after that's cleaned up and I test a bit more.

Thanks!

Shnatsel · 2022-01-29T19:24:31Z

Yeah, eliminating bounds checks usually results only in single-digit gains even in hot codepaths. At least x86_64 CPUs are very good at speculating past them.

Shnatsel · 2022-01-29T19:24:55Z

Clippy lint fixed!

pdeljanov · 2022-01-30T04:06:29Z

Thanks, merged!

Move bounds checks out of the hot loop to be performed only once outs…

34b6dd2

…ide it

Shnatsel added 3 commits January 28, 2022 03:31

Hopefully slightly less bounds checks

727652b

cleaner imdct36 calls and possibly a bit more bounds checks removed

abdd038

Add a comment

86bec86

Allow eliding bounds checks in the overlap processing after imdct12 a…

2645c85

…s well

Fix incorrect decoding

9adfe12

pdeljanov added this to the v0.5 milestone Jan 29, 2022

Apply clippy lint

db3eb53

Rustfmt

6562371

pdeljanov merged commit 3ec7761 into pdeljanov:master Jan 30, 2022

Shnatsel deleted the less-bounds-checks branch February 1, 2022 01:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MP3: move bounds checks out of the hot loop #101

MP3: move bounds checks out of the hot loop #101

Shnatsel commented Jan 28, 2022 •

edited

Loading

pdeljanov commented Jan 28, 2022

Shnatsel commented Jan 28, 2022

Shnatsel commented Jan 28, 2022

Shnatsel commented Jan 28, 2022 •

edited

Loading

pdeljanov commented Jan 29, 2022

Shnatsel commented Jan 29, 2022

pdeljanov commented Jan 29, 2022

Shnatsel commented Jan 29, 2022

Shnatsel commented Jan 29, 2022

pdeljanov commented Jan 30, 2022

MP3: move bounds checks out of the hot loop #101

MP3: move bounds checks out of the hot loop #101

Conversation

Shnatsel commented Jan 28, 2022 • edited Loading

pdeljanov commented Jan 28, 2022

Shnatsel commented Jan 28, 2022

Shnatsel commented Jan 28, 2022

Shnatsel commented Jan 28, 2022 • edited Loading

pdeljanov commented Jan 29, 2022

Shnatsel commented Jan 29, 2022

pdeljanov commented Jan 29, 2022

Shnatsel commented Jan 29, 2022

Shnatsel commented Jan 29, 2022

pdeljanov commented Jan 30, 2022

Shnatsel commented Jan 28, 2022 •

edited

Loading

Shnatsel commented Jan 28, 2022 •

edited

Loading