-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MP3: move bounds checks out of the hot loop #101
Conversation
I actually used this technique in the new I use this script to benchmark changes against either
What kind of numbers are you seeing? |
My machine is usually too noisy to pick up the difference of a few % (which is why I've been relying on instruction counts), but I'll give it a shot. I know there is some way to measure end-to-end instruction counts too, I just don't know what it is. IIRC rustc benchmarking uses it. |
I had to crank the test count all the way up to |
I've applied the same technique to The instruction count for |
So this is fun. These changes result in incorrect decoding (try running
needs to be
I believe in the first case |
Yep, that seems to have been it! With the fix applied My apologies for causing breakage - I did not expect such a simple change to cause so much trouble. |
No problem. It was a very non-obvious issue. I'm measuring about a 1-3% change on my system (Linux 5.16, Intel Core i7 4790k), though it varies. It's something, but this definitely isn't the hottest part of the decoder so major gains are hard to come by. Sorry for forgetting to mention it earlier, but there's a clippy warning too. I'll merge the PR after that's cleaned up and I test a bit more. Thanks! |
Yeah, eliminating bounds checks usually results only in single-digit gains even in hot codepaths. At least x86_64 CPUs are very good at speculating past them. |
Clippy lint fixed! |
Thanks, merged! |
Move bounds checks out of the hot loop to be performed only once outside it. This should provide a modest performance boost, especially on CPUs with less powerful speculative execution.
Reduces the output of
cargo asm
for this function from 488 to 412 lines, when annotated with#[inline(never)]
to make it easy to inspect.Instruction counts before
125 movss 92 mov 85 lea 25 cmp 18 call 16 mulss 14 xor 14 jmp 13 addss 11 jne 6 ud2 6 push 6 pop 6 jae 5 jb 5 add 3 je 1 subss 1 sub 1 retInstruction counts after
125 movss 74 lea 68 mov 16 mulss 14 xor 14 jmp 14 cmp 13 addss 12 call 11 jne 6 push 6 pop 5 add 3 je 1 subss 1 sub 1 retThis doesn't introduce any new panics - the code would have panicked anyway, now the check is simply performed once per function call outside the hot loop.
There might be a more elegant iterator-based way than
.try_into().unwrap()
; I haven't really looked into it.Many other functions in this file can be given a similar treatment; this should provide a total boost of a few % in terms of performance for end-to-end decoding. This PR is meant to demonstrate the approach. I could proceed with converting the rest to this idiom, if you wish.