Make the fast inverse test throughput-limited rather than latency-limited #7958
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This test is currently failing on a Cortex a76 buildbot, because it's a recursive update definition so it ends up limited by instruction latencies rather than throughputs. On an a76 (which is a reasonable CPU to assume for a generic ARM target), if you multiply by a fast inverse the total latency is frecpe + frecps + fmul = 11, whereas the Cortex a76 optimization guide says the latency of an fdiv instruction is 7-10. The cycle costs (sum of inverse throughput) however, are 3 and 8 respectively, so fast_inverse is still a good idea for most imaging workloads that aren't the goofy recursive thing in the test. So hopefully if I just change the test to be thoughput-limited, it'll fix it.
Still disabled on M1, because fdiv there has a throughput of 1?!