You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1 - if we'd used a better indvar we could have avoided some very large offsets on the address math (put A and B in registers and use a better range/increment for %rax).
2 - GCC recognises that the array is fully dereferencable allowing it to use fewer (vector) loads and then extract/shuffle the elements that it requires
3 - we fail to ensure the per-loop reduction is in a form that we can use HADDPS (on targets where its fast)
4 - the LoopMicroOpBufferSize in the znver3 model has a VERY unexpected effect on unrolling - I'm not sure clang's interpretation of the buffer size is the same as just copying AMD's hardware specs
The text was updated successfully, but these errors were encountered:
Something else that's interesting on the integer equivalent:
using T = int;
constexpr int SIZE = 128;
T A[SIZE][16];
T B[SIZE][16];
T foo()
{
T sum = (T)0;
for (int i = 1; i < 32; ++i)
for (int j = 0; j < 4; ++j)
sum += A[i][j] * B[i][j];
return sum;
}
Extended Description
https://godbolt.org/z/jE1e9rT5j
(NOTE: disabled fma on gcc to prevent fmul+fadd->fma diff)
clang -g0 -O3 -march=znver2
The clang code has several issues:
1 - if we'd used a better indvar we could have avoided some very large offsets on the address math (put A and B in registers and use a better range/increment for %rax).
2 - GCC recognises that the array is fully dereferencable allowing it to use fewer (vector) loads and then extract/shuffle the elements that it requires
3 - we fail to ensure the per-loop reduction is in a form that we can use HADDPS (on targets where its fast)
4 - the LoopMicroOpBufferSize in the znver3 model has a VERY unexpected effect on unrolling - I'm not sure clang's interpretation of the buffer size is the same as just copying AMD's hardware specs
The text was updated successfully, but these errors were encountered: