You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The throughput is 0.5 hence 2 independent CMOV can be issued per cycle, hence 2-3 cycles are required per Fp element.
However when we have a table precomputed for scalar multiplication/signing with 8 EC elements, each composed of 3 Fp coordinates of 4-6 limbs, using SSE or AVX we can load 2x4 or 2x8 limbs per cycle (2 vector loads per cycle, bottlenecked by memory speed).
This would reduce the overhead of table access. Note that LSB set recoding (#73) uses table with 64 to 256 EC elements (192+ Fp hence thousands of limbs)
The CMOV instruction that is used for conditional copy is likely optimal for 4~6 limbs.
From Agner Fog tables
https://www.agner.org/optimize/instruction_tables.pdf
The throughput is 0.5 hence 2 independent CMOV can be issued per cycle, hence 2-3 cycles are required per Fp element.
However when we have a table precomputed for scalar multiplication/signing with 8 EC elements, each composed of 3 Fp coordinates of 4-6 limbs, using SSE or AVX we can load 2x4 or 2x8 limbs per cycle (2 vector loads per cycle, bottlenecked by memory speed).
This would reduce the overhead of table access. Note that LSB set recoding (#73) uses table with 64 to 256 EC elements (192+ Fp hence thousands of limbs)
i.e. to vectorize:
constantine/constantine/elliptic/ec_endomorphism_accel.nim
Lines 200 to 206 in 00ff599
The text was updated successfully, but these errors were encountered: