Replies: 1 comment 1 reply
-
that depends on whether your tensor stride supports that "recast" or not, but you can just issue 2 TMA instructions in that case too. it should not matter for perf if your box is already hitting 256 for fp8 |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
In sm90 copy atom I notice that:
So there is limit that each copy can not larger than 256 for each datatype? If I want to copy like fp8/int8 data 512 element in axis1, can I set copy dtype as float16 and copy shape as 256?
Beta Was this translation helpful? Give feedback.
All reactions