-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support parallel split K mode for porfiling #277
Conversation
Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
@manishucsd and @kerrmudgeon , would you please help @Peter9606 here? Maybe point him to the conv parallel splitk code? |
Hi @Peter9606 , I have reviewed your code. It looks like you have followed the changes in convolutions to support parallel reductions. A few points to consider:
Thanks! |
I finally get it, thank you very much! |
1. find gemm kernel by preference key 2. switch m n for redution kernel Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
@manishucsd Now parallel split K reduction profiling works for SMIT fp32, but only through new added reduction kernel with smaller alignment. Still not very clear why larger alignment reduction kernel cannot be selected. |
Seems that this version can have a successful run for fp16 with align1/2/4. |
This PR has been labeled |
Now it can support
It still requires the 128bit alignment in the reduction. In this PR, I removed small aligned reduction code which requires some extra logic to find the correct reduction configuration. To do this, we need to put alignment into the FunctionalKey or PreferenceKey of the reduction operation and use problem_size.m to decide the correct the reduction kernel to use. We welcome the community to extend this PR to support small alignment reduction. |
I'm trying to add support for parallel profiling, and this patch is what I modified.
Unfortunately, it only works for very small portion of problem sizes whose
m
should equaln
. Also, to make it workable, I have to hard code the number of elements computed per operation during epilogue to 1 which is obviously not correct. Hope someone can correct it.A command line sample if anyone want to try it
./cutlass_profiler --split_k_slices=2 --m=242 --n=242 --k=300 --split_k_mode=parallel