Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance gemv vs gemm #83

Closed
JanAbbing opened this issue Jul 19, 2016 · 5 comments
Closed

Performance gemv vs gemm #83

JanAbbing opened this issue Jul 19, 2016 · 5 comments

Comments

@JanAbbing
Copy link

JanAbbing commented Jul 19, 2016

Hello,

I encountered a weird runtime difference between the gemv and the gemm routine.
When I run both with the Input: M=4096, N=1, K=4096 on my GTX480 the runtime of the gemm routine is 3.04ms and the runtime of the gemv routine is 5.51ms. I would have expected that gemv would be faster than the gemm routine because it is made for such an input. Could it be that gemv isn't yet optimized for a GTX480 or is it normal that it is slower? The cuBLASSgemm is slower than cuBLASSgemv (almost 2 times faster).

I call the gemv routine like this:
./clblast_client_xgemv -m 4096 -n 4096 -alpha 1 -beta 0 -warm_up true -runs 100

Greetings,
Jan

@CNugteren
Copy link
Owner

You are right, the GEMV kernel is not particularly fast if the matrix is rotated. It's been a while since I looked at it and I completely forgot about it. But you can see similar results if you look in the doc/performance folder of CLBlast. In fact, there is a GTX480 graph included as well. You can generate such a graph on your own system as well with the included tools (see README).

Also on my system with the latest version of CLBlast I see this behaviour:

./clblast_client_xgemv -m 4096 -n 4096 -alpha 1 -beta 0 -warm_up -layout 101
                                                                                                                         | <--       CLBlast       --> | <--       clBLAS        --> |
        m;        n;   layout;   transA;      lda;     incx;     incy;     offa;     offx;     offy;    alpha;     beta;     ms_1; GFLOPS_1;    GBs_1;     ms_2; GFLOPS_2;    GBs_2;  
       4K;       4K;      101;      111;       4K;        1;        1;        0;        0;        0; 1.000000; 0.000000;    20.28;      1.7;      3.3;     5.87;      5.7;     11.4;  

./clblast_client_xgemv -m 4096 -n 4096 -alpha 1 -beta 0 -warm_up -layout 102
                                                                                                                         | <--       CLBlast       --> | <--       clBLAS        --> |
        m;        n;   layout;   transA;      lda;     incx;     incy;     offa;     offx;     offy;    alpha;     beta;     ms_1; GFLOPS_1;    GBs_1;     ms_2; GFLOPS_2;    GBs_2;  
       4K;       4K;      102;      111;       4K;        1;        1;        0;        0;        0; 1.000000; 0.000000;     1.70;     19.7;     39.4;     2.77;     12.1;     24.3;  

For now, you can get decent performance again if you rotate the matrix (either use column-major layout or set the transpose option). I'll take a more in-depth look at the kernel soon and try to improve it for rotated matrices. I'll keep you up-to-date.

@CNugteren
Copy link
Owner

CNugteren commented Jul 23, 2016

I've designed a new kernel for the rotated case. It has much better data locality since it now loads a tile of matrix A into the local memory. This also enables coalescing. On my device this already improves performance to the clBLAS level (old and new experiments below each other for comparison):

./clblast_client_xgemv -m 4096 -n 4096 -alpha 1 -beta 0 -warm_up -layout 101
                                                                                                                         | <--       CLBlast       --> | <--       clBLAS        --> |
        m;        n;   layout;   transA;      lda;     incx;     incy;     offa;     offx;     offy;    alpha;     beta;     ms_1; GFLOPS_1;    GBs_1;     ms_2; GFLOPS_2;    GBs_2;  
       4K;       4K;      101;      111;       4K;        1;        1;        0;        0;        0; 1.000000; 0.000000;    20.28;      1.7;      3.3;     5.87;      5.7;     11.4;    <--- old
       4K;       4K;      101;      111;       4K;        1;        1;        0;        0;        0; 1.000000; 0.000000;     4.77;      7.0;     14.1;     5.79;      5.8;     11.6;    <--- new

The new kernel can already be found in the gemv_performance branch. However, I'll need to make some changes to the tuning database (preferably re-tune for all devices) since the kernel has changed significantly. I will also try to run it on an GTX 480 or similar to verify that performance has improved.

@blueberry
Copy link

when this is ready for rhe main, please put a visible reminder, and i'll retune clblast for the devices i have.

@CNugteren
Copy link
Owner

This is now merged into the development branch. @JanAbbing Could you re-run your experiment again and verify if this issue is fixed? If not directly fast with the default settings, could you re-tune it for your GPU and upload the corresponding JSON files here? Thanks!

@JanAbbing
Copy link
Author

It improved quite a bit to 2,17 ms.

Thanks for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants