-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CPU][Kernel] Single socket spmm #3024
Conversation
To trigger regression tests:
|
could you also provide some performance results to help people understand the benefit of this PR? |
i see the code constructs block sparse matrix in every sparse matrix multiplication. do you know how much is the overhead of block sparse matrix construction? can we get further speedup if we can cache the block sparse matrix? |
Sure! |
Block sparse matrix construction is only invoked in case of denser matrices with decently large dimensions. For example, for Reddit with dim = 602, it is invoked and consumes 10-12% of the time. But given that DGL now only performs SpMM on Reddit hidden layer with dim = 16, blocking is not used. So, in case of denser matrices with large enough dimensions for full batch training, it will help by about 10% if we cache the blocks. It is also not used for minibatch training due to sparse matrices and we can't cache them anyway. |
i'm just curious. it's better to not cache it for now, which makes the implementation more compatible with DGL's current API. |
Hi @zheng-da I have moved the optimized code to a new header file. I have named it "src/array/cpu/spmm_blocking_libxsmm.h" because I thought "spmm_blocking.h" did not capture all its improvements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First batch of suggestions on coding style to help pass the link check.
Hi @sanchit-misra , thank you very much for the contribution. The overall change makes sense but I may need more time to understand the details. I saw the link check has failed so I put down some suggestions on the coding style to help you pass that. Generally, we follow the Google Coding Style. Our line limit is 100 characters. BTW, I saw you added a new submodule libxsmm. Is it the snapshot of the latest stable version? I saw the latest release is 1.16.1. If all you need are available in it, I will suggest check out that version. |
Hi @jermainewang Thanks a lot for your comments. I have made some of the coding style changes suggested by you. Will try to make all the changes required to pass the link test today. The libxsmm release 1.16.1 is very old and does not have any of the things I am using it for. So, I am using a recent commit. |
Finished the Regression test. Result table is at https://dgl-asv-data.s3-us-west-2.amazonaws.com/2b58e634ad578b0b48a4c8144c2b01c76038d1d1_c59xlarge/results/result.csv. Jenkins job link is https://ci.dgl.ai/job/dgl/job/PR-3024/17/display/redirect. |
Hi @jermainewang, The regression tests seem to be showing that while our PR is faster by up to 6.7x in many cases, in some cases it is also slower than the master branch. |
Reddit dataset in regression test has some problems. Not related to this pull request. |
Regression test is at https://github.com/dmlc/dgl/blob/master/benchmarks/benchmarks/kernel/bench_gspmm_copy_u.py. And you can call the function directly by comment out those decorators. The utils module is at https://github.com/dmlc/dgl/blob/master/benchmarks/benchmarks/utils.py. You can append the path to the |
Thanks @VoVAllen. It seems from the output of regression tests, that tests in bench_gspmm_copy_u.py is fine. |
The result of ogbn-proteins in bench_gspmm_copy_u seems much worse than before. And ogbn-proteins is a very dense graph, which is a bit different from other dataset. Dataset stats can be got at https://ogb.stanford.edu/docs/nodeprop/#ogbn-proteins |
Did you run regression tests with flag / env variable DGL_CPU_INTEL_KERNEL_ENABLED=1? If not, is it possible to rerun tests with this flag and compare results? It will give good historical overview regarding kernel optimizations. |
I am a little confused. The csv file reports GFLOPS numbers for this PR and master branch. I am guessing higher is better, right? So, for gspmm_copy_u with proteins datasets, the GFLOPS for this PR are 3.3-6.7x higher than master. So, this PR is much better than master. Or am I reading it wrong? |
The regression tests were run by dgl bot. |
@jermainewang @VoVAllen Could you check if dgl bot is running benchmarks with flag / env variable DGL_CPU_INTEL_KERNEL_ENABLED=1? It will be great to have comparison regarding xbyak optimization which will give great insights for future kernel optimizations. |
@sanchit-misra My bad. You are right |
@ksadowski13 Sure. I'll manually test that and get the result back later |
message(STATUS "Build with AVX optimization.") | ||
if(USE_LIBXSMM) | ||
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -DUSE_AVX -DUSE_LIBXSMM -DDGL_CPU_LLC_SIZE=40000000") | ||
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DUSE_AVX -DUSE_LIBXSMM -DDGL_CPU_LLC_SIZE=40000000") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should users set a proper DGL_CPU_LLC_SIZE
value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the size of the LLC of the CPU the code is run on. This macro here is just failsafe. The code automatically gets the LLC size using sysconf() in the function getLLCSize(). Only if that fails, it uses this number. The number I have used as default here is quite safe for most server class CPUs. Only if sysconf fails and this number is also bigger than the user's LLC, then they will have to set the LLC size here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only if sysconf fails and this number is also bigger than the user's LLC, then they will have to set the LLC size here.
In this case, will the program crash or just run with under-optimal config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't crash. Will just run with under-optimal config.
{ | ||
for (IdType k = 0; k < num_K_blocks; k++) { | ||
#pragma omp for schedule(dynamic) | ||
for (IdType m = 0; m < num_M_blocks; m++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why not
#pragma omp parallel for
for (IdType m = 0; m < num_M_blocks; m++) {
for (IdType k = 0; k < num_K_blocks; k++) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see this picture for tiling for copylhs-reduce kind of SpMM. Here, A is the adjacency matrix, B is the input feature containing features of all the neighbors and C is the output feature array. This is sort of equivalent of doing AxB = C. If we use the k-loop as the outer loop, that ensures that all the threads are working on the same block of B and that block stays in cache. On the other hand, if we use m-loop as the outer loop, then different threads might be working on different blocks of B. That may make it harder to keep blocks of B in cache. While it may work for some workloads, keeping k-loop as outer loop provides better guarantees.
@sanchit-misra I've uploaded my comments. The overall code logic is quite clean (Thanks for the comments! They indeed help a lot!). I feel it is very close to be merged unless there are other performance regressions. Thanks for the great effort! |
Thanks. Will work on them. |
Great! I wanted to mention the following points wrt results.csv at the following link showing worse performance of this PR for some cases.
Please find attached detailed results here: |
Hi @jermainewang, I believe I have made all the changes suggested by you and answered all your questions. Please let me know if you need anything else. |
<style>
</style>
|
intel_kernel.xlsx |
Thanks VoVAllen. As can be seen in the results, this PR gets speedups up to 6.7x compared to xbyak run by using DGL_CPU_INTEL_KERNEL_ENABLED=1. There are a few cases in your excel sheet for which this PR gets lower performance. I went through all of those cases and verified that in all those cases, naive SpMM code is running whether we use this PR or master branch. That is because neither this PR nor xbyak provide optimized implementations for when use_bcast=1. So, the same naive code is performing differently between the two runs (probably because of run to run variation) and the slow down has nothing to do with this PR. |
@VoVAllen Thanks for the results. Do kernel benchmarks cover only full-graph scenario? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think the PR is ready to go.
Great to know that this has been merged. Thanks everyone for all the help! |
Description
This PR contains the single socket optimizations of SpMM for Xeon as mentioned in our DistGNN paper: https://arxiv.org/abs/2104.06700
Checklist
Please feel free to remove inapplicable items for your PR.
Changes
Provided Xeon optimized implementations of SpMMSumCsr() and SpMMCmpCsr(). We have observed up to 4.4x speedup on the SpMM kernel without change in accuracy.