Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPU][Kernel] Single socket spmm #3024

Merged
merged 23 commits into from
Jul 13, 2021

Conversation

sanchit-misra
Copy link
Contributor

Description

This PR contains the single socket optimizations of SpMM for Xeon as mentioned in our DistGNN paper: https://arxiv.org/abs/2104.06700

Checklist

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • To the my best knowledge, examples get faster or equal performance and accuracy is not affected.

Changes

Provided Xeon optimized implementations of SpMMSumCsr() and SpMMCmpCsr(). We have observed up to 4.4x speedup on the SpMM kernel without change in accuracy.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 15, 2021

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

src/array/cpu/spmm.h Outdated Show resolved Hide resolved
src/array/cpu/spmm.h Outdated Show resolved Hide resolved
@zheng-da
Copy link
Collaborator

could you also provide some performance results to help people understand the benefit of this PR?

@zheng-da
Copy link
Collaborator

i see the code constructs block sparse matrix in every sparse matrix multiplication. do you know how much is the overhead of block sparse matrix construction? can we get further speedup if we can cache the block sparse matrix?

@sanchit-misra
Copy link
Contributor Author

could you also provide some performance results to help people understand the benefit of this PR?

Sure!

@sanchit-misra
Copy link
Contributor Author

i see the code constructs block sparse matrix in every sparse matrix multiplication. do you know how much is the overhead of block sparse matrix construction? can we get further speedup if we can cache the block sparse matrix?

Block sparse matrix construction is only invoked in case of denser matrices with decently large dimensions. For example, for Reddit with dim = 602, it is invoked and consumes 10-12% of the time. But given that DGL now only performs SpMM on Reddit hidden layer with dim = 16, blocking is not used. So, in case of denser matrices with large enough dimensions for full batch training, it will help by about 10% if we cache the blocks.

It is also not used for minibatch training due to sparse matrices and we can't cache them anyway.

@zheng-da
Copy link
Collaborator

i'm just curious. it's better to not cache it for now, which makes the implementation more compatible with DGL's current API.

@sanchit-misra
Copy link
Contributor Author

Hi @zheng-da I have moved the optimized code to a new header file. I have named it "src/array/cpu/spmm_blocking_libxsmm.h" because I thought "spmm_blocking.h" did not capture all its improvements.
I am running tests again and will share the speedup numbers as soon as I have them.

@sanchit-misra
Copy link
Contributor Author

could you also provide some performance results to help people understand the benefit of this PR?

Please find the graphs for multiple workloads below. For baseline, we have enabled the xbyak optimization.

Full batch Training:
image
image
image
image

Minibatch Training:
image
image
image
image

Copy link
Member

@jermainewang jermainewang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First batch of suggestions on coding style to help pass the link check.

src/array/cpu/spmm.h Outdated Show resolved Hide resolved
src/array/cpu/spmm.h Outdated Show resolved Hide resolved
src/array/cpu/spmm.h Outdated Show resolved Hide resolved
src/array/cpu/spmm.h Outdated Show resolved Hide resolved
src/array/cpu/spmm_blocking_libxsmm.h Outdated Show resolved Hide resolved
src/array/cpu/spmm_blocking_libxsmm.h Outdated Show resolved Hide resolved
src/array/cpu/spmm_blocking_libxsmm.h Outdated Show resolved Hide resolved
src/array/cpu/spmm_blocking_libxsmm.h Outdated Show resolved Hide resolved
src/array/cpu/spmm_blocking_libxsmm.h Outdated Show resolved Hide resolved
src/array/cpu/spmm_blocking_libxsmm.h Outdated Show resolved Hide resolved
@jermainewang
Copy link
Member

Hi @sanchit-misra , thank you very much for the contribution. The overall change makes sense but I may need more time to understand the details. I saw the link check has failed so I put down some suggestions on the coding style to help you pass that. Generally, we follow the Google Coding Style. Our line limit is 100 characters.
You can click the CI details button to see the lint errors. Once that is done, you also need to pass the building phase and testing phase. In the meanwhile I will study your code and give more comments.

BTW, I saw you added a new submodule libxsmm. Is it the snapshot of the latest stable version? I saw the latest release is 1.16.1. If all you need are available in it, I will suggest check out that version.

@sanchit-misra
Copy link
Contributor Author

Hi @sanchit-misra , thank you very much for the contribution. The overall change makes sense but I may need more time to understand the details. I saw the link check has failed so I put down some suggestions on the coding style to help you pass that. Generally, we follow the Google Coding Style. Our line limit is 100 characters.
You can click the CI details button to see the lint errors. Once that is done, you also need to pass the building phase and testing phase. In the meanwhile I will study your code and give more comments.

BTW, I saw you added a new submodule libxsmm. Is it the snapshot of the latest stable version? I saw the latest release is 1.16.1. If all you need are available in it, I will suggest check out that version.

Hi @jermainewang Thanks a lot for your comments. I have made some of the coding style changes suggested by you. Will try to make all the changes required to pass the link test today. The libxsmm release 1.16.1 is very old and does not have any of the things I am using it for. So, I am using a recent commit.

@jermainewang jermainewang added this to the v0.7 milestone Jun 23, 2021
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 6, 2021

@sanchit-misra
Copy link
Contributor Author

Hi @jermainewang,

The regression tests seem to be showing that while our PR is faster by up to 6.7x in many cases, in some cases it is also slower than the master branch.
Nearly 10 of the tests failing are from gsddmm even though we did not change anything in gsddmm. Could you please help me understand why these are failing?
There are another 10 tests that are failing in gspmm. They are all u_mul_e_sum type. Could you please let me know how I can run these individual tests so as to figure out what went wrong?
Some of the tests don't have any output. Is that a problem?

@VoVAllen
Copy link
Collaborator

VoVAllen commented Jul 6, 2021

Reddit dataset in regression test has some problems. Not related to this pull request.

@VoVAllen
Copy link
Collaborator

VoVAllen commented Jul 6, 2021

Regression test is at https://github.com/dmlc/dgl/blob/master/benchmarks/benchmarks/kernel/bench_gspmm_copy_u.py. And you can call the function directly by comment out those decorators.

The utils module is at https://github.com/dmlc/dgl/blob/master/benchmarks/benchmarks/utils.py. You can append the path to the os.path and import utils directly

@sanchit-misra
Copy link
Contributor Author

Regression test is at https://github.com/dmlc/dgl/blob/master/benchmarks/benchmarks/kernel/bench_gspmm_copy_u.py. And you can call the function directly by comment out those decorators.

The utils module is at https://github.com/dmlc/dgl/blob/master/benchmarks/benchmarks/utils.py. You can append the path to the os.path and import utils directly

Thanks @VoVAllen. It seems from the output of regression tests, that tests in bench_gspmm_copy_u.py is fine.
But there are issues in https://github.com/dmlc/dgl/blob/master/benchmarks/benchmarks/kernel/bench_gspmm_u_mul_e_sum.py
Is that the correct understanding?

@VoVAllen
Copy link
Collaborator

VoVAllen commented Jul 6, 2021

The result of ogbn-proteins in bench_gspmm_copy_u seems much worse than before. And ogbn-proteins is a very dense graph, which is a bit different from other dataset. Dataset stats can be got at https://ogb.stanford.edu/docs/nodeprop/#ogbn-proteins

@krzysztof-daniell
Copy link
Contributor

Did you run regression tests with flag / env variable DGL_CPU_INTEL_KERNEL_ENABLED=1? If not, is it possible to rerun tests with this flag and compare results? It will give good historical overview regarding kernel optimizations.

@sanchit-misra
Copy link
Contributor Author

The result of ogbn-proteins in bench_gspmm_copy_u seems much worse than before. And ogbn-proteins is a very dense graph, which is a bit different from other dataset. Dataset stats can be got at https://ogb.stanford.edu/docs/nodeprop/#ogbn-proteins

I am a little confused. The csv file reports GFLOPS numbers for this PR and master branch. I am guessing higher is better, right? So, for gspmm_copy_u with proteins datasets, the GFLOPS for this PR are 3.3-6.7x higher than master. So, this PR is much better than master. Or am I reading it wrong?

@sanchit-misra
Copy link
Contributor Author

Did you run regression tests with flag / env variable DGL_CPU_INTEL_KERNEL_ENABLED=1? If not, is it possible to rerun tests with this flag and compare results? It will give good historical overview regarding kernel optimizations.

The regression tests were run by dgl bot.

@krzysztof-daniell
Copy link
Contributor

Did you run regression tests with flag / env variable DGL_CPU_INTEL_KERNEL_ENABLED=1? If not, is it possible to rerun tests with this flag and compare results? It will give good historical overview regarding kernel optimizations.

The regression tests were run by dgl bot.

@jermainewang @VoVAllen Could you check if dgl bot is running benchmarks with flag / env variable DGL_CPU_INTEL_KERNEL_ENABLED=1? It will be great to have comparison regarding xbyak optimization which will give great insights for future kernel optimizations.

@VoVAllen
Copy link
Collaborator

VoVAllen commented Jul 6, 2021

@sanchit-misra My bad. You are right

@VoVAllen
Copy link
Collaborator

VoVAllen commented Jul 6, 2021

@ksadowski13 Sure. I'll manually test that and get the result back later

message(STATUS "Build with AVX optimization.")
if(USE_LIBXSMM)
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -DUSE_AVX -DUSE_LIBXSMM -DDGL_CPU_LLC_SIZE=40000000")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DUSE_AVX -DUSE_LIBXSMM -DDGL_CPU_LLC_SIZE=40000000")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should users set a proper DGL_CPU_LLC_SIZE value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the size of the LLC of the CPU the code is run on. This macro here is just failsafe. The code automatically gets the LLC size using sysconf() in the function getLLCSize(). Only if that fails, it uses this number. The number I have used as default here is quite safe for most server class CPUs. Only if sysconf fails and this number is also bigger than the user's LLC, then they will have to set the LLC size here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if sysconf fails and this number is also bigger than the user's LLC, then they will have to set the LLC size here.

In this case, will the program crash or just run with under-optimal config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't crash. Will just run with under-optimal config.

src/array/cpu/spmm.h Outdated Show resolved Hide resolved
src/array/cpu/spmm.h Outdated Show resolved Hide resolved
src/array/cpu/spmm.h Outdated Show resolved Hide resolved
src/array/cpu/spmm_blocking_libxsmm.h Outdated Show resolved Hide resolved
{
for (IdType k = 0; k < num_K_blocks; k++) {
#pragma omp for schedule(dynamic)
for (IdType m = 0; m < num_M_blocks; m++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why not

#pragma omp parallel for
    for (IdType m = 0; m < num_M_blocks; m++) {
      for (IdType k = 0; k < num_K_blocks; k++) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see this picture for tiling for copylhs-reduce kind of SpMM. Here, A is the adjacency matrix, B is the input feature containing features of all the neighbors and C is the output feature array. This is sort of equivalent of doing AxB = C. If we use the k-loop as the outer loop, that ensures that all the threads are working on the same block of B and that block stays in cache. On the other hand, if we use m-loop as the outer loop, then different threads might be working on different blocks of B. That may make it harder to keep blocks of B in cache. While it may work for some workloads, keeping k-loop as outer loop provides better guarantees.

image

src/array/cpu/spmm_blocking_libxsmm.h Outdated Show resolved Hide resolved
src/array/cpu/spmm_blocking_libxsmm.h Outdated Show resolved Hide resolved
src/array/cpu/spmm_blocking_libxsmm.h Outdated Show resolved Hide resolved
src/array/cpu/spmm_blocking_libxsmm.h Outdated Show resolved Hide resolved
@jermainewang
Copy link
Member

@sanchit-misra I've uploaded my comments. The overall code logic is quite clean (Thanks for the comments! They indeed help a lot!). I feel it is very close to be merged unless there are other performance regressions. Thanks for the great effort!

@sanchit-misra
Copy link
Contributor Author

@sanchit-misra I've uploaded my comments. The overall code logic is quite clean (Thanks for the comments! They indeed help a lot!). I feel it is very close to be merged unless there are other performance regressions. Thanks for the great effort!

Thanks. Will work on them.

@sanchit-misra
Copy link
Contributor Author

@sanchit-misra My bad. You are right

Great! I wanted to mention the following points wrt results.csv at the following link showing worse performance of this PR for some cases.
https://dgl-asv-data.s3-us-west-2.amazonaws.com/2b58e634ad578b0b48a4c8144c2b01c76038d1d1_c59xlarge/results/result.csv

  1. This PR has not modified the sddmm code at all. So, all the cases where sddmm code is running slower is not because of this PR.
  2. For gspmm for which result.csv shows worse performance of PR [CPU][Kernel] Single socket spmm #3024, the naive SpMM code is being run for both master branch and PR [CPU][Kernel] Single socket spmm #3024. This is because that path is not supported by the optimized code for this PR. So, the comparison is just between naive code run for master branch and naive code running for this PR. Also, performance oscillates quite a bit from run to run and could be the reason for the observed slowdown.
  3. I also did a run at my end comparing this PR with xbyak code and naive code. These experiments were performed on a single socket of Intel Xeon Platinum 8280 consisting of 28 cores (https://ark.intel.com/content/www/us/en/ark/products/192478/intel-xeon-platinum-8280-processor-38-5m-cache-2-70-ghz.html). I used 28 threads with scatter KMP_AFFINITY. Wherever applicable, xbyak code runs faster than naive code. The code of this PR that is based on libxsmm is applicable in more cases than xbyak and whenever applicable, it is faster than naive and xbyak code.

Please find attached detailed results here:
result.xlsx

@sanchit-misra
Copy link
Contributor Author

Hi @jermainewang, I believe I have made all the changes suggested by you and answered all your questions. Please let me know if you need anything else.

@VoVAllen
Copy link
Collaborator

VoVAllen commented Jul 8, 2021

<style> </style>
  test_name params unit number_3024 machine DGL_CPU_INTEL_KERNEL_ENABLED=0 number_DGL_CPU_INTEL_KERNEL_ENABLED=1
27 kernel.bench_gspmm_copy_u.track_flops 'ogbn-arxiv', 4, 'sum' GFLOPS 6.05 c5.9xlarge-cpu 3.96 5.23
28 kernel.bench_gspmm_copy_u.track_flops 'ogbn-arxiv', 4, 'max' GFLOPS 2.85 c5.9xlarge-cpu 1.96 2.28
29 kernel.bench_gspmm_copy_u.track_flops 'ogbn-arxiv', 32, 'sum' GFLOPS 9.03 c5.9xlarge-cpu 4.4 7.66
30 kernel.bench_gspmm_copy_u.track_flops 'ogbn-arxiv', 32, 'max' GFLOPS 2.58 c5.9xlarge-cpu 1.81 1.93
31 kernel.bench_gspmm_copy_u.track_flops 'ogbn-arxiv', 256, 'sum' GFLOPS 5.98 c5.9xlarge-cpu 4.29 4.71
32 kernel.bench_gspmm_copy_u.track_flops 'ogbn-arxiv', 256, 'max' GFLOPS 1.93 c5.9xlarge-cpu 1.6 1.66
39 kernel.bench_gspmm_copy_u.track_flops 'ogbn-proteins', 4, 'sum' GFLOPS 29.24 c5.9xlarge-cpu 8.92 9.22
40 kernel.bench_gspmm_copy_u.track_flops 'ogbn-proteins', 4, 'max' GFLOPS 23.54 c5.9xlarge-cpu 5.83 5.91
41 kernel.bench_gspmm_copy_u.track_flops 'ogbn-proteins', 32, 'sum' GFLOPS 43.61 c5.9xlarge-cpu 9.32 40.1
42 kernel.bench_gspmm_copy_u.track_flops 'ogbn-proteins', 32, 'max' GFLOPS 35.58 c5.9xlarge-cpu 7.67 7.85
43 kernel.bench_gspmm_copy_u.track_flops 'ogbn-proteins', 256, 'sum' GFLOPS 58.5 c5.9xlarge-cpu 9.19 15.96
44 kernel.bench_gspmm_copy_u.track_flops 'ogbn-proteins', 256, 'max' GFLOPS 47 c5.9xlarge-cpu 6.98 7.1
45 kernel.bench_gspmm_u_mul_e_sum.track_flops 'ogbn-arxiv', 4, 0 GFLOPS 4.83 c5.9xlarge-cpu 3.54 4.69
46 kernel.bench_gspmm_u_mul_e_sum.track_flops 'ogbn-arxiv', 4, 1 GFLOPS 3.66 c5.9xlarge-cpu 3.89 5.12
47 kernel.bench_gspmm_u_mul_e_sum.track_flops 'ogbn-arxiv', 4, 4 GFLOPS 4.84 c5.9xlarge-cpu 3.46 4.73
48 kernel.bench_gspmm_u_mul_e_sum.track_flops 'ogbn-arxiv', 32, 0 GFLOPS 9.94 c5.9xlarge-cpu 4.61 9.27
49 kernel.bench_gspmm_u_mul_e_sum.track_flops 'ogbn-arxiv', 32, 1 GFLOPS 5.82 c5.9xlarge-cpu 6.7 6.74
50 kernel.bench_gspmm_u_mul_e_sum.track_flops 'ogbn-arxiv', 32, 4 GFLOPS 4.86 c5.9xlarge-cpu 5.71 5.62
51 kernel.bench_gspmm_u_mul_e_sum.track_flops 'ogbn-arxiv', 256, 0 GFLOPS 8.25 c5.9xlarge-cpu 6.47 6.12
52 kernel.bench_gspmm_u_mul_e_sum.track_flops 'ogbn-arxiv', 256, 1 GFLOPS 6.01 c5.9xlarge-cpu 6.67 7.04
53 kernel.bench_gspmm_u_mul_e_sum.track_flops 'ogbn-arxiv', 256, 4 GFLOPS 5.88 c5.9xlarge-cpu 6.67 6.87
63 kernel.bench_gspmm_u_mul_e_sum.track_flops 'ogbn-proteins', 4, 0 GFLOPS 7.87 c5.9xlarge-cpu 3.28 3.41
64 kernel.bench_gspmm_u_mul_e_sum.track_flops 'ogbn-proteins', 4, 1 GFLOPS 3.01 c5.9xlarge-cpu 3.03 3.13
65 kernel.bench_gspmm_u_mul_e_sum.track_flops 'ogbn-proteins', 4, 4 GFLOPS 7.87 c5.9xlarge-cpu 3.27 3.38
66 kernel.bench_gspmm_u_mul_e_sum.track_flops 'ogbn-proteins', 32, 0 GFLOPS 21.12 c5.9xlarge-cpu 4.95 12.71
67 kernel.bench_gspmm_u_mul_e_sum.track_flops 'ogbn-proteins', 32, 1 GFLOPS 6.29 c5.9xlarge-cpu 6.76 7.03
68 kernel.bench_gspmm_u_mul_e_sum.track_flops 'ogbn-proteins', 32, 4 GFLOPS 5.55 c5.9xlarge-cpu 5.95 6.21
69 kernel.bench_gspmm_u_mul_e_sum.track_flops 'ogbn-proteins', 256, 0 GFLOPS   c5.9xlarge-cpu    
70 kernel.bench_gspmm_u_mul_e_sum.track_flops 'ogbn-proteins', 256, 1 GFLOPS 8.8 c5.9xlarge-cpu 10.1 10.4
71 kernel.bench_gspmm_u_mul_e_sum.track_flops 'ogbn-proteins', 256, 4 GFLOPS 8.5 c5.9xlarge-cpu 9.77 10.04

@VoVAllen
Copy link
Collaborator

VoVAllen commented Jul 8, 2021

intel_kernel.xlsx
Here's the result comparing to the previous kernel with DGL_CPU_INTEL_KERNEL_ENABLED=1

@sanchit-misra
Copy link
Contributor Author

intel_kernel.xlsx
Here's the result comparing to the previous kernel with DGL_CPU_INTEL_KERNEL_ENABLED=1

Thanks VoVAllen. As can be seen in the results, this PR gets speedups up to 6.7x compared to xbyak run by using DGL_CPU_INTEL_KERNEL_ENABLED=1. There are a few cases in your excel sheet for which this PR gets lower performance. I went through all of those cases and verified that in all those cases, naive SpMM code is running whether we use this PR or master branch. That is because neither this PR nor xbyak provide optimized implementations for when use_bcast=1. So, the same naive code is performing differently between the two runs (probably because of run to run variation) and the slow down has nothing to do with this PR.

@krzysztof-daniell
Copy link
Contributor

@VoVAllen Thanks for the results. Do kernel benchmarks cover only full-graph scenario?

Copy link
Member

@jermainewang jermainewang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think the PR is ready to go.

@jermainewang jermainewang merged commit fac75e1 into dmlc:master Jul 13, 2021
@sanchit-misra
Copy link
Contributor Author

Great to know that this has been merged. Thanks everyone for all the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants