Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Gauss-Seidel and SGS #455

Merged
merged 31 commits into from
Nov 19, 2019
Merged

Conversation

brian-kelley
Copy link
Contributor

@brian-kelley brian-kelley commented Aug 19, 2019

Algorithm idea:
-Use RCM to reduce matrix envelope
-Build cluster graph using equal-size contiguous groups of rows in RCM order. The edges are the union of edges between vertices in different clusters.
-Color the cluster graph:
-Run Gauss-Seidel: within each cluster is serial, but clusters of a color are in parallel.
-In practice, this converges faster than traditional coloring GS, but preserves parallelism.
-In cfd1 from the SuiteSparse collection, traditional GS fails to converge entirely but this technique converges (slowly...)

@brian-kelley
Copy link
Contributor Author

Spot-check on kokkos-dev, Aug. 19:
#######################################################
PASSED TESTS
#######################################################
clang-4.0.1-Pthread_Serial-hwloc-release build_time=1188 run_time=880
clang-4.0.1-Pthread_Serial-release build_time=1192 run_time=1139
cuda-8.0.44-Cuda_OpenMP-release build_time=2343 run_time=768
gcc-5.3.0-Serial-hwloc-release build_time=643 run_time=202
gcc-5.3.0-Serial-release build_time=645 run_time=203
gcc-7.2.0-Serial-hwloc-release build_time=447 run_time=168
gcc-7.2.0-Serial-release build_time=442 run_time=166
#######################################################
FAILED TESTS
#######################################################
intel-17.0.1-OpenMP-hwloc-release (test failed)
intel-17.0.1-OpenMP-release (test failed)

Failing tests are not related:
[ FAILED ] openmp.sparse_spmv_struct_double_int64_t_int_TestExecSpace
[ FAILED ] openmp.sparse_spmv_struct_double_int64_t_size_t_TestExecSpace
#####################################################################
white spot check results (so far, CUDA is still running):
gcc-6.4.0-OpenMP_Serial-release build_time=592 run_time=342
gcc-7.2.0-OpenMP-release build_time=386 run_time=139
gcc-7.2.0-OpenMP_Serial-release build_time=671 run_time=385
gcc-7.2.0-Serial-release build_time=272 run_time=183
ibm-16.1.0-Serial-release build_time=1328 run_time=275

@brian-kelley
Copy link
Contributor Author

Now the spot checks are clean:
<<< white >>>
#######################################################
PASSED TESTS
#######################################################
cuda-10.0.130-Cuda_Serial-release build_time=1112 run_time=438
cuda-9.2.88-Cuda_OpenMP-release build_time=1203 run_time=374
gcc-6.4.0-OpenMP_Serial-release build_time=640 run_time=364
gcc-7.2.0-OpenMP-release build_time=434 run_time=150
gcc-7.2.0-OpenMP_Serial-release build_time=718 run_time=408
gcc-7.2.0-Serial-release build_time=272 run_time=178
ibm-16.1.0-Serial-release build_time=1505 run_time=277
#######################################################
FAILED TESTS
#######################################################

<<< kokkos-dev >>>
#######################################################
PASSED TESTS
#######################################################
clang-4.0.1-Pthread_Serial-hwloc-release build_time=1122 run_time=893
clang-4.0.1-Pthread_Serial-release build_time=868 run_time=1452
cuda-8.0.44-Cuda_OpenMP-release build_time=3389 run_time=2232
gcc-5.3.0-Serial-hwloc-release build_time=484 run_time=194
gcc-5.3.0-Serial-release build_time=481 run_time=189
gcc-7.2.0-Serial-hwloc-release build_time=424 run_time=182
gcc-7.2.0-Serial-release build_time=415 run_time=167
intel-17.0.1-OpenMP-hwloc-release build_time=898 run_time=140
intel-17.0.1-OpenMP-release build_time=905 run_time=119
#######################################################
FAILED TESTS
#######################################################

@srajama1
Copy link
Contributor

@brian-kelley : Can you add Bowman spot-check and a wiki update of the new feature please. A benchmark page update will also be useful. Thanks for these ! This will be useful for our apps. @lucbv can we add a TODO to to evaluate this SGS as an option for the momentum solves ?

@brian-kelley
Copy link
Contributor Author

@srajama1 Bowman spot checks:
#######################################################
PASSED TESTS
#######################################################
intel-16.4.258-Pthread-release build_time=1547 run_time=614
intel-16.4.258-Pthread_Serial-release build_time=2512 run_time=1313
intel-16.4.258-Serial-release build_time=1500 run_time=647
intel-17.2.174-OpenMP-release build_time=2001 run_time=397
intel-17.2.174-OpenMP_Serial-release build_time=2962 run_time=1139
intel-17.2.174-Pthread-release build_time=1348 run_time=650
intel-17.2.174-Pthread_Serial-release build_time=2472 run_time=1321
intel-17.2.174-Serial-release build_time=1416 run_time=687
intel-18.2.199-OpenMP-release build_time=1619 run_time=445
intel-18.2.199-OpenMP_Serial-release build_time=2859 run_time=1064
intel-18.2.199-Pthread-release build_time=1249 run_time=655
intel-18.2.199-Pthread_Serial-release build_time=2452 run_time=1274
intel-18.2.199-Serial-release build_time=1245 run_time=684
#######################################################
FAILED TESTS
#######################################################

Copy link
Contributor

@srajama1 srajama1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brian-kelley Sorry for my delay in reviewing this. Please see comments below.

src/sparse/impl/KokkosSparse_gauss_seidel_impl.hpp Outdated Show resolved Hide resolved
namespace Impl{

template <typename HandleType, typename lno_row_view_t, typename lno_nnz_view_t>
struct RCM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to expose RCM to the users rather than in Impl. Doesn't have to be part this PR though. We could add an issue.

//radix sort keys according to their corresponding values ascending.
//keys are NOT preserved since the use of this in RCM doesn't care about degree after sorting
template<typename size_type, typename KeyType, typename ValueType, typename IndexType, typename member_t>
KOKKOS_INLINE_FUNCTION static void
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this go away with PR #461 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srajama1 Yes it will.

}
}

//Functor that does breadth-first search on a sparse graph.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to expose BFS to the users as well. We could file an issue and come back later. No need to modify this PR.

unit_test/sparse/Test_Sparse_gauss_seidel.hpp Show resolved Hide resolved
@srajama1
Copy link
Contributor

One more comment : What is the default cluster size ? Does the user have to set it before calling ?

Algorithm idea:
  -Use RCM to reduce matrix envelope
  -Build cluster graph using contiguous groups of rows in RCM order.
  -Dist-1 color the cluster graph.
  -Run Gauss-Seidel, running within each cluster color in parallel.
RCM-based cluster finding is very slow. Working on two different
clustering algorithms that should be both faster (esp. on GPU) and
produce higher quality clusters (sparser cluster graph).
Fast partitioning (clustering) works!
If > 50 entries per row, use Cuthill-McKee clustering.
Otherwise, use SSSP clustering.
Apparent bug in cluster color -> vertex color mapping, since
bodyy5.mtx triggers a crash sometimes during create_reverse_map
Not using fixed iteration count; instead, run until variance
of cluster size fails to improve
Both range and team policy versions.
Much more robust and also produces
better quality for large clusters.
These are: balloon (default), RCM, and do-nothing
Also, checks that the scaled residual is no higher than 1.
The matrix is randomly generated to be diagonally dominant,
so if the residual blows up it is a bug in Gauss-Seidel.
Needs cleanup + more testing before pushing
@brian-kelley
Copy link
Contributor Author

@srajama1 The cluster size needs to be set when the user creates the GS handle. There are now two overloads of create_gs_handle, this one for point coloring:

void create_gs_handle(KokkosSparse::GSAlgorithm gs_algorithm = KokkosSparse::GS_DEFAULT)

and this one for cluster:

void create_gs_handle(KokkosSparse::ClusteringAlgorithm clusterAlgo, nnz_lno_t verts_per_cluster)

No default is set for the cluster size. In practice it seems like anywhere between 8 and 64 is reasonable. That will go in the wiki entry.

@brian-kelley
Copy link
Contributor Author

I'm running spot checks now. The numerical results are looking good. On af_shell7 (504k rows, 17.5M entries, and SPD) the preconditioned CG iteration counts were:

  • MTSGS: 2068
  • Cluster SGS with cluster size 64: 1682
  • Sequential GS: 1646

@srajama1
Copy link
Contributor

These iterations numbers look really good. Looking forward to getting this in the develop.

The block PCG perf tests gets built in the Makefile-based build,
so it's built by test_all_sandia.
@srajama1
Copy link
Contributor

I saw a comment about an error in email, but don't see it in the website. May be you have resolved it ?

@brian-kelley
Copy link
Contributor Author

@srajama1 Yeah I deleted that comment, I forgot to checkout the right branch :)

@brian-kelley
Copy link
Contributor Author

brian-kelley commented Nov 14, 2019

@srajama1 Actually, that error is still happening. On kokkos-dev, the Cuda_OpenMP-release build works but when either KokkosKernels_UnitTest_Cuda is actually run, I get:

[bmkelle@kokkos-dev unit_test]$ ./KokkosKernels_UnitTest_Cuda 
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaGetDeviceCount( & m_cudaDevCount ) error( cudaErrorUnknown): unknown error /ascldap/users/bmkelle/ClusterGaussSeidel/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:204
Traceback functionality not available

Aborted (core dumped)

It think the OpenMP backend is getting initialized but not the Cuda...

The output for KokkosKernels_UnitTest_OpenMP is exactly the same (it shouldn't ever be calling any CUDA runtime functions but it is calling cudaGetDeviceCount).

I have the right modules loaded:

  1. sems-env 2) kokkos-env 3) kokkos-cuda/8.0.44 4) sems-gcc/5.3.0 5) kokkos-hwloc/1.10.1/base

@brian-kelley
Copy link
Contributor Author

@srajama1 Nathan found out it is a system driver issue and it should be fixed pretty soon.

@srajama1
Copy link
Contributor

I am glad you found the reason. Thanks @ndellingwood @brian-kelley

@brian-kelley
Copy link
Contributor Author

@srajama1 CUDA on kokkos-dev is still broken, but below are successful test outputs from bowman and white.

The only build that fails on kokkos-dev is GCC 5.3.0, CUDA 8.0, running on Kepler. Should I try to run test_all_sandia with a similar configuration on another machine, wait for kokkos-dev drivers to be fixed, or is bowman+white enough?

#######################################################
PASSED TESTS
#######################################################
intel-16.4.258-Pthread-release build_time=1640 run_time=728
intel-16.4.258-Pthread_Serial-release build_time=2749 run_time=1539
intel-16.4.258-Serial-release build_time=1568 run_time=746
intel-17.2.174-OpenMP-release build_time=2130 run_time=432
intel-17.2.174-OpenMP_Serial-release build_time=3152 run_time=1380
intel-17.2.174-Pthread-release build_time=1405 run_time=757
intel-17.2.174-Pthread_Serial-release build_time=2645 run_time=1574
intel-17.2.174-Serial-release build_time=1530 run_time=771
intel-18.2.199-OpenMP-release build_time=1695 run_time=536
intel-18.2.199-OpenMP_Serial-release build_time=3084 run_time=1192
intel-18.2.199-Pthread-release build_time=1383 run_time=769
intel-18.2.199-Pthread_Serial-release build_time=2743 run_time=1497
intel-18.2.199-Serial-release build_time=1330 run_time=779
#######################################################
FAILED TESTS
#######################################################

#######################################################
PASSED TESTS
#######################################################
cuda-10.0.130-Cuda_Serial-release build_time=1218 run_time=438
cuda-9.2.88-Cuda_OpenMP-release build_time=1241 run_time=357
gcc-6.4.0-OpenMP_Serial-release build_time=637 run_time=410
gcc-7.2.0-OpenMP-release build_time=390 run_time=190
gcc-7.2.0-OpenMP_Serial-release build_time=612 run_time=777
gcc-7.2.0-Serial-release build_time=293 run_time=198
gcc-7.4.0-OpenMP-release build_time=400 run_time=184
ibm-16.1.0-Serial-release build_time=1360 run_time=300
#######################################################
FAILED TESTS
#######################################################

@srajama1
Copy link
Contributor

I am ok with pushing this with testing on white.

Copy link
Contributor

@srajama1 srajama1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brian-kelley : Thanks for the persistence on getting this in !

@brian-kelley
Copy link
Contributor Author

@srajama1 Cool, I'm ready to merge it then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants