Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda/11.2.2 TPL spot check test failures #1586

Closed
e10harvey opened this issue Nov 9, 2022 · 9 comments
Closed

Cuda/11.2.2 TPL spot check test failures #1586

e10harvey opened this issue Nov 9, 2022 · 9 comments

Comments

@e10harvey
Copy link
Contributor

The following tests are failing when cusparse or cublas is enabled:

7: [ RUN      ] cuda.rotg_double
 7/23 Test  #7: blas_cuda ........................***Exception: SegFault  4.44 sec

--
Expected: (result_res) < (params.tolerance * initial_norms[c]), actual: inf vs 1.27438e-05
/path/to/kokkos-kernels/sparse/unit_test/Test_Sparse_block_gauss_seidel.hpp:356: Failure
Expected: (result_res) < (params.tolerance * initial_norms[c]), actual: nan vs 1.32129e-05
/path/to/kokkos-kernels/sparse/unit_test/Test_Sparse_block_gauss_seidel.hpp:356: Failure
Expected: (result_res) < (params.tolerance * initial_norms[c]), actual: inf vs 1.27438e-05
[  FAILED  ] cuda.sparse_bsr_gauss_seidel_rank2_kokkos_complex_double_int_int_TestExecSpace (2245 ms)
--
Expected: (res) < (initial_norms[i]), actual: nan vs 257.853
/path/to/kokkos-kernels/sparse/unit_test/Test_Sparse_gauss_seidel.hpp:383: Failure
Expected: (res) < (initial_norms[i]), actual: nan vs 259.395
/path/to/kokkos-kernels/sparse/unit_test/Test_Sparse_gauss_seidel.hpp:383: Failure
Expected: (res) < (initial_norms[i]), actual: nan vs 257.853
[  FAILED  ] cuda.sparse_gauss_seidel_asymmetric_rank2_kokkos_complex_double_int_int_TestExecSpace (355 ms)
--
Expected: (res) < (initial_norms[i]), actual: nan vs 259.395
/path/to/kokkos-kernels/sparse/unit_test/Test_Sparse_gauss_seidel.hpp:383: Failure
Expected: (res) < (initial_norms[i]), actual: nan vs 257.853
/path/to/kokkos-kernels/sparse/unit_test/Test_Sparse_gauss_seidel.hpp:383: Failure
Expected: (res) < (initial_norms[i]), actual: 263.819 vs 263.819
[  FAILED  ] cuda.sparse_gauss_seidel_symmetric_rank2_kokkos_complex_double_int_int_TestExecSpace (311 ms)
--
  Actual: true
Expected: false
KokkosSparse::Test::spmv 2D, mode H: threw exception:
cusparseSpMM_bufferSize( cusparseHandle, opA, opB, &alpha, A_cusparse, vecX, &beta, vecY, computeType, alg, &bufferSize) error( CUSPARSE_STATUS_INVALID_VALUE): invalid value /path/to/kokkos-kernels/sparse/tpls/KokkosSparse_spmv_mv_tpl_spec_decl.hpp:191

[  FAILED  ] cuda.sparse_spmv_mv_double_int_int_LayoutLeft_TestExecSpace (13679 ms)
--
KokkosSparse::Test::spmv_mv: 200 errors of 200 for mv 29 (alpha=(2.5,0), beta=(2.5,0), mode = H)
/path/to/kokkos-kernels/sparse/unit_test/Test_Sparse_spmv.hpp:248: Failure
Value of: num_errors == 0
  Actual: false
Expected: true
[  FAILED  ] cuda.sparse_spmv_mv_kokkos_complex_double_int_int_LayoutLeft_TestExecSpace (59250 ms)
--
KokkosKernels::UnitTests::spmv_mv_struct: 10 errors of 10 with params: 1 1.000000 1.000000, in vector 1
/path/to/kokkos-kernels/sparse/unit_test/Test_Sparse_spmv.hpp:344: Failure
Value of: num_errors == 0
  Actual: false
Expected: true
[  FAILED  ] cuda.sparse_spmv_mv_struct_kokkos_complex_double_int_int_LayoutLeft_TestExecSpace (14 ms)
--
KokkosSparse::Test::spmv_bsr: 13 errors of 13 with params: N 3.100000 2.500000
/path/to/kokkos-kernels/sparse/unit_test/Test_Sparse_spmv_bsr.hpp:543: Failure
Value of: num_errors == 0
  Actual: false
Expected: true
[  FAILED  ] cuda.sparse_bsrmat_times_vec_double_int_int_TestExecSpace (4339 ms)
--
KokkosSparse::Test::spmv_bsr: 13 errors of 13 with params: N 3.100000 2.500000
/path/to/kokkos-kernels/sparse/unit_test/Test_Sparse_spmv_bsr.hpp:543: Failure
Value of: num_errors == 0
  Actual: false
Expected: true
[  FAILED  ] cuda.sparse_bsrmat_times_vec_kokkos_complex_double_int_int_TestExecSpace (7068 ms)
--
WARNING: Controls::getParameter for name "algorithm" was unset
 ** On entry to cusparseSpMM_bufferSize(): conjugate transpose (opA) is not supported for A data type (CUDA_R_64F)

unknown file: Failure
C++ exception with description "cusparseSpMM_bufferSize( cusparseHandle, opA, opB, &alpha, A_cusparse, vecX, &beta, vecY, computeType, alg, &bufferSize) error( CUSPARSE_STATUS_INVALID_VALUE): invalid value /path/to/kokkos-kernels/sparse/tpls/KokkosSparse_spmv_mv_tpl_spec_decl.hpp:191" thrown in the test body.
[  FAILED  ] cuda.sparse_bsrmat_times_multivec_double_int_int_LayoutLeft_TestExecSpace (2421 ms)
--
KokkosSparse::Test::spm_mv_bsr: 13 errors of 13 with params: H 3.100000 2.500000
/path/to/kokkos-kernels/sparse/unit_test/Test_Sparse_spmv_bsr.hpp:585: Failure
Value of: num_errors == 0
  Actual: false
Expected: true
[  FAILED  ] cuda.sparse_bsrmat_times_multivec_kokkos_complex_double_int_int_LayoutLeft_TestExecSpace (4976 ms)

The following tests FAILED:
	  7 - blas_cuda (SEGFAULT)
	 11 - sparse_cuda (Failed)

Reproducer:

module purge
module load cmake/3.21.2 cuda/11.2.2 openblas/0.3.20/gcc/9.3.0

$KOKKOSKERNELS_PATH/cm_generate_makefile.bash --with-devices=Cuda,Serial --arch=Volta70 --compiler=$KOKKOS_ROOT/bin/nvcc_wrapper --cxxflags="-O3 -Wall -Wunused-parameter -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wuninitialized " --cxxstandard="17" --ldflags="" --with-cuda=$CUDA_ROOT  --kokkos-path=$KOKKOS_PATH --kokkoskernels-path=$KOKKOSKERNELS_PATH --with-scalars='double,complex_double' --with-ordinals=int --with-offsets=int,size_t --with-layouts=LayoutLeft --with-tpls=blas,cublas,cusparse --user-blas-path=$OPENBLAS_ROOT/lib --user-lapack-path=$OPENBLAS_ROOT/lib --user-blas-lib=blas --user-lapack-lib=lapack --extra-linker-flags=-lgfortran,-lm --with-options= --with-cuda-options=,enable_lambda   --no-examples
@e10harvey e10harvey added the bug label Nov 9, 2022
@lucbv
Copy link
Contributor

lucbv commented Nov 9, 2022

I cannot reproduce the issue with cuda.rotg_double this might have been fixed with the merge of PR #1581, did you test with a develop prior to that PR merging?

bash-4.4$ ./blas/unit_test/KokkosKernels_blas_cuda --gtest_filter=cuda.rot*
Note: Google Test filter = cuda.rot*
[==========] Running 7 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 7 tests from cuda
[ RUN      ] cuda.rot_double
[       OK ] cuda.rot_double (1867 ms)
[ RUN      ] cuda.rot_complex_double
[       OK ] cuda.rot_complex_double (35 ms)
[ RUN      ] cuda.rotg_double
[       OK ] cuda.rotg_double (1 ms)
[ RUN      ] cuda.rotg_complex_double
[       OK ] cuda.rotg_complex_double (2 ms)
[ RUN      ] cuda.rotm_double
[       OK ] cuda.rotm_double (1 ms)
[ RUN      ] cuda.rotmg_double_int_int_TestExecSpace
[       OK ] cuda.rotmg_double_int_int_TestExecSpace (3 ms)
[ RUN      ] cuda.rotmg_double_int_size_t_TestExecSpace
[       OK ] cuda.rotmg_double_int_size_t_TestExecSpace (2 ms)
[----------] 7 tests from cuda (1911 ms total)

[----------] Global test environment tear-down
[==========] 7 tests from 1 test case ran. (1911 ms total)
[  PASSED  ] 7 tests.

@lucbv
Copy link
Contributor

lucbv commented Nov 9, 2022

After fix in PR #1587 merges the remaining errors are

bash-4.4$ ./sparse/unit_test/KokkosKernels_sparse_cuda
[==========] Running 135 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 135 tests from cuda
...
[----------] Global test environment tear-down
[==========] 135 tests from 1 test case ran. (647139 ms total)
[  PASSED  ] 127 tests.
[  FAILED  ] 8 tests, listed below:
[  FAILED  ] cuda.sparse_bsr_gauss_seidel_rank2_kokkos_complex_double_int_int_TestExecSpace
[  FAILED  ] cuda.sparse_gauss_seidel_asymmetric_rank2_kokkos_complex_double_int_int_TestExecSpace
[  FAILED  ] cuda.sparse_gauss_seidel_symmetric_rank2_kokkos_complex_double_int_int_TestExecSpace
[  FAILED  ] cuda.sparse_spmv_mv_kokkos_complex_double_int_int_LayoutLeft_TestExecSpace
[  FAILED  ] cuda.sparse_spmv_mv_struct_kokkos_complex_double_int_int_LayoutLeft_TestExecSpace
[  FAILED  ] cuda.sparse_bsrmat_times_vec_double_int_int_TestExecSpace
[  FAILED  ] cuda.sparse_bsrmat_times_vec_kokkos_complex_double_int_int_TestExecSpace
[  FAILED  ] cuda.sparse_bsrmat_times_multivec_kokkos_complex_double_int_int_LayoutLeft_TestExecSpace

I will have a look at the spmv_mv and spmv_mv_struct failures, some of them seem due to tolerance issues that should be easy to solve.

@e10harvey
Copy link
Contributor Author

did you test with a develop prior to that PR merging?

Yes, I think so.

@lucbv
Copy link
Contributor

lucbv commented Nov 28, 2022

After the merge of PR #1604 by @vqd8a most of the errors are gone, I only see these:

[  FAILED  ] cuda.sparse_bsrmat_times_vec_double_int_int_TestExecSpace
[  FAILED  ] cuda.sparse_bsrmat_times_vec_kokkos_complex_double_int_int_TestExecSpace

@brian-kelley
Copy link
Contributor

@lucbv I couldn't replicate the GS failures on Weaver (11.2, complex enabled) even before #1604 merged, but I guess it's fixed now?

@lucbv
Copy link
Contributor

lucbv commented Nov 28, 2022

@brian-kelley the error seems fixed as far as I can tell. We are pretty close to having all the cuda TPLs issues resolved which is good!

@e10harvey
Copy link
Contributor Author

After the merge of PR #1604 by @vqd8a most of the errors are gone, I only see these:

I'll look into those now.

@lucbv
Copy link
Contributor

lucbv commented Mar 14, 2023

Do we want to close this issue or are we still observing problems with cuda 11.2 tpls?

@ndellingwood
Copy link
Contributor

I don't see these tests failing in the cuda/11.2 nightly build with cusparse and cublas tpls, I think it is safe to close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants