Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests fail on POWER machines #621

Closed
ivan23kor opened this issue Mar 10, 2022 · 48 comments
Closed

Tests fail on POWER machines #621

ivan23kor opened this issue Mar 10, 2022 · 48 comments

Comments

@ivan23kor
Copy link
Contributor

ivan23kor commented Mar 10, 2022

  • testsuite-run-fast fails on POWER9 and POWER10 with error message:
    4440 Segmentation fault ./test_libblis.x -g ./testsuite/input.general.fast -o ./testsuite/input.operations.fast > output.testsuite
  • POWER7 dgemm microkernel includes altivec.h, which defines type bool. Because of that, edge-case handling macros here and here use a conflicting type bool from stdbool. A possible solution is to check for POWER architecture in edge_case_macros and define _bool as int for POWER and bool for other architectures.
  • POWER10 build fails because POWER10 microkernels use the old microkernel API without edge handling (addressed in POWER10: edge cases in microkernel #620). Also, type nibble is defined in the sandbox but used for i4 microkernel so any POWER10 user has to configure BLIS with -s power10. Is it intentional?

POWER9 uses slow reference implementations for sgemm, cgemm, zgemm by default, are there plans to support a fast sgemm microkernel for POWER9?

@devinamatthews
Copy link
Member

@ivan23kor can QEMU be used to test POWER 7/8/9/10? If so what are the necessary flags so that we can add Travis CI tests?

@ivan23kor
Copy link
Contributor Author

@devinamatthews unfortunately I don't know about QEMU as I am running on actual hardware.

@RajalakshmiSR
Copy link

Looks like this is happening on POWER9 after ee9ff98

This is the stack trace :
Program received signal SIGSEGV, Segmentation fault.
bli_sgemmtrsmbb_l_power9_ref (k=4, alpha=0x10, a1x=0x64, a11=0x7fffffffccf0, bx1=0x7ffff6ef10c0, b11=0x7ffff6ef1700, c11=0x7fffebd51098, rs_c=140737150003608, cs_c=270180544,
data=0x10, cntx=0x1) at ref_kernels/3/bb/bli_gemmtrsmbb_ref.c:138
138 INSERT_GENTFUNC_BASIC3( gemmtrsmbb_l, BLIS_CNAME_INFIX, BLIS_REF_SUFFIX, BLIS_TRSM_L_UKR )
(gdb) bt
#0 bli_sgemmtrsmbb_l_power9_ref (k=4, alpha=0x10, a1x=0x64, a11=0x7fffffffccf0, bx1=0x7ffff6ef10c0, b11=0x7ffff6ef1700, c11=0x7fffebd51098, rs_c=140737150003608,
cs_c=270180544, data=0x10, cntx=0x1) at ref_kernels/3/bb/bli_gemmtrsmbb_ref.c:138
#1 0x000000001010b2a8 in bli_sgemmtrsm_l_ukernel (m=4, n=16, k=100, alpha=0x7fffffffccf0, a1x=0x7ffff6ef10c0, a11=0x7ffff6ef1700, bx1=0x7fffebd51098, b11=0x7fffebd52998,
c11=0x101aa0c0, rs_c=16, cs_c=1, data=0x7fffffffc200, cntx=0x101a08c0) at frame/3/bli_l3_ukr_tapi.c:124
#2 0x000000001003da60 in bli_gemmtrsm_ukernel (alpha=, a1x=, a11=0x7fffffffc5d0, bx1=, b11=, c11=,
cntx=0x101a08c0) at frame/3/bli_l3_ukr_oapi.c:182
#3 0x000000001000bbec in libblis_test_gemmtrsm_ukr_impl (iface=, side=, alpha=, a1x=, a11=,
bx1=, b11=, c11=, cntx=0x101a08c0) at testsuite/src/test_gemmtrsm_ukr.c:435
#4 0x000000001000c534 in libblis_test_gemmtrsm_ukr_experiment (params=0x7fffffffe130, op=, iface=, dc_str=,
pc_str=, sc_str=, p_cur=, perf=0x7fffffffd040, resid=0x7fffffffd038) at testsuite/src/test_gemmtrsm_ukr.c:355
#5 0x0000000010017d94 in libblis_test_op_driver (tdata=0x101a3e00, params=0x7fffffffe130, op=, iface=, op_str=0x1014a810 "gemmtrsm_ukr",
p_types=, o_types=, thresh=0x10190638 , f_exp=0x1000c0d0 <libblis_test_gemmtrsm_ukr_experiment>) at testsuite/src/test_libblis.c:2200
#6 0x000000001000bb7c in libblis_test_gemmtrsm_ukr (tdata=0x101a3e00, params=0x7fffffffe130, op=0x7fffffffde30) at testsuite/src/test_gemmtrsm_ukr.c:151
#7 0x0000000010011b9c in libblis_test_level3_ukrs (tdata=0x101a3e00, params=0x7fffffffe130, ops=0x7fffffffd3d0) at testsuite/src/test_libblis.c:311
#8 0x0000000010011d94 in libblis_test_all_ops (tdata=0x101a3e00, params=0x7fffffffe130, ops=0x7fffffffd3d0) at testsuite/src/test_libblis.c:230
#9 0x0000000010011e00 in libblis_test_thread_entry (tdata_void=) at testsuite/src/test_libblis.c:115
#10 0x0000000010011f44 in libblis_test_thread_decorator (params=0x7fffffffe130, ops=0x7fffffffd3d0) at testsuite/src/test_libblis.c:176
#11 0x00000000100019e8 in main (argc=, argv=0x7fffffffe5d8) at testsuite/src/test_libblis.c:84

So there is no special TRSM kernel in the power9 directory. I could see only specialized SGEMM and DGEMM kernel for POWER9.

@Flamefire
Copy link
Contributor

Flamefire commented Jul 21, 2022

Looks like this is happening on POWER9 after ee9ff98

Having done a git bisect on a POWER9 system I can confirm this too.

Further investigation points at missing modification of ref_kernels/3/bb/bli_gemmtrsmbb_ref.c (parameters m&n are not introduced)

After changing that file similar to ee9ff98#diff-512c6a50b6244efae7b93e599fb1b295377718d298cac465e3f83db3718c4b03 I see many failures of the kind

% blis_<dt><op>_<params>_<stor>      m     n     k   gflops   resid      result
blis_cgemm1m_nn_ccc                100   100   100     1.34   1.97e-08   PASS
blis_cgemm_nn_ccc                  100   100   100     3.64   1.09e-08   PASS

% blis_<dt><op>_<params>_<stor>      m     n     k   gflops   resid      result
blis_zgemm1m_nn_ccc                100   100   100    17.96   3.82e-03   FAILURE
blis_zgemm_nn_ccc                  100   100   100     4.16   2.25e-17   PASS

I.e. Only the "z" datatype on the "*1m" is affected.

Using the "generic" configuration works.

@devinamatthews
Copy link
Member

@fgvanzee the issue seems to be missing edge case handling in gemmtrsmbb_ref. Since you fixed the other trsm/gemmtrsm kernels can you fix this?

@fgvanzee
Copy link
Member

I'll take a look at it.

@fgvanzee
Copy link
Member

fgvanzee commented Jul 21, 2022

Uhh, am I missing something, @devinamatthews?

$ ls ref_kernels/3
bli_gemm_ref.c  bli_gemmsup_ref.c  bli_gemmtrsm_ref.c  bli_trsm_ref.c  old

It's not clear what I need to fix.

@devinamatthews
Copy link
Member

devinamatthews commented Jul 21, 2022 via email

@fgvanzee
Copy link
Member

The bb kernels live in some strangebplace.

Oh wait, didn't we fold them into the conventional reference microkernels?

@fgvanzee
Copy link
Member

@devinamatthews Ah, I think I see the problem now. It's not that the gemmtrsm ukernels lack edge-case handling; it's that they haven't been updated according to the latest way that the bb functionality has been folded in. (By contrast, it appears that the gemm ukernel has been updated. So I'll use that as my guide.)

I'll try to work on this more tomorrow.

@devinamatthews
Copy link
Member

@fgvanzee gemmtrsm reference kernels has been updated:

	PASTEMAC(ch,bcastbbs_mxn) \
	( \
	  m, \
	  n, \
	  b11, rs_b, cs_b  \
	); \

etc. @Flamefire did you test the tip of the master branch (currently af3a41e)?

@fgvanzee
Copy link
Member

@devinamatthews Ah, yes, I had previously overlooked this:

    const inc_t rs_b   = packnr; \
    const inc_t cs_b   = bli_cntx_get_blksz_def_dt( dt, BLIS_BBN, cntx ); \

@Flamefire
Copy link
Contributor

Flamefire commented Jul 25, 2022

I successfully tested af3a41e on Power9LE

Do you know which commit fixed that?

@devinamatthews
Copy link
Member

The Power9/10 kernels use a "broadcast packing" format, which messed up a lot of the older code, or rather, required some bespoke code which interacted poorly with [cz]gemm1m. I rewrote all of the reference packing kernels in ae10d94 in part to streamline this issue (note that that commit has a bug which was later fixed in 667f201.

I guess we can close the issue then?

@Flamefire
Copy link
Contributor

Yes, thanks!

@RajalakshmiSR
Copy link

Thanks for fixing this. make check now shows only these failures on POWER9.
Filename: out.zblat3

******* FATAL ERROR - PARAMETER NUMBER 11 WAS CHANGED INCORRECTLY *******
******* ZHEMM FAILED ON CALL NUMBER:
274: ZHEMM ('R','U', 1, 1,( 1.0, .0), A, 2, B, 2,( .0, .0), C, 2) .

ZSYMM PASSED THE TESTS OF ERROR-EXITS

******* FATAL ERROR - PARAMETER NUMBER 11 WAS CHANGED INCORRECTLY *******
******* ZSYMM FAILED ON CALL NUMBER:
274: ZSYMM ('R','U', 1, 1,( 1.0, .0), A, 2, B, 2,( .0, .0), C, 2) .

@Flamefire
Copy link
Contributor

@RajalakshmiSR Maybe try compiling with -fstack-protector-strong when using GCC > 7. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799 which may be the issue. But just a wild guess.

@devinamatthews
Copy link
Member

@RajalakshmiSR there might be a bug where data is written off the end of the output matrix. Please try the following test program:

#include "blis.h"
#include <stdio.h>
#include <string.h>

int main(int argc, char** argv)
{
    obj_t A, B, C, a, b, c;

    bli_obj_create( BLIS_DCOMPLEX, 2, 2, 1, 2, &A );
    bli_obj_create( BLIS_DCOMPLEX, 2, 2, 1, 2, &B );
    bli_obj_create( BLIS_DCOMPLEX, 2, 2, 1, 2, &C );

    bli_setm( &BLIS_ONE, &A );
    bli_setm( &BLIS_ONE, &B );
    bli_setm( &BLIS_ZERO, &C );

    bli_obj_create_with_attached_buffer( BLIS_DCOMPLEX, 1, 1, bli_obj_buffer( &A ), 1, 2, &a );
    bli_obj_create_with_attached_buffer( BLIS_DCOMPLEX, 1, 1, bli_obj_buffer( &B ), 1, 2, &b );
    bli_obj_create_with_attached_buffer( BLIS_DCOMPLEX, 1, 1, bli_obj_buffer( &C ), 1, 2, &c );
    bli_obj_set_struc( BLIS_HERMITIAN, &a );
    bli_obj_set_uplo( BLIS_UPPER, &a );

    bli_printm( "before:", &C, "%4.1f", "" );
    
    bli_hemm( BLIS_RIGHT, &BLIS_ONE, &a, &b, &BLIS_ZERO, &c );
    
    bli_printm( "after:", &C, "%4.1f", "" );

    return 0;
}

It should print:

before:
 0.0 +  0.0   0.0 +  0.0  
 0.0 +  0.0   0.0 +  0.0  

after:
 1.0 +  0.0   0.0 +  0.0  
 0.0 +  0.0   0.0 +  0.0  

Any change in the zero values in the "after" matrix would be a problem.

@fgvanzee
Copy link
Member

fgvanzee commented Jul 27, 2022

Not sure if this is relevant, but I just realized this morning that unless @devinamatthews's recent changes (vis-a-vis merging the bb packm functionality into a single reference kernel) extends bb support to triangular packing, any bb-utilizing subconfiguration needs to #define BLIS_DISABLE_TRMM_RIGHT, as is done for power9:

$ grep -R BLIS_DISABLE_TRMM_RIGHT config
./config/old/haswellbb/bli_family_haswell.h:#define BLIS_DISABLE_TRMM_RIGHT
./config/power9/bli_family_power9.h:#define BLIS_DISABLE_TRMM_RIGHT

This macro, as its name suggests, disables right-side trmm, expressing it instead in terms of left-side trmm. This forces the triangular matrix to always be on the left, and therefore a column-preferential gemm microkernel will always be guaranteed to assume broadcast values to exist in the right-hand matrix.

Like power9, power10 also seems to use explicit broadcasting of B during packing, but it fails to set the above macro. That could cause issues for trmm, and possibly non-deterministic behavior elsewhere.

@devinamatthews
Copy link
Member

And hemm/symm too right?

@fgvanzee
Copy link
Member

fgvanzee commented Jul 27, 2022

Yes, that's right. hemm/symm also require their own macros (because the Hermitian and symmetric packing formats also would presumably not support explicit broadcast). And trmm3 (for same reasons as trmm).

@fgvanzee
Copy link
Member

fgvanzee commented Jul 27, 2022

Also just noticed that power10's subconfig registers preferences for complex gemm ukernels without registering any actual complex ukernels. 🤔

void bli_cntx_init_power10( cntx_t* cntx )
{
    blksz_t blkszs[ BLIS_NUM_BLKSZS ];

    // Set default kernel blocksizes and functions.
    bli_cntx_init_power10_ref( cntx );

    // -------------------------------------------------------------------------

    // Update the context with optimized native gemm micro-kernels.
    bli_cntx_set_ukrs
    (
      cntx,

      // level-3
      BLIS_GEMM_UKR, BLIS_FLOAT,  bli_sgemm_power10_mma_8x16,
      BLIS_GEMM_UKR, BLIS_DOUBLE, bli_dgemm_power10_mma_8x8,

      BLIS_VA_END
    );

    // Update the context with storage preferences.
    bli_cntx_set_ukr_prefs
    (
      cntx,

      // level-3
      BLIS_GEMM_UKR_ROW_PREF,   BLIS_FLOAT,    TRUE,
      BLIS_GEMM_UKR_ROW_PREF,   BLIS_DOUBLE,   TRUE,
      BLIS_GEMM_UKR_ROW_PREF,   BLIS_SCOMPLEX, FALSE,
      BLIS_GEMM_UKR_ROW_PREF,   BLIS_DCOMPLEX, FALSE,
      BLIS_TRSM_L_UKR_ROW_PREF, BLIS_FLOAT,    FALSE,
      BLIS_TRSM_U_UKR_ROW_PREF, BLIS_FLOAT,    FALSE,
      BLIS_TRSM_L_UKR_ROW_PREF, BLIS_DOUBLE,   FALSE,
      BLIS_TRSM_U_UKR_ROW_PREF, BLIS_DOUBLE,   FALSE,
      BLIS_TRSM_L_UKR_ROW_PREF, BLIS_SCOMPLEX, FALSE,
      BLIS_TRSM_U_UKR_ROW_PREF, BLIS_SCOMPLEX, FALSE,
      BLIS_TRSM_L_UKR_ROW_PREF, BLIS_DCOMPLEX, FALSE,
      BLIS_TRSM_U_UKR_ROW_PREF, BLIS_DCOMPLEX, FALSE,

      BLIS_VA_END
    );

Could be harmless since 1m method code should always read the preference of the real domain ukernel. But nonetheless I think those extra entries should be nixed unless there's something I missing.

EDIT: It would appear that the gemm1m virtual microkernel is doing The Right Thing:tm: by grabbing the preference of the real domain ukernel, so the above is likely not a correctness bug:

    PASTECH(chr,gemm_ukr_ft) \
                      rgemm_ukr = bli_cntx_get_ukr_dt( dt_r, BLIS_GEMM_UKR, cntx ); \
    const bool        col_pref  = bli_cntx_ukr_prefers_cols_dt( dt_r, BLIS_GEMM_UKR, cntx ); \

Although it's still confusing AF to anyone who isn't me and @devinamatthews a core BLIS developer and therefore should be cleaned up.

@fgvanzee
Copy link
Member

Another observation: power10 uses row-preferential ukernels (which use broadcast values of A, not B) but has broadcast-B values defined in config/power10/bli_kernel_defs_power10.h. 🤔

It's a wonder any of this power10 code works at all.

@fgvanzee
Copy link
Member

Okay, I'm going to consult with @nicholaiTukanov, the original author of the power10 subconfig and kernel set, to make sure I understand how things should be registered before I begin suggesting any bugfixes.

@RajalakshmiSR
Copy link

RajalakshmiSR commented Jul 27, 2022

@RajalakshmiSR there might be a bug where data is written off the end of the output matrix. Please try the following test program:

It should print:

before:
 0.0 +  0.0   0.0 +  0.0  
 0.0 +  0.0   0.0 +  0.0  

after:
 1.0 +  0.0   0.0 +  0.0  
 0.0 +  0.0   0.0 +  0.0  

Any change in the zero values in the "after" matrix would be a problem.

Yes, I could see the same result.

@RajalakshmiSR
Copy link

@RajalakshmiSR Maybe try compiling with -fstack-protector-strong when using GCC > 7. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799 which may be the issue. But just a wild guess.

Still the same issue. Tried it like CFLAGS="-fstack-protector-strong" ./configure power9

@devinamatthews
Copy link
Member

Yes, I could see the same result.

@RajalakshmiSR strange, that is exactly the test that is supposed to be failing in zblat3. I might write a Fortran version just in case that makes a difference.

@devinamatthews
Copy link
Member

Here's a Fortran version. It can't get much more similar to the failing test.

program main

double complex A(2,2), B(2,2), C(2,2)
double complex alpha, beta

A(1,1) = (1.0,0.0)
A(2,1) = (1.0,0.0)
A(1,2) = (1.0,0.0)
A(2,2) = (1.0,0.0)

B(1,1) = (1.0,0.0)
B(2,1) = (1.0,0.0)
B(1,2) = (1.0,0.0)
B(2,2) = (1.0,0.0)

C(1,1) = (0.0,0.0)
C(2,1) = (0.0,0.0)
C(1,2) = (0.0,0.0)
C(2,2) = (0.0,0.0)

alpha = (1.0,0.0)
beta = (0.0,0.0)

WRITE(*,*)'before:'
WRITE(*,'(2(A,F4.1,A,F4.1,A,X))')(('(',REAL(C(I,J)),',',AIMAG(C(I,J)),')',J=1,2),I=1,2)
WRITE(*,*)

CALL ZHEMM('R','U',1,1,alpha,A,2,B,2,beta,C,2)

WRITE(*,*)'after:'
WRITE(*,'(2(A,F4.1,A,F4.1,A,X))')(('(',REAL(C(I,J)),',',AIMAG(C(I,J)),')',J=1,2),I=1,2)
WRITE(*,*)

END PROGRAM MAIN

@fgvanzee
Copy link
Member

@ivan23kor I fixed what was probably a power10 bug in 5b29893, although I can't be sure how it would have manifested. If you can, please give this commit a try. (We don't have easy access to any Power hardware.)

@fgvanzee
Copy link
Member

Sorry these initial questions weren't answered in a timely manner.

POWER10 build fails because POWER10 microkernels use the old microkernel API without edge handling (addressed in #620).

@ivan23kor Thank you for catching that, and submitting the PR fix.

Also, type nibble is defined in the sandbox but used for i4 microkernel so any POWER10 user has to configure BLIS with -s power10. Is it intentional?

For now, yes. That code could be transitioned from being a sandbox to being an addon, but even then you'd have to explicitly request it via configure.

POWER9 uses slow reference implementations for sgemm, cgemm, zgemm by default, are there plans to support a fast sgemm microkernel for POWER9?

Not at this time. We don't have any Power expertise or hardware access in our core development group. The collaborator who wrote the power9 dgemm microkernel has moved on to other adventures, although he still stays involved in our community. Care to comment, @nicholaiTukanov?

@RajalakshmiSR
Copy link

We can submit a request in OSUOSL to access POWER9 systems for CI service.
https://osuosl.org/services/powerdev/
As of now, the only optimized fast kernel is dgemm for POWER9. There are no current plans to extend it for POWER9.

@nicholaiTukanov
Copy link
Contributor

Hey @ivan23kor, glad to see someone using and testing the POWER kernels 😄

POWER9 uses slow reference implementations for sgemm, cgemm, zgemm by default, are there plans to support a fast sgemm microkernel for POWER9?

Due to other commitments, I will not be able to work on these. However, if one was interested, they could take my dgemm code as a starting point for other real domain kernels.

Also, type nibble is defined in the sandbox but used for i4 microkernel so any POWER10 user has to configure BLIS with -s power10. Is it intentional?

This is intentional, since BLIS doesn't support low precision/integer datatypes. Hopefully, with all the changes @devinamatthews and @fgvanzee are making, this won't be the case soon.

@nisanthmp
Copy link
Contributor

Not sure if this is relevant, but I just realized this morning that unless @devinamatthews's recent changes (vis-a-vis merging the bb packm functionality into a single reference kernel) extends bb support to triangular packing, any bb-utilizing subconfiguration needs to #define BLIS_DISABLE_TRMM_RIGHT, as is done for power9:

$ grep -R BLIS_DISABLE_TRMM_RIGHT config
./config/old/haswellbb/bli_family_haswell.h:#define BLIS_DISABLE_TRMM_RIGHT
./config/power9/bli_family_power9.h:#define BLIS_DISABLE_TRMM_RIGHT

This macro, as its name suggests, disables right-side trmm, expressing it instead in terms of left-side trmm. This forces the triangular matrix to always be on the left, and therefore a column-preferential gemm microkernel will always be guaranteed to assume broadcast values to exist in the right-hand matrix.

Like power9, power10 also seems to use explicit broadcasting of B during packing, but it fails to set the above macro. That could cause issues for trmm, and possibly non-deterministic behavior elsewhere.

@fgvanzee
Can you please help me understand, whether disabling right-side 'hmm' and 'symm' was done for performance improvement or functional correctness? (in the case of power9)

@fgvanzee
Copy link
Member

fgvanzee commented Dec 12, 2022

Can you please help me understand, whether disabling right-side 'hmm' and 'symm' was done for performance improvement or functional correctness? (in the case of power9)

Functional correctness (although there are performance considerations lurking within).

We actually don't want to disable right-sided hemm and symm implementations. Having access to both left- and right-sided implementations allows us to efficiently handle mismatches between the storage of matrix C and the IO preference of the gemm microkernel -- via a transposition of the entire operation. (In those cases, in order to satisfy the microkernel preference, a left-sided case executes via the right-sided code path, or vice versa.)

But the right-sided hemm and symm code paths imply that "matrix B" (that is, the matrix on the right) is the one with Hermitian or symmetric structure. The power9 kernel requires the packed micropanels from "matrix B" (again, the matrix on the right) to be duplicated. And, for now, BLIS does not support packing a Hermitian/symmetric matrix to a buffer while also duplicating the elements.

So, what are the consequences of disabling right-sided hemm and symm? It means that all cases, both left and right, flow through the left-sided code path. (Notice that a right-sided case can be reexpressed as a left-sided case by transposing the entire operation -- the same technique we typically reserve for reconciling storage-of-C / IO-preference mismatches.) This means we can't always guarantee that the storage of C matches the IO preference of the microkernel, which means the microkernel's element-wise IO branch will sometimes execute (which is slower).

One way to mitigate this performance issue would be to add branches to the power9 microkernel that perform in-register transposes when there is a mismatch so that vector loads/stores could still be used to update C (rather than doing it element-by-element), but this in-register transpose is sometimes expensive and will only lessen (but never eliminate) the performance drag of calling the gemm microkernel on C with an unsatisfied IO preference.

@nisanthmp
Copy link
Contributor

Thanks @fgvanzee for the explanation.

With regards to the current HEMM/SYMM failures (power9, BLAS tests): Since row and column strides of C are also swapped, in the case of inducing transpose when hermitian/symmetric matrix B multiplies A from the right, 1m optimisation's row and column strides (of C) calculations are going wrong. This happens in the case of BLIS_DISABLE_HEMM_RIGHT (or BLIS_DISABLE_SYMM_RIGHT) are defined. So checked after removing these macros in power9 case, and the tests are passing.

The other option is to fix the 1m optimisation's row and column strides (of C) calculations, while retaining BLIS_DISABLE_HEMM_RIGHT and BLIS_DISABLE_SYMM_RIGHT macros.

Please suggest the optimal path to take.

@fgvanzee
Copy link
Member

@nisanthmp Thanks for that update. I don't quite understand the issue you described with 1m and the row/column strides of C. Is this something I would be able to reproduce on an Intel system (where the microkernel does not need to duplicate/broadcast elements of B during packing)?

Stepping through a concrete example might also help.

@nisanthmp
Copy link
Contributor

nisanthmp commented Dec 14, 2022

@fgvanzee I'm trying to find an Intel based system, where I can reproduce the error. I need the 1m optimisations enabled while BLIS_DISABLE_HEMM_RIGHT defined. Is there a way to enable 1m optimisations on any Intel based system?

On power9 based system, I was able to reproduce the error with the code snippet, pasted below:

#include<stdio.h>
#include "blis.h"

#define DIM 1
#define LDM 2
int main() {

dcomplex a[LDM * DIM] = { {-0.412587, -10000000000.000000}, {-10000000000.000000, 10000000000.000000} };
dcomplex b[LDM * DIM] = { {0.336663, 0.050949}, {-10000000000.000000, 10000000000.000000} };
dcomplex c[LDM * DIM];
int I, J, M, N, lda, ldb, ldc;
dcomplex alpha, beta;

M = DIM;
N = M;
lda = 2;
ldb = 2;
ldc = 2;
alpha.real = 1.0; alpha.imag = 0.0;
beta.real = 0.0; beta.imag = 0.0;

printf("a = \n");
for ( I = 0; I < LDM; I ++ ) {
	for ( J = 0; J < N; J ++ ) {
		printf("%lf + i%lf\t", a[J * LDM + I].real, a[J * LDM + I].imag);
	}
	printf("\n");
}
printf("b = \n");
for ( I = 0; I < LDM; I ++ ) {
	for ( J = 0; J < N; J ++ ) {
		printf("%lf + i%lf\t", b[J * LDM + I].real, b[J * LDM + I].imag);
	}
	printf("\n");
}

zhemm_("R","U",&M,&N,&alpha,a,&lda,b,&ldb,&beta,c,&ldc);

printf("c = \n");
for ( I = 0; I < LDM; I ++ ) {
	for ( J = 0; J < N; J ++ ) {
		printf("%lf + i%lf\t", c[J * LDM + I].real, c[J * LDM + I].imag);
	}
	printf("\n");
}

return 0;
}

@nisanthmp
Copy link
Contributor

I will try to explain what I was trying to say with regards to 1m related optimisations and row and column strides of C:

When I run the above code, with 1m optimisations and BLIS_DISABLE_HEMM_RIGHT defined, I get the following output.

a =
-0.412587 + i-10000000000.000000
-10000000000.000000 + i10000000000.000000
b =
0.336663 + i0.050949
-10000000000.000000 + i10000000000.000000
Before 1m specific optimization:
dt_exec: 3, dt_c: 3, m: 1, n: 1, k: 1, pd_a: 6, ps_a: 12, pd_b: 6, ps_b: 12, rs_c: 2, cs_c: 1
After 1m specific optimization:
dt_exec: 2, dt_c: 2, m: 2, n: 1, k: 2, pd_a: 12, ps_a: 24, pd_b: 6, ps_b: 24, rs_c: 2, cs_c: 2
c =
-0.138903 + i0.000000
-0.021021 + i0.000000


The prints are in "frame/3/gemm/bli_gemm_ker_var2.c", around the 1m specific optimisations to call real domain macro-kernel instead of 1m virtual micro-kernel:

    // If 1m is being employed on a column- or row-stored matrix with a
    // real-valued beta, we can use the real domain macro-kernel, which
    // eliminates a little overhead associated with the 1m virtual
    // micro-kernel.
    // Only employ this optimization if the storage datatype of C is
    // equal to the execution/computation datatype.
#if 1
    printf("Before 1m specific optimization:\n");
    printf("dt_exec: %d, dt_c: %d, m: %d, n: %d, k: %d, pd_a: %d, ps_a: %d, pd_b: %d, ps_b: %d, rs_c: %d, cs_c: %d\n", dt_exec, dt_c, m, n, k, pd_a, ps_a, pd_b, ps_b, rs_c, cs_c);
    if ( bli_cntx_method( cntx ) == BLIS_1M )
    {
            bli_gemm_ind_recast_1m_params
            (
              &dt_exec,
              &dt_c,
              schema_a,
              c,
              &m, &n, &k,
              &pd_a, &ps_a,
              &pd_b, &ps_b,
              &rs_c, &cs_c
            );
            printf("After 1m specific optimization:\n");
            printf("dt_exec: %d, dt_c: %d, m: %d, n: %d, k: %d, pd_a: %d, ps_a: %d, pd_b: %d, ps_b: %d, rs_c: %d, cs_c: %d\n", dt_exec, dt_c, m, n, k, pd_a, ps_a, pd_b, ps_b, rs_c, cs_c);

    }
#endif

As you can see in the printed output, before calling "bli_gemm_ind_recast_1m_params()", row and column strides of C are rs_c: 2, cs_c: 1, and afterwards, they are rs_c: 2, cs_c: 2.

if the values of rs_c and cs_c were 1 and 4 respectively, the computation would produce correct results. That's why, I'm suspecting that the row and column strides calculations for C are going wrong.

Hope I'm able to explain this properly.

@nisanthmp
Copy link
Contributor

@fgvanzee I could find an Intel system to reproduce this. The same test code listed above was run on a Xeon 8380 (Ice Lake) system, with BLIS_DISABLE_HEMM_RIGHT defined.

@fgvanzee
Copy link
Member

@nisanthmp Thank you for all of this information. I will start trying to reproduce on my end.

@fgvanzee
Copy link
Member

@nisanthmp I tried to use your code to reproduce the error as follows:

  1. I added #define BLIS_DISABLE_HEMM_RIGHT to config/haswell/bli_family_haswell.h.
  2. I configured BLIS with auto, which chose the haswell subconfig on my Intel desktop.
  3. make
  4. I compiled the test driver code you provided and ran it. It gave me this output, which I think is what it is supposed to give:
a = 
-0.412587 + i-10000000000.000000	
-10000000000.000000 + i10000000000.000000	
b = 
0.336663 + i0.050949	
-10000000000.000000 + i10000000000.000000	
Before 1m specific optimization:
dt_exec: 3, dt_c: 3, m: 1, n: 1, k: 1, pd_a: 6, ps_a: 6, pd_b: 4, ps_b: 8, rs_c: 2, cs_c: 1
After 1m specific optimization:
dt_exec: 2, dt_c: 2, m: 1, n: 2, k: 2, pd_a: 6, ps_a: 12, pd_b: 8, ps_b: 16, rs_c: 4, cs_c: 1
c = 
-0.138903 + i-0.021021	
0.000000 + i0.000000

Did I miss anything? Perhaps you could give further details on how you configured and built for your Intel Xeon 8380.

@nisanthmp
Copy link
Contributor

nisanthmp commented Dec 15, 2022

@fgvanzee I was also not able to reproduce the error on all intel systems. I could do it only on the Ice Lake (Xeon 8380) based system, out of the available Intel systems that I have access to. The only difference in configuration was that, I configured for "x86_64" instead of "auto". Other than that, everything looks the same.

I tried with "configure auto" also (on Xeon 8380). That way it chose "skx", so added "#define BLIS_DISABLE_HEMM_RIGHT" in "config/skx/bli_family_skx.h " and the error is still there.

@fgvanzee
Copy link
Member

fgvanzee commented Dec 15, 2022

I tried with "configure auto" also (on Xeon 8380). That way it chose "skx", so added "#define BLIS_DISABLE_HEMM_RIGHT" in "config/skx/bli_family_skx.h " and the error is still there.

@nisanthmp Ahhh, that helps (knowing the exact subconfig it chose)! Thanks for your reply. I'll take a closer look at skx and see if I can find anything out. (I might also have a SkylakeX system to test on, which, while not the same as Ice Lake should be comparable since it also uses the skx subconfig.)

EDIT: Could you confirm which version/commit of BLIS you are using?

@fgvanzee
Copy link
Member

fgvanzee commented Dec 15, 2022

@nisanthmp I was able to reproduce the error on my haswell-based system by:

  1. Switching to column-preferential microkernels in bli_cntx_init_haswell.c. (The skx subconfig uses kernels that prefer IO by columns.)
  2. Disabling registration of all c and z kernels and blocksizes in bli_cntx_init_haswell.c. (The skx subconfig only registers real-domain kernels/values.) This forces use of the 1m implementation for complex datatypes.
  3. Adding #define BLIS_DISABLE_HEMM_RIGHT to bli_family_haswell.h. (The skx subconfig does not disable any right-sided operations by default.)
  4. Running your test driver code.

I'll continue to investigate and get back to you! Thank you for your patience and diligent bug reports.

fgvanzee added a commit that referenced this issue Dec 16, 2022
Details:
- Fixed a bug in right-sided hemm when:
  - using the 1m method
  - #defining BLIS_DISABLE_HEMM_RIGHT in the active subconfiguration
  - the storage of C matches the gemm microkernel IO preference PRIOR to
    the right-sidedness being detected and recast in terms of the left
    side code path.
  It turns out that bli_gemm_ind_recast_1m_params() was applying its
  optimization (recasting a complex-domain macrokernel calling a 1m
  virtual microkernel to a real-domain macrokernel calling the real-
  domain microkernel) in situations in which it should not have. The
  optimization was silently assuming that the storage of C always
  matched that of the microkernel preference, since the front-end
  would have already had a chance to transpose the operation to bring
  the two into agreement. However, by disabling right-sided hemm, we
  deprive BLIS of that flexiblity, and thus suddenly the assumption was
  no longer holding in all cases. Thanks to Nisanth M P for reporting
  this bug in Issue #621.
- The original bug, and this bugfix, also extend to symm when
  BLIS_DISABLE_SYMM_RIGHT is defined.
- Comment updates.
- CREDITS file update.
@fgvanzee
Copy link
Member

fgvanzee commented Dec 16, 2022

@nisanthmp #697 is the best I can do at fixing the problem. Basically, we have to avoid the 1m-specific optimization if BLIS_DISABLE_HEMM_RIGHT is defined (and if that results in the storage of C mismatching the gemm microkernel IO preference). Please confirm the fix on your end, and I'll merge it shortly thereafter.

Thanks for your feedback on this!

@nisanthmp
Copy link
Contributor

@fgvanzee I checked #697 on a power9 system and it's working fine, please merge. Thank you for prioritising this and fixing it!

fgvanzee added a commit that referenced this issue Dec 16, 2022
Details:
- Fixed a bug in right-sided hemm when:
  - using the 1m method,
  - #defining BLIS_DISABLE_HEMM_RIGHT in the active subconfiguration,
    and
  - the storage of C matches the gemm microkernel IO preference PRIOR to
    the right-sidedness being detected and recast in terms of the left-
    side code path.
  It turns out that bli_gemm_ind_recast_1m_params() was applying its
  optimization (recasting a complex-domain macrokernel calling a 1m
  virtual microkernel to a real-domain macrokernel calling the real-
  domain microkernel) in situations in which it should not have. The
  optimization was silently assuming that the storage of C always
  matched that of the microkernel preference, since the front-end (in
  this case, bli_hemm_front()) would have already had a chance to 
  transpose the operation to bring the two into agreement. However, by 
  disabling right-sided hemm, we deprive BLIS of that flexibility (as a
  transposed left-sided case would necessarily have to become a right-
  sided case), and thus the assumption was no longer holding in all 
  cases. Thanks to Nisanth M P for reporting this bug in Issue #621.
- The aforementioned bug, and its bugfix, also apply to symm when
  BLIS_DISABLE_SYMM_RIGHT is defined.
- Comment updates.
- CREDITS file update.
@nisanthmp
Copy link
Contributor

@fgvanzee, Shall we please close this issue now?

@fgvanzee
Copy link
Member

fgvanzee commented Jan 4, 2023

Certainly. Thank you again for your patience and feedback on this issue, @nisanthmp!

@fgvanzee fgvanzee closed this as completed Jan 4, 2023
fgvanzee added a commit that referenced this issue May 20, 2024
Details:
- Fixed a bug in right-sided hemm when:
  - using the 1m method,
  - #defining BLIS_DISABLE_HEMM_RIGHT in the active subconfiguration,
    and
  - the storage of C matches the gemm microkernel IO preference PRIOR to
    the right-sidedness being detected and recast in terms of the left-
    side code path.
  It turns out that bli_gemm_ind_recast_1m_params() was applying its
  optimization (recasting a complex-domain macrokernel calling a 1m
  virtual microkernel to a real-domain macrokernel calling the real-
  domain microkernel) in situations in which it should not have. The
  optimization was silently assuming that the storage of C always
  matched that of the microkernel preference, since the front-end (in
  this case, bli_hemm_front()) would have already had a chance to
  transpose the operation to bring the two into agreement. However, by
  disabling right-sided hemm, we deprive BLIS of that flexibility (as a
  transposed left-sided case would necessarily have to become a right-
  sided case), and thus the assumption was no longer holding in all
  cases. Thanks to Nisanth M P for reporting this bug in Issue #621.
- The aforementioned bug, and its bugfix, also apply to symm when
  BLIS_DISABLE_SYMM_RIGHT is defined.
- Comment updates.
- CREDITS file update.
- (cherry picked from commit 3accacf)

Fixed perf of mt sup with packing, and mt gemmlike. (#696)

Details:
- Brought the gemmsup code path up to date relative to the latest
  thrinfo_t semantics introduced in the October Omnibus commit
  (aeb5f0c). This was done by passing the prenode (instead of the
  current node) into the packm variant within bli_l3_sup_packm.c as well
  as creating the prenodes and attaching them to the thrinfo_t tree in
  bli_l3_sup_thrinfo_create(). These changes erase the performance
  degradation introduced in the omnibus when running multithreaded sup
  with optional packing enabled. Special thanks to Devin Matthews for
  sussing out this fix in short order.
- Fixed the gemmlike sandbox in a manner similar to that of sup with
  packing, described above. This also involved passing the prenode into
  the local gemmlike packm variant. (Recall that gemmlike recycles the
  use of bli_l3_sup_thrinfo_create(), so it automatically inherits that
  part of the sup fix described above.)
- Updated bls_l3_packm_var[123].c to use bli_thrinfo_n_way() and
  bli_thrinfo_work_id() instead of bli_thrinfo_num_threads() and
  bli_thrinfo_thread_id(), respectively.
- (cherry picked from 4833ba2)

Fixed _gemm_small() prototype; disabled gemm_small.

Details:
- Fixed a mismatch between the prototype for bli_gemm_small() in
  bli_gemm_front.h and the actual definition of bli_gemm_small() in
  kernels/zen/3/bli_gemm_small.c. The former was erroneously declaring
  the cntl_t* argument as 'const'. Thanks to Jeff Diamond for reporting
  this issue.
- Commented out BLIS_ENABLE_SMALL_MATRIX, BLIS_ENABLE_SMALL_MATRIX_TRSM
  macro definitions in config/zen3/bli_family_zen3.h. AMD's small matrix
  implementation should probably remain disabled in vanilla BLIS, at
  least for now.
- (cherry picked from db10dd8)

Trival whitespace/comment tweaks.

Details:
- Trivial whitespace and comment changes, most of which ideally would
  have been part of the previous commit pertaining to HPX (2b05948).
- (cherry picked from f0337b7)

blis support for hpx (#682)

- Implement threading backend via HPX.
- HPX is an asynchronous many task runtime system used in high
  performance computing applications. The runtime implements the
  ISO C++ parallelism specification and provides a user-space
  thread implementation.
- This PR provides BLIS a thread backend implementation using HPX
  and resolves feature request #681. The configuration script,
  makefiles, and testsuite have been updated to support an HPX
  build option. The addition of HPX support provides other
  developers an exemplar for integrating other C++ threading
  backends into BLIS.
- (cherry picked from 2b05948)

Fixed subtle barrier_fpa bug in bli_thrcomm.c. (#690)

Details:
- In bli_thrcommo.c, correctly initialize the BLIS_OPENMP element of the
  barrier function pointer array (barrier_fpa) to NULL when
  BLIS_ENABLE_OPENMP is *not* defined. Similarly, initialize the
  BLIS_POSIX element of barrier_fpa to NULL when BLIS_ENABLE_PTHREADS is
  not enabled. This bug was introduced in a1a5a9b and was likely the
  result of an incomplete edit. The effects of the bug would have
  likely manifested when querying a thrcomm_t that was initialized with
  a timpl_t value corresponding to a threading implementation that was
  omitted from the -t option at configure-time.
- (cherry picked from e1ea25d)

Enhance emacs formatting of C files to remove trailing whitespace and ensure
 a newline at the end of file
- (cherry picked from dc6e5f3)

Delete mpi_test garbage. (#689)

Details:
- tlrmchlsmth: "What even is this? No comments, no commit message, not
  used by anything. Trash."
- (cherry picked from 713d078)

Some decluttering of the top-level directory.

Details:
- Relocated 'mpi_test' directory to test/mpi_test.
- Relocated 'so_version' and 'version' files from top-level directory to
  'build' directory.
- Updated build/bump-version.sh script to accommodate relocation of
  'version' file to 'build' directory.
- Updated configure script to accommodate relocation of 'so_version'
  file to 'build' directory.
- Updated INSTALL file to replace pointers to blis-devel mailing list
  with a pointer to docs/Discord.md.
- Updated RELEASING file to contain a reminder to consider whether the
  so_version file should be updated prior to the release.
- (cherry picked from 8d813f7)

Fix typo in configure --help text. (#686)

Details:
- Fixed a misspelling in the --help description for the --int-size (-i)
  configure option.
- (cherry picked from 6774bf0)

Support --nosup, --sup configure options. (#684)

Details:
- Added --nosup and --sup as alternative ways of requesting that sup be
  disabled or enabled. These are analagous to --disable-sup-handling and
  --enable-sup-handling, respectively. (I got tired of typing out
  --disable-sup-handling and needed a shorthand notation.)
- Tweaked message output by configure when sup is enable/disabled for
  clarity and specificity.
- Whitespace changes.
- (cherry picked from edcc2f9)

Add mention of Wilkinson Prize to README.md. (#683)

Details:
- Added blurbs and links to Wilkinson Prize to README.md.
- Added mention of both Best Paper and Wilkinson Prizes to the top of
  README.md.
- Other minor tweaks.
- (cherry picked from 5eea6ad)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants