-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tests fail on POWER machines #621
Comments
@ivan23kor can QEMU be used to test POWER 7/8/9/10? If so what are the necessary flags so that we can add Travis CI tests? |
@devinamatthews unfortunately I don't know about QEMU as I am running on actual hardware. |
Looks like this is happening on POWER9 after ee9ff98 This is the stack trace : So there is no special TRSM kernel in the power9 directory. I could see only specialized SGEMM and DGEMM kernel for POWER9. |
Having done a Further investigation points at missing modification of After changing that file similar to ee9ff98#diff-512c6a50b6244efae7b93e599fb1b295377718d298cac465e3f83db3718c4b03 I see many failures of the kind
I.e. Only the "z" datatype on the "*1m" is affected. Using the "generic" configuration works. |
@fgvanzee the issue seems to be missing edge case handling in |
I'll take a look at it. |
Uhh, am I missing something, @devinamatthews?
It's not clear what I need to fix. |
The bb kernels live in some strangebplace.
On Jul 21, 2022, at 6:08 PM, "Field G. Van Zee" ***@***.******@***.***>> wrote:
[EXTERNAL SENDER]
Uhh, am I missing something, @devinamatthews<https://github.com/devinamatthews>?
$ ls ref_kernels/3
bli_gemm_ref.c bli_gemmsup_ref.c bli_gemmtrsm_ref.c bli_trsm_ref.c old
—
Reply to this email directly, view it on GitHub<#621 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABIAZIODBD5UQXDEIJ52MHDVVHJ6HANCNFSM5QM5ICKQ>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Oh wait, didn't we fold them into the conventional reference microkernels? |
@devinamatthews Ah, I think I see the problem now. It's not that the I'll try to work on this more tomorrow. |
@fgvanzee PASTEMAC(ch,bcastbbs_mxn) \
( \
m, \
n, \
b11, rs_b, cs_b \
); \ etc. @Flamefire did you test the tip of the |
@devinamatthews Ah, yes, I had previously overlooked this: const inc_t rs_b = packnr; \
const inc_t cs_b = bli_cntx_get_blksz_def_dt( dt, BLIS_BBN, cntx ); \ |
I successfully tested af3a41e on Power9LE Do you know which commit fixed that? |
The Power9/10 kernels use a "broadcast packing" format, which messed up a lot of the older code, or rather, required some bespoke code which interacted poorly with I guess we can close the issue then? |
Yes, thanks! |
Thanks for fixing this. ******* FATAL ERROR - PARAMETER NUMBER 11 WAS CHANGED INCORRECTLY ******* ZSYMM PASSED THE TESTS OF ERROR-EXITS ******* FATAL ERROR - PARAMETER NUMBER 11 WAS CHANGED INCORRECTLY ******* |
@RajalakshmiSR Maybe try compiling with |
@RajalakshmiSR there might be a bug where data is written off the end of the output matrix. Please try the following test program: #include "blis.h"
#include <stdio.h>
#include <string.h>
int main(int argc, char** argv)
{
obj_t A, B, C, a, b, c;
bli_obj_create( BLIS_DCOMPLEX, 2, 2, 1, 2, &A );
bli_obj_create( BLIS_DCOMPLEX, 2, 2, 1, 2, &B );
bli_obj_create( BLIS_DCOMPLEX, 2, 2, 1, 2, &C );
bli_setm( &BLIS_ONE, &A );
bli_setm( &BLIS_ONE, &B );
bli_setm( &BLIS_ZERO, &C );
bli_obj_create_with_attached_buffer( BLIS_DCOMPLEX, 1, 1, bli_obj_buffer( &A ), 1, 2, &a );
bli_obj_create_with_attached_buffer( BLIS_DCOMPLEX, 1, 1, bli_obj_buffer( &B ), 1, 2, &b );
bli_obj_create_with_attached_buffer( BLIS_DCOMPLEX, 1, 1, bli_obj_buffer( &C ), 1, 2, &c );
bli_obj_set_struc( BLIS_HERMITIAN, &a );
bli_obj_set_uplo( BLIS_UPPER, &a );
bli_printm( "before:", &C, "%4.1f", "" );
bli_hemm( BLIS_RIGHT, &BLIS_ONE, &a, &b, &BLIS_ZERO, &c );
bli_printm( "after:", &C, "%4.1f", "" );
return 0;
} It should print:
Any change in the zero values in the "after" matrix would be a problem. |
Not sure if this is relevant, but I just realized this morning that unless @devinamatthews's recent changes (vis-a-vis merging the bb
This macro, as its name suggests, disables right-side Like |
And hemm/symm too right? |
Yes, that's right. |
Also just noticed that void bli_cntx_init_power10( cntx_t* cntx )
{
blksz_t blkszs[ BLIS_NUM_BLKSZS ];
// Set default kernel blocksizes and functions.
bli_cntx_init_power10_ref( cntx );
// -------------------------------------------------------------------------
// Update the context with optimized native gemm micro-kernels.
bli_cntx_set_ukrs
(
cntx,
// level-3
BLIS_GEMM_UKR, BLIS_FLOAT, bli_sgemm_power10_mma_8x16,
BLIS_GEMM_UKR, BLIS_DOUBLE, bli_dgemm_power10_mma_8x8,
BLIS_VA_END
);
// Update the context with storage preferences.
bli_cntx_set_ukr_prefs
(
cntx,
// level-3
BLIS_GEMM_UKR_ROW_PREF, BLIS_FLOAT, TRUE,
BLIS_GEMM_UKR_ROW_PREF, BLIS_DOUBLE, TRUE,
BLIS_GEMM_UKR_ROW_PREF, BLIS_SCOMPLEX, FALSE,
BLIS_GEMM_UKR_ROW_PREF, BLIS_DCOMPLEX, FALSE,
BLIS_TRSM_L_UKR_ROW_PREF, BLIS_FLOAT, FALSE,
BLIS_TRSM_U_UKR_ROW_PREF, BLIS_FLOAT, FALSE,
BLIS_TRSM_L_UKR_ROW_PREF, BLIS_DOUBLE, FALSE,
BLIS_TRSM_U_UKR_ROW_PREF, BLIS_DOUBLE, FALSE,
BLIS_TRSM_L_UKR_ROW_PREF, BLIS_SCOMPLEX, FALSE,
BLIS_TRSM_U_UKR_ROW_PREF, BLIS_SCOMPLEX, FALSE,
BLIS_TRSM_L_UKR_ROW_PREF, BLIS_DCOMPLEX, FALSE,
BLIS_TRSM_U_UKR_ROW_PREF, BLIS_DCOMPLEX, FALSE,
BLIS_VA_END
); Could be harmless since 1m method code should always read the preference of the real domain ukernel. But nonetheless I think those extra entries should be nixed unless there's something I missing. EDIT: It would appear that the PASTECH(chr,gemm_ukr_ft) \
rgemm_ukr = bli_cntx_get_ukr_dt( dt_r, BLIS_GEMM_UKR, cntx ); \
const bool col_pref = bli_cntx_ukr_prefers_cols_dt( dt_r, BLIS_GEMM_UKR, cntx ); \ Although it's still confusing AF to anyone who isn't |
Another observation: It's a wonder any of this |
Okay, I'm going to consult with @nicholaiTukanov, the original author of the |
Yes, I could see the same result. |
Still the same issue. Tried it like |
@RajalakshmiSR strange, that is exactly the test that is supposed to be failing in zblat3. I might write a Fortran version just in case that makes a difference. |
Here's a Fortran version. It can't get much more similar to the failing test. program main
double complex A(2,2), B(2,2), C(2,2)
double complex alpha, beta
A(1,1) = (1.0,0.0)
A(2,1) = (1.0,0.0)
A(1,2) = (1.0,0.0)
A(2,2) = (1.0,0.0)
B(1,1) = (1.0,0.0)
B(2,1) = (1.0,0.0)
B(1,2) = (1.0,0.0)
B(2,2) = (1.0,0.0)
C(1,1) = (0.0,0.0)
C(2,1) = (0.0,0.0)
C(1,2) = (0.0,0.0)
C(2,2) = (0.0,0.0)
alpha = (1.0,0.0)
beta = (0.0,0.0)
WRITE(*,*)'before:'
WRITE(*,'(2(A,F4.1,A,F4.1,A,X))')(('(',REAL(C(I,J)),',',AIMAG(C(I,J)),')',J=1,2),I=1,2)
WRITE(*,*)
CALL ZHEMM('R','U',1,1,alpha,A,2,B,2,beta,C,2)
WRITE(*,*)'after:'
WRITE(*,'(2(A,F4.1,A,F4.1,A,X))')(('(',REAL(C(I,J)),',',AIMAG(C(I,J)),')',J=1,2),I=1,2)
WRITE(*,*)
END PROGRAM MAIN |
@ivan23kor I fixed what was probably a |
Sorry these initial questions weren't answered in a timely manner.
@ivan23kor Thank you for catching that, and submitting the PR fix.
For now, yes. That code could be transitioned from being a sandbox to being an addon, but even then you'd have to explicitly request it via
Not at this time. We don't have any Power expertise or hardware access in our core development group. The collaborator who wrote the |
We can submit a request in OSUOSL to access POWER9 systems for CI service. |
Hey @ivan23kor, glad to see someone using and testing the POWER kernels 😄
Due to other commitments, I will not be able to work on these. However, if one was interested, they could take my dgemm code as a starting point for other real domain kernels.
This is intentional, since BLIS doesn't support low precision/integer datatypes. Hopefully, with all the changes @devinamatthews and @fgvanzee are making, this won't be the case soon. |
@fgvanzee |
Functional correctness (although there are performance considerations lurking within). We actually don't want to disable right-sided But the right-sided So, what are the consequences of disabling right-sided One way to mitigate this performance issue would be to add branches to the power9 microkernel that perform in-register transposes when there is a mismatch so that vector loads/stores could still be used to update C (rather than doing it element-by-element), but this in-register transpose is sometimes expensive and will only lessen (but never eliminate) the performance drag of calling the |
Thanks @fgvanzee for the explanation. With regards to the current HEMM/SYMM failures (power9, BLAS tests): Since row and column strides of C are also swapped, in the case of inducing transpose when hermitian/symmetric matrix B multiplies A from the right, 1m optimisation's row and column strides (of C) calculations are going wrong. This happens in the case of BLIS_DISABLE_HEMM_RIGHT (or BLIS_DISABLE_SYMM_RIGHT) are defined. So checked after removing these macros in power9 case, and the tests are passing. Please suggest the optimal path to take. |
@nisanthmp Thanks for that update. I don't quite understand the issue you described with 1m and the row/column strides of C. Is this something I would be able to reproduce on an Intel system (where the microkernel does not need to duplicate/broadcast elements of B during packing)? Stepping through a concrete example might also help. |
@fgvanzee I'm trying to find an Intel based system, where I can reproduce the error. I need the 1m optimisations enabled while BLIS_DISABLE_HEMM_RIGHT defined. Is there a way to enable 1m optimisations on any Intel based system? On power9 based system, I was able to reproduce the error with the code snippet, pasted below:
|
I will try to explain what I was trying to say with regards to 1m related optimisations and row and column strides of C: When I run the above code, with 1m optimisations and BLIS_DISABLE_HEMM_RIGHT defined, I get the following output. a = The prints are in "frame/3/gemm/bli_gemm_ker_var2.c", around the 1m specific optimisations to call real domain macro-kernel instead of 1m virtual micro-kernel:
As you can see in the printed output, before calling "bli_gemm_ind_recast_1m_params()", row and column strides of C are rs_c: 2, cs_c: 1, and afterwards, they are rs_c: 2, cs_c: 2. if the values of rs_c and cs_c were 1 and 4 respectively, the computation would produce correct results. That's why, I'm suspecting that the row and column strides calculations for C are going wrong. Hope I'm able to explain this properly. |
@fgvanzee I could find an Intel system to reproduce this. The same test code listed above was run on a Xeon 8380 (Ice Lake) system, with BLIS_DISABLE_HEMM_RIGHT defined. |
@nisanthmp Thank you for all of this information. I will start trying to reproduce on my end. |
@nisanthmp I tried to use your code to reproduce the error as follows:
Did I miss anything? Perhaps you could give further details on how you configured and built for your Intel Xeon 8380. |
@fgvanzee I was also not able to reproduce the error on all intel systems. I could do it only on the Ice Lake (Xeon 8380) based system, out of the available Intel systems that I have access to. The only difference in configuration was that, I configured for "x86_64" instead of "auto". Other than that, everything looks the same. I tried with "configure auto" also (on Xeon 8380). That way it chose "skx", so added "#define BLIS_DISABLE_HEMM_RIGHT" in "config/skx/bli_family_skx.h " and the error is still there. |
@nisanthmp Ahhh, that helps (knowing the exact subconfig it chose)! Thanks for your reply. I'll take a closer look at
|
@nisanthmp I was able to reproduce the error on my
I'll continue to investigate and get back to you! Thank you for your patience and diligent bug reports. |
Details: - Fixed a bug in right-sided hemm when: - using the 1m method - #defining BLIS_DISABLE_HEMM_RIGHT in the active subconfiguration - the storage of C matches the gemm microkernel IO preference PRIOR to the right-sidedness being detected and recast in terms of the left side code path. It turns out that bli_gemm_ind_recast_1m_params() was applying its optimization (recasting a complex-domain macrokernel calling a 1m virtual microkernel to a real-domain macrokernel calling the real- domain microkernel) in situations in which it should not have. The optimization was silently assuming that the storage of C always matched that of the microkernel preference, since the front-end would have already had a chance to transpose the operation to bring the two into agreement. However, by disabling right-sided hemm, we deprive BLIS of that flexiblity, and thus suddenly the assumption was no longer holding in all cases. Thanks to Nisanth M P for reporting this bug in Issue #621. - The original bug, and this bugfix, also extend to symm when BLIS_DISABLE_SYMM_RIGHT is defined. - Comment updates. - CREDITS file update.
@nisanthmp #697 is the best I can do at fixing the problem. Basically, we have to avoid the 1m-specific optimization if Thanks for your feedback on this! |
Details: - Fixed a bug in right-sided hemm when: - using the 1m method, - #defining BLIS_DISABLE_HEMM_RIGHT in the active subconfiguration, and - the storage of C matches the gemm microkernel IO preference PRIOR to the right-sidedness being detected and recast in terms of the left- side code path. It turns out that bli_gemm_ind_recast_1m_params() was applying its optimization (recasting a complex-domain macrokernel calling a 1m virtual microkernel to a real-domain macrokernel calling the real- domain microkernel) in situations in which it should not have. The optimization was silently assuming that the storage of C always matched that of the microkernel preference, since the front-end (in this case, bli_hemm_front()) would have already had a chance to transpose the operation to bring the two into agreement. However, by disabling right-sided hemm, we deprive BLIS of that flexibility (as a transposed left-sided case would necessarily have to become a right- sided case), and thus the assumption was no longer holding in all cases. Thanks to Nisanth M P for reporting this bug in Issue #621. - The aforementioned bug, and its bugfix, also apply to symm when BLIS_DISABLE_SYMM_RIGHT is defined. - Comment updates. - CREDITS file update.
@fgvanzee, Shall we please close this issue now? |
Certainly. Thank you again for your patience and feedback on this issue, @nisanthmp! |
Details: - Fixed a bug in right-sided hemm when: - using the 1m method, - #defining BLIS_DISABLE_HEMM_RIGHT in the active subconfiguration, and - the storage of C matches the gemm microkernel IO preference PRIOR to the right-sidedness being detected and recast in terms of the left- side code path. It turns out that bli_gemm_ind_recast_1m_params() was applying its optimization (recasting a complex-domain macrokernel calling a 1m virtual microkernel to a real-domain macrokernel calling the real- domain microkernel) in situations in which it should not have. The optimization was silently assuming that the storage of C always matched that of the microkernel preference, since the front-end (in this case, bli_hemm_front()) would have already had a chance to transpose the operation to bring the two into agreement. However, by disabling right-sided hemm, we deprive BLIS of that flexibility (as a transposed left-sided case would necessarily have to become a right- sided case), and thus the assumption was no longer holding in all cases. Thanks to Nisanth M P for reporting this bug in Issue #621. - The aforementioned bug, and its bugfix, also apply to symm when BLIS_DISABLE_SYMM_RIGHT is defined. - Comment updates. - CREDITS file update. - (cherry picked from commit 3accacf) Fixed perf of mt sup with packing, and mt gemmlike. (#696) Details: - Brought the gemmsup code path up to date relative to the latest thrinfo_t semantics introduced in the October Omnibus commit (aeb5f0c). This was done by passing the prenode (instead of the current node) into the packm variant within bli_l3_sup_packm.c as well as creating the prenodes and attaching them to the thrinfo_t tree in bli_l3_sup_thrinfo_create(). These changes erase the performance degradation introduced in the omnibus when running multithreaded sup with optional packing enabled. Special thanks to Devin Matthews for sussing out this fix in short order. - Fixed the gemmlike sandbox in a manner similar to that of sup with packing, described above. This also involved passing the prenode into the local gemmlike packm variant. (Recall that gemmlike recycles the use of bli_l3_sup_thrinfo_create(), so it automatically inherits that part of the sup fix described above.) - Updated bls_l3_packm_var[123].c to use bli_thrinfo_n_way() and bli_thrinfo_work_id() instead of bli_thrinfo_num_threads() and bli_thrinfo_thread_id(), respectively. - (cherry picked from 4833ba2) Fixed _gemm_small() prototype; disabled gemm_small. Details: - Fixed a mismatch between the prototype for bli_gemm_small() in bli_gemm_front.h and the actual definition of bli_gemm_small() in kernels/zen/3/bli_gemm_small.c. The former was erroneously declaring the cntl_t* argument as 'const'. Thanks to Jeff Diamond for reporting this issue. - Commented out BLIS_ENABLE_SMALL_MATRIX, BLIS_ENABLE_SMALL_MATRIX_TRSM macro definitions in config/zen3/bli_family_zen3.h. AMD's small matrix implementation should probably remain disabled in vanilla BLIS, at least for now. - (cherry picked from db10dd8) Trival whitespace/comment tweaks. Details: - Trivial whitespace and comment changes, most of which ideally would have been part of the previous commit pertaining to HPX (2b05948). - (cherry picked from f0337b7) blis support for hpx (#682) - Implement threading backend via HPX. - HPX is an asynchronous many task runtime system used in high performance computing applications. The runtime implements the ISO C++ parallelism specification and provides a user-space thread implementation. - This PR provides BLIS a thread backend implementation using HPX and resolves feature request #681. The configuration script, makefiles, and testsuite have been updated to support an HPX build option. The addition of HPX support provides other developers an exemplar for integrating other C++ threading backends into BLIS. - (cherry picked from 2b05948) Fixed subtle barrier_fpa bug in bli_thrcomm.c. (#690) Details: - In bli_thrcommo.c, correctly initialize the BLIS_OPENMP element of the barrier function pointer array (barrier_fpa) to NULL when BLIS_ENABLE_OPENMP is *not* defined. Similarly, initialize the BLIS_POSIX element of barrier_fpa to NULL when BLIS_ENABLE_PTHREADS is not enabled. This bug was introduced in a1a5a9b and was likely the result of an incomplete edit. The effects of the bug would have likely manifested when querying a thrcomm_t that was initialized with a timpl_t value corresponding to a threading implementation that was omitted from the -t option at configure-time. - (cherry picked from e1ea25d) Enhance emacs formatting of C files to remove trailing whitespace and ensure a newline at the end of file - (cherry picked from dc6e5f3) Delete mpi_test garbage. (#689) Details: - tlrmchlsmth: "What even is this? No comments, no commit message, not used by anything. Trash." - (cherry picked from 713d078) Some decluttering of the top-level directory. Details: - Relocated 'mpi_test' directory to test/mpi_test. - Relocated 'so_version' and 'version' files from top-level directory to 'build' directory. - Updated build/bump-version.sh script to accommodate relocation of 'version' file to 'build' directory. - Updated configure script to accommodate relocation of 'so_version' file to 'build' directory. - Updated INSTALL file to replace pointers to blis-devel mailing list with a pointer to docs/Discord.md. - Updated RELEASING file to contain a reminder to consider whether the so_version file should be updated prior to the release. - (cherry picked from 8d813f7) Fix typo in configure --help text. (#686) Details: - Fixed a misspelling in the --help description for the --int-size (-i) configure option. - (cherry picked from 6774bf0) Support --nosup, --sup configure options. (#684) Details: - Added --nosup and --sup as alternative ways of requesting that sup be disabled or enabled. These are analagous to --disable-sup-handling and --enable-sup-handling, respectively. (I got tired of typing out --disable-sup-handling and needed a shorthand notation.) - Tweaked message output by configure when sup is enable/disabled for clarity and specificity. - Whitespace changes. - (cherry picked from edcc2f9) Add mention of Wilkinson Prize to README.md. (#683) Details: - Added blurbs and links to Wilkinson Prize to README.md. - Added mention of both Best Paper and Wilkinson Prizes to the top of README.md. - Other minor tweaks. - (cherry picked from 5eea6ad)
4440 Segmentation fault ./test_libblis.x -g ./testsuite/input.general.fast -o ./testsuite/input.operations.fast > output.testsuite
bool
. Because of that, edge-case handling macros here and here use a conflicting type bool from stdbool. A possible solution is to check for POWER architecture in edge_case_macros and define _bool as int for POWER and bool for other architectures.-s power10
. Is it intentional?POWER9 uses slow reference implementations for sgemm, cgemm, zgemm by default, are there plans to support a fast sgemm microkernel for POWER9?
The text was updated successfully, but these errors were encountered: