-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix hobbled multithreaded performance of gemmlike sandbox (and sup when packing) #696
Conversation
Details: - Updated bls_l3_packm_var[123].c to use bli_thrinfo_n_way() and bli_thrinfo_work_id() instead of bli_thrinfo_num_threads() and bli_thrinfo_thread_id(), respectively. This change probably should have been included in aeb5f0c. - This change is the first of multiple that are necessary to get the gemmlike sandbox operating at full performance again. (It appears each thread is packing every micropanel, and so the thrinfo_t tree needs to be reconfigured. Same goes for sup when optional packing is requested.)
@fgvanzee I pushed a suggested fix to |
Thanks, I'll take a look at your branch and try to pick up where you left off. 👍🏼 |
@devinamatthews Hmm, oddly, your commit definitely helps, but it seems to make up only about half the performance shortfall relative to using the conventional code path. I'll keep digging into it. |
@devinamatthews I quickly realized that the true comparison isn't the conventional code path, but rather sup immediately prior to the omnibus commit. Once I did that, your latest code is only about 2% slower. I'll try to see what could account for that, but if I can't find it, I probably won't stress over it. I'll also look at fixing the sandbox. Thanks for your help. |
I need more sleep or something. Turns out that your fix is 2% faster than the pre-omnibus, not the other way around. Mission accomplished. 🎉 |
Details: - Brought the gemmsup code path up to date relative to the latest thrinfo_t semantics introduced in the October Omnibus commit (aeb5f0c). This was done by passing in the prenode (instead of the current node) into the packm variant within bli_l3_sup_packm.c as well as creating the prenodes and attaching them to the thrinfo_t tree in bli_l3_sup_thrinfo_create(). These changes erase the performance degradation introduced in the omnibus when running multithreaded sup with optional packing enabled. Special thanks to Devin Matthews for sussing out this fix in short order. - Fixed the gemmlike sandbox in a manner similar to that of sup with packing, described above. This also involved passing the prenode into the local gemmlike packm variant. That gemmlike sandbox code currently recycles use of bli_l3_sup_thrinfo_create(), so no other changes were needed at this time.
Details: - Fixed a bug in right-sided hemm when: - using the 1m method, - #defining BLIS_DISABLE_HEMM_RIGHT in the active subconfiguration, and - the storage of C matches the gemm microkernel IO preference PRIOR to the right-sidedness being detected and recast in terms of the left- side code path. It turns out that bli_gemm_ind_recast_1m_params() was applying its optimization (recasting a complex-domain macrokernel calling a 1m virtual microkernel to a real-domain macrokernel calling the real- domain microkernel) in situations in which it should not have. The optimization was silently assuming that the storage of C always matched that of the microkernel preference, since the front-end (in this case, bli_hemm_front()) would have already had a chance to transpose the operation to bring the two into agreement. However, by disabling right-sided hemm, we deprive BLIS of that flexibility (as a transposed left-sided case would necessarily have to become a right- sided case), and thus the assumption was no longer holding in all cases. Thanks to Nisanth M P for reporting this bug in Issue #621. - The aforementioned bug, and its bugfix, also apply to symm when BLIS_DISABLE_SYMM_RIGHT is defined. - Comment updates. - CREDITS file update. - (cherry picked from commit 3accacf) Fixed perf of mt sup with packing, and mt gemmlike. (#696) Details: - Brought the gemmsup code path up to date relative to the latest thrinfo_t semantics introduced in the October Omnibus commit (aeb5f0c). This was done by passing the prenode (instead of the current node) into the packm variant within bli_l3_sup_packm.c as well as creating the prenodes and attaching them to the thrinfo_t tree in bli_l3_sup_thrinfo_create(). These changes erase the performance degradation introduced in the omnibus when running multithreaded sup with optional packing enabled. Special thanks to Devin Matthews for sussing out this fix in short order. - Fixed the gemmlike sandbox in a manner similar to that of sup with packing, described above. This also involved passing the prenode into the local gemmlike packm variant. (Recall that gemmlike recycles the use of bli_l3_sup_thrinfo_create(), so it automatically inherits that part of the sup fix described above.) - Updated bls_l3_packm_var[123].c to use bli_thrinfo_n_way() and bli_thrinfo_work_id() instead of bli_thrinfo_num_threads() and bli_thrinfo_thread_id(), respectively. - (cherry picked from 4833ba2) Fixed _gemm_small() prototype; disabled gemm_small. Details: - Fixed a mismatch between the prototype for bli_gemm_small() in bli_gemm_front.h and the actual definition of bli_gemm_small() in kernels/zen/3/bli_gemm_small.c. The former was erroneously declaring the cntl_t* argument as 'const'. Thanks to Jeff Diamond for reporting this issue. - Commented out BLIS_ENABLE_SMALL_MATRIX, BLIS_ENABLE_SMALL_MATRIX_TRSM macro definitions in config/zen3/bli_family_zen3.h. AMD's small matrix implementation should probably remain disabled in vanilla BLIS, at least for now. - (cherry picked from db10dd8) Trival whitespace/comment tweaks. Details: - Trivial whitespace and comment changes, most of which ideally would have been part of the previous commit pertaining to HPX (2b05948). - (cherry picked from f0337b7) blis support for hpx (#682) - Implement threading backend via HPX. - HPX is an asynchronous many task runtime system used in high performance computing applications. The runtime implements the ISO C++ parallelism specification and provides a user-space thread implementation. - This PR provides BLIS a thread backend implementation using HPX and resolves feature request #681. The configuration script, makefiles, and testsuite have been updated to support an HPX build option. The addition of HPX support provides other developers an exemplar for integrating other C++ threading backends into BLIS. - (cherry picked from 2b05948) Fixed subtle barrier_fpa bug in bli_thrcomm.c. (#690) Details: - In bli_thrcommo.c, correctly initialize the BLIS_OPENMP element of the barrier function pointer array (barrier_fpa) to NULL when BLIS_ENABLE_OPENMP is *not* defined. Similarly, initialize the BLIS_POSIX element of barrier_fpa to NULL when BLIS_ENABLE_PTHREADS is not enabled. This bug was introduced in a1a5a9b and was likely the result of an incomplete edit. The effects of the bug would have likely manifested when querying a thrcomm_t that was initialized with a timpl_t value corresponding to a threading implementation that was omitted from the -t option at configure-time. - (cherry picked from e1ea25d) Enhance emacs formatting of C files to remove trailing whitespace and ensure a newline at the end of file - (cherry picked from dc6e5f3) Delete mpi_test garbage. (#689) Details: - tlrmchlsmth: "What even is this? No comments, no commit message, not used by anything. Trash." - (cherry picked from 713d078) Some decluttering of the top-level directory. Details: - Relocated 'mpi_test' directory to test/mpi_test. - Relocated 'so_version' and 'version' files from top-level directory to 'build' directory. - Updated build/bump-version.sh script to accommodate relocation of 'version' file to 'build' directory. - Updated configure script to accommodate relocation of 'so_version' file to 'build' directory. - Updated INSTALL file to replace pointers to blis-devel mailing list with a pointer to docs/Discord.md. - Updated RELEASING file to contain a reminder to consider whether the so_version file should be updated prior to the release. - (cherry picked from 8d813f7) Fix typo in configure --help text. (#686) Details: - Fixed a misspelling in the --help description for the --int-size (-i) configure option. - (cherry picked from 6774bf0) Support --nosup, --sup configure options. (#684) Details: - Added --nosup and --sup as alternative ways of requesting that sup be disabled or enabled. These are analagous to --disable-sup-handling and --enable-sup-handling, respectively. (I got tired of typing out --disable-sup-handling and needed a shorthand notation.) - Tweaked message output by configure when sup is enable/disabled for clarity and specificity. - Whitespace changes. - (cherry picked from edcc2f9) Add mention of Wilkinson Prize to README.md. (#683) Details: - Added blurbs and links to Wilkinson Prize to README.md. - Added mention of both Best Paper and Wilkinson Prizes to the top of README.md. - Other minor tweaks. - (cherry picked from 5eea6ad)
This PR is to track the changes needed to fix a couple (probably related) issues in the recent Omnibus PR #678 (aeb5f0c):
-s gemmlike
and settingBLIS_JR_NT=4
, performance ofgemm
is depressed by about 25-30% relative to the preceding commit.BLIS_JR_NT=4
, performance of thegemm
sup code path is similarly depressed when optional packing is enabled (viaBLIS_PACK_A=1 BLIS_PACK_B=1
). Performance is unaffected when packing is avoided (the default).My theory is that every thread is packing every micropanel redundantly, which is an issue that @devinamatthews ran into when developing aeb5f0c. This explanation would be consistent with observations and with the state of the code itself, since the sup
thrinfo_t
tree creation inbli_l3_thrinfo.c
does not utilize the sub-prenodes like the conventionalthrinfo_t
tree does.