TFactor: bulkerify GEMV (both MC and GPU) #1214

albestro · 2024-11-15T17:50:06Z

In this PR we try to increase parallelisation of GEMV step of TFactor, which is by construction serial (all results goes into the same block tile), by using workspaces for storing partial results and then reducing them before the final TRMV step.

Algorithmically, the main change is that stepGEMV loop has been replaced with a single stepGEMVAll, and the concept has been applied in a similar way for both backends, MC and GPU:

MC: pika::bulk splits input tiles over multiple task, each one stores the partial results in their workspace and just one at the end does the reduction.
GPU: similarly to the CPU, the work is forked over different pika tasks, each one computing a partial result on a different GPU stream. These tasks then join into a single task which performs the reduction.

In order to implement this solution, workspaces for intermediate results have been added (there is another option which does not require any additional workspace nor reduction, and that we might explore for MC in another PR).

TODO:

This work is based on TFactor: simplification and cleanup #1213
check with @msimberg if we can do something different instead of the "fork" (e.g. a bulk for the GPU but each pika thread should get a different cuda stream)
define better how many workspaces are needed (currently a random choice)
profile and test performance of this solution
Verify if there is anything to be "ported" from computeTFactor parallelise computation with bulk (MC Local and Distributed) #798

Close #798.

EDIT: since this is conceptually similar to #798 and it is going to be closed as soon as this gets merged, I migrated here the doc fixes happened there.

albestro · 2024-12-02T16:56:41Z

cscs-ci run

just to check if there is any major problems

rasolca

LGTM.
Please name new functions in snake case, and rename existing internal functions if possible.

rasolca · 2024-12-04T10:58:46Z

include/dlaf/factorization/qr/t_factor_impl.h

+    matrix::Matrix<T, Device::GPU> ws_T({nworkspaces * nrefls_step, nrefls_step},
+                                        {nrefls_step, nrefls_step});


All Ts are allocated at scheduling. Better reuse it.

msimberg

Only a few questions, nothing blocking.

include/dlaf/eigensolver/bt_reduction_to_band/impl.h

include/dlaf/factorization/qr/api.h

include/dlaf/factorization/qr/t_factor_impl.h

rasolca · 2024-12-06T10:58:27Z

include/dlaf/factorization/qr/internal/get_tfactor_nworkers.h

+
+namespace dlaf::factorization::internal {
+
+inline size_t getTFactorNWorkers() noexcept {


What about this to be snake case?

see 36f179d

include/dlaf/tune.h

albestro · 2024-12-10T14:37:50Z

cscs-ci run

albestro · 2024-12-11T16:16:27Z

Converted to draft in order to prevent merging, since I'm still doing some checks.

include/dlaf/tune.h

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>

taus are set for gemv, but then full tile is reduced, hence also the diag with taus.

tile_t might get moved by first worker before any other work is able to use its size

in case of odd division of tiles, it might end up being selected but not used.

albestro · 2024-12-17T13:53:01Z

cscs-ci run

codecov-commenter · 2024-12-17T14:47:32Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 89.94083% with 17 lines in your changes missing coverage. Please review.

Project coverage is 94.02%. Comparing base (e8c7f2c) to head (383dde5).
Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
include/dlaf/factorization/qr/t_factor_impl.h	85.84%	10 Missing and 6 partials ⚠️
src/tune.cpp	0.00%	1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1214      +/-   ##
==========================================
- Coverage   94.38%   94.02%   -0.37%     
==========================================
  Files         154      155       +1     
  Lines        9374     9461      +87     
  Branches     1160     1163       +3     
==========================================
+ Hits         8848     8896      +48     
- Misses        320      340      +20     
- Partials      206      225      +19

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

add assert also for gpu code that calls add manually

albestro added the Type:Optimization label Nov 15, 2024

albestro added this to the Optimizations milestone Nov 15, 2024

albestro self-assigned this Nov 15, 2024

albestro force-pushed the alby/tfactor-optim/bulk branch from 6558780 to bf17fcc Compare November 18, 2024 13:16

Base automatically changed from alby/tfactor-optim/no-gemv-divergence to master November 22, 2024 11:28

albestro force-pushed the alby/tfactor-optim/bulk branch 4 times, most recently from 37ea75b to 2845339 Compare December 2, 2024 16:48

albestro requested review from msimberg, rasolca and RMeli December 3, 2024 16:08

albestro marked this pull request as ready for review December 3, 2024 16:08

rasolca requested changes Dec 4, 2024

View reviewed changes

albestro requested a review from rasolca December 5, 2024 11:22

msimberg approved these changes Dec 5, 2024

View reviewed changes

include/dlaf/eigensolver/bt_reduction_to_band/impl.h Outdated Show resolved Hide resolved

include/dlaf/factorization/qr/api.h Outdated Show resolved Hide resolved

include/dlaf/factorization/qr/t_factor_impl.h Outdated Show resolved Hide resolved

albestro force-pushed the alby/tfactor-optim/bulk branch from 05dc48d to f83f317 Compare December 5, 2024 15:00

rasolca reviewed Dec 6, 2024

View reviewed changes

msimberg reviewed Dec 9, 2024

View reviewed changes

include/dlaf/tune.h Outdated Show resolved Hide resolved

albestro force-pushed the alby/tfactor-optim/bulk branch from 0397c9e to d370894 Compare December 9, 2024 09:47

albestro requested a review from rasolca December 9, 2024 09:47

rasolca approved these changes Dec 10, 2024

View reviewed changes

msimberg approved these changes Dec 10, 2024

View reviewed changes

albestro marked this pull request as draft December 11, 2024 16:15

albestro commented Dec 11, 2024

View reviewed changes

include/dlaf/tune.h Show resolved Hide resolved

albestro added 2 commits December 13, 2024 14:21

factor out gemvLoop for MC

b1197be

mc-local: add stepGEMVAll with bulk using workspaces

9647ec8

albestro and others added 21 commits December 13, 2024 14:21

wip: change api accepting an external workspace

27978c9

adapt algorithms to new api with external workspace

6765ccb

adapt also the test to the new api

ff577cd

basic check and doc for workspace

f5dadcf

fix inshpect

c5f4d75

use lowercase in snake_case

a967d6c

snake_case also for config getter

a6db55e

do not expose internal header

d4cd568

Update include/dlaf/tune.h

f3c0ba0

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>

remove superfluous spack

89d72f7

add missing cli option for dlaf:tfactor-nworkers

f1719fa

bug fix: tfactor workspaces allocation

87c952b

bug fix: replace sender workspace instead of appending

3d3b93c

bug fix: missing default initialization to a proper value

92aa5b7

bug fix: diag has to be reset at the end

eb9def0

taus are set for gemv, but then full tile is reduced, hence also the diag with taus.

bug fix: not all workspaces should be used

202f1cc

bug fix: race-condition on k using tile_t

c0a7660

tile_t might get moved by first worker before any other work is able to use its size

avoid moves by using const& in loop_gemv

dec050a

minor changes

fc12aae

minor change: drop unused step_trmv and reorder gpu helpers

3f2a53a

minor changes and cleanup to doc

5997e52

albestro force-pushed the alby/tfactor-optim/bulk branch from f0638e1 to 5997e52 Compare December 17, 2024 09:10

bug fix: worker might not get used anyway

383dde5

in case of odd division of tiles, it might end up being selected but not used.

albestro added 2 commits December 19, 2024 16:41

bug fix: workspace for reduction to band was too big.

5f2ddd4

add assert also for gpu code that calls add manually

bug fix: workspace in bt_red2band migth have not been reset

ffd3135

albestro mentioned this pull request Dec 20, 2024

Update benchmark scripts: basic separation for alps vcluster, number of cores, gpu defaults, ... #1254

Draft

1 task

fix conflict

d199b88

albestro force-pushed the alby/tfactor-optim/bulk branch from 2f90f23 to d199b88 Compare December 20, 2024 15:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TFactor: bulkerify GEMV (both MC and GPU) #1214

TFactor: bulkerify GEMV (both MC and GPU) #1214

albestro commented Nov 15, 2024 •

edited

Loading

albestro commented Dec 2, 2024

rasolca left a comment

rasolca Dec 4, 2024

msimberg left a comment

rasolca Dec 6, 2024

albestro Dec 6, 2024

albestro commented Dec 10, 2024

albestro commented Dec 11, 2024

albestro commented Dec 17, 2024

codecov-commenter commented Dec 17, 2024

		matrix::Matrix<T, Device::GPU> ws_T({nworkspaces * nrefls_step, nrefls_step},
		{nrefls_step, nrefls_step});


		namespace dlaf::factorization::internal {

		inline size_t getTFactorNWorkers() noexcept {

TFactor: bulkerify GEMV (both MC and GPU) #1214

Are you sure you want to change the base?

TFactor: bulkerify GEMV (both MC and GPU) #1214

Conversation

albestro commented Nov 15, 2024 • edited Loading

albestro commented Dec 2, 2024

rasolca left a comment

Choose a reason for hiding this comment

rasolca Dec 4, 2024

Choose a reason for hiding this comment

msimberg left a comment

Choose a reason for hiding this comment

rasolca Dec 6, 2024

Choose a reason for hiding this comment

albestro Dec 6, 2024

Choose a reason for hiding this comment

albestro commented Dec 10, 2024

albestro commented Dec 11, 2024

albestro commented Dec 17, 2024

codecov-commenter commented Dec 17, 2024

Codecov Report

albestro commented Nov 15, 2024 •

edited

Loading