-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TriSolver (dist): move sorting permutation from CPU to GPU #1118
Conversation
cscs-ci run |
cscs-ci run |
2 similar comments
cscs-ci run |
cscs-ci run |
180c08d
to
217cf55
Compare
cscs-ci run |
2 similar comments
cscs-ci run |
cscs-ci run |
// @param perm_sorted array[n] current -> initial (i.e. evals[i] -> types[perm_sorted[i]]) | ||
// @param index_sorted array[n] global(sort(non-deflated)|sort(deflated))) -> initial | ||
// @param index_sorted_coltype array[n] local(sort(upper)|sort(dense)|sort(lower)|sort(deflated))) -> initial | ||
// @param i5_lc array[n_lc] local(sort(upper)|sort(dense)|sort(lower)|sort(deflated))) -> initial |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note-to-self: specify that they are local indices, while in index_sorted_coltype
they are global indices
cscs-ci run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
// Note: | ||
// These are not implementation constraints, but more logic constraints. Indeed, these ensure that | ||
// the range [i_begin, i_end] is square in terms of elements (it would not make sense to have it square | ||
// in terms of number of tiles). Moreover, by requiring mat_in and mat_out matrices to have the same | ||
// shape, it is ensured that range [i_begin, i_end] is actually the same on both sides. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constraints should be revised (in a different PR).
std::move(setup_permute_fn)) | | ||
ex::unpack() | ex::bulk(subm_dist.size().get<C>(), permute_fn)); | ||
ex::start_detached(std::move(sender) | ex::transfer(di::getBackendScheduler<Backend::MC>()) | | ||
ex::bulk(nperms, std::move(permute_fn))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Number of tasks created by bulk might be huge.
I suggest addressing in a new PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number might be large, but not larger than num_threads
. That may of course still be too much, but just keep in mind that the thread_pool_scheduler
specialization of bulk will not blindly create nperms
tasks.
cscs-ci run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't comment on algorithmic changes. Looks good otherwise.
Back to draft due to frequent hangs on santis |
with a distributed matrix
required to be compatible with both local and distributed usage
11f2185
to
dd8c4ec
Compare
cscs-ci run |
This PR aims at dropping the custom
permuteJustLocal
and reduce the use-case, by transforming permutation indices, to be manageable with the existing local permutation implementation, that exists for both backends.It might be possible to drop i5 (for distributed implementation)permute
API? Should we separate the "distributed" use case (at least formally) or is it enough reviewing assumptions?Evaluate if it is worth switching toMatrixRef
(just for the code changed)Notes
From PR #967 each rank sort eigenvalues by type (upper, dense, lower, deflated) independently from other ranks. At the time of that PR, for convenience reasons, we opted for performing the sort with a custom permutation procedure
permuteJustLocal
that were able to deal with global indices but just apply the permutation to the local part. In addition to this,permuteJustLocal
was implemented just on CPU because on GPU it would had required a major effort not worth due to the inherently GPU inefficient type of operations.