Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TriSolver (dist): move sorting permutation from CPU to GPU #1118

Merged
merged 20 commits into from
Jun 21, 2024

Conversation

albestro
Copy link
Collaborator

@albestro albestro commented Apr 9, 2024

This PR aims at dropping the custom permuteJustLocal and reduce the use-case, by transforming permutation indices, to be manageable with the existing local permutation implementation, that exists for both backends.

  • Cleanup implementation
  • It might be possible to drop i5 (for distributed implementation)
  • What to do about permute API? Should we separate the "distributed" use case (at least formally) or is it enough reviewing assumptions?
  • Evaluate if it is worth switching to MatrixRef (just for the code changed)
  • Make it work on GPU
  • Add a unit test for the new use-case with distributed matrices

Notes

From PR #967 each rank sort eigenvalues by type (upper, dense, lower, deflated) independently from other ranks. At the time of that PR, for convenience reasons, we opted for performing the sort with a custom permutation procedure permuteJustLocal that were able to deal with global indices but just apply the permutation to the local part. In addition to this, permuteJustLocal was implemented just on CPU because on GPU it would had required a major effort not worth due to the inherently GPU inefficient type of operations.

@albestro
Copy link
Collaborator Author

albestro commented Apr 9, 2024

cscs-ci run

@albestro albestro marked this pull request as draft April 9, 2024 12:24
@albestro
Copy link
Collaborator Author

albestro commented Apr 9, 2024

cscs-ci run

2 similar comments
@albestro
Copy link
Collaborator Author

albestro commented Apr 9, 2024

cscs-ci run

@albestro
Copy link
Collaborator Author

cscs-ci run

@albestro
Copy link
Collaborator Author

cscs-ci run

2 similar comments
@albestro
Copy link
Collaborator Author

cscs-ci run

@albestro
Copy link
Collaborator Author

cscs-ci run

// @param perm_sorted array[n] current -> initial (i.e. evals[i] -> types[perm_sorted[i]])
// @param index_sorted array[n] global(sort(non-deflated)|sort(deflated))) -> initial
// @param index_sorted_coltype array[n] local(sort(upper)|sort(dense)|sort(lower)|sort(deflated))) -> initial
// @param i5_lc array[n_lc] local(sort(upper)|sort(dense)|sort(lower)|sort(deflated))) -> initial
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note-to-self: specify that they are local indices, while in index_sorted_coltype they are global indices

@albestro
Copy link
Collaborator Author

cscs-ci run

@albestro albestro marked this pull request as ready for review May 27, 2024 10:35
Copy link
Collaborator

@rasolca rasolca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Comment on lines 55 to 59
// Note:
// These are not implementation constraints, but more logic constraints. Indeed, these ensure that
// the range [i_begin, i_end] is square in terms of elements (it would not make sense to have it square
// in terms of number of tiles). Moreover, by requiring mat_in and mat_out matrices to have the same
// shape, it is ensured that range [i_begin, i_end] is actually the same on both sides.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constraints should be revised (in a different PR).

std::move(setup_permute_fn)) |
ex::unpack() | ex::bulk(subm_dist.size().get<C>(), permute_fn));
ex::start_detached(std::move(sender) | ex::transfer(di::getBackendScheduler<Backend::MC>()) |
ex::bulk(nperms, std::move(permute_fn)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Number of tasks created by bulk might be huge.
I suggest addressing in a new PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number might be large, but not larger than num_threads. That may of course still be too much, but just keep in mind that the thread_pool_scheduler specialization of bulk will not blindly create nperms tasks.

@rasolca
Copy link
Collaborator

rasolca commented May 28, 2024

cscs-ci run

Copy link
Collaborator

@msimberg msimberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't comment on algorithmic changes. Looks good otherwise.

@rasolca rasolca marked this pull request as draft May 29, 2024 10:06
@rasolca
Copy link
Collaborator

rasolca commented May 29, 2024

Back to draft due to frequent hangs on santis

@rasolca rasolca removed this from the v0.5.0 milestone May 29, 2024
@rasolca rasolca modified the milestones: Optimizations, v0.5.1 May 29, 2024
@albestro albestro force-pushed the alby/permute-local branch from 11f2185 to dd8c4ec Compare June 4, 2024 12:28
@albestro albestro marked this pull request as ready for review June 21, 2024 08:57
@rasolca
Copy link
Collaborator

rasolca commented Jun 21, 2024

cscs-ci run

@rasolca rasolca merged commit 0252d92 into master Jun 21, 2024
4 of 5 checks passed
@rasolca rasolca deleted the alby/permute-local branch June 21, 2024 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Tridiagonal Solver (dist): Migrate permutation of local eigenvectors to GPU
3 participants