-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support ExtendedCpuGpuSplit pattern, and more #1
Conversation
Profiling results from the Nsight Systems reports that a huge amount (~90%) of timing is consumed by `cuMemHostAlloc`, when a slice of array (e.g., `Uin(:, :, :, :, n)`) is passed to the SFR's argument. This commit avoids `cuMemHostAlloc` in the profiling results by introducing "wrapper" subroutines for each SFRs. However, the overall performance results remain the same, even though the Nsight Systems doesn't report `cuMemHostAlloc`. Perhaps it was an incorrect profiling results, but I push this commit to investigate this further.
In the meantime, support for the
|
Can we revert the cherry picking and merge this PR instead? The subroutine wrappers should have no or minimal impacts on the performance, and they relieve the I prefer to maintain the |
Those cherry-pick commits should not have been done on the main branch. The shell command I am using to restore the main branch to the desired state is git restore --staged --worktree --source=1f5a60a88315ca854516b254d5796b8e81d8d3f6 :/ A proper merge commit, merging the GitHub PR #1, is to follow soon. The changes undone here will then be applied once more, together with others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you have indicated that the extra wrapping can be easily removed again - in case it does more harm or good - I approve.
This PR contains the following:
pushTile
support for theExtendedCpuGpuSplit
pattern.