-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add reduce then scan algorithm for transform scan family #1762
Conversation
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
9eb3099
to
fc43983
Compare
c8a274a
to
4e4c0bc
Compare
3d277c6
to
4df758a
Compare
ea28133
to
d01aeb1
Compare
47a227c
to
c471ed5
Compare
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had another look and I think it is in a good state. Clang-format is red, but I prefer the current version since it is easier to read.
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Show resolved
Hide resolved
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
{ | ||
template <typename _OutRng, typename ValueType> | ||
void | ||
operator()(const _OutRng& __out_rng, std::size_t __id, const ValueType& __v) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constant reference const _OutRng& __out_rng
looks suspicious, because the range is output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I removed the const
here this and fixed a couple other range types in the kernel. I will go through the remaining PRs to check for similar issues. (I think output ranges for the other helpers probably have the same issue)
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please wait for another approval since there are partial reviews of several people.
// and then write back the final values to memory | ||
if (__sub_group_id == 0) | ||
{ | ||
// step 1) load to Xe SLM the WG-local S prefix sums |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove the reference to Xe
here since this algorithm is designed to work on non-Xe-based devices as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few minor things I found.
// and then write back the final values to memory | ||
if (__sub_group_id == 0) | ||
{ | ||
// step 1) load to Xe SLM the WG-local S prefix sums |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so.
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
Co-authored-by: Matthew Michel <106704043+mmichel11@users.noreply.github.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is in a good state and is okay to be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. We have checked tests and performance with the latest version.
@@ -91,35 +91,15 @@ oneapi::dpl::__internal::__difference_t<_Range2> | |||
__pattern_transform_scan_base(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _Range1&& __rng1, _Range2&& __rng2, | |||
_UnaryOperation __unary_op, _InitType __init, _BinaryOperation __binary_op, _Inclusive) | |||
{ | |||
if (__rng1.empty()) | |||
oneapi::dpl::__internal::__difference_t<_Range1> __n = __rng1.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment I forgot to send earlier:
__n
is also the return value, and the function is declared to return __difference_t<_Range2>
, not that of _Range1
. In the previous code, we took the size of __rng1
and converted it to the return type. But I wonder if we should just take the size of __rng2
instead (if we assume ranges being of equal size), or maybe change the return type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. We created PR #1811 to address this issue.
This PR Adds a generic reduce_then_scan algorithm for scan-like algorithms, and uses this algorithm for transform scan family of scan-like algorithms where it is beneficial.
This PR is targeted after a pair of other PRs #1769 + #1770, which lay some plumbing for these changes. Those two PRs were separated from this one to provide more focus to this change.
It is important to understand that subsequent PRs will add other families of scan-like algorithms (copy_if, partition, unique, ...), which may justify some aspects of the implementation of the reduce_then_scan kernels that could seem overcomplicated for the transform scan family alone.
This PR is targeted to #1770, to allow for a clean diff, and is a part of the following sequence of PRs meant to be merged in order:
#1769 [MERGED]
Relocate __lazy_ctor_storage to utils header#1770 [MERGED]
Use __result_and_scratch_storage within scan kernels#1762 Add reduce_then_scan algorithm for transform scan family (This PR)
#1763 Make Copy_if family of APIs use reduce_then_scan algorithm
#1764 Make Partition family of APIs use reduce_then_scan algorithm
#1765 Make Unique family of APIs use reduce_then_scan algorithm
This work is a collaboration between @mmichel11 @adamfidel and @danhoeflinger, and based upon an original prototype by Ted Painter.