Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reduce then scan algorithm for transform scan family #1762

Merged
merged 88 commits into from
Aug 30, 2024
Merged
Show file tree
Hide file tree
Changes from 84 commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
d07ada2
Checkpoint for reduce then scan integration
mmichel11 Jul 8, 2024
6244266
Introduce a parallel_backend_sycl_reduce_then_scan.h file to contain …
mmichel11 Jul 9, 2024
ccdb3b0
Port of kernels from two-pass scan KT branch
mmichel11 Jul 10, 2024
b465d84
Move the single-element last element storage for exclusive_scan after…
mmichel11 Jul 11, 2024
47360a0
Use init value type for init processing helper
adamfidel Jul 11, 2024
3bf0602
Lower single work-group upper limit to 2048 elements (empirically found)
mmichel11 Jul 12, 2024
46c1a50
[PROTOTYPE] Generalized two pass algorithm and copy_if (#1700)
danhoeflinger Jul 17, 2024
38c1b19
bug fix for global race on block carry-out
danhoeflinger Jul 18, 2024
72d42c2
bugfix for elements to process in partial subgroup scan
danhoeflinger Jul 18, 2024
ecce124
[PROTOTYPE] Add unused temporary storage to single work-group scan to…
adamfidel Jul 18, 2024
39ebdbe
Add temporary work-group size cap for FPGA_EMU testing
mmichel11 Jul 19, 2024
e4e30e1
[PROTOTYPE] Resolve conversion issues between internal tuple and std:…
mmichel11 Jul 19, 2024
3732c12
Use __dpl_sycl::__local_accessor
adamfidel Jul 22, 2024
1745e0c
bugfix for overruning input for small non multiples of subgroup size
danhoeflinger Jul 22, 2024
0921941
Check if a subgroup is active before fetching its carry and grab the …
mmichel11 Jul 23, 2024
8effa03
Comment out std::complex tests in scan_by_segment tests
mmichel11 Jul 23, 2024
c22231a
renaming __out as it seems to be a keyword
danhoeflinger Jul 24, 2024
598f569
fixing device copyable for helpers
danhoeflinger Jul 29, 2024
96b4fd2
Remove commented code that remained after rebase
mmichel11 Jul 31, 2024
8f759a3
[PROTOTYPE] Add fallback to legacy scan implementation for CPU device…
mmichel11 Jul 31, 2024
6da54e7
[PROTOTYPE] partition, unique families and ranges API (#1708)
danhoeflinger Aug 2, 2024
13cecbf
fix windows issue regression __out
danhoeflinger Aug 4, 2024
2daefab
fix for missing assigner in copy if pattern
danhoeflinger Aug 4, 2024
4a83e1b
fix unique same mangled name problem
danhoeflinger Aug 4, 2024
299b28b
[PROTOTYPE] Cleanup reduce-then-scan code (#1760)
mmichel11 Aug 5, 2024
8266882
restoring removed whitespace line
danhoeflinger Aug 5, 2024
453d4ca
removing unnecessay storage type from kernel name
danhoeflinger Aug 5, 2024
78e33ac
remove unique pattern family from reduce_then_scan
danhoeflinger Aug 5, 2024
8267513
remove partition pattern family from reduce_then_scan
danhoeflinger Aug 5, 2024
d37746e
remove copy_if pattern family from reduce_then_scan
danhoeflinger Aug 5, 2024
404c4ef
remove unnecessary barrier + cleanup unnecessary lazy value
danhoeflinger Aug 5, 2024
060f649
clang format
danhoeflinger Aug 5, 2024
0beebd1
codespell
danhoeflinger Aug 5, 2024
90e6e62
restoring whitespace only changes
danhoeflinger Aug 6, 2024
ef5d377
removing unnecessary using
danhoeflinger Aug 6, 2024
bca0002
reverting formatting only changes
danhoeflinger Aug 6, 2024
68c75e5
remove max and TODO
danhoeflinger Aug 6, 2024
dddb050
remove extra braces, add comments
danhoeflinger Aug 6, 2024
dc2de26
removing formatting only changes
danhoeflinger Aug 6, 2024
165b1a5
removing unnecessary decay
danhoeflinger Aug 7, 2024
b9f0f4e
removing unused forwarding references
danhoeflinger Aug 7, 2024
bd144a4
clang-formatting
danhoeflinger Aug 8, 2024
d809051
adding comment and different threshold for different implementations
danhoeflinger Aug 14, 2024
1647722
checking is_gpu rather than !is_cpu
danhoeflinger Aug 14, 2024
0271b40
use dpl_bit_ceil
danhoeflinger Aug 14, 2024
6cfc979
removing bad formatting only changes (::std::)
danhoeflinger Aug 14, 2024
cc03af1
fixing result_and_scratch_storage creation
danhoeflinger Aug 15, 2024
98de25d
spelling
danhoeflinger Aug 15, 2024
59933c1
fixing single pass scan KT from change to single-wg check
danhoeflinger Aug 15, 2024
94e6e97
clarifying comment language
danhoeflinger Aug 15, 2024
ddaad55
refactor subgroup scan to reduce redundant code
danhoeflinger Aug 15, 2024
1fc0f59
refactoring full block / full thread logic to remove redundancy
danhoeflinger Aug 15, 2024
a5753d0
passing storage container by ref
danhoeflinger Aug 15, 2024
761ec51
__g -> __group_id
danhoeflinger Aug 15, 2024
4d8c92d
__group_start_idx -> __group_start_id
danhoeflinger Aug 15, 2024
55db83e
minor variable naming and helpers
danhoeflinger Aug 15, 2024
f3768bf
improving comments, removing unused variable
danhoeflinger Aug 15, 2024
f1361d2
__prefer_reduce_then_scan -> __is_gpu_with_sg_32
danhoeflinger Aug 15, 2024
b67b987
comment for temporary storage
danhoeflinger Aug 15, 2024
f3aec73
fold initial value into __carry_offset
danhoeflinger Aug 15, 2024
15d09e2
running tally of __reduction_scan_id
danhoeflinger Aug 15, 2024
6bbe469
_idx -> _id
danhoeflinger Aug 15, 2024
a7d00db
running tally of __load_reduction_id rather than recalculating
danhoeflinger Aug 15, 2024
f54e298
running tally of __reduction_id rather than recalculating
danhoeflinger Aug 15, 2024
d11dd6f
comment improvement
danhoeflinger Aug 15, 2024
1a29790
refactor for readability
danhoeflinger Aug 15, 2024
e936e83
formatting
danhoeflinger Aug 15, 2024
1b4f365
removing extra space
danhoeflinger Aug 15, 2024
0ca6f48
rename variables for consistency
danhoeflinger Aug 15, 2024
df6a223
fixing misleading names
danhoeflinger Aug 15, 2024
6e470e5
Address reviewer feedback
danhoeflinger Aug 19, 2024
528e04a
fix bugs from 6e470e5253]
danhoeflinger Aug 19, 2024
8104f1f
Simplify conversions in __gen_transform_input
mmichel11 Aug 19, 2024
a5367d1
Move def of __n_uniform closer to its use
adamfidel Aug 19, 2024
6096e7a
Add alias for __dpl_sycl::__sub_group and replace templates
adamfidel Aug 21, 2024
60c8516
auto -> real types and formatting
danhoeflinger Aug 21, 2024
8121d67
fixing type of subgroup id returns
danhoeflinger Aug 21, 2024
48724db
shrinking subgroup size id types
danhoeflinger Aug 21, 2024
3cc61db
adjust type to depend on input range
danhoeflinger Aug 21, 2024
c2c7e35
idx -> id
danhoeflinger Aug 22, 2024
ff7b256
shrinking types, switch branch to min, remove double deref
danhoeflinger Aug 22, 2024
8a36d5a
Adjust block size for reduce-then-scan based on input type (#1782)
adamfidel Aug 22, 2024
9520f3c
shrinking missed types
danhoeflinger Aug 22, 2024
5a928fd
bugfix for windows
danhoeflinger Aug 23, 2024
e57573f
fixing range types
danhoeflinger Aug 27, 2024
93189b0
minor comments from review + formatting
danhoeflinger Aug 29, 2024
4e4568e
Apply std:: suggestions
danhoeflinger Aug 29, 2024
af82182
rounding workgroup size down to mult of subgroup size
danhoeflinger Aug 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion include/oneapi/dpl/experimental/kt/single_pass_scan.h
Original file line number Diff line number Diff line change
Expand Up @@ -332,7 +332,7 @@ __single_pass_scan(sycl::queue __queue, _InRange&& __in_rng, _OutRange&& __out_r
auto __n_uniform = ::oneapi::dpl::__internal::__dpl_bit_ceil(__n);

// Perform a single-work group scan if the input is small
if (oneapi::dpl::__par_backend_hetero::__group_scan_fits_in_slm<_Type>(__queue, __n, __n_uniform))
if (oneapi::dpl::__par_backend_hetero::__group_scan_fits_in_slm<_Type>(__queue, __n, __n_uniform, /*limit=*/16384))
{
return oneapi::dpl::__par_backend_hetero::__parallel_transform_scan_single_group(
oneapi::dpl::__internal::__device_backend_tag{},
Expand Down
3 changes: 2 additions & 1 deletion include/oneapi/dpl/internal/exclusive_scan_by_segment_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,8 @@ exclusive_scan_by_segment_impl(__internal::__hetero_tag<_BackendTag>, Policy&& p
transform_inclusive_scan(::std::move(policy2), make_zip_iterator(_temp.get(), _flags.get()),
make_zip_iterator(_temp.get(), _flags.get()) + n, make_zip_iterator(result, _flags.get()),
internal::segmented_scan_fun<ValueType, FlagType, Operator>(binary_op),
oneapi::dpl::__internal::__no_op(), ::std::make_tuple(init, FlagType(1)));
oneapi::dpl::__internal::__no_op(),
oneapi::dpl::__internal::make_tuple(init, FlagType(1)));
return result + n;
}

Expand Down
83 changes: 68 additions & 15 deletions include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
#include "parallel_backend_sycl_reduce.h"
#include "parallel_backend_sycl_merge.h"
#include "parallel_backend_sycl_merge_sort.h"
#include "parallel_backend_sycl_reduce_then_scan.h"
#include "execution_sycl_defs.h"
#include "sycl_iterator.h"
#include "unseq_backend_sycl.h"
Expand Down Expand Up @@ -753,10 +754,9 @@ __parallel_transform_scan_base(oneapi::dpl::__internal::__device_backend_tag, _E

template <typename _Type>
bool
__group_scan_fits_in_slm(const sycl::queue& __queue, ::std::size_t __n, ::std::size_t __n_uniform)
__group_scan_fits_in_slm(const sycl::queue& __queue, std::size_t __n, std::size_t __n_uniform,
std::size_t __single_group_upper_limit)
{
constexpr int __single_group_upper_limit = 16384;

// Pessimistically only use half of the memory to take into account memory used by compiled kernel
const ::std::size_t __max_slm_size =
__queue.get_device().template get_info<sycl::info::device::local_mem_size>() / 2;
Expand All @@ -765,6 +765,37 @@ __group_scan_fits_in_slm(const sycl::queue& __queue, ::std::size_t __n, ::std::s
return (__n <= __single_group_upper_limit && __max_slm_size >= __req_slm_size);
}

template <typename _UnaryOp>
struct __gen_transform_input
{
template <typename _InRng>
auto
operator()(const _InRng& __in_rng, std::size_t __id) const
{
// We explicitly convert __in_rng[__id] to the value type of _InRng to properly handle the case where we
// process zip_iterator input where the reference type is a tuple of a references. This prevents the caller
// from modifying the input range when altering the return of this functor.
using _ValueType = oneapi::dpl::__internal::__value_t<_InRng>;
return __unary_op(_ValueType{__in_rng[__id]});
}
_UnaryOp __unary_op;
};

struct __simple_write_to_id
{
template <typename _OutRng, typename ValueType>
void
operator()(const _OutRng& __out_rng, std::size_t __id, const ValueType& __v) const
Copy link
Contributor

@MikeDvorskiy MikeDvorskiy Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constant reference const _OutRng& __out_rng looks suspicious, because the range is output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I removed the const here this and fixed a couple other range types in the kernel. I will go through the remaining PRs to check for similar issues. (I think output ranges for the other helpers probably have the same issue)

{
// Use of an explicit cast to our internal tuple type is required to resolve conversion issues between our
// internal tuple and std::tuple. If the underlying type is not a tuple, then the type will just be passed through.
using _ConvertedTupleType =
typename oneapi::dpl::__internal::__get_tuple_type<std::decay_t<decltype(__v)>,
std::decay_t<decltype(__out_rng[__id])>>::__type;
__out_rng[__id] = static_cast<_ConvertedTupleType>(__v);
}
};

template <typename _ExecutionPolicy, typename _Range1, typename _Range2, typename _UnaryOperation, typename _InitType,
typename _BinaryOperation, typename _Inclusive>
auto
Expand All @@ -773,24 +804,46 @@ __parallel_transform_scan(oneapi::dpl::__internal::__device_backend_tag __backen
_InitType __init, _BinaryOperation __binary_op, _Inclusive)
{
using _Type = typename _InitType::__value_type;
// Reduce-then-scan is dependent on sycl::shift_group_right which requires the underlying type to be trivially
// copyable. If this is not met, then we must fallback to the multi pass scan implementation. The single
// work-group implementation requires a fundamental type which must also be trivially copyable.
if constexpr (std::is_trivially_copyable_v<_Type>)
{
bool __use_reduce_then_scan = oneapi::dpl::__par_backend_hetero::__is_gpu_with_sg_32(__exec);

// Next power of 2 greater than or equal to __n
auto __n_uniform = __n;
if ((__n_uniform & (__n_uniform - 1)) != 0)
__n_uniform = oneapi::dpl::__internal::__dpl_bit_floor(__n) << 1;
// TODO: Consider re-implementing single group scan to support types without known identities. This could also
// allow us to use single wg scan for the last block of reduce-then-scan if it is sufficiently small.
constexpr bool __can_use_group_scan = unseq_backend::__has_known_identity<_BinaryOperation, _Type>::value;
if constexpr (__can_use_group_scan)
{
// Next power of 2 greater than or equal to __n
std::size_t __n_uniform = oneapi::dpl::__internal::__dpl_bit_ceil(__n);

constexpr bool __can_use_group_scan = unseq_backend::__has_known_identity<_BinaryOperation, _Type>::value;
if constexpr (__can_use_group_scan)
{
if (__group_scan_fits_in_slm<_Type>(__exec.queue(), __n, __n_uniform))
// Empirically found values for reduce-then-scan and multi pass scan implementation for single wg cutoff
std::size_t __single_group_upper_limit = __use_reduce_then_scan ? 2048 : 16384;
if (__group_scan_fits_in_slm<_Type>(__exec.queue(), __n, __n_uniform, __single_group_upper_limit))
julianmi marked this conversation as resolved.
Show resolved Hide resolved
{
return __parallel_transform_scan_single_group(
__backend_tag, std::forward<_ExecutionPolicy>(__exec), ::std::forward<_Range1>(__in_rng),
::std::forward<_Range2>(__out_rng), __n, __unary_op, __init, __binary_op, _Inclusive{});
danhoeflinger marked this conversation as resolved.
Show resolved Hide resolved
}
}
if (__use_reduce_then_scan)
{
return __parallel_transform_scan_single_group(
__backend_tag, std::forward<_ExecutionPolicy>(__exec), ::std::forward<_Range1>(__in_rng),
::std::forward<_Range2>(__out_rng), __n, __unary_op, __init, __binary_op, _Inclusive{});
using _GenInput = oneapi::dpl::__par_backend_hetero::__gen_transform_input<_UnaryOperation>;
using _ScanInputTransform = oneapi::dpl::__internal::__no_op;
using _WriteOp = oneapi::dpl::__par_backend_hetero::__simple_write_to_id;

_GenInput __gen_transform{__unary_op};

return __parallel_transform_reduce_then_scan(
__backend_tag, std::forward<_ExecutionPolicy>(__exec), std::forward<_Range1>(__in_rng),
std::forward<_Range2>(__out_rng), __gen_transform, __binary_op, __gen_transform, _ScanInputTransform{},
_WriteOp{}, __init, _Inclusive{});
}
}

// Either we can't use group scan or this input is too big for one workgroup
//else use multi pass scan implementation
using _Assigner = unseq_backend::__scan_assigner;
using _NoAssign = unseq_backend::__scan_no_assign;
using _UnaryFunctor = unseq_backend::walk_n<_ExecutionPolicy, _UnaryOperation>;
Expand Down
Loading
Loading