Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single-pass scan kernel template #1320

Merged
merged 78 commits into from
Apr 24, 2024
Merged
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
58c2639
Start of single-pass scan kernel template
adamfidel Aug 18, 2023
e0a676d
Fix hang in inclusive scan
adamfidel Aug 24, 2023
956f139
Debug statements for scan kernel template
adamfidel Aug 31, 2023
deb92cb
Update scan kernel template test
adamfidel Sep 6, 2023
fd0af78
Merge remote-tracking branch 'antcarcomp01/dev/adamfidel/scan_kt' int…
adamfidel Sep 13, 2023
590b1c0
Only have a single work-item per group query for previous tile status
adamfidel Sep 14, 2023
5f4069a
First attempt at parallel lookback
adamfidel Sep 18, 2023
6d4aa3d
Working cooperative lookback
adamfidel Sep 22, 2023
7e32a6f
Fix correctness issue with non-power-of-2 sizes
adamfidel Oct 25, 2023
1fcdba4
Code cleanup and bug fixes
adamfidel Nov 17, 2023
db77c7d
Remove accidental debug statement
adamfidel Nov 17, 2023
0228724
Test with floats
adamfidel Nov 27, 2023
b911bc0
Fix incorrect values for size_t scan
adamfidel Dec 13, 2023
8dacc2f
Add support for sizes > 2^30
adamfidel Dec 18, 2023
69b5afe
Uglify
adamfidel Dec 19, 2023
d88abb1
Remove debug code
adamfidel Dec 19, 2023
53e17b1
Merge branch 'main' into dev/adamfidel/scan_kt
adamfidel Jan 8, 2024
ce2bb8a
Move single-pass scan test to same place as ESIMD radix sort test
adamfidel Jan 8, 2024
d67d579
Move kernel template to other kernel template directory
adamfidel Jan 8, 2024
6d7929a
Restructure scan KT tests to be similar to sort KT tests
adamfidel Jan 8, 2024
4b01e64
Delete old test code
adamfidel Jan 8, 2024
6e97ed2
clang-format
adamfidel Jan 8, 2024
063ac92
Re-arrange assert for 64-bit atomics
adamfidel Jan 8, 2024
ec5a47c
Rename entry from single_pass_inclusive_scan -> inclusive_scan
adamfidel Jan 11, 2024
217a848
Rewrite kernel submission of init kernel to use provided kernel name
adamfidel Jan 11, 2024
28ab915
Rewrite main kernel to use user-provided kernel name
adamfidel Jan 11, 2024
d48e6d6
Move input values to local memory to process them there
adamfidel Jan 12, 2024
c2a6686
Merge remote-tracking branch 'github/dev/adamfidel/scan_kt' into dev/…
adamfidel Jan 12, 2024
9cf1ec1
fix issues caused by merge
adamfidel Jan 12, 2024
9d621a9
Address various pull request comments
adamfidel Jan 12, 2024
605cc42
KT scan tests generation through CMakeLists.txt (#1351)
SergeyKopienko Jan 17, 2024
c747f75
Remove unnecessary chunking logic
adamfidel Jan 17, 2024
b5136ce
Merge remote-tracking branch 'github/dev/adamfidel/scan_kt' into dev/…
adamfidel Jan 17, 2024
96c25ef
Separate out flags and value to support signed integers
adamfidel Jan 17, 2024
5a7626e
Use acq_rel instead of seq_cst
adamfidel Jan 17, 2024
de2061b
Fix duplicate kernel name errors
adamfidel Jan 23, 2024
d78db9c
Fix tests using oneapi::dpl::begin/end
adamfidel Jan 24, 2024
d4f019f
Fill padding with identity values to fix errors in test
adamfidel Jan 29, 2024
64edea3
Merge remote-tracking branch 'github/main' into dev/adamfidel/scan_kt
adamfidel Feb 1, 2024
e272b18
Assert that malloc_device returned non-null ptrs
adamfidel Feb 16, 2024
cdbec2d
Merge remote-tracking branch 'github/main' into dev/adamfidel/scan_kt
adamfidel Feb 16, 2024
3e57dbe
Pass ranges by forward ref
adamfidel Feb 16, 2024
e55c491
Use sycl::reqd_sub_group_size instead of intel::reqd_sub_group_size
adamfidel Feb 20, 2024
b944c2b
Replace raw unroll pragma with _ONEDPL_PRAGMA_UNROLL macro
adamfidel Feb 26, 2024
49dea3f
Call single-group scan if possible
adamfidel Feb 28, 2024
5262700
Change namespace from kt::igpu to kt::gpu
adamfidel Feb 29, 2024
e7c32cb
Use single malloc for values array
adamfidel Mar 13, 2024
5a6352f
Use only a single malloc if the value type is the same as the flag st…
adamfidel Mar 13, 2024
85c730c
Add missing free for status value array
adamfidel Mar 13, 2024
f135477
Fix include guard
adamfidel Mar 18, 2024
4e65b42
Rename _FlagType to _FlagStorageType
adamfidel Mar 18, 2024
f8f8508
Replace group.leader() with is first work-item in subgroup
adamfidel Mar 20, 2024
c0e22e3
Simplify device memory allocation into a single device_malloc
adamfidel Mar 22, 2024
404fcc2
Fix all_view test
adamfidel Apr 1, 2024
0e7a87c
Support passing sycl::buffer directly
adamfidel Apr 3, 2024
d6b60a6
Use __dpl_bit_ceil
adamfidel Apr 3, 2024
a899426
Promote types for sizes to 64 bits
adamfidel Apr 3, 2024
a8caee1
Merge commit '6a45be7078a636676b6a128a142e5d02213722ca' into dev/adam…
adamfidel Apr 9, 2024
41adacc
Add new device_backend_tag for call to single-group scan
adamfidel Apr 9, 2024
247bffb
Fix duplicate kernel name error with single-group scan
adamfidel Apr 9, 2024
80e6cd4
clang format
adamfidel Apr 9, 2024
0d0efeb
Fix alignment of status values
adamfidel Apr 17, 2024
b530eec
Rename [build/run]-igpu-tests to [build/run]-scan-kt-tests
adamfidel Apr 17, 2024
849db06
Remove template for queue type
adamfidel Apr 17, 2024
d076f11
Fix access mode for input
adamfidel Apr 17, 2024
eb6fb65
Add missing includes
adamfidel Apr 17, 2024
f1e0709
::std -> std:: and replacing assert with exception
adamfidel Apr 17, 2024
265ab8f
Fix number of elements vs number of bytes
adamfidel Apr 19, 2024
fe18dac
Improve test data generation, especially for multiplies
adamfidel Apr 19, 2024
e96c8e9
Merge remote-tracking branch 'github/dev/adamfidel/scan_kt' into dev/…
adamfidel Apr 19, 2024
0df3640
Fix CMake target without any constant params
adamfidel Apr 19, 2024
33c8982
Adding a few includes for completeness
adamfidel Apr 22, 2024
e431f01
Address PR comments
adamfidel Apr 23, 2024
6bcc32d
Merge remote-tracking branch 'github/main' into dev/adamfidel/scan_kt
adamfidel Apr 23, 2024
5f2fea6
Use new get_new_kernel_params function
adamfidel Apr 23, 2024
978e0fa
Correctly pass kernel name with optional_kernel_name
adamfidel Apr 23, 2024
804c8c7
clang-format + Moving around get_new_kernel_params
adamfidel Apr 23, 2024
7cd2d63
clang-format
adamfidel Apr 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions include/oneapi/dpl/experimental/kernel_templates
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,6 @@
# include "kt/esimd_radix_sort.h"
#endif

#include "kt/single_pass_scan.h"

#endif // _ONEDPL_KERNEL_TEMPLATES
337 changes: 337 additions & 0 deletions include/oneapi/dpl/experimental/kt/single_pass_scan.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,337 @@
// -*- C++ -*-
//===----------------------------------------------------------------------===//
//
// Copyright (C) Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
// This file incorporates work covered by the following copyright and permission
// notice:
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
//
//===----------------------------------------------------------------------===//

#ifndef _ONEDPL_parallel_backend_sycl_scan_H
#define _ONEDPL_parallel_backend_sycl_scan_H
danhoeflinger marked this conversation as resolved.
Show resolved Hide resolved

dmitriy-sobolev marked this conversation as resolved.
Show resolved Hide resolved
#include "../../pstl/hetero/dpcpp/sycl_defs.h"
#include "../../pstl/hetero/dpcpp/unseq_backend_sycl.h"

namespace oneapi::dpl::experimental::kt
{

inline namespace igpu
{

namespace __impl
{

template <typename... _Name>
class __lookback_init_kernel;

template <typename... _Name>
class __lookback_kernel;

static constexpr int SUBGROUP_SIZE = 32;

template <typename _T>
struct __scan_status_flag
{
// 00xxxx - not computed
// 01xxxx - partial
// 10xxxx - full
// 110000 - out of bounds

static constexpr bool __is_larger_than_32_bits = sizeof(_T) * 8 > 32;
using _StorageType = ::std::conditional_t<__is_larger_than_32_bits, ::std::uint64_t, ::std::uint32_t>;
using _AtomicRefT = sycl::atomic_ref<_StorageType, sycl::memory_order::relaxed, sycl::memory_scope::device,
sycl::access::address_space::global_space>;

static constexpr ::std::size_t __flag_length = sizeof(_StorageType);

static constexpr _StorageType __partial_mask = 1ul << (__flag_length * 8 - 2);
static constexpr _StorageType __full_mask = 1ul << (__flag_length * 8 - 1);
static constexpr _StorageType __value_mask = ~(__partial_mask | __full_mask);
static constexpr _StorageType __oob_value = __partial_mask | __full_mask;

static constexpr int __padding = SUBGROUP_SIZE;

__scan_status_flag(_StorageType* __flags_begin, const std::uint32_t __tile_id)
: __atomic_flag(*(__flags_begin + __tile_id + __padding))
{
}

void
set_partial(const _T __val)
{
__atomic_flag.store(__val | __partial_mask);
}

void
set_full(const _T __val)
{
__atomic_flag.store(__val | __full_mask);
}

template <typename _Subgroup, typename _BinaryOp>
_T
cooperative_lookback(::std::uint32_t __tile_id, const _Subgroup& __subgroup, _BinaryOp __binary_op,
_StorageType* __flags_begin)
{
_T __running = oneapi::dpl::unseq_backend::__known_identity<_BinaryOp, _T>;
auto __local_id = __subgroup.get_local_id();

for (int __tile = static_cast<int>(__tile_id) - 1; __tile >= 0; __tile -= SUBGROUP_SIZE)
{
_AtomicRefT __tile_atomic(*(__flags_begin + __tile + __padding - __local_id));
_StorageType __tile_val = 0;
do
{
__tile_val = __tile_atomic.load();
} while (!sycl::all_of_group(__subgroup, __tile_val != 0));

bool __is_full = (__tile_val & __full_mask) && ((__tile_val & __partial_mask) == 0);
auto __is_full_ballot = sycl::ext::oneapi::group_ballot(__subgroup, __is_full);
::std::uint32_t __is_full_ballot_bits{};
__is_full_ballot.extract_bits(__is_full_ballot_bits);

auto __lowest_item_with_full = sycl::ctz(__is_full_ballot_bits);
_T __contribution = __local_id <= __lowest_item_with_full ? __tile_val & __value_mask : oneapi::dpl::unseq_backend::__known_identity<_BinaryOp, _T>;

// Running reduction of all of the partial results from the tiles found, as well as the full contribution from the closest tile (if any)
__running = __binary_op(__running, sycl::reduce_over_group(__subgroup, __contribution, __binary_op));

// If we found a full value, we can stop looking at previous tiles. Otherwise,
// keep going through tiles until we either find a full tile or we've completely
// recomputed the prefix using partial values
if (__is_full_ballot_bits)
break;
}
return __running;
}

_AtomicRefT __atomic_flag;
};

template <typename _KernelName>
struct __lookback_init_submitter;

template <typename... _Name>
struct __lookback_init_submitter<oneapi::dpl::__par_backend_hetero::__internal::__optional_kernel_name<_Name...>>
{
template <typename _StatusFlags, typename _Flag>
sycl::event
operator()(sycl::queue __q, _StatusFlags&& __status_flags, ::std::size_t __status_flags_size,
::std::uint16_t __status_flag_padding, _Flag __oob_value) const
{
return __q.submit([&](sycl::handler& __hdl) {
__hdl.parallel_for<_Name...>(sycl::range<1>{__status_flags_size}, [=](const sycl::item<1>& __item) {
auto __id = __item.get_linear_id();
__status_flags[__id] = __id < __status_flag_padding ? __oob_value : 0;
});
});
}
};

template <::std::uint16_t __data_per_workitem, ::std::uint16_t __workgroup_size, typename _Type, typename _FlagType,
typename _KernelName>
struct __lookback_submitter;

template <::std::uint16_t __data_per_workitem, ::std::uint16_t __workgroup_size, typename _Type, typename _FlagType,
typename... _Name>
struct __lookback_submitter<__data_per_workitem, __workgroup_size, _Type, _FlagType,
oneapi::dpl::__par_backend_hetero::__internal::__optional_kernel_name<_Name...>>
{
using _FlagStorageType = typename _FlagType::_StorageType;
static constexpr std::uint32_t __elems_in_tile = __workgroup_size * __data_per_workitem;

template <typename _InRng, typename _OutRng, typename _BinaryOp, typename _StatusFlags>
sycl::event
operator()(sycl::queue __q, sycl::event __prev_event, _InRng&& __in_rng, _OutRng&& __out_rng, _BinaryOp __binary_op,
::std::size_t __n, _StatusFlags&& __status_flags, ::std::size_t __status_flags_size,
::std::size_t __current_num_items) const
{
return __q.submit([&](sycl::handler& __hdl) {
auto __tile_vals = sycl::local_accessor<_Type, 1>(sycl::range<1>{__elems_in_tile}, __hdl);
__hdl.depends_on(__prev_event);

oneapi::dpl::__ranges::__require_access(__hdl, __in_rng, __out_rng);
__hdl.parallel_for<_Name...>(
sycl::nd_range<1>(__current_num_items, __workgroup_size),
[=](const sycl::nd_item<1>& __item) [[intel::reqd_sub_group_size(SUBGROUP_SIZE)]] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intel::.. attribute can be replaced with a more portable abstraction: [[_ONEDPL_SYCL_REQD_SUB_GROUP_SIZE_IF_SUPPORTED(__req_sub_group_size)]].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usage of [[_ONEDPL_SYCL_REQD_SUB_GROUP_SIZE_IF_SUPPORTED(__req_sub_group_size)]] could potentially be problematic here since other portions of code assume the subgroup size is always 32. If we want this solution to be device generic, I think we would also need to query the subgroup size at runtime in the kernel to be correct.

auto __group = __item.get_group();
auto __subgroup = __item.get_sub_group();
auto __local_id = __item.get_local_id(0);

::std::uint32_t __tile_id = 0;

// Obtain unique ID for this work-group that will be used in decoupled lookback
if (__group.leader())
{
sycl::atomic_ref<_FlagStorageType, sycl::memory_order::relaxed, sycl::memory_scope::device,
sycl::access::address_space::global_space>
__idx_atomic(__status_flags[__status_flags_size - 1]);
__tile_id = __idx_atomic.fetch_add(1);
}

__tile_id = sycl::group_broadcast(__group, __tile_id, 0);

// TODO: only need the cast if size is greater than 2>30, maybe specialize?
::std::size_t __current_offset = static_cast<::std::size_t>(__tile_id) * __elems_in_tile;
auto __out_begin = __out_rng.begin() + __current_offset;

if (__current_offset >= __n)
return;

// Global load into local
auto __wg_current_offset = (__tile_id * __elems_in_tile);
auto __wg_next_offset = ((__tile_id + 1) * __elems_in_tile);
auto __wg_local_memory_size = __elems_in_tile;

if (__wg_next_offset > __n)
__wg_local_memory_size = __n - __wg_current_offset;

if (__wg_next_offset <= __n)
{
#pragma unroll
for (std::uint32_t __i = 0; __i < __data_per_workitem; ++__i)
{
__tile_vals[__local_id + __workgroup_size * __i] = __in_rng[__wg_current_offset + __local_id + __workgroup_size * __i];
}
}
else
{
#pragma unroll
for (std::uint32_t __i = 0; __i < __data_per_workitem; ++__i)
{
if (__wg_current_offset + __local_id + __workgroup_size * __i < __n)
{
__tile_vals[__local_id + __workgroup_size * __i] = __in_rng[__wg_current_offset + __local_id + __workgroup_size * __i];
}
}
}

auto __tile_vals_ptr = __dpl_sycl::__get_accessor_ptr(__tile_vals);
_Type __local_reduction = sycl::joint_reduce(__group, __tile_vals_ptr, __tile_vals_ptr+__wg_local_memory_size, __binary_op);
_Type __prev_tile_reduction = 0;

// The first sub-group will query the previous tiles to find a prefix
if (__subgroup.get_group_id() == 0)
{
_FlagType __flag(__status_flags, __tile_id);

if (__group.leader())
__flag.set_partial(__local_reduction);

__prev_tile_reduction = __flag.cooperative_lookback(__tile_id, __subgroup, __binary_op, __status_flags);

if (__group.leader())
__flag.set_full(__binary_op(__prev_tile_reduction, __local_reduction));
}

__prev_tile_reduction = sycl::group_broadcast(__group, __prev_tile_reduction, 0);

sycl::joint_inclusive_scan(__group, __tile_vals_ptr, __tile_vals_ptr+__wg_local_memory_size, __out_begin, __binary_op, __prev_tile_reduction);
});
});
}
};

template <bool _Inclusive, typename _InRange, typename _OutRange, typename _BinaryOp, typename _KernelParam>
sycl::event
__single_pass_scan(sycl::queue __queue, _InRange&& __in_rng, _OutRange&& __out_rng, _BinaryOp __binary_op, _KernelParam)
{
using _Type = oneapi::dpl::__internal::__value_t<_InRange>;
using _FlagType = __scan_status_flag<_Type>;
using _FlagStorageType = typename _FlagType::_StorageType;

using _KernelName = typename _KernelParam::kernel_name;
using _LookbackInitKernel =
oneapi::dpl::__par_backend_hetero::__internal::__kernel_name_provider<__lookback_init_kernel<_KernelName>>;
using _LookbackKernel =
oneapi::dpl::__par_backend_hetero::__internal::__kernel_name_provider<__lookback_kernel<_KernelName>>;

const ::std::size_t __n = __in_rng.size();
mmichel11 marked this conversation as resolved.
Show resolved Hide resolved

if (__n == 0)
return sycl::event{};

static_assert(_Inclusive, "Single-pass scan only available for inclusive scan");
static_assert(oneapi::dpl::unseq_backend::__has_known_identity<_BinaryOp, _Type>::value, "Only binary operators with known identity values are supported");

assert("This device does not support 64-bit atomics" &&
(sizeof(_Type) < 64 || __queue.get_device().has(sycl::aspect::atomic64)));

// We need to process the input array by 2^30 chunks for 32-bit ints
constexpr ::std::size_t __chunk_size = 1ul << (sizeof(_Type) * 8 - 2);
const ::std::size_t __num_chunks = oneapi::dpl::__internal::__dpl_ceiling_div(__n, __chunk_size);

constexpr ::std::size_t __workgroup_size = _KernelParam::workgroup_size;
constexpr ::std::size_t __data_per_workitem = _KernelParam::data_per_workitem;

// Avoid non_uniform n by padding up to a multiple of workgroup_size
std::uint32_t __elems_in_tile = __workgroup_size * __data_per_workitem;
dmitriy-sobolev marked this conversation as resolved.
Show resolved Hide resolved
::std::size_t __num_wgs = oneapi::dpl::__internal::__dpl_ceiling_div(__n, __elems_in_tile);

constexpr int __status_flag_padding = SUBGROUP_SIZE;
std::uint32_t __status_flags_size = __num_wgs + 1 + __status_flag_padding;

_FlagStorageType* __status_flags = sycl::malloc_device<_FlagStorageType>(__status_flags_size, __queue);
timmiesmith marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think one large sycl::malloc_device call and setting __status_vals_full, __status_vals_partial, and __status_flags to offsets within the allocation may be quicker than three separate mallocs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think one large sycl::malloc_device call and setting __status_vals_full, __status_vals_partial, and __status_flags to offsets within the allocation may be quicker than three separate mallocs?

Yes, I think it is better to use a single allocation where possible. I have updated the code to do that.


auto __fill_event = __lookback_init_submitter<_LookbackInitKernel>{}(__queue, __status_flags, __status_flags_size,
__status_flag_padding, _FlagType::__oob_value);

sycl::event __prev_event = __fill_event;
for (int __chunk = 0; __chunk < __num_chunks; ++__chunk)
{
::std::size_t __current_chunk_size = __chunk == __num_chunks - 1 ? __n % __chunk_size : __chunk_size;
::std::size_t __current_num_wgs =
oneapi::dpl::__internal::__dpl_ceiling_div(__current_chunk_size, __elems_in_tile);
::std::size_t __current_num_items = __current_num_wgs * __workgroup_size;

auto __event = __lookback_submitter<__data_per_workitem, __workgroup_size, _Type, _FlagType, _LookbackKernel>{}(
__queue, __prev_event, __in_rng, __out_rng, __binary_op, __n, __status_flags, __status_flags_size,
__current_num_items);
__prev_event = __event;
}

auto __free_event = __queue.submit([=](sycl::handler& __hdl) {
__hdl.depends_on(__prev_event);
__hdl.host_task([=]() { sycl::free(__status_flags, __queue); });
});

return __free_event;
}

} // namespace __impl

template <typename _InRng, typename _OutRng, typename _BinaryOp, typename _KernelParam>
sycl::event
inclusive_scan(sycl::queue __queue, _InRng __in_rng, _OutRng __out_rng, _BinaryOp __binary_op, _KernelParam __param = {})
{

return __impl::__single_pass_scan<true>(__queue, __in_rng, __out_rng, __binary_op, __param);
}

template <typename _InIterator, typename _OutIterator, typename _BinaryOp, typename _KernelParam>
sycl::event
inclusive_scan(sycl::queue __queue, _InIterator __in_begin, _InIterator __in_end, _OutIterator __out_begin,
_BinaryOp __binary_op, _KernelParam __param = {})
Comment on lines +417 to +420
Copy link
Contributor

@danhoeflinger danhoeflinger Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need an overload with an init to satisfy our current targets / goals?

(If so, this brings up the question of what sort of data do we accept as init: scalar, device_ptr, single element "range" / iterator?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan is to include a version with a single scalar init value to match the functionality of the original scan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so does that need to go in this PR or do you plan to add that in a follow up PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. My goal was to have it implemented in this PR, but it is taking longer to implement + debug correctness than I expected. I am still planning to update this PR with the changes for an initial value, but if it is dragging too long then I might defer it to a follow up PR.

{
auto __n = __in_end - __in_begin;

auto __keep1 = oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::read, _InIterator>();
auto __buf1 = __keep1(__in_begin, __in_end);
auto __keep2 = oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::write, _OutIterator>();
auto __buf2 = __keep2(__out_begin, __out_begin + __n);

return __impl::__single_pass_scan<true>(__queue, __buf1.all_view(), __buf2.all_view(), __binary_op, __param);
}

} // namespace igpu
Copy link
Contributor

@dmitriy-sobolev dmitriy-sobolev Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does igpu stand on? I am correct in assuming that this KT is targeted to intel GPUs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is for Intel GPUs as per the design doc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if it was discussed but we can consider adding it into gpu namespace instead of igpu.

If the implementation does not use any instructions specific to Intel GPUs (and is not expected to use), then it might perform well on other cards due to being highly-configurable.

What do you think about it?


} // namespace oneapi::dpl::experimental::kt

#endif /* _ONEDPL_parallel_backend_sycl_scan_H */
6 changes: 6 additions & 0 deletions test/kt/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
##===----------------------------------------------------------------------===##

option(ONEDPL_TEST_ENABLE_KT_ESIMD "Enable ESIMD-based kernel template tests")
option(ONEDPL_TEST_ENABLE_KT_SYCL "Enable SYCL-based kernel template tests")

function(_generate_test _target_name _test_path)
add_executable(${_target_name} EXCLUDE_FROM_ALL ${_test_path})
Expand Down Expand Up @@ -122,3 +123,8 @@ if (ONEDPL_TEST_ENABLE_KT_ESIMD)
# Pin some cases to track them, e.g. because they fail
_generate_esimd_sort_test("256" "32" "double" "" 1000) # segfault
endif()


if (ONEDPL_TEST_ENABLE_KT_SYCL)
_generate_test("single_pass_scan" "single_pass_scan.cpp")
endif()
Loading
Loading