-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of sort/argsort and set functions per array API spec #1483
Conversation
View rendered docs @ https://intelpython.github.io/dpctl/pulls/1483/index.html |
Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_20 ran successfully. |
Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_23 ran successfully. |
Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_27 ran successfully. |
Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_32 ran successfully. |
Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_33 ran successfully. |
Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_35 ran successfully. |
3be1f65
to
18e0db3
Compare
Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_40 ran successfully. |
Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_41 ran successfully. |
90578b6
to
1849fff
Compare
Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_43 ran successfully. |
Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_44 ran successfully. |
0660a78
to
7003c90
Compare
Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_49 ran successfully. |
Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_67 ran successfully. |
@oleksandr-pavlyk For me, using the async implementation shaves off ~1ms for small arrays
compared to synchronized
|
82da567
to
26d97ba
Compare
@ndgrigorian I have switched to asynchronous implementation per your feedback, but kept the synchronous one around just in case. |
Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_74 ran successfully. |
Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_77 ran successfully. |
Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_78 ran successfully. |
09620d7
to
60c6ad6
Compare
Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_79 ran successfully. |
Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_82 ran successfully. |
Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_83 ran successfully. |
This is `std::vector<sycl::event>` passed by reference to collect events associated with host_task submissions. The synchronizing call `mask_positions` is releasing GIL before wait on these events is called. It is either this, or accumulated host tasks must be returned to the user.
Changed hyperparameter choices to be different for CPU and GPU, resulting in 20% performance gain on GPU. The non-recursive implementation allows to avoid repeated USM allocations, resulting in performance gains for large arrays. Furthermore, corrected base step kernel to accumulate in outputT rather than in size_t, which additionally realizes savings when int32 is used as accumulator type. Using example from gh-1249, previously, on my Iris Xe laptop: ``` In [1]: import dpctl.tensor as dpt ...: ag = dpt.ones((8192, 8192), device='gpu', dtype='f4') ...: bg = dpt.ones((8192, 8192), device='gpu', dtype=bool) In [2]: cg = ag[bg] In [3]: dpt.all(cg == dpt.reshape(ag, -1)) Out[3]: usm_ndarray(True) In [4]: %timeit -n 10 -r 3 cg = ag[bg] 212 ms ± 56 ms per loop (mean ± std. dev. of 3 runs, 10 loops each) ``` while with this change: ``` In [4]: %timeit -n 10 -r 3 cg = ag[bg] 178 ms ± 24.2 ms per loop (mean ± std. dev. of 3 runs, 10 loops each) ```
As the array will be copied if it isn't C-contiguous regardless, this change does not impact performance and permits calls to set functions with strided, ND data
Set functions were making copies and then sorting the original data, resulting in incorrect results.
`unique_all` would not return a `UniqueAllResult` tuple when early exiting for an array of all unique elements Fixed problems through `unique` functions where the cumulative sum was being allocated on a separate (default) queue, causing inputs with a non-default queue to fail `unique` functions now behave correctly for 0-size array inputs
Were previously returning a 1D array of indices rather than an array with the same shape as input `x`
As mandated by the array API spec
Changes to sync implementation of set functions so that test suite passes with sync implementation as it does for async implementation.
Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_85 ran successfully. |
de20631
to
08e5dac
Compare
Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_87 ran successfully. |
…x FP types We use extended comparison operators compatible with NumPy's behavior: https://numpy.org/devdocs/reference/generated/numpy.sort.html Specifically, we use [R, nan] block ordering for reals, and [(R, R), (R, nan), (nan, R), (nan, nan)] for complexes.
Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_89 ran successfully. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @oleksandr-pavlyk , very close to full array API coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, works fine for zero dimensional arrays and arrays with nan
.
Implemented data-parallel merge-sort algorithm, which powers array API operations
dpct.tensor.sort
anddpctl.tensor.argsort
. Implemented set functions:dpctl.tensor.unique_values
,dpctl.tensor.unique_counts
,dpctl.tensor.unique_inverse
, anddoctl.tensor.unique_all
.