Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of sort/argsort and set functions per array API spec #1483

Merged
merged 34 commits into from
Jan 13, 2024

Conversation

oleksandr-pavlyk
Copy link
Collaborator

@oleksandr-pavlyk oleksandr-pavlyk commented Dec 6, 2023

Implemented data-parallel merge-sort algorithm, which powers array API operations dpct.tensor.sort and dpctl.tensor.argsort. Implemented set functions: dpctl.tensor.unique_values, dpctl.tensor.unique_counts, dpctl.tensor.unique_inverse, and doctl.tensor.unique_all.

  • Have you provided a meaningful PR description?
  • Have you added a test, reproducer or referred to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • Have you checked performance impact of proposed changes?
  • If this PR is a work in progress, are you opening the PR as a draft?

Copy link

github-actions bot commented Dec 6, 2023

Copy link

github-actions bot commented Dec 6, 2023

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_20 ran successfully.
Passed: 881
Failed: 50
Skipped: 59

Copy link

github-actions bot commented Dec 7, 2023

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_23 ran successfully.
Passed: 881
Failed: 50
Skipped: 59

@coveralls
Copy link
Collaborator

coveralls commented Dec 7, 2023

Coverage Status

coverage: 89.993% (+0.3%) from 89.693%
when pulling 5469832 on feature/sorting
into f8b2202 on master.

Copy link

github-actions bot commented Dec 7, 2023

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_27 ran successfully.
Passed: 881
Failed: 50
Skipped: 59

Copy link

github-actions bot commented Dec 8, 2023

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_32 ran successfully.
Passed: 891
Failed: 41
Skipped: 59

Copy link

github-actions bot commented Dec 8, 2023

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_33 ran successfully.
Passed: 891
Failed: 41
Skipped: 59

Copy link

github-actions bot commented Dec 9, 2023

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_35 ran successfully.
Passed: 890
Failed: 42
Skipped: 59

Copy link

github-actions bot commented Dec 9, 2023

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_40 ran successfully.
Passed: 889
Failed: 43
Skipped: 59

Copy link

github-actions bot commented Dec 9, 2023

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_41 ran successfully.
Passed: 890
Failed: 42
Skipped: 59

@oleksandr-pavlyk oleksandr-pavlyk marked this pull request as ready for review December 9, 2023 13:12
Copy link

github-actions bot commented Dec 9, 2023

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_43 ran successfully.
Passed: 894
Failed: 38
Skipped: 59

Copy link

github-actions bot commented Dec 9, 2023

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_44 ran successfully.
Passed: 891
Failed: 41
Skipped: 59

Copy link

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_49 ran successfully.
Passed: 891
Failed: 41
Skipped: 59

Copy link

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_67 ran successfully.
Passed: 891
Failed: 41
Skipped: 59

@ndgrigorian
Copy link
Collaborator

ndgrigorian commented Dec 18, 2023

@oleksandr-pavlyk
Tests seem to pass using the asynchronous unique_* functions. Should I go ahead and remove set_functions_sync.py and transition to the asynchronous implementations?

For me, using the async implementation shaves off ~1ms for small arrays

In [2]: x = dpt.ones((10), dtype="c8")

In [3]: %time z = dpt.unique_values(x)
CPU times: user 267 ms, sys: 20.1 ms, total: 287 ms
Wall time: 296 ms

In [4]: %time z = dpt.unique_values(x)
CPU times: user 2.77 ms, sys: 0 ns, total: 2.77 ms
Wall time: 4.52 ms

In [5]: %time z = dpt.unique_values(x)
CPU times: user 1.51 ms, sys: 371 µs, total: 1.88 ms
Wall time: 1.8 ms

In [6]: %time z = dpt.unique_values(x)
CPU times: user 1.6 ms, sys: 0 ns, total: 1.6 ms
Wall time: 1.59 ms

In [7]: %time z = dpt.unique_values(x)
CPU times: user 1.36 ms, sys: 0 ns, total: 1.36 ms
Wall time: 1.29 ms

In [8]: %time z = dpt.unique_values(x)
CPU times: user 2.05 ms, sys: 0 ns, total: 2.05 ms
Wall time: 1.78 ms

compared to synchronized

In [2]: x = dpt.ones((10), dtype="c8")

In [3]: %time z = dpt.unique_values(x)
CPU times: user 264 ms, sys: 14.6 ms, total: 279 ms
Wall time: 287 ms

In [4]: %time z = dpt.unique_values(x)
CPU times: user 1.6 ms, sys: 1.07 ms, total: 2.67 ms
Wall time: 3.43 ms

In [5]: %time z = dpt.unique_values(x)
CPU times: user 1.62 ms, sys: 1.08 ms, total: 2.7 ms
Wall time: 3.78 ms

In [6]: %time z = dpt.unique_values(x)
CPU times: user 2.69 ms, sys: 0 ns, total: 2.69 ms
Wall time: 3.03 ms

In [7]: %time z = dpt.unique_values(x)
CPU times: user 0 ns, sys: 3.01 ms, total: 3.01 ms
Wall time: 3.33 ms

@oleksandr-pavlyk
Copy link
Collaborator Author

@ndgrigorian I have switched to asynchronous implementation per your feedback, but kept the synchronous one around just in case.

Copy link

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_74 ran successfully.
Passed: 890
Failed: 42
Skipped: 59

Copy link

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_77 ran successfully.
Passed: 895
Failed: 37
Skipped: 59

Copy link

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_78 ran successfully.
Passed: 892
Failed: 40
Skipped: 59

Copy link

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_79 ran successfully.
Passed: 896
Failed: 36
Skipped: 59

Copy link

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_82 ran successfully.
Passed: 895
Failed: 37
Skipped: 59

Copy link

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_83 ran successfully.
Passed: 894
Failed: 38
Skipped: 59

oleksandr-pavlyk and others added 18 commits January 8, 2024 13:30
This is `std::vector<sycl::event>` passed by reference to collect
events associated with host_task submissions.

The synchronizing call `mask_positions` is releasing GIL before
wait on these events is called.

It is either this, or accumulated host tasks must be returned to
the user.
Changed hyperparameter choices to be different for CPU and GPU, resulting
in 20% performance gain on GPU.

The non-recursive implementation allows to avoid repeated USM allocations,
resulting in performance gains for large arrays.

Furthermore, corrected base step kernel to accumulate in outputT rather than
in size_t, which additionally realizes savings when int32 is used as
accumulator type.

Using example from gh-1249, previously, on my Iris Xe laptop:

```
In [1]: import dpctl.tensor as dpt
   ...: ag = dpt.ones((8192, 8192), device='gpu', dtype='f4')
   ...: bg = dpt.ones((8192, 8192), device='gpu', dtype=bool)

In [2]: cg = ag[bg]

In [3]: dpt.all(cg == dpt.reshape(ag, -1))
Out[3]: usm_ndarray(True)

In [4]: %timeit -n 10 -r 3 cg = ag[bg]
212 ms ± 56 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
```

while with this change:

```
In [4]: %timeit -n 10 -r 3 cg = ag[bg]
178 ms ± 24.2 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
```
As the array will be copied if it isn't C-contiguous regardless,
this change does not impact performance and permits calls to set functions
with strided, ND data
Set functions were making copies and then sorting the original data,
resulting in incorrect results.
`unique_all` would not return a `UniqueAllResult` tuple when early exiting for an array of all unique elements

Fixed problems through `unique` functions where the cumulative sum was being allocated on a separate (default) queue, causing inputs with a non-default queue to fail

`unique` functions now behave correctly for 0-size array inputs
Were previously returning a 1D array of indices rather than an array with the same shape as input `x`
Changes to sync implementation of set functions so that test
suite passes with sync implementation as it does for async
implementation.
Copy link

github-actions bot commented Jan 8, 2024

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_85 ran successfully.
Passed: 893
Failed: 16
Skipped: 82

Copy link

github-actions bot commented Jan 8, 2024

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_87 ran successfully.
Passed: 893
Failed: 16
Skipped: 82

…x FP types

We use extended comparison operators compatible with NumPy's behavior:

https://numpy.org/devdocs/reference/generated/numpy.sort.html

Specifically, we use [R, nan] block ordering for reals, and
[(R, R), (R, nan), (nan, R), (nan, nan)] for complexes.
Copy link

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_89 ran successfully.
Passed: 895
Failed: 14
Skipped: 82

Copy link
Collaborator

@ndgrigorian ndgrigorian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @oleksandr-pavlyk , very close to full array API coverage.

Copy link
Collaborator

@vtavana vtavana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, works fine for zero dimensional arrays and arrays with nan.

@oleksandr-pavlyk oleksandr-pavlyk merged commit fd84252 into master Jan 13, 2024
45 of 49 checks passed
@oleksandr-pavlyk oleksandr-pavlyk deleted the feature/sorting branch January 13, 2024 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants