Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of reduction for small number of elements to reduce for types where tree-reduction is needed #1463

Merged
merged 10 commits into from
Nov 4, 2023

Conversation

oleksandr-pavlyk
Copy link
Collaborator

@oleksandr-pavlyk oleksandr-pavlyk commented Nov 2, 2023

  1. Renamed misspelled variable

  2. If reduction_nelems is small, use SequentialReductionKernel for tree-reductions as it is done for atomic reduction

  3. Tweak scaling down logic for moderately-sized number of elements to reduce. Closes Sensible performance degradation in dpt.tensor.sum #1461

    We should also use max_wg if the iter_nelems is very small (one), since choosing max_wg for large iter_nelems may lead to under-utilization of GPU.


  • Have you provided a meaningful PR description?
  • Have you added a test, reproducer or referred to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • Have you checked performance impact of proposed changes?
  • If this PR is a work in progress, are you opening the PR as a draft?

1. Renamed misspelled variable
2. If reduction_nelems is small, used SequentialReductionKernel
   for tree-reductions as it is done for atomic reduction
3. Tweak scaling down logic for moderately-sized number of elements
   to reduce.

   We should also use max_wg if the iter_nelems is very small (one),
   since choosing max_wg for large iter_nelems may lead to under-
   utilization of GPU.
Copy link

github-actions bot commented Nov 2, 2023

Copy link

github-actions bot commented Nov 2, 2023

Array API standard conformance tests for dpctl=0.15.1dev0=py310ha25a700_75 ran successfully.
Passed: 935
Failed: 65
Skipped: 119

_tensor_impl continues holding constructors, where, clip

_tensor_elementwise_impl holds elementwise functions
_tensor_reductions_impl holds reduction functions.
@antonwolfy
Copy link
Collaborator

The PR resolves the performance issue gh-1461, L2 norm benchmark is back to expected values. Thank you!

@oleksandr-pavlyk oleksandr-pavlyk marked this pull request as ready for review November 3, 2023 00:31
@coveralls
Copy link
Collaborator

coveralls commented Nov 3, 2023

Coverage Status

coverage: 85.785% (+0.04%) from 85.748%
when pulling d4d4992 on optimize-small-size-tree-reduction
into 11ecba8 on master.

Copy link

github-actions bot commented Nov 3, 2023

Array API standard conformance tests for dpctl=0.15.1dev0=py310ha25a700_78 ran successfully.
Passed: 935
Failed: 65
Skipped: 119

Added stable API to retrieve implementation functions in each elementwise
function class instance to allow `dpnp` to access that information using
stable API.
…at types

Added entries for float and double types to TypePairSupportDataForCompReductionAtomic
as spotted by @ndgrigorian in the PR review.

Also moved comments around.
This removes use of dpnp.matmul from the example, making this example
self-contained.
@ndgrigorian
Copy link
Collaborator

@oleksandr-pavlyk
I realized while looking through reductions.hpp that floating-point types are not included in TypePairSupportDataForCompReductionAtomic, so even though we permit atomics for max and min on floating-point arrays, there aren't any functions to call in the table.

float and double should be added to fix this.

Copy link

github-actions bot commented Nov 3, 2023

Array API standard conformance tests for dpctl=0.15.1dev0=py310ha25a700_79 ran successfully.
Passed: 935
Failed: 65
Skipped: 119

ndgrigorian and others added 2 commits November 3, 2023 13:19
…ts (#1464)

* Adds SequentialSearchReduction functor to search reductions

* Search reductions use correct branch for float16

constexpr branch logic accounted for floating point types but not sycl::half,
which meant NaNs were not propagating for float16 data
Copy link

github-actions bot commented Nov 3, 2023

Array API standard conformance tests for dpctl=0.15.1dev0=py310ha25a700_81 ran successfully.
Passed: 935
Failed: 65
Skipped: 119

@ndgrigorian
Copy link
Collaborator

#include "boolean_reductions.hpp"

#include "reductions/reduction_common.hpp"

These includes in tensor_ctors.cpp should be unnecessary with these changes.

Copy link

github-actions bot commented Nov 3, 2023

Array API standard conformance tests for dpctl=0.15.1dev0=py310ha25a700_82 ran successfully.
Passed: 935
Failed: 65
Skipped: 119

@oleksandr-pavlyk
Copy link
Collaborator Author

oleksandr-pavlyk commented Nov 4, 2023

#include "boolean_reductions.hpp"

#include "reductions/reduction_common.hpp"

These includes in tensor_ctors.cpp should be unnecessary with these changes.

Thank you @ndgrigorian, good catch!

Copy link
Collaborator

@ndgrigorian ndgrigorian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @oleksandr-pavlyk, LGTM

Copy link

github-actions bot commented Nov 4, 2023

Array API standard conformance tests for dpctl=0.15.1dev0=py310ha25a700_84 ran successfully.
Passed: 935
Failed: 65
Skipped: 119

@oleksandr-pavlyk oleksandr-pavlyk merged commit 9018745 into master Nov 4, 2023
26 checks passed
@oleksandr-pavlyk oleksandr-pavlyk deleted the optimize-small-size-tree-reduction branch November 4, 2023 03:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sensible performance degradation in dpt.tensor.sum
4 participants