Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support max_n and min_n reductions on GPU #1196

Merged
merged 2 commits into from
Mar 24, 2023
Merged

Support max_n and min_n reductions on GPU #1196

merged 2 commits into from
Mar 24, 2023

Conversation

ianthomas23
Copy link
Member

Closes #1177.

This adds support for max_n and min_n reductions on a GPU, both with and without dask. The key change is to add new CUDA mutex functionality to support CUDA append functions (i.e. individual pixel callbacks) that do more than a simple get/set operation. Because of the massively parallel nature of CUDA hardware multiple threads can access the same canvas pixel at the same time, and up until now we have been restricted to CUDA atomic operations (https://numba.readthedocs.io/en/stable/cuda/intrinsics.html#supported-atomic-operations) in append functions. With the new mutex we can lock access to a particular pixel to a single thread at a time and thus perform more complicated operations such as for max_n without any race conditions.

In implementation we need to get the mutex (a cupy array) to the CUDA append functions and this is achieved within the expand_aggs_and_cols framework by appending the mutex array in the make_info function which is where other arrays and/or dataframe columns are extracted and passed to append functions. This ensures that there is only ever a single shared mutex even if multiple reductions need it.

This implementation is limited by what is currently available in numba 0.56 which means we can only lock/unlock the mutex as a whole rather than individual elements/pixels of it so the performance will not be great. Numba PR numba/numba#8790 will allow us to lock individual pixels, so when numba 0.57 is released I will write another PR to check use the fast route if that is available otherwise drop back to this slower one.

There is no support yet for where(max_n) on CUDA, but this will follow in another PR soon.

@ianthomas23 ianthomas23 added this to the v0.14.5 milestone Mar 22, 2023
@codecov
Copy link

codecov bot commented Mar 22, 2023

Codecov Report

Merging #1196 (8f9ec81) into main (ed0d58e) will decrease coverage by 0.80%.
The diff coverage is 32.00%.

@@            Coverage Diff             @@
##             main    #1196      +/-   ##
==========================================
- Coverage   85.48%   84.68%   -0.80%     
==========================================
  Files          35       35              
  Lines        8232     8345     +113     
==========================================
+ Hits         7037     7067      +30     
- Misses       1195     1278      +83     
Impacted Files Coverage Δ
datashader/transfer_functions/_cuda_utils.py 23.52% <17.77%> (-2.85%) ⬇️
datashader/reductions.py 83.11% <30.64%> (-3.06%) ⬇️
datashader/compiler.py 92.81% <72.22%> (-2.94%) ⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@ianthomas23
Copy link
Member Author

Test failures are related to rioxarray which released 0.14.0 yesterday.

@ianthomas23
Copy link
Member Author

Test failures are related to rioxarray which released 0.14.0 yesterday.

This was a dependency issue in the conda-forge rioxarray build which has now been fixed: conda-forge/rioxarray-feedstock#70.

@jbednar
Copy link
Member

jbednar commented Mar 24, 2023

Nice. Thanks!

@ianthomas23 ianthomas23 merged commit 3d2f7df into holoviz:main Mar 24, 2023
@ianthomas23 ianthomas23 deleted the cuda_mutexes branch March 24, 2023 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CUDA mutexes for complicated reductions
2 participants