Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate GPU Error Checking Function #350

Merged
merged 5 commits into from
Dec 8, 2023

Conversation

bcaddy
Copy link
Collaborator

@bcaddy bcaddy commented Oct 27, 2023

GPU Error Checking

I replaced the 3 macros and 7 functions for GPU error checking with a single overloaded function; one overload for CUDA/HIP checking and one for CUFFT/HIPFFT checking. The function supports wrapping a CUDA call or being called with no arguments to check the latest error.

I also added error checking to some cudaMallocs that were missing them or used them in a non-standard way.

The other major change is the deprecation of the CUDA_ERROR_CHECK macro. Now error checking is on by default and can be disabled with the new DISABLE_GPU_ERROR_CHECKING macro.

This should resolve Issue #286 and possibly #296 as well, subject to discussion in that issue.

@bcaddy bcaddy linked an issue Oct 27, 2023 that may be closed by this pull request
@bcaddy bcaddy force-pushed the dev-iss286and296 branch 2 times, most recently from 5d4941e to 42d17ef Compare October 27, 2023 20:45
Consolodate all GPU error checking functions and macros into one
overloaded function with one overload for CUDA/HIP errors and one for
CUFFT/HIPFFT errors. That one doesn't use any macros and supports all
the usual usage modes. It does utilize the
`experimental::source_location` class. That class is supported on all
compilers that we use or expect others to use but if it doesn't work
for you then commenting out the relevant lines should be sufficient.

Replaced all calls to `CHECK`, `CudaSafeCall`, `CudaCheckError`, and
`gpErrchk` with `GPU_Error_Check`.
Some already had error checks on the next line but now they're all
wrapped as is standard in the rest of the code
The CUDA_ERROR_CHECK macro that turns on error checking has been
deprecated in favor of the new DISABLE_GPU_ERROR_CHECKING macro which
disable error checking. Error checking is now on by default unless
compiled with the DISABLE_GPU_ERROR_CHECKING macro.
Copy link
Collaborator

@mabruzzo mabruzzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've gone through -- it looks good to me.

I assume that you have confirmed that any of the implicit grid-synchronizations performed by the old error checking macros were entirely unnecessary, right? If so, we're good to merge.

@bcaddy
Copy link
Collaborator Author

bcaddy commented Dec 8, 2023

Yep, everywhere that the implicit sync seemed like it might be relevant were on functions that already contain an implicit sync (like moving or allocating). Since we only use 1 GPU stream there's an implicit sync between all kernels.

@evaneschneider evaneschneider merged commit 39bd9e4 into cholla-hydro:dev Dec 8, 2023
10 checks passed
@bcaddy bcaddy deleted the dev-iss286and296 branch January 22, 2024 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants