Consolidate GPU Error Checking Function #350

bcaddy · 2023-10-27T20:07:45Z

GPU Error Checking

I replaced the 3 macros and 7 functions for GPU error checking with a single overloaded function; one overload for CUDA/HIP checking and one for CUFFT/HIPFFT checking. The function supports wrapping a CUDA call or being called with no arguments to check the latest error.

I also added error checking to some cudaMallocs that were missing them or used them in a non-standard way.

The other major change is the deprecation of the CUDA_ERROR_CHECK macro. Now error checking is on by default and can be disabled with the new DISABLE_GPU_ERROR_CHECKING macro.

This should resolve Issue #286 and possibly #296 as well, subject to discussion in that issue.

Consolodate all GPU error checking functions and macros into one overloaded function with one overload for CUDA/HIP errors and one for CUFFT/HIPFFT errors. That one doesn't use any macros and supports all the usual usage modes. It does utilize the `experimental::source_location` class. That class is supported on all compilers that we use or expect others to use but if it doesn't work for you then commenting out the relevant lines should be sufficient. Replaced all calls to `CHECK`, `CudaSafeCall`, `CudaCheckError`, and `gpErrchk` with `GPU_Error_Check`.

Some already had error checks on the next line but now they're all wrapped as is standard in the rest of the code

The CUDA_ERROR_CHECK macro that turns on error checking has been deprecated in favor of the new DISABLE_GPU_ERROR_CHECKING macro which disable error checking. Error checking is now on by default unless compiled with the DISABLE_GPU_ERROR_CHECKING macro.

mabruzzo

So I've gone through -- it looks good to me.

I assume that you have confirmed that any of the implicit grid-synchronizations performed by the old error checking macros were entirely unnecessary, right? If so, we're good to merge.

bcaddy · 2023-12-08T21:06:28Z

Yep, everywhere that the implicit sync seemed like it might be relevant were on functions that already contain an implicit sync (like moving or allocating). Since we only use 1 GPU stream there's an implicit sync between all kernels.

bcaddy linked an issue Oct 27, 2023 that may be closed by this pull request

Multiple versions of cuda error checking with unclear performance impacts #286

Closed

bcaddy mentioned this pull request Oct 27, 2023

What should we do about device memory checking? #296

Closed

bcaddy linked an issue Oct 27, 2023 that may be closed by this pull request

What should we do about device memory checking? #296

Closed

bcaddy force-pushed the dev-iss286and296 branch 2 times, most recently from 5d4941e to 42d17ef Compare October 27, 2023 20:45

bcaddy added 5 commits November 27, 2023 11:00

Remove accidentally committed file

4477747

Add missing error checks to some cudaMallocs

971bacf

Some already had error checks on the next line but now they're all wrapped as is standard in the rest of the code

Add HIPifly macros needed for GPU error checking

cdbed5d

bcaddy force-pushed the dev-iss286and296 branch from 42d17ef to cdbed5d Compare November 27, 2023 16:01

bcaddy mentioned this pull request Dec 8, 2023

Multiple versions of cuda error checking with unclear performance impacts #286

Closed

mabruzzo approved these changes Dec 8, 2023

View reviewed changes

evaneschneider merged commit 39bd9e4 into cholla-hydro:dev Dec 8, 2023
10 checks passed

bcaddy deleted the dev-iss286and296 branch January 22, 2024 18:31

mabruzzo mentioned this pull request Oct 19, 2024

More thread crash improvements #417

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate GPU Error Checking Function #350

Consolidate GPU Error Checking Function #350

bcaddy commented Oct 27, 2023 •

edited

Loading

mabruzzo left a comment

bcaddy commented Dec 8, 2023

Consolidate GPU Error Checking Function #350

Consolidate GPU Error Checking Function #350

Conversation

bcaddy commented Oct 27, 2023 • edited Loading

GPU Error Checking

mabruzzo left a comment

Choose a reason for hiding this comment

bcaddy commented Dec 8, 2023

bcaddy commented Oct 27, 2023 •

edited

Loading