-
Notifications
You must be signed in to change notification settings - Fork 447
Better error message for no GPU or incompatible GPU #577
Conversation
To give user some clue what's happening if the program gets compiled on a node with no GPU or if it gets compiled with different compute capability than the one it's running on. In both scenarios no good error message were produced before. The proposed changes will improve the user experience and make it easier for users to troubleshoot problems. This fix is for addressing the issue#1785 reported on Thrust https://github.com/NVIDIA/thrust/issues/1785
From issue#1785 on thrust (NVIDIA/cccl#818), for this small test case:
when compiled with -gpu=cc60 and then run it on a system with cc80, the error message would be: This doesn't help user to understand what's happening. I tried to address it in this change so that better message will show up: |
@allisonvacanti @jrhemstad |
Thanks @zkhatami! This is a nice addition. We'll have @senior-zero take a look. |
if (device < 0) { | ||
printf("No GPU is available\n"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this for debugging? I don't think we want to keep this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to keep this as well when a code gets compiled on a node with no GPU. Since now it gives this error that doesn't help user:
terminate called after throwing an instance of 'thrust::system::system_error' what(): radix_sort: failed on 1st step: cudaErrorInvalidDevice: invalid device ordinal Aborted (core dumped)
printf("Incompatible GPU: you are trying to run this program on sm_%d%d, " | ||
"different from the one that it was compiled for\n", | ||
sm_version/100, (sm_version%100)/10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
printf
isn't a very robust or canonical way of reporting errors. This should really be an exception, but CUB doesn't currently throw exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other problem with this approach is it is dependent on our current strategy of using cudaFuncGetAttributes
for querying the PTX version at runtime.
This will no longer work when we migrate to the alternative approach described here: https://github.com/NVIDIA/cub/issues/556
Is there any release targeted for NVIDIA/thrust#556? If its unknown, then might be beneficial from user perspective to have this fix with current CUB design. |
I agree, however, I'm not sure about our ability to preserve this functionality into the future.
That's the problem, I don't think we do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution! Some minor questions below
int device; | ||
if (!CubDebug(cudaGetDevice(&device))) { | ||
if (!CubDebug(SmVersionUncached(sm_version, device))) { | ||
printf("Incompatible GPU: you are trying to run this program on sm_%d%d, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include something like stdio in this header if we are going to use printf unconditionally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, should it be fprintf(stderr, ...)
instead?
if (!CubDebug(cudaGetDevice(&device))) { | ||
if (!CubDebug(SmVersionUncached(sm_version, device))) { | ||
printf("Incompatible GPU: you are trying to run this program on sm_%d%d, " | ||
"different from the one that it was compiled for\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: we could report actual architectures we were compiled against. We are using the list in any case for our namespace generation:
#define CUB_DETAIL_MAGIC_NS_BEGIN inline namespace CUB_DETAIL_MAGIC_NS_NAME(CUB_VERSION, NV_TARGET_SM_INTEGER_LIST) {
#else // not defined(_NVHPC_CUDA)
#define CUB_DETAIL_MAGIC_NS_BEGIN inline namespace CUB_DETAIL_MAGIC_NS_NAME(CUB_VERSION, __CUDA_ARCH_LIST__) {
We could probably concatenate appropriate lists and stringify the result as a follow up at some point.
if (result != cudaSuccess) { | ||
int sm_version; | ||
int device; | ||
if (!CubDebug(cudaGetDevice(&device))) { | ||
if (!CubDebug(SmVersionUncached(sm_version, device))) { | ||
printf("Incompatible GPU: you are trying to run this program on sm_%d%d, " | ||
"different from the one that it was compiled for\n", | ||
sm_version/100, (sm_version%100)/10); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some style guides:
if (result != cudaSuccess) { | |
int sm_version; | |
int device; | |
if (!CubDebug(cudaGetDevice(&device))) { | |
if (!CubDebug(SmVersionUncached(sm_version, device))) { | |
printf("Incompatible GPU: you are trying to run this program on sm_%d%d, " | |
"different from the one that it was compiled for\n", | |
sm_version/100, (sm_version%100)/10); | |
} | |
} | |
} | |
if (result != cudaSuccess) | |
{ | |
int sm_version{}; | |
int device{-1}; | |
if (CubDebug(cudaGetDevice(&device)) == cudaSuccess) | |
{ | |
if (CubDebug(SmVersionUncached(sm_version, device)) == cudaSuccess) | |
{ | |
printf("Incompatible GPU: you are trying to run this program on sm_%d%d, " | |
"different from the one that it was compiled for\n", | |
sm_version / 100, (sm_version % 100) / 10); | |
} | |
} | |
} |
I agree with Jake that writing error messages to stdout or stderr is not the best way to report this problem. The use cases that we care about are calling Thrust. NVC++ stdpar never calls CUB directly. Thrust usually reports device-side problems by throwing a I wonder if |
@jrhemstad @dkolsen-pgi @senior-zero Still, I'm not able to add any reviewers here. The Reviewers button is still disabled for me. |
Hello @zkhatami!
If the change is now on the thrust part, can this PR be closed?
The thrust PR seems to have 3 reviewers at the moment. Let me know if I can help requesting reviews from other maintainers. |
I'm closing this pull since I've moved the changes to thrust. |
To give user some clue what's happening if the program gets compiled on a node with no GPU or if it gets compiled with different compute capability than the one it's running on. In both scenarios no good error message were produced before. The proposed changes will improve the user experience and make it easier for users to troubleshoot problems.
This fix is for addressing the issue#1785 reported on Thrust NVIDIA/cccl#818