Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Better error message for no GPU or incompatible GPU #577

Closed

Conversation

zkhatami
Copy link

To give user some clue what's happening if the program gets compiled on a node with no GPU or if it gets compiled with different compute capability than the one it's running on. In both scenarios no good error message were produced before. The proposed changes will improve the user experience and make it easier for users to troubleshoot problems.

This fix is for addressing the issue#1785 reported on Thrust NVIDIA/cccl#818

To give user some clue what's happening if the program
gets compiled on a node with no GPU or if it gets compiled
with different compute capability than the one it's running on.
In both scenarios no good error message were produced before.
The proposed changes will improve the user experience and make
it easier for users to troubleshoot problems.

This fix is for addressing the issue#1785 reported on Thrust
https://github.com/NVIDIA/thrust/issues/1785
@zkhatami zkhatami marked this pull request as draft September 21, 2022 17:57
@zkhatami zkhatami marked this pull request as ready for review September 21, 2022 17:58
@zkhatami zkhatami marked this pull request as draft September 21, 2022 18:33
@zkhatami zkhatami marked this pull request as ready for review September 21, 2022 18:34
@zkhatami
Copy link
Author

From issue#1785 on thrust (NVIDIA/cccl#818), for this small test case:

#include <thrust/device_vector.h> #include <thrust/sort.h> int main() { thrust::device_vector<int> dv; thrust::sort(dv.begin(), dv.end()); }

when compiled with -gpu=cc60 and then run it on a system with cc80, the error message would be:
terminate called after throwing an instance of 'thrust::system::system_error' what(): radix_sort: failed on 1st step: cudaErrorUnsupportedPtxVersion: the provided PTX was compiled with an unsupported toolchain. Aborted

This doesn't help user to understand what's happening. I tried to address it in this change so that better message will show up:
Incompatible GPU: you are trying to run this program on sm_80, different from the one that it was compiled for

@zkhatami
Copy link
Author

@allisonvacanti @jrhemstad
The "Reviewers" button is disabled for me for some reason and I cant fix it, so and I am not able to add any reviewers for this change. Would you please take a look at the changes and share your comments with me? Thanks!

@zkhatami zkhatami marked this pull request as draft September 21, 2022 18:49
@zkhatami zkhatami marked this pull request as ready for review September 21, 2022 18:50
@jrhemstad
Copy link
Collaborator

Thanks @zkhatami! This is a nice addition. We'll have @senior-zero take a look.

Comment on lines +458 to +460
if (device < 0) {
printf("No GPU is available\n");
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this for debugging? I don't think we want to keep this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to keep this as well when a code gets compiled on a node with no GPU. Since now it gives this error that doesn't help user:
terminate called after throwing an instance of 'thrust::system::system_error' what(): radix_sort: failed on 1st step: cudaErrorInvalidDevice: invalid device ordinal Aborted (core dumped)

Comment on lines +376 to +378
printf("Incompatible GPU: you are trying to run this program on sm_%d%d, "
"different from the one that it was compiled for\n",
sm_version/100, (sm_version%100)/10);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

printf isn't a very robust or canonical way of reporting errors. This should really be an exception, but CUB doesn't currently throw exceptions.

Copy link
Collaborator

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other problem with this approach is it is dependent on our current strategy of using cudaFuncGetAttributes for querying the PTX version at runtime.

This will no longer work when we migrate to the alternative approach described here: https://github.com/NVIDIA/cub/issues/556

@zkhatami
Copy link
Author

Is there any release targeted for NVIDIA/thrust#556? If its unknown, then might be beneficial from user perspective to have this fix with current CUB design.
In this proposal, is there anyway to test for the GPU incompatibility in the very beginning, like the one that we currently have with using the failure on cudaFuncGetAttributes?

@jrhemstad
Copy link
Collaborator

have this fix with current CUB design.

I agree, however, I'm not sure about our ability to preserve this functionality into the future.

In this proposal, is there anyway to test for the GPU incompatibility in the very beginning, like the one that we currently have with using the failure on cudaFuncGetAttributes?

That's the problem, I don't think we do.

Copy link
Collaborator

@gevtushenko gevtushenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution! Some minor questions below

int device;
if (!CubDebug(cudaGetDevice(&device))) {
if (!CubDebug(SmVersionUncached(sm_version, device))) {
printf("Incompatible GPU: you are trying to run this program on sm_%d%d, "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include something like stdio in this header if we are going to use printf unconditionally?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should it be fprintf(stderr, ...) instead?

if (!CubDebug(cudaGetDevice(&device))) {
if (!CubDebug(SmVersionUncached(sm_version, device))) {
printf("Incompatible GPU: you are trying to run this program on sm_%d%d, "
"different from the one that it was compiled for\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: we could report actual architectures we were compiled against. We are using the list in any case for our namespace generation:

#define CUB_DETAIL_MAGIC_NS_BEGIN inline namespace CUB_DETAIL_MAGIC_NS_NAME(CUB_VERSION, NV_TARGET_SM_INTEGER_LIST) {
#else // not defined(_NVHPC_CUDA)
#define CUB_DETAIL_MAGIC_NS_BEGIN inline namespace CUB_DETAIL_MAGIC_NS_NAME(CUB_VERSION, __CUDA_ARCH_LIST__) {

We could probably concatenate appropriate lists and stringify the result as a follow up at some point.

Comment on lines +371 to +381
if (result != cudaSuccess) {
int sm_version;
int device;
if (!CubDebug(cudaGetDevice(&device))) {
if (!CubDebug(SmVersionUncached(sm_version, device))) {
printf("Incompatible GPU: you are trying to run this program on sm_%d%d, "
"different from the one that it was compiled for\n",
sm_version/100, (sm_version%100)/10);
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some style guides:

Suggested change
if (result != cudaSuccess) {
int sm_version;
int device;
if (!CubDebug(cudaGetDevice(&device))) {
if (!CubDebug(SmVersionUncached(sm_version, device))) {
printf("Incompatible GPU: you are trying to run this program on sm_%d%d, "
"different from the one that it was compiled for\n",
sm_version/100, (sm_version%100)/10);
}
}
}
if (result != cudaSuccess)
{
int sm_version{};
int device{-1};
if (CubDebug(cudaGetDevice(&device)) == cudaSuccess)
{
if (CubDebug(SmVersionUncached(sm_version, device)) == cudaSuccess)
{
printf("Incompatible GPU: you are trying to run this program on sm_%d%d, "
"different from the one that it was compiled for\n",
sm_version / 100, (sm_version % 100) / 10);
}
}
}

@dkolsen-pgi
Copy link
Collaborator

I agree with Jake that writing error messages to stdout or stderr is not the best way to report this problem. The use cases that we care about are calling Thrust. NVC++ stdpar never calls CUB directly. Thrust usually reports device-side problems by throwing a thrust::system_error exception. A lack of GPU or a GPU with an incompatible architecture should also be reported with a thrust::system_error exception, with a a custom message in the exception that clearly explains the problem.

I wonder if get_ptx_version in thrust/system/cuda/detail/core/util.h would be a better place to put this logic.

@zkhatami
Copy link
Author

@jrhemstad @dkolsen-pgi @senior-zero
I've moved my changes to thrust instead based on David's suggestion. Here is my pull request:
NVIDIA/thrust#1848

Still, I'm not able to add any reviewers here. The Reviewers button is still disabled for me.
Would you please take a look at the changes and share your comments with me? Thanks!

@gevtushenko
Copy link
Collaborator

gevtushenko commented Jan 26, 2023

Hello @zkhatami!

I've moved my changes to thrust instead based on David's suggestion.

If the change is now on the thrust part, can this PR be closed?

Still, I'm not able to add any reviewers here.

The thrust PR seems to have 3 reviewers at the moment. Let me know if I can help requesting reviews from other maintainers.

@zkhatami
Copy link
Author

I'm closing this pull since I've moved the changes to thrust.

@zkhatami zkhatami closed this Jan 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants