Better error message for no GPU or incompatible GPU #577

zkhatami · 2022-09-21T17:56:04Z

To give user some clue what's happening if the program gets compiled on a node with no GPU or if it gets compiled with different compute capability than the one it's running on. In both scenarios no good error message were produced before. The proposed changes will improve the user experience and make it easier for users to troubleshoot problems.

This fix is for addressing the issue#1785 reported on Thrust NVIDIA/cccl#818

To give user some clue what's happening if the program gets compiled on a node with no GPU or if it gets compiled with different compute capability than the one it's running on. In both scenarios no good error message were produced before. The proposed changes will improve the user experience and make it easier for users to troubleshoot problems. This fix is for addressing the issue#1785 reported on Thrust https://github.com/NVIDIA/thrust/issues/1785

zkhatami · 2022-09-21T18:39:17Z

From issue#1785 on thrust (NVIDIA/cccl#818), for this small test case:

#include <thrust/device_vector.h> #include <thrust/sort.h> int main() { thrust::device_vector<int> dv; thrust::sort(dv.begin(), dv.end()); }

when compiled with -gpu=cc60 and then run it on a system with cc80, the error message would be:
terminate called after throwing an instance of 'thrust::system::system_error' what(): radix_sort: failed on 1st step: cudaErrorUnsupportedPtxVersion: the provided PTX was compiled with an unsupported toolchain. Aborted

This doesn't help user to understand what's happening. I tried to address it in this change so that better message will show up:
Incompatible GPU: you are trying to run this program on sm_80, different from the one that it was compiled for

zkhatami · 2022-09-21T18:42:12Z

@allisonvacanti @jrhemstad
The "Reviewers" button is disabled for me for some reason and I cant fix it, so and I am not able to add any reviewers for this change. Would you please take a look at the changes and share your comments with me? Thanks!

jrhemstad · 2022-09-21T21:54:13Z

Thanks @zkhatami! This is a nice addition. We'll have @senior-zero take a look.

jrhemstad · 2022-09-21T21:55:19Z

cub/util_device.cuh

+      if (device < 0) {
+        printf("No GPU is available\n");
+      }


Was this for debugging? I don't think we want to keep this.

We want to keep this as well when a code gets compiled on a node with no GPU. Since now it gives this error that doesn't help user:
terminate called after throwing an instance of 'thrust::system::system_error' what(): radix_sort: failed on 1st step: cudaErrorInvalidDevice: invalid device ordinal Aborted (core dumped)

jrhemstad · 2022-09-21T21:57:36Z

cub/util_device.cuh

+                    printf("Incompatible GPU: you are trying to run this program on sm_%d%d, "
+                            "different from the one that it was compiled for\n",
+                            sm_version/100, (sm_version%100)/10);


printf isn't a very robust or canonical way of reporting errors. This should really be an exception, but CUB doesn't currently throw exceptions.

jrhemstad

The other problem with this approach is it is dependent on our current strategy of using cudaFuncGetAttributes for querying the PTX version at runtime.

This will no longer work when we migrate to the alternative approach described here: https://github.com/NVIDIA/cub/issues/556

zkhatami · 2022-09-22T16:20:51Z

Is there any release targeted for NVIDIA/thrust#556? If its unknown, then might be beneficial from user perspective to have this fix with current CUB design.
In this proposal, is there anyway to test for the GPU incompatibility in the very beginning, like the one that we currently have with using the failure on cudaFuncGetAttributes?

jrhemstad · 2022-09-22T17:17:49Z

have this fix with current CUB design.

I agree, however, I'm not sure about our ability to preserve this functionality into the future.

In this proposal, is there anyway to test for the GPU incompatibility in the very beginning, like the one that we currently have with using the failure on cudaFuncGetAttributes?

That's the problem, I don't think we do.

gevtushenko

Thank you for the contribution! Some minor questions below

gevtushenko · 2022-10-10T10:16:05Z

cub/util_device.cuh

+            int device;
+            if (!CubDebug(cudaGetDevice(&device))) {
+                if (!CubDebug(SmVersionUncached(sm_version, device))) {
+                    printf("Incompatible GPU: you are trying to run this program on sm_%d%d, "


Should we include something like stdio in this header if we are going to use printf unconditionally?

Also, should it be fprintf(stderr, ...) instead?

gevtushenko · 2022-10-10T10:21:15Z

cub/util_device.cuh

+            if (!CubDebug(cudaGetDevice(&device))) {
+                if (!CubDebug(SmVersionUncached(sm_version, device))) {
+                    printf("Incompatible GPU: you are trying to run this program on sm_%d%d, "
+                            "different from the one that it was compiled for\n",


Optional: we could report actual architectures we were compiled against. We are using the list in any case for our namespace generation:

#define CUB_DETAIL_MAGIC_NS_BEGIN inline namespace CUB_DETAIL_MAGIC_NS_NAME(CUB_VERSION, NV_TARGET_SM_INTEGER_LIST) { #else // not defined(_NVHPC_CUDA) #define CUB_DETAIL_MAGIC_NS_BEGIN inline namespace CUB_DETAIL_MAGIC_NS_NAME(CUB_VERSION, __CUDA_ARCH_LIST__) {

We could probably concatenate appropriate lists and stringify the result as a follow up at some point.

gevtushenko · 2022-10-10T10:49:25Z

cub/util_device.cuh

+        if (result != cudaSuccess) {
+            int sm_version;
+            int device;
+            if (!CubDebug(cudaGetDevice(&device))) {
+                if (!CubDebug(SmVersionUncached(sm_version, device))) {
+                    printf("Incompatible GPU: you are trying to run this program on sm_%d%d, "
+                            "different from the one that it was compiled for\n",
+                            sm_version/100, (sm_version%100)/10);
+                }
+            }
+        }


Some style guides:

Suggested change

if (result != cudaSuccess) {

int sm_version;

int device;

if (!CubDebug(cudaGetDevice(&device))) {

if (!CubDebug(SmVersionUncached(sm_version, device))) {

printf("Incompatible GPU: you are trying to run this program on sm_%d%d, "

"different from the one that it was compiled for\n",

sm_version/100, (sm_version%100)/10);

}

}

}

if (result != cudaSuccess)

{

int sm_version{};

int device{-1};

if (CubDebug(cudaGetDevice(&device)) == cudaSuccess)

{

if (CubDebug(SmVersionUncached(sm_version, device)) == cudaSuccess)

{

printf("Incompatible GPU: you are trying to run this program on sm_%d%d, "

"different from the one that it was compiled for\n",

sm_version / 100, (sm_version % 100) / 10);

}

}

}

dkolsen-pgi · 2022-10-10T17:22:51Z

I agree with Jake that writing error messages to stdout or stderr is not the best way to report this problem. The use cases that we care about are calling Thrust. NVC++ stdpar never calls CUB directly. Thrust usually reports device-side problems by throwing a thrust::system_error exception. A lack of GPU or a GPU with an incompatible architecture should also be reported with a thrust::system_error exception, with a a custom message in the exception that clearly explains the problem.

I wonder if get_ptx_version in thrust/system/cuda/detail/core/util.h would be a better place to put this logic.

zkhatami · 2023-01-13T21:47:21Z

@jrhemstad @dkolsen-pgi @senior-zero
I've moved my changes to thrust instead based on David's suggestion. Here is my pull request:
NVIDIA/thrust#1848

Still, I'm not able to add any reviewers here. The Reviewers button is still disabled for me.
Would you please take a look at the changes and share your comments with me? Thanks!

gevtushenko · 2023-01-26T06:36:02Z

Hello @zkhatami!

I've moved my changes to thrust instead based on David's suggestion.

If the change is now on the thrust part, can this PR be closed?

Still, I'm not able to add any reviewers here.

The thrust PR seems to have 3 reviewers at the moment. Let me know if I can help requesting reviews from other maintainers.

zkhatami · 2023-01-27T22:32:14Z

I'm closing this pull since I've moved the changes to thrust.

zkhatami marked this pull request as draft September 21, 2022 17:57

zkhatami marked this pull request as ready for review September 21, 2022 17:58

zkhatami marked this pull request as draft September 21, 2022 18:33

zkhatami marked this pull request as ready for review September 21, 2022 18:34

zkhatami marked this pull request as draft September 21, 2022 18:49

zkhatami marked this pull request as ready for review September 21, 2022 18:50

jrhemstad requested a review from gevtushenko September 21, 2022 21:53

jrhemstad reviewed Sep 21, 2022

View reviewed changes

gevtushenko suggested changes Oct 10, 2022

View reviewed changes

zkhatami closed this Jan 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better error message for no GPU or incompatible GPU #577

Better error message for no GPU or incompatible GPU #577

zkhatami commented Sep 21, 2022

zkhatami commented Sep 21, 2022

zkhatami commented Sep 21, 2022

jrhemstad commented Sep 21, 2022

jrhemstad Sep 21, 2022

zkhatami Sep 21, 2022

jrhemstad Sep 21, 2022

jrhemstad left a comment

zkhatami commented Sep 22, 2022

jrhemstad commented Sep 22, 2022

gevtushenko left a comment

gevtushenko Oct 10, 2022

gevtushenko Oct 10, 2022

gevtushenko Oct 10, 2022

gevtushenko Oct 10, 2022

dkolsen-pgi commented Oct 10, 2022

zkhatami commented Jan 13, 2023

gevtushenko commented Jan 26, 2023 •

edited

Loading

zkhatami commented Jan 27, 2023

Better error message for no GPU or incompatible GPU #577

Better error message for no GPU or incompatible GPU #577

Conversation

zkhatami commented Sep 21, 2022

zkhatami commented Sep 21, 2022

zkhatami commented Sep 21, 2022

jrhemstad commented Sep 21, 2022

jrhemstad Sep 21, 2022

Choose a reason for hiding this comment

zkhatami Sep 21, 2022

Choose a reason for hiding this comment

jrhemstad Sep 21, 2022

Choose a reason for hiding this comment

jrhemstad left a comment

Choose a reason for hiding this comment

zkhatami commented Sep 22, 2022

jrhemstad commented Sep 22, 2022

gevtushenko left a comment

Choose a reason for hiding this comment

gevtushenko Oct 10, 2022

Choose a reason for hiding this comment

gevtushenko Oct 10, 2022

Choose a reason for hiding this comment

gevtushenko Oct 10, 2022

Choose a reason for hiding this comment

gevtushenko Oct 10, 2022

Choose a reason for hiding this comment

dkolsen-pgi commented Oct 10, 2022

zkhatami commented Jan 13, 2023

gevtushenko commented Jan 26, 2023 • edited Loading

zkhatami commented Jan 27, 2023

gevtushenko commented Jan 26, 2023 •

edited

Loading