[DISCUSSION] Alternative solution for determining compute capability at runtime #898

jrhemstad · 2022-08-19T13:57:16Z

Current Situation

As discussed in NVIDIA/cub#545, CUB needs to query the current device's compute capability in order to know which tuning policy to use for launching the kernel.

Currently, CUB does this by using cudaFuncGetAttributes on an EmptyKernel<void>.

As discussed in NVIDIA/cub#545, this runs into problems due to the nuanced relationship among the linkage of kernels, their enclosing function, and the architectures used to compile the TU. The end result is that we can end up getting a version of EmptyKernel with a different PTX version than we expect.

Proposal

The goal of the machinery described above is to determine which PTX version for a given kernel will be used when it is invoked.

However, there is another way for CUB to do this.

We could instead use cudaGetDeviceProperties. The resulting cudaDeviceProp structure has cudaDeviceProp::major and cudaDeviceProp::minor members that indicate the major/minor versions of the compute capability for the current device.

We could instead use cudaDeviceGetAttribute and query for cudaDevAttrComputeCapabilityMajor and cudaDevAttrComputeCapabilityMinor.

In addition, we would have to somewhere cache internal to CUB the list of architectures used to compile a particular TU (__CUDA_ARCH_LIST__) so we can select the closest arch to the compute capability of the current device.

Additional Context

It is generally recommended to avoid using cudaGetDeviceProperties and to instead use cudaDeviceGetAttribute to query the specific attribute of interest as cudaGetDeviceProperties can be quite slow.

~~However, it doesn't appear there is a way to query the compute capability through cudaDeviceGetAttribute as the cudaDeviceAttr enum doesn't have a field for querying compute capability.~~

~~I don't think this will be a serious issue as we cache the result anyways.~~

The text was updated successfully, but these errors were encountered:

alliepiper · 2022-08-19T17:09:49Z

Just to make it more explicit -- the main difference between these approaches is that cudaFuncGetAttributes tells us which available PTX target will be used on the current device, while cudaDeviceGetAttribute will return the SM architecture of the device. As you mention, we can use the list of target architectures (eg. __CUDA_ARCH_LIST__ on nvcc) with this information to figure out which PTX target will be selected.

I'm in favor of finding a better solution than the empty_kernel approach, which has been troublesome for a variety of reasons. If the approach of querying the SM arch and refining with __CUDA_ARCH_LIST__ works, it sounds good to me and is worth experimenting with for 2.1.

zkhatami mentioned this issue Sep 22, 2022

Better error message for no GPU or incompatible GPU NVIDIA/cub#577

Closed

jrhemstad added the cub For all items related to CUB label Feb 22, 2023

miscco self-assigned this Feb 23, 2023

jrhemstad mentioned this issue Jun 15, 2023

Add dispatch based on compute architecture rapidsai/raft#1295

Closed

gevtushenko mentioned this issue Sep 21, 2023

Fix Thrust/CUB Linkage Issues #443

Merged

2 tasks

jarmak-nv transferred this issue from NVIDIA/cub Nov 8, 2023

jrhemstad mentioned this issue Sep 16, 2024

Prune CUB's ChainedPolicy by __CUDA_ARCH_LIST__ #2154

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSSION] Alternative solution for determining compute capability at runtime #898

[DISCUSSION] Alternative solution for determining compute capability at runtime #898

jrhemstad commented Aug 19, 2022 •

edited

Loading

alliepiper commented Aug 19, 2022

[DISCUSSION] Alternative solution for determining compute capability at runtime #898

[DISCUSSION] Alternative solution for determining compute capability at runtime #898

Comments

jrhemstad commented Aug 19, 2022 • edited Loading

Current Situation

Proposal

Additional Context

alliepiper commented Aug 19, 2022

jrhemstad commented Aug 19, 2022 •

edited

Loading