This program showcases an implementation of a simple matrix transpose kernel, which uses a different codepath depending on the target architecture.
- A number of constants are defined to control the problem details and the kernel launch parameters.
- Input matrix is set up in host memory.
- The necessary amount of device memory is allocated and input is copied to the device.
- The GPU transposition kernel is launched with previously defined arguments.
- The kernel will have two different codepaths for its data movement, depending on the target architecture.
- The transposed matrix is copied back to the host and all device memory is freed.
- The elements of the result matrix are compared with the expected result. The result of the comparison is printed to the standard output.
This example showcases two different codepaths inside a GPU kernel, depending on the target architecture.
You may want to use architecture-specific inline assembly when compiling for a specific architecture, without losing compatibility with other architectures (see the inline_assembly example).
These architecture-specific compiler definitions only exist within GPU kernels. If you would like to have GPU architecture-specific host-side code, you could query the stream/device information at runtime.
threadIdx
,blockIdx
,blockDim
__gfx1010__
,__gfx1011__
,__gfx1012__
,__gfx1030__
,__gfx1031__
,__gfx1100__
,__gfx1101__
,__gfx1102__
hipMalloc
hipMemcpy
hipGetLastError
hipFree