From ef46f9761f08df802abe7dac23ab2a6fd086e693 Mon Sep 17 00:00:00 2001 From: Matthias Knorr Date: Thu, 28 Nov 2024 14:20:58 +0100 Subject: [PATCH] Review fixes --- docs/how-to/hip_cpp_language_extensions.rst | 37 +++++++++++---------- docs/how-to/kernel_language_cpp_support.rst | 7 ++-- 2 files changed, 23 insertions(+), 21 deletions(-) diff --git a/docs/how-to/hip_cpp_language_extensions.rst b/docs/how-to/hip_cpp_language_extensions.rst index 879d92b70d..131fffc8ae 100644 --- a/docs/how-to/hip_cpp_language_extensions.rst +++ b/docs/how-to/hip_cpp_language_extensions.rst @@ -68,8 +68,9 @@ There are some restrictions on the parameters of kernels. Kernels can't: Kernels can have variadic template parameters, but only one parameter pack, which must be the last item in the template parameter list. -Unlike CUDA, HIP does not support dynamic parallelism, meaning that kernels can -not be called from the device. +.. note:: + Unlike CUDA, HIP does not support dynamic parallelism, meaning that kernels + can not be called from the device. Calling __global__ functions ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -79,8 +80,8 @@ need an additional configuration, that specifies the grid and block dimensions (i.e. the amount of threads to be launched), as well as specifying the amount of shared memory per block and which stream to execute the kernel on. -Kernels are called using the ``<<<>>>`` syntax known from CUDA, but HIP also -supports the ``hipLaunchKernelGGL`` macro. +Kernels are called using the triple chevron ``<<<>>>`` syntax known from CUDA, +but HIP also supports the ``hipLaunchKernelGGL`` macro. When using ``hipLaunchKernelGGL``, the first five configuration parameters must be: @@ -167,8 +168,8 @@ launched with more threads than ``MAX_THREADS_PER_BLOCK``. If no ``__launch_bounds__`` are specified, ``MAX_THREADS_PER_BLOCK`` is the maximum block size supported by the device (see -:doc:`reference/hardware_features`). Reducing ``MAX_THREADS_PER_BLOCK`` allows -the compiler to use more resources per thread than an unconstrained +:doc:`../reference/hardware_features`). Reducing ``MAX_THREADS_PER_BLOCK`` +allows the compiler to use more resources per thread than an unconstrained compilation. This might however reduce the amount of blocks that can run concurrently on a CU, thereby reducing occupancy and trading thread-level parallelism for instruction-level parallelism. @@ -187,9 +188,9 @@ When launching kernels HIP will validate the launch configuration to make sure the requested block size is not larger than ``MAX_THREADS_PER_BLOCK`` and return an error if it is exceeded. -If :doc:`AMD_LOG_LEVEL ` is set, detailed information will be -shown in the error log message, including the launch configuration of the -kernel and the specified ``__launch_bounds__``. +If :doc:`AMD_LOG_LEVEL <./logging>` is set, detailed information will be shown +in the error log message, including the launch configuration of the kernel and +the specified ``__launch_bounds__``. MIN_WARPS_PER_EXECUTION_UNIT ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -210,11 +211,11 @@ The available registers per Compute Unit are listed in :doc:`rocm:reference/gpu-arch-specs`. Beware that these values are per Compute Unit, not per Execution Unit. On AMD GPUs a Compute Unit consists of 4 Execution Units, also known as SIMDs, each with their own register file. For more -information see :doc:`understand/hardware_implementation`. +information see :doc:`../understand/hardware_implementation`. :cpp:struct:`hipDeviceProp_t` also has a field ``executionUnitsPerMultiprocessor``. Porting from CUDA __launch_bounds -"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ CUDA defines the ``__launch_bounds`` qualifier which works similar to ``__launch_bounds__``: @@ -223,10 +224,12 @@ CUDA defines the ``__launch_bounds`` qualifier which works similar to __launch_bounds(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MULTIPROCESSOR) -The first parameter is the same as HIP's implementation. The second parameter of -``__launch_bounds`` must be converted to the format used ``__launch_bounds__``, -which uses warps and execution units rather than blocks and multiprocessors. -This conversion is performed automatically by :doc:`HIPIFY `. +The first parameter is the same as HIP's implementation, but +``MIN_BLOCKS_PER_MULTIPROCESSOR`` must be converted to +``MIN_WARPS_PER_EXECUTION``, which uses warps and execution units rather than +blocks and multiprocessors. This conversion is performed automatically by +:doc:`HIPIFY `, or can be done manually with the following +equation. .. code-block:: cpp @@ -520,8 +523,8 @@ the following workaround: #. Build HIP with the ``HIP_COHERENT_HOST_ALLOC`` environment variable enabled. #. Modify kernels that use ``__threadfence_system()`` as follows: -* Ensure the kernel operates only on fine-grained system memory, which should be allocated with - ``hipHostMalloc()``. +* Ensure the kernel operates only on fine-grained system memory, which should be + allocated with ``hipHostMalloc()``. * Remove ``memcpy`` for all allocated fine-grained system memory regions. .. _synchronization_functions: diff --git a/docs/how-to/kernel_language_cpp_support.rst b/docs/how-to/kernel_language_cpp_support.rst index 4d571f47e7..abb291369d 100644 --- a/docs/how-to/kernel_language_cpp_support.rst +++ b/docs/how-to/kernel_language_cpp_support.rst @@ -44,10 +44,9 @@ Assertions -------------------------------------------------------------------------------- The ``assert`` function is supported in device code. Assertions are used for -debugging purposes. When the input expression equals to zero, the execution will -be stopped. -HIP provides its own implementation for ``assert`` for usage in device code in -``hip/hip_runtime.h`` +debugging purposes. When the input expression equals zero, the execution will be +stopped. HIP provides its own implementation for ``assert`` for usage in device +code in ``hip/hip_runtime.h``. .. code-block:: cpp