diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index f4f1d19c5e..b32c2e0df3 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -10,6 +10,8 @@ jobs: steps: - uses: actions/checkout@v2 - uses: actions/setup-python@v2 + with: + python-verson: '3.9' - name: Install prerequisites run: | sudo apt-get update -qq diff --git a/source/elements/oneMKL/source/domains/blas/axpby.rst b/source/elements/oneMKL/source/domains/blas/axpby.rst new file mode 100644 index 0000000000..2fc7fd4908 --- /dev/null +++ b/source/elements/oneMKL/source/domains/blas/axpby.rst @@ -0,0 +1,214 @@ +.. SPDX-FileCopyrightText: 2019-2020 Intel Corporation +.. +.. SPDX-License-Identifier: CC-BY-4.0 + +.. _onemkl_blas_axpby: + +axpby +===== + +Computes a vector-scalar product added to a scaled-vector. + +.. _onemkl_blas_axpby_description: + +.. rubric:: Description + +The ``axpby`` routines compute two scalar-vector product and add them: + +.. math:: + + y \leftarrow beta * y + alpha * x + +where ``x`` and ``y`` are vectors of ``n`` elements and ``alpha`` and ``beta`` are scalars. + +``axpby`` supports the following precisions. + + .. list-table:: + :header-rows: 1 + + * - T + * - ``float`` + * - ``double`` + * - ``std::complex`` + * - ``std::complex`` + +.. _onemkl_blas_axpby_buffer: + +axpby (Buffer Version) +---------------------- + +.. rubric:: Syntax + +.. code-block:: cpp + + namespace oneapi::mkl::blas::column_major { + void axpby(sycl::queue &queue, + std::int64_t n, + T alpha, + sycl::buffer &x, std::int64_t incx, + T beta, + sycl::buffer &y, std::int64_t incy) + } +.. code-block:: cpp + + namespace oneapi::mkl::blas::row_major { + void axpby(sycl::queue &queue, + std::int64_t n, + T alpha, + sycl::buffer &x, std::int64_t incx, + T beta, + sycl::buffer &y, std::int64_t incy) + } + +.. container:: section + + .. rubric:: Input Parameters + + queue + The queue where the routine should be executed. + + n + Number of elements in vector ``x`` and ``y``. + + alpha + Specifies the scalar ``alpha``. + + x + Buffer holding input vector ``x``. The buffer must be of size at least + (1 + (``n`` – 1)*abs(``incx``)). See :ref:`matrix-storage` for + more details. + + incx + Stride between two consecutive elements of the ``x`` vector. + + beta + Specifies the scalar ``beta``. + + y + Buffer holding input vector ``y``. The buffer must be of size at least + (1 + (``n`` – 1)*abs(``incy``)). See :ref:`matrix-storage` for + more details. + + incy + Stride between two consecutive elements of the ``y`` vector. + +.. container:: section + + .. rubric:: Output Parameters + + y + Buffer holding the updated vector ``y``. + +.. container:: section + + .. rubric:: Throws + + This routine shall throw the following exceptions if the associated condition is detected. An implementation may throw additional implementation-specific exception(s) in case of error conditions not covered here. + + :ref:`oneapi::mkl::invalid_argument` + + :ref:`oneapi::mkl::unsupported_device` + + :ref:`oneapi::mkl::host_bad_alloc` + + :ref:`oneapi::mkl::device_bad_alloc` + + :ref:`oneapi::mkl::unimplemented` + +.. _onemkl_blas_axpby_usm: + +axpby (USM Version) +------------------- + +.. rubric:: Syntax + +.. code-block:: cpp + + namespace oneapi::mkl::blas::column_major { + sycl::event axpby(sycl::queue &queue, + std::int64_t n, + T alpha, + const T *x, std::int64_t incx, + const T beta, + T *y, std::int64_t incy, + const std::vector &dependencies = {}) + } +.. code-block:: cpp + + namespace oneapi::mkl::blas::row_major { + sycl::event axpby(sycl::queue &queue, + std::int64_t n, + T alpha, + const T *x, std::int64_t incx, + const T beta, + T *y, std::int64_t incy, + const std::vector &dependencies = {}) + } + +.. container:: section + + .. rubric:: Input Parameters + + queue + The queue where the routine should be executed. + + n + Number of elements in vector ``x`` and ``y``. + + alpha + Specifies the scalar alpha. + + beta + Specifies the scalar beta. + + x + Pointer to the input vector ``x``. The allocated memory must be + of size at least (1 + (``n`` – 1)*abs(``incx``)). See + :ref:`matrix-storage` for more details. + + incx + Stride between consecutive elements of the ``x`` vector. + + y + Pointer to the input vector ``y``. The allocated memory must be + of size at least (1 + (``n`` – 1)*abs(``incy``)). See + :ref:`matrix-storage` for more details. + + incy + Stride between consecutive elements of the ``y`` vector. + + dependencies + List of events to wait for before starting computation, if any. + If omitted, defaults to no dependencies. + +.. container:: section + + .. rubric:: Output Parameters + + y + Array holding the updated vector ``y``. + +.. container:: section + + .. rubric:: Return Values + + Output event to wait on to ensure computation is complete. + +.. container:: section + + .. rubric:: Throws + + This routine shall throw the following exceptions if the associated condition is detected. An implementation may throw additional implementation-specific exception(s) in case of error conditions not covered here. + + :ref:`oneapi::mkl::invalid_argument` + + :ref:`oneapi::mkl::unsupported_device` + + :ref:`oneapi::mkl::host_bad_alloc` + + :ref:`oneapi::mkl::device_bad_alloc` + + :ref:`oneapi::mkl::unimplemented` + + **Parent topic:** :ref:`blas-like-extensions` + diff --git a/source/elements/oneMKL/source/domains/blas/axpy_batch.rst b/source/elements/oneMKL/source/domains/blas/axpy_batch.rst index 942b110bc4..2e41b99e16 100644 --- a/source/elements/oneMKL/source/domains/blas/axpy_batch.rst +++ b/source/elements/oneMKL/source/domains/blas/axpy_batch.rst @@ -244,7 +244,7 @@ The total number of vectors in ``x`` and ``y`` are given by the ``batch_size`` p x Array of pointers to input vectors ``X`` with size ``total_batch_count``. - The size of array allocated for the ``X`` vector of the group ``i`` must be at least (1 + (``n[i]`` – 1)*abs(``incx[i]``))``. + The size of array allocated for the ``X`` vector of the group ``i`` must be at least (1 + (``n[i]`` – 1)*abs(``incx[i]``)). See :ref:`matrix-storage` for more details. incx @@ -252,7 +252,7 @@ The total number of vectors in ``x`` and ``y`` are given by the ``batch_size`` p y Array of pointers to input/output vectors ``Y`` with size ``total_batch_count``. - The size of array allocated for the ``Y`` vector of the group ``i`` must be at least (1 + (``n[i]`` – 1)*abs(``incy[i]``))``. + The size of array allocated for the ``Y`` vector of the group ``i`` must be at least (1 + (``n[i]`` – 1)*abs(``incy[i]``)). See :ref:`matrix-storage` for more details. incy diff --git a/source/elements/oneMKL/source/domains/blas/blas-like-extensions.rst b/source/elements/oneMKL/source/domains/blas/blas-like-extensions.rst index 9ba5d5d148..d2cbb39acb 100644 --- a/source/elements/oneMKL/source/domains/blas/blas-like-extensions.rst +++ b/source/elements/oneMKL/source/domains/blas/blas-like-extensions.rst @@ -46,7 +46,12 @@ BLAS-like Extensions :hidden: axpy_batch + axpby + copy_batch + dgmm_batch gemm_batch + gemv_batch + syrk_batch trsm_batch gemmt gemm_bias diff --git a/source/elements/oneMKL/source/domains/blas/copy_batch.rst b/source/elements/oneMKL/source/domains/blas/copy_batch.rst new file mode 100644 index 0000000000..93f67dd979 --- /dev/null +++ b/source/elements/oneMKL/source/domains/blas/copy_batch.rst @@ -0,0 +1,373 @@ +.. SPDX-FileCopyrightText: 2019-2020 Intel Corporation +.. +.. SPDX-License-Identifier: CC-BY-4.0 + +.. _onemkl_blas_copy_batch: + +copy_batch +========== + +Computes a group of ``copy`` operations. + +.. _onemkl_blas_copy_batch_description: + +.. rubric:: Description + +The ``copy_batch`` routines are batched versions of :ref:`onemkl_blas_copy`, performing +multiple ``copy`` operations in a single call. Each ``copy`` +operation copies one vector to another. + +``copy_batch`` supports the following precisions for data. + + .. list-table:: + :header-rows: 1 + + * - T + * - ``float`` + * - ``double`` + * - ``std::complex`` + * - ``std::complex`` + +.. _onemkl_blas_copy_batch_buffer: + +copy_batch (Buffer Version) +--------------------------- + +.. rubric:: Description + +The buffer version of ``copy_batch`` supports only the strided API. + +The strided API operation is defined as: +:: + + for i = 0 … batch_size – 1 + X and Y are vectors at offset i * stridex, i * stridey in x and y + Y := X + end for + +where: + +``X`` and ``Y`` are vectors. + +**Strided API** + +.. rubric:: Syntax + +.. code-block:: cpp + + namespace oneapi::mkl::blas::column_major { + void copy_batch(sycl::queue &queue, + std::int64_t n, + sycl::buffer &x, + std::int64_t incx, + std::int64_t stridex, + sycl::buffer &y, + std::int64_t incy, + std::int64_t stridey, + std::int64_t batch_size) + } +.. code-block:: cpp + + namespace oneapi::mkl::blas::row_major { + void copy_batch(sycl::queue &queue, + std::int64_t n, + sycl::buffer &x, + std::int64_t incx, + std::int64_t stridex, + sycl::buffer &y, + std::int64_t incy, + std::int64_t stridey, + std::int64_t batch_size) + } + +.. container:: section + + .. rubric:: Input Parameters + + queue + The queue where the routine should be executed. + + n + Number of elements in ``X`` and ``Y``. + + x + Buffer holding input vectors ``X`` with size ``stridex`` * ``batch_size``. + + incx + Stride of vector ``X``. + + stridex + Stride between different ``X`` vectors. + + y + Buffer holding input/output vectors ``Y`` with size ``stridey`` * ``batch_size``. + + incy + Stride of vector ``Y``. + + stridey + Stride between different ``Y`` vectors. + + batch_size + Specifies the number of ``copy`` operations to perform. + +.. container:: section + + .. rubric:: Output Parameters + + y + Output buffer, overwritten by ``batch_size`` ``copy`` operations. + +.. container:: section + + .. rubric:: Throws + + This routine shall throw the following exceptions if the associated condition is detected. An implementation may throw additional implementation-specific exception(s) in case of error conditions not covered here. + + :ref:`oneapi::mkl::invalid_argument` + + + :ref:`oneapi::mkl::unsupported_device` + + + :ref:`oneapi::mkl::host_bad_alloc` + + + :ref:`oneapi::mkl::device_bad_alloc` + + + :ref:`oneapi::mkl::unimplemented` + + +.. _onemkl_blas_copy_batch_usm: + +copy_batch (USM Version) +------------------------ + +.. rubric:: Description + +The USM version of ``copy_batch`` supports the group API and strided API. + +The group API operation is defined as +:: + + idx = 0 + for i = 0 … group_count – 1 + for j = 0 … group_size – 1 + X and Y are vectors in x[idx] and y[idx] + Y := X + idx := idx + 1 + end for + end for + +The strided API operation is defined as +:: + + for i = 0 … batch_size – 1 + X and Y are vectors at offset i * stridex, i * stridey in x and y + Y := X + end for + +where: + +``X`` and ``Y`` are vectors. + +For group API, ``x`` and ``y`` arrays contain the pointers for all the input vectors. +The total number of vectors in ``x`` and ``y`` are given by: + +.. math:: + + total\_batch\_count = \sum_{i=0}^{group\_count-1}group\_size[i] + +For strided API, ``x`` and ``y`` arrays contain all the input vectors. +The total number of vectors in ``x`` and ``y`` are given by the ``batch_size`` parameter. + +**Group API** + +.. rubric:: Syntax + +.. code-block:: cpp + + namespace oneapi::mkl::blas::column_major { + sycl::event copy_batch(sycl::queue &queue, + std::int64_t *n, + const T **x, + std::int64_t *incx, + T **y, + std::int64_t *incy, + std::int64_t group_count, + std::int64_t *group_size, + const std::vector &dependencies = {}) + } +.. code-block:: cpp + + namespace oneapi::mkl::blas::row_major { + sycl::event copy_batch(sycl::queue &queue, + std::int64_t *n, + const T **x, + std::int64_t *incx, + T **y, + std::int64_t *incy, + std::int64_t group_count, + std::int64_t *group_size, + const std::vector &dependencies = {}) + } + +.. container:: section + + .. rubric:: Input Parameters + + queue + The queue where the routine should be executed. + + n + Array of ``group_count`` integers. ``n[i]`` specifies the number of elements in vectors ``X`` and ``Y`` for every vector in group ``i``. + + x + Array of pointers to input vectors ``X`` with size ``total_batch_count``. + The size of array allocated for the ``X`` vector of the group ``i`` must be at least (1 + (``n[i]`` – 1)*abs(``incx[i]``)). + See :ref:`matrix-storage` for more details. + + incx + Array of ``group_count`` integers. ``incx[i]`` specifies the stride of vector ``X`` in group ``i``. + + y + Array of pointers to input/output vectors ``Y`` with size ``total_batch_count``. + The size of array allocated for the ``Y`` vector of the group ``i`` must be at least (1 + (``n[i]`` – 1)*abs(``incy[i]``)). + See :ref:`matrix-storage` for more details. + + incy + Array of ``group_count`` integers. ``incy[i]`` specifies the stride of vector ``Y`` in group ``i``. + + group_count + Number of groups. Must be at least 0. + + group_size + Array of ``group_count`` integers. ``group_size[i]`` specifies the number of ``copy`` operations in group ``i``. + Each element in ``group_size`` must be at least 0. + + dependencies + List of events to wait for before starting computation, if any. + If omitted, defaults to no dependencies. + +.. container:: section + + .. rubric:: Output Parameters + + y + Array of pointers holding the ``Y`` vectors, overwritten by ``total_batch_count`` ``copy`` operations. + +.. container:: section + + .. rubric:: Return Values + + Output event to wait on to ensure computation is complete. + +**Strided API** + +.. rubric:: Syntax + +.. code-block:: cpp + + namespace oneapi::mkl::blas::column_major { + sycl::event copy_batch(sycl::queue &queue, + std::int64_t n, + const T *x, + std::int64_t incx, + std::int64_t stridex, + T *y, + std::int64_t incy, + std::int64_t stridey, + std::int64_t batch_size, + const std::vector &dependencies = {}) + } +.. code-block:: cpp + + namespace oneapi::mkl::blas::row_major { + sycl::event copy_batch(sycl::queue &queue, + std::int64_t n, + const T *x, + std::int64_t incx, + std::int64_t stridex, + T *y, + std::int64_t incy, + std::int64_t stridey, + std::int64_t batch_size, + const std::vector &dependencies = {}) + } + +.. container:: section + + .. rubric:: Input Parameters + + queue + The queue where the routine should be executed. + + n + Number of elements in ``X`` and ``Y``. + + x + Pointer to input vectors ``X`` with size ``stridex`` * ``batch_size``. + + incx + Stride of vector ``X``. + + stridex + Stride between different ``X`` vectors. + + y + Pointer to input/output vectors ``Y`` with size ``stridey`` * ``batch_size``. + + incy + Stride of vector ``Y``. + + stridey + Stride between different ``Y`` vectors. + + batch_size + Specifies the number of ``copy`` operations to perform. + + dependencies + List of events to wait for before starting computation, if any. + If omitted, defaults to no dependencies. + +.. container:: section + + .. rubric:: Output Parameters + + y + Output vectors, overwritten by ``batch_size`` ``copy`` operations + +.. container:: section + + .. rubric:: Return Values + + Output event to wait on to ensure computation is complete. + +.. container:: section + + .. rubric:: Throws + + This routine shall throw the following exceptions if the associated condition is detected. An implementation may throw additional implementation-specific exception(s) in case of error conditions not covered here. + + :ref:`oneapi::mkl::invalid_argument` + + + + :ref:`oneapi::mkl::unsupported_device` + + + :ref:`oneapi::mkl::host_bad_alloc` + + + :ref:`oneapi::mkl::device_bad_alloc` + + + :ref:`oneapi::mkl::unimplemented` + + + **Parent topic:**:ref:`blas-like-extensions` diff --git a/source/elements/oneMKL/source/domains/blas/dgmm_batch.rst b/source/elements/oneMKL/source/domains/blas/dgmm_batch.rst new file mode 100644 index 0000000000..403a8a156c --- /dev/null +++ b/source/elements/oneMKL/source/domains/blas/dgmm_batch.rst @@ -0,0 +1,507 @@ +.. SPDX-FileCopyrightText: 2019-2020 Intel Corporation +.. +.. SPDX-License-Identifier: CC-BY-4.0 + +.. _onemkl_blas_dgmm_batch: + +dgmm_batch +========== + +Computes a group of ``dgmm`` operations. + +.. _onemkl_blas_dgmm_batch_description: + +.. rubric:: Description + +The ``dgmm_batch`` routines perform +multiple diagonal matrix-matrix product operations in a single call. + +``dgmm_batch`` supports the following precisions. + + .. list-table:: + :header-rows: 1 + + * - T + * - ``float`` + * - ``double`` + * - ``std::complex`` + * - ``std::complex`` + +.. _onemkl_blas_dgmm_batch_buffer: + +dgmm_batch (Buffer Version) +--------------------------- + +.. rubric:: Description + +The buffer version of ``dgmm_batch`` supports only the strided API. + +The strided API operation is defined as: +:: + + for i = 0 … batch_size – 1 + A and C are matrices at offset i * stridea in a, i * stridec in c. + X is a vector at offset i * stridex in x + C := diag(X) * A or C = A * diag(X) + end for + +where: + +``A`` is a matrix, + +``X`` is a diagonal matrix stored as a vector + +The ``a`` and ``x`` buffers contain all the input matrices. The stride +between matrices is given by the stride parameter. The total number +of matrices in ``a`` and ``x`` buffers is given by the ``batch_size`` parameter. + +**Strided API** + +.. rubric:: Syntax + +.. code-block:: cpp + + namespace oneapi::mkl::blas::column_major { + void dgmm_batch(sycl::queue &queue, + onemkl::mkl::side left_right, + std::int64_t m, + std::int64_t n, + sycl::buffer &a, + std::int64_t lda, + std::int64_t stridea, + sycl::buffer &x, + std::int64_t incx, + std::int64_t stridex, + sycl::buffer &c, + std::int64_t ldc, + std::int64_t stridec, + std::int64_t batch_size) + } +.. code-block:: cpp + + namespace oneapi::mkl::blas::row_major { + void dgmm_batch(sycl::queue &queue, + onemkl::mkl::side left_right, + std::int64_t m, + std::int64_t n, + sycl::buffer &a, + std::int64_t lda, + std::int64_t stridea, + sycl::buffer &x, + std::int64_t incx, + std::int64_t stridex, + sycl::buffer &c, + std::int64_t ldc, + std::int64_t stridec, + std::int64_t batch_size) + } + +.. container:: section + + .. rubric:: Input Parameters + + queue + The queue where the routine should be executed. + + left_right + Specifies the position of the diagonal matrix in the product. + See :ref:`onemkl_datatypes` for more details. + + m + Number of rows of matrices ``A`` and ``C``. Must be at least zero. + + n + Number of columns of matrices ``A`` and ``C``. Must be at least zero. + + a + + Buffer holding the input matrices ``A`` with size ``stridea`` * + ``batch_size``. Must be of at least ``lda`` * ``j`` + + ``stridea`` * (``batch_size`` - 1) where j is n if column major + layout is used or m if major layout is used. + + lda + The leading dimension of the matrices ``A``. It must be positive + and at least ``m`` if column major layout is used or at least + ``n`` if row major layout is used. + + stridea + Stride between different ``A`` matrices. + + x + Buffer holding the input matrices ``X`` with size ``stridex`` * + ``batch_size``. Must be of size at least + (1 + (``len`` - 1)*abs(``incx``)) + ``stridex`` * (``batch_size`` - 1) + where ``len`` is ``n`` if the diagonal matrix is on the right + of the product or ``m`` otherwise. + + incx + Stride between two consecutive elements of the ``x`` vectors. + + stridex + Stride between different ``X`` vectors, must be at least 0. + + c + Buffer holding input/output matrices ``C`` with size ``stridec`` * ``batch_size``. + + ldc + The leading dimension of the matrices ``C``. It must be positive and at least + ``m`` if column major layout is used to store matrices or at + least ``n`` if column major layout is used to store matrices. + + stridec + Stride between different ``C`` matrices. Must be at least + ``ldc`` * ``n`` if column major layout is used or ``ldc`` * ``m`` if row + major layout is used. + + batch_size + Specifies the number of diagonal matrix-matrix product operations to perform. + +.. container:: section + + .. rubric:: Output Parameters + + c + Output overwritten by ``batch_size`` diagonal matrix-matrix product + operations. + +.. container:: section + + .. rubric:: Throws + + This routine shall throw the following exceptions if the associated condition is detected. An implementation may throw additional implementation-specific exception(s) in case of error conditions not covered here. + + :ref:`oneapi::mkl::invalid_argument` + + + :ref:`oneapi::mkl::unsupported_device` + + + :ref:`oneapi::mkl::host_bad_alloc` + + + :ref:`oneapi::mkl::device_bad_alloc` + + + :ref:`oneapi::mkl::unimplemented` + + +.. _onemkl_blas_dgmm_batch_usm: + +dgmm_batch (USM Version) +--------------------------- + +.. rubric:: Description + +The USM version of ``dgmm_batch`` supports the group API and strided API. + +The group API operation is defined as: +:: + + idx = 0 + for i = 0 … group_count – 1 + for j = 0 … group_size – 1 + a and c are matrices of size mxn at position idx in a_array and c_array + x is a vector of size m or n depending on left_right, at position idx in x_array + if (left_right == oneapi::mkl::side::left) + c := diag(x) * a + else + c := a * diag(x) + idx := idx + 1 + end for + end for + +The strided API operation is defined as +:: + + for i = 0 … batch_size – 1 + A and C are matrices at offset i * stridea in a, i * stridec in c. + X is a vector at offset i * stridex in x + C := diag(X) * A or C = A * diag(X) + end for + +where: + +``A`` is a matrix, + +``X`` is a diagonal matrix stored as a vector + +The ``a`` and ``x`` buffers contain all the input matrices. The stride +between matrices is given by the stride parameter. The total number +of matrices in ``a`` and ``x`` buffers is given by the ``batch_size`` parameter. + +For group API, ``a`` and ``x`` arrays contain the pointers for all the input matrices. +The total number of matrices in ``a`` and ``x`` are given by: + +.. math:: + + total\_batch\_count = \sum_{i=0}^{group\_count-1}group\_size[i] + +For strided API, ``a`` and ``x`` arrays contain all the input matrices. The total number of matrices +in ``a`` and ``x`` are given by the ``batch_size`` parameter. + +**Group API** + +.. rubric:: Syntax + +.. code-block:: cpp + + namespace oneapi::mkl::blas::column_major { + sycl::event dgmm_batch(sycl::queue &queue, + onemkl::mkl::side *left_right, + std::int64_t *m, + std::int64_t *n, + const T **a, + std::int64_t *lda, + const T **x, + std::int64_t *incx, + T **c, + std::int64_t *ldc, + std::int64_t group_count, + std::int64_t *group_size, + const std::vector &dependencies = {}) + } +.. code-block:: cpp + + namespace oneapi::mkl::blas::row_major { + sycl::event dgmm_batch(sycl::queue &queue, + onemkl::mkl::side *left_right, + std::int64_t *m, + std::int64_t *n, + const T **a, + std::int64_t *lda, + const T **x, + std::int64_t *incx, + T **c, + std::int64_t *ldc, + std::int64_t group_count, + std::int64_t *group_size, + const std::vector &dependencies = {}) + } + +.. container:: section + + .. rubric:: Input Parameters + + queue + The queue where the routine should be executed. + + left_right + Specifies the position of the diagonal matrix in the product. + See :ref:`onemkl_datatypes` for more details. + + m + Array of ``group_count`` integers. ``m[i]`` specifies the + number of rows of ``A`` for every matrix in group ``i``. All entries must be at least zero. + + n + Array of ``group_count`` integers. ``n[i]`` specifies the + number of columns of ``A`` for every matrix in group ``i``. All entries must be at least zero. + + a + Array of pointers to input matrices ``A`` with size + ``total_batch_count``. Must be of size at least ``lda[i]`` * ``n[i]`` if + column major layout is used or at least ``lda[i]`` * ``m[i]`` if row major + layout is used. + See :ref:`matrix-storage` for more details. + + lda + Array of ``group_count`` integers. ``lda[i]`` specifies the + leading dimension of ``A`` for every matrix in group ``i``. All + entries must be positive and at least ``m[i]`` if column major + layout is used or at least ``n[i]`` if row major layout is used. + + x + Array of pointers to input vectors ``X`` with size + ``total_batch_count``. Must be of size at least (1 + ``len[i]`` – + 1)*abs(``incx[i]``)) where ``len[i]`` is ``n[i]`` if the diagonal matrix is on the + right of the product or ``m[i]`` otherwise. + See :ref:`matrix-storage` for more details. + + incx + Array of ``group_count`` integers. ``incx[i]`` specifies the + stride of ``x`` for every vector in group ``i``. All entries + must be positive. + c + Array of pointers to input/output matrices ``C`` with size ``total_batch_count``. + Must be of size at least + ``ldc[i]`` * ``n[i]`` + if column major layout is used or at least + ``ldc[i]`` * ``m[i]`` + if row major layout is used. + See :ref:`matrix-storage` for more details. + + ldc + Array of ``group_count`` integers. ``ldc[i]`` specifies the + leading dimension of ``C`` for every matrix in group ``i``. All + entries must be positive and ``ldc[i]`` must be at least + ``m[i]`` if column major layout is used to store matrices or at + least ``n[i]`` if row major layout is used to store matrices. + + group_count + Specifies the number of groups. Must be at least 0. + + group_size + Array of ``group_count`` integers. ``group_size[i]`` specifies the + number of diagonal matrix-matrix product operations in group ``i``. + All entries must be at least 0. + + dependencies + List of events to wait for before starting computation, if any. + If omitted, defaults to no dependencies. + +.. container:: section + + .. rubric:: Output Parameters + + c + Output overwritten by ``batch_size`` diagonal matrix-matrix product + operations. + +.. container:: section + + .. rubric:: Return Values + + Output event to wait on to ensure computation is complete. + +**Strided API** + +.. rubric:: Syntax + +.. code-block:: cpp + + namespace oneapi::mkl::blas::column_major { + sycl::event dgmm_batch(sycl::queue &queue, + onemkl::mkl::side left_right, + std::int64_t m, + std::int64_t n, + const T *a, + std::int64_t lda, + std::int64_t stridea, + const T *b, + std::int64_t incx, + std::int64_t stridex, + T *c, + std::int64_t ldc, + std::int64_t stridec, + std::int64_t batch_size, + const std::vector &dependencies = {}) + } +.. code-block:: cpp + + namespace oneapi::mkl::blas::row_major { + sycl::event dgmm_batch(sycl::queue &queue, + onemkl::mkl::side left_right, + std::int64_t m, + std::int64_t n, + const T *a, + std::int64_t lda, + std::int64_t stridea, + const T *b, + std::int64_t incx, + std::int64_t stridex, + T *c, + std::int64_t ldc, + std::int64_t stridec, + std::int64_t batch_size, + const std::vector &dependencies = {}) + } + +.. container:: section + + .. rubric:: Input Parameters + + queue + The queue where the routine should be executed. + + left_right + Specifies the position of the diagonal matrix in the product. + See :ref:`onemkl_datatypes` for more details. + + m + Number of rows of ``A``. Must be at least zero. + + n + Number of columns of ``A``. Must be at least zero. + + a + Pointer to input matrices ``A`` with size ``stridea`` * + ``batch_size``. Must be of size at least + ``lda`` * ``k`` + ``stridea`` * (``batch_size`` - 1) + where ``k`` is ``n`` if column major layout is used + or ``m`` if row major layout is used. + + lda + The leading dimension of the matrices ``A``. It must be positive + and at least ``m``. Must be positive and at least ``m`` if column + major layout is used or at least ``n`` if row major layout is used. + + stridea + Stride between different ``A`` matrices. + + x + Pointer to input matrices ``X`` with size ``stridex`` * ``batch_size``. + Must be of size at least + (1 + (``len`` - 1)*abs(``incx``)) + ``stridex`` * (``batch_size`` - 1) + where ``len`` is ``n`` if the diagonal matrix is on the right + of the product or ``m`` otherwise. + + incx + Stride between two consecutive elements of the ``x`` vector. + + stridex + Stride between different ``X`` vectors, must be at least 0. + + c + Pointer to input/output matrices ``C`` with size ``stridec`` * ``batch_size``. + + ldc + The leading dimension of the matrices ``C``. It must be positive and at least + ``ldc`` * ``m`` if column major layout is used to store matrices or at + least ``ldc`` * ``n`` if column major layout is used to store matrices. + + stridec + Stride between different ``C`` matrices. Must be at least + ``ldc`` * ``n`` if column major layout is used or + ``ldc`` * ``m`` if row major layout is used. + + batch_size + Specifies the number of diagonal matrix-matrix product operations to perform. + +.. container:: section + + .. rubric:: Output Parameters + + c + Output overwritten by ``batch_size`` diagonal matrix-matrix product + operations. + +.. container:: section + + .. rubric:: Return Values + + Output event to wait on to ensure computation is complete. + +.. container:: section + + .. rubric:: Throws + + This routine shall throw the following exceptions if the associated condition is detected. An implementation may throw additional implementation-specific exception(s) in case of error conditions not covered here. + + :ref:`oneapi::mkl::invalid_argument` + + + + :ref:`oneapi::mkl::unsupported_device` + + + :ref:`oneapi::mkl::host_bad_alloc` + + + :ref:`oneapi::mkl::device_bad_alloc` + + + :ref:`oneapi::mkl::unimplemented` + + + **Parent topic:** :ref:`blas-like-extensions` diff --git a/source/elements/oneMKL/source/domains/blas/gemm.rst b/source/elements/oneMKL/source/domains/blas/gemm.rst index bcefd90c2e..c981f2dff2 100644 --- a/source/elements/oneMKL/source/domains/blas/gemm.rst +++ b/source/elements/oneMKL/source/domains/blas/gemm.rst @@ -53,6 +53,10 @@ op(``X``) = ``X``\ :sup:`H`, - ``half`` - ``half`` - ``half`` + * - ``float`` + - ``bfloat16`` + - ``bfloat16`` + - ``float`` * - ``float`` - ``float`` - ``float`` diff --git a/source/elements/oneMKL/source/domains/blas/gemm_batch.rst b/source/elements/oneMKL/source/domains/blas/gemm_batch.rst index 94a15c11d0..32be79475a 100644 --- a/source/elements/oneMKL/source/domains/blas/gemm_batch.rst +++ b/source/elements/oneMKL/source/domains/blas/gemm_batch.rst @@ -23,6 +23,7 @@ operation perform a matrix-matrix product with general matrices. :header-rows: 1 * - T + * - ``half`` * - ``float`` * - ``double`` * - ``std::complex`` @@ -244,7 +245,8 @@ gemm_batch (USM Version) .. rubric:: Description -The USM version of ``gemm_batch`` supports the group API and strided API. +The USM version of ``gemm_batch`` supports the group API and the strided API. +The group API supports pointer and span inputs. The group API operation is defined as: :: @@ -258,6 +260,25 @@ The group API operation is defined as: end for end for +The advantage of using span instead of pointer is that the sizes of +the array can vary and the size of the span can be queried at +runtime. For each GEMM parameter, except the output matrices, the span +can be of size 1, the number of groups or the total batch size. For +the output matrices, to ensure all computation are independent, the size +of the span must be the total batch size. + +Depending on the size of the spans, each parameter for the GEMM computation is used as follows: + + - If the span has size 1, the parameter is reused for all GEMM + computation. + + - If the span has size group_count, the parameter is reused for all + GEMM within a group, but each group will have a different value + for this parameter. This is like the gemm_batch group API with pointers. + + - If the span has size equal to the total batch size, each GEMM + computation will use a different value for this parameter. + The strided API operation is defined as :: @@ -311,6 +332,24 @@ in ``a``, ``b`` and ``c`` are given by the ``batch_size`` parameter. std::int64_t group_count, std::int64_t *group_size, const std::vector &dependencies = {}) + + sycl::event gemm_batch(sycl::queue &queue, + const sycl::span &transa, + const sycl::span &transb, + const sycl::span &m, + const sycl::span &n, + const sycl::span &k, + const sycl::span &alpha, + const sycl::span &a, + const sycl::span &lda, + const sycl::span &b, + const sycl::span &ldb, + const sycl::span &beta, + sycl::span &c, + const sycl::span &ldc, + size_t group_count, + const sycl::span &group_sizes, + const std::vector &dependencies = {}) } .. code-block:: cpp @@ -332,6 +371,24 @@ in ``a``, ``b`` and ``c`` are given by the ``batch_size`` parameter. std::int64_t group_count, std::int64_t *group_size, const std::vector &dependencies = {}) + + sycl::event gemm_batch(sycl::queue &queue, + const sycl::span &transa, + const sycl::span &transb, + const sycl::span &m, + const sycl::span &n, + const sycl::span &k, + const sycl::span &alpha, + const sycl::span &a, + const sycl::span &lda, + const sycl::span &b, + const sycl::span &ldb, + const sycl::span &beta, + sycl::span &c, + const sycl::span &ldc, + size_t group_count, + const sycl::span &group_sizes, + const std::vector &dependencies = {}) } .. container:: section @@ -342,37 +399,37 @@ in ``a``, ``b`` and ``c`` are given by the ``batch_size`` parameter. The queue where the routine should be executed. transa - Array of ``group_count`` ``onemkl::transpose`` values. ``transa[i]`` specifies the form of op(``A``) used in + Array or span of ``group_count`` ``onemkl::transpose`` values. ``transa[i]`` specifies the form of op(``A``) used in the matrix multiplication in group ``i``. See :ref:`onemkl_datatypes` for more details. transb - Array of ``group_count`` ``onemkl::transpose`` values. ``transb[i]`` specifies the form of op(``B``) used in + Array or span of ``group_count`` ``onemkl::transpose`` values. ``transb[i]`` specifies the form of op(``B``) used in the matrix multiplication in group ``i``. See :ref:`onemkl_datatypes` for more details. m - Array of ``group_count`` integers. ``m[i]`` specifies the + Array or span of ``group_count`` integers. ``m[i]`` specifies the number of rows of op(``A``) and ``C`` for every matrix in group ``i``. All entries must be at least zero. n - Array of ``group_count`` integers. ``n[i]`` specifies the + Array or span of ``group_count`` integers. ``n[i]`` specifies the number of columns of op(``B``) and ``C`` for every matrix in group ``i``. All entries must be at least zero. k - Array of ``group_count`` integers. ``k[i]`` specifies the + Array or span of ``group_count`` integers. ``k[i]`` specifies the number of columns of op(``A``) and rows of op(``B``) for every matrix in group ``i``. All entries must be at least zero. alpha - Array of ``group_count`` scalar elements. ``alpha[i]`` specifies the scaling factor for every matrix-matrix + Array or span of ``group_count`` scalar elements. ``alpha[i]`` specifies the scaling factor for every matrix-matrix product in group ``i``. a - Array of pointers to input matrices ``A`` with size ``total_batch_count``. + Array of pointers or span of input matrices ``A`` with size ``total_batch_count``. See :ref:`matrix-storage` for more details. lda - Array of ``group_count`` integers. ``lda[i]`` specifies the + Array or span of ``group_count`` integers. ``lda[i]`` specifies the leading dimension of ``A`` for every matrix in group ``i``. All entries must be positive. @@ -390,13 +447,12 @@ in ``a``, ``b`` and ``c`` are given by the ``batch_size`` parameter. - ``lda[i]`` must be at least ``m[i]``. b - Array of pointers to input matrices ``B`` with size ``total_batch_count``. + Array of pointers or span of input matrices ``B`` with size ``total_batch_count``. See :ref:`matrix-storage` for more details. ldb - - Array of ``group_count`` integers. ``ldb[i]`` specifies the + Array or span of ``group_count`` integers. ``ldb[i]`` specifies the leading dimension of ``B`` for every matrix in group ``i``. All entries must be positive. @@ -414,16 +470,16 @@ in ``a``, ``b`` and ``c`` are given by the ``batch_size`` parameter. - ``ldb[i]`` must be at least ``k[i]``. beta - Array of ``group_count`` scalar elements. ``beta[i]`` specifies the scaling factor for matrix ``C`` + Array or span of ``group_count`` scalar elements. ``beta[i]`` specifies the scaling factor for matrix ``C`` for every matrix in group ``i``. c - Array of pointers to input/output matrices ``C`` with size ``total_batch_count``. + Array of pointers or span of input/output matrices ``C`` with size ``total_batch_count``. See :ref:`matrix-storage` for more details. ldc - Array of ``group_count`` integers. ``ldc[i]`` specifies the + Array or span of ``group_count`` integers. ``ldc[i]`` specifies the leading dimension of ``C`` for every matrix in group ``i``. All entries must be positive and ``ldc[i]`` must be at least ``m[i]`` if column major layout is used to store matrices or at @@ -433,7 +489,7 @@ in ``a``, ``b`` and ``c`` are given by the ``batch_size`` parameter. Specifies the number of groups. Must be at least 0. group_size - Array of ``group_count`` integers. ``group_size[i]`` specifies the + Array or span of ``group_count`` integers. ``group_size[i]`` specifies the number of matrix multiply products in group ``i``. All entries must be at least 0. dependencies @@ -461,6 +517,27 @@ in ``a``, ``b`` and ``c`` are given by the ``batch_size`` parameter. Output event to wait on to ensure computation is complete. +.. container:: section + + .. rubric:: Output Parameters + + c + Overwritten by the ``m[i]``-by-``n[i]`` matrix calculated by + (``alpha[i]`` * op(``A``)*op(``B``) + ``beta[i]`` * ``C``) for group ``i``. + +.. container:: section + + .. rubric:: Notes + + If ``beta`` = 0, matrix ``C`` does not need to be initialized + before calling ``gemm_batch``. + +.. container:: section + + .. rubric:: Return Values + + Output event to wait on to ensure computation is complete. + **Strided API** .. rubric:: Syntax diff --git a/source/elements/oneMKL/source/domains/blas/gemv.rst b/source/elements/oneMKL/source/domains/blas/gemv.rst index 5b8f2c58b2..8d7ac094eb 100644 --- a/source/elements/oneMKL/source/domains/blas/gemv.rst +++ b/source/elements/oneMKL/source/domains/blas/gemv.rst @@ -230,7 +230,7 @@ gemv (USM Version) Scaling factor for the matrix-vector product. a - The pointer to the input matrix ``A``. Must have a size of at + Pointer to the input matrix ``A``. Must have a size of at least ``lda``\ \*\ ``n`` if column major layout is used or at least ``lda``\ \*\ ``m`` if row major layout is used. See :ref:`matrix-storage` for more details. diff --git a/source/elements/oneMKL/source/domains/blas/gemv_batch.rst b/source/elements/oneMKL/source/domains/blas/gemv_batch.rst new file mode 100644 index 0000000000..7c047b9517 --- /dev/null +++ b/source/elements/oneMKL/source/domains/blas/gemv_batch.rst @@ -0,0 +1,517 @@ +.. SPDX-FileCopyrightText: 2019-2020 Intel Corporation +.. +.. SPDX-License-Identifier: CC-BY-4.0 + +.. _onemkl_blas_gemv_batch: + +gemv_batch +========== + +Computes a group of ``gemv`` operations. + +.. _onemkl_blas_gemv_batch_description: + +.. rubric:: Description + +The ``gemv_batch`` routines are batched versions of +:ref:`onemkl_blas_gemv`, performing multiple ``gemv`` operations in a +single call. Each ``gemv`` operations perform a scalar-matrix-vector +product and add the result to a scalar-vector product. + +``gemv_batch`` supports the following precisions. + + .. list-table:: + :header-rows: 1 + + * - T + * - ``float`` + * - ``double`` + * - ``std::complex`` + * - ``std::complex`` + +.. _onemkl_blas_gemv_batch_buffer: + +gemv_batch (Buffer Version) +--------------------------- + +.. rubric:: Description + +The buffer version of ``gemv_batch`` supports only the strided API. + +The strided API operation is defined as: +:: + + for i = 0 … batch_size – 1 + A is a matrix at offset i * stridea in a. + X and Y are matrices at offset i * stridex, i * stridey, in x and y. + Y := alpha * op(A) * X + beta * Y + end for + +where: + +op(A) is one of op(A) = A, or op(A) = A\ :sup:`T`, or op(A) = A\ :sup:`H`, + +``alpha`` and ``beta`` are scalars, + +``A`` is a matrix and ``X`` and ``Y`` are vectors, + +The ``x`` and ``y`` buffers contain all the input matrices. The stride +between vectors is given by the stride parameter. The total number of +vectors in ``x`` and ``y`` buffers is given by the ``batch_size`` +parameter. + +**Strided API** + +.. rubric:: Syntax + +.. code-block:: cpp + + namespace oneapi::mkl::blas::column_major { + void gemv_batch(sycl::queue &queue, + onemkl::transpose trans, + std::int64_t m, + std::int64_t n, + T alpha, + sycl::buffer &a, + std::int64_t lda, + std::int64_t stridea, + sycl::buffer &x, + std::int64_t incx, + std::int64_t stridex, + T beta, + sycl::buffer &y, + std::int64_t incy, + std::int64_t stridey, + std::int64_t batch_size) + } +.. code-block:: cpp + + namespace oneapi::mkl::blas::row_major { + void gemv_batch(sycl::queue &queue, + onemkl::transpose trans, + std::int64_t m, + std::int64_t n, + T alpha, + sycl::buffer &a, + std::int64_t lda, + std::int64_t stridea, + sycl::buffer &x, + std::int64_t incx, + std::int64_t stridex, + T beta, + sycl::buffer &y, + std::int64_t incy, + std::int64_t stridey, + std::int64_t batch_size) + } + +.. container:: section + + .. rubric:: Input Parameters + + queue + The queue where the routine should be executed. + + trans + Specifies op(``A``) the transposition operation applied to the + matrices ``A``. See :ref:`onemkl_datatypes` for more details. + + m + Number of rows of op(``A``). Must be at least zero. + + n + Number of columns of op(``A``). Must be at least zero. + + alpha + Scaling factor for the matrix-vector products. + + a + Buffer holding the input matrices ``A`` with size ``stridea`` * ``batch_size``. + + lda + The leading dimension of the matrices ``A``. It must be positive + and at least ``m`` if column major layout is used or at least + ``n`` if row major layout is used. + + stridea + Stride between different ``A`` matrices. + + x + Buffer holding the input vectors ``X`` with size ``stridex`` * ``batch_size``. + + incx + The stride of the vector ``X``. It must be positive. + + stridex + Stride between different consecutive ``X`` vectors, must be at least 0. + + beta + Scaling factor for the vector ``Y``. + + y + Buffer holding input/output vectors ``Y`` with size ``stridey`` * ``batch_size``. + + incy + Stride between two consecutive elements of the ``y`` vectors. + + stridey + Stride between two consecutive ``Y`` vectors. Must be at least + (1 + (len-1)*abs(incy)) where ``len`` is ``m`` if the matrix ``A`` + is non transpose or ``n`` otherwise. + + batch_size + Specifies the number of matrix-vector operations to perform. + +.. container:: section + + .. rubric:: Output Parameters + + y + Output overwritten by ``batch_size`` matrix-vector product + operations of the form ``alpha`` * op(``A``) * ``X`` + ``beta`` * ``Y``. + +.. container:: section + + .. rubric:: Throws + + This routine shall throw the following exceptions if the associated condition is detected. An implementation may throw additional implementation-specific exception(s) in case of error conditions not covered here. + + :ref:`oneapi::mkl::invalid_argument` + + + :ref:`oneapi::mkl::unsupported_device` + + + :ref:`oneapi::mkl::host_bad_alloc` + + + :ref:`oneapi::mkl::device_bad_alloc` + + + :ref:`oneapi::mkl::unimplemented` + + +.. _onemkl_blas_gemv_batch_usm: + +gemv_batch (USM Version) +--------------------------- + +.. rubric:: Description + +The USM version of ``gemv_batch`` supports the group API and strided API. + +The group API operation is defined as: +:: + + idx = 0 + for i = 0 … group_count – 1 + for j = 0 … group_size – 1 + A is an m x n matrix in a[idx] + X and Y are vectors in x[idx] and y[idx] + Y := alpha[i] * op(A) * X + beta[i] * Y + idx = idx + 1 + end for + end for + +The strided API operation is defined as +:: + + for i = 0 … batch_size – 1 + A is a matrix at offset i * stridea in a. + X and Y are vectors at offset i * stridex, i * stridey in x and y. + Y := alpha * op(A) * X + beta * Y + end for + +where: + +op(A) is one of op(A) = A, or op(A) = A\ :sup:`T`, or op(A) = A\ :sup:`H`, + +``alpha`` and ``beta`` are scalars, + +``A`` is a matrix and ``X`` and ``Y`` are vectors, + +For group API, ``x`` and ``y`` arrays contain the pointers for all the input vectors. +``A`` array contains the pointers to all input matrices. +The total number of vectors in ``x`` and ``y`` and matrices in ``A`` are given by: + +.. math:: + + total\_batch\_count = \sum_{i=0}^{group\_count-1}group\_size[i] + +For strided API, ``x`` and ``y`` arrays contain all the input +vectors. ``A`` array contains the pointers to all input matrices. The +total number of vectors in ``x`` and ``y`` and matrices in ``A`` are given by the +``batch_size`` parameter. + +**Group API** + +.. rubric:: Syntax + +.. code-block:: cpp + + namespace oneapi::mkl::blas::column_major { + sycl::event gemv_batch(sycl::queue &queue, + onemkl::transpose *trans, + std::int64_t *m, + std::int64_t *n, + T *alpha, + const T **a, + std::int64_t *lda, + const T **x, + std::int64_t *incx, + T *beta, + T **y, + std::int64_t *incy, + std::int64_t group_count, + std::int64_t *group_size, + const std::vector &dependencies = {}) + } +.. code-block:: cpp + + namespace oneapi::mkl::blas::row_major { + sycl::event gemv_batch(sycl::queue &queue, + onemkl::transpose *trans, + std::int64_t *m, + std::int64_t *n, + T *alpha, + const T **a, + std::int64_t *lda, + const T **x, + std::int64_t *incx, + T *beta, + T **y, + std::int64_t *incy, + std::int64_t group_count, + std::int64_t *group_size, + const std::vector &dependencies = {}) + } + +.. container:: section + + .. rubric:: Input Parameters + + queue + The queue where the routine should be executed. + + trans + Array of ``group_count`` ``onemkl::transpose`` values. ``trans[i]`` specifies the form of op(``A``) used in + the matrix-vector product in group ``i``. See :ref:`onemkl_datatypes` for more details. + + m + Array of ``group_count`` integers. ``m[i]`` specifies the + number of rows of op(``A``) for every matrix in group ``i``. All entries must be at least zero. + + n + Array of ``group_count`` integers. ``n[i]`` specifies the + number of columns of op(``A``) for every matrix in group ``i``. All entries must be at least zero. + + alpha + Array of ``group_count`` scalar elements. ``alpha[i]`` specifies + the scaling factor for every matrix-vector product in group + ``i``. + + a + Array of pointers to input matrices ``A`` with size ``total_batch_count``. + + See :ref:`matrix-storage` for more details. + + lda + Array of ``group_count`` integers. ``lda[i]`` specifies the + leading dimension of ``A`` for every matrix in group ``i``. All + entries must be positive and at least ``m`` if column major + layout is used or at least ``n`` if row major layout is used. + + x + Array of pointers to input vectors ``X`` with size ``total_batch_count``. + + See :ref:`matrix-storage` for more details. + + incx + Array of ``group_count`` integers. ``incx[i]`` specifies the + stride of ``X`` for every vector in group ``i``. All + entries must be positive. + + beta + Array of ``group_count`` scalar elements. ``beta[i]`` specifies + the scaling factor for vector ``Y`` for every vector in group + ``i``. + + y + Array of pointers to input/output vectors ``Y`` with size ``total_batch_count``. + + See :ref:`matrix-storage` for more details. + + incy + Array of ``group_count`` integers. ``incy[i]`` specifies the + leading dimension of ``Y`` for every vector in group ``i``. All + entries must be positive and ``incy[i]`` must be at least + ``m[i]`` if column major layout is used or at + least ``n[i]`` if row major layout is used. + + group_count + Specifies the number of groups. Must be at least 0. + + group_size + Array of ``group_count`` integers. ``group_size[i]`` specifies the + number of matrix-vector products in group ``i``. All entries must be at least 0. + + dependencies + List of events to wait for before starting computation, if any. + If omitted, defaults to no dependencies. + +.. container:: section + + .. rubric:: Output Parameters + + y + Overwritten by vector calculated by + (``alpha[i]`` * op(``A``) * ``X`` + ``beta[i]`` * ``Y``) for group ``i``. + +.. container:: section + + .. rubric:: Return Values + + Output event to wait on to ensure computation is complete. + +**Strided API** + +.. rubric:: Syntax + +.. code-block:: cpp + + namespace oneapi::mkl::blas::column_major { + sycl::event gemv_batch(sycl::queue &queue, + onemkl::transpose trans, + std::int64_t m, + std::int64_t n, + T alpha, + const T *a, + std::int64_t lda, + std::int64_t stridea, + const T *x, + std::int64_t incx, + std::int64_t stridex, + T beta, + T *y, + std::int64_t incy, + std::int64_t stridey, + std::int64_t batch_size, + const std::vector &dependencies = {}) + } +.. code-block:: cpp + + namespace oneapi::mkl::blas::row_major { + sycl::event gemv_batch(sycl::queue &queue, + onemkl::transpose trans, + std::int64_t m, + std::int64_t n, + T alpha, + const T *a, + std::int64_t lda, + std::int64_t stridea, + const T *x, + std::int64_t incx, + std::int64_t stridex, + T beta, + T *y, + std::int64_t incy, + std::int64_t stridey, + std::int64_t batch_size, + const std::vector &dependencies = {}) + } + + +.. container:: section + + .. rubric:: Input Parameters + + queue + The queue where the routine should be executed. + + trans + Specifies op(``A``) the transposition operation applied to the + matrices ``A``. See :ref:`onemkl_datatypes` for more details. + + m + Number of rows of op(``A``). Must be at least zero. + + n + Number of columns of op(``A``). Must be at least zero. + + alpha + Scaling factor for the matrix-vector products. + + a + Pointer to the input matrices ``A`` with size ``stridea`` * ``batch_size``. + + lda + The leading dimension of the matrices ``A``. It must be positive + and at least ``m`` if column major layout is used or at least + ``n`` if row major layout is used. + + stridea + Stride between different ``A`` matrices. + + x + Pointer to the input vectors ``X`` with size ``stridex`` * ``batch_size``. + + incx + Stride of the vector ``X``. It must be positive. + + stridex + Stride between different consecutive ``X`` vectors, must be at least 0. + + beta + Scaling factor for the vector ``Y``. + + y + Pointer to the input/output vectors ``Y`` with size ``stridey`` * ``batch_size``. + + incy + Stride between two consecutive elements of the ``y`` vectors. + + stridey + Stride between two consecutive ``Y`` vectors. Must be at least + (1 + (len-1)*abs(incy)) where ``len`` is ``m`` if the matrix ``A`` + is non transpose or ``n`` otherwise. + + batch_size + Specifies the number of matrix-vector operations to perform. + +.. container:: section + + .. rubric:: Output Parameters + + y + Output overwritten by ``batch_size`` matrix-vector product + operations of the form ``alpha`` * op(``A``) * ``X`` + ``beta`` * ``Y``. + +.. container:: section + + .. rubric:: Return Values + + Output event to wait on to ensure computation is complete. + +.. container:: section + + .. rubric:: Throws + + This routine shall throw the following exceptions if the associated condition is detected. An implementation may throw additional implementation-specific exception(s) in case of error conditions not covered here. + + :ref:`oneapi::mkl::invalid_argument` + + + + :ref:`oneapi::mkl::unsupported_device` + + + :ref:`oneapi::mkl::host_bad_alloc` + + + :ref:`oneapi::mkl::device_bad_alloc` + + + :ref:`oneapi::mkl::unimplemented` + + + **Parent topic:** :ref:`blas-like-extensions` diff --git a/source/elements/oneMKL/source/domains/blas/syrk_batch.rst b/source/elements/oneMKL/source/domains/blas/syrk_batch.rst new file mode 100644 index 0000000000..5b7cd71d42 --- /dev/null +++ b/source/elements/oneMKL/source/domains/blas/syrk_batch.rst @@ -0,0 +1,529 @@ +.. SPDX-FileCopyrightText: 2019-2020 Intel Corporation +.. +.. SPDX-License-Identifier: CC-BY-4.0 + +.. _onemkl_blas_syrk_batch: + +syrk_batch +========== + +Computes a group of ``syrk`` operations. + +.. _onemkl_blas_syrk_batch_description: + +.. rubric:: Description + +The ``syrk_batch`` routines are batched versions of :ref:`onemkl_blas_syrk`, performing +multiple ``syrk`` operations in a single call. Each ``syrk`` +operation perform a rank-k update with general matrices. + +``syrk_batch`` supports the following precisions. + + .. list-table:: + :header-rows: 1 + + * - T + * - ``float`` + * - ``double`` + * - ``std::complex`` + * - ``std::complex`` + +.. _onemkl_blas_syrk_batch_buffer: + +syrk_batch (Buffer Version) +--------------------------- + +.. rubric:: Description + +The buffer version of ``syrk_batch`` supports only the strided API. + +The strided API operation is defined as: +:: + + for i = 0 … batch_size – 1 + A and C are matrices at offset i * stridea, i * stridec in a and c. + C := alpha * op(A) * op(A)^T + beta * C + end for + +where: + +op(X) is one of op(X) = X, or op(X) = X\ :sup:`T`, or op(X) = X\ :sup:`H`, + +``alpha`` and ``beta`` are scalars, + +``A`` and ``C`` are matrices, + +op(``A``) is ``n`` x ``k`` and ``C`` is ``n`` x ``n``. + +The ``a`` and ``c`` buffers contain all the input matrices. The stride +between matrices is given by the stride parameter. The total number +of matrices in ``a`` and ``c`` buffers is given by the ``batch_size`` parameter. + +**Strided API** + +.. rubric:: Syntax + +.. code-block:: cpp + + namespace oneapi::mkl::blas::column_major { + void syrk_batch(sycl::queue &queue, + onemkl::uplo upper_lower, + onemkl::transpose trans, + std::int64_t n, + std::int64_t k, + T alpha, + sycl::buffer &a, + std::int64_t lda, + std::int64_t stridea, + T beta, + sycl::buffer &c, + std::int64_t ldc, + std::int64_t stridec, + std::int64_t batch_size) + } +.. code-block:: cpp + + namespace oneapi::mkl::blas::row_major { + void syrk_batch(sycl::queue &queue, + onemkl::uplo upper_lower, + onemkl::transpose trans, + std::int64_t n, + std::int64_t k, + T alpha, + sycl::buffer &a, + std::int64_t lda, + std::int64_t stridea, + T beta, + sycl::buffer &c, + std::int64_t ldc, + std::int64_t stridec, + std::int64_t batch_size) + } + +.. container:: section + + .. rubric:: Input Parameters + + queue + The queue where the routine should be executed. + + upper_lower + Specifies whether data in ``C`` is stored in its upper or lower triangle. + For more details, see :ref:`onemkl_datatypes`. + + trans + Specifies op(``A``) the transposition operation applied to the + matrix ``A``. Conjugation is never performed, even if trans = + transpose::conjtrans. See :ref:`onemkl_datatypes` for more + details. + + n + Number of rows and columns of ``C``. + Must be at least zero. + + k + Number of columns of op(``A``). + Must be at least zero. + + alpha + Scaling factor for the rank-k update. + + a + Buffer holding the input matrices ``A`` with size ``stridea`` * ``batch_size``. + + lda + The leading dimension of the matrices ``A``. It must be positive. + + .. list-table:: + :header-rows: 1 + + * - + - ``A`` not transposed + - ``A`` transposed + * - Column major + - ``lda`` must be at least ``n``. + - ``lda`` must be at least ``k``. + * - Row major + - ``lda`` must be at least ``k``. + - ``lda`` must be at least ``n``. + + stridea + Stride between different ``A`` matrices. + + beta + Scaling factor for the matrices ``C``. + + c + Buffer holding input/output matrices ``C`` with size ``stridec`` * ``batch_size``. + + ldc + The leading dimension of the matrices ``C``. It must be positive + and at least ``n``. + + stridec + Stride between different ``C`` matrices. Must be at least + ``ldc`` * ``n``. + + batch_size + Specifies the number of rank-k update operations to perform. + +.. container:: section + + .. rubric:: Output Parameters + + c + Output buffer, overwritten by ``batch_size`` rank-k update + operations of the form ``alpha`` * op(``A``)*op(``A``)^T + ``beta`` * ``C``. + +.. container:: section + + .. rubric:: Throws + + This routine shall throw the following exceptions if the associated condition is detected. An implementation may throw additional implementation-specific exception(s) in case of error conditions not covered here. + + :ref:`oneapi::mkl::invalid_argument` + + + :ref:`oneapi::mkl::unsupported_device` + + + :ref:`oneapi::mkl::host_bad_alloc` + + + :ref:`oneapi::mkl::device_bad_alloc` + + + :ref:`oneapi::mkl::unimplemented` + + +.. _onemkl_blas_syrk_batch_usm: + +syrk_batch (USM Version) +--------------------------- + +.. rubric:: Description + +The USM version of ``syrk_batch`` supports the group API and strided API. + +The group API operation is defined as: +:: + + idx = 0 + for i = 0 … group_count – 1 + for j = 0 … group_size – 1 + A, B, and C are matrices in a[idx] and c[idx] + C := alpha[i] * op(A) * op(A)^T + beta[i] * C + idx = idx + 1 + end for + end for + +The strided API operation is defined as +:: + + for i = 0 … batch_size – 1 + A, B and C are matrices at offset i * stridea, i * stridec in a and c. + C := alpha * op(A) * op(A)^T + beta * C + end for + +where: + +op(X) is one of op(X) = X, or op(X) = X\ :sup:`T`, or op(X) = X\ :sup:`H`, + +``alpha`` and ``beta`` are scalars, + +``A`` and ``C`` are matrices, + +op(``A``) is ``n`` x ``k`` and ``C`` is ``n`` x ``n``. + + +For group API, ``a`` and ``c`` arrays contain the pointers for all the input matrices. +The total number of matrices in ``a`` and ``c`` are given by: + +.. math:: + + total\_batch\_count = \sum_{i=0}^{group\_count-1}group\_size[i] + +For strided API, ``a`` and ``c`` arrays contain all the input matrices. The total number of matrices +in ``a`` and ``c`` are given by the ``batch_size`` parameter. + +**Group API** + +.. rubric:: Syntax + +.. code-block:: cpp + + namespace oneapi::mkl::blas::column_major { + sycl::event syrk_batch(sycl::queue &queue, + uplo *upper_lower, + transpose *trans, + std::int64_t *n, + std::int64_t *k, + T *alpha, + const T **a, + std::int64_t *lda, + T *beta, + T **c, + std::int64_t *ldc, + std::int64_t group_count, + std::int64_t *group_size, + const std::vector &dependencies = {}) + } +.. code-block:: cpp + + namespace oneapi::mkl::blas::row_major { + sycl::event syrk_batch(sycl::queue &queue, + uplo *upper_lower, + transpose *trans, + std::int64_t *n, + std::int64_t *k, + T *alpha, + const T **a, + std::int64_t *lda, + T *beta, + T **c, + std::int64_t *ldc, + std::int64_t group_count, + std::int64_t *group_size, + const std::vector &dependencies = {}) + } + +.. container:: section + + .. rubric:: Input Parameters + + queue + The queue where the routine should be executed. + + upper_lower + Array of ``group_count`` ``onemkl::upper_lower`` + values. ``upper_lower[i]`` specifies whether data in C for every + matrix in group ``i`` is in upper or lower triangle. + + trans + Array of ``group_count`` ``onemkl::transpose`` values. ``trans[i]`` specifies the form of op(``A``) used in + the rank-k update in group ``i``. See :ref:`onemkl_datatypes` for more details. + + n + Array of ``group_count`` integers. ``n[i]`` specifies the + number of rows and columns of ``C`` for every matrix in group ``i``. All entries must be at least zero. + + k + Array of ``group_count`` integers. ``k[i]`` specifies the + number of columns of op(``A``) for every matrix in group ``i``. All entries must be at + least zero. + + alpha + Array of ``group_count`` scalar elements. ``alpha[i]`` specifies the scaling factor for every rank-k update in group ``i``. + + a + Array of pointers to input matrices ``A`` with size ``total_batch_count``. + + See :ref:`matrix-storage` for more details. + + lda + Array of ``group_count`` integers. ``lda[i]`` specifies the + leading dimension of ``A`` for every matrix in group ``i``. All + entries must be positive. + + .. list-table:: + :header-rows: 1 + + * - + - ``A`` not transposed + - ``A`` transposed + * - Column major + - ``lda[i]`` must be at least ``n[i]``. + - ``lda[i]`` must be at least ``k[i]``. + * - Row major + - ``lda[i]`` must be at least ``k[i]``. + - ``lda[i]`` must be at least ``n[i]``. + + beta + Array of ``group_count`` scalar elements. ``beta[i]`` specifies the scaling factor for matrix ``C`` + for every matrix in group ``i``. + + c + Array of pointers to input/output matrices ``C`` with size ``total_batch_count``. + + See :ref:`matrix-storage` for more details. + + ldc + Array of ``group_count`` integers. ``ldc[i]`` specifies the + leading dimension of ``C`` for every matrix in group ``i``. All + entries must be positive and ``ldc[i]`` must be at least ``n[i]``. + + group_count + Specifies the number of groups. Must be at least 0. + + group_size + Array of ``group_count`` integers. ``group_size[i]`` specifies the + number of rank-k update products in group ``i``. All entries must be at least 0. + + dependencies + List of events to wait for before starting computation, if any. + If omitted, defaults to no dependencies. + +.. container:: section + + .. rubric:: Output Parameters + + c + Overwritten by the ``n[i]``-by-``n[i]`` matrix calculated by + (``alpha[i]`` * op(``A``)*op(``A``)^T + ``beta[i]`` * ``C``) for group ``i``. + +.. container:: section + + .. rubric:: Return Values + + Output event to wait on to ensure computation is complete. + +**Strided API** + +.. rubric:: Syntax + +.. code-block:: cpp + + namespace oneapi::mkl::blas::column_major { + sycl::event syrk_batch(sycl::queue &queue, + uplo upper_lower, + transpose trans, + std::int64_t n, + std::int64_t k, + T alpha, + const T *a, + std::int64_t lda, + std::int64_t stride_a, + T beta, + T *c, + std::int64_t ldc, + std::int64_t stride_c, + std::int64_t batch_size, + const std::vector &dependencies = {}) + } +.. code-block:: cpp + + namespace oneapi::mkl::blas::row_major { + sycl::event syrk_batch(sycl::queue &queue, + uplo upper_lower, + transpose trans, + std::int64_t n, + std::int64_t k, + T alpha, + const T *a, + std::int64_t lda, + std::int64_t stride_a, + T beta, + T *c, + std::int64_t ldc, + std::int64_t stride_c, + std::int64_t batch_size, + const std::vector &dependencies = {}) + } + +.. container:: section + + .. rubric:: Input Parameters + + queue + The queue where the routine should be executed. + + upper_lower + Specifies whether data in ``C`` is stored in its upper or lower triangle. + For more details, see :ref:`onemkl_datatypes`. + + trans + Specifies op(``A``) the transposition operation applied to the + matrices ``A``. Conjugation is never performed, even if trans = + transpose::conjtrans. See :ref:`onemkl_datatypes` for more + details. + + n + Number of rows and columns of ``C``. + Must be at least zero. + + k + Number of columns of op(``A``). + Must be at least zero. + + alpha + Scaling factor for the rank-k updates. + + a + Pointer to input matrices ``A`` with size ``stridea`` * ``batch_size``. + + lda + The leading dimension of the matrices ``A``. It must be positive. + + .. list-table:: + :header-rows: 1 + + * - + - ``A`` not transposed + - ``A`` transposed + * - Column major + - ``lda`` must be at least ``n``. + - ``lda`` must be at least ``k``. + * - Row major + - ``lda`` must be at least ``k``. + - ``lda`` must be at least ``n``. + + stridea + Stride between different ``A`` matrices. + + beta + Scaling factor for the matrices ``C``. + + c + Pointer to input/output matrices ``C`` with size ``stridec`` * ``batch_size``. + + ldc + The leading dimension of the matrices ``C``. It must be positive + and at least ``n``. + + stridec + Stride between different ``C`` matrices. + + batch_size + Specifies the number of rank-k update operations to perform. + + dependencies + List of events to wait for before starting computation, if any. + If omitted, defaults to no dependencies. + +.. container:: section + + .. rubric:: Output Parameters + + c + Output matrices, overwritten by ``batch_size`` rank-k update + operations of the form ``alpha`` * op(``A``)*op(``A``)^T + ``beta`` * ``C``. + +.. container:: section + + .. rubric:: Return Values + + Output event to wait on to ensure computation is complete. + +.. container:: section + + .. rubric:: Throws + + This routine shall throw the following exceptions if the associated condition is detected. An implementation may throw additional implementation-specific exception(s) in case of error conditions not covered here. + + :ref:`oneapi::mkl::invalid_argument` + + + + :ref:`oneapi::mkl::unsupported_device` + + + :ref:`oneapi::mkl::host_bad_alloc` + + + :ref:`oneapi::mkl::device_bad_alloc` + + + :ref:`oneapi::mkl::unimplemented` + + + **Parent topic:** :ref:`blas-like-extensions` diff --git a/source/elements/oneMKL/source/domains/blas/trsm_batch.rst b/source/elements/oneMKL/source/domains/blas/trsm_batch.rst index 7bade164b7..55eb27381e 100644 --- a/source/elements/oneMKL/source/domains/blas/trsm_batch.rst +++ b/source/elements/oneMKL/source/domains/blas/trsm_batch.rst @@ -184,6 +184,25 @@ of matrices in ``a`` and ``b`` buffers are given by the ``batch_size`` parameter If ``alpha`` = 0, matrix ``B`` is set to zero and the matrices ``A`` and ``B`` do not need to be initialized before calling ``trsm_batch``. +.. container:: section + + .. rubric:: Throws + + This routine shall throw the following exceptions if the associated condition is detected. An implementation may throw additional implementation-specific exception(s) in case of error conditions not covered here. + + :ref:`oneapi::mkl::invalid_argument` + + + :ref:`oneapi::mkl::unsupported_device` + + + :ref:`oneapi::mkl::host_bad_alloc` + + + :ref:`oneapi::mkl::device_bad_alloc` + + + :ref:`oneapi::mkl::unimplemented` trsm_batch (USM Version) --------------------------- @@ -500,28 +519,6 @@ in ``a`` and ``b`` are given by the ``batch_size`` parameter. Output event to wait on to ensure computation is complete. - **Parent topic:** :ref:`blas-like-extensions` -.. container:: section - - .. rubric:: Throws - - This routine shall throw the following exceptions if the associated condition is detected. An implementation may throw additional implementation-specific exception(s) in case of error conditions not covered here. - - :ref:`oneapi::mkl::invalid_argument` - - - :ref:`oneapi::mkl::unsupported_device` - - - :ref:`oneapi::mkl::host_bad_alloc` - - - :ref:`oneapi::mkl::device_bad_alloc` - - - :ref:`oneapi::mkl::unimplemented` - - .. container:: section .. rubric:: Throws