diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index f4f1d19c5e..b32c2e0df3 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -10,6 +10,8 @@ jobs:
     steps:
     - uses: actions/checkout@v2
     - uses: actions/setup-python@v2
+      with:
+        python-verson: '3.9'
     - name: Install prerequisites
       run: |
           sudo apt-get update -qq
diff --git a/source/elements/oneMKL/source/domains/blas/axpby.rst b/source/elements/oneMKL/source/domains/blas/axpby.rst
new file mode 100644
index 0000000000..2fc7fd4908
--- /dev/null
+++ b/source/elements/oneMKL/source/domains/blas/axpby.rst
@@ -0,0 +1,214 @@
+.. SPDX-FileCopyrightText: 2019-2020 Intel Corporation
+..
+.. SPDX-License-Identifier: CC-BY-4.0
+
+.. _onemkl_blas_axpby:
+
+axpby
+=====
+
+Computes a vector-scalar product added to a scaled-vector.
+
+.. _onemkl_blas_axpby_description:
+
+.. rubric:: Description
+
+The ``axpby`` routines compute two scalar-vector product and add them:
+
+.. math::
+
+      y \leftarrow beta * y + alpha * x
+
+where ``x`` and ``y`` are vectors of ``n`` elements and ``alpha`` and ``beta`` are scalars.
+
+``axpby`` supports the following precisions.
+
+   .. list-table::
+      :header-rows: 1
+
+      * -  T
+      * -  ``float``
+      * -  ``double``
+      * -  ``std::complex<float>``
+      * -  ``std::complex<double>``
+
+.. _onemkl_blas_axpby_buffer:
+
+axpby (Buffer Version)
+----------------------
+
+.. rubric:: Syntax
+
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::column_major {
+       void axpby(sycl::queue &queue,
+                 std::int64_t n,
+                 T alpha,
+                 sycl::buffer<T,1> &x, std::int64_t incx,
+                 T beta,
+                 sycl::buffer<T,1> &y, std::int64_t incy)
+   }
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::row_major {
+       void axpby(sycl::queue &queue,
+                 std::int64_t n,
+                 T alpha,
+                 sycl::buffer<T,1> &x, std::int64_t incx,
+                 T beta,
+                 sycl::buffer<T,1> &y, std::int64_t incy)
+   }
+
+.. container:: section
+
+   .. rubric:: Input Parameters
+
+   queue
+      The queue where the routine should be executed.
+
+   n
+      Number of elements in vector ``x`` and ``y``.
+
+   alpha
+      Specifies the scalar ``alpha``.
+
+   x
+      Buffer holding input vector ``x``. The buffer must be of size at least
+      (1 + (``n`` – 1)*abs(``incx``)). See :ref:`matrix-storage` for
+      more details.
+
+   incx
+      Stride between two consecutive elements of the ``x`` vector.
+
+   beta
+      Specifies the scalar ``beta``.
+
+   y
+      Buffer holding input vector ``y``. The buffer must be of size at least
+      (1 + (``n`` – 1)*abs(``incy``)). See :ref:`matrix-storage` for
+      more details.
+
+   incy
+      Stride between two consecutive elements of the ``y`` vector.
+
+.. container:: section
+
+   .. rubric:: Output Parameters
+
+   y
+      Buffer holding the updated vector ``y``.
+
+.. container:: section
+
+   .. rubric:: Throws
+
+   This routine shall throw the following exceptions if the associated condition is detected. An implementation may throw additional implementation-specific exception(s) in case of error conditions not covered here.
+
+   :ref:`oneapi::mkl::invalid_argument<onemkl_exception_invalid_argument>`
+
+   :ref:`oneapi::mkl::unsupported_device<onemkl_exception_unsupported_device>`
+
+   :ref:`oneapi::mkl::host_bad_alloc<onemkl_exception_host_bad_alloc>`
+
+   :ref:`oneapi::mkl::device_bad_alloc<onemkl_exception_device_bad_alloc>`
+
+   :ref:`oneapi::mkl::unimplemented<onemkl_exception_unimplemented>`
+
+.. _onemkl_blas_axpby_usm:
+
+axpby (USM Version)
+-------------------
+
+.. rubric:: Syntax
+
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::column_major {
+       sycl::event axpby(sycl::queue &queue,
+                        std::int64_t n,
+                        T alpha,
+                        const T *x, std::int64_t incx,
+                        const T beta,
+                        T *y, std::int64_t incy,
+                        const std::vector<sycl::event> &dependencies = {})
+   }
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::row_major {
+       sycl::event axpby(sycl::queue &queue,
+                        std::int64_t n,
+                        T alpha,
+                        const T *x, std::int64_t incx,
+                        const T beta,
+                        T *y, std::int64_t incy,
+                        const std::vector<sycl::event> &dependencies = {})
+   }
+
+.. container:: section
+
+   .. rubric:: Input Parameters
+
+   queue
+      The queue where the routine should be executed.
+
+   n
+      Number of elements in vector ``x`` and ``y``.
+
+   alpha
+      Specifies the scalar alpha.
+
+   beta
+      Specifies the scalar beta.
+
+   x
+      Pointer to the input vector ``x``. The allocated memory must be
+      of size at least (1 + (``n`` – 1)*abs(``incx``)). See
+      :ref:`matrix-storage` for more details.
+
+   incx
+      Stride between consecutive elements of the ``x`` vector.
+
+   y
+      Pointer to the input vector ``y``. The allocated memory must be
+      of size at least (1 + (``n`` – 1)*abs(``incy``)). See
+      :ref:`matrix-storage` for more details.
+
+   incy
+      Stride between consecutive elements of the ``y`` vector.
+
+   dependencies
+      List of events to wait for before starting computation, if any.
+      If omitted, defaults to no dependencies.
+
+.. container:: section
+
+   .. rubric:: Output Parameters
+
+   y
+      Array holding the updated vector ``y``.
+
+.. container:: section
+
+   .. rubric:: Return Values
+
+   Output event to wait on to ensure computation is complete.
+
+.. container:: section
+
+   .. rubric:: Throws
+
+   This routine shall throw the following exceptions if the associated condition is detected. An implementation may throw additional implementation-specific exception(s) in case of error conditions not covered here.
+
+   :ref:`oneapi::mkl::invalid_argument<onemkl_exception_invalid_argument>`
+
+   :ref:`oneapi::mkl::unsupported_device<onemkl_exception_unsupported_device>`
+
+   :ref:`oneapi::mkl::host_bad_alloc<onemkl_exception_host_bad_alloc>`
+
+   :ref:`oneapi::mkl::device_bad_alloc<onemkl_exception_device_bad_alloc>`
+
+   :ref:`oneapi::mkl::unimplemented<onemkl_exception_unimplemented>`
+
+   **Parent topic:** :ref:`blas-like-extensions`
+
diff --git a/source/elements/oneMKL/source/domains/blas/axpy_batch.rst b/source/elements/oneMKL/source/domains/blas/axpy_batch.rst
index 942b110bc4..2e41b99e16 100644
--- a/source/elements/oneMKL/source/domains/blas/axpy_batch.rst
+++ b/source/elements/oneMKL/source/domains/blas/axpy_batch.rst
@@ -244,7 +244,7 @@ The total number of vectors in ``x`` and ``y`` are given by the ``batch_size`` p
 
    x
       Array of pointers to input vectors ``X`` with size ``total_batch_count``.
-      The size of array allocated for the ``X`` vector of the group ``i`` must be at least (1 + (``n[i]`` – 1)*abs(``incx[i]``))``. 
+      The size of array allocated for the ``X`` vector of the group ``i`` must be at least (1 + (``n[i]`` – 1)*abs(``incx[i]``)). 
       See :ref:`matrix-storage` for more details.
 
    incx
@@ -252,7 +252,7 @@ The total number of vectors in ``x`` and ``y`` are given by the ``batch_size`` p
  
    y
       Array of pointers to input/output vectors ``Y`` with size ``total_batch_count``.
-      The size of array allocated for the ``Y`` vector of the group ``i`` must be at least (1 + (``n[i]`` – 1)*abs(``incy[i]``))``. 
+      The size of array allocated for the ``Y`` vector of the group ``i`` must be at least (1 + (``n[i]`` – 1)*abs(``incy[i]``)). 
       See :ref:`matrix-storage` for more details.
 
    incy
diff --git a/source/elements/oneMKL/source/domains/blas/blas-like-extensions.rst b/source/elements/oneMKL/source/domains/blas/blas-like-extensions.rst
index 9ba5d5d148..d2cbb39acb 100644
--- a/source/elements/oneMKL/source/domains/blas/blas-like-extensions.rst
+++ b/source/elements/oneMKL/source/domains/blas/blas-like-extensions.rst
@@ -46,7 +46,12 @@ BLAS-like Extensions
     :hidden:
 
     axpy_batch
+    axpby
+    copy_batch
+    dgmm_batch
     gemm_batch
+    gemv_batch
+    syrk_batch
     trsm_batch
     gemmt
     gemm_bias
diff --git a/source/elements/oneMKL/source/domains/blas/copy_batch.rst b/source/elements/oneMKL/source/domains/blas/copy_batch.rst
new file mode 100644
index 0000000000..93f67dd979
--- /dev/null
+++ b/source/elements/oneMKL/source/domains/blas/copy_batch.rst
@@ -0,0 +1,373 @@
+.. SPDX-FileCopyrightText: 2019-2020 Intel Corporation
+..
+.. SPDX-License-Identifier: CC-BY-4.0
+
+.. _onemkl_blas_copy_batch:
+
+copy_batch
+==========
+
+Computes a group of ``copy`` operations.
+
+.. _onemkl_blas_copy_batch_description:
+
+.. rubric:: Description
+
+The ``copy_batch`` routines are batched versions of :ref:`onemkl_blas_copy`, performing
+multiple ``copy`` operations in a single call. Each ``copy`` 
+operation copies one vector to another.
+   
+``copy_batch`` supports the following precisions for data.
+
+   .. list-table:: 
+      :header-rows: 1
+
+      * -  T 
+      * -  ``float`` 
+      * -  ``double`` 
+      * -  ``std::complex<float>`` 
+      * -  ``std::complex<double>`` 
+
+.. _onemkl_blas_copy_batch_buffer:
+
+copy_batch (Buffer Version)
+---------------------------
+
+.. rubric:: Description
+
+The buffer version of ``copy_batch`` supports only the strided API. 
+
+The strided API operation is defined as:
+::
+  
+   for i = 0 … batch_size – 1
+      X and Y are vectors at offset i * stridex, i * stridey in x and y
+      Y := X
+   end for
+
+where:
+
+``X`` and ``Y`` are vectors.
+   
+**Strided API**
+
+.. rubric:: Syntax
+ 
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::column_major {
+       void copy_batch(sycl::queue &queue,
+                       std::int64_t n,
+                       sycl::buffer<T,
+                       1> &x,
+                       std::int64_t incx,
+                       std::int64_t stridex,
+                       sycl::buffer<T,
+                       1> &y,
+                       std::int64_t incy,
+                       std::int64_t stridey,
+                       std::int64_t batch_size)
+   }
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::row_major {
+       void copy_batch(sycl::queue &queue,
+                       std::int64_t n,
+                       sycl::buffer<T,
+                       1> &x,
+                       std::int64_t incx,
+                       std::int64_t stridex,
+                       sycl::buffer<T,
+                       1> &y,
+                       std::int64_t incy,
+                       std::int64_t stridey,
+                       std::int64_t batch_size)
+   }
+
+.. container:: section
+
+   .. rubric:: Input Parameters
+
+   queue
+      The queue where the routine should be executed.
+
+   n
+      Number of elements in ``X`` and ``Y``.
+
+   x
+      Buffer holding input vectors ``X`` with size ``stridex`` * ``batch_size``.
+
+   incx 
+      Stride of vector ``X``.
+
+   stridex 
+      Stride between different ``X`` vectors.
+
+   y
+      Buffer holding input/output vectors ``Y`` with size ``stridey`` * ``batch_size``.
+
+   incy 
+      Stride of vector ``Y``.
+   
+   stridey 
+      Stride between different ``Y`` vectors.
+
+   batch_size 
+      Specifies the number of ``copy`` operations to perform.
+
+.. container:: section
+
+   .. rubric:: Output Parameters
+
+   y
+      Output buffer, overwritten by ``batch_size`` ``copy`` operations.
+
+.. container:: section
+
+   .. rubric:: Throws
+
+   This routine shall throw the following exceptions if the associated condition is detected. An implementation may throw additional implementation-specific exception(s) in case of error conditions not covered here.
+
+   :ref:`oneapi::mkl::invalid_argument<onemkl_exception_invalid_argument>`
+       
+   
+   :ref:`oneapi::mkl::unsupported_device<onemkl_exception_unsupported_device>`
+       
+
+   :ref:`oneapi::mkl::host_bad_alloc<onemkl_exception_host_bad_alloc>`
+       
+
+   :ref:`oneapi::mkl::device_bad_alloc<onemkl_exception_device_bad_alloc>`
+       
+
+   :ref:`oneapi::mkl::unimplemented<onemkl_exception_unimplemented>`
+      
+
+.. _onemkl_blas_copy_batch_usm:
+
+copy_batch (USM Version)
+------------------------
+
+.. rubric:: Description
+
+The USM version of ``copy_batch`` supports the group API and strided API. 
+
+The group API operation is defined as
+::
+   
+   idx = 0
+   for i = 0 … group_count – 1
+       for j = 0 … group_size – 1
+           X and Y are vectors in x[idx] and y[idx]
+           Y := X
+           idx := idx + 1
+       end for
+   end for
+
+The strided API operation is defined as
+::
+   
+   for i = 0 … batch_size – 1
+      X and Y are vectors at offset i * stridex, i * stridey in x and y
+      Y := X
+   end for
+
+where:
+
+``X`` and ``Y`` are vectors.
+
+For group API, ``x`` and ``y`` arrays contain the pointers for all the input vectors. 
+The total number of vectors in ``x`` and ``y`` are given by:
+
+.. math::
+
+      total\_batch\_count = \sum_{i=0}^{group\_count-1}group\_size[i]    
+
+For strided API, ``x`` and ``y`` arrays contain all the input vectors. 
+The total number of vectors in ``x`` and ``y`` are given by the ``batch_size`` parameter.
+
+**Group API**
+
+.. rubric:: Syntax
+
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::column_major {
+       sycl::event copy_batch(sycl::queue &queue,
+                              std::int64_t *n,
+                              const T **x,
+                              std::int64_t *incx,
+                              T **y,
+                              std::int64_t *incy,
+                              std::int64_t group_count,
+                              std::int64_t *group_size,
+                              const std::vector<sycl::event> &dependencies = {})
+   }
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::row_major {
+       sycl::event copy_batch(sycl::queue &queue,
+                              std::int64_t *n,
+                              const T **x,
+                              std::int64_t *incx,
+                              T **y,
+                              std::int64_t *incy,
+                              std::int64_t group_count,
+                              std::int64_t *group_size,
+                              const std::vector<sycl::event> &dependencies = {})
+   }
+
+.. container:: section
+
+   .. rubric:: Input Parameters
+
+   queue
+      The queue where the routine should be executed.
+
+   n
+      Array of ``group_count`` integers. ``n[i]`` specifies the number of elements in vectors ``X`` and ``Y`` for every vector in group ``i``.
+
+   x
+      Array of pointers to input vectors ``X`` with size ``total_batch_count``.
+      The size of array allocated for the ``X`` vector of the group ``i`` must be at least (1 + (``n[i]`` – 1)*abs(``incx[i]``)). 
+      See :ref:`matrix-storage` for more details.
+
+   incx
+      Array of ``group_count`` integers. ``incx[i]`` specifies the stride of vector ``X`` in group ``i``.
+ 
+   y
+      Array of pointers to input/output vectors ``Y`` with size ``total_batch_count``.
+      The size of array allocated for the ``Y`` vector of the group ``i`` must be at least (1 + (``n[i]`` – 1)*abs(``incy[i]``)). 
+      See :ref:`matrix-storage` for more details.
+
+   incy
+      Array of ``group_count`` integers. ``incy[i]`` specifies the stride of vector ``Y`` in group ``i``.
+
+   group_count
+      Number of groups. Must be at least 0.
+
+   group_size
+      Array of ``group_count`` integers. ``group_size[i]`` specifies the number of ``copy`` operations in group ``i``. 
+      Each element in ``group_size`` must be at least 0.
+
+   dependencies
+      List of events to wait for before starting computation, if any.
+      If omitted, defaults to no dependencies.
+
+.. container:: section
+
+   .. rubric:: Output Parameters
+
+   y
+      Array of pointers holding the ``Y`` vectors, overwritten by ``total_batch_count`` ``copy`` operations.
+
+.. container:: section
+
+   .. rubric:: Return Values
+
+   Output event to wait on to ensure computation is complete.
+
+**Strided API**
+
+.. rubric:: Syntax
+
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::column_major {
+       sycl::event copy_batch(sycl::queue &queue,
+                              std::int64_t n,
+                              const T *x,
+                              std::int64_t incx,
+                              std::int64_t stridex,
+                              T *y,
+                              std::int64_t incy,
+                              std::int64_t stridey,
+                              std::int64_t batch_size,
+                              const std::vector<sycl::event> &dependencies = {})
+   }
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::row_major {
+       sycl::event copy_batch(sycl::queue &queue,
+                              std::int64_t n,
+                              const T *x,
+                              std::int64_t incx,
+                              std::int64_t stridex,
+                              T *y,
+                              std::int64_t incy,
+                              std::int64_t stridey,
+                              std::int64_t batch_size,
+                              const std::vector<sycl::event> &dependencies = {})
+   }
+
+.. container:: section
+
+   .. rubric:: Input Parameters
+
+   queue
+      The queue where the routine should be executed.
+
+   n
+      Number of elements in ``X`` and ``Y``.
+
+   x
+      Pointer to input vectors ``X`` with size ``stridex`` * ``batch_size``.
+
+   incx 
+      Stride of vector ``X``.
+   
+   stridex 
+      Stride between different ``X`` vectors.
+
+   y
+      Pointer to input/output vectors ``Y`` with size ``stridey`` * ``batch_size``.
+
+   incy 
+      Stride of vector ``Y``.
+   
+   stridey 
+      Stride between different ``Y`` vectors.
+
+   batch_size 
+      Specifies the number of ``copy`` operations to perform.
+  
+   dependencies
+      List of events to wait for before starting computation, if any.
+      If omitted, defaults to no dependencies.
+
+.. container:: section
+
+   .. rubric:: Output Parameters
+
+   y
+      Output vectors, overwritten by ``batch_size`` ``copy`` operations
+
+.. container:: section
+
+   .. rubric:: Return Values
+
+   Output event to wait on to ensure computation is complete.
+
+.. container:: section
+
+   .. rubric:: Throws
+
+   This routine shall throw the following exceptions if the associated condition is detected. An implementation may throw additional implementation-specific exception(s) in case of error conditions not covered here.
+
+   :ref:`oneapi::mkl::invalid_argument<onemkl_exception_invalid_argument>`
+       
+       
+   
+   :ref:`oneapi::mkl::unsupported_device<onemkl_exception_unsupported_device>`
+       
+
+   :ref:`oneapi::mkl::host_bad_alloc<onemkl_exception_host_bad_alloc>`
+       
+
+   :ref:`oneapi::mkl::device_bad_alloc<onemkl_exception_device_bad_alloc>`
+       
+
+   :ref:`oneapi::mkl::unimplemented<onemkl_exception_unimplemented>`
+      
+
+   **Parent topic:**:ref:`blas-like-extensions`
diff --git a/source/elements/oneMKL/source/domains/blas/dgmm_batch.rst b/source/elements/oneMKL/source/domains/blas/dgmm_batch.rst
new file mode 100644
index 0000000000..403a8a156c
--- /dev/null
+++ b/source/elements/oneMKL/source/domains/blas/dgmm_batch.rst
@@ -0,0 +1,507 @@
+.. SPDX-FileCopyrightText: 2019-2020 Intel Corporation
+..
+.. SPDX-License-Identifier: CC-BY-4.0
+
+.. _onemkl_blas_dgmm_batch:
+
+dgmm_batch
+==========
+
+Computes a group of ``dgmm`` operations.
+
+.. _onemkl_blas_dgmm_batch_description:
+
+.. rubric:: Description
+
+The ``dgmm_batch`` routines perform
+multiple diagonal matrix-matrix product operations in a single call.
+   
+``dgmm_batch`` supports the following precisions.
+
+   .. list-table:: 
+      :header-rows: 1
+
+      * -  T 
+      * -  ``float`` 
+      * -  ``double`` 
+      * -  ``std::complex<float>`` 
+      * -  ``std::complex<double>`` 
+
+.. _onemkl_blas_dgmm_batch_buffer:
+
+dgmm_batch (Buffer Version)
+---------------------------
+
+.. rubric:: Description
+
+The buffer version of ``dgmm_batch`` supports only the strided API. 
+
+The strided API operation is defined as:
+::
+
+   for i = 0 … batch_size – 1
+       A and C are matrices at offset i * stridea in a, i * stridec in c.
+       X is a vector at offset i * stridex in x
+       C := diag(X) * A or  C = A * diag(X)
+   end for
+
+where:
+
+``A`` is a matrix,
+
+``X`` is a diagonal matrix stored as a vector
+
+The ``a`` and ``x`` buffers contain all the input matrices. The stride 
+between matrices is given by the stride parameter. The total number
+of matrices in ``a`` and ``x`` buffers is given by the ``batch_size`` parameter.
+
+**Strided API**
+
+.. rubric:: Syntax
+
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::column_major {
+       void dgmm_batch(sycl::queue &queue,
+                       onemkl::mkl::side left_right,
+                       std::int64_t m,
+                       std::int64_t n,
+                       sycl::buffer<T,1> &a,
+                       std::int64_t lda,
+                       std::int64_t stridea,
+                       sycl::buffer<T,1> &x,
+                       std::int64_t incx,
+                       std::int64_t stridex,
+                       sycl::buffer<T,1> &c,
+                       std::int64_t ldc,
+                       std::int64_t stridec,
+                       std::int64_t batch_size)
+   }
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::row_major {
+       void dgmm_batch(sycl::queue &queue,
+                       onemkl::mkl::side left_right,
+                       std::int64_t m,
+                       std::int64_t n,
+                       sycl::buffer<T,1> &a,
+                       std::int64_t lda,
+                       std::int64_t stridea,
+                       sycl::buffer<T,1> &x,
+                       std::int64_t incx,
+                       std::int64_t stridex,
+                       sycl::buffer<T,1> &c,
+                       std::int64_t ldc,
+                       std::int64_t stridec,
+                       std::int64_t batch_size)
+   }
+
+.. container:: section
+
+   .. rubric:: Input Parameters
+
+   queue
+      The queue where the routine should be executed.
+
+   left_right
+      Specifies the position of the diagonal matrix in the product.
+      See :ref:`onemkl_datatypes` for more details.
+
+   m
+      Number of rows of matrices ``A`` and ``C``. Must be at least zero.
+
+   n
+      Number of columns of matrices ``A`` and ``C``. Must be at least zero.
+
+   a
+
+      Buffer holding the input matrices ``A`` with size ``stridea`` *
+      ``batch_size``.  Must be of at least ``lda`` * ``j`` +
+      ``stridea`` * (``batch_size`` - 1) where j is n if column major
+      layout is used or m if major layout is used.
+
+   lda
+      The leading dimension of the matrices ``A``. It must be positive
+      and at least ``m`` if column major layout is used or at least
+      ``n`` if row major layout is used.
+
+   stridea
+      Stride between different ``A`` matrices.
+
+   x
+      Buffer holding the input matrices ``X`` with size ``stridex`` *
+      ``batch_size``.  Must be of size at least 
+      (1 + (``len`` - 1)*abs(``incx``)) + ``stridex`` * (``batch_size`` - 1) 
+      where ``len`` is ``n`` if the diagonal matrix is on the right 
+      of the product or ``m`` otherwise.
+
+   incx
+      Stride between two consecutive elements of the ``x`` vectors.
+
+   stridex
+      Stride between different ``X`` vectors, must be at least 0.
+
+   c
+      Buffer holding input/output matrices ``C`` with size ``stridec`` * ``batch_size``.
+
+   ldc
+      The leading dimension of the matrices ``C``. It must be positive and at least
+      ``m`` if column major layout is used to store matrices or at
+      least ``n`` if column major layout is used to store matrices.
+
+   stridec
+      Stride between different ``C`` matrices. Must be at least
+      ``ldc`` * ``n`` if column major layout is used or ``ldc`` * ``m`` if row
+      major layout is used.
+
+   batch_size
+      Specifies the number of diagonal matrix-matrix product operations to perform.
+
+.. container:: section
+
+   .. rubric:: Output Parameters
+
+   c
+      Output overwritten by ``batch_size`` diagonal matrix-matrix product
+      operations.
+
+.. container:: section
+
+   .. rubric:: Throws
+
+   This routine shall throw the following exceptions if the associated condition is detected. An implementation may throw additional implementation-specific exception(s) in case of error conditions not covered here.
+
+   :ref:`oneapi::mkl::invalid_argument<onemkl_exception_invalid_argument>`
+       
+   
+   :ref:`oneapi::mkl::unsupported_device<onemkl_exception_unsupported_device>`
+       
+
+   :ref:`oneapi::mkl::host_bad_alloc<onemkl_exception_host_bad_alloc>`
+       
+
+   :ref:`oneapi::mkl::device_bad_alloc<onemkl_exception_device_bad_alloc>`
+       
+
+   :ref:`oneapi::mkl::unimplemented<onemkl_exception_unimplemented>`
+      
+
+.. _onemkl_blas_dgmm_batch_usm:
+
+dgmm_batch (USM Version)
+---------------------------
+
+.. rubric:: Description
+
+The USM version of ``dgmm_batch`` supports the group API and strided API. 
+
+The group API operation is defined as:
+::
+
+   idx = 0
+   for i = 0 … group_count – 1
+       for j = 0 … group_size – 1
+           a and c are matrices of size mxn at position idx in a_array and c_array
+           x is a vector of size m or n depending on left_right, at position idx in x_array
+           if (left_right == oneapi::mkl::side::left)
+               c := diag(x) * a
+           else
+               c := a * diag(x)
+           idx := idx + 1
+       end for
+   end for
+
+The strided API operation is defined as
+::
+
+   for i = 0 … batch_size – 1
+       A and C are matrices at offset i * stridea in a, i * stridec in c.
+       X is a vector at offset i * stridex in x
+       C := diag(X) * A or  C = A * diag(X)
+   end for
+
+where:
+
+``A`` is a matrix,
+
+``X`` is a diagonal matrix stored as a vector
+
+The ``a`` and ``x`` buffers contain all the input matrices. The stride 
+between matrices is given by the stride parameter. The total number
+of matrices in ``a`` and ``x`` buffers is given by the ``batch_size`` parameter.
+ 
+For group API, ``a`` and ``x`` arrays contain the pointers for all the input matrices. 
+The total number of matrices in ``a`` and ``x`` are given by: 
+
+.. math::
+
+      total\_batch\_count = \sum_{i=0}^{group\_count-1}group\_size[i]    
+ 
+For strided API, ``a`` and ``x`` arrays contain all the input matrices. The total number of matrices 
+in ``a`` and ``x`` are given by the ``batch_size`` parameter.  
+   
+**Group API**
+
+.. rubric:: Syntax
+   
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::column_major {
+       sycl::event dgmm_batch(sycl::queue &queue,
+                              onemkl::mkl::side *left_right,
+                              std::int64_t *m,
+                              std::int64_t *n,
+                              const T **a,
+                              std::int64_t *lda,
+                              const T **x,
+                              std::int64_t *incx,
+                              T **c,
+                              std::int64_t *ldc,
+                              std::int64_t group_count,
+                              std::int64_t *group_size,
+                              const std::vector<sycl::event> &dependencies = {})
+   }
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::row_major {
+       sycl::event dgmm_batch(sycl::queue &queue,
+                              onemkl::mkl::side *left_right,
+                              std::int64_t *m,
+                              std::int64_t *n,
+                              const T **a,
+                              std::int64_t *lda,
+                              const T **x,
+                              std::int64_t *incx,
+                              T **c,
+                              std::int64_t *ldc,
+                              std::int64_t group_count,
+                              std::int64_t *group_size,
+                              const std::vector<sycl::event> &dependencies = {})
+   }
+
+.. container:: section
+
+   .. rubric:: Input Parameters
+
+   queue
+      The queue where the routine should be executed.
+
+   left_right
+      Specifies the position of the diagonal matrix in the product.
+      See :ref:`onemkl_datatypes` for more details.
+
+   m
+      Array of ``group_count`` integers. ``m[i]`` specifies the
+      number of rows of ``A`` for every matrix in group ``i``. All entries must be at least zero.
+
+   n
+      Array of ``group_count`` integers. ``n[i]`` specifies the
+      number of columns of ``A`` for every matrix in group ``i``. All entries must be at least zero.
+
+   a
+      Array of pointers to input matrices ``A`` with size
+      ``total_batch_count``.  Must be of size at least ``lda[i]`` * ``n[i]`` if
+      column major layout is used or at least ``lda[i]`` * ``m[i]`` if row major
+      layout is used.
+      See :ref:`matrix-storage` for more details.
+
+   lda
+      Array of ``group_count`` integers. ``lda[i]`` specifies the
+      leading dimension of ``A`` for every matrix in group ``i``. All
+      entries must be positive and at least ``m[i]`` if column major
+      layout is used or at least ``n[i]`` if row major layout is used.
+
+   x
+      Array of pointers to input vectors ``X`` with size
+      ``total_batch_count``.  Must be of size at least (1 + ``len[i]`` –
+      1)*abs(``incx[i]``)) where ``len[i]`` is ``n[i]`` if the diagonal matrix is on the
+      right of the product or ``m[i]`` otherwise.
+      See :ref:`matrix-storage` for more details.
+
+   incx
+      Array of ``group_count`` integers. ``incx[i]`` specifies the
+      stride of ``x`` for every vector in group ``i``. All entries
+      must be positive.
+   c
+      Array of pointers to input/output matrices ``C`` with size ``total_batch_count``. 
+      Must be of size at least
+      ``ldc[i]`` * ``n[i]``
+      if column major layout is used or at least
+      ``ldc[i]`` * ``m[i]``
+      if row major layout is used.
+      See :ref:`matrix-storage` for more details.
+
+   ldc
+      Array of ``group_count`` integers. ``ldc[i]`` specifies the
+      leading dimension of ``C`` for every matrix in group ``i``.  All
+      entries must be positive and ``ldc[i]`` must be at least
+      ``m[i]`` if column major layout is used to store matrices or at
+      least ``n[i]`` if row major layout is used to store matrices.
+
+   group_count
+      Specifies the number of groups. Must be at least 0.
+
+   group_size
+      Array of ``group_count`` integers. ``group_size[i]`` specifies the
+      number of diagonal matrix-matrix product operations in group ``i``.
+      All entries must be at least 0.
+
+   dependencies
+         List of events to wait for before starting computation, if any.
+         If omitted, defaults to no dependencies.
+
+.. container:: section
+
+   .. rubric:: Output Parameters
+
+   c
+      Output overwritten by ``batch_size`` diagonal matrix-matrix product
+      operations.
+
+.. container:: section
+
+   .. rubric:: Return Values
+
+   Output event to wait on to ensure computation is complete.
+
+**Strided API**
+
+.. rubric:: Syntax
+
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::column_major {
+       sycl::event dgmm_batch(sycl::queue &queue,
+                              onemkl::mkl::side left_right,
+                              std::int64_t m,
+                              std::int64_t n,
+                              const T *a,
+                              std::int64_t lda,
+                              std::int64_t stridea,
+                              const T *b,
+                              std::int64_t incx,
+                              std::int64_t stridex,
+                              T *c,
+                              std::int64_t ldc,
+                              std::int64_t stridec,
+                              std::int64_t batch_size,
+                              const std::vector<sycl::event> &dependencies = {})
+   }
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::row_major {
+       sycl::event dgmm_batch(sycl::queue &queue,
+                              onemkl::mkl::side left_right,
+                              std::int64_t m,
+                              std::int64_t n,
+                              const T *a,
+                              std::int64_t lda,
+                              std::int64_t stridea,
+                              const T *b,
+                              std::int64_t incx,
+                              std::int64_t stridex,
+                              T *c,
+                              std::int64_t ldc,
+                              std::int64_t stridec,
+                              std::int64_t batch_size,
+                              const std::vector<sycl::event> &dependencies = {})
+   }
+
+.. container:: section
+
+   .. rubric:: Input Parameters
+
+   queue
+      The queue where the routine should be executed.
+
+   left_right
+      Specifies the position of the diagonal matrix in the product.
+      See :ref:`onemkl_datatypes` for more details.
+
+   m
+      Number of rows of ``A``. Must be at least zero.
+
+   n
+      Number of columns of ``A``. Must be at least zero.
+
+   a
+      Pointer to input matrices ``A`` with size ``stridea`` *
+      ``batch_size``.  Must be of size at least
+      ``lda`` * ``k`` + ``stridea`` * (``batch_size`` - 1) 
+      where ``k`` is ``n`` if column major layout is used 
+      or ``m`` if row major layout is used.
+
+   lda
+      The leading dimension of the matrices ``A``. It must be positive
+      and at least ``m``.  Must be positive and at least ``m`` if column
+      major layout is used or at least ``n`` if row major layout is used.
+
+   stridea
+      Stride between different ``A`` matrices.
+
+   x
+      Pointer to input matrices ``X`` with size ``stridex`` * ``batch_size``.
+      Must be of size at least
+      (1 + (``len`` - 1)*abs(``incx``)) + ``stridex`` * (``batch_size`` - 1)
+      where ``len`` is ``n`` if the diagonal matrix is on the right
+      of the product or ``m`` otherwise.
+
+   incx
+      Stride between two consecutive elements of the ``x`` vector.
+
+   stridex
+      Stride between different ``X`` vectors, must be at least 0.
+
+   c
+      Pointer to input/output matrices ``C`` with size ``stridec`` * ``batch_size``.
+
+   ldc
+      The leading dimension of the matrices ``C``. It must be positive and at least
+      ``ldc`` * ``m`` if column major layout is used to store matrices or at
+      least ``ldc`` * ``n`` if column major layout is used to store matrices.
+
+   stridec
+      Stride between different ``C`` matrices. Must be at least
+      ``ldc`` * ``n`` if column major layout is used or 
+      ``ldc`` * ``m`` if row major layout is used.
+
+   batch_size
+      Specifies the number of diagonal matrix-matrix product operations to perform.
+
+.. container:: section
+
+   .. rubric:: Output Parameters
+
+   c
+      Output overwritten by ``batch_size`` diagonal matrix-matrix product
+      operations.
+
+.. container:: section
+      
+   .. rubric:: Return Values
+
+   Output event to wait on to ensure computation is complete.
+
+.. container:: section
+
+   .. rubric:: Throws
+
+   This routine shall throw the following exceptions if the associated condition is detected. An implementation may throw additional implementation-specific exception(s) in case of error conditions not covered here.
+
+   :ref:`oneapi::mkl::invalid_argument<onemkl_exception_invalid_argument>`
+       
+       
+   
+   :ref:`oneapi::mkl::unsupported_device<onemkl_exception_unsupported_device>`
+       
+
+   :ref:`oneapi::mkl::host_bad_alloc<onemkl_exception_host_bad_alloc>`
+       
+
+   :ref:`oneapi::mkl::device_bad_alloc<onemkl_exception_device_bad_alloc>`
+       
+
+   :ref:`oneapi::mkl::unimplemented<onemkl_exception_unimplemented>`
+      
+
+   **Parent topic:** :ref:`blas-like-extensions`
diff --git a/source/elements/oneMKL/source/domains/blas/gemm.rst b/source/elements/oneMKL/source/domains/blas/gemm.rst
index bcefd90c2e..c981f2dff2 100644
--- a/source/elements/oneMKL/source/domains/blas/gemm.rst
+++ b/source/elements/oneMKL/source/domains/blas/gemm.rst
@@ -53,6 +53,10 @@ op(``X``) = ``X``\ :sup:`H`,
        -  ``half`` 
        -  ``half`` 
        -  ``half`` 
+     * -  ``float``
+       -  ``bfloat16``
+       -  ``bfloat16``
+       -  ``float``
      * -  ``float`` 
        -  ``float`` 
        -  ``float`` 
diff --git a/source/elements/oneMKL/source/domains/blas/gemm_batch.rst b/source/elements/oneMKL/source/domains/blas/gemm_batch.rst
index 94a15c11d0..32be79475a 100644
--- a/source/elements/oneMKL/source/domains/blas/gemm_batch.rst
+++ b/source/elements/oneMKL/source/domains/blas/gemm_batch.rst
@@ -23,6 +23,7 @@ operation perform a matrix-matrix product with general matrices.
       :header-rows: 1
 
       * -  T 
+      * -  ``half``
       * -  ``float`` 
       * -  ``double`` 
       * -  ``std::complex<float>`` 
@@ -244,7 +245,8 @@ gemm_batch (USM Version)
 
 .. rubric:: Description
 
-The USM version of ``gemm_batch`` supports the group API and strided API. 
+The USM version of ``gemm_batch`` supports the group API and the strided API.
+The group API supports pointer and span inputs.
 
 The group API operation is defined as:
 ::
@@ -258,6 +260,25 @@ The group API operation is defined as:
        end for
    end for
 
+The advantage of using span instead of pointer is that the sizes of
+the array can vary and the size of the span can be queried at
+runtime. For each GEMM parameter, except the output matrices, the span
+can be of size 1, the number of groups or the total batch size. For
+the output matrices, to ensure all computation are independent, the size
+of the span must be the total batch size.
+
+Depending on the size of the spans, each parameter for the GEMM computation is used as follows:
+
+  - If the span has size 1, the parameter is reused for all GEMM
+    computation.
+
+  - If the span has size group_count, the parameter is reused for all
+    GEMM within a group, but each group will have a different value
+    for this parameter.  This is like the gemm_batch group API with pointers.
+
+  - If the span has size equal to the total batch size, each GEMM
+    computation will use a different value for this parameter.
+
 The strided API operation is defined as
 ::
 
@@ -311,6 +332,24 @@ in ``a``, ``b`` and ``c`` are given by the ``batch_size`` parameter.
                               std::int64_t group_count,
                               std::int64_t *group_size,
                               const std::vector<sycl::event> &dependencies = {})
+
+       sycl::event gemm_batch(sycl::queue &queue,
+                              const sycl::span<onemkl::transpose> &transa,
+                              const sycl::span<onemkl::transpose> &transb,
+                              const sycl::span<std::int64_t> &m,
+                              const sycl::span<std::int64_t> &n,
+                              const sycl::span<std::int64_t> &k,
+                              const sycl::span<std::int64_t> &alpha,
+                              const sycl::span<const T*> &a,
+                              const sycl::span<std::int64_t> &lda,
+                              const sycl::span<const T*> &b,
+                              const sycl::span<std::int64_t> &ldb,
+                              const sycl::span<T> &beta,
+                              sycl::span<T*> &c,
+                              const sycl::span<std::int64_t> &ldc,
+                              size_t group_count,
+                              const sycl::span<size_t> &group_sizes,
+                              const std::vector<sycl::event> &dependencies = {})
    }
 .. code-block:: cpp
 
@@ -332,6 +371,24 @@ in ``a``, ``b`` and ``c`` are given by the ``batch_size`` parameter.
                               std::int64_t group_count,
                               std::int64_t *group_size,
                               const std::vector<sycl::event> &dependencies = {})
+
+       sycl::event gemm_batch(sycl::queue &queue,
+                              const sycl::span<onemkl::transpose> &transa,
+                              const sycl::span<onemkl::transpose> &transb,
+                              const sycl::span<std::int64_t> &m,
+                              const sycl::span<std::int64_t> &n,
+                              const sycl::span<std::int64_t> &k,
+                              const sycl::span<std::int64_t> &alpha,
+                              const sycl::span<const T*> &a,
+                              const sycl::span<std::int64_t> &lda,
+                              const sycl::span<const T*> &b,
+                              const sycl::span<std::int64_t> &ldb,
+                              const sycl::span<T> &beta,
+                              sycl::span<T*> &c,
+                              const sycl::span<std::int64_t> &ldc,
+                              size_t group_count,
+                              const sycl::span<size_t> &group_sizes,
+                              const std::vector<sycl::event> &dependencies = {})
    }
 
 .. container:: section
@@ -342,37 +399,37 @@ in ``a``, ``b`` and ``c`` are given by the ``batch_size`` parameter.
       The queue where the routine should be executed.
 
    transa
-      Array of ``group_count`` ``onemkl::transpose`` values. ``transa[i]`` specifies the form of op(``A``) used in
+      Array or span of ``group_count`` ``onemkl::transpose`` values. ``transa[i]`` specifies the form of op(``A``) used in
       the matrix multiplication in group ``i``. See :ref:`onemkl_datatypes` for more details.
 
    transb
-      Array of ``group_count`` ``onemkl::transpose`` values. ``transb[i]`` specifies the form of op(``B``) used in
+      Array or span of ``group_count`` ``onemkl::transpose`` values. ``transb[i]`` specifies the form of op(``B``) used in
       the matrix multiplication in group ``i``. See :ref:`onemkl_datatypes` for more details.
 
    m
-      Array of ``group_count`` integers. ``m[i]`` specifies the
+      Array or span of ``group_count`` integers. ``m[i]`` specifies the
       number of rows of op(``A``) and ``C`` for every matrix in group ``i``. All entries must be at least zero.
 
    n
-      Array of ``group_count`` integers. ``n[i]`` specifies the
+      Array or span of ``group_count`` integers. ``n[i]`` specifies the
       number of columns of op(``B``) and ``C`` for every matrix in group ``i``. All entries must be at least zero.
 
    k
-      Array of ``group_count`` integers. ``k[i]`` specifies the
+      Array or span of ``group_count`` integers. ``k[i]`` specifies the
       number of columns of op(``A``) and rows of op(``B``) for every matrix in group ``i``. All entries must be at
       least zero.
 
    alpha
-      Array of ``group_count`` scalar elements. ``alpha[i]`` specifies the scaling factor for every matrix-matrix
+      Array or span of ``group_count`` scalar elements. ``alpha[i]`` specifies the scaling factor for every matrix-matrix
       product in group ``i``.
 
    a
-      Array of pointers to input matrices ``A`` with size ``total_batch_count``. 
+      Array of pointers or span of input matrices ``A`` with size ``total_batch_count``. 
       
       See :ref:`matrix-storage` for more details.
 
    lda
-      Array of ``group_count`` integers. ``lda[i]`` specifies the
+      Array or span of ``group_count`` integers. ``lda[i]`` specifies the
       leading dimension of ``A`` for every matrix in group ``i``. All
       entries must be positive.
 
@@ -390,13 +447,12 @@ in ``a``, ``b`` and ``c`` are given by the ``batch_size`` parameter.
            - ``lda[i]`` must be at least ``m[i]``.
              
    b
-      Array of pointers to input matrices ``B`` with size ``total_batch_count``. 
+      Array of pointers or span of input matrices ``B`` with size ``total_batch_count``. 
       
       See :ref:`matrix-storage` for more details.
 
    ldb
-   
-      Array of ``group_count`` integers. ``ldb[i]`` specifies the
+      Array or span of ``group_count`` integers. ``ldb[i]`` specifies the
       leading dimension of ``B`` for every matrix in group ``i``. All
       entries must be positive.
 
@@ -414,16 +470,16 @@ in ``a``, ``b`` and ``c`` are given by the ``batch_size`` parameter.
            - ``ldb[i]`` must be at least ``k[i]``.
              
    beta
-      Array of ``group_count`` scalar elements. ``beta[i]`` specifies the scaling factor for matrix ``C`` 
+      Array or span of ``group_count`` scalar elements. ``beta[i]`` specifies the scaling factor for matrix ``C`` 
       for every matrix in group ``i``.
 
    c
-      Array of pointers to input/output matrices ``C`` with size ``total_batch_count``. 
+      Array of pointers or span of input/output matrices ``C`` with size ``total_batch_count``. 
       
       See :ref:`matrix-storage` for more details.
 
    ldc
-      Array of ``group_count`` integers. ``ldc[i]`` specifies the
+      Array or span of ``group_count`` integers. ``ldc[i]`` specifies the
       leading dimension of ``C`` for every matrix in group ``i``.  All
       entries must be positive and ``ldc[i]`` must be at least
       ``m[i]`` if column major layout is used to store matrices or at
@@ -433,7 +489,7 @@ in ``a``, ``b`` and ``c`` are given by the ``batch_size`` parameter.
       Specifies the number of groups. Must be at least 0.
 
    group_size
-      Array of ``group_count`` integers. ``group_size[i]`` specifies the
+      Array or span of ``group_count`` integers. ``group_size[i]`` specifies the
       number of matrix multiply products in group ``i``. All entries must be at least 0.
 
    dependencies
@@ -461,6 +517,27 @@ in ``a``, ``b`` and ``c`` are given by the ``batch_size`` parameter.
 
    Output event to wait on to ensure computation is complete.
 
+.. container:: section
+
+   .. rubric:: Output Parameters
+
+   c
+      Overwritten by the ``m[i]``-by-``n[i]`` matrix calculated by
+      (``alpha[i]`` * op(``A``)*op(``B``) + ``beta[i]`` * ``C``) for group ``i``.
+
+.. container:: section
+
+   .. rubric:: Notes
+
+   If ``beta`` = 0, matrix ``C`` does not need to be initialized
+   before calling ``gemm_batch``.
+
+.. container:: section
+
+   .. rubric:: Return Values
+
+   Output event to wait on to ensure computation is complete.
+
 **Strided API**
 
 .. rubric:: Syntax
diff --git a/source/elements/oneMKL/source/domains/blas/gemv.rst b/source/elements/oneMKL/source/domains/blas/gemv.rst
index 5b8f2c58b2..8d7ac094eb 100644
--- a/source/elements/oneMKL/source/domains/blas/gemv.rst
+++ b/source/elements/oneMKL/source/domains/blas/gemv.rst
@@ -230,7 +230,7 @@ gemv (USM Version)
       Scaling factor for the matrix-vector product.
 
    a
-      The pointer to the input matrix ``A``. Must have a size of at
+      Pointer to the input matrix ``A``. Must have a size of at
       least ``lda``\ \*\ ``n`` if column major layout is used or at
       least ``lda``\ \*\ ``m`` if row major layout is used. See
       :ref:`matrix-storage` for more details.
diff --git a/source/elements/oneMKL/source/domains/blas/gemv_batch.rst b/source/elements/oneMKL/source/domains/blas/gemv_batch.rst
new file mode 100644
index 0000000000..7c047b9517
--- /dev/null
+++ b/source/elements/oneMKL/source/domains/blas/gemv_batch.rst
@@ -0,0 +1,517 @@
+.. SPDX-FileCopyrightText: 2019-2020 Intel Corporation
+..
+.. SPDX-License-Identifier: CC-BY-4.0
+
+.. _onemkl_blas_gemv_batch:
+
+gemv_batch
+==========
+
+Computes a group of ``gemv`` operations.
+
+.. _onemkl_blas_gemv_batch_description:
+
+.. rubric:: Description
+
+The ``gemv_batch`` routines are batched versions of
+:ref:`onemkl_blas_gemv`, performing multiple ``gemv`` operations in a
+single call. Each ``gemv`` operations perform a scalar-matrix-vector
+product and add the result to a scalar-vector product.
+   
+``gemv_batch`` supports the following precisions.
+
+   .. list-table:: 
+      :header-rows: 1
+
+      * -  T 
+      * -  ``float`` 
+      * -  ``double`` 
+      * -  ``std::complex<float>`` 
+      * -  ``std::complex<double>`` 
+
+.. _onemkl_blas_gemv_batch_buffer:
+
+gemv_batch (Buffer Version)
+---------------------------
+
+.. rubric:: Description
+
+The buffer version of ``gemv_batch`` supports only the strided API. 
+
+The strided API operation is defined as:
+::
+
+   for i = 0 … batch_size – 1
+       A is a matrix at offset i * stridea in a.
+       X and Y are matrices at offset i * stridex, i * stridey, in x and y.
+       Y := alpha * op(A) * X + beta * Y
+   end for
+
+where:
+
+op(A) is one of op(A) = A, or op(A) = A\ :sup:`T`, or op(A) = A\ :sup:`H`,
+
+``alpha`` and ``beta`` are scalars,
+
+``A`` is a matrix and ``X`` and ``Y`` are vectors,
+
+The ``x`` and ``y`` buffers contain all the input matrices. The stride
+between vectors is given by the stride parameter. The total number of
+vectors in ``x`` and ``y`` buffers is given by the ``batch_size``
+parameter.
+
+**Strided API**
+
+.. rubric:: Syntax
+
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::column_major {
+       void gemv_batch(sycl::queue &queue,
+                       onemkl::transpose trans,
+                       std::int64_t m,
+                       std::int64_t n,
+                       T alpha,
+                       sycl::buffer<T,1> &a,
+                       std::int64_t lda,
+                       std::int64_t stridea,
+                       sycl::buffer<T,1> &x,
+                       std::int64_t incx,
+                       std::int64_t stridex,
+                       T beta,
+                       sycl::buffer<T,1> &y,
+                       std::int64_t incy,
+                       std::int64_t stridey,
+                       std::int64_t batch_size)
+   }
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::row_major {
+       void gemv_batch(sycl::queue &queue,
+                       onemkl::transpose trans,
+                       std::int64_t m,
+                       std::int64_t n,
+                       T alpha,
+                       sycl::buffer<T,1> &a,
+                       std::int64_t lda,
+                       std::int64_t stridea,
+                       sycl::buffer<T,1> &x,
+                       std::int64_t incx,
+                       std::int64_t stridex,
+                       T beta,
+                       sycl::buffer<T,1> &y,
+                       std::int64_t incy,
+                       std::int64_t stridey,
+                       std::int64_t batch_size)
+   }
+
+.. container:: section
+
+   .. rubric:: Input Parameters
+
+   queue
+      The queue where the routine should be executed.
+
+   trans
+      Specifies op(``A``) the transposition operation applied to the
+      matrices ``A``. See :ref:`onemkl_datatypes` for more details.
+
+   m
+      Number of rows of op(``A``). Must be at least zero.
+
+   n
+      Number of columns of op(``A``). Must be at least zero.
+
+   alpha
+      Scaling factor for the matrix-vector products.
+
+   a
+      Buffer holding the input matrices ``A`` with size ``stridea`` * ``batch_size``.
+
+   lda
+      The leading dimension of the matrices ``A``. It must be positive
+      and at least ``m`` if column major layout is used or at least
+      ``n`` if row major layout is used.
+
+   stridea
+      Stride between different ``A`` matrices.
+
+   x
+      Buffer holding the input vectors ``X`` with size ``stridex`` * ``batch_size``.
+
+   incx
+      The stride of the vector ``X``. It must be positive.
+
+   stridex
+      Stride between different consecutive ``X`` vectors, must be at least 0.
+
+   beta
+      Scaling factor for the vector ``Y``.
+
+   y
+      Buffer holding input/output vectors ``Y`` with size ``stridey`` * ``batch_size``.
+
+   incy
+      Stride between two consecutive elements of the ``y`` vectors.
+
+   stridey
+      Stride between two consecutive ``Y`` vectors. Must be at least
+      (1 + (len-1)*abs(incy)) where ``len`` is ``m`` if the matrix ``A``
+      is non transpose or ``n`` otherwise.
+
+   batch_size
+      Specifies the number of matrix-vector operations to perform.
+
+.. container:: section
+
+   .. rubric:: Output Parameters
+
+   y
+      Output overwritten by ``batch_size`` matrix-vector product
+      operations of the form ``alpha`` * op(``A``) * ``X`` + ``beta`` * ``Y``.
+
+.. container:: section
+
+   .. rubric:: Throws
+
+   This routine shall throw the following exceptions if the associated condition is detected. An implementation may throw additional implementation-specific exception(s) in case of error conditions not covered here.
+
+   :ref:`oneapi::mkl::invalid_argument<onemkl_exception_invalid_argument>`
+       
+   
+   :ref:`oneapi::mkl::unsupported_device<onemkl_exception_unsupported_device>`
+       
+
+   :ref:`oneapi::mkl::host_bad_alloc<onemkl_exception_host_bad_alloc>`
+       
+
+   :ref:`oneapi::mkl::device_bad_alloc<onemkl_exception_device_bad_alloc>`
+       
+
+   :ref:`oneapi::mkl::unimplemented<onemkl_exception_unimplemented>`
+      
+
+.. _onemkl_blas_gemv_batch_usm:
+
+gemv_batch (USM Version)
+---------------------------
+
+.. rubric:: Description
+
+The USM version of ``gemv_batch`` supports the group API and strided API. 
+
+The group API operation is defined as:
+::
+
+   idx = 0
+   for i = 0 … group_count – 1
+       for j = 0 … group_size – 1
+           A is an m x n matrix in a[idx]
+           X and Y are vectors in x[idx] and y[idx]
+           Y := alpha[i] * op(A) * X + beta[i] * Y
+           idx = idx + 1
+       end for
+   end for
+
+The strided API operation is defined as
+::
+
+   for i = 0 … batch_size – 1
+       A is a matrix at offset i * stridea in a.
+       X and Y are vectors at offset i * stridex, i * stridey in x and y.
+       Y := alpha * op(A) * X + beta * Y
+   end for
+
+where:
+
+op(A) is one of op(A) = A, or op(A) = A\ :sup:`T`, or op(A) = A\ :sup:`H`,
+
+``alpha`` and ``beta`` are scalars,
+
+``A`` is a matrix and ``X`` and ``Y`` are vectors,
+
+For group API, ``x`` and ``y`` arrays contain the pointers for all the input vectors. 
+``A`` array contains the pointers to all input matrices.
+The total number of vectors in ``x`` and ``y`` and matrices in ``A`` are given by: 
+
+.. math::
+
+      total\_batch\_count = \sum_{i=0}^{group\_count-1}group\_size[i]    
+ 
+For strided API, ``x`` and ``y`` arrays contain all the input
+vectors. ``A`` array contains the pointers to all input matrices.  The
+total number of vectors in ``x`` and ``y`` and matrices in ``A`` are given by the
+``batch_size`` parameter.
+   
+**Group API**
+
+.. rubric:: Syntax
+   
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::column_major {
+       sycl::event gemv_batch(sycl::queue &queue,
+                              onemkl::transpose *trans,
+                              std::int64_t *m,
+                              std::int64_t *n,
+                              T *alpha,
+                              const T **a,
+                              std::int64_t *lda,
+                              const T **x,
+                              std::int64_t *incx,
+                              T *beta,
+                              T **y,
+                              std::int64_t *incy,
+                              std::int64_t group_count,
+                              std::int64_t *group_size,
+                              const std::vector<sycl::event> &dependencies = {})
+   }
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::row_major {
+       sycl::event gemv_batch(sycl::queue &queue,
+                              onemkl::transpose *trans,
+                              std::int64_t *m,
+                              std::int64_t *n,
+                              T *alpha,
+                              const T **a,
+                              std::int64_t *lda,
+                              const T **x,
+                              std::int64_t *incx,
+                              T *beta,
+                              T **y,
+                              std::int64_t *incy,
+                              std::int64_t group_count,
+                              std::int64_t *group_size,
+                              const std::vector<sycl::event> &dependencies = {})
+   }
+
+.. container:: section
+
+   .. rubric:: Input Parameters
+
+   queue
+      The queue where the routine should be executed.
+
+   trans
+      Array of ``group_count`` ``onemkl::transpose`` values. ``trans[i]`` specifies the form of op(``A``) used in
+      the matrix-vector product in group ``i``. See :ref:`onemkl_datatypes` for more details.
+
+   m
+      Array of ``group_count`` integers. ``m[i]`` specifies the
+      number of rows of op(``A``) for every matrix in group ``i``. All entries must be at least zero.
+
+   n
+      Array of ``group_count`` integers. ``n[i]`` specifies the
+      number of columns of op(``A``) for every matrix in group ``i``. All entries must be at least zero.
+
+   alpha
+      Array of ``group_count`` scalar elements. ``alpha[i]`` specifies
+      the scaling factor for every matrix-vector product in group
+      ``i``.
+
+   a
+      Array of pointers to input matrices ``A`` with size ``total_batch_count``. 
+      
+      See :ref:`matrix-storage` for more details.
+
+   lda
+      Array of ``group_count`` integers. ``lda[i]`` specifies the
+      leading dimension of ``A`` for every matrix in group ``i``. All
+      entries must be positive and at least ``m`` if column major
+      layout is used or at least ``n`` if row major layout is used.
+             
+   x
+      Array of pointers to input vectors ``X`` with size ``total_batch_count``. 
+      
+      See :ref:`matrix-storage` for more details.
+
+   incx
+      Array of ``group_count`` integers. ``incx[i]`` specifies the
+      stride of ``X`` for every vector in group ``i``. All
+      entries must be positive.
+             
+   beta
+      Array of ``group_count`` scalar elements. ``beta[i]`` specifies
+      the scaling factor for vector ``Y`` for every vector in group
+      ``i``.
+
+   y
+      Array of pointers to input/output vectors ``Y`` with size ``total_batch_count``. 
+      
+      See :ref:`matrix-storage` for more details.
+
+   incy
+      Array of ``group_count`` integers. ``incy[i]`` specifies the
+      leading dimension of ``Y`` for every vector in group ``i``.  All
+      entries must be positive and ``incy[i]`` must be at least
+      ``m[i]`` if column major layout is used or at
+      least ``n[i]`` if row major layout is used.
+
+   group_count
+      Specifies the number of groups. Must be at least 0.
+
+   group_size
+      Array of ``group_count`` integers. ``group_size[i]`` specifies the
+      number of matrix-vector products in group ``i``. All entries must be at least 0.
+
+   dependencies
+         List of events to wait for before starting computation, if any.
+         If omitted, defaults to no dependencies.
+
+.. container:: section
+
+   .. rubric:: Output Parameters
+
+   y
+      Overwritten by vector calculated by 
+      (``alpha[i]`` * op(``A``) * ``X`` + ``beta[i]`` * ``Y``) for group ``i``.
+
+.. container:: section
+
+   .. rubric:: Return Values
+
+   Output event to wait on to ensure computation is complete.
+
+**Strided API**
+
+.. rubric:: Syntax
+
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::column_major {
+       sycl::event gemv_batch(sycl::queue &queue,
+                              onemkl::transpose trans,
+                              std::int64_t m,
+                              std::int64_t n,
+                              T alpha,
+                              const T *a,
+                              std::int64_t lda,
+                              std::int64_t stridea,
+                              const T *x,
+                              std::int64_t incx,
+                              std::int64_t stridex,
+                              T beta,
+                              T *y,
+                              std::int64_t incy,
+                              std::int64_t stridey,
+                              std::int64_t batch_size,
+                              const std::vector<sycl::event> &dependencies = {})
+   }
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::row_major {
+       sycl::event gemv_batch(sycl::queue &queue,
+                              onemkl::transpose trans,
+                              std::int64_t m,
+                              std::int64_t n,
+                              T alpha,
+                              const T *a,
+                              std::int64_t lda,
+                              std::int64_t stridea,
+                              const T *x,
+                              std::int64_t incx,
+                              std::int64_t stridex,
+                              T beta,
+                              T *y,
+                              std::int64_t incy,
+                              std::int64_t stridey,
+                              std::int64_t batch_size,
+                              const std::vector<sycl::event> &dependencies = {})
+   }
+
+
+.. container:: section
+
+   .. rubric:: Input Parameters
+
+   queue
+      The queue where the routine should be executed.
+
+   trans
+      Specifies op(``A``) the transposition operation applied to the
+      matrices ``A``. See :ref:`onemkl_datatypes` for more details.
+
+   m
+      Number of rows of op(``A``). Must be at least zero.
+
+   n
+      Number of columns of op(``A``). Must be at least zero.
+
+   alpha
+      Scaling factor for the matrix-vector products.
+
+   a
+      Pointer to the input matrices ``A`` with size ``stridea`` * ``batch_size``.
+
+   lda
+      The leading dimension of the matrices ``A``. It must be positive
+      and at least ``m`` if column major layout is used or at least
+      ``n`` if row major layout is used.
+
+   stridea
+      Stride between different ``A`` matrices.
+
+   x
+      Pointer to the input vectors ``X`` with size ``stridex`` * ``batch_size``.
+
+   incx
+      Stride of the vector ``X``. It must be positive.
+
+   stridex
+      Stride between different consecutive ``X`` vectors, must be at least 0.
+
+   beta
+      Scaling factor for the vector ``Y``.
+
+   y
+      Pointer to the input/output vectors ``Y`` with size ``stridey`` * ``batch_size``.
+
+   incy
+      Stride between two consecutive elements of the ``y`` vectors.
+
+   stridey
+      Stride between two consecutive ``Y`` vectors. Must be at least
+      (1 + (len-1)*abs(incy)) where ``len`` is ``m`` if the matrix ``A``
+      is non transpose or ``n`` otherwise.
+
+   batch_size
+      Specifies the number of matrix-vector operations to perform.
+
+.. container:: section
+
+   .. rubric:: Output Parameters
+
+   y
+      Output overwritten by ``batch_size`` matrix-vector product
+      operations of the form ``alpha`` * op(``A``) * ``X`` + ``beta`` * ``Y``.
+
+.. container:: section
+      
+   .. rubric:: Return Values
+
+   Output event to wait on to ensure computation is complete.
+
+.. container:: section
+
+   .. rubric:: Throws
+
+   This routine shall throw the following exceptions if the associated condition is detected. An implementation may throw additional implementation-specific exception(s) in case of error conditions not covered here.
+
+   :ref:`oneapi::mkl::invalid_argument<onemkl_exception_invalid_argument>`
+       
+       
+   
+   :ref:`oneapi::mkl::unsupported_device<onemkl_exception_unsupported_device>`
+       
+
+   :ref:`oneapi::mkl::host_bad_alloc<onemkl_exception_host_bad_alloc>`
+       
+
+   :ref:`oneapi::mkl::device_bad_alloc<onemkl_exception_device_bad_alloc>`
+       
+
+   :ref:`oneapi::mkl::unimplemented<onemkl_exception_unimplemented>`
+      
+
+   **Parent topic:** :ref:`blas-like-extensions`
diff --git a/source/elements/oneMKL/source/domains/blas/syrk_batch.rst b/source/elements/oneMKL/source/domains/blas/syrk_batch.rst
new file mode 100644
index 0000000000..5b7cd71d42
--- /dev/null
+++ b/source/elements/oneMKL/source/domains/blas/syrk_batch.rst
@@ -0,0 +1,529 @@
+.. SPDX-FileCopyrightText: 2019-2020 Intel Corporation
+..
+.. SPDX-License-Identifier: CC-BY-4.0
+
+.. _onemkl_blas_syrk_batch:
+
+syrk_batch
+==========
+
+Computes a group of ``syrk`` operations.
+
+.. _onemkl_blas_syrk_batch_description:
+
+.. rubric:: Description
+
+The ``syrk_batch`` routines are batched versions of :ref:`onemkl_blas_syrk`, performing
+multiple ``syrk`` operations in a single call. Each ``syrk`` 
+operation perform a rank-k update with general matrices.
+   
+``syrk_batch`` supports the following precisions.
+
+   .. list-table:: 
+      :header-rows: 1
+
+      * -  T 
+      * -  ``float`` 
+      * -  ``double`` 
+      * -  ``std::complex<float>`` 
+      * -  ``std::complex<double>`` 
+
+.. _onemkl_blas_syrk_batch_buffer:
+
+syrk_batch (Buffer Version)
+---------------------------
+
+.. rubric:: Description
+
+The buffer version of ``syrk_batch`` supports only the strided API. 
+
+The strided API operation is defined as:
+::
+
+   for i = 0 … batch_size – 1
+       A and C are matrices at offset i * stridea, i * stridec in a and c.
+       C := alpha * op(A) * op(A)^T + beta * C
+   end for
+
+where:
+
+op(X) is one of op(X) = X, or op(X) = X\ :sup:`T`, or op(X) = X\ :sup:`H`,
+
+``alpha`` and ``beta`` are scalars,
+
+``A`` and ``C`` are matrices,
+
+op(``A``) is ``n`` x ``k`` and ``C`` is ``n`` x ``n``.
+
+The ``a`` and ``c`` buffers contain all the input matrices. The stride 
+between matrices is given by the stride parameter. The total number
+of matrices in ``a`` and ``c`` buffers is given by the ``batch_size`` parameter.
+
+**Strided API**
+
+.. rubric:: Syntax
+
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::column_major {
+       void syrk_batch(sycl::queue &queue,
+                       onemkl::uplo upper_lower,
+                       onemkl::transpose trans,
+                       std::int64_t n,
+                       std::int64_t k,
+                       T alpha,
+                       sycl::buffer<T,1> &a,
+                       std::int64_t lda,
+                       std::int64_t stridea,
+                       T beta,
+                       sycl::buffer<T,1> &c,
+                       std::int64_t ldc,
+                       std::int64_t stridec,
+                       std::int64_t batch_size)
+   }
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::row_major {
+       void syrk_batch(sycl::queue &queue,
+                       onemkl::uplo upper_lower,
+                       onemkl::transpose trans,
+                       std::int64_t n,
+                       std::int64_t k,
+                       T alpha,
+                       sycl::buffer<T,1> &a,
+                       std::int64_t lda,
+                       std::int64_t stridea,
+                       T beta,
+                       sycl::buffer<T,1> &c,
+                       std::int64_t ldc,
+                       std::int64_t stridec,
+                       std::int64_t batch_size)
+   }
+
+.. container:: section
+
+   .. rubric:: Input Parameters
+
+   queue
+      The queue where the routine should be executed.
+
+   upper_lower
+      Specifies whether data in ``C`` is stored in its upper or lower triangle.
+      For more details, see :ref:`onemkl_datatypes`.
+
+   trans
+      Specifies op(``A``) the transposition operation applied to the
+      matrix ``A``. Conjugation is never performed, even if trans =
+      transpose::conjtrans. See :ref:`onemkl_datatypes` for more
+      details.
+
+   n
+      Number of rows and columns of ``C``.
+      Must be at least zero.
+
+   k
+      Number of columns of op(``A``).
+      Must be at least zero.
+
+   alpha
+      Scaling factor for the rank-k update.
+
+   a
+      Buffer holding the input matrices ``A`` with size ``stridea`` * ``batch_size``.
+
+   lda
+      The leading dimension of the matrices ``A``. It must be positive.
+
+      .. list-table::
+         :header-rows: 1
+
+         * -
+           - ``A`` not transposed
+           - ``A`` transposed
+         * - Column major
+           - ``lda`` must be at least ``n``.
+           - ``lda`` must be at least ``k``.
+         * - Row major
+           - ``lda`` must be at least ``k``.
+           - ``lda`` must be at least ``n``.
+
+   stridea
+      Stride between different ``A`` matrices.
+
+   beta
+      Scaling factor for the matrices ``C``.
+
+   c
+      Buffer holding input/output matrices ``C`` with size ``stridec`` * ``batch_size``.
+
+   ldc
+      The leading dimension of the matrices ``C``. It must be positive
+      and at least ``n``.
+
+   stridec
+      Stride between different ``C`` matrices. Must be at least
+      ``ldc`` * ``n``.
+
+   batch_size
+      Specifies the number of rank-k update operations to perform.
+
+.. container:: section
+
+   .. rubric:: Output Parameters
+
+   c
+      Output buffer, overwritten by ``batch_size`` rank-k update
+      operations of the form ``alpha`` * op(``A``)*op(``A``)^T + ``beta`` * ``C``.
+
+.. container:: section
+
+   .. rubric:: Throws
+
+   This routine shall throw the following exceptions if the associated condition is detected. An implementation may throw additional implementation-specific exception(s) in case of error conditions not covered here.
+
+   :ref:`oneapi::mkl::invalid_argument<onemkl_exception_invalid_argument>`
+       
+   
+   :ref:`oneapi::mkl::unsupported_device<onemkl_exception_unsupported_device>`
+       
+
+   :ref:`oneapi::mkl::host_bad_alloc<onemkl_exception_host_bad_alloc>`
+       
+
+   :ref:`oneapi::mkl::device_bad_alloc<onemkl_exception_device_bad_alloc>`
+       
+
+   :ref:`oneapi::mkl::unimplemented<onemkl_exception_unimplemented>`
+      
+
+.. _onemkl_blas_syrk_batch_usm:
+
+syrk_batch (USM Version)
+---------------------------
+
+.. rubric:: Description
+
+The USM version of ``syrk_batch`` supports the group API and strided API. 
+
+The group API operation is defined as:
+::
+
+   idx = 0
+   for i = 0 … group_count – 1
+       for j = 0 … group_size – 1
+           A, B, and C are matrices in a[idx] and c[idx]
+           C := alpha[i] * op(A) * op(A)^T + beta[i] * C
+           idx = idx + 1
+       end for
+   end for
+
+The strided API operation is defined as
+::
+
+   for i = 0 … batch_size – 1
+       A, B and C are matrices at offset i * stridea, i * stridec in a and c.
+       C := alpha * op(A) * op(A)^T + beta * C
+   end for
+
+where:
+
+op(X) is one of op(X) = X, or op(X) = X\ :sup:`T`, or op(X) = X\ :sup:`H`,
+
+``alpha`` and ``beta`` are scalars,
+
+``A`` and ``C`` are matrices,
+
+op(``A``) is ``n`` x ``k`` and ``C`` is ``n`` x ``n``.
+
+ 
+For group API, ``a`` and ``c`` arrays contain the pointers for all the input matrices. 
+The total number of matrices in ``a`` and ``c`` are given by: 
+
+.. math::
+
+      total\_batch\_count = \sum_{i=0}^{group\_count-1}group\_size[i]    
+ 
+For strided API, ``a`` and ``c`` arrays contain all the input matrices. The total number of matrices 
+in ``a`` and ``c`` are given by the ``batch_size`` parameter.  
+   
+**Group API**
+
+.. rubric:: Syntax
+   
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::column_major {
+       sycl::event syrk_batch(sycl::queue &queue,
+                              uplo *upper_lower,
+                              transpose *trans,
+                              std::int64_t *n,
+                              std::int64_t *k,
+                              T *alpha,
+                              const T **a,
+                              std::int64_t *lda,
+                              T *beta,
+                              T **c,
+                              std::int64_t *ldc,
+                              std::int64_t group_count,
+                              std::int64_t *group_size,
+                              const std::vector<sycl::event> &dependencies = {})
+   }
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::row_major {
+       sycl::event syrk_batch(sycl::queue &queue,
+                              uplo *upper_lower,
+                              transpose *trans,
+                              std::int64_t *n,
+                              std::int64_t *k,
+                              T *alpha,
+                              const T **a,
+                              std::int64_t *lda,
+                              T *beta,
+                              T **c,
+                              std::int64_t *ldc,
+                              std::int64_t group_count,
+                              std::int64_t *group_size,
+                              const std::vector<sycl::event> &dependencies = {})
+   }
+
+.. container:: section
+
+   .. rubric:: Input Parameters
+
+   queue
+      The queue where the routine should be executed.
+
+   upper_lower
+      Array of ``group_count`` ``onemkl::upper_lower``
+      values. ``upper_lower[i]`` specifies whether data in C for every
+      matrix in group ``i`` is in upper or lower triangle.
+
+   trans
+      Array of ``group_count`` ``onemkl::transpose`` values. ``trans[i]`` specifies the form of op(``A``) used in
+      the rank-k update in group ``i``. See :ref:`onemkl_datatypes` for more details.
+
+   n
+      Array of ``group_count`` integers. ``n[i]`` specifies the
+      number of rows and columns of ``C`` for every matrix in group ``i``. All entries must be at least zero.
+
+   k
+      Array of ``group_count`` integers. ``k[i]`` specifies the
+      number of columns of op(``A``) for every matrix in group ``i``. All entries must be at
+      least zero.
+
+   alpha
+      Array of ``group_count`` scalar elements. ``alpha[i]`` specifies the scaling factor for every rank-k update in group ``i``.
+
+   a
+      Array of pointers to input matrices ``A`` with size ``total_batch_count``. 
+      
+      See :ref:`matrix-storage` for more details.
+
+   lda
+      Array of ``group_count`` integers. ``lda[i]`` specifies the
+      leading dimension of ``A`` for every matrix in group ``i``. All
+      entries must be positive.
+
+      .. list-table::
+         :header-rows: 1
+
+         * -
+           - ``A`` not transposed
+           - ``A`` transposed
+         * - Column major
+           - ``lda[i]`` must be at least ``n[i]``.
+           - ``lda[i]`` must be at least ``k[i]``.
+         * - Row major
+           - ``lda[i]`` must be at least ``k[i]``.
+           - ``lda[i]`` must be at least ``n[i]``.
+             
+   beta
+      Array of ``group_count`` scalar elements. ``beta[i]`` specifies the scaling factor for matrix ``C`` 
+      for every matrix in group ``i``.
+
+   c
+      Array of pointers to input/output matrices ``C`` with size ``total_batch_count``. 
+      
+      See :ref:`matrix-storage` for more details.
+
+   ldc
+      Array of ``group_count`` integers. ``ldc[i]`` specifies the
+      leading dimension of ``C`` for every matrix in group ``i``.  All
+      entries must be positive and ``ldc[i]`` must be at least ``n[i]``.
+
+   group_count
+      Specifies the number of groups. Must be at least 0.
+
+   group_size
+      Array of ``group_count`` integers. ``group_size[i]`` specifies the
+      number of rank-k update products in group ``i``. All entries must be at least 0.
+
+   dependencies
+         List of events to wait for before starting computation, if any.
+         If omitted, defaults to no dependencies.
+
+.. container:: section
+
+   .. rubric:: Output Parameters
+
+   c
+      Overwritten by the ``n[i]``-by-``n[i]`` matrix calculated by 
+      (``alpha[i]`` * op(``A``)*op(``A``)^T + ``beta[i]`` * ``C``) for group ``i``.
+
+.. container:: section
+
+   .. rubric:: Return Values
+
+   Output event to wait on to ensure computation is complete.
+
+**Strided API**
+
+.. rubric:: Syntax
+
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::column_major {
+       sycl::event syrk_batch(sycl::queue &queue,
+                              uplo upper_lower,
+                              transpose trans,
+                              std::int64_t n,
+                              std::int64_t k,
+                              T alpha,
+                              const T *a,
+                              std::int64_t lda,
+                              std::int64_t stride_a,
+                              T beta,
+                              T *c,
+                              std::int64_t ldc,
+                              std::int64_t stride_c,
+                              std::int64_t batch_size,
+                              const std::vector<sycl::event> &dependencies = {})
+   }
+.. code-block:: cpp
+
+   namespace oneapi::mkl::blas::row_major {
+       sycl::event syrk_batch(sycl::queue &queue,
+                              uplo upper_lower,
+                              transpose trans,
+                              std::int64_t n,
+                              std::int64_t k,
+                              T alpha,
+                              const T *a,
+                              std::int64_t lda,
+                              std::int64_t stride_a,
+                              T beta,
+                              T *c,
+                              std::int64_t ldc,
+                              std::int64_t stride_c,
+                              std::int64_t batch_size,
+                              const std::vector<sycl::event> &dependencies = {})
+   }
+
+.. container:: section
+
+   .. rubric:: Input Parameters
+
+   queue
+      The queue where the routine should be executed.
+
+   upper_lower
+      Specifies whether data in ``C`` is stored in its upper or lower triangle.
+      For more details, see :ref:`onemkl_datatypes`.
+
+   trans
+      Specifies op(``A``) the transposition operation applied to the
+      matrices ``A``. Conjugation is never performed, even if trans =
+      transpose::conjtrans. See :ref:`onemkl_datatypes` for more
+      details.
+
+   n
+      Number of rows and columns of ``C``.
+      Must be at least zero.
+
+   k
+      Number of columns of op(``A``).
+      Must be at least zero.
+
+   alpha
+      Scaling factor for the rank-k updates.
+
+   a
+      Pointer to input matrices ``A`` with size ``stridea`` * ``batch_size``.
+
+   lda
+      The leading dimension of the matrices ``A``. It must be positive.
+
+      .. list-table::
+         :header-rows: 1
+
+         * -
+           - ``A`` not transposed
+           - ``A`` transposed
+         * - Column major
+           - ``lda`` must be at least ``n``.
+           - ``lda`` must be at least ``k``.
+         * - Row major
+           - ``lda`` must be at least ``k``.
+           - ``lda`` must be at least ``n``.
+
+   stridea
+      Stride between different ``A`` matrices.
+
+   beta
+      Scaling factor for the matrices ``C``.
+
+   c
+      Pointer to input/output matrices ``C`` with size ``stridec`` * ``batch_size``.
+
+   ldc
+      The leading dimension of the matrices ``C``. It must be positive
+      and at least ``n``.
+
+   stridec
+      Stride between different ``C`` matrices.
+
+   batch_size
+      Specifies the number of rank-k update operations to perform.
+
+   dependencies
+         List of events to wait for before starting computation, if any.
+         If omitted, defaults to no dependencies.
+
+.. container:: section
+
+   .. rubric:: Output Parameters
+
+   c
+      Output matrices, overwritten by ``batch_size`` rank-k update
+      operations of the form ``alpha`` * op(``A``)*op(``A``)^T + ``beta`` * ``C``.
+
+.. container:: section
+      
+   .. rubric:: Return Values
+
+   Output event to wait on to ensure computation is complete.
+
+.. container:: section
+
+   .. rubric:: Throws
+
+   This routine shall throw the following exceptions if the associated condition is detected. An implementation may throw additional implementation-specific exception(s) in case of error conditions not covered here.
+
+   :ref:`oneapi::mkl::invalid_argument<onemkl_exception_invalid_argument>`
+       
+       
+   
+   :ref:`oneapi::mkl::unsupported_device<onemkl_exception_unsupported_device>`
+       
+
+   :ref:`oneapi::mkl::host_bad_alloc<onemkl_exception_host_bad_alloc>`
+       
+
+   :ref:`oneapi::mkl::device_bad_alloc<onemkl_exception_device_bad_alloc>`
+       
+
+   :ref:`oneapi::mkl::unimplemented<onemkl_exception_unimplemented>`
+      
+
+   **Parent topic:** :ref:`blas-like-extensions`
diff --git a/source/elements/oneMKL/source/domains/blas/trsm_batch.rst b/source/elements/oneMKL/source/domains/blas/trsm_batch.rst
index 7bade164b7..55eb27381e 100644
--- a/source/elements/oneMKL/source/domains/blas/trsm_batch.rst
+++ b/source/elements/oneMKL/source/domains/blas/trsm_batch.rst
@@ -184,6 +184,25 @@ of matrices in ``a`` and ``b`` buffers are given by the ``batch_size`` parameter
    If ``alpha`` = 0, matrix ``B`` is set to zero and the matrices ``A``
    and ``B`` do not need to be initialized before calling ``trsm_batch``.
 
+.. container:: section
+
+   .. rubric:: Throws
+
+   This routine shall throw the following exceptions if the associated condition is detected. An implementation may throw additional implementation-specific exception(s) in case of error conditions not covered here.
+
+   :ref:`oneapi::mkl::invalid_argument<onemkl_exception_invalid_argument>`
+       
+   
+   :ref:`oneapi::mkl::unsupported_device<onemkl_exception_unsupported_device>`
+       
+
+   :ref:`oneapi::mkl::host_bad_alloc<onemkl_exception_host_bad_alloc>`
+       
+
+   :ref:`oneapi::mkl::device_bad_alloc<onemkl_exception_device_bad_alloc>`
+       
+
+   :ref:`oneapi::mkl::unimplemented<onemkl_exception_unimplemented>`
 
 trsm_batch (USM Version)
 ---------------------------
@@ -500,28 +519,6 @@ in ``a`` and ``b`` are given by the ``batch_size`` parameter.
 
    Output event to wait on to ensure computation is complete.
 
-   **Parent topic:** :ref:`blas-like-extensions`
-.. container:: section
-
-   .. rubric:: Throws
-
-   This routine shall throw the following exceptions if the associated condition is detected. An implementation may throw additional implementation-specific exception(s) in case of error conditions not covered here.
-
-   :ref:`oneapi::mkl::invalid_argument<onemkl_exception_invalid_argument>`
-       
-   
-   :ref:`oneapi::mkl::unsupported_device<onemkl_exception_unsupported_device>`
-       
-
-   :ref:`oneapi::mkl::host_bad_alloc<onemkl_exception_host_bad_alloc>`
-       
-
-   :ref:`oneapi::mkl::device_bad_alloc<onemkl_exception_device_bad_alloc>`
-       
-
-   :ref:`oneapi::mkl::unimplemented<onemkl_exception_unimplemented>`
-      
-
 .. container:: section
 
    .. rubric:: Throws