Skip to content

Preview version 0.11.0

Compare
Choose a tag to compare
@CNugteren CNugteren released this 02 May 20:41
· 758 commits to master since this release

Version 0.11.0

  • Improved the internal program source and binary caches for scalability and speed (thanks to 'intelfx')
  • Fixed a bug having to re-create the binary even if it was in the cache
  • Fixed a bug when using offsets in the direct version of the GEMM kernels
  • Fixed a missing cl_khr_fp64 when running double-precision on Intel CPUs
  • Fixed tests on Apple's CPU OpenCL implementation; still not fast but correct at least
  • Fixed bugs in the half-precision routines HTBMV/HTPMV/HTRMV/HSYR2K/HTRMM
  • Tests now also exit with an error code when OpenCL errors or compilation errors occur
  • Tests now also check for the L2 error in case of half-precision
  • Clients can now test against cuBLAS on NVIDIA systems for performance comparisons (-DCUBLAS=ON)
  • Replaced the R graph scripts with Python/Matplotlib scripts
  • Various minor fixes and enhancements
  • Added tuned parameters for various devices (see README)
  • Added the OverrideParameters function to the API to be able to supply custom tuning parmeters
  • Added triangular solver (level-2 & level-3) routines:
    • STRSV/DTRSV/CTRSV/ZTRSV (experimental, un-optimized)
    • STRSM/DTRSM/CTRSM/ZTRSM (experimental, un-optimized)
  • Added batched (not part of the BLAS standard) routines:
    • SAXPYBATCHED/DAXPYBATCHED/CAXPYBATCHED/ZAXPYBATCHED/HAXPYBATCHED (batched version of AXPY)
    • SGEMMBATCHED/DGEMMBATCHED/CGEMMBATCHED/ZGEMMBATCHED/HGEMMBATCHED (batched version of GEMM)