Preview version 0.11.0
Version 0.11.0
- Improved the internal program source and binary caches for scalability and speed (thanks to 'intelfx')
- Fixed a bug having to re-create the binary even if it was in the cache
- Fixed a bug when using offsets in the direct version of the GEMM kernels
- Fixed a missing cl_khr_fp64 when running double-precision on Intel CPUs
- Fixed tests on Apple's CPU OpenCL implementation; still not fast but correct at least
- Fixed bugs in the half-precision routines HTBMV/HTPMV/HTRMV/HSYR2K/HTRMM
- Tests now also exit with an error code when OpenCL errors or compilation errors occur
- Tests now also check for the L2 error in case of half-precision
- Clients can now test against cuBLAS on NVIDIA systems for performance comparisons (-DCUBLAS=ON)
- Replaced the R graph scripts with Python/Matplotlib scripts
- Various minor fixes and enhancements
- Added tuned parameters for various devices (see README)
- Added the OverrideParameters function to the API to be able to supply custom tuning parmeters
- Added triangular solver (level-2 & level-3) routines:
- STRSV/DTRSV/CTRSV/ZTRSV (experimental, un-optimized)
- STRSM/DTRSM/CTRSM/ZTRSM (experimental, un-optimized)
- Added batched (not part of the BLAS standard) routines:
- SAXPYBATCHED/DAXPYBATCHED/CAXPYBATCHED/ZAXPYBATCHED/HAXPYBATCHED (batched version of AXPY)
- SGEMMBATCHED/DGEMMBATCHED/CGEMMBATCHED/ZGEMMBATCHED/HGEMMBATCHED (batched version of GEMM)