CLBlast 1.3.0
CLBlast version 1.3.0. Changes since previous release (version 1.2.0):
- Re-designed and integrated the auto-tuner, no more dependency on CLTune
- Made it possible to override the tuning parameters in the clients straight from JSON tuning files
- Added OpenCL pre-processor to unroll loops and perform array-to-register promotions for compilers
which don't do this themselves (ARM Mali) - greatly improves performance on these platforms - Added first tuners for the TRSV (block size) and TRSM (invert kernel) routines
- Added an optional argument to the GEMM routine to provide a pre-allocated temporary buffer
- Fixed an issue with a crashing/hanging AMD APP compiler with the TRSM routine (invert kernel)
- Improved compilation time by splitting the tuning database into multiple compilation units
- Various minor fixes and enhancements
- Added tuned parameters for various devices (see README)
- Added the RetrieveParameters function to the API to be able to inspect the tuning parameters
- Added a strided-batched (not part of the BLAS standard) routine, faster but less generic compared
to the existing xGEMMBATCHED routines:- SGEMMSTRIDEDBATCHED/DGEMMSTRIDEDBATCHED/CGEMMSTRIDEDBATCHED/ZGEMMSTRIDEDBATCHED/HGEMMSTRIDEDBATCHED