Skip to content

CLBlast 1.3.0

Compare
Choose a tag to compare
@CNugteren CNugteren released this 29 Jan 20:08
· 424 commits to master since this release

CLBlast version 1.3.0. Changes since previous release (version 1.2.0):

  • Re-designed and integrated the auto-tuner, no more dependency on CLTune
  • Made it possible to override the tuning parameters in the clients straight from JSON tuning files
  • Added OpenCL pre-processor to unroll loops and perform array-to-register promotions for compilers
    which don't do this themselves (ARM Mali) - greatly improves performance on these platforms
  • Added first tuners for the TRSV (block size) and TRSM (invert kernel) routines
  • Added an optional argument to the GEMM routine to provide a pre-allocated temporary buffer
  • Fixed an issue with a crashing/hanging AMD APP compiler with the TRSM routine (invert kernel)
  • Improved compilation time by splitting the tuning database into multiple compilation units
  • Various minor fixes and enhancements
  • Added tuned parameters for various devices (see README)
  • Added the RetrieveParameters function to the API to be able to inspect the tuning parameters
  • Added a strided-batched (not part of the BLAS standard) routine, faster but less generic compared
    to the existing xGEMMBATCHED routines:
    • SGEMMSTRIDEDBATCHED/DGEMMSTRIDEDBATCHED/CGEMMSTRIDEDBATCHED/ZGEMMSTRIDEDBATCHED/HGEMMSTRIDEDBATCHED