Skip to content

Releases: CNugteren/CLBlast

CLBlast 1.3.0

29 Jan 20:08
Compare
Choose a tag to compare

CLBlast version 1.3.0. Changes since previous release (version 1.2.0):

  • Re-designed and integrated the auto-tuner, no more dependency on CLTune
  • Made it possible to override the tuning parameters in the clients straight from JSON tuning files
  • Added OpenCL pre-processor to unroll loops and perform array-to-register promotions for compilers
    which don't do this themselves (ARM Mali) - greatly improves performance on these platforms
  • Added first tuners for the TRSV (block size) and TRSM (invert kernel) routines
  • Added an optional argument to the GEMM routine to provide a pre-allocated temporary buffer
  • Fixed an issue with a crashing/hanging AMD APP compiler with the TRSM routine (invert kernel)
  • Improved compilation time by splitting the tuning database into multiple compilation units
  • Various minor fixes and enhancements
  • Added tuned parameters for various devices (see README)
  • Added the RetrieveParameters function to the API to be able to inspect the tuning parameters
  • Added a strided-batched (not part of the BLAS standard) routine, faster but less generic compared
    to the existing xGEMMBATCHED routines:
    • SGEMMSTRIDEDBATCHED/DGEMMSTRIDEDBATCHED/CGEMMSTRIDEDBATCHED/ZGEMMSTRIDEDBATCHED/HGEMMSTRIDEDBATCHED

CLBlast 1.2.0

08 Nov 20:50
Compare
Choose a tag to compare

CLBlast version 1.2.0. Changes since previous release (version 1.1.1):

  • Fixed a bug in the TRSM/TRSV routines due to missing synchronisations after GEMM/GEMV calls
  • Fixed a bug in TRSM when using the a-offset argument
  • Added a CUDA API to CLBlast:
    • The library and kernels can be compiled with the CUDA driver API and NVRTC (requires CUDA 7.5)
    • Two CUDA API sample programs are added: SGEMM and DAXPY
    • All correctness tests and performance clients work on CUDA like they did for OpenCL
  • Kernels are now cached based on their tuning parameters: fits the use-case of 'OverrideParameters'
  • Cross-compiling for Android is now supported using CMake; instructions are added to the README
  • Improved performance for small GEMM problems by going from 3 to 1 optional temporary buffers
  • GEMM kernel selection (direct vs in-direct) is now done automatically using a new tuner
  • Various minor fixes and enhancements
  • Added tuned parameters for various devices (see README)

CLBlast 1.1.0

30 Sep 16:04
Compare
Choose a tag to compare

CLBlast version 1.1.0. Changes since previous release (version 1.0.1):

  • The tuning database now has defaults per architecture (e.g. NVIDIA Kepler SM3.5, AMD Fiji)
  • The tuning database now has a dictionary to translate vendor/device names to a common set
  • The tuners can now distinguish between different AMD GPU board names of the same architecture
  • The tuners can now use particle-swarm optimisation to search more efficiently (thanks to 'mcian')
  • Improved performance for small problems on NVIDIA hardware by caching the device name
  • Further improved compilation time of database.cpp
  • Added a small diagnostics helper executable
  • Various minor fixes and enhancements
  • Added tuned parameters for various devices (see README)
  • Added non-BLAS routines:
    • SIM2COL/DIM2COL/CIM2COL/ZIM2COL/HIM2COL (im2col transform as used to express convolution as GEMM)

CLBlast 1.0.1

08 Aug 18:53
Compare
Choose a tag to compare

CLBlast version 1.0.1. Changes since previous release (version 1.0.0):

  • Fixed a bug in the direct version of the GEMM kernel

CLBlast 1.0.0

30 Jul 18:56
Compare
Choose a tag to compare

CLBlast version 1.0.0. Changes since previous release (version 0.11.0):

  • Fixed a bug in the TRSM routine for alpha != 1
  • Fixed a bug in the cache related to multi-device contexts (thanks to 'kpot')
  • Fixed a bug in the direct version of the GEMM kernel
  • Fixed several warnings for MSVC and Clang
  • Added support for Mesa Clover and AMD's ROCm by making the inline keyword optional in kernels
  • Performance reports are now external at https://cnugteren.github.io/clblast
  • Greatly improved compilation time of database.cpp
  • Various minor fixes and enhancements
  • Added tuned parameters for various devices (see README)
  • Added non-BLAS level-1 routines:
    • iSAMIN/iDAMIN/iCAMIN/iZAMIN (absolute minimum version of the ixAMAX BLAS routines)

Preview version 0.11.0

02 May 20:41
Compare
Choose a tag to compare

Version 0.11.0

  • Improved the internal program source and binary caches for scalability and speed (thanks to 'intelfx')
  • Fixed a bug having to re-create the binary even if it was in the cache
  • Fixed a bug when using offsets in the direct version of the GEMM kernels
  • Fixed a missing cl_khr_fp64 when running double-precision on Intel CPUs
  • Fixed tests on Apple's CPU OpenCL implementation; still not fast but correct at least
  • Fixed bugs in the half-precision routines HTBMV/HTPMV/HTRMV/HSYR2K/HTRMM
  • Tests now also exit with an error code when OpenCL errors or compilation errors occur
  • Tests now also check for the L2 error in case of half-precision
  • Clients can now test against cuBLAS on NVIDIA systems for performance comparisons (-DCUBLAS=ON)
  • Replaced the R graph scripts with Python/Matplotlib scripts
  • Various minor fixes and enhancements
  • Added tuned parameters for various devices (see README)
  • Added the OverrideParameters function to the API to be able to supply custom tuning parmeters
  • Added triangular solver (level-2 & level-3) routines:
    • STRSV/DTRSV/CTRSV/ZTRSV (experimental, un-optimized)
    • STRSM/DTRSM/CTRSM/ZTRSM (experimental, un-optimized)
  • Added batched (not part of the BLAS standard) routines:
    • SAXPYBATCHED/DAXPYBATCHED/CAXPYBATCHED/ZAXPYBATCHED/HAXPYBATCHED (batched version of AXPY)
    • SGEMMBATCHED/DGEMMBATCHED/CGEMMBATCHED/ZGEMMBATCHED/HGEMMBATCHED (batched version of GEMM)

Preview version 0.10.0

27 Nov 15:02
Compare
Choose a tag to compare

Version 0.10.0

  • Updated to version 8.0 of the CLCudaAPI C++11 OpenCL header
  • Changed the enums in the C API to avoid potential name clashes with external code
  • Added a Netlib CBLAS compatible API (not recommended for full control over performance)
  • Greatly improved the way exceptions are handled in the library (thanks to 'intelfx')
  • Improved performance of GEMM kernels for small sizes by using a direct single-kernel implementation
  • Fixed a bug in the tests and samples related to waiting for an invalid event
  • Fixed a bug in the SYRK/SYR2K/HERK/HER2K routines that would occur with specific tuning parameters
  • Fixed a bug in the TRMM routine that would overwrite input data before consuming everything
  • Added support for compilation under Visual Studio 2013 (MSVC++ 12.0)
  • Added an option to set OpenCL compiler options through the env variable CLBLAST_BUILD_OPTIONS
  • Added an option to run tuned kernels multiple times to average execution times
  • Added an option to build a static version of the library
  • Made it possible to use the command-line environmental vars everywhere and without re-running CMake
  • Various minor fixes and enhancements
  • Added tuned parameters for various devices (see README)

Preview version 0.9.0

13 Sep 19:20
Compare
Choose a tag to compare

Version 0.9.0

  • Updated to version 6.0 of the CLCudaAPI C++11 OpenCL header
  • Improved performance significantly of rotated GEMV computations
  • Improved performance of unseen/un-tuned devices by a better default tuning parameter selection
  • Fixed proper MSVC dllimport and dllexport declarations
  • Fixed memory leaks related to events not being released
  • Fixed a bug with a size_t and cl_ulong mismatch on 32-bit systems
  • Fixed a bug related to the cache and retrieval of programs based on the OpenCL context
  • Fixed a performance issue (caused by fp16 support) by optimizing alpha/beta parameter passing to kernels
  • Fixed a bug in the OpenCL kernels: now placing __kernel before __attribute__
  • Fixed a bug in level-3 routines when beta is zero and matrix C contains NaNs
  • Added an option (-warm_up) to do a warm-up run before timing in the performance clients
  • Various minor fixes and enhancements
  • Added tuned parameters for various devices (see README)

Preview version 0.8.0

28 Jun 20:39
Compare
Choose a tag to compare

Version 0.8.0

  • Added support for half-precision floating-point (fp16) in the library
  • Made it possible to compile the performance tests (clients) separately from the correctness tests
  • Made a reference BLAS and head-to-head performance comparison optional in the clients
  • Increased the verbosity of the "-verbose" option in the correctness tests
  • Refactored the host code for better compilation times and fewer lines of code
  • Added Appveyor continuous integration and increased coverage of the Travis builds
  • Improved the API documentation
  • Various minor fixes and enhancements
  • Added tuned parameters for various devices (see README)
  • Added half-precision routines:
    • Level-1: HSWAP/HSCAL/HCOPY/HAXPY/HDOT/HNRM2/HASUM/HSUM/iHAMAX/iHMAX/iHMIN
    • Level-2: HGEMV/HGBMV/HHEMV/HHBMV/HHPMV/HSYMV/HSBMV/HSPMV/HTRMV/HTBMV/HTPMV/HGER/HSYR/HSPR/HSYR2/HSPR2
    • Level-3: HGEMM/HSYMM/HSYRK/HSYR2K/HTRMM
  • Added non-BLAS routines:
    • SOMATCOPY/DOMATCOPY/COMATCOPY/ZOMATCOPY/HOMATCOPY (matrix copy, scaling, and/or transpose)

Preview version 0.7.1

18 May 19:20
Compare
Choose a tag to compare

Version 0.7.1 (bug-fix release)

  • Improved performance of large power-of-2 xGEMM kernels for AMD GPUs
  • Fixed a bug in the xGEMM routine related to the event incorrectly set
  • Made MSVC link the run-time libraries statically