Releases · CNugteren/CLBlast

29 Jan 20:08

CNugteren

1.3.0

37c5e8f

CLBlast 1.3.0

CLBlast version 1.3.0. Changes since previous release (version 1.2.0):

Re-designed and integrated the auto-tuner, no more dependency on CLTune
Made it possible to override the tuning parameters in the clients straight from JSON tuning files
Added OpenCL pre-processor to unroll loops and perform array-to-register promotions for compilers
which don't do this themselves (ARM Mali) - greatly improves performance on these platforms
Added first tuners for the TRSV (block size) and TRSM (invert kernel) routines
Added an optional argument to the GEMM routine to provide a pre-allocated temporary buffer
Fixed an issue with a crashing/hanging AMD APP compiler with the TRSM routine (invert kernel)
Improved compilation time by splitting the tuning database into multiple compilation units
Various minor fixes and enhancements
Added tuned parameters for various devices (see README)
Added the RetrieveParameters function to the API to be able to inspect the tuning parameters
Added a strided-batched (not part of the BLAS standard) routine, faster but less generic compared
to the existing xGEMMBATCHED routines:
- SGEMMSTRIDEDBATCHED/DGEMMSTRIDEDBATCHED/CGEMMSTRIDEDBATCHED/ZGEMMSTRIDEDBATCHED/HGEMMSTRIDEDBATCHED

Assets 4

08 Nov 20:50

CNugteren

1.2.0

5d5e3f9

CLBlast 1.2.0

CLBlast version 1.2.0. Changes since previous release (version 1.1.1):

Fixed a bug in the TRSM/TRSV routines due to missing synchronisations after GEMM/GEMV calls
Fixed a bug in TRSM when using the a-offset argument
Added a CUDA API to CLBlast:
- The library and kernels can be compiled with the CUDA driver API and NVRTC (requires CUDA 7.5)
- Two CUDA API sample programs are added: SGEMM and DAXPY
- All correctness tests and performance clients work on CUDA like they did for OpenCL
Kernels are now cached based on their tuning parameters: fits the use-case of 'OverrideParameters'
Cross-compiling for Android is now supported using CMake; instructions are added to the README
Improved performance for small GEMM problems by going from 3 to 1 optional temporary buffers
GEMM kernel selection (direct vs in-direct) is now done automatically using a new tuner
Various minor fixes and enhancements
Added tuned parameters for various devices (see README)

Assets 4

30 Sep 16:04

CNugteren

1.1.0

ef082bb

CLBlast 1.1.0

CLBlast version 1.1.0. Changes since previous release (version 1.0.1):

The tuning database now has defaults per architecture (e.g. NVIDIA Kepler SM3.5, AMD Fiji)
The tuning database now has a dictionary to translate vendor/device names to a common set
The tuners can now distinguish between different AMD GPU board names of the same architecture
The tuners can now use particle-swarm optimisation to search more efficiently (thanks to 'mcian')
Improved performance for small problems on NVIDIA hardware by caching the device name
Further improved compilation time of database.cpp
Added a small diagnostics helper executable
Various minor fixes and enhancements
Added tuned parameters for various devices (see README)
Added non-BLAS routines:
- SIM2COL/DIM2COL/CIM2COL/ZIM2COL/HIM2COL (im2col transform as used to express convolution as GEMM)

Assets 4

08 Aug 18:53

CNugteren

1.0.1

eb89683

CLBlast 1.0.1

CLBlast version 1.0.1. Changes since previous release (version 1.0.0):

Fixed a bug in the direct version of the GEMM kernel

Assets 4

30 Jul 18:56

CNugteren

1.0.0

1155c06

CLBlast 1.0.0

CLBlast version 1.0.0. Changes since previous release (version 0.11.0):

Fixed a bug in the TRSM routine for alpha != 1
Fixed a bug in the cache related to multi-device contexts (thanks to 'kpot')
Fixed a bug in the direct version of the GEMM kernel
Fixed several warnings for MSVC and Clang
Added support for Mesa Clover and AMD's ROCm by making the inline keyword optional in kernels
Performance reports are now external at https://cnugteren.github.io/clblast
Greatly improved compilation time of database.cpp
Various minor fixes and enhancements
Added tuned parameters for various devices (see README)
Added non-BLAS level-1 routines:
- iSAMIN/iDAMIN/iCAMIN/iZAMIN (absolute minimum version of the ixAMAX BLAS routines)

Assets 4

02 May 20:41

CNugteren

0.11.0

606f287

Preview version 0.11.0

Version 0.11.0

Improved the internal program source and binary caches for scalability and speed (thanks to 'intelfx')
Fixed a bug having to re-create the binary even if it was in the cache
Fixed a bug when using offsets in the direct version of the GEMM kernels
Fixed a missing cl_khr_fp64 when running double-precision on Intel CPUs
Fixed tests on Apple's CPU OpenCL implementation; still not fast but correct at least
Fixed bugs in the half-precision routines HTBMV/HTPMV/HTRMV/HSYR2K/HTRMM
Tests now also exit with an error code when OpenCL errors or compilation errors occur
Tests now also check for the L2 error in case of half-precision
Clients can now test against cuBLAS on NVIDIA systems for performance comparisons (-DCUBLAS=ON)
Replaced the R graph scripts with Python/Matplotlib scripts
Various minor fixes and enhancements
Added tuned parameters for various devices (see README)
Added the OverrideParameters function to the API to be able to supply custom tuning parmeters
Added triangular solver (level-2 & level-3) routines:
- STRSV/DTRSV/CTRSV/ZTRSV (experimental, un-optimized)
- STRSM/DTRSM/CTRSM/ZTRSM (experimental, un-optimized)
Added batched (not part of the BLAS standard) routines:
- SAXPYBATCHED/DAXPYBATCHED/CAXPYBATCHED/ZAXPYBATCHED/HAXPYBATCHED (batched version of AXPY)
- SGEMMBATCHED/DGEMMBATCHED/CGEMMBATCHED/ZGEMMBATCHED/HGEMMBATCHED (batched version of GEMM)

Assets 4

27 Nov 15:02

CNugteren

0.10.0

e52f9a9

Preview version 0.10.0

Version 0.10.0

Updated to version 8.0 of the CLCudaAPI C++11 OpenCL header
Changed the enums in the C API to avoid potential name clashes with external code
Added a Netlib CBLAS compatible API (not recommended for full control over performance)
Greatly improved the way exceptions are handled in the library (thanks to 'intelfx')
Improved performance of GEMM kernels for small sizes by using a direct single-kernel implementation
Fixed a bug in the tests and samples related to waiting for an invalid event
Fixed a bug in the SYRK/SYR2K/HERK/HER2K routines that would occur with specific tuning parameters
Fixed a bug in the TRMM routine that would overwrite input data before consuming everything
Added support for compilation under Visual Studio 2013 (MSVC++ 12.0)
Added an option to set OpenCL compiler options through the env variable CLBLAST_BUILD_OPTIONS
Added an option to run tuned kernels multiple times to average execution times
Added an option to build a static version of the library
Made it possible to use the command-line environmental vars everywhere and without re-running CMake
Various minor fixes and enhancements
Added tuned parameters for various devices (see README)

Assets 5

13 Sep 19:20

CNugteren

0.9.0

f07ac22

Preview version 0.9.0

Version 0.9.0

Updated to version 6.0 of the CLCudaAPI C++11 OpenCL header
Improved performance significantly of rotated GEMV computations
Improved performance of unseen/un-tuned devices by a better default tuning parameter selection
Fixed proper MSVC dllimport and dllexport declarations
Fixed memory leaks related to events not being released
Fixed a bug with a size_t and cl_ulong mismatch on 32-bit systems
Fixed a bug related to the cache and retrieval of programs based on the OpenCL context
Fixed a performance issue (caused by fp16 support) by optimizing alpha/beta parameter passing to kernels
Fixed a bug in the OpenCL kernels: now placing __kernel before __attribute__
Fixed a bug in level-3 routines when beta is zero and matrix C contains NaNs
Added an option (-warm_up) to do a warm-up run before timing in the performance clients
Various minor fixes and enhancements
Added tuned parameters for various devices (see README)

Assets 5

28 Jun 20:39

CNugteren

0.8.0

7c13bac

Preview version 0.8.0

Version 0.8.0

Added support for half-precision floating-point (fp16) in the library
Made it possible to compile the performance tests (clients) separately from the correctness tests
Made a reference BLAS and head-to-head performance comparison optional in the clients
Increased the verbosity of the "-verbose" option in the correctness tests
Refactored the host code for better compilation times and fewer lines of code
Added Appveyor continuous integration and increased coverage of the Travis builds
Improved the API documentation
Various minor fixes and enhancements
Added tuned parameters for various devices (see README)
Added half-precision routines:
- Level-1: HSWAP/HSCAL/HCOPY/HAXPY/HDOT/HNRM2/HASUM/HSUM/iHAMAX/iHMAX/iHMIN
- Level-2: HGEMV/HGBMV/HHEMV/HHBMV/HHPMV/HSYMV/HSBMV/HSPMV/HTRMV/HTBMV/HTPMV/HGER/HSYR/HSPR/HSYR2/HSPR2
- Level-3: HGEMM/HSYMM/HSYRK/HSYR2K/HTRMM
Added non-BLAS routines:
- SOMATCOPY/DOMATCOPY/COMATCOPY/ZOMATCOPY/HOMATCOPY (matrix copy, scaling, and/or transpose)

Assets 5

18 May 19:20

CNugteren

0.7.1

181eb20

Preview version 0.7.1

Version 0.7.1 (bug-fix release)

Improved performance of large power-of-2 xGEMM kernels for AMD GPUs
Fixed a bug in the xGEMM routine related to the event incorrectly set
Made MSVC link the run-time libraries statically

Assets 5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: CNugteren/CLBlast

CLBlast 1.3.0

CLBlast 1.2.0

CLBlast 1.1.0

CLBlast 1.0.1

CLBlast 1.0.0

Preview version 0.11.0

Preview version 0.10.0

Preview version 0.9.0

Preview version 0.8.0

Preview version 0.7.1