Skip to content

Releases: ROCm/Tensile

Make rocBLAS build with pre-ROCm 1.9 compilers work

21 Sep 16:24
afc560a
Compare
Choose a tag to compare

A small incremental release to make rocBLAS v14.2.5 buildable with pre-ROCm 1.9 compilers.

V4.5.0 Performance improvements, Bug fixes, add hpa_hgemm

12 Sep 14:55
Compare
Choose a tag to compare

Features

  • add support for vega20
  • add hpa_hgemm assembly and source
  • tuning for sgemm and hgemm
  • bug fixes for sgemm and hgemm small sizes
  • use SGPR for alpha and beta

V4.4.0 Performance Improvements and Bug Fixes

10 Aug 03:43
Compare
Choose a tag to compare

Features

  • Support Global Split U for half and double
  • Support Local Split U for half and hpa
  • Fix beta for hpa
  • Add AssertFree0ElementMultiple requirement and runtime launch check
  • Intercept solution selection logic and call hgemm HIP kernel when summation index or first free index is odd
  • correct reordered_schedules fallback for hgemm
  • disable PreciseBoundsCheck
  • update rocblas_hgemm_asm_full.yaml to call source with VW=2 for m,n,k <= 32
  • update rocblas_hgemm_asm_full.yaml to call source with VW=1 for m,n,k == 1
  • Use alternating sign in random init for half
  • use hipGetDevice in place of hipCtxGetDevice
  • use _Float16 in place of __fp16
  • add device to llvm_fma_v2f16

V4.3.0 Performance Improvements and Bug Fixes

28 Jun 21:36
Compare
Choose a tag to compare

Features

  • source kernels for k<=128 to fix stride_b=0, batch_count > 1
  • __hfma no longer needed
  • Modify default handling for LdsPad, if -1, only pad the TLU=0 cases
  • Combine second-to-last MAC iter into common loop
  • Reset local pointers at iteration based on PrefetchLocalRead
  • Multi-thread the kernel writing, provides 3X-4X speedup for build
  • Support -1 default LdsPad (matches VectorWidth)
  • refactor .yaml files
  • Optimize overhang calculation
  • use glvw in overhang calculation
  • Enable CodeFromFiles
  • Feature detect invalid kernel
  • Change order to better match write batching reclaim algorithm
  • Allocate LoopCounters in middle of SGPRs so tmp sgpr recovery works

V4.2.0 Performance improvements

15 May 22:10
Compare
Choose a tag to compare

Features

  • Fractional global capability
  • Additional ResNet sizes
  • Round up for half vgprs
  • Initial code for PersistentKernel (disabled)
  • Feature inner unroll2
  • Enable BufferStore and buffer_atomic_cmpswap for GSU>1

V4.1.1 Performance improvements

26 Apr 20:55
Compare
Choose a tag to compare

Features

  • Support LSHL_ADD
  • Vectorize the store-C path
  • Enable DirectToLds for half
  • Fix sync with DirectToLds when PrefetchLocalRead=0
  • Optimize solution merging using lookup
  • Align MAC blocks when using half datatype
  • Add mi25 Device 6860 to vega10
  • Train for DataInitTypeBeta: 0
  • Add ResNet1x1 to Exact sizes

v4.0.2 Performance improvements and initial mixed precision support

10 Apr 22:31
Compare
Choose a tag to compare

Features

  • Initial mixed precision support
  • Performance Improvements
    • Use Buffer Load for global reads (saves registers, reduce instruction count)
    • Support DirectToLds (save registers, reduce latency)
    • Reduce global read offset vgprs (save registers)
    • Use Buffer Store for global stores (reduce instruction count)
    • Optimize global store address calculaton (reduce instruction count)
    • Support LdsPad to reduce LDS write bank conflicts
  • Improve debug for assembly path (asserts, state dump, init LDS)

v3.6.0 Hgemm and thread safety fix

05 Mar 21:23
Compare
Choose a tag to compare

Features:

  • Hgemm
    • assembly for gfx900, source for gfx803
  • Bug fixes:
    • Additional thread safety fix for solution lookup and module storage

v3.5.1 Hgemm Assembly

06 Feb 21:46
e0a583f
Compare
Choose a tag to compare

Features:

  • Hgemm has been implemented in assembly.
    • For gfx900 only.
    • Does not support matrix dimensions M = 1 or N = 1.
    • GlobalSplitU and LocalSplitU have not yet been implemented for it.

Bug fixes:

  • fix thread safety for solution lookup and assembly module storage.

v3.4.0 Dgemm Assembly

01 Dec 15:00
Compare
Choose a tag to compare

Features:

  • Dgemm has been implemented in assembly.
    • GlobalSplitU and LocalSplitU have not yet been implemented for it.
    • 64x64x8 with prefetching is fastest kernel configuration.