permit NNlibCUDA to use Float16 #363

bjarthur · 2021-11-16T21:48:58Z

in conjunction with FluxML/NNlibCUDA.jl#32, add support for half-precision gemm, for which a special kernel is provided by Nvidia. see JuliaGPU/CUDA.jl#1080

mcabbott · 2021-11-19T04:22:13Z

src/batched/batchedmul.jl

@@ -220,7 +220,7 @@ _batched_mul!(::Type, C, A, B, α::Number, β::Number) = batched_mul_generic!(C,
 _batched_mul!(::Type{DT}, C, A, B, α::Number, β::Number) where {DT<:DenseArray{T}} where {T<:BlasFloat} =
    _batched_try_gemm!(DT, C, A, B, α, β)

-function _batched_try_gemm!(::Type{DT}, C, A, B, α::Number, β::Number) where {DT<:DenseArray{T}} where {T<:BlasFloat}


My concern with this change (removing {T<:BlasFloat} restriction, not highlighed well) is that it may send weird numbers (like Dual, or BigFloat) down the path towards batched_gemm! which won't accept them.

Perhaps, to safely widen here, the method _batched_gemm!(::Type{<:Array} below needs to be restricted to Array{<:BlasFloat}? With a new method offering another path to batched_mul_generic! at that stage?

The dispatch in this file is pretty convoluted! Maybe there's another tidier solution.

Float16 would be good to have, though. Thanks for digging.

the only place this method (ie _batched_try_gemm!) is currently called is from the method immediately above (ie _batched_mul!() where {T<:BlasFloat}). widening _batched_try_gemm! to types other than BlasFloat permits the proposed new _batched_mul!() where {T<:Float16} in FluxML/NNlibCUDA.jl#32 to call it too. i don't think there's any danger of weird number types getting where they shouldn't.

Oh, now I see better what you're proposing. There are two jumps to the CUDA package, in order to allow Float16 only for CuArrays, not for Arrays. Which is the desired behaviour. The first jump comes back to this package's chain of functions.

It does seem slightly weird to jump twice. Let me think a bit more, I'd be happier if there was exactly one point in the chain where dispatch cared about CuArrays.

Sorry I dropped the ball here. I think we should do this, or at least I certainly didn't get around to thinking up a better way.

Could you perhaps add some comments explaining a bit what's going on? Having dispatch at two points, instead of just reading down the page & at some point jumping to CUDA, is one step trickier to read. Maybe the where {DT<:DenseArray{T}} where {T<:BlasFloat} = ... method can explain that there's another path through here for CuArray{Float16}?

permit NNlibCUDA to use Float16

1ff66ec

bjarthur mentioned this pull request Nov 16, 2021

add support for half precision gemm FluxML/NNlibCUDA.jl#32

Open

mcabbott reviewed Nov 19, 2021

View reviewed changes

mcabbott added the enhancement label Jul 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

permit NNlibCUDA to use Float16 #363

permit NNlibCUDA to use Float16 #363

bjarthur commented Nov 16, 2021

mcabbott Nov 19, 2021 •

edited

Loading

bjarthur Nov 19, 2021 •

edited

Loading

mcabbott Nov 19, 2021

bjarthur Feb 22, 2022

mcabbott Feb 22, 2022

permit NNlibCUDA to use Float16 #363

Are you sure you want to change the base?

permit NNlibCUDA to use Float16 #363

Conversation

bjarthur commented Nov 16, 2021

mcabbott Nov 19, 2021 • edited Loading

Choose a reason for hiding this comment

bjarthur Nov 19, 2021 • edited Loading

Choose a reason for hiding this comment

mcabbott Nov 19, 2021

Choose a reason for hiding this comment

bjarthur Feb 22, 2022

Choose a reason for hiding this comment

mcabbott Feb 22, 2022

Choose a reason for hiding this comment

mcabbott Nov 19, 2021 •

edited

Loading

bjarthur Nov 19, 2021 •

edited

Loading