-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support LinearAlgebra.kron
for all GPU backends
#558
Comments
@maleadt Do you mean something like the following @@ ? I just changed Sorry that I'm not so familiar with the development of GPU codes... function LinearAlgebra.kron!(C::AbstractGPUMatrix{TC}, A::AbstractGPUMatrix{TA}, B::AbstractGPUMatrix{TB}) where {TA,TB,TC}
function _kron_mat_kernelA!(C, A, B, m, n, p, q)
index_i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
index_j = (blockIdx().y - 1) * blockDim().y + threadIdx().y
stride_i = blockDim().x * gridDim().x
stride_j = blockDim().y * gridDim().y
index_i > m && return
index_j > n && return
for i in index_i:stride_i:m
for j in index_j:stride_j:n
for k in 1:p
for l in 1:q
@inbounds C[(i-1)*p+k, (j-1)*q+l] = A[i,j] * B[k,l]
end
end
end
end
return nothing
end
function _kron_mat_kernelB!(C, A, B, m, n, p, q)
index_p = (blockIdx().x - 1) * blockDim().x + threadIdx().x
index_q = (blockIdx().y - 1) * blockDim().y + threadIdx().y
stride_p = blockDim().x * gridDim().x
stride_q = blockDim().y * gridDim().y
index_p > p && return
index_q > q && return
for i in 1:m
for j in 1:n
for k in index_p:stride_p:p
for l in index_q:stride_q:q
@inbounds C[(i-1)*p+k, (j-1)*q+l] = A[i,j] * B[k,l]
end
end
end
end
return nothing
end
m, n = size(A)
p, q = size(B)
# Use different kernels depending on the size of the matrices
# choosing to parallelize the matrix with the largest number of elements
m*n >= p*q ? (kernel = @cuda launch=false _kron_mat_kernelA!(C, A, B, m, n, p, q)) :
(kernel = @cuda launch=false _kron_mat_kernelB!(C, A, B, m, n, p, q))
m*n >= p*q ? (sizes = (m, n)) : (sizes = (p, q))
config = launch_configuration(kernel.fun)
dim_ratio = sizes[1] / sizes[2]
max_threads_i = max(1, floor(Int, sqrt(config.threads * dim_ratio)))
max_threads_j = max(1, floor(Int, sqrt(config.threads / dim_ratio)))
max_blocks_i = max(1, floor(Int, sqrt(config.blocks * dim_ratio)))
max_blocks_j = max(1, floor(Int, sqrt(config.blocks / dim_ratio)))
threads_i = min(sizes[1], max_threads_i)
threads_j = min(sizes[2], max_threads_j)
threads = (threads_i, threads_j)
blocks_i = min(cld(sizes[1], threads_i), max_blocks_i)
blocks_j = min(cld(sizes[2], threads_j), max_blocks_j)
blocks = (blocks_i, blocks_j)
kernel(C, A, B, m, n, p, q; threads=threads, blocks=blocks)
return C
end
function LinearAlgebra.kron(A::AbstractGPUMatrix{TA}, B::AbstractGPUMatrix{TB}) where {TA,TB}
m, n = size(A)
p, q = size(B)
T = promote_type(TA, TB)
C = similar(A, T, m*p, n*q)
kron!(C, A, B)
end |
LinearAlgebra.kron
LinearAlgebra.kron
for all GPU backends
Ah, I glanced over the kernels. Yeah those will have to be ported to GPUArrays' DSL ( |
Hmm, not sure how to contribute I checked the documentation of I think you guys define |
Okay, maybe I will just wait |
The docs are pretty bad, it's easier to look at some existing kernels. It's a very shallow abstraction over the CUDA stuff you're probably familiar with. But waiting for KA.jl is probably fine as well, if you don't need this urgently. |
So you guys are making |
Yes, see the open PRs here and on the various back-ends. |
Correct! I won't be able to work on it this week, but I hope to merge it next week. So if you want, you can already create a PR on GPUArrays.jl that's based on the KA.jl PR. |
@maleadt kron(A::Transpose{T,<:AbstractGPUVecOrMat}, B::AbstractGPUVecOrMat) where {T} = kron(typeof(B)(A), B) Cause the actual backend has not decided yet at this stage, I cannot convert Also, in this way, the following function cannot be defined: kron(A::Transpose{TA,<:AbstractGPUVecOrMat}, Transpose{TB,<:AbstractGPUVecOrMat}) where {TA,TB} = ? Do you guys have some functions to do these ? |
I'm not following; why would you want this call, materializing the transpose? Typically we access the underlying array by calling |
Fixed by #575. |
As @maleadt mentioned in JuliaGPU/Metal.jl#422.
I re-open a new issue here.
The current
LinearAlgebra.kron
only supports forCuArray
, and the otherGPUArray
uses scalar indexing.Also, the methods for Kronecker producting
Transpose{T,<:AbstractGPUArray}
orAdjoint{T,<:AbstractGPUArray}
are missing.The text was updated successfully, but these errors were encountered: