-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to KernelAbstractions.jl #559
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,53 +1,32 @@ | ||
# Interface | ||
|
||
To extend the above functionality to a new array type, you should use the types and | ||
implement the interfaces listed on this page. GPUArrays is design around having two | ||
different array types to represent a GPU array: one that only ever lives on the host, and | ||
implement the interfaces listed on this page. GPUArrays is designed around having two | ||
different array types to represent a GPU array: one that exists only on the host, and | ||
one that actually can be instantiated on the device (i.e. in kernels). | ||
Device functionality is then handled by [KernelAbstractions.jl](https://github.com/JuliaGPU/KernelAbstractions.jl). | ||
|
||
## Host abstractions | ||
|
||
## Device functionality | ||
|
||
Several types and interfaces are related to the device and execution of code on it. First of | ||
all, you need to provide a type that represents your execution back-end and a way to call | ||
kernels: | ||
You should provide an array type that builds on the `AbstractGPUArray` supertype, such as: | ||
|
||
```@docs | ||
GPUArrays.AbstractGPUBackend | ||
GPUArrays.AbstractKernelContext | ||
GPUArrays.gpu_call | ||
GPUArrays.thread_block_heuristic | ||
``` | ||
mutable struct CustomArray{T, N} <: AbstractGPUArray{T, N} | ||
data::DataRef{Vector{UInt8}} | ||
offset::Int | ||
dims::Dims{N} | ||
... | ||
end | ||
|
||
You then need to provide implementations of certain methods that will be executed on the | ||
device itself: | ||
|
||
```@docs | ||
GPUArrays.AbstractDeviceArray | ||
GPUArrays.LocalMemory | ||
GPUArrays.synchronize_threads | ||
GPUArrays.blockidx | ||
GPUArrays.blockdim | ||
GPUArrays.threadidx | ||
GPUArrays.griddim | ||
``` | ||
|
||
This will allow your defined type (in this case `JLArray`) to use the GPUArrays interface where available. | ||
To be able to actually use the functionality that is defined for `AbstractGPUArray`s, you need to define the backend, like so: | ||
|
||
## Host abstractions | ||
|
||
You should provide an array type that builds on the `AbstractGPUArray` supertype: | ||
|
||
```@docs | ||
AbstractGPUArray | ||
``` | ||
|
||
First of all, you should implement operations that are expected to be defined for any | ||
`AbstractArray` type. Refer to the Julia manual for more details, or look at the `JLArray` | ||
reference implementation. | ||
|
||
To be able to actually use the functionality that is defined for `AbstractGPUArray`s, you | ||
should provide implementations of the following interfaces: | ||
|
||
```@docs | ||
GPUArrays.backend | ||
import KernelAbstractions: Backend | ||
struct CustomBackend <: KernelAbstractions.GPU | ||
KernelAbstractions.get_backend(a::CA) where CA <: CustomArray = CustomBackend() | ||
``` | ||
|
||
There are numerous examples of potential interfaces for GPUArrays, such as with [JLArrays](https://github.com/JuliaGPU/GPUArrays.jl/blob/master/lib/JLArrays/src/JLArrays.jl), [CuArrays](https://github.com/JuliaGPU/CUDA.jl/blob/master/src/gpuarrays.jl), and [ROCArrays](https://github.com/JuliaGPU/AMDGPU.jl/blob/master/src/gpuarrays.jl). |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,53 +1,30 @@ | ||
# reference implementation on the CPU | ||
|
||
# note that most of the code in this file serves to define a functional array type, | ||
# the actual implementation of GPUArrays-interfaces is much more limited. | ||
# This acts as a wrapper around KernelAbstractions's parallel CPU | ||
# functionality. It is useful for testing GPUArrays (and other packages) | ||
# when no GPU is present. | ||
# This file follows conventions from AMDGPU.jl | ||
|
||
module JLArrays | ||
|
||
export JLArray, JLVector, JLMatrix, jl | ||
export JLArray, JLVector, JLMatrix, jl, JLBackend | ||
|
||
using GPUArrays | ||
|
||
using Adapt | ||
|
||
import KernelAbstractions | ||
import KernelAbstractions: Adapt, StaticArrays, Backend, Kernel, StaticSize, DynamicSize, partition, blocks, workitems, launch_config | ||
|
||
|
||
# | ||
# Device functionality | ||
# | ||
|
||
const MAXTHREADS = 256 | ||
|
||
|
||
## execution | ||
|
||
struct JLBackend <: AbstractGPUBackend end | ||
|
||
mutable struct JLKernelContext <: AbstractKernelContext | ||
blockdim::Int | ||
griddim::Int | ||
blockidx::Int | ||
threadidx::Int | ||
|
||
localmem_counter::Int | ||
localmems::Vector{Vector{Array}} | ||
end | ||
|
||
function JLKernelContext(threads::Int, blockdim::Int) | ||
blockcount = prod(blockdim) | ||
lmems = [Vector{Array}() for i in 1:blockcount] | ||
JLKernelContext(threads, blockdim, 1, 1, 0, lmems) | ||
end | ||
|
||
function JLKernelContext(ctx::JLKernelContext, threadidx::Int) | ||
JLKernelContext( | ||
ctx.blockdim, | ||
ctx.griddim, | ||
ctx.blockidx, | ||
threadidx, | ||
0, | ||
ctx.localmems | ||
) | ||
struct JLBackend <: KernelAbstractions.GPU | ||
static::Bool | ||
JLBackend(;static::Bool=false) = new(static) | ||
end | ||
|
||
struct Adaptor end | ||
|
@@ -60,27 +37,6 @@ end | |
Base.getindex(r::JlRefValue) = r.x | ||
Adapt.adapt_structure(to::Adaptor, r::Base.RefValue) = JlRefValue(adapt(to, r[])) | ||
|
||
function GPUArrays.gpu_call(::JLBackend, f, args, threads::Int, blocks::Int; | ||
name::Union{String,Nothing}) | ||
ctx = JLKernelContext(threads, blocks) | ||
device_args = jlconvert.(args) | ||
tasks = Array{Task}(undef, threads) | ||
for blockidx in 1:blocks | ||
ctx.blockidx = blockidx | ||
for threadidx in 1:threads | ||
thread_ctx = JLKernelContext(ctx, threadidx) | ||
tasks[threadidx] = @async f(thread_ctx, device_args...) | ||
# TODO: require 1.3 and use Base.Threads.@spawn for actual multithreading | ||
# (this would require a different synchronization mechanism) | ||
end | ||
for t in tasks | ||
fetch(t) | ||
end | ||
end | ||
return | ||
end | ||
|
||
|
||
## executed on-device | ||
|
||
# array type | ||
|
@@ -108,42 +64,6 @@ end | |
@inline Base.setindex!(A::JLDeviceArray, x, index::Integer) = setindex!(typed_data(A), x, index) | ||
|
||
|
||
# indexing | ||
|
||
for f in (:blockidx, :blockdim, :threadidx, :griddim) | ||
@eval GPUArrays.$f(ctx::JLKernelContext) = ctx.$f | ||
end | ||
|
||
# memory | ||
|
||
function GPUArrays.LocalMemory(ctx::JLKernelContext, ::Type{T}, ::Val{dims}, ::Val{id}) where {T, dims, id} | ||
ctx.localmem_counter += 1 | ||
lmems = ctx.localmems[blockidx(ctx)] | ||
|
||
# first invocation in block | ||
data = if length(lmems) < ctx.localmem_counter | ||
lmem = fill(zero(T), dims) | ||
push!(lmems, lmem) | ||
lmem | ||
else | ||
lmems[ctx.localmem_counter] | ||
end | ||
|
||
N = length(dims) | ||
JLDeviceArray{T,N}(data, tuple(dims...)) | ||
end | ||
|
||
# synchronization | ||
|
||
@inline function GPUArrays.synchronize_threads(::JLKernelContext) | ||
# All threads are getting started asynchronously, so a yield will yield to the next | ||
# execution of the same function, which should call yield at the exact same point in the | ||
# program, leading to a chain of yields effectively syncing the tasks (threads). | ||
yield() | ||
return | ||
end | ||
|
||
|
||
# | ||
# Host abstractions | ||
# | ||
|
@@ -409,8 +329,6 @@ end | |
|
||
## GPUArrays interfaces | ||
|
||
GPUArrays.backend(::Type{<:JLArray}) = JLBackend() | ||
|
||
Adapt.adapt_storage(::Adaptor, x::JLArray{T,N}) where {T,N} = | ||
JLDeviceArray{T,N}(x.data[], x.offset, x.dims) | ||
|
||
|
@@ -423,4 +341,50 @@ function GPUArrays.mapreducedim!(f, op, R::AnyJLArray, A::Union{AbstractArray,Br | |
R | ||
end | ||
|
||
## KernelAbstractions interface | ||
|
||
KernelAbstractions.get_backend(a::JLA) where JLA <: JLArray = JLBackend() | ||
|
||
function KernelAbstractions.mkcontext(kernel::Kernel{JLBackend}, I, _ndrange, iterspace, ::Dynamic) where Dynamic | ||
return KernelAbstractions.CompilerMetadata{KernelAbstractions.ndrange(kernel), Dynamic}(I, _ndrange, iterspace) | ||
end | ||
|
||
KernelAbstractions.allocate(::JLBackend, ::Type{T}, dims::Tuple) where T = JLArray{T}(undef, dims) | ||
|
||
@inline function launch_config(kernel::Kernel{JLBackend}, ndrange, workgroupsize) | ||
if ndrange isa Integer | ||
ndrange = (ndrange,) | ||
end | ||
if workgroupsize isa Integer | ||
workgroupsize = (workgroupsize, ) | ||
end | ||
|
||
if KernelAbstractions.workgroupsize(kernel) <: DynamicSize && workgroupsize === nothing | ||
workgroupsize = (1024,) # Vectorization, 4x unrolling, minimal grain size | ||
end | ||
iterspace, dynamic = partition(kernel, ndrange, workgroupsize) | ||
# partition checked that the ndrange's agreed | ||
if KernelAbstractions.ndrange(kernel) <: StaticSize | ||
ndrange = nothing | ||
end | ||
|
||
return ndrange, workgroupsize, iterspace, dynamic | ||
end | ||
|
||
KernelAbstractions.isgpu(b::JLBackend) = false | ||
|
||
function convert_to_cpu(obj::Kernel{JLBackend, W, N, F}) where {W, N, F} | ||
return Kernel{typeof(KernelAbstractions.CPU(; static = obj.backend.static)), W, N, F}(KernelAbstractions.CPU(; static = obj.backend.static), obj.f) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is clever xD There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you explain? I didn't get what this was for, and it seems unused? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unless I did something wrong, it's used for kernel configuration:
And will essentially transform anything that has a JLBackend into a
|
||
end | ||
|
||
function (obj::Kernel{JLBackend})(args...; ndrange=nothing, workgroupsize=nothing) | ||
device_args = jlconvert.(args) | ||
new_obj = convert_to_cpu(obj) | ||
new_obj(device_args...; ndrange, workgroupsize) | ||
end | ||
|
||
Adapt.adapt_storage(::JLBackend, a::Array) = Adapt.adapt(JLArrays.JLArray, a) | ||
Adapt.adapt_storage(::JLBackend, a::JLArrays.JLArray) = a | ||
Adapt.adapt_storage(::KernelAbstractions.CPU, a::JLArrays.JLArray) = convert(Array, a) | ||
|
||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vchuravy Does KA.jl already support recursing into wrapped arrays for
get_backend
queries?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm I don't think so, but we can add that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure we want to; it would pull the whole Union mess of Adapt's
WrappedArray
into KA.jl. On the other hand, there's not much of an alternative right now if we want the ability to launch kernels on wrapped arrays...