adding atomic support with atomix #299

leios · 2022-05-25T09:15:12Z

After some discussions on #282, we decided to use Atomix for atomic support in KA.

A few quick questions:

Because Base (and CUDA) both have an @atomic macro, we need to specify that we are using the Atomix.@atomic macro in code that needs atomic operations. Should we overdub any @atomic macros in KA to specifically use Atomix?
Should we add in the tests from Atomic attempts #282?
What about atomic primitives like atomic_add!(...), and atomic_sub!(...) from Atomic attempts #282? These come from either CUDA or Core.Intrinsics. Maybe it's a good idea to use Atomix on top of Atomic attempts #282? I don't know how many people will use the primitives over the macro, to be honest.

Note, this should not be merged until JuliaRegistries/General#61002 is automerged.

tkf

KernelAbstractions.jl doesn't have to depend on UnsafeAtomicsLLVM.jl (and LLVM.jl)

Project.toml

Co-authored-by: Takafumi Arakaki <takafumi.a@gmail.com>

vchuravy

Looks great!

Probably needs docs as well as AMDGPU support.

lib/CUDAKernels/Project.toml

src/KernelAbstractions.jl

Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>

examples/histogram.jl

pxl-th · 2022-05-31T09:19:51Z

Hi!

As for 3. What about atomic primitives like atomic_add!(...), I'd like to say that I have several kernels that use atomic_add! specifically because it returns the old value after adding. I'm not sure if this is achievable with macros.

Also I'm curious if it will support things like:

@atomic max(x[i], v)

Co-authored-by: Takafumi Arakaki <takafumi.a@gmail.com>

leios · 2022-05-31T19:02:18Z

I don't mind reworking this PR and #282 so we get both the macro and better ordering support from Atomix and also the atomic_... functions from either Core.Intrinsics or CUDA. I have a branch locally that basically does this and it works fine for my purposes.

I figure most people will want to use the macro, but some people will prefer the atomic_... functions, so why not just do both?

vchuravy · 2022-05-31T19:05:39Z

Let's merge this for now and then you can open a second PR?

leios · 2022-05-31T19:06:33Z

This one is not ready to be merged

vchuravy · 2022-05-31T19:16:26Z

Oops. I got excited that it passed tests :)

leios · 2022-05-31T19:28:08Z

It was missing docs and tests, at least... I will add them when I get the chance. To be fair, atomix should have all the necessary tests, I just wanted to double check here. Documentation does not need to be long, but having a section for atomics with an example would go a long way.

leios · 2022-05-31T19:29:39Z

I was just waiting to add docs until we settled the atomic "primitive" discussion.

claforte · 2022-05-31T20:24:18Z

Thanks a lot @leios ! @pxl-th and a few others in my team are very much looking forward to this PR being merged for our Instant NeRF (3D reconstruction) Julia implementation. If you'd like a sneak preview, let me know, I can invite you to our private Discord and Github. :-)

pxl-th · 2022-06-04T08:09:15Z

I've tried this PR and it looks like on CPU it only supports integer types.
While on GPU I get unsupported dynamic function invocation (call to modify!) for any type.
I'm on Julia 1.8.0-rc1, but the same errors are present on 1.7.2.

Error

ERROR: LoadError: InvalidIRError: compiling kernel #gpu_splat!(KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to modify!)
Stacktrace:
 [1] modify!
   @ ~/.julia/packages/Atomix/F9VIX/src/core.jl:33
 [2] macro expansion
   @ ~/code/a.jl:28
 [3] gpu_splat!
   @ ~/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80
 [4] gpu_splat!
   @ ./none:0
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_splat!), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}}}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/validation.jl:139
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:409 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/LDL7n/src/TimerOutput.jl:252 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:407 [inlined]
  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/utils.jl:64
  [6] cufunction_compile(job::GPUCompiler.CompilerJob, ctx::LLVM.Context)
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:354
  [7] #224
    @ ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:347 [inlined]
  [8] JuliaContext(f::CUDA.var"#224#225"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_splat!), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}}}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:74
  [9] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:346
 [10] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/cache.jl:90
 [11] cufunction(f::typeof(gpu_splat!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:299
 [12] cufunction(f::typeof(gpu_splat!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}})
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:293
 [13] macro expansion
    @ ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:102 [inlined]
 [14] (::KernelAbstractions.Kernel{CUDADevice, KernelAbstractions.NDIteration.StaticSize{(512,)}, KernelAbstractions.NDIteration.DynamicSize, typeof(gpu_splat!)})(::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, ::Vararg{Any}; ndrange::Int64, dependencies::CUDAKernels.CudaEvent, workgroupsize::Nothing, progress::Function)
    @ CUDAKernels ~/.julia/packages/CUDAKernels/4VLF4/src/CUDAKernels.jl:272
 [15] main()
    @ Main ~/code/a.jl:40
 [16] top-level scope
    @ ~/code/a.jl:42
in expression starting at /home/pxl-th/code/a.jl:42

MWE:

Code

using CUDA
using CUDAKernels
using KernelAbstractions
using KernelAbstractions: @atomic

CUDA.allowscalar(false)

n_threads(::CPU) = Threads.nthreads()
n_threads(::CUDADevice) = 512

Base.rand(::CPU, T, shape) = rand(T, shape)
Base.rand(::CUDADevice, T, shape) = CUDA.rand(T, shape)

to_device(::CPU, x) = copy(x)
to_device(::CUDADevice, x) = CuArray(x)

@kernel function splat!(grid, @Const(indices), @Const(mlp_out))
    i = @index(Global)
    idx = indices[i]
    @atomic max(grid[idx], mlp_out[i])
end

function main()
    #device = CPU()
    device = CUDADevice()

    n = 16
    indices = to_device(device, UInt32.(collect(1:n)))
    mlp_out = rand(device, Int64, n) # errors on CPU with Float32
    grid = rand(device, Int64, n) # errors on CPU with Float32

    wait(splat!(device, n_threads(device), n)(grid, indices, mlp_out))
end
main()

leios · 2022-06-04T09:32:22Z

I was unable to replicate this error by running the provided code with 1.7.1 and 1.8.0-beta3 (just pulled from git). What OS are you using? Also, could you show the outputs of ] st?

pxl-th · 2022-06-04T11:17:11Z

I'm on Ubuntu 22.04
CPU: AMD Ryzen 7 5800HS
GPU: NVIDIA GeForce RTX 3060

]st:

(@v1.8) pkg> st
Status `~/.julia/environments/v1.8/Project.toml`
  [a9b6321e] Atomix v0.1.0
  [052768ef] CUDA v3.10.1
  [72cfdca4] CUDAKernels v0.4.1
  [5789e2e9] FileIO v1.14.0
  [a09fc81d] ImageCore v0.9.3
  [82e4d734] ImageIO v0.6.5
  [02fcd773] ImageTransformations v0.9.4
  [b835a17e] JpegTurbo v0.1.1
  [63c18a36] KernelAbstractions v0.8.1 `https://github.com/JuliaGPU/KernelAbstractions.jl.git#master`

pxl-th · 2022-06-04T11:20:45Z

I've just updated MWE code, before I included code that does not error :)
You can also change grid & mlp_out eltypes to Float32 and to see that it does not work with them.

leios · 2022-06-04T12:25:24Z

Right, I see the comments now, sorry!

try ]add CUDAKernels#master?

pxl-th · 2022-06-04T15:16:21Z

Yes, that works, thanks!

Although there is another issue, which is not critical for me, but might be worth mentioning:

MWE:

Code

using CUDA
using CUDAKernels
using KernelAbstractions
using KernelAbstractions: @atomic

CUDA.allowscalar(false)

const NERF_STEPS = UInt32(1024)
const MIN_CONE_STEPSIZE = √3f0 / NERF_STEPS

n_threads(::CPU) = Threads.nthreads()
n_threads(::CUDADevice) = 512

Base.rand(::CPU, T, shape) = rand(T, shape)
Base.rand(::CUDADevice, T, shape) = CUDA.rand(T, shape)

Base.zeros(::CPU, T, shape) = zeros(T, shape)
Base.zeros(::CUDADevice, T, shape) = CUDA.zeros(T, shape)

to_device(::CPU, x) = copy(x)
to_device(::CUDADevice, x) = CuArray(x)

@inline density_activation(x) = exp(x)

@kernel function splat!(grid, @Const(indices), @Const(mlp_out))
    i = @index(Global)
    idx = indices[i]
    old, new = @atomic max(grid[idx], mlp_out[i])
    @atomic grid[idx] = old
end

function main()
    # device = CPU()
    device = CUDADevice()

    n = 16
    indices = to_device(device, UInt32.(collect(1:n)))
    mlp_out = rand(device, Int64, n)
    grid = zeros(device, Int64, n)

    wait(splat!(device, n_threads(device), n)(grid, indices, mlp_out))
end
main()

Error

ERROR: LoadError: LLVM error: Cannot select: 0x77adc60: ch = AtomicStore<(store seq_cst (s64) into %ir.41, addrspace 1)> 0x4926e08:1, 0x6f629c8, 0x4926e08, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/base.jl:40 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/atomics.jl:245 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/atomics.jl:201 @[ /home/pxl-th/.julia/packages/UnsafeAtomicsLLVM/i4GMj/src/internal.jl:11 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:14 @[ /home/pxl-th/code/a.jl:29 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
  0x6f629c8: i64 = add 0x6f626f0, Constant:i64<-8>, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
    0x6f626f0: i64 = add 0x6f62ea8, 0x7195f28, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
      0x6f62ea8: i64,ch = CopyFromReg 0x65cc278, Register:i64 %0, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
        0x4926cd0: i64 = Register %0
      0x7195f28: i64 = shl nuw nsw 0x77ae5b8, Constant:i32<3>, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
        0x77ae5b8: i64 = AssertZext 0x71ce208, ValueType:ch:i32, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
          0x71ce208: i64,ch = CopyFromReg 0x65cc278, Register:i64 %9, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
            0x66aca88: i64 = Register %9
        0x66ac0c8: i32 = Constant<3>
    0x4926720: i64 = Constant<-8>
  0x4926e08: i64,ch = AtomicLoadMax<(load store seq_cst (s64) on %ir.39, addrspace 1)> 0x7195c50:1, 0x6f629c8, 0x7195c50, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/base.jl:40 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/atomics.jl:270 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/atomics.jl:270 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/atomics.jl:374 @[ /home/pxl-th/.julia/packages/UnsafeAtomicsLLVM/i4GMj/src/internal.jl:18 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:33 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ] ]
    0x6f629c8: i64 = add 0x6f626f0, Constant:i64<-8>, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
      0x6f626f0: i64 = add 0x6f62ea8, 0x7195f28, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
        0x6f62ea8: i64,ch = CopyFromReg 0x65cc278, Register:i64 %0, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
          0x4926cd0: i64 = Register %0
        0x7195f28: i64 = shl nuw nsw 0x77ae5b8, Constant:i32<3>, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
          0x77ae5b8: i64 = AssertZext 0x71ce208, ValueType:ch:i32, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
            0x71ce208: i64,ch = CopyFromReg 0x65cc278, Register:i64 %9, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
              0x66aca88: i64 = Register %9
          0x66ac0c8: i32 = Constant<3>
      0x4926720: i64 = Constant<-8>
    0x7195c50: i64,ch = llvm.nvvm.ldg.global.i<(load (s64) from %ir.34, addrspace 1)> 0x65cc278, TargetConstant:i64<5104>, 0x6f62b68, Constant:i32<8>, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/base.jl:40 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:120 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:120 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:219 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:40 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ] ] ] ]
      0x49271b0: i64 = TargetConstant<5104>
      0x6f62b68: i64 = add 0x4926b98, Constant:i64<-8>, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:39 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
        0x4926b98: i64 = add 0x6f62278, 0x4926ed8, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:39 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
          0x6f62278: i64,ch = CopyFromReg 0x65cc278, Register:i64 %4, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:39 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
            0x7803a58: i64 = Register %4
          0x4926ed8: i64 = shl nuw nsw 0x71ce2d8, Constant:i32<3>, int.jl:88 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:39 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
            0x71ce2d8: i64,ch = CopyFromReg 0x65cc278, Register:i64 %8, int.jl:88 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:39 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
              0x77ae620: i64 = Register %8
            0x66ac0c8: i32 = Constant<3>
        0x4926720: i64 = Constant<-8>
      0x7196880: i32 = Constant<8>
In function: _Z21julia_gpu_splat__430516CompilerMetadataI10StaticSizeI5_16__E12DynamicCheckvv7NDRangeILi1ES0_I4_1__ES0_I6_512__EvvEE13CuDeviceArrayI5Int64Li1ELi1EES3_I6UInt32Li1ELi1EES3_IS4_Li1ELi1EE
Stacktrace:
  [1] handle_error(reason::Cstring)
    @ LLVM ~/.julia/packages/LLVM/YSJ2s/src/core/context.jl:105
  [2] LLVMTargetMachineEmitToMemoryBuffer
    @ ~/.julia/packages/LLVM/YSJ2s/lib/13/libLLVM_h.jl:947 [inlined]
  [3] emit(tm::LLVM.TargetMachine, mod::LLVM.Module, filetype::LLVM.API.LLVMCodeGenFileType)
    @ LLVM ~/.julia/packages/LLVM/YSJ2s/src/targetmachine.jl:45
  [4] mcgen(job::GPUCompiler.CompilerJob, mod::LLVM.Module, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/mcgen.jl:74
  [5] macro expansion
    @ ~/.julia/packages/TimerOutputs/LDL7n/src/TimerOutput.jl:252 [inlined]
  [6] macro expansion
    @ ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:421 [inlined]
  [7] macro expansion
    @ ~/.julia/packages/TimerOutputs/LDL7n/src/TimerOutput.jl:252 [inlined]
  [8] macro expansion
    @ ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:418 [inlined]
  [9] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/utils.jl:64
 [10] cufunction_compile(job::GPUCompiler.CompilerJob, ctx::LLVM.Context)
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:354
 [11] #224
    @ ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:347 [inlined]
 [12] JuliaContext(f::CUDA.var"#224#225"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_splat!), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.StaticSize{(16,)}, KernelAbstractions.NDIteration.DynamicCheck, Nothing, Nothing, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.StaticSize{(1,)}, KernelAbstractions.NDIteration.StaticSize{(512,)}, Nothing, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}}}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:74
 [13] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:346
 [14] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/cache.jl:90
 [15] cufunction(f::typeof(gpu_splat!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.StaticSize{(16,)}, KernelAbstractions.NDIteration.DynamicCheck, Nothing, Nothing, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.StaticSize{(1,)}, KernelAbstractions.NDIteration.StaticSize{(512,)}, Nothing, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:299
 [16] cufunction(f::typeof(gpu_splat!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.StaticSize{(16,)}, KernelAbstractions.NDIteration.DynamicCheck, Nothing, Nothing, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.StaticSize{(1,)}, KernelAbstractions.NDIteration.StaticSize{(512,)}, Nothing, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}})
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:292
 [17] macro expansion
    @ ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:102 [inlined]
 [18] (::KernelAbstractions.Kernel{CUDADevice, KernelAbstractions.NDIteration.StaticSize{(512,)}, KernelAbstractions.NDIteration.StaticSize{(16,)}, typeof(gpu_splat!)})(::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, ::Vararg{Any}; ndrange::Nothing, dependencies::CUDAKernels.CudaEvent, workgroupsize::Nothing, progress::Function)
    @ CUDAKernels ~/.julia/packages/CUDAKernels/JJJ1U/src/CUDAKernels.jl:273
 [19] Kernel
    @ ~/.julia/packages/CUDAKernels/JJJ1U/src/CUDAKernels.jl:268 [inlined]
 [20] main()
    @ Main ~/code/a.jl:42
 [21] top-level scope
    @ ~/code/a.jl:45
in expression starting at /home/pxl-th/code/a.jl:45

leios · 2022-06-04T16:47:31Z

Ah, I can replicate this error, but I am not sure if it is an Atomix or KernelAbstractions issue. It seems like the CPU version works fine, so maybe it's with UnsafeAtomicsLLVM?

Would you be willing to open up a new issue either here or on Atomix (https://github.com/JuliaConcurrent/Atomix.jl) and ping @tkf?

tkf · 2022-06-05T04:45:35Z

It's an LLVM issue but workaroundable at the level of (e.g.) CUDA.jl. See: JuliaConcurrent/Atomix.jl#33

leios · 2022-06-14T06:27:31Z

@pxl-th, if you are still having trouble with Atomix, I created a separate PR with the atomic support from Core.Intrinsics and CUDA directly in #306. I also added the pkg commands to load in the subdirectory of CUDAKernels in a comment so you can just use it for now if you need.

I've been struggling to get things to work as well, so I also added testing infrastructure for Atomix in #308. Hopefully we can iron out all the details there and get all this sorted. If you have run into any issues, please document them there!

atting atomic attempts with atomix

3e5e3a2

tkf reviewed May 25, 2022

View reviewed changes

Project.toml Outdated Show resolved Hide resolved

Project.toml Outdated Show resolved Hide resolved

Apply suggestions from code review

7b9a7aa

Co-authored-by: Takafumi Arakaki <takafumi.a@gmail.com>

vchuravy reviewed May 25, 2022

View reviewed changes

lib/CUDAKernels/Project.toml Outdated Show resolved Hide resolved

src/KernelAbstractions.jl Outdated Show resolved Hide resolved

Apply suggestions from code review

b0f7374

Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>

tkf reviewed May 25, 2022

View reviewed changes

examples/histogram.jl Outdated Show resolved Hide resolved

leios changed the title ~~atting atomic support with atomix~~ adding atomic support with atomix May 31, 2022

vchuravy marked this pull request as ready for review May 31, 2022 18:25

Update examples/histogram.jl

9384238

Co-authored-by: Takafumi Arakaki <takafumi.a@gmail.com>

vchuravy merged commit 6374613 into JuliaGPU:master May 31, 2022

This was referenced Jun 8, 2022

Atomic attempts #282

Closed

Support atomics #7

Closed

leios mentioned this pull request Jun 15, 2022

adding atomix tests #308

Draft

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding atomic support with atomix #299

adding atomic support with atomix #299

leios commented May 25, 2022 •

edited

Loading

tkf left a comment

vchuravy left a comment

pxl-th commented May 31, 2022 •

edited

Loading

leios commented May 31, 2022

vchuravy commented May 31, 2022

leios commented May 31, 2022

vchuravy commented May 31, 2022

leios commented May 31, 2022

leios commented May 31, 2022

claforte commented May 31, 2022 •

edited

Loading

pxl-th commented Jun 4, 2022 •

edited

Loading

leios commented Jun 4, 2022

pxl-th commented Jun 4, 2022

pxl-th commented Jun 4, 2022 •

edited

Loading

leios commented Jun 4, 2022

pxl-th commented Jun 4, 2022 •

edited

Loading

leios commented Jun 4, 2022 •

edited

Loading

tkf commented Jun 5, 2022

leios commented Jun 14, 2022

adding atomic support with atomix #299

adding atomic support with atomix #299

Conversation

leios commented May 25, 2022 • edited Loading

tkf left a comment

Choose a reason for hiding this comment

vchuravy left a comment

Choose a reason for hiding this comment

pxl-th commented May 31, 2022 • edited Loading

leios commented May 31, 2022

vchuravy commented May 31, 2022

leios commented May 31, 2022

vchuravy commented May 31, 2022

leios commented May 31, 2022

leios commented May 31, 2022

claforte commented May 31, 2022 • edited Loading

pxl-th commented Jun 4, 2022 • edited Loading

leios commented Jun 4, 2022

pxl-th commented Jun 4, 2022

pxl-th commented Jun 4, 2022 • edited Loading

leios commented Jun 4, 2022

pxl-th commented Jun 4, 2022 • edited Loading

leios commented Jun 4, 2022 • edited Loading

tkf commented Jun 5, 2022

leios commented Jun 14, 2022

leios commented May 25, 2022 •

edited

Loading

pxl-th commented May 31, 2022 •

edited

Loading

claforte commented May 31, 2022 •

edited

Loading

pxl-th commented Jun 4, 2022 •

edited

Loading

pxl-th commented Jun 4, 2022 •

edited

Loading

pxl-th commented Jun 4, 2022 •

edited

Loading

leios commented Jun 4, 2022 •

edited

Loading