Unconditional errors result in dynamic invocations #649

kichappa · 2024-11-22T06:59:35Z

Describe the bug

Any use of shfl_sync throws an error saying shfl_recurse is a dynamic function.

To reproduce

The Minimal Working Example (MWE) for this bug:

Attempting to do a stream compaction:

using CUDA

# define a new arrays of 64 elements, and fill it with random ones and zeros
a = rand(0:1, 64)

a_gpu = CuArray(a)
b_gpu = CUDA.zeros(Int64, 64)
count = CUDA.zeros(Int64, 1)

function mykernel!(in, out, count)
	threadNum = threadIdx().x + blockDim().x * (blockIdx().x-1) # 1-indexed
	warpNum = (threadIdx().x - 1) ÷ 32 # 0-indexed
	laneNum = (threadIdx().x - 1) % 32 # 0-indexed

    shared_count = CuDynamicSharedArray(Int64, 1)
    
    if threadNum == 1
        shared_count[1] = 0
    end
    sync_threads()

    if threadNum <= 64
        is_nonzero = in[threadNum] != 0
        mask = CUDA.vote_ballot_sync(0xffffffff, is_nonzero)
        warp_count = count_ones(mask)

        warp_offset = 0
        if laneNum == 0
            warp_offset = CUDA.atomic_add!(pointer(shared_count, 1), warp_count)
        end
        warp_offset = CUDA.shfl_sync(0xffffffff, warp_offset, Int32(0)) #<<<<< This is the BUG code.

        if is_nonzero
            index = count_ones(mask & ((1u << laneNum) - 1)) + warp_offset
            out[index+1] = threadNum
        end
    end
    sync_threads()

    if threadIdx().x == 1
        CUDA.atomic_add!(CUDA.pointer(count), shared_count[1])
    end
	return
end

@cuda threads = 64 blocks = 1 shmem=sizeof(Int64) mykernel!(a_gpu, b_gpu, count)

println("nonzeros:$(collect(count))")
println(collect(b_gpu))

Manifest.toml

Package versions:
Status `~/.julia/environments/v1.11/Project.toml`
  [052768ef] CUDA v5.5.2
No Matches in `~/.julia/environments/v1.11/Project.toml`
No Matches in `~/.julia/environments/v1.11/Project.toml`
No Matches in `~/.julia/environments/v1.11/Project.toml`

CUDA details:
CUDA runtime version: 12.6.0
CUDA driver version: 12.6.0
CUDA capability: 9.0.0

Expected behavior

Expected behavior is that the shuffle function doesn't throw an error, and all zeros in a get removed when moved to b

Version info

Details on Julia:

Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × Intel(R) Xeon(R) Platinum 8462Y+
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, sapphirerapids)
Threads: 1 default, 0 interactive, 1 GC (on 64 virtual cores)

Details on CUDA:

CUDA driver 12.6
NVIDIA driver 550.90.7

CUDA libraries: 
- CUBLAS: 12.6.3
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+550.90.7

Julia packages: 
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.3+0
- CUDA_Runtime_jll: 0.15.3+0

Toolchain:
- Julia: 1.11.1
- LLVM: 16.0.6

1 device:
  0: NVIDIA H100 80GB HBM3 (sm_90, 77.409 GiB / 79.647 GiB available)

The text was updated successfully, but these errors were encountered:

maleadt · 2024-11-22T07:50:44Z

Fascinating. So the actual bug is that you're incorrectly invoking shfl_sync with lane set to 0, while CUDA.jl uses 1-based indices everywhere. However, this manifests as a dynamic invocation because Julia detects the InexactError thrown unconditionally, and deoptimizes the call to a dynamic invocation.

@aviatesk Is this the throw-block deoptimization? I tested this on 1.11. Can we disable this deoptimization so that compilation succeeds and the user gets a chance to see the error being generated at run time?

kichappa · 2024-11-22T07:53:26Z

Holy shit! That was a 3-hour torment for me until your response. Thanks a lot, and yeah, the suggestion would have helped me.

aviatesk · 2024-11-22T08:05:49Z

Yes, since unoptimize_throw_blocks is still present in v1.11, you might be able to resolve this issue by disabling it like InferenceParams(::GPUInterpreter) = InferenceParams(; unoptimize_throw_blocks=false).

maleadt · 2024-11-22T08:08:47Z

Hmm, we should already do that:

GPUCompiler.jl/src/interface.jl

Lines 254 to 260 in 09b4708

    
           function inference_params(@nospecialize(job::CompilerJob)) 
        
               if VERSION >= v"1.12.0-DEV.1017" 
        
                   CC.InferenceParams() 
        
               else 
        
                   CC.InferenceParams(; unoptimize_throw_blocks=false) 
        
               end 
        
           end

aviatesk · 2024-11-22T08:40:23Z

That does seem to be the case... Taking a look at the inference results from GPUInterpreter might reveal something?

kichappa · 2024-12-01T22:58:09Z

I might have discovered another area potentially where dynamic invocations occur. atomic_add! is raising errors because 1.0 is not of type Float32, causing a mismatch between memory at the pointer.

function kernel()
    c_shared = CuDynamicSharedArray(Float32, 2)
    if threadIdx().x == 1
        c_shared[1] = 0.0
        c_shared[2] = 0.0
    end
    sync_threads()

    if threadIdx().x % 2 == 0
        CUDA.atomic_add!(pointer(c_shared, 1), 1.0)
    else
        CUDA.atomic_add!(CUDA.pointer(c_shared, 2), 1.0)
    end
    sync_threads()
    return
end

@cuda threads = 10 blocks = 1 shmem=sizeof(Float32)*2 kernel()

Reason: unsupported dynamic function invocation (call to atomic_add!)
Stacktrace:
 [1] kernel
   @ ~/test.jl:26
Reason: unsupported dynamic function invocation (call to atomic_add!)
Stacktrace:
 [1] kernel
   @ ~/test.jl:28

maleadt · 2024-12-02T14:36:46Z

atomic_add! is raising errors because 1.0 is not of type Float32, causing a mismatch between memory at the pointer.

Yeah, that's expected behavior of the low-level interface. Only the high-level @atomic performs automatic type conversion.

kichappa · 2024-12-02T18:35:47Z

Yes, that's expected. But should it lead to a dynamic invocation? My question is why in shfls, a dynamic invocation was deemed unnecessary, but not in atomic_s.

maleadt · 2024-12-03T12:25:47Z

I guess you could interpret calling a function with the wrong arguments as throwing an unconditional MethodError, but generally I find it less surprising that doing so results in a dynamic invocation, whereas in the case of shfl the underlying error was an unconditional InexactError.

kichappa added the bug Something isn't working label Nov 22, 2024

maleadt added enhancement New feature or request and removed bug Something isn't working labels Nov 22, 2024

maleadt changed the title ~~Warp Shuffle doesn't work~~ Unconditional errors result in dynamic invocations Nov 22, 2024

maleadt transferred this issue from JuliaGPU/CUDA.jl Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unconditional errors result in dynamic invocations #649

Unconditional errors result in dynamic invocations #649

kichappa commented Nov 22, 2024 •

edited

Loading

maleadt commented Nov 22, 2024

kichappa commented Nov 22, 2024 •

edited

Loading

aviatesk commented Nov 22, 2024

maleadt commented Nov 22, 2024

aviatesk commented Nov 22, 2024

kichappa commented Dec 1, 2024 •

edited

Loading

maleadt commented Dec 2, 2024

kichappa commented Dec 2, 2024

maleadt commented Dec 3, 2024

Unconditional errors result in dynamic invocations #649

Unconditional errors result in dynamic invocations #649

Comments

kichappa commented Nov 22, 2024 • edited Loading

maleadt commented Nov 22, 2024

kichappa commented Nov 22, 2024 • edited Loading

aviatesk commented Nov 22, 2024

maleadt commented Nov 22, 2024

aviatesk commented Nov 22, 2024

kichappa commented Dec 1, 2024 • edited Loading

maleadt commented Dec 2, 2024

kichappa commented Dec 2, 2024

maleadt commented Dec 3, 2024

kichappa commented Nov 22, 2024 •

edited

Loading

kichappa commented Nov 22, 2024 •

edited

Loading

kichappa commented Dec 1, 2024 •

edited

Loading