Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unconditional errors result in dynamic invocations #649

Open
kichappa opened this issue Nov 22, 2024 · 9 comments
Open

Unconditional errors result in dynamic invocations #649

kichappa opened this issue Nov 22, 2024 · 9 comments
Labels
enhancement New feature or request

Comments

@kichappa
Copy link

kichappa commented Nov 22, 2024

Describe the bug

Any use of shfl_sync throws an error saying shfl_recurse is a dynamic function.

To reproduce

The Minimal Working Example (MWE) for this bug:

Attempting to do a stream compaction:

using CUDA

# define a new arrays of 64 elements, and fill it with random ones and zeros
a = rand(0:1, 64)

a_gpu = CuArray(a)
b_gpu = CUDA.zeros(Int64, 64)
count = CUDA.zeros(Int64, 1)

function mykernel!(in, out, count)
	threadNum = threadIdx().x + blockDim().x * (blockIdx().x-1) # 1-indexed
	warpNum = (threadIdx().x - 1) ÷ 32 # 0-indexed
	laneNum = (threadIdx().x - 1) % 32 # 0-indexed

    shared_count = CuDynamicSharedArray(Int64, 1)
    
    if threadNum == 1
        shared_count[1] = 0
    end
    sync_threads()

    if threadNum <= 64
        is_nonzero = in[threadNum] != 0
        mask = CUDA.vote_ballot_sync(0xffffffff, is_nonzero)
        warp_count = count_ones(mask)

        warp_offset = 0
        if laneNum == 0
            warp_offset = CUDA.atomic_add!(pointer(shared_count, 1), warp_count)
        end
        warp_offset = CUDA.shfl_sync(0xffffffff, warp_offset, Int32(0)) #<<<<< This is the BUG code.

        if is_nonzero
            index = count_ones(mask & ((1u << laneNum) - 1)) + warp_offset
            out[index+1] = threadNum
        end
    end
    sync_threads()

    if threadIdx().x == 1
        CUDA.atomic_add!(CUDA.pointer(count), shared_count[1])
    end
	return
end

@cuda threads = 64 blocks = 1 shmem=sizeof(Int64) mykernel!(a_gpu, b_gpu, count)

println("nonzeros:$(collect(count))")
println(collect(b_gpu))
Manifest.toml

Package versions:
Status `~/.julia/environments/v1.11/Project.toml`
  [052768ef] CUDA v5.5.2
No Matches in `~/.julia/environments/v1.11/Project.toml`
No Matches in `~/.julia/environments/v1.11/Project.toml`
No Matches in `~/.julia/environments/v1.11/Project.toml`

CUDA details:
CUDA runtime version: 12.6.0
CUDA driver version: 12.6.0
CUDA capability: 9.0.0

Expected behavior

Expected behavior is that the shuffle function doesn't throw an error, and all zeros in a get removed when moved to b

Version info

Details on Julia:

Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × Intel(R) Xeon(R) Platinum 8462Y+
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, sapphirerapids)
Threads: 1 default, 0 interactive, 1 GC (on 64 virtual cores)

Details on CUDA:

CUDA driver 12.6
NVIDIA driver 550.90.7

CUDA libraries: 
- CUBLAS: 12.6.3
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+550.90.7

Julia packages: 
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.3+0
- CUDA_Runtime_jll: 0.15.3+0

Toolchain:
- Julia: 1.11.1
- LLVM: 16.0.6

1 device:
  0: NVIDIA H100 80GB HBM3 (sm_90, 77.409 GiB / 79.647 GiB available)
@kichappa kichappa added the bug Something isn't working label Nov 22, 2024
@maleadt
Copy link
Member

maleadt commented Nov 22, 2024

Fascinating. So the actual bug is that you're incorrectly invoking shfl_sync with lane set to 0, while CUDA.jl uses 1-based indices everywhere. However, this manifests as a dynamic invocation because Julia detects the InexactError thrown unconditionally, and deoptimizes the call to a dynamic invocation.

@aviatesk Is this the throw-block deoptimization? I tested this on 1.11. Can we disable this deoptimization so that compilation succeeds and the user gets a chance to see the error being generated at run time?

@maleadt maleadt added enhancement New feature or request and removed bug Something isn't working labels Nov 22, 2024
@maleadt maleadt changed the title Warp Shuffle doesn't work Unconditional errors result in dynamic invocations Nov 22, 2024
@maleadt maleadt transferred this issue from JuliaGPU/CUDA.jl Nov 22, 2024
@kichappa
Copy link
Author

kichappa commented Nov 22, 2024

Holy shit! That was a 3-hour torment for me until your response. Thanks a lot, and yeah, the suggestion would have helped me.

@aviatesk
Copy link
Contributor

Yes, since unoptimize_throw_blocks is still present in v1.11, you might be able to resolve this issue by disabling it like InferenceParams(::GPUInterpreter) = InferenceParams(; unoptimize_throw_blocks=false).

@maleadt
Copy link
Member

maleadt commented Nov 22, 2024

Hmm, we should already do that:

function inference_params(@nospecialize(job::CompilerJob))
if VERSION >= v"1.12.0-DEV.1017"
CC.InferenceParams()
else
CC.InferenceParams(; unoptimize_throw_blocks=false)
end
end

@aviatesk
Copy link
Contributor

That does seem to be the case... Taking a look at the inference results from GPUInterpreter might reveal something?

@kichappa
Copy link
Author

kichappa commented Dec 1, 2024

I might have discovered another area potentially where dynamic invocations occur. atomic_add! is raising errors because 1.0 is not of type Float32, causing a mismatch between memory at the pointer.

function kernel()
    c_shared = CuDynamicSharedArray(Float32, 2)
    if threadIdx().x == 1
        c_shared[1] = 0.0
        c_shared[2] = 0.0
    end
    sync_threads()

    if threadIdx().x % 2 == 0
        CUDA.atomic_add!(pointer(c_shared, 1), 1.0)
    else
        CUDA.atomic_add!(CUDA.pointer(c_shared, 2), 1.0)
    end
    sync_threads()
    return
end

@cuda threads = 10 blocks = 1 shmem=sizeof(Float32)*2 kernel()
Reason: unsupported dynamic function invocation (call to atomic_add!)
Stacktrace:
 [1] kernel
   @ ~/test.jl:26
Reason: unsupported dynamic function invocation (call to atomic_add!)
Stacktrace:
 [1] kernel
   @ ~/test.jl:28

@maleadt
Copy link
Member

maleadt commented Dec 2, 2024

atomic_add! is raising errors because 1.0 is not of type Float32, causing a mismatch between memory at the pointer.

Yeah, that's expected behavior of the low-level interface. Only the high-level @atomic performs automatic type conversion.

@kichappa
Copy link
Author

kichappa commented Dec 2, 2024

Yes, that's expected. But should it lead to a dynamic invocation? My question is why in shfls, a dynamic invocation was deemed unnecessary, but not in atomic_s.

@maleadt
Copy link
Member

maleadt commented Dec 3, 2024

I guess you could interpret calling a function with the wrong arguments as throwing an unconditional MethodError, but generally I find it less surprising that doing so results in a dynamic invocation, whereas in the case of shfl the underlying error was an unconditional InexactError.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants