1.9 compatibility #1710

maleadt · 2022-12-28T11:51:45Z

device/intrinsics/wmma: wrong use of extern, Use plain llvmcall calling convention for WMMA intrinsics. #1709
cublas: method error calling axp(b)y!: CUBLAS: test against generic axp(b)y, not the BLAS-specific one. #1713
cusolver/dense: accessing non-existent field of factorization struct: Fix LU getproperty invoke. #1714
cusolver/dense: scalar indexing during cholesky::cholcopy: Specialize cholcopy to avoid scalar indexing. #1716
intrinsics/memory: multiple shared memory arrays are aliasing: Switch back to LLVM's IR linker JuliaLang/julia#48106
sorting: heterogeneously-typed kwarg call fails to compile statically in the presence of a Core.throw_inexacterror overlay: Avoid a couple of InexactErrors in the IdDict code. JuliaLang/julia#48116

The text was updated successfully, but these errors were encountered:

maleadt · 2022-12-29T09:46:45Z

Sorting regression (a jl_invoke where we previously weren't getting one) reduced to:

using CUDA

function kernel()
    @cuda dynamic=true threads=Int32(1) blocks=Int64(1) identity(nothing)
    return
end

function main()
    @cuda kernel()
end

Looks like there's some inference regression when splatting heterogeneous kwarg tuples.

Further reduced to:

child(; kwargs...) = return
function parent()
    child(; a=1f0, b=1.0)
    return
end

CUDA.code_llvm(parent, Tuple{}) shows a dynamic call, but regular code_llvm doesn't...

Further reduced to:

using GPUCompiler


child(; kwargs...) = return
function parent()
    child(; a=1f0, b=1.0)
    return
end

# this override introduces a `jl_invoke`
GPUCompiler.@override GPUCompiler.GLOBAL_METHOD_TABLE @noinline Core.throw_inexacterror(f::Symbol, ::Type{T}, val) where {T} =
    return

module DummyRuntime
    # dummy methods
    signal_exception() = return
    malloc(sz) = C_NULL
    report_oom(sz) = return
    report_exception(ex) = return
    report_exception_name(ex) = return
    report_exception_frame(idx, func, file, line) = return
end

struct DummyCompilerParams <: AbstractCompilerParams end
GPUCompiler.runtime_module(::CompilerJob{<:Any,DummyCompilerParams}) = DummyRuntime

function main()
    source = FunctionSpec(typeof(parent))
    target = NativeCompilerTarget()
    params = DummyCompilerParams()
    job = CompilerJob(target, source, params)

    JuliaContext() do ctx
        string(GPUCompiler.compile(:llvm, job; ctx)[1])
    end
end

isinteractive() || main()

i.e. adding that overlay on Core.throw_inexacterror breaks static GPU compilation.

~~Bisected to JuliaLang/julia#43800.~~
EDIT: nope, bisected incorrectly (a spurious segfault during sysimg generation corrupted the results)

Now bisected to JuliaLang/julia#44224. @aviatesk, any quick thoughts? I'll also try to reduce this to a simpler AbsInt+overlay MWE.

maleadt · 2023-01-02T15:56:18Z

The shmem issue can be reproduced with:

@inline shmem() = Base.llvmcall(("""
        @shmem = internal global [1 x i8] zeroinitializer, align 32

        define i8* @entry() #0 {
            ret i8* getelementptr inbounds ([1 x i8], [1 x i8]* @shmem, i64 0, i64 0)
        }

        attributes #0 = { alwaysinline }""", "entry"),
    Core.LLVMPtr{Int8,0}, Tuple{})

function main()
    ptr1 = reinterpret(Ptr{Int8}, shmem())
    arr1 = unsafe_wrap(Array, ptr1, 1)
    ptr2 = reinterpret(Ptr{Int8}, shmem())
    arr2 = unsafe_wrap(Array, ptr2, 1)
    @inbounds begin
        arr1[] = 1
        arr2[]
    end
end

using InteractiveUtils
@code_llvm debuginfo=:none dump_module=true main()
@show main()

On 1.8, this yields two separate shmem variables:

@shmem = internal global [1 x i8] zeroinitializer, align 32
@shmem.5 = internal global [1 x i8] zeroinitializer, align 32
...
  %6 = call nonnull {}* inttoptr (i64 140193080728800 to {}* ({}*, i64, i64, i32)*)({}* inttoptr (i64 140192742526720 to {}*), i64 ptrtoint ([1 x i8]* @shmem to i64), i64 1, i32 0)
  %8 = call nonnull {}* inttoptr (i64 140193080728800 to {}* ({}*, i64, i64, i32)*)({}* inttoptr (i64 140192742526720 to {}*), i64 ptrtoint ([1 x i8]* @shmem.5 to i64), i64 1, i32 0)

While on 1.9:

@shmem = internal global [1 x i8] zeroinitializer, align 32

  %6 = call nonnull {}* inttoptr (i64 140218959557040 to {}* ({}*, i64, i64, i32)*)({}* inttoptr (i64 140218633426224 to {}*), i64 ptrtoint ([1 x i8]* @shmem to i64), i64 1, i32 0)
  %8 = call nonnull {}* inttoptr (i64 140218959557040 to {}* ({}*, i64, i64, i32)*)({}* inttoptr (i64 140218633426224 to {}*), i64 ptrtoint ([1 x i8]* @shmem to i64), i64 1, i32 0)

Bisected to JuliaLang/julia#44440. cc @jpsamaroo @pchintalapudi

maleadt · 2023-01-02T18:18:56Z

@dkarrasch Can you chime in on the cholcopy changes? I think JuliaLang/julia#44756 or JuliaLang/julia#47063 broke GPU compatibility, because of dispach to a copy function we don't implement (resulting in a for loop processing items, while every GPU operation needs to be vectorized).

The problem is with LinearAlgebra.cholcopy(Hermitian(::CuArray, :L)), that used to be implemented in terms of copy_oftype which did a copyto!(similar(A, T), A); while now there's eigencopy_oftype that calls Hermitian(copy_similar(A, S), sym_uplo(A.uplo)) which does a copyto! with mixed IndexingStyles that GPUArrays doesn't support. Should be implement this version of copyto!, or are we missing something else (why did this work before?)?

dkarrasch · 2023-01-02T18:28:42Z

What's the return type of 3-arg similar for a CuArray? I think that's the main difference between copy_oftype (uses 2-arg similar) and copy_similar.

maleadt · 2023-01-02T18:46:40Z

What's the return type of 3-arg similar for a CuArray? I think that's the main difference between copy_oftype (uses 2-arg similar) and copy_similar.

3-arg similar demotes back to (Cu)Array, while 2-arg preserves the structure:

julia> a = CUDA.rand(10,10);

julia> typeof(similar(Hermitian(a, :L), Float32, size(Hermitian(a, :L))))
CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}

julia> typeof(similar(Hermitian(Array(a), :L), Float32, size(Hermitian(Array(a), :L))))
Matrix{Float32} (alias for Array{Float32, 2})

julia> typeof(similar(Hermitian(a, :L), Float32))
Hermitian{Float32, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}

julia> typeof(similar(Hermitian(Array(a), :L), Float32))
Hermitian{Float32, Matrix{Float32}}

But this is, as demonstrated, similar to how Array works.

dkarrasch · 2023-01-02T18:51:55Z

Aha, I think it requires the same fix as in JuliaLinearAlgebra/BandedMatrices.jl#276. If you overload (potentially in a VERSION branch)

LinearAlgebra.cholcopy(A::RealHermSymComplexHerm{<:Any,<:CuArray}) =
    copyto!(similar(A, LinearAlgebra.choltype(A)), A)

does that fix the issue? The reason I generically switched to copy_similar (and hence 3-arg similar) is to have writable copies that we can pass to the in-place methods. That is what you need to pass structured matrices from LinearAlgebra to decomposition-related functions, but apparently breaks other packages that have opposite requirements.

maleadt · 2023-01-02T19:33:08Z

Yep, that seems to work, thanks!

maleadt · 2023-01-05T16:56:02Z

All tests work on the beta3 branch from JuliaLang/julia#48075.

maleadt added the bug Something isn't working label Dec 28, 2022

maleadt closed this as completed Jan 5, 2023

cncastillo mentioned this issue Jan 5, 2023

Conversion of GPU objects to F32 or F64 fails in Julia 1.9 JuliaHealth/KomaMRI.jl#145

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.9 compatibility #1710

1.9 compatibility #1710

maleadt commented Dec 28, 2022 •

edited

Loading

maleadt commented Dec 29, 2022 •

edited

Loading

maleadt commented Jan 2, 2023

maleadt commented Jan 2, 2023

dkarrasch commented Jan 2, 2023

maleadt commented Jan 2, 2023 •

edited

Loading

dkarrasch commented Jan 2, 2023

maleadt commented Jan 2, 2023

maleadt commented Jan 5, 2023

1.9 compatibility #1710

1.9 compatibility #1710

Comments

maleadt commented Dec 28, 2022 • edited Loading

maleadt commented Dec 29, 2022 • edited Loading

maleadt commented Jan 2, 2023

maleadt commented Jan 2, 2023

dkarrasch commented Jan 2, 2023

maleadt commented Jan 2, 2023 • edited Loading

dkarrasch commented Jan 2, 2023

maleadt commented Jan 2, 2023

maleadt commented Jan 5, 2023

maleadt commented Dec 28, 2022 •

edited

Loading

maleadt commented Dec 29, 2022 •

edited

Loading

maleadt commented Jan 2, 2023 •

edited

Loading