Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.9 compatibility #1710

Closed
6 tasks done
maleadt opened this issue Dec 28, 2022 · 8 comments
Closed
6 tasks done

1.9 compatibility #1710

maleadt opened this issue Dec 28, 2022 · 8 comments
Labels
bug Something isn't working

Comments

@maleadt
Copy link
Member

maleadt commented Dec 28, 2022

@maleadt maleadt added the bug Something isn't working label Dec 28, 2022
@maleadt
Copy link
Member Author

maleadt commented Dec 29, 2022

Sorting regression (a jl_invoke where we previously weren't getting one) reduced to:

using CUDA

function kernel()
    @cuda dynamic=true threads=Int32(1) blocks=Int64(1) identity(nothing)
    return
end

function main()
    @cuda kernel()
end

Looks like there's some inference regression when splatting heterogeneous kwarg tuples.


Further reduced to:

child(; kwargs...) = return
function parent()
    child(; a=1f0, b=1.0)
    return
end

CUDA.code_llvm(parent, Tuple{}) shows a dynamic call, but regular code_llvm doesn't...


Further reduced to:

using GPUCompiler


child(; kwargs...) = return
function parent()
    child(; a=1f0, b=1.0)
    return
end

# this override introduces a `jl_invoke`
GPUCompiler.@override GPUCompiler.GLOBAL_METHOD_TABLE @noinline Core.throw_inexacterror(f::Symbol, ::Type{T}, val) where {T} =
    return

module DummyRuntime
    # dummy methods
    signal_exception() = return
    malloc(sz) = C_NULL
    report_oom(sz) = return
    report_exception(ex) = return
    report_exception_name(ex) = return
    report_exception_frame(idx, func, file, line) = return
end

struct DummyCompilerParams <: AbstractCompilerParams end
GPUCompiler.runtime_module(::CompilerJob{<:Any,DummyCompilerParams}) = DummyRuntime

function main()
    source = FunctionSpec(typeof(parent))
    target = NativeCompilerTarget()
    params = DummyCompilerParams()
    job = CompilerJob(target, source, params)

    JuliaContext() do ctx
        string(GPUCompiler.compile(:llvm, job; ctx)[1])
    end
end

isinteractive() || main()

i.e. adding that overlay on Core.throw_inexacterror breaks static GPU compilation.


Bisected to JuliaLang/julia#43800.
EDIT: nope, bisected incorrectly (a spurious segfault during sysimg generation corrupted the results)


Now bisected to JuliaLang/julia#44224. @aviatesk, any quick thoughts? I'll also try to reduce this to a simpler AbsInt+overlay MWE.

@maleadt
Copy link
Member Author

maleadt commented Jan 2, 2023

The shmem issue can be reproduced with:

@inline shmem() = Base.llvmcall(("""
        @shmem = internal global [1 x i8] zeroinitializer, align 32

        define i8* @entry() #0 {
            ret i8* getelementptr inbounds ([1 x i8], [1 x i8]* @shmem, i64 0, i64 0)
        }

        attributes #0 = { alwaysinline }""", "entry"),
    Core.LLVMPtr{Int8,0}, Tuple{})

function main()
    ptr1 = reinterpret(Ptr{Int8}, shmem())
    arr1 = unsafe_wrap(Array, ptr1, 1)
    ptr2 = reinterpret(Ptr{Int8}, shmem())
    arr2 = unsafe_wrap(Array, ptr2, 1)
    @inbounds begin
        arr1[] = 1
        arr2[]
    end
end

using InteractiveUtils
@code_llvm debuginfo=:none dump_module=true main()
@show main()

On 1.8, this yields two separate shmem variables:

@shmem = internal global [1 x i8] zeroinitializer, align 32
@shmem.5 = internal global [1 x i8] zeroinitializer, align 32
...
  %6 = call nonnull {}* inttoptr (i64 140193080728800 to {}* ({}*, i64, i64, i32)*)({}* inttoptr (i64 140192742526720 to {}*), i64 ptrtoint ([1 x i8]* @shmem to i64), i64 1, i32 0)
  %8 = call nonnull {}* inttoptr (i64 140193080728800 to {}* ({}*, i64, i64, i32)*)({}* inttoptr (i64 140192742526720 to {}*), i64 ptrtoint ([1 x i8]* @shmem.5 to i64), i64 1, i32 0)

While on 1.9:

@shmem = internal global [1 x i8] zeroinitializer, align 32

  %6 = call nonnull {}* inttoptr (i64 140218959557040 to {}* ({}*, i64, i64, i32)*)({}* inttoptr (i64 140218633426224 to {}*), i64 ptrtoint ([1 x i8]* @shmem to i64), i64 1, i32 0)
  %8 = call nonnull {}* inttoptr (i64 140218959557040 to {}* ({}*, i64, i64, i32)*)({}* inttoptr (i64 140218633426224 to {}*), i64 ptrtoint ([1 x i8]* @shmem to i64), i64 1, i32 0)

Bisected to JuliaLang/julia#44440. cc @jpsamaroo @pchintalapudi

@maleadt
Copy link
Member Author

maleadt commented Jan 2, 2023

@dkarrasch Can you chime in on the cholcopy changes? I think JuliaLang/julia#44756 or JuliaLang/julia#47063 broke GPU compatibility, because of dispach to a copy function we don't implement (resulting in a for loop processing items, while every GPU operation needs to be vectorized).

The problem is with LinearAlgebra.cholcopy(Hermitian(::CuArray, :L)), that used to be implemented in terms of copy_oftype which did a copyto!(similar(A, T), A); while now there's eigencopy_oftype that calls Hermitian(copy_similar(A, S), sym_uplo(A.uplo)) which does a copyto! with mixed IndexingStyles that GPUArrays doesn't support. Should be implement this version of copyto!, or are we missing something else (why did this work before?)?

@dkarrasch
Copy link
Contributor

What's the return type of 3-arg similar for a CuArray? I think that's the main difference between copy_oftype (uses 2-arg similar) and copy_similar.

@maleadt
Copy link
Member Author

maleadt commented Jan 2, 2023

What's the return type of 3-arg similar for a CuArray? I think that's the main difference between copy_oftype (uses 2-arg similar) and copy_similar.

3-arg similar demotes back to (Cu)Array, while 2-arg preserves the structure:

julia> a = CUDA.rand(10,10);

julia> typeof(similar(Hermitian(a, :L), Float32, size(Hermitian(a, :L))))
CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}

julia> typeof(similar(Hermitian(Array(a), :L), Float32, size(Hermitian(Array(a), :L))))
Matrix{Float32} (alias for Array{Float32, 2})

julia> typeof(similar(Hermitian(a, :L), Float32))
Hermitian{Float32, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}

julia> typeof(similar(Hermitian(Array(a), :L), Float32))
Hermitian{Float32, Matrix{Float32}}

But this is, as demonstrated, similar to how Array works.

@dkarrasch
Copy link
Contributor

Aha, I think it requires the same fix as in JuliaLinearAlgebra/BandedMatrices.jl#276. If you overload (potentially in a VERSION branch)

LinearAlgebra.cholcopy(A::RealHermSymComplexHerm{<:Any,<:CuArray}) =
    copyto!(similar(A, LinearAlgebra.choltype(A)), A)

does that fix the issue? The reason I generically switched to copy_similar (and hence 3-arg similar) is to have writable copies that we can pass to the in-place methods. That is what you need to pass structured matrices from LinearAlgebra to decomposition-related functions, but apparently breaks other packages that have opposite requirements.

@maleadt
Copy link
Member Author

maleadt commented Jan 2, 2023

Yep, that seems to work, thanks!

@maleadt
Copy link
Member Author

maleadt commented Jan 5, 2023

All tests work on the beta3 branch from JuliaLang/julia#48075.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants