-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework the GPUCompiler interface to avoid needless compiler specialization #227
Conversation
Turns out |
Despite the refactor, I'm still seeing excessive compilation by Julia when launching the first CUDA.jl kernel. Going to summarize my findings here so that I link this to a couple of people. The problem is that certain methods which are covered by GPUCompiler's precompilation directives, get recompiled when using them from CUDA.jl. That used to be caused by specialization, and due to invalidations, but I think I have eliminated those in this PR (and the CUDA.jl counterpart in JuliaGPU/CUDA.jl#1066). As an example, let's look at the julia> using GPUCompiler, MethodAnalysis
julia> mi = only(methodinstances(GPUCompiler.emit_llvm))
MethodInstance for GPUCompiler.emit_llvm(::CompilerJob, ::Core.MethodInstance)
# a method was precompiled
julia> mi.cache
Core.CodeInstance(MethodInstance for GPUCompiler.emit_llvm(::CompilerJob, ::Core.MethodInstance), #undef, 0x00000000000079fd, 0x0000000000000000, Tuple{Any, NamedTuple{(:entry, :compiled), _A} where _A<:Tuple{LLVM.Function, Any}}, #undef, UInt8[0x0c, 0x03, 0x00, 0x00, 0x00, 0x08, 0x08, 0x08, 0x16, 0x88 … 0x01, 0x11, 0x02, 0x2b, 0x3c, 0x00, 0xbf, 0x3d, 0x01, 0x01], false, true, Ptr{Nothing} @0x0000000000000000, Ptr{Nothing} @0x0000000000000000)
julia> mi.cache.min_world
0x00000000000079fd
julia> mi.cache.max_world
0x0000000000000000 IIUC, loading CUDA.jl does not invalidate that method instance since the world bounds of the cached code instance remain the same: julia> using CUDA
julia> mi2 = only(methodinstances(GPUCompiler.emit_llvm))
MethodInstance for GPUCompiler.emit_llvm(::CompilerJob, ::Core.MethodInstance)
julia> mi2 === mi
true
# precompilation result still valid
julia> mi.cache
Core.CodeInstance(MethodInstance for GPUCompiler.emit_llvm(::CompilerJob, ::Core.MethodInstance), #undef, 0x00000000000079fd, 0x0000000000000000, Tuple{Any, NamedTuple{(:entry, :compiled), _A} where _A<:Tuple{LLVM.Function, Any}}, #undef, UInt8[0x0c, 0x03, 0x00, 0x00, 0x00, 0x08, 0x08, 0x08, 0x16, 0x88 … 0x01, 0x11, 0x02, 0x2b, 0x3c, 0x00, 0xbf, 0x3d, 0x01, 0x01], false, true, Ptr{Nothing} @0x0000000000000000, Ptr{Nothing} @0x0000000000000000)
julia> mi.cache.min_world
0x00000000000079fd
julia> mi.cache.max_world
0x0000000000000000 HOWEVER, when I actually trigger compilation, the CI cache gets a second entry! The only thing different between these two CIs is the julia> @cuda identity(nothing)
CUDA.HostKernel{typeof(identity), Tuple{Nothing}}(identity, CuContext(0x00000000019ee6d0, instance b1996a79f885a5d8), CuModule(Ptr{Nothing} @0x00000000037c7100, CuContext(0x00000000019ee6d0, instance b1996a79f885a5d8)), CuFunction(Ptr{Nothing} @0x0000000004f97390, CuModule(Ptr{Nothing} @0x00000000037c7100, CuContext(0x00000000019ee6d0, instance b1996a79f885a5d8))))
julia> mi3 = only(methodinstances(GPUCompiler.emit_llvm))
MethodInstance for GPUCompiler.emit_llvm(::CompilerJob, ::Core.MethodInstance)
julia> mi3 === mi
true
julia> mi.cache
Core.CodeInstance(MethodInstance for GPUCompiler.emit_llvm(::CompilerJob, ::Core.MethodInstance), Core.CodeInstance(MethodInstance for GPUCompiler.emit_llvm(::CompilerJob, ::Core.MethodInstance), #undef, 0x00000000000079fd, 0x0000000000000000, Tuple{Any, NamedTuple{(:entry, :compiled), _A} where _A<:Tuple{LLVM.Function, Any}}, #undef, UInt8[0x0c, 0x03, 0x00, 0x00, 0x00, 0x08, 0x08, 0x08, 0x16, 0x88 … 0x01, 0x11, 0x02, 0x2b, 0x3c, 0x00, 0xbf, 0x3d, 0x01, 0x01], false, true, Ptr{Nothing} @0x0000000000000000, Ptr{Nothing} @0x0000000000000000), 0x00000000000079fd, 0xffffffffffffffff, Tuple{Any, NamedTuple{(:entry, :compiled), _A} where _A<:Tuple{LLVM.Function, Any}}, #undef, UInt8[0x0c, 0x03, 0x00, 0x00, 0x00, 0x08, 0x08, 0x08, 0x16, 0x88 … 0x01, 0x11, 0x02, 0x2b, 0x3c, 0x00, 0xbf, 0x3d, 0x01, 0x01], false, false, Ptr{Nothing} @0x0000000000000000, Ptr{Nothing} @0x0000000000000000)
julia> for field in (:def, :inferred, :invoke, :isspecsig, :max_world, :min_world, :rettype, :specptr, :precompile)
@show field getfield(mi.cache, field) == getfield(mi.cache.next, field)
end
field = :def
getfield(mi.cache, field) == getfield(mi.cache.next, field) = true
field = :inferred
getfield(mi.cache, field) == getfield(mi.cache.next, field) = true
field = :invoke
getfield(mi.cache, field) == getfield(mi.cache.next, field) = true
field = :isspecsig
getfield(mi.cache, field) == getfield(mi.cache.next, field) = true
field = :max_world
getfield(mi.cache, field) == getfield(mi.cache.next, field) = false
field = :min_world
getfield(mi.cache, field) == getfield(mi.cache.next, field) = true
field = :rettype
getfield(mi.cache, field) == getfield(mi.cache.next, field) = true
field = :specptr
getfield(mi.cache, field) == getfield(mi.cache.next, field) = true
field = :precompile
getfield(mi.cache, field) == getfield(mi.cache.next, field) = false
julia> mi.cache.precompile
false
julia> mi.cache.max_world
0xffffffffffffffff
julia> mi.cache.next.precompile
true
julia> mi.cache.next.max_world
0x0000000000000000 I don't understand why we are inferring a new version of |
You can achieve this with the following combination: f(@nospecialize(x)) = 1
g(x) = f(Base.inferencebarrier(x)) # the causes inference to use `Any` as the type of `x` for inferring `f`
Possibly const-prop? Try breaking that by push/pop to a |
After recent refactor of CompilerJob, I don't see unnecessary specialization anymore: # define a new kernel
julia> bar(x) = nothing
bar (generic function with 1 method)
julia> Metal.mtlfunction(bar, Tuple{Nothing})
precompile(Tuple{typeof(GPUCompiler.get_world_generator), Any, Type{Type{typeof(Main.bar)}}, Type{Type{Tuple{Nothing}}}})
precompile(Tuple{typeof(Metal.mtlfunction), typeof(Main.bar), Type{Tuple{Nothing}}})
precompile(Tuple{typeof(Base.vect), Type{typeof(Main.bar)}, Vararg{DataType}})
precompile(Tuple{Type{Metal.HostKernel{typeof(Main.bar), Tuple{Nothing}}}, Function, Metal.MTL.MTLComputePipelineStateInstance})
precompile(Tuple{typeof(Base.show), Base.IOContext{Base.TTY}, Base.Multimedia.MIME{Symbol("text/plain")}, Metal.HostKernel{typeof(Main.bar), Tuple{Nothing}}})
precompile(Tuple{typeof(Base.sizeof), Metal.HostKernel{typeof(Main.bar), Tuple{Nothing}}})
Metal.HostKernel{typeof(bar), Tuple{Nothing}}(bar, Metal.MTL.MTLComputePipelineStateInstance (object of type AGXG13XFamilyComputePipeline)) |
I'm trying to make it possible to precompile most of the compiler, but it's proving to be hard. First, I removed the function from the CompilerJob, since it's otherwise too easy to invalidate the precompilation results. I think that we also have to use
@invokelatest
with all of the GPUCompiler interfaces, because even with type assertions and@noinline
the generic implementations of the interface are referred to literally in the compiled code.However, I'm still seeing
emit_llvm
getting re-compiled when using CUDA.jl, even though I don't immediately spot invalidations that would explain this (is it possible to go backwards -- start from a method that got invalidated to find the definition that invalidated it?).Strangely, the method instance also lists the specific compiler target, even though
job
is passed as@nospecialize
inemit_llvm
...