Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up expansion and lowering of ccall macro #50077

Merged
merged 1 commit into from
Jun 12, 2023
Merged

Conversation

JeffBezanson
Copy link
Member

Test case:

julia> ex = quote
           function cublasCher2_v2_64(handle, uplo, n, alpha, x, incx, y, incy, A, lda)
               @ccall libcublas.cublasCher2_v2_64(handle::cublasHandle_t, uplo::cublasFillMode_t,
                                               n::Int64, alpha::RefOrCuRef{cuComplex},
                                               x::CuPtr{cuComplex}, incx::Int64,
                                               y::CuPtr{cuComplex}, incy::Int64,
                                               A::CuPtr{cuComplex}, lda::Int64)::cublasStatus_t
           end
       end

julia> @btime Meta.lower(Main, ex)

before: 1.33ms after: 0.680ms.
I do like the idea of @ccall producing a foreigncall expression, but there are two problems leading to noticeable load time differences in some packages: (1) it does some unnecessary work forming strings to make temporary identifier names, and (2) normal variables (slots) are more expensive to analyze than ssavalues. It would probably help if macros were somehow able to generate ssavalues. In the meantime this cuts the lowering time in half by making @ccall produce a "classic" ccall call expression (plus a new cconv Expr head to retain the ability to express calling conventions and correct varargs).

@JeffBezanson JeffBezanson added compiler:lowering Syntax lowering (compiler front end, 2nd stage) compiler:latency Compiler latency labels Jun 6, 2023
@maleadt
Copy link
Member

maleadt commented Jun 6, 2023

Thanks, this does indeed improve the time it takes to lower @ccall expressions! Still remarkably slow at 0.5ms per ccall (CUDA.jl has thousands), but I won't say no to a 50% speed-up 🙂

(1) it does some unnecessary work forming strings to make temporary identifier names

Is that really significant, as benchmarking ccall_macro_lower directly only takes a couple of us vs. many hundreds of us when calling all of lowering? Or does it have a knock-on effect on later lowering?

For reference, this was the benchmark script I was using:

call = :(
    libfoo.bar(a::A, b::B, c::C, d::D, e::E)::X
)

println("ccall_macro_parse:")
x = Base.ccall_macro_parse(call)
@benchmark Base.ccall_macro_parse(call)
# 230 ns

println("ccall_macro_lower:")
Base.ccall_macro_lower(:ccall, x...)
@benchmark Base.ccall_macro_lower(:ccall, x...)
# 4 us

println("the above, but via lowering:")
macro_call = :(
    @ccall $call
)
lower(ex::Expr, mod::Module=Main, file::String="", line::Int=0) =
    ccall(:jl_expand_with_loc_warn, Any, (Any, Any, Cstring, Cint), ex, mod, file, line)
lower(macro_call)
@benchmark lower(macro_call)
# 500 us

println("plain ccall")
plain_ccall = :(
    ccall((:bar, :libfoo), X, Tuple{A, B, C, D, E}, a, b, c, d, e)
)
lower(plain_ccall)
display(@benchmark lower(plain_ccall))
# 8 us

@JeffBezanson
Copy link
Member Author

JeffBezanson commented Jun 8, 2023

I guess not; all the cost may very well be from using "normal" identifiers (which have lots of features!) for labeling temporary values, plus the cost of macroexpansion.

@KristofferC KristofferC merged commit 75bda64 into master Jun 12, 2023
@KristofferC KristofferC deleted the jb/fasteratccall branch June 12, 2023 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:latency Compiler latency compiler:lowering Syntax lowering (compiler front end, 2nd stage)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants