-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of vectorized evaluations #96
Comments
Very quick comments:
|
Thanks, @spencerlyon2 . |
Given the high amount of memory allocation in the test run for Interpolations.jl, I'm suspicious against the splatting you use for the indexing... Will investigate further. Edit: Splatting isn't the entire problem, but it's a big part of it; exchanging the indexing line for Edit2: the second half seems to be a type instability in the creation of
(Cubic splines were tested with As you see, Interpolations.jl is now slightly faster than splines.jl. |
I'm also guessing true vector-valued interpolation is actually a little broken at the moment:
We probably need to look at that closer... (ref #55) |
Ah, excellent ! @tlycken Your explanation about the splatting makes perfect sense. I just tried it and get similar results on my laptop. |
For vector-valued interpolation, I recommend FixedSizedArrays. Check out this awe-inspiring result: using FixedSizeArrays
a = rand(5,200)
b = reinterpret(Vec{5,Float64}, a, (200,))
bitp = interpolate(b, BSpline(Quadratic(Reflect())), OnCell())
function foo(itp, X)
s = zero(eltype(itp))
for x in X
s += itp[x]
end
s
end
X = 2+150*rand(10^6)
# After warming up with one call to `foo(bitp, X)`:
julia> @time foo(bitp, X)
0.027119 seconds (5 allocations: 208 bytes)
FixedSizeArrays.Vec{5,Float64}((466133.70976438577,522291.33201962017,540706.6050416315,468993.60840671643,489241.67073413206)) (The most awe-inspiring part being that it allocated just 208 bytes.) CCing @SimonDanisch, just for fun. |
Ah, we're talking different types of vector-valued. I was referring to evaluating a scalar-valued interpolant in many places by passing a vector of coordinates; you were talking about evaluating a vector-valued interpolant efficiently. For the latter, FIxedSizeArrays is definitely the way to go. |
@timholy: very impressive example indeed. |
(To be clear, the part I find awe-inspiring is being done by LLVM and julia's code generation/compiler, in figuring out that it can elide the memory allocations needed for all the intermediate computations. Interpolations.jl's and FixedSizeArrays.jl's main contributions are "simply" to leverage that compiler effectively.) |
Are there any more issues to address, or can this be closed? |
@timholy if you prefer to move the discussion somewhere and close this issue, I'm OK with it. There are still two things I'm curious about
|
If I jump off my previous benchmark, I get a very different result from you:
Only a 2-fold increment for interpolation with 5-vectors. I haven't looked at your benchmark, though, and am too busy with other things to dig into it now. Regarding simd, I think only if it's automatically added by Julia (which it might be). |
In the script you linked, you are comparing scalar quadratic interpolation with vector-valued cubic :) Fixing that, JuliaBox gives me
i.e. about a factor 11 slowdown, which is about what I'd expect. I don't think we're using SIMD instructions much, but we gain much of the speed from heavy use of metaprogramming; the compiled indexing expression consists literally only of adds, multiplies and array lookups from the underlying data array. |
With
which makes me think it may be automatically added. |
@tlycken and @timholy : point taken, this quadratic vs cubic thing had me confused from the beginning... As for the gains of using the vector version (appart from conceptual elegance), I would then conclude that it works well for the quadratic case (twice as fast), not so much in the cubic one. Is that expected ? |
And before this issue is closed, I have one last question: could you explain very briefly where the performance of interpolations.jl comes from ? @spencerlyon2 explained a bit to me about the precomputation of tensor products, which is something I've also tried to implement but apparently it was not enough. |
I think the easiest way to showcase why Interpolations.jl is so fast, is to introspect the implementation.
where Take a look at the code we generate:
There are three main blocks of code here:
Now, the beauty comes when we do this in more dimensions. I'll look at linear interpolation here, mainly for brevity, but the same principle is applied on higher orders too:
Basically the same code, only now it works for two dimensions (the main difference between the linear and cubic case is the number of coefficients used in the calculation). Still no cruft, looping or other stuff related to handling interpolation in general - the code is specialized for this exact case. This is probably what Spencer was referring to as precomputation of tensor products. The speed in Interpolations.jl comes mainly from the fact that we're able to generate specialized code like this, coupled with Julia's ability to very efficiently dispatch to the right version of the code, and optimize it agressively (since it comes out quite simple). |
...and with that short book, I think I will close this :) |
Thank you @tlycken ! This looks rather close to what I was doing. I'll need to investigate a bit more to understand the difference fully. |
thanks @timholy, it's nice to see use cases for |
Out of curiosity I just ran some performance comparisons between interpolations.jl and splines.jl. The latter is a small library which does only cubic splines interpolations, in any dimension, with gradient evaluation, and which I updated recently to use julia macros..
I found it it be much faster than interpolations.jl when evaluated on a large number of points, even when compared with interpolations.jl quadratic interpolation. The code to generate the comparison is here: https://github.com/EconForge/splines.jl/blob/master/test/speed_comparison.jl
and here is the output I get on my machine (50x50x50 grid, evaluation on 100000 points):
Here are a few remarks/questions coming to my mind:
The text was updated successfully, but these errors were encountered: