Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export dotu with generic fallbacks #8300

Closed
dlfivefifty opened this issue Sep 10, 2014 · 18 comments
Closed

Export dotu with generic fallbacks #8300

dlfivefifty opened this issue Sep 10, 2014 · 18 comments
Labels
speculative Whether the change will be implemented is speculative

Comments

@dlfivefifty
Copy link
Contributor

I have the following inconsistency in Julia v0.3.0:

julia> BLAS.dotu([1.+im,2.],[3.,4.+im])
11.0 + 5.0im

julia> BLAS.dotu([1.,2.],[3.,4.])
ERROR: dotu has no method matching dotu(::Int64, ::Array{Float64,1}, ::Int64, ::Array{Float64,1}, ::Int64)
in dotu at linalg/blas.jl:156

@simonster
Copy link
Member

dotu is only defined for complex arguments, both in BLAS and in Julia. If you want to compute a dot product of real vectors, you can just use dot.

@dlfivefifty
Copy link
Contributor Author

What if you want to right a function that uses dot for real arguments and dotu for complex?  (I.e., always do dotu)

On 11 Sep 2014, at 9:14 am, Simon Kornblith notifications@github.com wrote:

dotu is only defined for complex arguments, both in BLAS and in Julia. If you want to compute a dot product of real vectors, you can just use dot.


Reply to this email directly or view it on GitHub.

@johnmyleswhite
Copy link
Member

Write a wrapper that gets inlined to each of them?

@dlfivefifty
Copy link
Contributor Author

Yes of course, but I can’t see a good reason why dotu(Real,Real) != dot(Real,Real)    

not to mention dotu(Complex,Real) and dotu(Real,Complex)

On 11 Sep 2014, at 9:17 am, John Myles White notifications@github.com wrote:

Write a wrapper that gets inlined to each of them?


Reply to this email directly or view it on GitHub.

@simonster
Copy link
Member

Because BLAS.dotu is a wrapper for BLAS ?dotu. If we were to create an exported dotu function it ought to do this and also have a pure Julia fallback for non-BLAS element types. Such a function currently does not exist, but it might make sense to have.

@dlfivefifty
Copy link
Contributor Author

OK That’s fair enough.  I’d be in favour of the addition of an exported dotu ..  its a commonly used operation and dotu(u,v) looks cleaner than dot(conj(u),v) or (u.’*v)[1]

On 11 Sep 2014, at 9:22 am, Simon Kornblith notifications@github.com wrote:

Because BLAS.dotu is a wrapper for BLAS ?dotu. If we were to create an exported dotu function it ought to do this and also have a pure Julia fallback for non-BLAS element types. Such a function currently does not exist, but it might make sense to have.


Reply to this email directly or view it on GitHub.

@simonster simonster changed the title BLAS.dotu(::Vector{Float64},::Vector{Float64}) not working Export dotu and dotc with generic fallbacks Sep 10, 2014
@simonster simonster added feature speculative Whether the change will be implemented is speculative labels Sep 10, 2014
@jiahao
Copy link
Member

jiahao commented Sep 10, 2014

I think of dotu as semantically a special case of .*; we could just add the BLAS calls as special methods of .* called on StridedArray{<:BlasFloat}s.

@simonster simonster changed the title Export dotu and dotc with generic fallbacks Export dotu with generic fallbacks Sep 10, 2014
@timholy
Copy link
Member

timholy commented Sep 10, 2014

Out of curiosity, are the BLAS versions any faster than the Julia versions? At least for Float32, the gap has narrowed a lot thanks to SIMD.

@simonster
Copy link
Member

@jiahao It's really sum(x.*y) or (x.'y)[1], though, isn't it? It would be great if we could just make that call dotu, but we don't have the machinery to do that.

@jiahao
Copy link
Member

jiahao commented Sep 10, 2014

Oh right, duh.

@simonster
Copy link
Member

@timholy That's worth testing. @andreasnoack said that the difference wasn't big, but at the moment the generic version doesn't use @simd which would be necessary to let LLVM reassociate the adds, so it's probably at least a bit slower. The loop vectorizer was also not that great at optimizing summation the last time I tried it (#6928), but I never got around to trying with LLVM trunk after @ArchRobison fixed vectorization there.

@andreasnoack
Copy link
Member

@dlfivefifty I vaguely remember a discussion of dotu some time ago. The result was that I wrote a wrapper for dotu. A google search revealed that the person asking was actually you. Hence, it appears that you are the only person who have asked for that function so far. I don't think we want to export another function unless the demand goes up. Hopefully, x.'y will return a scalar at some point, such the you can get a right working dotu without another export. (Right now x.'y also seems to be much slower that BLAS.dotu)

@simonster The Julia implementation is faster for small n and Complex{Float64} although the crossing point is rather low. I get for native Julia:

n = 50, time = 0.336
n = 100, time = 0.298
n = 150, time = 0.276
n = 200, time = 0.268
n = 250, time = 0.262
n = 300, time = 0.258
n = 350, time = 0.257
n = 400, time = 0.256
n = 450, time = 0.252
n = 500, time = 0.252
n = 550, time = 0.252
n = 600, time = 0.251
n = 650, time = 0.250
n = 700, time = 0.249
n = 750, time = 0.249
n = 800, time = 0.251
n = 850, time = 0.248
n = 900, time = 0.248
n = 950, time = 0.247
n = 1000, time = 0.247

And for BLAS

n = 50, time = 0.643
n = 100, time = 0.341
n = 150, time = 0.223
n = 200, time = 0.189
n = 250, time = 0.109
n = 300, time = 0.099
n = 350, time = 0.090
n = 400, time = 0.087
n = 450, time = 0.081
n = 500, time = 0.073
n = 550, time = 0.069
n = 600, time = 0.069
n = 650, time = 0.067
n = 700, time = 0.062
n = 750, time = 0.065
n = 800, time = 0.061
n = 850, time = 0.060
n = 900, time = 0.056
n = 950, time = 0.058
n = 1000, time = 0.055

Off course the native Julia implementation can be made much faster for all n if it only uses the first element of the vectors. 😃

@dlfivefifty
Copy link
Contributor Author

It’s definitely low priority as there are easy work arounds, but  I can’t be the only person who wants to multiply and sum the entries of 2 vectors that are possibly complex :P

On 11 Sep 2014, at 12:10 pm, Andreas Noack notifications@github.com wrote:

@dlfivefifty I vaguely remember a discussion of dotu some time ago. The result was that I wrote a wrapper for dotu. A google search revealed that the person asking was actually you. Hence, it appears that you are the only person who have asked for that function so far. I don't think we want to export another function unless the demand goes up. Hopefully, x.'y will return a scalar at some point, such the you can get a right working dotu without another export. (Right now x.'y also seems to be much slower that BLAS.dotu)

@simonster The Julia implementation is faster for small n and Complex{Float64} although the crossing point is rather low. I get for native Julia:

n = 50, time = 0.336
n = 100, time = 0.298
n = 150, time = 0.276
n = 200, time = 0.268
n = 250, time = 0.262
n = 300, time = 0.258
n = 350, time = 0.257
n = 400, time = 0.256
n = 450, time = 0.252
n = 500, time = 0.252
n = 550, time = 0.252
n = 600, time = 0.251
n = 650, time = 0.250
n = 700, time = 0.249
n = 750, time = 0.249
n = 800, time = 0.251
n = 850, time = 0.248
n = 900, time = 0.248
n = 950, time = 0.247
n = 1000, time = 0.247
And for BLAS

n = 50, time = 0.643
n = 100, time = 0.341
n = 150, time = 0.223
n = 200, time = 0.189
n = 250, time = 0.109
n = 300, time = 0.099
n = 350, time = 0.090
n = 400, time = 0.087
n = 450, time = 0.081
n = 500, time = 0.073
n = 550, time = 0.069
n = 600, time = 0.069
n = 650, time = 0.067
n = 700, time = 0.062
n = 750, time = 0.065
n = 800, time = 0.061
n = 850, time = 0.060
n = 900, time = 0.056
n = 950, time = 0.058
n = 1000, time = 0.055
Off course the native Julia implementation can be made much faster for all n if it only uses the first element of the vectors.


Reply to this email directly or view it on GitHub.

@simonster
Copy link
Member

@andreasnoack Some of the overhead for small n in the Complex{Float64} case may come from the fact that the BLAS wrappers allocate a 1-element array for the output. Given that, I'm actually kind of surprised the crossing point is so low. We could allocate that array outside the function, but that wouldn't be thread-safe so we risk incurring the wrath of @JeffBezanson. We might also be able to use jl_alloca.

@simonster
Copy link
Member

#8134 seems like the clean way to avoid that allocation.

@andreasnoack
Copy link
Member

I think you are right. For Float64 BLAS is faster even for n=50 which fits well into that explanation.

An alternative solution could possible be that we'll be able return Complex{FloatX}s from ccalls.

@vtjnash
Copy link
Member

vtjnash commented Sep 11, 2014

Some of the overhead for small n in the Complex{Float64} case may come from the fact that the BLAS wrappers allocate a 1-element array for the output. Given that, I'm actually kind of surprised the crossing point is so low. We could allocate that array outside the function, but that wouldn't be thread-safe so we risk incurring the wrath of @JeffBezanson. We might also be able to use jl_alloca.

If I can get Jeff to merge it soon, #8134 is intended to address exactly that observation. (it allows the compiler to automatically use alloca for exactly this use case)

@KristofferC
Copy link
Member

Explicitly calling BLAS functions is an "expert" feature and they should confirm to the argument types that the BLAS functions themselves take.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
speculative Whether the change will be implemented is speculative
Projects
None yet
Development

No branches or pull requests

9 participants