Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broadcast over view doesn't compile to SIMD code #48890

Open
rmsrosa opened this issue Mar 4, 2023 · 3 comments
Open

broadcast over view doesn't compile to SIMD code #48890

rmsrosa opened this issue Mar 4, 2023 · 3 comments
Labels
broadcast Applying a function over a collection performance Must go faster

Comments

@rmsrosa
Copy link

rmsrosa commented Mar 4, 2023

I am opening this issue at the suggestion of @Moelf, after discussions on Slack.

Broadcasting over a view of a column of a matrix is quite slower than looping, apparently because SIMD optimization is not performed in this case. Might be related to issue #43153. Here is a MWE:

function foo!(u)
    u .*= 0.9
    nothing
end

function bar!(u)
    for i in eachindex(u)
        u[i] *= 0.9
    end
end
julia> @btime foo!(x) setup = (x = rand(1000))

  100.477 ns (0 allocations: 0 bytes)

julia> @btime foo!(view(x, :, 1)) setup = (x = rand(1000, 2))
  326.419 ns (0 allocations: 0 bytes)

julia> @btime bar!(x) setup = (x = rand(1000))

  99.321 ns (0 allocations: 0 bytes)

julia> @btime bar!(view(x, :, 1)) setup = (x = rand(1000, 2))
  101.955 ns (0 allocations: 0 bytes)
@Moelf
Copy link
Contributor

Moelf commented Mar 4, 2023

#48801

@giordano giordano added performance Must go faster broadcast Applying a function over a collection labels Mar 7, 2023
@chriselrod
Copy link
Contributor

Do note that FastBroadcast.jl doesn't have this problem

julia> @time using FastBroadcast
  0.314388 seconds (1.09 M allocations: 71.700 MiB, 6.95% gc time, 3.68% compilation time)

julia> function foo!(u)
           u .*= 0.9
           nothing
       end
foo! (generic function with 1 method)

julia> function bar!(u)
           for i in eachindex(u)
               u[i] *= 0.9
           end
       end
bar! (generic function with 1 method)

julia> function goo!(u)
           @.. u *= 0.9
           nothing
       end
goo! (generic function with 1 method)

julia> @btime foo!(x) setup = (x = rand(1000))
  36.572 ns (0 allocations: 0 bytes)

julia> @btime foo!(view(x, :, 1)) setup = (x = rand(1000, 2))
  369.223 ns (0 allocations: 0 bytes)

julia> @btime bar!(x) setup = (x = rand(1000))
  41.786 ns (0 allocations: 0 bytes)

julia> @btime bar!(view(x, :, 1)) setup = (x = rand(1000, 2))
  42.763 ns (0 allocations: 0 bytes)

julia> @btime goo!(x) setup = (x = rand(1000))
  35.522 ns (0 allocations: 0 bytes)

julia> @btime goo!(view(x, :, 1)) setup = (x = rand(1000, 2))
  37.805 ns (0 allocations: 0 bytes)

@vtjnash
Copy link
Member

vtjnash commented Oct 10, 2023

Looking into this closer, the issue seems to be not that it can't SIMD, but that LLVM determines at runtime that vectorization of this code is illegal and would result in undefined behavior, because the input and output are detected to alias:

vector.memcheck:                                  ; preds = %L64.preheader                                                  
     %scevgep183 = getelementptr {}*, {}** %4, i64 %"u::SubArray.unbox.unpack153"                                                                                                                                                                       
     %5 = add i64 %0, %"u::SubArray.unbox.unpack153"                                                                                                                                                                                                    
     %scevgep169184 = getelementptr {}*, {}** %4, i64 %5                                                                                                                                                                                                
     %scevgep171185 = getelementptr {}*, {}** %3, i64 %"u::SubArray.unbox.unpack153"                                                                                                                                                                    
     %scevgep173186 = getelementptr {}*, {}** %3, i64 %5                                                                                                                                                                                                
     %bound0 = icmp ult {}** %scevgep183, %scevgep173186                                                                                                                                                                                                
     %bound1 = icmp ult {}** %scevgep171185, %scevgep169184                                                                                                                                                                                             
     %found.conflict = and i1 %bound0, %bound1                                                                              
     br i1 %found.conflict, label %scalar.ph, label %vector.ph                                                              

The original measurements also appear to turn out to be a benchmarking mistake, mis-interpreting DCE (from inlining) to SCEV. Here's a better picture of the performance:

julia> @btime @noinline(foo!(x)) setup = (x = rand(1000))
  69.681 ns (0 allocations: 0 bytes)

julia> @btime @noinline(foo!(view(x, :, 1))) setup = (x = rand(1000, 2))
  339.908 ns (1 allocation: 16 bytes)

julia> @btime @noinline(bar!(x)) setup = (x = rand(1000))
  69.681 ns (0 allocations: 0 bytes)

julia> @btime @noinline(bar!(view(x, :, 1))) setup = (x = rand(1000, 2))
  614.414 ns (1 allocation: 16 bytes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
broadcast Applying a function over a collection performance Must go faster
Projects
None yet
Development

No branches or pull requests

5 participants