broadcast over view doesn't compile to SIMD code #48890

rmsrosa · 2023-03-04T09:16:04Z

I am opening this issue at the suggestion of @Moelf, after discussions on Slack.

Broadcasting over a view of a column of a matrix is quite slower than looping, apparently because SIMD optimization is not performed in this case. Might be related to issue #43153. Here is a MWE:

function foo!(u)
    u .*= 0.9
    nothing
end

function bar!(u)
    for i in eachindex(u)
        u[i] *= 0.9
    end
end

julia> @btime foo!(x) setup = (x = rand(1000))

  100.477 ns (0 allocations: 0 bytes)

julia> @btime foo!(view(x, :, 1)) setup = (x = rand(1000, 2))
  326.419 ns (0 allocations: 0 bytes)

julia> @btime bar!(x) setup = (x = rand(1000))

  99.321 ns (0 allocations: 0 bytes)

julia> @btime bar!(view(x, :, 1)) setup = (x = rand(1000, 2))
  101.955 ns (0 allocations: 0 bytes)

Moelf · 2023-03-04T13:46:33Z

#48801

chriselrod · 2023-03-11T05:47:36Z

Do note that FastBroadcast.jl doesn't have this problem

julia> @time using FastBroadcast
  0.314388 seconds (1.09 M allocations: 71.700 MiB, 6.95% gc time, 3.68% compilation time)

julia> function foo!(u)
           u .*= 0.9
           nothing
       end
foo! (generic function with 1 method)

julia> function bar!(u)
           for i in eachindex(u)
               u[i] *= 0.9
           end
       end
bar! (generic function with 1 method)

julia> function goo!(u)
           @.. u *= 0.9
           nothing
       end
goo! (generic function with 1 method)

julia> @btime foo!(x) setup = (x = rand(1000))
  36.572 ns (0 allocations: 0 bytes)

julia> @btime foo!(view(x, :, 1)) setup = (x = rand(1000, 2))
  369.223 ns (0 allocations: 0 bytes)

julia> @btime bar!(x) setup = (x = rand(1000))
  41.786 ns (0 allocations: 0 bytes)

julia> @btime bar!(view(x, :, 1)) setup = (x = rand(1000, 2))
  42.763 ns (0 allocations: 0 bytes)

julia> @btime goo!(x) setup = (x = rand(1000))
  35.522 ns (0 allocations: 0 bytes)

julia> @btime goo!(view(x, :, 1)) setup = (x = rand(1000, 2))
  37.805 ns (0 allocations: 0 bytes)

vtjnash · 2023-10-10T19:00:19Z

Looking into this closer, the issue seems to be not that it can't SIMD, but that LLVM determines at runtime that vectorization of this code is illegal and would result in undefined behavior, because the input and output are detected to alias:

vector.memcheck:                                  ; preds = %L64.preheader                                                  
     %scevgep183 = getelementptr {}*, {}** %4, i64 %"u::SubArray.unbox.unpack153"                                                                                                                                                                       
     %5 = add i64 %0, %"u::SubArray.unbox.unpack153"                                                                                                                                                                                                    
     %scevgep169184 = getelementptr {}*, {}** %4, i64 %5                                                                                                                                                                                                
     %scevgep171185 = getelementptr {}*, {}** %3, i64 %"u::SubArray.unbox.unpack153"                                                                                                                                                                    
     %scevgep173186 = getelementptr {}*, {}** %3, i64 %5                                                                                                                                                                                                
     %bound0 = icmp ult {}** %scevgep183, %scevgep173186                                                                                                                                                                                                
     %bound1 = icmp ult {}** %scevgep171185, %scevgep169184                                                                                                                                                                                             
     %found.conflict = and i1 %bound0, %bound1                                                                              
     br i1 %found.conflict, label %scalar.ph, label %vector.ph

The original measurements also appear to turn out to be a benchmarking mistake, mis-interpreting DCE (from inlining) to SCEV. Here's a better picture of the performance:

julia> @btime @noinline(foo!(x)) setup = (x = rand(1000))
  69.681 ns (0 allocations: 0 bytes)

julia> @btime @noinline(foo!(view(x, :, 1))) setup = (x = rand(1000, 2))
  339.908 ns (1 allocation: 16 bytes)

julia> @btime @noinline(bar!(x)) setup = (x = rand(1000))
  69.681 ns (0 allocations: 0 bytes)

julia> @btime @noinline(bar!(view(x, :, 1))) setup = (x = rand(1000, 2))
  614.414 ns (1 allocation: 16 bytes)

giordano added performance Must go faster broadcast Applying a function over a collection labels Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

broadcast over view doesn't compile to SIMD code #48890

broadcast over view doesn't compile to SIMD code #48890

rmsrosa commented Mar 4, 2023

Moelf commented Mar 4, 2023

chriselrod commented Mar 11, 2023

vtjnash commented Oct 10, 2023

broadcast over view doesn't compile to SIMD code #48890

broadcast over view doesn't compile to SIMD code #48890

Comments

rmsrosa commented Mar 4, 2023

Moelf commented Mar 4, 2023

chriselrod commented Mar 11, 2023

vtjnash commented Oct 10, 2023