-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
broadcast over view doesn't compile to SIMD code #48890
Comments
Do note that julia> @time using FastBroadcast
0.314388 seconds (1.09 M allocations: 71.700 MiB, 6.95% gc time, 3.68% compilation time)
julia> function foo!(u)
u .*= 0.9
nothing
end
foo! (generic function with 1 method)
julia> function bar!(u)
for i in eachindex(u)
u[i] *= 0.9
end
end
bar! (generic function with 1 method)
julia> function goo!(u)
@.. u *= 0.9
nothing
end
goo! (generic function with 1 method)
julia> @btime foo!(x) setup = (x = rand(1000))
36.572 ns (0 allocations: 0 bytes)
julia> @btime foo!(view(x, :, 1)) setup = (x = rand(1000, 2))
369.223 ns (0 allocations: 0 bytes)
julia> @btime bar!(x) setup = (x = rand(1000))
41.786 ns (0 allocations: 0 bytes)
julia> @btime bar!(view(x, :, 1)) setup = (x = rand(1000, 2))
42.763 ns (0 allocations: 0 bytes)
julia> @btime goo!(x) setup = (x = rand(1000))
35.522 ns (0 allocations: 0 bytes)
julia> @btime goo!(view(x, :, 1)) setup = (x = rand(1000, 2))
37.805 ns (0 allocations: 0 bytes) |
Looking into this closer, the issue seems to be not that it can't SIMD, but that LLVM determines at runtime that vectorization of this code is illegal and would result in undefined behavior, because the input and output are detected to alias:
The original measurements also appear to turn out to be a benchmarking mistake, mis-interpreting DCE (from inlining) to SCEV. Here's a better picture of the performance: julia> @btime @noinline(foo!(x)) setup = (x = rand(1000))
69.681 ns (0 allocations: 0 bytes)
julia> @btime @noinline(foo!(view(x, :, 1))) setup = (x = rand(1000, 2))
339.908 ns (1 allocation: 16 bytes)
julia> @btime @noinline(bar!(x)) setup = (x = rand(1000))
69.681 ns (0 allocations: 0 bytes)
julia> @btime @noinline(bar!(view(x, :, 1))) setup = (x = rand(1000, 2))
614.414 ns (1 allocation: 16 bytes) |
I am opening this issue at the suggestion of @Moelf, after discussions on Slack.
Broadcasting over a view of a column of a matrix is quite slower than looping, apparently because SIMD optimization is not performed in this case. Might be related to issue #43153. Here is a MWE:
The text was updated successfully, but these errors were encountered: