-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
broadcast!(f, x, x)
slower, prevents SIMD?
#43153
Comments
Gist with the llvm code generated for a similar problem https://gist.github.com/gbaraldi/9ed4d47fd7fabc55f5d013399af2862e on m1 mac native. |
My guess is that this is a runtime memory check by LLVM that decides not to take the SIMD-path when it sees that the memory aliases. From the LLVM IR: vector.memcheck: ; preds = %L178.us65.preheader
%scevgep99 = getelementptr float, float* %19, i64 %24
%26 = shl i64 %23, 2
%27 = add i64 %26, 4
%28 = mul i64 %10, %27
%uglygep = getelementptr i8, i8* %20, i64 %28
%bound0 = icmp ugt i8* %uglygep, %scevgep96
%bound1 = icmp ult float* %scevgep99, %scevgep97
%found.conflict = and i1 %bound0, %bound1
br i1 %found.conflict, label %L178.us65, label %vector.ph The |
Could confirm with simple |
Maybe this is what you meant, but adding that to the loop in the method of julia> @eval Base.Broadcast @inline function copyto!(dest::AbstractArray, bc::Broadcasted{Nothing})
axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc))
# Performance optimization: broadcast!(identity, dest, A) is equivalent to copyto!(dest, A) if indices match
if bc.f === identity && bc.args isa Tuple{AbstractArray} # only a single input argument to broadcast!
A = bc.args[1]
if axes(dest) == axes(A)
return copyto!(dest, A)
end
end
bc′ = preprocess(dest, bc)
# Performance may vary depending on whether `@inbounds` is placed outside the
# for loop or not. (cf. https://github.com/JuliaLang/julia/issues/38086)
@simd ivdep for I in eachindex(dest)
@inbounds dest[I] = bc′[I]
end
return dest
end
copyto! (generic function with 60 methods)
julia> @btime $x .= f23.($x);
1.562 μs (0 allocations: 0 bytes) |
Of course, we can't add if dest isa StridedArray{<:Base.HWNumber} # just an example
@simd ivdep for I in eachindex(dest)
@inbounds dest[I] = bc′[I]
end
else
@simd for I in eachindex(dest)
@inbounds dest[I] = bc′[I]
end
end should be safe enough? |
Some more examples: julia> a = rand(1024); b = similar(a);
julia> @inline f(x, y, c) = x .= y .+ c # force inline
f (generic function with 1 method)
julia> ff(x, c) = f(x, x, c)
ff (generic function with 1 method)
julia> @btime f($b, $a, $0);
67.857 ns (0 allocations: 0 bytes) # fast as expect
julia> @btime f($a, $a, $0);
310.117 ns (0 allocations: 0 bytes) # slow as expect
julia> @btime ff($a, $0);
62.322 ns (0 allocations: 0 bytes) # inlined version is fast
julia> @noinline f(x, y, c) = x .= y .+ c # force noinline
f (generic function with 1 method)
julia> @btime ff($a, $0);
311.673 ns (0 allocations: 0 bytes) # noinlined version is slow
|
Some updated timings; still appears to reproduce
|
Can also reproduce on v.1.9 |
I was surprised by this slowdown when writing back into
x
, instead of into another arrayy
:This is 1.7.0-rc2, but similar on 1.5 and master, and on other computers. I don't think it's a benchmarking artefact, it persists with
evals=1 setup=(x=...; y=...)
. I don't think it's a hardware limit, since@turbo $x .= f23.($x)
with LoopVectorization.jl, or@.. $x = f23($x)
with FastBroadcast.jl, don't show this difference.For comparison, it seems that
map!
is never fast here, althoughmap
is:The text was updated successfully, but these errors were encountered: