Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broadcast!(f, x, x) slower, prevents SIMD? #43153

Open
mcabbott opened this issue Nov 19, 2021 · 8 comments
Open

broadcast!(f, x, x) slower, prevents SIMD? #43153

mcabbott opened this issue Nov 19, 2021 · 8 comments
Labels
broadcast Applying a function over a collection compiler:simd instruction-level vectorization performance Must go faster

Comments

@mcabbott
Copy link
Contributor

I was surprised by this slowdown when writing back into x, instead of into another array y:

julia> f23(x) = ifelse(x>0, x^2, x^3);

julia> x = randn(Float32, 100, 100); y = similar(x);

julia> @btime $y .= f23.($x);
  1.958 μs (0 allocations: 0 bytes)

julia> @btime $x .= f23.($x);
  6.567 μs (0 allocations: 0 bytes)

julia> @btime f23.($x);  # allocating; mean time 3.010 μs still faster than x .= case
  2.236 μs (2 allocations: 39.11 KiB)

This is 1.7.0-rc2, but similar on 1.5 and master, and on other computers. I don't think it's a benchmarking artefact, it persists with evals=1 setup=(x=...; y=...). I don't think it's a hardware limit, since @turbo $x .= f23.($x) with LoopVectorization.jl, or @.. $x = f23($x) with FastBroadcast.jl, don't show this difference.

For comparison, it seems that map! is never fast here, although map is:

julia> @btime map!(f23, $y, $x);
  8.055 μs (0 allocations: 0 bytes)

julia> @btime map!(f23, $x, $x);
  8.055 μs (0 allocations: 0 bytes)

julia> @btime map(f23, $x);
  1.917 μs (2 allocations: 39.11 KiB)
@gbaraldi
Copy link
Member

Gist with the llvm code generated for a similar problem https://gist.github.com/gbaraldi/9ed4d47fd7fabc55f5d013399af2862e on m1 mac native.

@KristofferC
Copy link
Member

KristofferC commented Nov 19, 2021

My guess is that this is a runtime memory check by LLVM that decides not to take the SIMD-path when it sees that the memory aliases. From the LLVM IR:

vector.memcheck:                                  ; preds = %L178.us65.preheader
     %scevgep99 = getelementptr float, float* %19, i64 %24
     %26 = shl i64 %23, 2
     %27 = add i64 %26, 4
     %28 = mul i64 %10, %27
     %uglygep = getelementptr i8, i8* %20, i64 %28
     %bound0 = icmp ugt i8* %uglygep, %scevgep96
     %bound1 = icmp ult float* %scevgep99, %scevgep97
     %found.conflict = and i1 %bound0, %bound1
     br i1 %found.conflict, label %L178.us65, label %vector.ph

The L178.us65 block holds the scalar version of the code.

@N5N3
Copy link
Member

N5N3 commented Nov 20, 2021

Could confirm with simple @simd loop, and adding ivdep "fix"es the performance difference.

@mcabbott
Copy link
Contributor Author

Maybe this is what you meant, but adding that to the loop in the method of copyto! which is being called here speeds up the broadcast! case above. Can this be done safely? IIRC you may need to avoid BitVectors, maybe there are other problems.

julia> @eval Base.Broadcast @inline function copyto!(dest::AbstractArray, bc::Broadcasted{Nothing})
           axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc))
           # Performance optimization: broadcast!(identity, dest, A) is equivalent to copyto!(dest, A) if indices match
           if bc.f === identity && bc.args isa Tuple{AbstractArray} # only a single input argument to broadcast!
               A = bc.args[1]
               if axes(dest) == axes(A)
                   return copyto!(dest, A)
               end
           end
           bc′ = preprocess(dest, bc)
           # Performance may vary depending on whether `@inbounds` is placed outside the
           # for loop or not. (cf. https://github.com/JuliaLang/julia/issues/38086)
           @simd ivdep for I in eachindex(dest)
               @inbounds dest[I] = bc′[I]
           end
           return dest
       end
copyto! (generic function with 60 methods)

julia> @btime $x .= f23.($x);
  1.562 μs (0 allocations: 0 bytes)

@N5N3
Copy link
Member

N5N3 commented Nov 20, 2021

Of course, we can't add ivdep to the most genenal fallback.
IIUC, we only need to concern whether the dest array is self-overlapped, as the srcs have been unaliased if needed. So some check like:

if dest isa StridedArray{<:Base.HWNumber} # just an example
    @simd ivdep for I in eachindex(dest)
       @inbounds dest[I] = bc′[I]
    end
else
   @simd for I in eachindex(dest)
       @inbounds dest[I] = bc′[I]
   end
end

should be safe enough?

@N5N3
Copy link
Member

N5N3 commented Nov 20, 2021

Some more examples:

julia> a = rand(1024); b = similar(a);
julia> @inline f(x, y, c) = x .= y .+ c # force inline
f (generic function with 1 method)
julia> ff(x, c) = f(x, x, c)
ff (generic function with 1 method)
julia> @btime f($b, $a, $0);      
  67.857 ns (0 allocations: 0 bytes)   # fast as expect
julia> @btime f($a, $a, $0);
  310.117 ns (0 allocations: 0 bytes) # slow as expect
julia> @btime ff($a, $0);
  62.322 ns (0 allocations: 0 bytes)   # inlined version is fast
julia> @noinline f(x, y, c) = x .= y .+ c  # force noinline
f (generic function with 1 method)
julia> @btime ff($a, $0);
  311.673 ns (0 allocations: 0 bytes)  # noinlined version is slow

The above bench indicates that inlined version could be vectorized correctly. (@code_llvm also shows that the vector.memcheck block is omiited). Edit: Further bench shows that only inlined 1d self inplace broadcast could be vectorlized properly (might caused by the index transformation as @simd has seperates the inner most loop)
LLVM's auto vectorization has similar problem, so the best solution might be relaxing the current runtime check on llvm side?

@brenhinkeller
Copy link
Contributor

brenhinkeller commented Nov 19, 2022

Some updated timings; still appears to reproduce

julia> using BenchmarkTools

julia> @btime $y .= f23.($x);
  935.345 ns (0 allocations: 0 bytes)

julia> @btime $x .= f23.($x);
  4.494 μs (0 allocations: 0 bytes)

julia> @btime f23.($x);
  1.012 μs (2 allocations: 39.11 KiB)

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161e (2022-11-14 20:14 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.3.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 1 on 8 virtual cores
julia> @btime $y .= f23.($x);
  1.779 μs (0 allocations: 0 bytes)

julia> @btime $x .= f23.($x);
  6.442 μs (0 allocations: 0 bytes)

julia> @btime f23.($x);  # allocating; mean time 3.010 μs still faster than x .= case
  1.972 μs (2 allocations: 39.11 KiB)

julia> versioninfo()
Julia Version 1.9.0-alpha1
Commit 0540f9d7394 (2022-11-15 14:37 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin21.5.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, westmere)
  Threads: 1 on 10 virtual cores

@miguelraz
Copy link
Contributor

Can also reproduce on v.1.9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
broadcast Applying a function over a collection compiler:simd instruction-level vectorization performance Must go faster
Projects
None yet
Development

No branches or pull requests

6 participants