`broadcast!(f, x, x)` slower, prevents SIMD? #43153

mcabbott · 2021-11-19T14:59:37Z

I was surprised by this slowdown when writing back into x, instead of into another array y:

julia> f23(x) = ifelse(x>0, x^2, x^3);

julia> x = randn(Float32, 100, 100); y = similar(x);

julia> @btime $y .= f23.($x);
  1.958 μs (0 allocations: 0 bytes)

julia> @btime $x .= f23.($x);
  6.567 μs (0 allocations: 0 bytes)

julia> @btime f23.($x);  # allocating; mean time 3.010 μs still faster than x .= case
  2.236 μs (2 allocations: 39.11 KiB)

This is 1.7.0-rc2, but similar on 1.5 and master, and on other computers. I don't think it's a benchmarking artefact, it persists with evals=1 setup=(x=...; y=...). I don't think it's a hardware limit, since @turbo $x .= f23.($x) with LoopVectorization.jl, or @.. $x = f23($x) with FastBroadcast.jl, don't show this difference.

For comparison, it seems that map! is never fast here, although map is:

julia> @btime map!(f23, $y, $x);
  8.055 μs (0 allocations: 0 bytes)

julia> @btime map!(f23, $x, $x);
  8.055 μs (0 allocations: 0 bytes)

julia> @btime map(f23, $x);
  1.917 μs (2 allocations: 39.11 KiB)

The text was updated successfully, but these errors were encountered:

gbaraldi · 2021-11-19T15:13:26Z

Gist with the llvm code generated for a similar problem https://gist.github.com/gbaraldi/9ed4d47fd7fabc55f5d013399af2862e on m1 mac native.

KristofferC · 2021-11-19T15:18:55Z

My guess is that this is a runtime memory check by LLVM that decides not to take the SIMD-path when it sees that the memory aliases. From the LLVM IR:

vector.memcheck:                                  ; preds = %L178.us65.preheader
     %scevgep99 = getelementptr float, float* %19, i64 %24
     %26 = shl i64 %23, 2
     %27 = add i64 %26, 4
     %28 = mul i64 %10, %27
     %uglygep = getelementptr i8, i8* %20, i64 %28
     %bound0 = icmp ugt i8* %uglygep, %scevgep96
     %bound1 = icmp ult float* %scevgep99, %scevgep97
     %found.conflict = and i1 %bound0, %bound1
     br i1 %found.conflict, label %L178.us65, label %vector.ph

The L178.us65 block holds the scalar version of the code.

N5N3 · 2021-11-20T05:20:13Z

Could confirm with simple @simd loop, and adding ivdep "fix"es the performance difference.

mcabbott · 2021-11-20T05:36:27Z

Maybe this is what you meant, but adding that to the loop in the method of copyto! which is being called here speeds up the broadcast! case above. Can this be done safely? IIRC you may need to avoid BitVectors, maybe there are other problems.

julia> @eval Base.Broadcast @inline function copyto!(dest::AbstractArray, bc::Broadcasted{Nothing})
           axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc))
           # Performance optimization: broadcast!(identity, dest, A) is equivalent to copyto!(dest, A) if indices match
           if bc.f === identity && bc.args isa Tuple{AbstractArray} # only a single input argument to broadcast!
               A = bc.args[1]
               if axes(dest) == axes(A)
                   return copyto!(dest, A)
               end
           end
           bc′ = preprocess(dest, bc)
           # Performance may vary depending on whether `@inbounds` is placed outside the
           # for loop or not. (cf. https://github.com/JuliaLang/julia/issues/38086)
           @simd ivdep for I in eachindex(dest)
               @inbounds dest[I] = bc′[I]
           end
           return dest
       end
copyto! (generic function with 60 methods)

julia> @btime $x .= f23.($x);
  1.562 μs (0 allocations: 0 bytes)

N5N3 · 2021-11-20T05:59:46Z

Of course, we can't add ivdep to the most genenal fallback.
IIUC, we only need to concern whether the dest array is self-overlapped, as the srcs have been unaliased if needed. So some check like:

if dest isa StridedArray{<:Base.HWNumber} # just an example
    @simd ivdep for I in eachindex(dest)
       @inbounds dest[I] = bc′[I]
    end
else
   @simd for I in eachindex(dest)
       @inbounds dest[I] = bc′[I]
   end
end

should be safe enough?

N5N3 · 2021-11-20T13:26:33Z

Some more examples:

julia> a = rand(1024); b = similar(a);
julia> @inline f(x, y, c) = x .= y .+ c # force inline
f (generic function with 1 method)
julia> ff(x, c) = f(x, x, c)
ff (generic function with 1 method)
julia> @btime f($b, $a, $0);      
  67.857 ns (0 allocations: 0 bytes)   # fast as expect
julia> @btime f($a, $a, $0);
  310.117 ns (0 allocations: 0 bytes) # slow as expect
julia> @btime ff($a, $0);
  62.322 ns (0 allocations: 0 bytes)   # inlined version is fast
julia> @noinline f(x, y, c) = x .= y .+ c  # force noinline
f (generic function with 1 method)
julia> @btime ff($a, $0);
  311.673 ns (0 allocations: 0 bytes)  # noinlined version is slow

~~The above bench indicates that inlined version could be vectorized correctly. (@code_llvm also shows that the vector.memcheck block is omiited).~~ Edit: Further bench shows that only inlined 1d self inplace broadcast could be vectorlized properly (might caused by the index transformation as @simd has seperates the inner most loop)
LLVM's auto vectorization has similar problem, so the best solution might be relaxing the current runtime check on llvm side?

brenhinkeller · 2022-11-19T20:31:12Z

Some updated timings; still appears to reproduce

julia> using BenchmarkTools

julia> @btime $y .= f23.($x);
  935.345 ns (0 allocations: 0 bytes)

julia> @btime $x .= f23.($x);
  4.494 μs (0 allocations: 0 bytes)

julia> @btime f23.($x);
  1.012 μs (2 allocations: 39.11 KiB)

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161e (2022-11-14 20:14 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.3.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 1 on 8 virtual cores

julia> @btime $y .= f23.($x);
  1.779 μs (0 allocations: 0 bytes)

julia> @btime $x .= f23.($x);
  6.442 μs (0 allocations: 0 bytes)

julia> @btime f23.($x);  # allocating; mean time 3.010 μs still faster than x .= case
  1.972 μs (2 allocations: 39.11 KiB)

julia> versioninfo()
Julia Version 1.9.0-alpha1
Commit 0540f9d7394 (2022-11-15 14:37 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin21.5.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, westmere)
  Threads: 1 on 10 virtual cores

miguelraz · 2023-06-10T21:55:52Z

Can also reproduce on v.1.9

mcabbott mentioned this issue Nov 29, 2021

Reduce allocations in Dropout FluxML/Flux.jl#1791

Open

N5N3 mentioned this issue Dec 1, 2021

Add Expr(:ivdepscope) to support not marking the entire loop body as ivdep #43261

Closed

mcabbott mentioned this issue Jan 13, 2022

Add gradients for conv_bias_act, and a similar dense_bias_act FluxML/NNlib.jl#346

Closed

N5N3 mentioned this issue Jan 27, 2022

Add a generic scal! #43931

Closed

N5N3 mentioned this issue Mar 14, 2022

Significant performance gap in assigning to a subsection of an array with and without broadcasting #40962

Open

N5N3 added performance Must go faster broadcast Applying a function over a collection labels Mar 14, 2022

brenhinkeller added the compiler:simd instruction-level vectorization label Nov 19, 2022

Moelf mentioned this issue Feb 26, 2023

in-place broadcast (e.g. .+=) significanctly slower for reinterpreted array #48801

Open

rmsrosa mentioned this issue Mar 4, 2023

broadcast over view doesn't compile to SIMD code #48890

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`broadcast!(f, x, x)` slower, prevents SIMD? #43153

`broadcast!(f, x, x)` slower, prevents SIMD? #43153

mcabbott commented Nov 19, 2021

gbaraldi commented Nov 19, 2021

KristofferC commented Nov 19, 2021 •

edited

Loading

N5N3 commented Nov 20, 2021 •

edited

Loading

mcabbott commented Nov 20, 2021

N5N3 commented Nov 20, 2021 •

edited

Loading

N5N3 commented Nov 20, 2021 •

edited

Loading

brenhinkeller commented Nov 19, 2022 •

edited

Loading

miguelraz commented Jun 10, 2023

broadcast!(f, x, x) slower, prevents SIMD? #43153

broadcast!(f, x, x) slower, prevents SIMD? #43153

Comments

mcabbott commented Nov 19, 2021

gbaraldi commented Nov 19, 2021

KristofferC commented Nov 19, 2021 • edited Loading

N5N3 commented Nov 20, 2021 • edited Loading

mcabbott commented Nov 20, 2021

N5N3 commented Nov 20, 2021 • edited Loading

N5N3 commented Nov 20, 2021 • edited Loading

brenhinkeller commented Nov 19, 2022 • edited Loading

miguelraz commented Jun 10, 2023

`broadcast!(f, x, x)` slower, prevents SIMD? #43153

`broadcast!(f, x, x)` slower, prevents SIMD? #43153

KristofferC commented Nov 19, 2021 •

edited

Loading

N5N3 commented Nov 20, 2021 •

edited

Loading

N5N3 commented Nov 20, 2021 •

edited

Loading

N5N3 commented Nov 20, 2021 •

edited

Loading

brenhinkeller commented Nov 19, 2022 •

edited

Loading