Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

specialize copyto! for performance #23

Merged
merged 1 commit into from
Apr 22, 2024
Merged

specialize copyto! for performance #23

merged 1 commit into from
Apr 22, 2024

Conversation

nsajko
Copy link
Collaborator

@nsajko nsajko commented Apr 22, 2024

The fallback generic implementation of copyto! uses setindex! in an elementwise manner. Instead base the implementation on unsafe_copyto!. This improves the performance of copyto!, thus it also improves the performance of copy!, copy and of broadcasted operations.

This change only applies when copying between two FixedSizeArrays, or between a Memory and a FixedSizeArray.

Fixes #19

Copy link

codecov bot commented Apr 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (dd77e40) to head (721f6a2).

Additional details and impacted files
@@            Coverage Diff            @@
##              main       #23   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            1         1           
  Lines           33        57   +24     
=========================================
+ Hits            33        57   +24     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nsajko
Copy link
Collaborator Author

nsajko commented Apr 22, 2024

With this change:

julia> using FixedSizeArrays, BenchmarkTools

julia> @benchmark z .= v setup=(v=convert(FixedSizeVector,randn(Float64, 1024*1024)); z=similar(v);)
BenchmarkTools.Trial: 491 samples with 1 evaluation.
 Range (min … max):  1.344 ms …   4.687 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.687 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.317 ms ± 684.430 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▅█▇                                 ▁ ▂▂                 
  ▂▃▃▄███▄▄▅▅▃▃▃▂▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄█▇▆████▆▄▅▃▆▆▃▄▄▃▁▃▁▂▂ ▃
  1.34 ms         Histogram: frequency by time        3.43 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark z .= v setup=(v=randn(Float64, 1024*1024); z=similar(v);)
BenchmarkTools.Trial: 793 samples with 1 evaluation.
 Range (min … max):  1.521 ms …   3.384 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.795 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.862 ms ± 271.172 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▁▂▃▇██▆▄▄▃▃▂                                            
  ▆▄▁▅██████████████▆▆▇▄▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▄▄▁▅▁▇█▆▆▆ █
  1.52 ms      Histogram: log(frequency) by time       3.1 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.12.0-DEV.387
Commit b5bfd83a3d0 (2024-04-22 13:12 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × AMD Ryzen 3 5300U with Radeon Graphics
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver2)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

The fallback generic implementation of `copyto!` uses `setindex!` in an
elementwise manner. Instead base the implementation on
`unsafe_copyto!`. This improves the performance of `copyto!`, thus it
also improves the performance of `copy!`, `copy` and of broadcasted
operations.

This change only applies when copying between two `FixedSizeArray`s,
or between a `Memory` and a `FixedSizeArray`.

Fixes #19
@nsajko
Copy link
Collaborator Author

nsajko commented Apr 22, 2024

TBH the tests would ideally be more comprehensive, even though they should now (with the force push) have full source line coverage.

@nsajko
Copy link
Collaborator Author

nsajko commented Apr 22, 2024

BTW, the code in this PR is adapted from base/array.jl in the Julia source. From around here:

https://github.com/JuliaLang/julia/blob/4a4d85090b3bde224acec51477ab817cb4fb5298/base/array.jl#L306-L316

Copy link
Collaborator

@giordano giordano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this PR, for small arrays the broacast operations are pretty much equivalent:

julia> @benchmark z .= v setup=(v = randn(Float64, 1024); z = similar(v))
BenchmarkTools.Trial: 10000 samples with 943 evaluations.
 Range (min … max):  100.919 ns … 158.051 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     101.670 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   102.948 ns ±   3.040 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ███▅▅▅▄▃▄▄▂▁▂▂▃▂▂▁▁▂▁              ▁▁                        ▂
  ████████████████████████▇██▇▇██████████▇▆▆▆▇▆▆▇▆▆▆▇▇▆▆▆▆▅▆▆▆▆ █
  101 ns        Histogram: log(frequency) by time        115 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark z .= v setup=(v = convert(FixedSizeVector, randn(Float64, 1024)); z = similar(v))
BenchmarkTools.Trial: 10000 samples with 942 evaluations.
 Range (min … max):  101.025 ns … 188.960 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     101.911 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   103.364 ns ±   4.435 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄██▆▅▅▄▄▃▃▃▃▃▃▂▂▂▁▁                                           ▂
  █████████████████████▇██▇█▆█▇▇▇▇▆▆▆▆▆▆▆▆▆▇▆▆▇▆▆▅▆▆▅▆▅▅▄▅▃▅▅▅▄ █
  101 ns        Histogram: log(frequency) by time        119 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

For larger arrays there's still some difference, also visible in your benchmarks:

julia> @benchmark z .= v setup=(v = randn(Float64, 2 ^ 20); z = similar(v))
BenchmarkTools.Trial: 1601 samples with 1 evaluation.
 Range (min … max):  364.459 μs … 876.416 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     434.208 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   434.333 μs ±  50.884 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁█▅▄▃ ▃▁▃▂▃         ▃▄   ▇▄▁
  ▅████████████▇▄█▄▄▄▄▆█████████▆▆▅▅▆▅▅▃▃▃▃▂▂▂▃▂▃▃▂▃▂▂▂▂▃▁▁▂▂▁▂ ▄
  364 μs           Histogram: frequency by time          577 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark z .= v setup=(v = convert(FixedSizeVector, randn(Float64, 2 ^ 20)); z = similar(v))
BenchmarkTools.Trial: 1177 samples with 1 evaluation.
 Range (min … max):  432.250 μs … 873.959 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     492.125 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   513.554 μs ±  64.203 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁▄▅█▇▄▂▁
  ▂▃▅████████▇▇▆▆▇▇▇▇█▇▅▆▆▄▅▄▄▅▃▃▃▃▃▄▃▃▄▃▃▃▃▂▃▃▂▂▁▂▂▂▂▁▂▂▂▁▁▂▂▂ ▄
  432 μs           Histogram: frequency by time          749 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Compare to current main (dd77e40) of this package:

julia> @benchmark z .= v setup=(v = convert(FixedSizeVector, randn(Float64, 2 ^ 20)); z = similar(v))
BenchmarkTools.Trial: 1093 samples with 1 evaluation.
 Range (min … max):  775.125 μs …  1.205 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     822.417 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   836.816 μs ± 42.380 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▃▆██▆▇▃▃▁▂▄▂
  ▂▁▁▂▃▄▆▇████████████▆▆▄▅▃▄▃▄▃▃▃▃▃▄▃▄▃▄▄▄▄▅▄▅▄▃▄▄▃▃▃▂▃▁▂▂▃▂▂▂ ▄
  775 μs          Histogram: frequency by time          958 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

This is definitely going in the right direction, nice work!

@nsajko
Copy link
Collaborator Author

nsajko commented Apr 22, 2024

On my machine, the benchmarks favor FixedSizeArray over Array, both in the above benchmark, and now when I tried to replicate your results:

julia> using FixedSizeArrays, BenchmarkTools

julia> @benchmark z .= v setup=(v = randn(Float64, 2 ^ 20); z = similar(v);)
BenchmarkTools.Trial: 801 samples with 1 evaluation.
 Range (min … max):  1.555 ms …   3.311 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.793 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.955 ms ± 407.934 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▃█▆▄                                                  
  ▃▃▂▅▄▄█████▇▄▃▃▃▄▃▂▃▂▂▂▂▂▂▂▁▁▁▁▂▁▁▁▂▁▁▁▁▁▂▂▁▁▂▁▂▂▃▃▂▃▃▃▃▃▃▄ ▃
  1.56 ms         Histogram: frequency by time        3.04 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark z .= v setup=(v = convert(FixedSizeVector, randn(Float64, 2 ^ 20)); z = similar(v);)
BenchmarkTools.Trial: 514 samples with 1 evaluation.
 Range (min … max):  1.270 ms …   3.203 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.870 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.355 ms ± 721.781 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▇▂                                             ▇▁ █▇    
  ▃▃▃▄▆██▆▆▄▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▃▅▇▄██▄██▃▃ ▃
  1.27 ms         Histogram: frequency by time        3.08 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.12.0-DEV.387
Commit b5bfd83a3d0 (2024-04-22 13:12 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × AMD Ryzen 3 5300U with Radeon Graphics
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver2)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

What's your versioninfo?

@nsajko nsajko merged commit 8c43170 into JuliaArrays:main Apr 22, 2024
8 checks passed
@nsajko nsajko deleted the copyto branch April 22, 2024 23:28
@giordano
Copy link
Collaborator

julia> versioninfo()
Julia Version 1.12.0-DEV.302
Commit 955b85b549 (2024-04-03 14:01 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.6.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, apple-m1)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

@nsajko
Copy link
Collaborator Author

nsajko commented Apr 22, 2024

Profiling shows that z .= v spends almost all of its time in unsafe_copyto! (genericmemory.jl:95) on my machine, for both Array and FixedSizeArray. (All samples end up in unsafe_copyto!, there's just one column in the flame graph instead of a tree.)

Perhaps this is just profiling noise? Perhaps it's because your Julia commit is slightly older? Not sure, just throwing around possibilities.

@giordano
Copy link
Collaborator

Yeah, maybe there's some noise, if I use a fixed vector as input, instead of random ones, I get more stable and comparable results:

julia> v = (1.0:2.0 ^ 20);

julia> @benchmark z .= v setup=(v = collect($v); z = similar(v))
BenchmarkTools.Trial: 2941 samples with 1 evaluation.
 Range (min … max):  391.083 μs … 877.709 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     412.250 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   423.975 μs ±  33.054 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▅█▃▁▂▃▁
  ▂█████████▆▅▅▅▅▄▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  391 μs           Histogram: frequency by time          540 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark z .= v setup=(v = convert(FixedSizeVector, $v); z = similar(v))
BenchmarkTools.Trial: 2402 samples with 1 evaluation.
 Range (min … max):  391.333 μs … 733.917 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     411.417 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   426.627 μs ±  39.550 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁██▂
  ▃████▆▅▅▅▄▅▄▄▃▄▃▃▃▂▂▂▂▂▂▂▂▂▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  391 μs           Histogram: frequency by time          577 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simple broadcasting for FixedSizeArray is slower than regular Array
2 participants