Non-ideal default number of BLAS threads on aarch64-apple-darwin #46071

ctkelley · 2022-07-16T16:21:37Z

lu is slower by a factor of 2 on 1.8.0-rc3/M1 Macs

Results from a 2020 MacBook Pro.

On 1.8.0-rc3

julia> A=rand(8192,8192);

julia> @btime lu!($A);
  4.087 s (2 allocations: 64.05 KiB)

and on 1.7.2

julia> A=rand(8192,8192);

julia> @btime lu!($A);
  1.929 s (2 allocations: 64.05 KiB)

The text was updated successfully, but these errors were encountered:

ctkelley · 2022-07-16T16:52:14Z

Seems that it's better if I set the number of BLAS threads to 4. Wasn't this done automatically before?

julia> A1=rand(8192,8192); A2=copy(A1);

julia> @btime lu!($A1);
  4.101 s (2 allocations: 64.05 KiB)

julia> using LinearAlgebra.BLAS

julia> BLAS.set_num_threads(4)

julia> @btime lu!(A2);
  2.251 s (3 allocations: 64.08 KiB)

inkydragon · 2022-07-16T16:52:59Z

~~Confirmed in WSL2.~~
It looks like the default number of threads for 1.7/1.8 is not the same.

Note: Use seed! to fix the test matrix to be consistent between tests.

using Random
using LinearAlgebra
using BenchmarkTools

Random.seed!(46071)
A = rand(8192, 8192);

BLAS.get_num_threads()

sum(A)
@btime lu!(copy(A));
@benchmark lu!(B)  setup=( B=copy($A) )
sum(A)

B = copy(A);
@btime lu!(B);

Version 1.7.3 (2022-05-06)

julia> BLAS.get_num_threads()
6

julia> sum(A)
3.3553450714250512e7

julia> @btime lu!(copy(A));
  2.189 s (5 allocations: 512.06 MiB)

julia> @benchmark lu!(B)  setup=( B=copy($A) )
BenchmarkTools.Trial: 3 samples with 1 evaluation.
 Range (min … max):  1.931 s …   2.026 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.948 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.969 s ± 50.677 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █         █                                             █
  █▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.93 s         Histogram: frequency by time        2.03 s <

 Memory estimate: 64.05 KiB, allocs estimate: 2.

julia> sum(A)
3.3553450714250512e7

julia> @btime lu!(B);
  1.897 s (3 allocations: 64.08 KiB)

Version 1.8.0-rc3 (2022-07-13)

julia> BLAS.get_num_threads()
3

julia> sum(A)
3.3553450714250512e7

julia> @btime lu!(copy(A));
  2.891 s (5 allocations: 512.06 MiB)

julia> @benchmark lu!(B)  setup=( B=copy($A) )
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  2.754 s …   2.872 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.813 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.813 s ± 83.977 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                       █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  2.75 s         Histogram: frequency by time        2.87 s <

 Memory estimate: 64.05 KiB, allocs estimate: 2.

julia> sum(A)
3.3553450714250512e7

julia> @btime lu!(B);
  2.636 s (3 allocations: 64.08 KiB)

Update:

Version 1.8.0-rc3 (2022-07-13) + 6 BLAS threads

julia> BLAS.get_num_threads()
3
julia> BLAS.set_num_threads(6)
julia> BLAS.get_num_threads()
6

julia> sum(A)
3.3553450714250512e7

julia> @btime lu!(copy(A));
  2.037 s (5 allocations: 512.06 MiB)

julia> @benchmark lu!(B)  setup=( B=copy($A) )
BenchmarkTools.Trial: 3 samples with 1 evaluation.
 Range (min … max):  1.898 s …    2.348 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.995 s               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.080 s ± 237.145 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █           █                                            █
  █▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.9 s          Histogram: frequency by time         2.35 s <

 Memory estimate: 64.05 KiB, allocs estimate: 2.

julia> sum(A)
3.3553450714250512e7

julia> B = copy(A);

julia> @btime lu!(B);
  1.706 s (3 allocations: 64.08 KiB)

ViralBShah · 2022-07-16T17:44:25Z

The default number of openblas threads is a challenge. We may need to have a special case for M1 with the performance and efficiency cores stuff.

This was the latest update to that logic: #45412

giordano · 2022-07-16T18:29:55Z

On M1 I get:

% julia-17 -e 'using LinearAlgebra; @info "" VERSION BLAS.get_num_threads()'
┌ Info:
│   VERSION = v"1.7.0"
└   BLAS.get_num_threads() = 8
% julia -e 'using LinearAlgebra; @info "" VERSION BLAS.get_num_threads()'
┌ Info:
│   VERSION = v"1.9.0-DEV.983"
└   BLAS.get_num_threads() = 2

The "right" number of threads should be 4, not 8, nor 2.

giordano · 2022-07-16T18:38:51Z

Also, I can confirm I get identical performance between Julia v1.7 and master, when using same number of BLAS threads, so the only issue here is the default number of threads, I renamed the issue accordingly.

ctkelley · 2022-07-16T19:13:27Z

That seems to be it. I ran the experiment on 1.8.0-rc3

Random.seed!(46071)
A = rand(8192, 8192);
A0=copy(A)
for ith=2:2:8
    A .= A0
    BLAS.set_num_threads(ith)
    lutime = @belapsed lu!($A);
    println("threads = $ith; time = $lutime")
end

and got

threads = 2; time = 4.09720e+00
threads = 4; time = 2.39923e+00
threads = 6; time = 2.26150e+00
threads = 8; time = 2.09293e+00

and the same results on 1.7.2.

Is 4 the right number? Is there a guarantee that if we ask for 4 threads they run on the performance cores?

ViralBShah · 2022-07-17T14:32:38Z

@chriselrod may know.

chriselrod · 2022-07-18T02:41:35Z

1 thread per performance core is optimal, which is what Julia detects: #44072

Note that each performance core has only a single thread, so we're not getting twice the cores from Sys.CPU_THREADS like we are on most x86 systems.

ViralBShah · 2022-07-18T02:52:44Z

If you pick 4 cores, are you guaranteed to get scheduled on the performance cores?

gbaraldi · 2022-07-18T13:03:56Z

You aren't guaranteed anything unfortunately. But the scheduler is pretty good in moving the right operations to the right places.

ViralBShah · 2022-07-18T13:08:50Z

So basically, do we need to set to 4 OpenBLAS threads on M1? The PR I linked above is where we may need to specialize for M1.

giordano · 2022-07-18T13:11:21Z

Not 4, but simply don't divide by 2 in

julia/stdlib/LinearAlgebra/src/LinearAlgebra.jl

Line 585 in e1739aa

BLAS.set_num_threads(max(1, Sys.CPU_THREADS ÷ 2))

ViralBShah · 2022-07-18T13:39:21Z

Would be great if someone can make a PR quick. We probably want to get this into 1.8.

ctkelley changed the title ~~lu! slower in 1.8.0-rc3~~ lu! slower in 1.8.0-rc3: M1 Mac Jul 16, 2022

ctkelley changed the title ~~lu! slower in 1.8.0-rc3: M1 Mac~~ Performance regression: lu! slower in 1.8.0-rc3 on M1 Mac Jul 16, 2022

ctkelley changed the title ~~Performance regression: lu! slower in 1.8.0-rc3 on M1 Mac~~ Performance regression: lu! 2x slower in 1.8.0-rc3 on M1 Mac Jul 16, 2022

inkydragon added the linear algebra Linear algebra label Jul 16, 2022

ViralBShah added the performance Must go faster label Jul 16, 2022

giordano changed the title ~~Performance regression: lu! 2x slower in 1.8.0-rc3 on M1 Mac~~ Non-ideal default number of threads on aarch64-apple-darwin Jul 16, 2022

giordano changed the title ~~Non-ideal default number of threads on aarch64-apple-darwin~~ Non-ideal default number of BLAS threads on aarch64-apple-darwin Jul 16, 2022

chriselrod mentioned this issue Jul 18, 2022

Use max(1, Sys.CPU_THREADS) BLAS threads for aarch64. #46085

Merged

KristofferC closed this as completed in #46085 Jul 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-ideal default number of BLAS threads on aarch64-apple-darwin #46071

Non-ideal default number of BLAS threads on aarch64-apple-darwin #46071

ctkelley commented Jul 16, 2022 •

edited by giordano

Loading

ctkelley commented Jul 16, 2022

inkydragon commented Jul 16, 2022 •

edited

Loading

ViralBShah commented Jul 16, 2022

giordano commented Jul 16, 2022 •

edited

Loading

giordano commented Jul 16, 2022 •

edited

Loading

ctkelley commented Jul 16, 2022

ViralBShah commented Jul 17, 2022

chriselrod commented Jul 18, 2022

ViralBShah commented Jul 18, 2022

gbaraldi commented Jul 18, 2022

ViralBShah commented Jul 18, 2022

giordano commented Jul 18, 2022

ViralBShah commented Jul 18, 2022

Non-ideal default number of BLAS threads on aarch64-apple-darwin #46071

Non-ideal default number of BLAS threads on aarch64-apple-darwin #46071

Comments

ctkelley commented Jul 16, 2022 • edited by giordano Loading

ctkelley commented Jul 16, 2022

inkydragon commented Jul 16, 2022 • edited Loading

ViralBShah commented Jul 16, 2022

giordano commented Jul 16, 2022 • edited Loading

giordano commented Jul 16, 2022 • edited Loading

ctkelley commented Jul 16, 2022

ViralBShah commented Jul 17, 2022

chriselrod commented Jul 18, 2022

ViralBShah commented Jul 18, 2022

gbaraldi commented Jul 18, 2022

ViralBShah commented Jul 18, 2022

giordano commented Jul 18, 2022

ViralBShah commented Jul 18, 2022

ctkelley commented Jul 16, 2022 •

edited by giordano

Loading

inkydragon commented Jul 16, 2022 •

edited

Loading

giordano commented Jul 16, 2022 •

edited

Loading

giordano commented Jul 16, 2022 •

edited

Loading