Use `max(1, Sys.CPU_THREADS)` BLAS threads for `aarch64`. #46085

chriselrod · 2022-07-18T13:59:04Z

None of the common ARM CPUs that come to mind have hyperthreading.

gbaraldi · 2022-07-18T14:51:47Z

Looks good to me. Probably should be backported to 1.8 too.

giordano · 2022-07-18T15:57:45Z

stdlib/LinearAlgebra/src/LinearAlgebra.jl

@@ -582,7 +582,11 @@ function __init__()
    Base.at_disable_library_threading(() -> BLAS.set_num_threads(1))

    if !haskey(ENV, "OPENBLAS_NUM_THREADS")
-        BLAS.set_num_threads(max(1, Sys.CPU_THREADS ÷ 2))
+        if Sys.ARCH === :aarch64


I'm not sure this is right. It's only on CPUs where we know Sys.CPU_THREADS is already the "right" number of threads for BLAS that we should use Sys.CPU_THREADS instead of Sys.CPU_THREADS ÷ 2. Unfortunately we don't have any way at the moment to query this, and the only system where we know for sure we have to do it is the M1 (because we have specific code for setting the value Sys.CPU_THREADS), which incidentally coincides with all known aarch64-apple-darwin platforms. Also, this should use @static and I think a better test is Sys.isapple() && Base.BinaryPlatforms.arch(Base.BinaryPlatforms.HostPlatform()) == "aarch64" (HostPlatform() does a normalisation of the arch, which in some cases may be arm64)

None of the common ARM CPUs that come to mind have hyperthreading.

Probably it's only/mainly Intel which does hyperthreading no? You're excluding 32-bit ARM, which I also doubt does hyperthreading

AMD Cpus also have it (it's called SMT for them). Power I think does it too. For someone running asahi for example this is probably wrong, same for someone running the new Intel ones. But it was wrong before and i don't think we figured out a way to query performance cores on linux.

Power I think does it too.

Power does 4x or 8x, depending on the chip, meaning we may want to divide by 4 or 8 for Power? Is that detectable?

For someone running asahi for example this is probably wrong

On asahi, would the 4big4L read as 8 threads, and 8big2L read as 10 threads?
Just by chance, 4 would've been correct for the former, but 5 for the latter isn't exactly ideal either. I'd be curious whether 5 or 10 are faster.

It's only on CPUs where we know Sys.CPU_THREADS is already the "right" number of threads for BLAS that we should use Sys.CPU_THREADS instead of Sys.CPU_THREADS ÷ 2

I think this is a rather strong statement, in that 5 doesn't seem obviously better than 10 for 8 big core + 2 little core systems.

Also, this should use @static

Seems unnecessary when using Sys.ARCH, given the syntax is valid and the compiler can optimize it away just fine.

julia> foo() = Sys.ARCH === :aarch64 foo (generic function with 1 method) julia> @code_typed foo() CodeInfo( 1 ─ return false ) => Bool

but I can use @static when switching to Base.BinaryPlatforms.arch(Base.BinaryPlatforms.HostPlatform()) == "aarch64"

I think a better test is Sys.isapple()

Do you think Asahi linux is more common than Raspberry Pi 4, Graviton, Ampere, A64FX?

Not sure about Nvidea's products.
If we start seeing more ARM laptops, they'll probably have performance + efficiency cores for Windows and Linux.

So it's possible, but I do not know the market shares here.

There are Intel CPUs without hyperthreading, and you can also turn it off in the bios for CPUs with it (this is reasonably common in HPC environments).
So combined with the P+E cores from Alder Lake and beyond, this all does get pretty messy. Would be interesting to see benchmarks there.
Intel advertises their E cores as being efficient in terms of performance/area, not performance/power, so they're marketing their E cores as improving multithreading performance. Their next generation of CPUs will add a lot more E cores without adding extra P cores. Maybe we really do want to use 1 BLAS thread per E core, hoping that the BLAS library does reasonable load balancing (which isn't a given; MKL seems to, but I wouldn't bet on OpenBLAS or BLIS). We'd still probably only want 1 thread per P core, and these unfortunately do have hyperthreading...

Alder lake is really annoying because the E cores don't have SMT but the P cores do. Ideally we should always find the number of P cores and set NUM_THREADS to that.

A quick test shows that, at least for BLAS, adding the E cores makes it quite a bit slower.

versioninfo() Julia Version 1.9.0-DEV.688 Commit 3eaed8b54c (2022-05-31 15:31 UTC) Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 16 × 12th Gen Intel(R) Core(TM) i5-12600K WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-13.0.1 (ORCJIT, alderlake) Threads: 6 on 16 virtual cores Environment: JULIA_NUM_THREADS = 6 JULIA_NUM_PRECOMPILE_TASKS = 6 julia> using LinearAlgebra A1=rand(8192,8192); A2=copy(A1); julia> @btime lu!($A1); 1.022 s (2 allocations: 64.05 KiB) julia> BLAS.set_num_threads(10) julia> @btime lu!($A2); 1.847 s (2 allocations: 64.05 KiB)

What about with using MKL? I'm curious if MKL does better.

I get the argument for mixing cores optimized for absolute performance with cores optimized for performance/area, but with it hurting benchmarks, so far that sounds more like marketing than reality.

I see you also set JULIA_NUM_PRECOMPILE_TASKS = 6, do you think it even hurts this?

I set that because in macos there were some issues with the E cores and precompiling (it made the sound crackle). With MKL and 6 cores it didn't change much, but with 10 it didn't get worse interestingly. So mkl might do something not to use them

So mkl might do something not to use them

That seems likely.

To be clear, when I said "Intel" above I was referring to x86-64 in general, not chips made by Intel specifically.

Anyway, since it was said that most ARM CPUs (and not just Aaarch64, and not just on macOS) will typically not have multithreading, then the condition should change to

Base.BinaryPlatforms.arch(Base.BinaryPlatforms.HostPlatform()) ∈ ("aarch64", "armv6l", "armv7l")

? It remains the already mentioned problem of Asahi Linux, where we don't set correctly the total number of CPUs, but that's a more general problem of detecting the efficiency/power cores, and in any case this is only a default value of threads, which can be always changed (and people strongly caring about the precise number of threads to be used on their machine should set it manually to be safe against changes of defaults in any case).

Use binarybuilder to normalize arch, and for now only use 1 BLAS thread per CPU thread for `Sys.isapple()` and `aarch64`

(cherry picked from commit 97df6db)

giordano · 2022-07-20T14:10:43Z

stdlib/LinearAlgebra/src/LinearAlgebra.jl

@@ -582,7 +582,11 @@ function __init__()
    Base.at_disable_library_threading(() -> BLAS.set_num_threads(1))

    if !haskey(ENV, "OPENBLAS_NUM_THREADS")


Wait, this should look at GOTO_NUM_THREADS AND OMP_NUM_THREADS as well: https://github.com/xianyi/OpenBLAS/blob/c43ec53bdd00d9423fc609d7b7ecb35e7bf41b85/README.md#setting-the-number-of-threads-using-environment-variables

GOTO_NUM_THREADS comes from the original GOTOBLAS library. Nobody should be using it!

For OMP, I thought that only applies if you build with OpenMP for threading - which we don't. I think we should only look at OPENBLAS_NUM_THREADS.

As a matter of fact OpenBLAS respects those variables. If you set them you can see with BLAS.get_num_threads that OpenBLAS changes the setting accordingly

Yup it does. I was just saying that we may only want to use one of them and only document that. However, I don't have a strong opinion if we want to use the other ones - but what should we do if more than one of those environment variables are present?

If any of those variables is set, we shouldn't override it. See #46118 🙂

…aLang#46085)

Use max(1, Sys.CPU_THREADS) BLAS threads for aarch64.

aa27862

None of the common ARM CPUs that come to mind have hyperthreading.

KristofferC added the backport 1.8 Change should be backported to release-1.8 label Jul 18, 2022

giordano reviewed Jul 18, 2022

View reviewed changes

Update LinearAlgebra.jl

2b33c6b

Use binarybuilder to normalize arch, and for now only use 1 BLAS thread per CPU thread for `Sys.isapple()` and `aarch64`

KristofferC mentioned this pull request Jul 20, 2022

release-1.8: Backports for 1.8-rc4 #46075

Merged

32 tasks

KristofferC merged commit 97df6db into JuliaLang:master Jul 20, 2022

KristofferC pushed a commit that referenced this pull request Jul 20, 2022

Use max(1, Sys.CPU_THREADS) BLAS threads for apple aarch64. (#46085)

f3bc6bf

(cherry picked from commit 97df6db)

chriselrod deleted the patch-2 branch July 20, 2022 14:08

giordano reviewed Jul 20, 2022

View reviewed changes

giordano mentioned this pull request Jul 20, 2022

[LinearAlgebra] Support more variables to set OpenBLAS threads #46118

Merged

KristofferC removed the backport 1.8 Change should be backported to release-1.8 label Aug 7, 2022

ffucci pushed a commit to ffucci/julia that referenced this pull request Aug 11, 2022

Use max(1, Sys.CPU_THREADS) BLAS threads for apple aarch64. (Juli…

fec8608

…aLang#46085)

pcjentsch pushed a commit to pcjentsch/julia that referenced this pull request Aug 18, 2022

Use max(1, Sys.CPU_THREADS) BLAS threads for apple aarch64. (Juli…

49822f0

…aLang#46085)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `max(1, Sys.CPU_THREADS)` BLAS threads for `aarch64`. #46085

Use `max(1, Sys.CPU_THREADS)` BLAS threads for `aarch64`. #46085

chriselrod commented Jul 18, 2022

gbaraldi commented Jul 18, 2022

giordano Jul 18, 2022

giordano Jul 18, 2022

gbaraldi Jul 18, 2022

chriselrod Jul 18, 2022

gbaraldi Jul 18, 2022

chriselrod Jul 18, 2022 •

edited

Loading

chriselrod Jul 18, 2022

gbaraldi Jul 18, 2022

chriselrod Jul 18, 2022

giordano Jul 18, 2022

giordano Jul 20, 2022

ViralBShah Jul 20, 2022

giordano Jul 20, 2022

ViralBShah Jul 21, 2022 •

edited

Loading

giordano Jul 21, 2022

		@@ -582,7 +582,11 @@ function __init__()
		Base.at_disable_library_threading(() -> BLAS.set_num_threads(1))

		if !haskey(ENV, "OPENBLAS_NUM_THREADS")

Use max(1, Sys.CPU_THREADS) BLAS threads for aarch64. #46085

Use max(1, Sys.CPU_THREADS) BLAS threads for aarch64. #46085

Conversation

chriselrod commented Jul 18, 2022

gbaraldi commented Jul 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chriselrod Jul 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ViralBShah Jul 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Use `max(1, Sys.CPU_THREADS)` BLAS threads for `aarch64`. #46085

Use `max(1, Sys.CPU_THREADS)` BLAS threads for `aarch64`. #46085

chriselrod Jul 18, 2022 •

edited

Loading

ViralBShah Jul 21, 2022 •

edited

Loading