-
-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very slow "using Flux" #1961
Comments
That is indeed slower than expected. We'll want to see your versioninfo and what version of Flux you're trying to load at a minimum, but assuming your CPU is rather recent I would suspect an out-of-date Julia version to be the culprit. |
My cpu is some years old now, but I hope the problem is not that
and |
I don't think so, but your operating system might. Can you try timing without any AV or related programs running? |
If you use the Julia 1.9 dev version, then you can use the fancy julia> versioninfo()
Julia Version 1.9.0-DEV.351
Commit 385762b444 (2022-04-08 21:50 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: 24 × 12th Gen Intel(R) Core(TM) i9-12900K
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, goldmont)
Threads: 1 on 24 virtual cores time_import resultjulia> @time_imports using Flux
8.3 ms ┌ MacroTools
15.9 ms ┌ ZygoteRules
0.2 ms ┌ DefineSingletons
0.1 ms ┌ IteratorInterfaceExtensions
0.3 ms ┌ TableTraits
0.6 ms ┌ Compat
4.5 ms ┌ OrderedCollections
0.3 ms ┌ Requires
0.1 ms ┌ DataValueInterfaces
0.7 ms ┌ DataAPI
8.1 ms ┌ Tables
0.5 ms ┌ ConstructionBase
18.7 ms ┌ Setfield
10.5 ms ┌ InitialValues
75.0 ms ┌ BangBang
329.3 ms ┌ StaticArrays
807.4 ms ┌ DiffResults
9.0 ms ┌ FunctionWrappers
41.8 ms ┌ DataStructures
0.4 ms ┌ NaNMath
0.1 ms ┌ CompositionsBase
34.3 ms ┌ Accessors
1.4 ms ┌ ContextVariablesX
0.2 ms ┌ RealDot
3.2 ms ┌ IrrationalConstants
48.0 ms ┌ ChainRulesCore
63.0 ms ┌ ChainRules
5.5 ms ┌ AbstractFFTs
35.3 ms ┌ Preferences
35.9 ms ┌ JLLWrappers
171.4 ms ┌ LLVMExtra_jll
0.3 ms ┌ Reexport
16.4 ms ┌ RandomNumbers
19.7 ms ┌ Random123
2.8 ms ┌ CEnum
133.3 ms ┌ LLVM
0.5 ms ┌ Adapt
600.2 ms ┌ GPUArrays
0.5 ms ┌ ExprTools
42.8 ms ┌ TimerOutputs
3.9 ms ┌ BFloat16s
278.0 ms ┌ GPUCompiler
20.3 ms ┌ NNlib
726.3 ms ┌ CUDA
2080.8 ms ┌ NNlibCUDA
115.0 ms ┌ FillArrays
0.6 ms ┌ ChangesOfVariables
0.8 ms ┌ InverseFunctions
4.1 ms ┌ DocStringExtensions
0.4 ms ┌ StatsAPI
0.4 ms ┌ SortingAlgorithms
0.7 ms ┌ LogExpFunctions
6.2 ms ┌ Missings
31.2 ms ┌ StatsBase
9.4 ms ┌ MicroCollections
0.4 ms ┌ ArgCheck
6.2 ms ┌ SplittablesBase
27.5 ms ┌ Baselet
93.7 ms ┌ Transducers
0.4 ms ┌ OpenLibm_jll
193.0 ms ┌ MLStyle
0.6 ms ┌ OpenSpecFun_jll
115.8 ms ┌ SpecialFunctions
0.5 ms ┌ CommonSubexpressions
0.7 ms ┌ DiffRules
417.8 ms ┌ ForwardDiff
0.2 ms ┌ IfElse
1.6 ms ┌ PrettyPrint
29.5 ms ┌ Static
132.2 ms ┌ ArrayInterface
2.8 ms ┌ ProgressLogging
6.5 ms ┌ ShowCases
349.2 ms ┌ FoldsThreads
0.3 ms ┌ FLoopsBase
0.5 ms ┌ NameResolution
3.5 ms ┌ JuliaVariables
6.8 ms ┌ FLoops
1.5 ms ┌ Functors
10.4 ms ┌ IRTools
1714.0 ms ┌ Zygote
7.8 ms ┌ Optimisers
5.6 ms ┌ MLUtils
6324.0 ms Flux |
I had wsl running, but now that I closed it it's even worse
Should I use a pre-release version of Julia? |
The presence/absence of WSL seems like a red herring to me. Just to confirm, these times are sans AV or any other program that might try to sniff actively running executables? |
I tried without AV but the time is again around 40 seconds. I don't notice strange performance problems for any other running software |
As a reference point, here's what I get on an older xeon (from 2012, while it seems i7-6500U is from 2015) with latest Flux. tl;dr is that I don't think that 20s implies something is wrong on your end, sadly. julia> @time using Flux # first time, after updating some packages
[ Info: Precompiling Flux [587475ba-b771-5e3f-ad9e-33799f191a9c]
134.147325 seconds (20.18 M allocations: 1.405 GiB, 0.61% gc time, 7.73% compilation time: 49% of which was recompilation)
# restart Julia, no changes
julia> @time using Flux
22.720761 seconds (20.17 M allocations: 1.404 GiB, 3.77% gc time, 45.13% compilation time: 49% of which was recompilation)
julia> @time using CUDA # after re-starting again
8.594542 seconds (9.48 M allocations: 705.886 MiB, 3.71% gc time, 26.78% compilation time: 85% of which was recompilation)
julia> versioninfo() # cyclops, an elderly xeon
Julia Version 1.9.0-DEV.454
Commit 6026b5ff2a (2022-04-29 12:16 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 12 × Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, broadwell)
Threads: 4 on 12 virtual cores Julia 1.7.2 is a little quicker:
Running the fancy new macro on 1.9, and trimming its output a bit, here's who the culprits are: julia> @time_imports using Flux
25.7 ms ┌ MacroTools
44.8 ms ┌ ZygoteRules 40.40% compilation time
227.7 ms ┌ BangBang 29.65% compilation time (6% recompilation)
1470.1 ms ┌ StaticArrays
3226.1 ms ┌ DiffResults 53.16% compilation time (85% recompilation)
174.0 ms ┌ DataStructures
211.8 ms ┌ ChainRulesCore
271.2 ms ┌ ChainRules
545.6 ms ┌ LLVMExtra_jll 85.64% compilation time (99% recompilation)
429.4 ms ┌ LLVM 32.98% compilation time
2608.8 ms ┌ GPUArrays 5.43% compilation time
840.0 ms ┌ GPUCompiler 33.01% compilation time (94% recompilation)
2965.1 ms ┌ CUDA 0.69% compilation time
8070.0 ms ┌ NNlibCUDA 20.35% compilation time (84% recompilation)
479.0 ms ┌ FillArrays
2.8 ms ┌ ChangesOfVariables
111.6 ms ┌ StatsBase 5.19% compilation time
320.1 ms ┌ Transducers 12.56% compilation time
0.9 ms ┌ OpenLibm_jll
1085.4 ms ┌ MLStyle
1211.3 ms ┌ ForwardDiff 61.29% compilation time (85% recompilation)
97.5 ms ┌ Static
422.4 ms ┌ ArrayInterface 49.69% compilation time
8.4 ms ┌ ProgressLogging
22.6 ms ┌ ShowCases
1281.1 ms ┌ FoldsThreads 92.11% compilation time
0.8 ms ┌ FLoopsBase
1.2 ms ┌ NameResolution
9.3 ms ┌ JuliaVariables
20.2 ms ┌ FLoops
3.7 ms ┌ Functors
30.8 ms ┌ IRTools
5583.3 ms ┌ Zygote 92.38% compilation time (37% recompilation)
29.0 ms ┌ Optimisers
18.0 ms ┌ MLUtils
23181.4 ms Flux 46.75% compilation time (50% recompilation) NNlibCUDA and DiffResults jump out as surprises, is there something wrong with these? Xref JuliaDiff/ForwardDiff.jl#518 and JuliaArrays/StaticArrays.jl#1023 for recent discussion. Surely we could also trim a few of these, do we really need FoldsThreads, MLStyle? |
Is there a recommended set of hardware for running Flux, then?
Il giorno lun 9 mag 2022 alle ore 22:52 Michael Abbott <
***@***.***> ha scritto:
… As a reference point, here's what I get on an older xeon (from 2012, while
it seems i7-6500U is from 2015) with latest Flux. tl;dr is that I don't
think that 20s implies something is wrong on your end, sadly.
julia> @time using Flux # first time, after updating some packages
[ Info: Precompiling Flux [587475ba-b771-5e3f-ad9e-33799f191a9c]
134.147325 seconds (20.18 M allocations: 1.405 GiB, 0.61% gc time, 7.73% compilation time: 49% of which was recompilation)
# restart Julia, no changes
julia> @time using Flux
22.720761 seconds (20.17 M allocations: 1.404 GiB, 3.77% gc time, 45.13% compilation time: 49% of which was recompilation)
julia> @time using CUDA # after re-starting again
8.594542 seconds (9.48 M allocations: 705.886 MiB, 3.71% gc time, 26.78% compilation time: 85% of which was recompilation)
julia> versioninfo() # cyclops, an elderly xeon
Julia Version 1.9.0-DEV.454
Commit 6026b5ff2a (2022-04-29 12:16 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 12 × Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, broadwell)
Threads: 4 on 12 virtual cores
Julia 1.7.2 is a little quicker:
julia> @time using Flux
18.600435 seconds (23.27 M allocations: 1.445 GiB, 6.86% gc time, 47.83% compilation time)
—
Reply to this email directly, view it on GitHub
<#1961 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA7LDIX2CWRXUDIED5CUP3DVJF3JVANCNFSM5VLDVA5Q>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Mattia Antonino Di Gangi, PhD
Speech Scientist, AppTek Gmbh.
Aachen, NRW, Germany
|
Those are slow, but I would expect a Skylake chip (even a mobile one) to run faster. For example, here is my Haswell machine: julia> @time using Flux
7.409032 seconds (22.81 M allocations: 1.422 GiB, 7.05% gc time, 48.99% compilation time)
julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, haswell) That's on spinning rust, not a SSD like I would expect to see in a laptop with the kind of CPU @mattiadg has. Since there's no parallelism here, I would not expect a 1.6x clock speed increase and a microarchitectural disadvantage to lead to a 3x (let alone 6x) speed increase. Hence I suspect some sort of OS or other software interference. Have you checked whether your machine is on spec in other single-threaded benchmarks? The only other factors I can think of are generic troubleshooting ones like making sure RAM/disk aren't at their limits and power settings are cranked all the way up (which IIRC makes a huge difference on U-series chips). |
When all the open apps, including vs code and the AV, and set the battery
to max performance, it got down to 11s. Without max performance it was at
19.
Maybe the rest comes from the old cpu. Isn't it able to compile on multiple
cores?
Il giorno lun 9 mag 2022 alle ore 23:07 Brian Chen ***@***.***>
ha scritto:
… Those are slow, but I would expect a Skylake chip (even a mobile one) to
run faster. For example, here is my Haswell machine:
julia> @time using Flux
7.409032 seconds (22.81 M allocations: 1.422 GiB, 7.05% gc time, 48.99% compilation time)
julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, haswell)
That's on spinning rust, not a SSD like I would expect to see in a laptop
with the kind of CPU @mattiadg <https://github.com/mattiadg> has. Since
there's no parallelism here, I would not expect a 1.6x clock speed increase
and a microarchitectural *disadvantage* to lead to a 3x (let alone 6x)
speed increase. Hence I suspect some sort of OS or other software
interference. Have you checked whether your machine is on spec in other
single-threaded benchmarks? The only other factors I can think of are
generic troubleshooting ones like making sure RAM/disk aren't at their
limits and power settings are cranked all the way up (which IIRC makes a
huge difference on U-series chips).
—
Reply to this email directly, view it on GitHub
<#1961 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA7LDITQ7MPN4F36DG52FUTVJF47RANCNFSM5VLDVA5Q>
.
You are receiving this because you were mentioned.Message ID: <FluxML/Flux
.***@***.***>
--
Mattia Antonino Di Gangi, PhD
Speech Scientist, AppTek Gmbh.
Aachen, NRW, Germany
|
Great! 11s looks more reasonable and is hopefully more tolerable. The AV may have been a red herring, which is also great.
I do not believe the import process in Julia can run multi-threaded, so probably not. |
I asked about this on Slack and the conclusion was rather unsatisfying. In short (@IanButterworth can correct me if I'm wrong), just because
Those are deps of MLUtils. I think the former is pretty fundamental, but the latter is only used in one place. @lorenzoh would it be easy to desugar that floops macro so that the dep can be dropped? |
Re NNlib, with more coffee, it's closer to the sum of the unavoidable packages above it than I first thought. So maybe it's innocent! Would be great if all GPU stuff could be skipped when not necessary, but maybe not easy. |
Surely possible to desugar and use base Transducers.jl + FoldsThreads.jl, though I wonder whether those two may not be the main compilation problem. |
Not sure if this is the right application for this, but for a new approach to loading a module only when it is actually used, see https://github.com/johnnychen94/LazyModules.jl |
For CUDA.jl it will almost certainly hit the world-age issue because CuArray is an alien type to the caller world-age. See the discussion in [ANN] Accouncing LazyModules.jl: delay the heavy dependency loading to its first use IIUC, CUDA.jl must be eagerly loaded. NNlibCUDA.jl, on the other hand, might be okay to be lazily loaded. But this trick introduces a ~80ns overhead for each function call, so if it's a trivial computation, it's not ideal. The ideal use case of LazyModules is some standalone dependency that doesn't overlap or deeply coupled with the main codebase. ImageIO is a perfect example of it JuliaIO/ImageIO.jl#49. |
The future of this is probably JuliaLang/julia#47695 . That PR has nice instructions... anyone want to make a branch using Requires.jl and see how much slower this actually is? (Since that's how pre-47695 versions of Julia would then have to use Flux.) |
Back compatability doesn't mean you have to use Requires. You can have the package as-is, and using extensions. That's explained in the Pkg docs. It would be great to make CUDA weak! |
Ah, just load it unconditionally on old Julia, and via extensions on new. That sounds even better, it won't get worse for anyone. |
no longer relevant on julia 1.9 + extensions |
Hi, I'm new here.
I would like to use Flux but everytime I run
using Flux
it takes 23 seconds on my i5 CPU. The following is the full output:
Then, it becomes considerably faster the second time
I found #1283 but it is 2 years old and I didn't really find a solution there. Is such a slow startup expected?
The text was updated successfully, but these errors were encountered: