Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow "using Flux" #1961

Closed
mattiadg opened this issue May 8, 2022 · 21 comments
Closed

Very slow "using Flux" #1961

mattiadg opened this issue May 8, 2022 · 21 comments

Comments

@mattiadg
Copy link

mattiadg commented May 8, 2022

Hi, I'm new here.
I would like to use Flux but everytime I run
using Flux
it takes 23 seconds on my i5 CPU. The following is the full output:

@time using Flux  
23.719183 seconds (22.00 M allocations: 1.437 GiB, 5.05% gc time, 43.13% compilation time)

Then, it becomes considerably faster the second time

@time using Flux  
0.467162 seconds (888.83 k allocations: 47.272 MiB, 99.61% compilation time)

I found #1283 but it is 2 years old and I didn't really find a solution there. Is such a slow startup expected?

@ToucheSir
Copy link
Member

That is indeed slower than expected. We'll want to see your versioninfo and what version of Flux you're trying to load at a minimum, but assuming your CPU is rather recent I would suspect an out-of-date Julia version to be the culprit.

@mattiadg
Copy link
Author

mattiadg commented May 8, 2022

My cpu is some years old now, but I hope the problem is not that

> versioninfo()

Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

and Flux v0.12.10

@ToucheSir
Copy link
Member

I don't think so, but your operating system might. Can you try timing without any AV or related programs running?

@johnnychen94
Copy link
Contributor

johnnychen94 commented May 9, 2022

If you use the Julia 1.9 dev version, then you can use the fancy @time_imports function to get a profile of package loading time, e.g.,

julia> versioninfo()
Julia Version 1.9.0-DEV.351
Commit 385762b444 (2022-04-08 21:50 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 24 × 12th Gen Intel(R) Core(TM) i9-12900K
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, goldmont)
  Threads: 1 on 24 virtual cores
time_import result
julia> @time_imports using Flux
      8.3 ms    ┌ MacroTools
     15.9 ms  ┌ ZygoteRules
      0.2 ms  ┌ DefineSingletons
      0.1 ms    ┌ IteratorInterfaceExtensions
      0.3 ms  ┌ TableTraits
      0.6 ms    ┌ Compat
      4.5 ms    ┌ OrderedCollections
      0.3 ms    ┌ Requires
      0.1 ms    ┌ DataValueInterfaces
      0.7 ms      ┌ DataAPI
      8.1 ms    ┌ Tables
      0.5 ms    ┌ ConstructionBase
     18.7 ms    ┌ Setfield
     10.5 ms    ┌ InitialValues
     75.0 ms  ┌ BangBang
    329.3 ms    ┌ StaticArrays
    807.4 ms  ┌ DiffResults
      9.0 ms  ┌ FunctionWrappers
     41.8 ms  ┌ DataStructures
      0.4 ms  ┌ NaNMath
      0.1 ms    ┌ CompositionsBase
     34.3 ms  ┌ Accessors
      1.4 ms  ┌ ContextVariablesX
      0.2 ms    ┌ RealDot
      3.2 ms    ┌ IrrationalConstants
     48.0 ms    ┌ ChainRulesCore
     63.0 ms  ┌ ChainRules
      5.5 ms    ┌ AbstractFFTs
     35.3 ms        ┌ Preferences
     35.9 ms      ┌ JLLWrappers
    171.4 ms    ┌ LLVMExtra_jll
      0.3 ms    ┌ Reexport
     16.4 ms      ┌ RandomNumbers
     19.7 ms    ┌ Random123
      2.8 ms        ┌ CEnum
    133.3 ms      ┌ LLVM
      0.5 ms      ┌ Adapt
    600.2 ms    ┌ GPUArrays
      0.5 ms    ┌ ExprTools
     42.8 ms    ┌ TimerOutputs
      3.9 ms    ┌ BFloat16s
    278.0 ms    ┌ GPUCompiler
     20.3 ms    ┌ NNlib
    726.3 ms    ┌ CUDA
   2080.8 ms  ┌ NNlibCUDA
    115.0 ms  ┌ FillArrays
      0.6 ms  ┌ ChangesOfVariables
      0.8 ms    ┌ InverseFunctions
      4.1 ms    ┌ DocStringExtensions
      0.4 ms    ┌ StatsAPI
      0.4 ms    ┌ SortingAlgorithms
      0.7 ms    ┌ LogExpFunctions
      6.2 ms    ┌ Missings
     31.2 ms  ┌ StatsBase
      9.4 ms    ┌ MicroCollections
      0.4 ms    ┌ ArgCheck
      6.2 ms    ┌ SplittablesBase
     27.5 ms    ┌ Baselet
     93.7 ms  ┌ Transducers
      0.4 ms  ┌ OpenLibm_jll
    193.0 ms  ┌ MLStyle
      0.6 ms      ┌ OpenSpecFun_jll
    115.8 ms    ┌ SpecialFunctions
      0.5 ms    ┌ CommonSubexpressions
      0.7 ms    ┌ DiffRules
    417.8 ms  ┌ ForwardDiff
      0.2 ms  ┌ IfElse
      1.6 ms  ┌ PrettyPrint
     29.5 ms    ┌ Static
    132.2 ms  ┌ ArrayInterface
      2.8 ms  ┌ ProgressLogging
      6.5 ms  ┌ ShowCases
    349.2 ms  ┌ FoldsThreads
      0.3 ms  ┌ FLoopsBase
      0.5 ms  ┌ NameResolution
      3.5 ms  ┌ JuliaVariables
      6.8 ms  ┌ FLoops
      1.5 ms  ┌ Functors
     10.4 ms  ┌ IRTools
   1714.0 ms  ┌ Zygote
      7.8 ms  ┌ Optimisers
      5.6 ms  ┌ MLUtils
   6324.0 ms  Flux

@mattiadg
Copy link
Author

mattiadg commented May 9, 2022

I had wsl running, but now that I closed it it's even worse

> @time using Flux
 44.074991 seconds (20.31 M allocations: 1.349 GiB, 4.12% gc time, 28.61% compilation time)

Should I use a pre-release version of Julia?

@ToucheSir
Copy link
Member

The presence/absence of WSL seems like a red herring to me. Just to confirm, these times are sans AV or any other program that might try to sniff actively running executables?

@mattiadg
Copy link
Author

mattiadg commented May 9, 2022

I tried without AV but the time is again around 40 seconds. I don't notice strange performance problems for any other running software

@mcabbott
Copy link
Member

mcabbott commented May 9, 2022

As a reference point, here's what I get on an older xeon (from 2012, while it seems i7-6500U is from 2015) with latest Flux.

tl;dr is that I don't think that 20s implies something is wrong on your end, sadly.

julia> @time using Flux  # first time, after updating some packages
[ Info: Precompiling Flux [587475ba-b771-5e3f-ad9e-33799f191a9c]
134.147325 seconds (20.18 M allocations: 1.405 GiB, 0.61% gc time, 7.73% compilation time: 49% of which was recompilation)

# restart Julia, no changes

julia> @time using Flux
 22.720761 seconds (20.17 M allocations: 1.404 GiB, 3.77% gc time, 45.13% compilation time: 49% of which was recompilation)

julia> @time using CUDA  # after re-starting again
  8.594542 seconds (9.48 M allocations: 705.886 MiB, 3.71% gc time, 26.78% compilation time: 85% of which was recompilation)

julia> versioninfo()  # cyclops, an elderly xeon
Julia Version 1.9.0-DEV.454
Commit 6026b5ff2a (2022-04-29 12:16 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 12 × Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, broadwell)
  Threads: 4 on 12 virtual cores

Julia 1.7.2 is a little quicker:

julia> @time using Flux
 18.600435 seconds (23.27 M allocations: 1.445 GiB, 6.86% gc time, 47.83% compilation time)

Running the fancy new macro on 1.9, and trimming its output a bit, here's who the culprits are:

julia> @time_imports using Flux
     25.7 ms    ┌ MacroTools
     44.8 ms  ┌ ZygoteRules 40.40% compilation time
    227.7 ms  ┌ BangBang 29.65% compilation time (6% recompilation)
   1470.1 ms    ┌ StaticArrays
   3226.1 ms  ┌ DiffResults 53.16% compilation time (85% recompilation)
    174.0 ms  ┌ DataStructures
    211.8 ms    ┌ ChainRulesCore
    271.2 ms  ┌ ChainRules
    545.6 ms    ┌ LLVMExtra_jll 85.64% compilation time (99% recompilation)
    429.4 ms      ┌ LLVM 32.98% compilation time
   2608.8 ms    ┌ GPUArrays 5.43% compilation time
    840.0 ms    ┌ GPUCompiler 33.01% compilation time (94% recompilation)
   2965.1 ms    ┌ CUDA 0.69% compilation time
   8070.0 ms  ┌ NNlibCUDA 20.35% compilation time (84% recompilation)
    479.0 ms  ┌ FillArrays
      2.8 ms  ┌ ChangesOfVariables
    111.6 ms  ┌ StatsBase 5.19% compilation time
    320.1 ms  ┌ Transducers 12.56% compilation time
      0.9 ms  ┌ OpenLibm_jll
   1085.4 ms  ┌ MLStyle
   1211.3 ms  ┌ ForwardDiff 61.29% compilation time (85% recompilation)
     97.5 ms    ┌ Static
    422.4 ms  ┌ ArrayInterface 49.69% compilation time
      8.4 ms  ┌ ProgressLogging
     22.6 ms  ┌ ShowCases
   1281.1 ms  ┌ FoldsThreads 92.11% compilation time
      0.8 ms  ┌ FLoopsBase
      1.2 ms  ┌ NameResolution
      9.3 ms  ┌ JuliaVariables
     20.2 ms  ┌ FLoops
      3.7 ms  ┌ Functors
     30.8 ms  ┌ IRTools
   5583.3 ms  ┌ Zygote 92.38% compilation time (37% recompilation)
     29.0 ms  ┌ Optimisers
     18.0 ms  ┌ MLUtils
  23181.4 ms  Flux 46.75% compilation time (50% recompilation)

NNlibCUDA and DiffResults jump out as surprises, is there something wrong with these? Xref JuliaDiff/ForwardDiff.jl#518 and JuliaArrays/StaticArrays.jl#1023 for recent discussion.

Surely we could also trim a few of these, do we really need FoldsThreads, MLStyle?

@mattiadg
Copy link
Author

mattiadg commented May 9, 2022 via email

@ToucheSir
Copy link
Member

Those are slow, but I would expect a Skylake chip (even a mobile one) to run faster. For example, here is my Haswell machine:

julia> @time using Flux
  7.409032 seconds (22.81 M allocations: 1.422 GiB, 7.05% gc time, 48.99% compilation time)

julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, haswell)

That's on spinning rust, not a SSD like I would expect to see in a laptop with the kind of CPU @mattiadg has. Since there's no parallelism here, I would not expect a 1.6x clock speed increase and a microarchitectural disadvantage to lead to a 3x (let alone 6x) speed increase. Hence I suspect some sort of OS or other software interference. Have you checked whether your machine is on spec in other single-threaded benchmarks? The only other factors I can think of are generic troubleshooting ones like making sure RAM/disk aren't at their limits and power settings are cranked all the way up (which IIRC makes a huge difference on U-series chips).

@mattiadg
Copy link
Author

mattiadg commented May 9, 2022 via email

@ToucheSir
Copy link
Member

Great! 11s looks more reasonable and is hopefully more tolerable. The AV may have been a red herring, which is also great.

Maybe the rest comes from the old cpu. Isn't it able to compile on multiple
cores?

I do not believe the import process in Julia can run multi-threaded, so probably not.

@ToucheSir
Copy link
Member

NNlibCUDA and DiffResults jump out as surprises, is there something wrong with these? Xref JuliaDiff/ForwardDiff.jl#518 and JuliaArrays/StaticArrays.jl#1023 for recent discussion.

I asked about this on Slack and the conclusion was rather unsatisfying. In short (@IanButterworth can correct me if I'm wrong), just because@time_imports assigns a particular (re)compilation time to a dep doesn't mean it's personally responsible. Ref. JuliaLang/julia#45064 (comment), tracking provenance turns out to be non-trivial. For this specific case, SnoopCompile reports no invalidations caused by NNlibCUDA, so most (if not all) of the recompilation reported for it is coming from deps or transitive deps.

Surely we could also trim a few of these, do we really need FoldsThreads, MLStyle?

Those are deps of MLUtils. I think the former is pretty fundamental, but the latter is only used in one place. @lorenzoh would it be easy to desugar that floops macro so that the dep can be dropped?

@mcabbott
Copy link
Member

Re NNlib, with more coffee, it's closer to the sum of the unavoidable packages above it than I first thought. So maybe it's innocent!

Would be great if all GPU stuff could be skipped when not necessary, but maybe not easy.

@lorenzoh
Copy link
Member

Those are deps of MLUtils. I think the former is pretty fundamental, but the latter is only used in one place. @lorenzoh would it be easy to desugar that floops macro so that the dep can be dropped?

Surely possible to desugar and use base Transducers.jl + FoldsThreads.jl, though I wonder whether those two may not be the main compilation problem.

@lorenzoh
Copy link
Member

lorenzoh commented May 12, 2022

Would be great if all GPU stuff could be skipped when not necessary, but maybe not easy.

Not sure if this is the right application for this, but for a new approach to loading a module only when it is actually used, see https://github.com/johnnychen94/LazyModules.jl

@johnnychen94
Copy link
Contributor

johnnychen94 commented May 12, 2022

For CUDA.jl it will almost certainly hit the world-age issue because CuArray is an alien type to the caller world-age. See the discussion in [ANN] Accouncing LazyModules.jl: delay the heavy dependency loading to its first use

IIUC, CUDA.jl must be eagerly loaded. NNlibCUDA.jl, on the other hand, might be okay to be lazily loaded. But this trick introduces a ~80ns overhead for each function call, so if it's a trivial computation, it's not ideal.

The ideal use case of LazyModules is some standalone dependency that doesn't overlap or deeply coupled with the main codebase. ImageIO is a perfect example of it JuliaIO/ImageIO.jl#49.

@mcabbott
Copy link
Member

mcabbott commented Dec 8, 2022

The future of this is probably JuliaLang/julia#47695 . That PR has nice instructions... anyone want to make a branch using Requires.jl and see how much slower this actually is? (Since that's how pre-47695 versions of Julia would then have to use Flux.)

@IanButterworth
Copy link
Contributor

IanButterworth commented Dec 8, 2022

Back compatability doesn't mean you have to use Requires. You can have the package as-is, and using extensions. That's explained in the Pkg docs.

It would be great to make CUDA weak!

@mcabbott
Copy link
Member

mcabbott commented Dec 8, 2022

Ah, just load it unconditionally on old Julia, and via extensions on new. That sounds even better, it won't get worse for anyone.

@CarloLucibello
Copy link
Member

no longer relevant on julia 1.9 + extensions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants