Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add macro for timing package and dependency imports #41612

Merged
merged 5 commits into from
Jul 21, 2021

Conversation

IanButterworth
Copy link
Member

@IanButterworth IanButterworth commented Jul 16, 2021

Adds @loadtime @time_imports for reporting the import time of packages and all their dependencies (direct and indirect).

Edit: Updated to show dep nest indenting

julia> @time_imports using CSV
      0.6 ms    ┌ IteratorInterfaceExtensions
    115.4 ms  ┌ TableTraits
    365.8 ms  ┌ SentinelArrays
    102.0 ms  ┌ Parsers
      0.7 ms  ┌ DataValueInterfaces
      1.7 ms    ┌ DataAPI
     34.0 ms  ┌ WeakRefStrings
     16.7 ms  ┌ Tables
     28.0 ms  ┌ PooledArrays
    698.7 ms  CSV

julia> @time_imports using Flux
     58.9 ms    ┌ MacroTools
     61.8 ms  ┌ ZygoteRules
      3.0 ms    ┌ Zlib_jll
      7.4 ms  ┌ ZipFile
    649.3 ms    ┌ StaticArrays
    655.9 ms  ┌ DiffResults
      3.0 ms  ┌ Compat
    106.8 ms  ┌ DataStructures
      0.9 ms  ┌ NaNMath
      1.1 ms  ┌ Requires
      3.4 ms    ┌ TranscodingStreams
      8.5 ms  ┌ CodecZlib
     83.6 ms  ┌ FixedPointNumbers
     65.0 ms    ┌ ChainRulesCore
     92.4 ms  ┌ ChainRules
     11.5 ms    ┌ AbstractFFTs
      1.0 ms        ┌ Preferences
      4.0 ms      ┌ JLLWrappers
    290.5 ms    ┌ LLVMExtra_jll
      0.7 ms    ┌ Reexport
     25.9 ms      ┌ RandomNumbers
     35.6 ms    ┌ Random123
      1.0 ms      ┌ Adapt
   1083.4 ms    ┌ GPUArrays
      4.1 ms    ┌ DocStringExtensions
      5.2 ms      ┌ CEnum
    148.9 ms    ┌ LLVM
      0.8 ms    ┌ ExprTools
     52.3 ms    ┌ TimerOutputs
     16.8 ms    ┌ CompilerSupportLibraries_jll
      9.1 ms    ┌ BFloat16s
      4.0 ms      ┌ LogExpFunctions
      2.6 ms      ┌ OpenSpecFun_jll
     28.7 ms    ┌ SpecialFunctions
    242.4 ms    ┌ GPUCompiler
    153.6 ms    ┌ NNlib
    828.2 ms    ┌ CUDA
   3017.7 ms  ┌ NNlibCUDA
    176.0 ms  ┌ FillArrays
      5.3 ms  ┌ Media
      0.7 ms    ┌ CommonSubexpressions
      1.0 ms    ┌ DiffRules
    117.1 ms  ┌ ForwardDiff
     15.4 ms  ┌ AbstractTrees
      0.9 ms  ┌ Functors
     34.5 ms  ┌ IRTools
   3026.2 ms  ┌ Zygote
    292.5 ms  ┌ ColorTypes
    221.1 ms  ┌ Colors
     42.4 ms  ┌ Juno
   8241.7 ms  Flux

julia> @time_imports begin
       using BenchmarkTools
       using PNGFiles
       end
      5.4 ms  ┌ JSON
     16.1 ms  BenchmarkTools
     76.9 ms      ┌ OffsetArrays
     88.6 ms    ┌ PaddedViews
      6.5 ms      ┌ StackViews
      6.0 ms      ┌ MappedArrays
     20.4 ms    ┌ MosaicViews
      3.1 ms    ┌ Graphics
      1.7 ms    ┌ TensorCore
    180.4 ms    ┌ ColorVectorSpace
    920.3 ms  ┌ ImageCore
      2.0 ms  ┌ libpng_jll
      4.2 ms  ┌ IndirectArrays
   1099.6 ms  PNGFiles

Note that it's real load time, not isolated, so if a package's deps have been loaded by an earlier package it will accurately report a faster time than if it were to be loaded in isolation.

@antoine-levitt
Copy link
Contributor

Great idea! Maybe print in seconds, so that it's clearer where the big offenders are?

@IanButterworth
Copy link
Member Author

I don't think 1 second is the cut-off for being highlighted. Smaller load times can be problematic when there are lots of them.

I chose ms and aligned the decimals so that it effectively forms a bar graph on the left. Hopefully that goes a fair way to make it easier to parse.

@IanButterworth
Copy link
Member Author

Perhaps it should be @timeloading, which might make a bit more grammatical sense, and would sit beside @time, @timev, @timed

@antoine-levitt
Copy link
Contributor

Or just @time_using pkg since it's only really supposed to be used like that?

Would it be easy to add an indicator of the recursion level? Something like

X ms >> Pkg3
X ms > Pkg2
X ms  Pkg3

?

base/loading.jl Outdated Show resolved Hide resolved
@PallHaraldsson
Copy link
Contributor

Great to have, I've wanted this for a long time, and I see the code to make it happen was surprisingly short.

I assume the timing is correct (just better on 1.8?), I load Zygote (same/latest version) in 4.5 sec. 67% longer than you do. While it could be explained by load (2,73, Firefox running... while not a bad machine 32 GB 16 virtual core), that might not be the full story, as loading Flux is (consistently) only 38% slower, PNGFiles 41% slower, and CSV 60% slower.

I'm on 1.7-beta3, different versions could also explain.

@IanButterworth
Copy link
Member Author

I think what you're observing is isolated load time vs. interleaved load time

Note that it's real load time, not isolated, so if a package's deps have been loaded by an earlier package it will accurately report a faster time than if it were to be loaded in isolation.

That's a key point. I'll add it to the docstring

@IanButterworth
Copy link
Member Author

Or just @time_using pkg since it's only really supposed to be used like that?

Given people can using/import, and that this works with code blocks, I wouldn't want to force users too narrowly.

Would it be easy to add an indicator of the recursion level?

Good idea, I've added this

julia> @loadtime using Flux
     50.0 ms      MacroTools
    340.0 ms    ZygoteRules
      2.6 ms      Zlib_jll
    120.1 ms    ZipFile
    598.6 ms      StaticArrays
    604.8 ms    DiffResults
      2.5 ms    Compat
     80.5 ms    DataStructures
      0.8 ms    NaNMath
      0.9 ms    Requires
      2.9 ms      TranscodingStreams
      7.4 ms    CodecZlib
     59.0 ms    FixedPointNumbers
     58.7 ms      ChainRulesCore
     76.4 ms    ChainRules
      9.4 ms      AbstractFFTs
      0.8 ms          Preferences
      3.5 ms        JLLWrappers
     14.5 ms      LLVMExtra_jll
      0.6 ms      Reexport
     25.3 ms        RandomNumbers
     34.5 ms      Random123
      0.8 ms        Adapt
   1061.5 ms      GPUArrays
      4.2 ms      DocStringExtensions
      5.0 ms        CEnum
    139.0 ms      LLVM
      0.8 ms      ExprTools
     53.8 ms      TimerOutputs
     16.7 ms      CompilerSupportLibraries_jll
      6.1 ms      BFloat16s
      3.8 ms        LogExpFunctions
      2.3 ms        OpenSpecFun_jll
     26.5 ms      SpecialFunctions
    206.1 ms      GPUCompiler
     29.5 ms      NNlib
    897.0 ms      CUDA
   2597.4 ms    NNlibCUDA
    164.2 ms    FillArrays
      5.1 ms    Media
      0.8 ms      CommonSubexpressions
      1.3 ms      DiffRules
    115.2 ms    ForwardDiff
     14.8 ms    AbstractTrees
      1.0 ms    Functors
     33.1 ms    IRTools
   2669.6 ms    Zygote
    256.1 ms    ColorTypes
    177.5 ms    Colors
     64.1 ms    Juno
   7807.4 ms  Flux

@Drvi
Copy link
Contributor

Drvi commented Jul 16, 2021

Great idea!:+1: Could the output be sorted though (ideally within each nesting level)?

@IanButterworth
Copy link
Member Author

Sorting would require a lot more invasive code, and I'm not sure it would even make sense as the order here is key for understanding the timings, given the !!! note

@PallHaraldsson
Copy link
Contributor

It's maybe not well known, but I've experienced total loading time to differ by ordering of using of dependencies, so that might be an argument to not order, rather keep that order.

@IanButterworth
Copy link
Member Author

I've experienced total loading time to differ by ordering of using of dependencies

That's likely to be invalidation-related

@IanButterworth
Copy link
Member Author

Now with tree branch indicators to highlight the direction of the dependencies

Screen Shot 2021-07-18 at 1 39 27 AM

base/exports.jl Outdated
@@ -993,6 +993,7 @@ export
@timev,
@elapsed,
@allocated,
@loadtime,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about @time_imports?

This only measures timing for serialized code, right? Anything marked __precompile__(false) will not be timed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's good! thanks

@vchuravy
Copy link
Member

vchuravy commented Jul 18, 2021

The nesting seems a bit off:

     76.4 ms    ChainRules
      9.4 ms      AbstractFFTs
      0.8 ms          Preferences
      3.5 ms        JLLWrappers
     14.5 ms      LLVMExtra_jll
      0.6 ms      Reexport
     25.3 ms        RandomNumbers
     34.5 ms      Random123
      0.8 ms        Adapt
   1061.5 ms      GPUArrays
      4.2 ms      DocStringExtensions
      5.0 ms        CEnum
    139.0 ms      LLVM

LLVMExtra_jll should be attributed to LLVM and not ChainRules.

@IanButterworth
Copy link
Member Author

IanButterworth commented Jul 18, 2021

@vchuravy The nesting goes the other way. It's a little counterintuitive. That's why I added the tree branch indicators.

But you're right that something is off.. Judging by this, LLVMExtra_jll should be a direct dep of NNLibCUDA along side LLVM, which they aren't

I think it's to do with when the depth value is incremented

Screen Shot 2021-07-18 at 7 24 29 AM

btw, making the tree branch characters a bit more aesthetic would be possible by threading a bit more information through the functions, but I thought it might be better to keep this as minimal as possible so that it has less overhead when switched off

base/timing.jl Outdated
package will accurately report a faster load time than if it were to be loaded in isolation.

"""
macro loadtime(ex)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about stupid people like me who write:


@loadtime begin
    # lots of indirection
    @loadtime begin 
       using ...
    end
   using A
end

Currently A won't be counted

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I guess instead of a bool it could be an Int that's incremented/decremented and checked for being > 0

@IanButterworth
Copy link
Member Author

Renamed to @time_imports.

I'm realizing that the nesting of _tryrequire_from_serialized isn't a reliable representation of the dependency tree because each package tries to load all deps, not just direct, and the load order isn't in top-down order, it just tries them in some sequence. The nesting only happens when a dep hits a dep that's not been seen before.

There may not be a way to reliably represent the dependency tree here without parsing the Manifest. Maybe that could be done in the presence of this macro though

@PallHaraldsson
Copy link
Contributor

Can this be backported (to 1.7)? Since this will break no code, only used in the REPL.

@oscardssmith
Copy link
Member

No. It will go in 1.8

@IanButterworth
Copy link
Member Author

I think it's going to be very hard to get the dependency tree nesting accurate, for the reason I mentioned above and given:

  • I tried reordering the dependencies at each level to put project deps first, which would make the current indenting approach work, but it broke precompilation, and even if it worked would deviate the package load ordering making the timing in the test non-representative (i.e. package load order affecting invalidations?).
  • I thought about parsing the env manifest and attributing each a depth, but deps are used in multiple places in the dep tree, and AFAICT it's not possible to link the order that deps are loaded with the manifest deps map.

The indentation that this PR currently shows is most likely going to cause confusion, so I'm thinking about reverting to a flat list, unless someone sees an easy way to make it accurate?

@vchuravy
Copy link
Member

So right now the nesting represents the wavefront of parallelism? I would say that's still useful, but instead of a nested tree, maybe simply a list ordered by depth?

@IanButterworth
Copy link
Member Author

The thing I was surprised by was that staledeps below (which is a confusingly named variable as it's either a list of deps, or a bool stating there are stale deps) isn't just the deps that are using/import-ed by the parent, but a list of all direct and indirect deps.

julia/base/loading.jl

Lines 839 to 850 in 9044222

for i in 1:length(staledeps)
dep = staledeps[i]
dep isa Module && continue
modpath, modkey, build_id = dep::Tuple{String, PkgId, UInt64}
dep = _tryrequire_from_serialized(modkey, build_id, modpath)
if dep === nothing
@debug "Required dependency $modkey failed to load from cache file for $modpath."
staledeps = true
break
end
staledeps[i] = dep::Module
end

I was expecting it to only loop through imported direct dependencies, and the nesting would handle things from there downwards..

maybe simply a list ordered by depth

With the depth not being accurate adding sorting might even further obfuscate, and I think I like that this prints as soon as a dep is imported. It feels like peeking into the process.

@codecov
Copy link

codecov bot commented Jul 19, 2021

Codecov Report

Merging #41612 (9044222) into master (004cb25) will decrease coverage by 4.47%.
The diff coverage is 80.18%.

❗ Current head 9044222 differs from pull request most recent head 6cd8d97. Consider uploading reports for the commit 6cd8d97 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master   #41612      +/-   ##
==========================================
- Coverage   86.15%   81.67%   -4.48%     
==========================================
  Files         350      350              
  Lines       65991    73977    +7986     
==========================================
+ Hits        56854    60424    +3570     
- Misses       9137    13553    +4416     
Impacted Files Coverage Δ
base/atomics.jl 84.00% <ø> (+0.66%) ⬆️
base/baseext.jl 94.73% <ø> (ø)
base/compiler/typeinfer.jl 76.61% <ø> (-14.25%) ⬇️
base/compiler/typelattice.jl 85.16% <ø> (-5.92%) ⬇️
base/compiler/typelimits.jl 81.86% <ø> (-11.07%) ⬇️
base/compiler/types.jl 65.51% <ø> (-4.49%) ⬇️
base/compiler/typeutils.jl 65.66% <ø> (-16.70%) ⬇️
base/compiler/utilities.jl 86.48% <ø> (-0.19%) ⬇️
base/compiler/validation.jl 88.99% <ø> (+0.64%) ⬆️
base/complex.jl 99.43% <ø> (-0.16%) ⬇️
... and 514 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 586acfb...6cd8d97. Read the comment docs.

@IanButterworth IanButterworth changed the title Add loadtime macro for timing package and dependency loading Add macro for timing package and dependency imports Jul 19, 2021
@IanButterworth
Copy link
Member Author

Given import time accumulates when nested imports happen it could also be confusing to not show the nesting..
I added a note to the docstring that hopefully means the nesting can stay.

@JeffBezanson
Copy link
Member

Nice. Might seem like a small thing but is it possible to move the identifier somewhere else, e.g. InteractiveUtils? I'm pretty concerned about bell-and-whistle creep in Base.

@IanButterworth
Copy link
Member Author

Sounds good to me. I've moved it to InteractiveUtils and added a test

@IanButterworth
Copy link
Member Author

This is RTM for me

macro time_imports(ex)
quote
try
Base.TIMING_IMPORTS[Threads.threadid()] += 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Julia having thread migration now, this kind of code is a bit fishy.
You can no longer guarantee that you receive the same value from threadid on the same task.

I would simplify this to one Atomic in the system. (Which would also help with esoteric code doing using on a different task).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that sounds good. Thanks. Updated

@JeffBezanson JeffBezanson merged commit 69f8bce into JuliaLang:master Jul 21, 2021
@IanButterworth IanButterworth deleted the ib/loadtime branch July 21, 2021 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants