Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jl_cumulative_compile_time_ns_before tracks compilation time per thread rather than per task #41271

Open
rgankema opened this issue Jun 18, 2021 · 6 comments
Labels
multithreading Base.Threads and related functionality

Comments

@rgankema
Copy link

rgankema commented Jun 18, 2021

For my work at RelationalAI I'm trying to gather metrics on how much time is spent in run-time compilation while evaluating queries. I was hoping to use jl_cumulative_compile_time_ns_[before/after] for this, a little bit like this:

function evaluate_query(query_fn)
    time_before = time_ns()
    compile_before = jl_cumulative_compile_time_ns_before()
    
    query_fn()
    
    compile_after = jl_cumulative_compile_time_ns_after()
    time_after = time_ns()
    inc_metric!(metrics, "query_eval_total_time", time_after - time_before)
    inc_metric!(metrics, "query_jl_compile_time", compile_after - compile_before)
end

This works fine if there are no background tasks and queries run sequentially. It also works fine in a multi-threaded scenario, as long as only a single task runs at-a-time at any given thread, because jl_cumulative_compile_time_ns_[before/after] seem to use separate counters per thread. However, it breaks when multiple tasks are multiplexed on the same thread. For instance, if one call of evaluate_query has a very long running query_fn, and other tasks that do work that incur compilation time are running at the same time on that thread, that first evaluate_query call will also record all the compilation time triggered by those others tasks. This means that we're overestimating how much time we spent in compilation for that given query, potentially by a very large margin.

To see this behavior in action, please consider the following MRE:

using Base.Threads: @async

function do_something_and_measure_compilation_time(a, b)
    compile_before = Base.cumulative_compile_time_ns_before()

    # Function that we want to measure.
    x = do_something(a, b)

    compile_after = Base.cumulative_compile_time_ns_after()
    compile_duration_s = (compile_after - compile_before) / 1_000_000_000
    @show compile_duration_s
    return x
end

# Simulates a long running function that costs a lot more time to run than to compile.
function do_something(a, b)
    x = foo(a, a + 3, b, b + 5)
    sleep(5)
    return x
end

# Something to compile.
function foo(x1, x2, y1, y2)
    xs = [x1:x2...]
    ys = [y1:y2...]
    return sum(xs) * max(xs...) * sum(ys) * max(ys...)
end

# Do a couple of runs. We should see compilation time at first, but that should quickly go
# away.
do_something_and_measure_compilation_time(1, 2)
do_something_and_measure_compilation_time(1, 2)
do_something_and_measure_compilation_time(1, 2)
do_something_and_measure_compilation_time(1, 2)

# Call foo with types we haven't seen before, in the background.
@async begin
    sleep(1)
    for T1 in [Int32, Int16, Int8]
        for T2 in [Int32, Int16, Int8]
            @time foo(zero(T1), one(T1), zero(T2), one(T2))
        end
    end
end

# Final run, this time we'll measure compilation time from the above, even though it's
# not compilation that comes from this task.
do_something_and_measure_compilation_time(1, 2)

This outputs:

compile_duration_s = 0.024834394
compile_duration_s = 0.002964853
compile_duration_s = 0.0
compile_duration_s = 0.0
  0.037290 seconds (87.25 k allocations: 5.084 MiB, 99.79% compilation time)
  0.077509 seconds (169.89 k allocations: 10.369 MiB, 99.61% compilation time)
  0.068803 seconds (170.97 k allocations: 10.438 MiB, 99.69% compilation time)
  0.008911 seconds (4.97 k allocations: 271.748 KiB, 99.42% compilation time)
  0.009805 seconds (4.92 k allocations: 266.718 KiB, 99.52% compilation time)
  0.014783 seconds (4.97 k allocations: 269.733 KiB, 98.73% compilation time)
  0.009418 seconds (4.97 k allocations: 270.107 KiB, 99.38% compilation time)
  0.009743 seconds (4.97 k allocations: 269.671 KiB, 99.47% compilation time)
  0.013891 seconds (4.92 k allocations: 266.905 KiB, 99.33% compilation time)
compile_duration_s = 0.298222365

As you can see, the last call to do_something_and_measure_compilation_time recorded the compilation time of the background task. Unfortunately that behavior makes it unusable for the use-case described above. We'd therefore like to request that jl_cumulative_compile_time_ns_[before/after] is changed such that it keeps a counter per task, rather than per thread. Thanks!

@NHDaly
Copy link
Member

NHDaly commented Jun 18, 2021

Here's another, maybe a bit simpler example, which shows the same affect as above using @time. In this program, you can see that timing one Task which doesn't have any compilation time, running concurrently with another Task on the same thread that does reports the compilation time in the call to @time, even though it wasn't caused by that Task.

(Note that this example doesn't work if you copy/paste into the REPL; i think because it's compiling the REPL display output at the same time as the program runs, so both @time calls show some compilation time. So instead, I run this example as a script.)

# async-time-example.jl

const running = Threads.Atomic{Bool}(true)

loop() = while running[] ; yield(); GC.safepoint(); end

# Run once as warmup
running[] = true
t = @async loop()
sleep(0.1) ; running[] = false

wait(t)

# ----------------------------

running[] = true

# Start timing a task, which doesn't have any compilation time.
t = @async @time loop()

# Finish timing the task. Note no compilation time.
sleep(0.1) ; running[] = false

wait(t)

# ----------------------------

running[] = true

# Now start timing the *same* task, which shouldn't have any compilation time.
t = @async @time while running[] ; yield(); GC.safepoint(); end

# But start another task *on the same thread,* which DOES spend some compilation time.
t2 = @async (@eval f() = 2+2; @eval f())

# Finish timing the original task, and observe that the compilation time from `t2`
# was recorded by the `@time` in `t`.
sleep(0.1) ; running[] = false

wait(t) ; wait(t2)

And here's the output:

$ julia ~/Downloads/async-time-example.jl
  0.079854 seconds (72 allocations: 4.594 KiB)
  0.093527 seconds (2.27 k allocations: 158.235 KiB, 4.59% compilation time)

The request here is to record compilation time via task-local storage, instead of through thread-local storage.

@NHDaly NHDaly added the multithreading Base.Threads and related functionality label Jun 18, 2021
@NHDaly
Copy link
Member

NHDaly commented Jun 29, 2021

Worth noting that actually this is not unique to compilation time. The number of allocations are tracked globally as well, as shown in this simple example:

julia> function f(x::Array{Any})  x[1] + x[1]  end
f (generic function with 1 method)

julia> t = @async @time sleep(10)
Task (runnable) @0x000000010b059000

julia> @time for _ in 1:3000000 f(Any[2]) end
  0.204893 seconds (3.00 M allocations: 274.658 MiB, 16.42% gc time)

julia>  10.005931 seconds (3.04 M allocations: 277.261 MiB, 0.34% gc time, 0.19% compilation time)

This would be great to change for all the metric in @time, though i don't know how feasible that is.

@vtjnash
Copy link
Member

vtjnash commented Jun 29, 2021

IIRC, the code is currently taking extra effort to make this counter global (instead of per-thread). The challenge may be deciding if we should ever "charge" thread costs (time, memory, etc.) to the parent Tasks (all that called 'wait' on it), and how to handle task thread-migration.

@ElSeidy
Copy link

ElSeidy commented Jun 30, 2021

Tracking this discussion.
I just came across the same issue :)

@NHDaly
Copy link
Member

NHDaly commented Jul 1, 2021

Thanks @vtjnash - that makes sense. Interesting!

Yeah, after thinking more about it, i can see why the global accounting is actually really valuable for some situations.

It seems like maybe we'd ideally want both types of metrics? In addition to the global stuff we currently report, it would be great to also be able to track:

  • how many allocations did this Task and all child tasks involved in my span perform?
  • how much time was spent in compilation for this task and all child tasks?
  • how much CPU time did this task and my child tasks actually take when running on the CPU?
  • how much time did this task and child tasks spend blocking, not running?

Does that make sense to you? That we could (have the option to) record both types of metrics: task-local (and maybe including any child tasks that we wait on - as you say), and global?

@NHDaly
Copy link
Member

NHDaly commented Jul 8, 2021

@vtjnash -

IIRC, the code is currently taking extra effort to make this counter global (instead of per-thread).

After looking at this a bit more, i don't think the jl_cumulative_compile_time_ns_before is global though. It's per-thread.

And actually, @janrous-rai just pointed out to me that I think there's actually a race condition here, because if multiple Tasks scheduled on the same thread are both measuring @time, the first one to finish will disable jl_measure_compile_time for that thread, and the other one will miss some of its compilation time because it was disabled out from under it! 😮

jl_measure_compile_time[tid] = 0;

To fix the race condition, do you think we could at least change jl_measure_compile_time to be:
A) an thread-safe and/or atomic variable
B) a counter instead of a boolean? So that only the it will remain true as long as at least one Task on the thread is currently measuring compilation time?

Alternatively, is it really so expensive to measure the compilation time that we need to enable/disable it? Could we just keep this as a cumulative count of all compilation time, and remove the jl_measure_compile_time check entirely?

NHDaly added a commit to NHDaly/julia that referenced this issue Aug 18, 2021
Now, multiple tasks (on the same or different Threads) can start and stop compilation
time measurement, without interrupting each other.

* Makes jl_cumulative_compile_time_ns into a global, atomic variable.

Instead of keeping per-task compilation time, this change keeps a
global counter of compilation time, protected with atomic mutations.

Fixes JuliaLang#41739

```julia
julia> include("./compilation-task-migration-17-example.jl")
start thread: 2
end thread: 2
  5.185706 seconds (3.53 M allocations: 2.570 GiB, 7.34% gc time, 15.57% compilation time)

julia> include("./compilation-task-migration-17-example.jl")
start thread: 3
WARNING: replacing module M.
end thread: 1
  4.110316 seconds (18.23 k allocations: 2.391 GiB, 5.67% gc time, 0.24% compilation time)
```

Compilation time measurement originally added in: JuliaLang#38885

Problems addressed:
- This fixes JuliaLang#41739, meaning it fixes compilation time reporting in 1.7 after task migration was enabled.
- It also fixes the race condition that existed previously, even on 1.6, where multiple Tasks on the thread measuring `@time` could break the measurement, as identified in (JuliaLang#41271 (comment)).
  - It fixes reentrant `@time` by making the `enable` flag a _counter,_ instead of a boolean.
  - It fixes `@time` called from multiple threads by making that flag thread-safe (via atomics).
NHDaly added a commit to NHDaly/julia that referenced this issue Aug 23, 2021
Now, multiple tasks (on the same or different Threads) can start and stop compilation
time measurement, without interrupting each other.

* Makes jl_cumulative_compile_time_ns into a global, atomic variable.

Instead of keeping per-task compilation time, this change keeps a
global counter of compilation time, protected with atomic mutations.

Fixes JuliaLang#41739

```julia
julia> include("./compilation-task-migration-17-example.jl")
start thread: 2
end thread: 2
  5.185706 seconds (3.53 M allocations: 2.570 GiB, 7.34% gc time, 15.57% compilation time)

julia> include("./compilation-task-migration-17-example.jl")
start thread: 3
WARNING: replacing module M.
end thread: 1
  4.110316 seconds (18.23 k allocations: 2.391 GiB, 5.67% gc time, 0.24% compilation time)
```

Compilation time measurement originally added in: JuliaLang#38885

Problems addressed:
- This fixes JuliaLang#41739, meaning it fixes compilation time reporting in 1.7 after task migration was enabled.
- It also fixes the race condition that existed previously, even on 1.6, where multiple Tasks on the thread measuring `@time` could break the measurement, as identified in (JuliaLang#41271 (comment)).
  - It fixes reentrant `@time` by making the `enable` flag a _counter,_ instead of a boolean.
  - It fixes `@time` called from multiple threads by making that flag thread-safe (via atomics).
KristofferC pushed a commit that referenced this issue Aug 27, 2021
Now, multiple tasks (on the same or different Threads) can start and stop compilation
time measurement, without interrupting each other.

* Makes jl_cumulative_compile_time_ns into a global, atomic variable.

Instead of keeping per-task compilation time, this change keeps a
global counter of compilation time, protected with atomic mutations.

Fixes #41739

```julia
julia> include("./compilation-task-migration-17-example.jl")
start thread: 2
end thread: 2
  5.185706 seconds (3.53 M allocations: 2.570 GiB, 7.34% gc time, 15.57% compilation time)

julia> include("./compilation-task-migration-17-example.jl")
start thread: 3
WARNING: replacing module M.
end thread: 1
  4.110316 seconds (18.23 k allocations: 2.391 GiB, 5.67% gc time, 0.24% compilation time)
```

Compilation time measurement originally added in: #38885

Problems addressed:
- This fixes #41739, meaning it fixes compilation time reporting in 1.7 after task migration was enabled.
- It also fixes the race condition that existed previously, even on 1.6, where multiple Tasks on the thread measuring `@time` could break the measurement, as identified in (#41271 (comment)).
  - It fixes reentrant `@time` by making the `enable` flag a _counter,_ instead of a boolean.
  - It fixes `@time` called from multiple threads by making that flag thread-safe (via atomics).

(cherry picked from commit b4ca196)
janrous-rai pushed a commit to janrous-rai/julia that referenced this issue Oct 28, 2021
Now, multiple tasks (on the same or different Threads) can start and stop compilation
time measurement, without interrupting each other.

* Makes jl_cumulative_compile_time_ns into a global, atomic variable.

Instead of keeping per-task compilation time, this change keeps a
global counter of compilation time, protected with atomic mutations.

Fixes JuliaLang#41739

```julia
julia> include("./compilation-task-migration-17-example.jl")
start thread: 2
end thread: 2
  5.185706 seconds (3.53 M allocations: 2.570 GiB, 7.34% gc time, 15.57% compilation time)

julia> include("./compilation-task-migration-17-example.jl")
start thread: 3
WARNING: replacing module M.
end thread: 1
  4.110316 seconds (18.23 k allocations: 2.391 GiB, 5.67% gc time, 0.24% compilation time)
```

Compilation time measurement originally added in: JuliaLang#38885

Problems addressed:
- This fixes JuliaLang#41739, meaning it fixes compilation time reporting in 1.7 after task migration was enabled.
- It also fixes the race condition that existed previously, even on 1.6, where multiple Tasks on the thread measuring `@time` could break the measurement, as identified in (JuliaLang#41271 (comment)).
  - It fixes reentrant `@time` by making the `enable` flag a _counter,_ instead of a boolean.
  - It fixes `@time` called from multiple threads by making that flag thread-safe (via atomics).
LilithHafner pushed a commit to LilithHafner/julia that referenced this issue Feb 22, 2022
Now, multiple tasks (on the same or different Threads) can start and stop compilation
time measurement, without interrupting each other.

* Makes jl_cumulative_compile_time_ns into a global, atomic variable.

Instead of keeping per-task compilation time, this change keeps a
global counter of compilation time, protected with atomic mutations.

Fixes JuliaLang#41739

```julia
julia> include("./compilation-task-migration-17-example.jl")
start thread: 2
end thread: 2
  5.185706 seconds (3.53 M allocations: 2.570 GiB, 7.34% gc time, 15.57% compilation time)

julia> include("./compilation-task-migration-17-example.jl")
start thread: 3
WARNING: replacing module M.
end thread: 1
  4.110316 seconds (18.23 k allocations: 2.391 GiB, 5.67% gc time, 0.24% compilation time)
```

Compilation time measurement originally added in: JuliaLang#38885

Problems addressed:
- This fixes JuliaLang#41739, meaning it fixes compilation time reporting in 1.7 after task migration was enabled.
- It also fixes the race condition that existed previously, even on 1.6, where multiple Tasks on the thread measuring `@time` could break the measurement, as identified in (JuliaLang#41271 (comment)).
  - It fixes reentrant `@time` by making the `enable` flag a _counter,_ instead of a boolean.
  - It fixes `@time` called from multiple threads by making that flag thread-safe (via atomics).
LilithHafner pushed a commit to LilithHafner/julia that referenced this issue Mar 8, 2022
Now, multiple tasks (on the same or different Threads) can start and stop compilation
time measurement, without interrupting each other.

* Makes jl_cumulative_compile_time_ns into a global, atomic variable.

Instead of keeping per-task compilation time, this change keeps a
global counter of compilation time, protected with atomic mutations.

Fixes JuliaLang#41739

```julia
julia> include("./compilation-task-migration-17-example.jl")
start thread: 2
end thread: 2
  5.185706 seconds (3.53 M allocations: 2.570 GiB, 7.34% gc time, 15.57% compilation time)

julia> include("./compilation-task-migration-17-example.jl")
start thread: 3
WARNING: replacing module M.
end thread: 1
  4.110316 seconds (18.23 k allocations: 2.391 GiB, 5.67% gc time, 0.24% compilation time)
```

Compilation time measurement originally added in: JuliaLang#38885

Problems addressed:
- This fixes JuliaLang#41739, meaning it fixes compilation time reporting in 1.7 after task migration was enabled.
- It also fixes the race condition that existed previously, even on 1.6, where multiple Tasks on the thread measuring `@time` could break the measurement, as identified in (JuliaLang#41271 (comment)).
  - It fixes reentrant `@time` by making the `enable` flag a _counter,_ instead of a boolean.
  - It fixes `@time` called from multiple threads by making that flag thread-safe (via atomics).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality
Projects
None yet
Development

No branches or pull requests

4 participants