`jl_cumulative_compile_time_ns_before` tracks compilation time per thread rather than per task #41271

rgankema · 2021-06-18T09:03:08Z

For my work at RelationalAI I'm trying to gather metrics on how much time is spent in run-time compilation while evaluating queries. I was hoping to use jl_cumulative_compile_time_ns_[before/after] for this, a little bit like this:

function evaluate_query(query_fn)
    time_before = time_ns()
    compile_before = jl_cumulative_compile_time_ns_before()
    
    query_fn()
    
    compile_after = jl_cumulative_compile_time_ns_after()
    time_after = time_ns()
    inc_metric!(metrics, "query_eval_total_time", time_after - time_before)
    inc_metric!(metrics, "query_jl_compile_time", compile_after - compile_before)
end

This works fine if there are no background tasks and queries run sequentially. It also works fine in a multi-threaded scenario, as long as only a single task runs at-a-time at any given thread, because jl_cumulative_compile_time_ns_[before/after] seem to use separate counters per thread. However, it breaks when multiple tasks are multiplexed on the same thread. For instance, if one call of evaluate_query has a very long running query_fn, and other tasks that do work that incur compilation time are running at the same time on that thread, that first evaluate_query call will also record all the compilation time triggered by those others tasks. This means that we're overestimating how much time we spent in compilation for that given query, potentially by a very large margin.

To see this behavior in action, please consider the following MRE:

using Base.Threads: @async

function do_something_and_measure_compilation_time(a, b)
    compile_before = Base.cumulative_compile_time_ns_before()

    # Function that we want to measure.
    x = do_something(a, b)

    compile_after = Base.cumulative_compile_time_ns_after()
    compile_duration_s = (compile_after - compile_before) / 1_000_000_000
    @show compile_duration_s
    return x
end

# Simulates a long running function that costs a lot more time to run than to compile.
function do_something(a, b)
    x = foo(a, a + 3, b, b + 5)
    sleep(5)
    return x
end

# Something to compile.
function foo(x1, x2, y1, y2)
    xs = [x1:x2...]
    ys = [y1:y2...]
    return sum(xs) * max(xs...) * sum(ys) * max(ys...)
end

# Do a couple of runs. We should see compilation time at first, but that should quickly go
# away.
do_something_and_measure_compilation_time(1, 2)
do_something_and_measure_compilation_time(1, 2)
do_something_and_measure_compilation_time(1, 2)
do_something_and_measure_compilation_time(1, 2)

# Call foo with types we haven't seen before, in the background.
@async begin
    sleep(1)
    for T1 in [Int32, Int16, Int8]
        for T2 in [Int32, Int16, Int8]
            @time foo(zero(T1), one(T1), zero(T2), one(T2))
        end
    end
end

# Final run, this time we'll measure compilation time from the above, even though it's
# not compilation that comes from this task.
do_something_and_measure_compilation_time(1, 2)

This outputs:

compile_duration_s = 0.024834394
compile_duration_s = 0.002964853
compile_duration_s = 0.0
compile_duration_s = 0.0
  0.037290 seconds (87.25 k allocations: 5.084 MiB, 99.79% compilation time)
  0.077509 seconds (169.89 k allocations: 10.369 MiB, 99.61% compilation time)
  0.068803 seconds (170.97 k allocations: 10.438 MiB, 99.69% compilation time)
  0.008911 seconds (4.97 k allocations: 271.748 KiB, 99.42% compilation time)
  0.009805 seconds (4.92 k allocations: 266.718 KiB, 99.52% compilation time)
  0.014783 seconds (4.97 k allocations: 269.733 KiB, 98.73% compilation time)
  0.009418 seconds (4.97 k allocations: 270.107 KiB, 99.38% compilation time)
  0.009743 seconds (4.97 k allocations: 269.671 KiB, 99.47% compilation time)
  0.013891 seconds (4.92 k allocations: 266.905 KiB, 99.33% compilation time)
compile_duration_s = 0.298222365

As you can see, the last call to do_something_and_measure_compilation_time recorded the compilation time of the background task. Unfortunately that behavior makes it unusable for the use-case described above. We'd therefore like to request that jl_cumulative_compile_time_ns_[before/after] is changed such that it keeps a counter per task, rather than per thread. Thanks!

The text was updated successfully, but these errors were encountered:

NHDaly · 2021-06-18T15:10:28Z

Here's another, maybe a bit simpler example, which shows the same affect as above using @time. In this program, you can see that timing one Task which doesn't have any compilation time, running concurrently with another Task on the same thread that does reports the compilation time in the call to @time, even though it wasn't caused by that Task.

(Note that this example doesn't work if you copy/paste into the REPL; i think because it's compiling the REPL display output at the same time as the program runs, so both @time calls show some compilation time. So instead, I run this example as a script.)

# async-time-example.jl

const running = Threads.Atomic{Bool}(true)

loop() = while running[] ; yield(); GC.safepoint(); end

# Run once as warmup
running[] = true
t = @async loop()
sleep(0.1) ; running[] = false

wait(t)

# ----------------------------

running[] = true

# Start timing a task, which doesn't have any compilation time.
t = @async @time loop()

# Finish timing the task. Note no compilation time.
sleep(0.1) ; running[] = false

wait(t)

# ----------------------------

running[] = true

# Now start timing the *same* task, which shouldn't have any compilation time.
t = @async @time while running[] ; yield(); GC.safepoint(); end

# But start another task *on the same thread,* which DOES spend some compilation time.
t2 = @async (@eval f() = 2+2; @eval f())

# Finish timing the original task, and observe that the compilation time from `t2`
# was recorded by the `@time` in `t`.
sleep(0.1) ; running[] = false

wait(t) ; wait(t2)

And here's the output:

$ julia ~/Downloads/async-time-example.jl
  0.079854 seconds (72 allocations: 4.594 KiB)
  0.093527 seconds (2.27 k allocations: 158.235 KiB, 4.59% compilation time)

The request here is to record compilation time via task-local storage, instead of through thread-local storage.

NHDaly · 2021-06-29T18:55:12Z

Worth noting that actually this is not unique to compilation time. The number of allocations are tracked globally as well, as shown in this simple example:

julia> function f(x::Array{Any})  x[1] + x[1]  end
f (generic function with 1 method)

julia> t = @async @time sleep(10)
Task (runnable) @0x000000010b059000

julia> @time for _ in 1:3000000 f(Any[2]) end
  0.204893 seconds (3.00 M allocations: 274.658 MiB, 16.42% gc time)

julia>  10.005931 seconds (3.04 M allocations: 277.261 MiB, 0.34% gc time, 0.19% compilation time)

This would be great to change for all the metric in @time, though i don't know how feasible that is.

vtjnash · 2021-06-29T21:54:47Z

IIRC, the code is currently taking extra effort to make this counter global (instead of per-thread). The challenge may be deciding if we should ever "charge" thread costs (time, memory, etc.) to the parent Tasks (all that called 'wait' on it), and how to handle task thread-migration.

ElSeidy · 2021-06-30T12:57:40Z

Tracking this discussion.
I just came across the same issue :)

NHDaly · 2021-07-01T02:18:46Z

Thanks @vtjnash - that makes sense. Interesting!

Yeah, after thinking more about it, i can see why the global accounting is actually really valuable for some situations.

It seems like maybe we'd ideally want both types of metrics? In addition to the global stuff we currently report, it would be great to also be able to track:

how many allocations did this Task and all child tasks involved in my span perform?
how much time was spent in compilation for this task and all child tasks?
how much CPU time did this task and my child tasks actually take when running on the CPU?
how much time did this task and child tasks spend blocking, not running?

Does that make sense to you? That we could (have the option to) record both types of metrics: task-local (and maybe including any child tasks that we wait on - as you say), and global?

NHDaly · 2021-07-08T16:08:30Z

@vtjnash -

IIRC, the code is currently taking extra effort to make this counter global (instead of per-thread).

After looking at this a bit more, i don't think the jl_cumulative_compile_time_ns_before is global though. It's per-thread.

And actually, @janrous-rai just pointed out to me that I think there's actually a race condition here, because if multiple Tasks scheduled on the same thread are both measuring @time, the first one to finish will disable jl_measure_compile_time for that thread, and the other one will miss some of its compilation time because it was disabled out from under it! 😮

julia/src/jitlayers.cpp

Line 89 in 5dbf45a

jl_measure_compile_time[tid] = 0;

To fix the race condition, do you think we could at least change jl_measure_compile_time to be:
A) an thread-safe and/or atomic variable
B) a counter instead of a boolean? So that only the it will remain true as long as at least one Task on the thread is currently measuring compilation time?

Alternatively, is it really so expensive to measure the compilation time that we need to enable/disable it? Could we just keep this as a cumulative count of all compilation time, and remove the jl_measure_compile_time check entirely?

Now, multiple tasks (on the same or different Threads) can start and stop compilation time measurement, without interrupting each other. * Makes jl_cumulative_compile_time_ns into a global, atomic variable. Instead of keeping per-task compilation time, this change keeps a global counter of compilation time, protected with atomic mutations. Fixes JuliaLang#41739 ```julia julia> include("./compilation-task-migration-17-example.jl") start thread: 2 end thread: 2 5.185706 seconds (3.53 M allocations: 2.570 GiB, 7.34% gc time, 15.57% compilation time) julia> include("./compilation-task-migration-17-example.jl") start thread: 3 WARNING: replacing module M. end thread: 1 4.110316 seconds (18.23 k allocations: 2.391 GiB, 5.67% gc time, 0.24% compilation time) ``` Compilation time measurement originally added in: JuliaLang#38885 Problems addressed: - This fixes JuliaLang#41739, meaning it fixes compilation time reporting in 1.7 after task migration was enabled. - It also fixes the race condition that existed previously, even on 1.6, where multiple Tasks on the thread measuring `@time` could break the measurement, as identified in (JuliaLang#41271 (comment)). - It fixes reentrant `@time` by making the `enable` flag a _counter,_ instead of a boolean. - It fixes `@time` called from multiple threads by making that flag thread-safe (via atomics).

Now, multiple tasks (on the same or different Threads) can start and stop compilation time measurement, without interrupting each other. * Makes jl_cumulative_compile_time_ns into a global, atomic variable. Instead of keeping per-task compilation time, this change keeps a global counter of compilation time, protected with atomic mutations. Fixes #41739 ```julia julia> include("./compilation-task-migration-17-example.jl") start thread: 2 end thread: 2 5.185706 seconds (3.53 M allocations: 2.570 GiB, 7.34% gc time, 15.57% compilation time) julia> include("./compilation-task-migration-17-example.jl") start thread: 3 WARNING: replacing module M. end thread: 1 4.110316 seconds (18.23 k allocations: 2.391 GiB, 5.67% gc time, 0.24% compilation time) ``` Compilation time measurement originally added in: #38885 Problems addressed: - This fixes #41739, meaning it fixes compilation time reporting in 1.7 after task migration was enabled. - It also fixes the race condition that existed previously, even on 1.6, where multiple Tasks on the thread measuring `@time` could break the measurement, as identified in (#41271 (comment)). - It fixes reentrant `@time` by making the `enable` flag a _counter,_ instead of a boolean. - It fixes `@time` called from multiple threads by making that flag thread-safe (via atomics). (cherry picked from commit b4ca196)

Now, multiple tasks (on the same or different Threads) can start and stop compilation time measurement, without interrupting each other. * Makes jl_cumulative_compile_time_ns into a global, atomic variable. Instead of keeping per-task compilation time, this change keeps a global counter of compilation time, protected with atomic mutations. Fixes JuliaLang#41739 ```julia julia> include("./compilation-task-migration-17-example.jl") start thread: 2 end thread: 2 5.185706 seconds (3.53 M allocations: 2.570 GiB, 7.34% gc time, 15.57% compilation time) julia> include("./compilation-task-migration-17-example.jl") start thread: 3 WARNING: replacing module M. end thread: 1 4.110316 seconds (18.23 k allocations: 2.391 GiB, 5.67% gc time, 0.24% compilation time) ``` Compilation time measurement originally added in: JuliaLang#38885 Problems addressed: - This fixes JuliaLang#41739, meaning it fixes compilation time reporting in 1.7 after task migration was enabled. - It also fixes the race condition that existed previously, even on 1.6, where multiple Tasks on the thread measuring `@time` could break the measurement, as identified in (JuliaLang#41271 (comment)). - It fixes reentrant `@time` by making the `enable` flag a _counter,_ instead of a boolean. - It fixes `@time` called from multiple threads by making that flag thread-safe (via atomics).

NHDaly added the multithreading Base.Threads and related functionality label Jun 18, 2021

This was referenced Jul 30, 2021

Compilation Time reporting is broken on Julia 1.7, due to Task Migration #41739

Closed

Make jl_cumulative_compile_time_ns global (and reentrant). #41733

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`jl_cumulative_compile_time_ns_before` tracks compilation time per thread rather than per task #41271

`jl_cumulative_compile_time_ns_before` tracks compilation time per thread rather than per task #41271

rgankema commented Jun 18, 2021 •

edited

Loading

NHDaly commented Jun 18, 2021 •

edited

Loading

NHDaly commented Jun 29, 2021

vtjnash commented Jun 29, 2021

ElSeidy commented Jun 30, 2021

NHDaly commented Jul 1, 2021

NHDaly commented Jul 8, 2021

jl_cumulative_compile_time_ns_before tracks compilation time per thread rather than per task #41271

jl_cumulative_compile_time_ns_before tracks compilation time per thread rather than per task #41271

Comments

rgankema commented Jun 18, 2021 • edited Loading

NHDaly commented Jun 18, 2021 • edited Loading

NHDaly commented Jun 29, 2021

vtjnash commented Jun 29, 2021

ElSeidy commented Jun 30, 2021

NHDaly commented Jul 1, 2021

NHDaly commented Jul 8, 2021

`jl_cumulative_compile_time_ns_before` tracks compilation time per thread rather than per task #41271

`jl_cumulative_compile_time_ns_before` tracks compilation time per thread rather than per task #41271

rgankema commented Jun 18, 2021 •

edited

Loading

NHDaly commented Jun 18, 2021 •

edited

Loading