Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refetch PTLS via Task struct (enable task migration) #40715

Merged
merged 1 commit into from
May 31, 2021
Merged

Conversation

vtjnash
Copy link
Member

@vtjnash vtjnash commented May 4, 2021

@vtjnash vtjnash force-pushed the tkf/ptls-via-task branch 4 times, most recently from cb5a359 to ab08a29 Compare May 11, 2021 16:27
@vtjnash vtjnash requested review from JeffBezanson, tkf and yuyichao May 11, 2021 21:18
Copy link
Member

@tkf tkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exciting!

julia> function f()
           a = Threads.threadid()
           yield()
           b = Threads.threadid()
           (a, b)
       end
f (generic function with 1 method)

julia> function waste(d)
           t0 = time_ns()
           while time_ns() - t0 < d
           end
       end
waste (generic function with 1 method)

julia> begin
       tasks = [Threads.@spawn(f()) for _ in 1:10000]
       Threads.@threads for _ in 1:Threads.nthreads()
           waste(100_000_000)
       end
       end

julia> ids = fetch.(tasks);

julia> d = map(Base.splat(==), ids);

julia> countmap(d)
Dict{Bool, Int64} with 2 entries:
  0 => 8757
  1 => 1243

I looked at the patch and it made sense to me. I'm not familiar with a large part of it, though.

@JeffBezanson JeffBezanson self-assigned this May 13, 2021
@@ -298,12 +298,12 @@ JL_DLLEXPORT jl_array_t *jl_reshape_array(jl_value_t *atype, jl_array_t *data,

JL_DLLEXPORT jl_array_t *jl_string_to_array(jl_value_t *str)
{
jl_ptls_t ptls = jl_get_ptls_states();
jl_task_t *ct = jl_current_task;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid some of these changes, just to reduce code churn?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, these are pretty much all necessary. I think it would be bad to analyze them on a case-by-case basis as they cost performance and correctness.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I realized later we also don't want anybody reading the code to get the wrong idea about how to do it.

@JeffBezanson JeffBezanson added the multithreading Base.Threads and related functionality label May 13, 2021
@JeffBezanson
Copy link
Member

This looks great to me so far. Let's try benchmarks for funsies.

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @christopher-dG

@vtjnash
Copy link
Member Author

vtjnash commented May 14, 2021

That seems to be just noise

@vtjnash
Copy link
Member Author

vtjnash commented May 14, 2021

Of note, this is possibly breaking linuxaarch64, as I think it is known that unwind info registration is bad there, and it seems this is somehow then making it worse.

@JeffBezanson
Copy link
Member

Here is pfib(31) (4 trials) with 4 threads on master and this branch:

master:

  1.892977 seconds (13.20 M allocations: 1.237 GiB, 33.03% gc time)
  1.843148 seconds (13.24 M allocations: 1.379 GiB, 36.69% gc time)
  1.943728 seconds (13.19 M allocations: 1.211 GiB, 32.89% gc time)
  1.853310 seconds (13.16 M allocations: 1.109 GiB, 34.95% gc time)

PR:

  2.265880 seconds (14.11 M allocations: 2.426 GiB, 39.32% gc time)
  1.868710 seconds (14.23 M allocations: 1.734 GiB, 32.32% gc time)
  2.062129 seconds (14.20 M allocations: 1.878 GiB, 32.08% gc time)
  2.249262 seconds (14.18 M allocations: 2.060 GiB, 42.60% gc time)

So something seems to be a bit slower in pure task spawning/switching.

@tkf
Copy link
Member

tkf commented May 14, 2021

I can also observe a longer tail in spawn overheads in a different benchmark. OTOH, schedule (measured via notify) does not seem to have a drastic difference.

I'm somewhat surprised that this PR increases the overhead on spawn. I thought that the overhead might happen (also?) at re-schedule.

Spawn and sync (wait) overheads

Benchmark code: https://github.com/tkf/ThreadsAPIBenchmarks.jl/blob/4e6b8795334a45fd3ffa687a70bb68efec0c4839/benchmark/task_overheads.jl#L4-L18

It uses jl_set_task_tid explicitly. I don't know how it'd interfere with the scheduler in this PR.

N (color) in the plots is nthreads().

This branch ab08a29:

image

1.7.0-DEV.1089 (d6c092d):

image

Overhead of notify (schedule)

Benchmark code: https://github.com/tkf/ThreadsAPIBenchmarks.jl/blob/4e6b8795334a45fd3ffa687a70bb68efec0c4839/benchmark/notify_overheads.jl#L4-L37

This branch ab08a29:

image

1.7.0-DEV.1089 (d6c092d):

image


You can find the notebooks containing a bit more explanation of the benchmarks here:

The code running the benchmarks and generating the notebooks is in https://github.com/tkf/ThreadsAPIBenchmarks.jl

@vtjnash
Copy link
Member Author

vtjnash commented May 24, 2021

@JeffBezanson I don't know how to reproduce that without pfib, but I'm noting the greatly increased allocation rate, so I don't think you are measuring spawn/sync overhead, but instead measuring the performance overhead of partr. The code from TKF seems to do roughly the same, but additionally injects some unexpected internal state during spawn, which we expect to harm performance significantly. Neither measurement is particularly relevant to this PR however, as it isn't the effect on spawn performance that we are concerned about most here.

@JeffBezanson
Copy link
Member

It's just this:

function pfib(n::Int)
    if n <= 1
        return n
    end
    t = Threads.@spawn pfib(n-2)
    return pfib(n-1) + fetch(t)::Int
end
@time pfib(31)

I agree it's a lot of scheduler overhead, which we'll probably have to address separately. Just wanted to get a sense of what effects we might see from this. I don't think it blocks merging.

@vtjnash vtjnash added the merge me PR is reviewed. Merge when all tests are passing label May 26, 2021
@vtjnash vtjnash force-pushed the tkf/ptls-via-task branch from f16b6e8 to 3d08b65 Compare May 27, 2021 21:55
Co-authored-by: Takafumi Arakaki <aka.tkf@gmail.com>
@vtjnash vtjnash force-pushed the tkf/ptls-via-task branch from 3d08b65 to 5f34035 Compare May 28, 2021 19:40
@vtjnash vtjnash merged commit b46df09 into master May 31, 2021
@vtjnash vtjnash deleted the tkf/ptls-via-task branch May 31, 2021 02:08
shirodkara pushed a commit to shirodkara/julia that referenced this pull request Jun 9, 2021
…Lang#40715)

Enables task-thread migration!

Co-authored-by: Takafumi Arakaki <aka.tkf@gmail.com>
@DilumAluthge DilumAluthge removed the merge me PR is reviewed. Merge when all tests are passing label Jun 18, 2021
@KristofferC
Copy link
Member

Should the Caveat in the manual be removed: https://docs.julialang.org/en/v1/manual/multi-threading/#Caveats?

After a task starts running on a certain thread (e.g. via @Spawn), it will always be restarted on the same thread after blocking. In the future this limitation will be removed, and tasks will migrate between threads.

@ancapdev
Copy link
Contributor

ancapdev commented Jun 21, 2021

I wasn't sure where to open a discussion on this, but my post on Discourse (https://discourse.julialang.org/t/task-scheduling-semantics/62950) didn't garner any attention, so perhaps here is the better place.

@vtjnash, @JeffBezanson As much as I'm excited for task migration,I think there are some open design question to answer before pushing this functionality into a release.

My core concern is that there are two kinds of scheduler stickiness you might want, either to stick on the system thread, or to stick on the thread as your parent task is (at any moment) scheduled on. At present these are the same, however with task migration they will no longer be. The way I see it, you want to pin your task to system threads when you need access to thread local state or you're interacting with libraries that need to run on particular threads. For the more general Julia async use cases however, what you want is to guarantee non-simultaneous execution of tasks in some group (e.g., parent and sibling async tasks).

It would be nice to consider this requirement prior to making the scheduler more flexible, and potentially also cleaning up the task scheduling API. I may be wrong, but I believe the current Threads.@spawn and @async APIs are separate mostly for legacy reasons, and a cleaner more future proof approach may be to have a single scheduling API with a scheduling policy argument. Policies could include affinity (e.g., free, current-thread, parent-thread, initial-thread), and later would easily accommodate priority levels, NUMA based affinities, or other features that may be interesting as the scope and functionality grows.

johanmon pushed a commit to johanmon/julia that referenced this pull request Jul 5, 2021
…Lang#40715)

Enables task-thread migration!

Co-authored-by: Takafumi Arakaki <aka.tkf@gmail.com>
@kshyatt
Copy link
Contributor

kshyatt commented Jul 20, 2021

This PR seems to have broken precompilation for GPUCompiler.jl. Before, doing this:

precomp_script.jl:

using LinearAlgebra
m = 10

compile.jl:

using PackageCompiler
default_compile_dir() = joinpath(homedir(), ".julia", "sysimages")

default_compile_filename() = "sys_gputest.so"

default_compile_path() = joinpath(default_compile_dir(), default_compile_filename())

function compile(;
  dir::AbstractString=default_compile_dir(),
  filename::AbstractString=default_compile_filename(),
)
  if !isdir(dir)
    println("""The directory "$dir" doesn't exist yet, creating it now.""")
    println()
    mkdir(dir)
  end
  path = joinpath(dir, filename)
  println(
    """Creating the system image "$path" containing the compiled version of GPUCompiler. This may take a few minutes.""",
  )
  create_sysimage(
    :GPUCompiler;
    sysimage_path=path,
    precompile_execution_file=joinpath(@__DIR__, "precomp_script.jl"),
  )
  return path
end

compile()

and running with
julia --project=~/.julia/dev/GPUCompiler compile.jl then julia --sysimage ~/.julia/sysimages/sys_gputest.so allows Julia to startup without problems. After this PR, the compilation process succeeds but as soon as we try to load the sysimage with a sanitized Julia build, we get:

$ ~/Julia/src/julia/build/sanitize/julia --sysimage /tmp/sys.so                                                                                                   [master] 
AddressSanitizer:DEADLYSIGNAL
=================================================================
==260483==ERROR: AddressSanitizer: SEGV on unknown address 0x0000000001c8 (pc 0x7fb5d77e79a6 bp 0x7ffc28e2ed30 sp 0x7ffc28e2ed20 T0)
==260483==The signal is caused by a READ memory access.
==260483==Hint: address points to the zero page.
    #0 0x7fb5d77e79a6 in llvm::StringRef::StringRef(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /home/tim/Julia/src/julia/build/sanitize/usr/include/llvm/ADT/StringRef.h
    #1 0x7fb5d77e79a6 in jl_setup_module(llvm::Module*, jl_cgparams_t const*) /home/tim/Julia/src/julia/src/codegen.cpp:1740:24
    #2 0x7fb5d77e78b5 in jl_create_llvm_module(llvm::StringRef) /home/tim/Julia/src/julia/src/codegen.cpp:1746:5
    #3 0x7fb5d7af303c in jl_compile_extern_c /home/tim/Julia/src/julia/src/jitlayers.cpp:245:16
    #4 0x7fb5d7a171ae in jl_reinit_item /home/tim/Julia/src/julia/src/staticdata.c:1453:27
    #5 0x7fb5d7a171ae in jl_finalize_deserializer /home/tim/Julia/src/julia/src/staticdata.c:1473:9
    #6 0x7fb5d7a14c60 in jl_restore_system_image_from_stream /home/tim/Julia/src/julia/src/staticdata.c:1845:5
    #7 0x7fb5d7a134f6 in jl_restore_system_image_data /home/tim/Julia/src/julia/src/staticdata.c:1894:5
    #8 0x7fb5d7a134f6 in jl_load_sysimg_so /home/tim/Julia/src/julia/src/staticdata.c:375:5
    #9 0x7fb5d7a134f6 in jl_restore_system_image /home/tim/Julia/src/julia/src/staticdata.c:1867:9
    #10 0x7fb5d79dbb56 in julia_init /home/tim/Julia/src/julia/src/init.c:738:9
    #11 0x7fb5d7a8bd16 in jl_repl_entrypoint /home/tim/Julia/src/julia/src/jlapi.c:684:5
    #12 0x500c1d in main /home/tim/Julia/src/julia/cli/loader_exe.c:53:15
    #13 0x7fb5db52fb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)
    #14 0x41f39d in _start (/home/tim/Julia/src/julia/build/sanitize/usr/bin/julia+0x41f39d)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /home/tim/Julia/src/julia/build/sanitize/usr/include/llvm/ADT/StringRef.h in llvm::StringRef::StringRef(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
==260483==ABORTING$ ~/Julia/src/julia/build/sanitize/julia --sysimage /tmp/sys.so                                                                                                   [master] 
AddressSanitizer:DEADLYSIGNAL
=================================================================
==260483==ERROR: AddressSanitizer: SEGV on unknown address 0x0000000001c8 (pc 0x7fb5d77e79a6 bp 0x7ffc28e2ed30 sp 0x7ffc28e2ed20 T0)
==260483==The signal is caused by a READ memory access.
==260483==Hint: address points to the zero page.
    #0 0x7fb5d77e79a6 in llvm::StringRef::StringRef(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /home/tim/Julia/src/julia/build/sanitize/usr/include/llvm/ADT/StringRef.h
    #1 0x7fb5d77e79a6 in jl_setup_module(llvm::Module*, jl_cgparams_t const*) /home/tim/Julia/src/julia/src/codegen.cpp:1740:24
    #2 0x7fb5d77e78b5 in jl_create_llvm_module(llvm::StringRef) /home/tim/Julia/src/julia/src/codegen.cpp:1746:5
    #3 0x7fb5d7af303c in jl_compile_extern_c /home/tim/Julia/src/julia/src/jitlayers.cpp:245:16
    #4 0x7fb5d7a171ae in jl_reinit_item /home/tim/Julia/src/julia/src/staticdata.c:1453:27
    #5 0x7fb5d7a171ae in jl_finalize_deserializer /home/tim/Julia/src/julia/src/staticdata.c:1473:9
    #6 0x7fb5d7a14c60 in jl_restore_system_image_from_stream /home/tim/Julia/src/julia/src/staticdata.c:1845:5
    #7 0x7fb5d7a134f6 in jl_restore_system_image_data /home/tim/Julia/src/julia/src/staticdata.c:1894:5
    #8 0x7fb5d7a134f6 in jl_load_sysimg_so /home/tim/Julia/src/julia/src/staticdata.c:375:5
    #9 0x7fb5d7a134f6 in jl_restore_system_image /home/tim/Julia/src/julia/src/staticdata.c:1867:9
    #10 0x7fb5d79dbb56 in julia_init /home/tim/Julia/src/julia/src/init.c:738:9
    #11 0x7fb5d7a8bd16 in jl_repl_entrypoint /home/tim/Julia/src/julia/src/jlapi.c:684:5
    #12 0x500c1d in main /home/tim/Julia/src/julia/cli/loader_exe.c:53:15
    #13 0x7fb5db52fb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)
    #14 0x41f39d in _start (/home/tim/Julia/src/julia/build/sanitize/usr/bin/julia+0x41f39d)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /home/tim/Julia/src/julia/build/sanitize/usr/include/llvm/ADT/StringRef.h in llvm::StringRef::StringRef(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
==260483==ABORTING

(thanks @maleadt for this log)
the problem is NULL pointer + offset:

(gdb) p jl_TargetMachine
$1 = (llvm::TargetMachine *) 0x0

Tim suggested that:

the GPUCompiler.ji deserialization triggers jl_compile_extern_c (for a @ccallable, I presume) from within jl_init before jl_init_codegen has initalized LLVM

@vtjnash any thoughts?

@DilumAluthge
Copy link
Member

Should the Caveat in the manual be removed: https://docs.julialang.org/en/v1/manual/multi-threading/#Caveats?

After a task starts running on a certain thread (e.g. via @Spawn), it will always be restarted on the same thread after blocking. In the future this limitation will be removed, and tasks will migrate between threads.

Implemented in #43496

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants