-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add docs on task-specific buffering using multithreading #48542
Add docs on task-specific buffering using multithreading #48542
Conversation
cd1070f
to
d6d0234
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this
doc/src/manual/multi-threading.md
Outdated
500000500000 | ||
``` | ||
|
||
Note that we do not use buffers based on the `threadid()` i.e. `buffers = zeros(Threads.nthreads())` because tasks can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also because nthreads is soft-deprecated (no warning, but may return incorrect answers now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't add a note on this. I don't understand whether the docs should or shouldn't recommend nthreads()
To fix this, buffers that are specific to the task may be used to segment the sum into chunks that are race-free. | ||
|
||
```julia-repl | ||
julia> function sum_multi_good(a) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not an expert on multi-threaded parallelism, so I might just be wrong.
But this doesn't seem like it'll be particularly performant as the values of the WeakKeyDict
won't be contiguous in memory, and so the reduction at the needs to gather these from likely different parts of the heap => slow reduction. In fact, when I tried running something like this on an example of mine it was incredibly slow; probably partially due to what I just mentioned, and partially due to the slowness of WeakKeyDict
(I believe you can also use an IdDict
which at least should be faster that WeakKeyDict
).
An alternative is Atomic
as mentioned below, or, even better, you can do a Vector{Atomic{T}}
of some buffer_length
. In the latter scenario you can then just randomly pick an index for each Task
, and act on the corresponding Atomic{T}
atomically, e.g. atomic_add!(buffer[rand(1:buffer_size)], i)
. If the work within each of the tasks is fairly uniform, then this random picking of index to add to should result in fairly even congestion for the different buffers.
I'm sure there are much, much better ways of doing this, but the above is a quick-and-easy way to implement a much faster version of this kind of map-reduce approach using a buffer.
At least most of those had the decency to avoid having inbounds annotations, which can quickly (and often) turns mild wrong code into drastic wrong answers |
Those seem more impetus to continue looking into #48543 perhaps? |
I think many of those mimiced the implementation of the old default rng. |
d6d0234
to
2637437
Compare
1d330f7
to
1c0ebe6
Compare
Perhaps this is ready for final review? @MasonProtter perhaps? |
@threads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for suggested that code and now I'm suggesting changes, but I guess it would be nice and concise here to just re-use the single threaded version of the function and run it on each of the spawned chunks, and also use it for reducing the returned data?
1c0ebe6
to
0c21503
Compare
Co-Authored-By: Mason Protter <mason.protter@icloud.com>
0c21503
to
7f8d474
Compare
@MasonProtter I absorbed your suggestions. Can you give this a last review please. Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it!
Co-authored-by: Mason Protter <mason.protter@icloud.com>
Backported PRs: - [x] #47782 <!-- Generalize Bool parse method to AbstractString --> - [x] #48634 <!-- Remove unused "deps" mechanism in internal sorting keywords [NFC] --> - [x] #49931 <!-- Lock finalizers' lists at exit --> - [x] #50064 <!-- Fix numbered prompt with input only with comment --> - [x] #50474 <!-- docs: Fix a `!!! note` which was miscapitalized --> - [x] #50516 <!-- Fix visibility of assert on GCC12/13 --> - [x] #50635 <!-- `versioninfo()`: include build info and unofficial warning --> - [x] #49915 <!-- Revert "Remove number / vector (#44358)" --> - [x] #50781 <!-- fix `bit_map!` with aliasing --> - [x] #50845 <!-- fix #50438, use default pool for at-threads --> - [x] #49031 <!-- Update inference.md --> - [x] #50289 <!-- Initialize prev_nold and nold in gc_reset_page --> - [x] #50559 <!-- Expand kwcall lowering positional default check to vararg --> - [x] #49582 <!-- Update HISTORY.md for `DelimitedFiles` --> - [x] #50341 <!-- invokelatest docs should say not exported before 1.9 --> - [x] #50525 <!-- only check that values are finite in `generic_lufact` when `check=true` --> - [x] #50444 <!-- Optimize getfield lowering to avoid boxing in some cases --> - [x] #50523 <!-- Avoid generic call in most cases for getproperty --> - [x] #50860 <!-- Add `Base.get_extension` to docs/API --> - [x] #50164 <!-- codegen: handle dead code with unsafe_store of FCA pointers --> - [x] #50568 <!-- `Array(::AbstractRange)` should return an `Array` --> - [x] #50871 <!-- macOS: Don't inspect dead threadtls during exception handling. --> Need manual backport: - [ ] #48542 <!-- Add docs on task-specific buffering using multithreading --> - [ ] #50591 <!-- build: fix various makefile bugs --> Non-merged PRs with backport label: - [ ] #50842 <!-- Avoid race conditions with recursive rm --> - [ ] #50823 <!-- Make ranges more robust with unsigned indexes. --> - [ ] #50663 <!-- Fix Expr(:loopinfo) codegen --> - [ ] #49716 <!-- Update varinfo() docstring signature --> - [ ] #49713 <!-- prevent REPL from erroring in numbered mode in some situations --> - [ ] #49573 <!-- Implement jl_cpu_pause on PPC64 --> - [ ] #48726 <!-- fix macro expansion of property destructuring --> - [ ] #48642 <!-- Use gc alloc instead of alloc typed in lowering --> - [ ] #48183 <!-- Don't use pkgimage for package if any includes fall in tracked path for coverage or alloc tracking --> - [ ] #48050 <!-- improve `--heap-size-hint` arg handling --> - [ ] #47615 <!-- Allow threadsafe access to buffer of type inference profiling trees -->
Co-authored-by: Mason Protter <mason.protter@icloud.com> (cherry picked from commit 02f80c6)
Co-authored-by: Mason Protter <mason.protter@icloud.com> (cherry picked from commit 02f80c6)
Backported PRs: - [x] #48625 <!-- add replace(io, str, patterns...) --> - [x] #48387 <!-- Improve documentation of sort-related functions --> - [x] #48363 <!-- Revise sort.md and docstrings in sort.jl --> - [x] #48977 <!-- Update SparseArrays.jl stdlib for SuiteSparse 7 --> - [x] #50719 <!-- fix `CyclePadding(::DataType)` --> - [x] #50694 <!-- inference: permit recursive type traits --> - [x] #50860 <!-- Add `Base.get_extension` to docs/API --> - [x] #50594 <!-- Disallow non-index Integer types in isassigned --> - [x] #50802 <!-- Makes IntrusiveLinkedListSynchronized mutable to avoid allocation on poptask --> - [x] #50858 <!-- Add a `threadpool` parameter to `Channel` constructor --> - [x] #50874 <!-- Restrict COFF to a single thread when symbol count is high --> - [x] #50822 <!-- Add default method for setmodifiers! --> - [x] #50730 <!-- Fix integer overflow in `isapprox` --> - [x] #50850 <!-- Remove weird Rational dispatch and add pi functions to list --> - [x] #50809 <!-- Limit type-printing in MethodError --> - [x] #50915 <!-- Add note the `Task` about sticky bit --> - [x] #50929 <!-- when widening tuple types in tmerge, only widen the complex parts --> - [x] #50928 <!-- Bump JuliaSyntax to 0.4.6 --> - [x] #50959 <!-- Update libssh2 patches --> - [x] #50823 <!-- Make ranges more robust with unsigned indexes. --> - [x] #48542 <!-- Add docs on task-specific buffering using multithreading --> - [x] #50912 <!-- Separate foreign threads into a :foreign threadpool --> - [x] #51010 <!-- Add ORIGIN to SuiteSparse rpath on Linux/FreeBSD --> - [x] #50753 <!-- cat: remove unused promote_eltype methods that confuse inference --> - [x] #51027 <!-- Implement realloc accounting correctly --> - [x] #51019 <!-- fix a case of potentially use of undefined variable when handling error in distributed message processing --> - [x] #51039 <!-- Revert "Optimize findall(f, ::AbstractArray{Bool}) (#42202)" --> - [x] #51036 <!-- add missing invoke edge for nospecialize targets --> - [x] #51042 <!-- inference: fix return_type_tfunc modeling of concrete functions --> - [x] #51114 <!-- Workaround upstream FreeBSD issue #272992 --> - [x] #50892 <!-- Add `JL_DLLIMPORT` to `small_typeof` declaration --> - [x] #51154 <!-- broadcast: use recursion rather than ntuple to map over a tuple --> - [x] #51153 <!-- fix debug typo in "add missing invoke edge for nospecialize targets (#51036)" --> - [x] #51222 <!-- Check again if the tty is open inside the IO lock --> - [x] #51236 <!-- Add lock around uv_unref during init --> - [x] #51243 <!-- GMP: Gracefully handle more overflows. --> - [x] #51254 <!-- Ryu: make sure adding zeros does not overwrite trailing dot --> - [x] #51175 <!-- shorten stale_age for cachefile lock --> - [x] #51300 <!-- fix method definition error for bad vararg --> - [x] #51307 <!-- fix force-throw ctrl-C on Windows --> - [x] #51303 <!-- ensure revising structs is safe --> - [x] #51393 - [x] #51403 Need manual backport: - [x] #51009 <!-- fix #50562, regression in `in` of tuple of Symbols --> - [x] #51053 <!-- Bump Statistics stdlib --> - [x] #51013 <!-- fix #50709, type bound error due to free typevar in sparam env --> - [x] #51305 <!-- reduce test time for rounding and floatfuncs --> Contains multiple commits, manual intervention needed: - [ ] #50663 <!-- Fix Expr(:loopinfo) codegen --> - [ ] #51035 <!-- refactor GC scanning code to reflect jl_binding_t are now first class --> - [ ] #51092 <!-- inference: fix bad effects for recursion --> - [x] #51247 <!-- Check if malloc has succeeded before incrementing gc stats --> - [x] #51294 <!-- use LibGit2_jll for LibGit2 library --> Non-merged PRs with backport label: - [ ] #51132 <!-- Handle `AbstractQ` in concatenation --> - [x] #51029 <!-- testdefs: make sure that if a test set changes the active project, they change it back when they're done --> - [ ] #50919 <!-- Code loading: do the "skipping mtime check for stdlib" check regardless of the value of `ispath(f)` --> - [ ] #50824 <!-- Add some aliasing warnings to docstrings for mutating functions --> - [x] #50385 <!-- Precompile pidlocks: add to NEWS and docs --> - [ ] #49805 <!-- Limit TimeType subtraction to AbstractDateTime -->
Co-authored-by: Mason Protter <mason.protter@icloud.com> (cherry picked from commit 02f80c6)
It's common to see people using
threadid()
-based buffers, for instance in a toy sum examplebut after task migration #40715 (which is undocumented AFAICT) this practice is not race-safe.
This is an attempt to document some best practice.
I expected the
task_local_storage
api to be the right answer here, but I couldn't figure out how to reduce it after the@threads
loop returned.Help appreciated in making this correct