-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a keyword argument to disable multithreading #3030
Changes from 11 commits
a5b1089
a31bf16
37604e6
afd9b8f
c24618a
f218189
39300fc
77e3475
7e3321f
fd665cc
b488196
10e262f
c21a1c8
aa391cd
4b5163c
d4dfb0f
27028e0
2c6488e
2d8d5f5
ce4a2cf
4bbbb27
e575338
1382ffe
f63fe97
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -261,7 +261,7 @@ function _combine_rows_with_first!((firstrow,)::Ref{Any}, | |
# Create up to one task per thread | ||
# This has lower overhead than creating one task per group, | ||
# but is optimal only if operations take roughly the same time for all groups | ||
if VERSION >= v"1.4" && isthreadsafe(outcols, incols) | ||
if VERSION >= v"1.4" && ismultithreaded() && isthreadsafe(outcols, incols) | ||
basesize = max(1, cld(len - 1, Threads.nthreads())) | ||
partitions = Iterators.partition(2:len, basesize) | ||
else | ||
|
@@ -273,11 +273,11 @@ function _combine_rows_with_first!((firstrow,)::Ref{Any}, | |
tasks = Vector{Task}(undef, length(partitions)) | ||
for (tid, idx) in enumerate(partitions) | ||
tasks[tid] = | ||
@spawn _combine_rows_with_first_task!(tid, first(idx), last(idx), first(idx), | ||
outcols, outcolsref, | ||
type_widened, widen_type_lock, | ||
f, gd, starts, ends, incols, colnames, | ||
firstcoltype(firstmulticol)) | ||
@spawn_or_async _combine_rows_with_first_task!(tid, first(idx), last(idx), first(idx), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. a lot of code that isn't thread-safe is also not safe for julia> v = []
Any[]
julia> @sync for i = 1:100
@async begin
sleep(rand())
push!(v, i)
end
end
julia> issorted(v)
false Perhaps folks will push an ID to a list or something inside a My actual concern is loading data with Downloads.jl or HTTP.jl; both have a lot of issues with concurrency right now, and even So I think it would be great if there was a way to disallow concurrent execution of a function altogether, from green threads or threads. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jrevels pointed out we often don't really want to refer to threads but rather tasks, which kind of generalizes my point above. Maybe it should be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is good point - I think There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, too bad. I hoped we could support this without adding branches all over the place which make the code hard to read and add special code paths that need a specific test for each case. This applies in particular to splitapplycombine.jl:669. The only way to avoid adding branches I can see is having the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually we can simply run the passed expression and return a |
||
outcols, outcolsref, | ||
type_widened, widen_type_lock, | ||
f, gd, starts, ends, incols, colnames, | ||
firstcoltype(firstmulticol)) | ||
end | ||
|
||
# Workaround JuliaLang/julia#38931: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -131,6 +131,115 @@ end | |
|
||
funname(c::ComposedFunction) = Symbol(funname(c.outer), :_, funname(c.inner)) | ||
|
||
|
||
const SINGLETHREADING = Threads.Atomic{Bool}(false) | ||
const SINGLETHREADING_DEPTH = Threads.Atomic{Int}(0) | ||
|
||
ismultithreaded() = SINGLETHREADING_DEPTH[] == 0 | ||
|
||
""" | ||
DataFrames.singlethreaded(f) | ||
|
||
Run function `f` while disabling multithreading in all DataFrames.jl operations. | ||
This is useful in particular to run functions which are not thread-safe, or when | ||
distribution of work across threads is managed separately. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thinking about this API design for a while, I personally find it limiting to mix thread-safety and resource control issues. A more granular fix for thread-safety issues is to make it configurable per-invocation basis. For example, the function On the other hand, for resource control, a context-based solution is a good choice. However, for making composable ecosystem for parallel computing, I think it is better to treat "parallelized context" (= Combining the above two comments, what I suggest is to only provide a keyword argument for supporting thread-unsafe user-defined functions as a surface DataFrames API. DataFrames can internally start relying on an external package for performance hints and let programmers who want to tune the performance invoke the external package. All that said, I also realize these subtle differences may be hard to understand for many (especially new) Julia programmers. I understand DataFrames.jl needs to keep API very simple to be user-friendly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The point of separating thread-safety and resource control is very good. @tkf - do I understand you correctly that you propose that:
If this is correct then is there any concrete plan/timeline that could be sketched for this "extra package"? @jpsamaroo - what do you think about this proposal? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @nalimilan - what @tkf proposes is a very good. But maybe for simplicity what we should do now is:
The reason is that I believe I know that this is not a super clean design but we could signal to the users that we are planning to have something better in the future but this is a simple temporary solution. What do you think? The alternative, if we follow @tkf suggestion is to add There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another idea (probably not very good but it came to my mind so I am sharing it 😄): for thread-safety we could define wrapper There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, I think you do! Also, I created ParallelismHints.jl (ref tkf/ParallelismHints.jl#1) as a POC "external package" that I was talking about.
I actually like this idea very much 👍 and even actually wanted to suggest this. For example, not a DataFrames API but There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes the Julia recently gained support for declaring various levels of purity (JuliaLang/julia#43852), but AFAICT there's no way to mark a function as not pure given that it's the default, nor to mark a function as thread-unsafe. |
||
|
||
*See also*: [`DataFrames.setmultithreading`](@ref) to disable multithreading globally | ||
|
||
!!! note | ||
|
||
This function is considered as experimental and may change or be removed once | ||
a cross-package mechanism for multithreading configuration is developed. | ||
|
||
Currently, it disables multithreading for any DataFrames.jl | ||
operations which may be run while `f` is running (e.g. if tasks using data | ||
frames have been spawned on multiple threads). | ||
This may change in the future. | ||
|
||
# Examples | ||
```jldoctest | ||
julia> df = DataFrame(x=repeat(1:5, inner=2), y=1:10); | ||
|
||
julia> gd = groupby(df, :x); | ||
|
||
julia> counter = 0; | ||
|
||
julia> f(x) = (sleep(0.1); global counter += 1); # Thread-unsafe function | ||
|
||
julia> DataFrames.singlethreaded() do | ||
combine(gd, :y => f) | ||
end | ||
5×2 DataFrame | ||
Row │ x y_f | ||
│ Int64 Int64 | ||
─────┼────────────── | ||
1 │ 1 1 | ||
2 │ 2 2 | ||
3 │ 3 3 | ||
4 │ 4 4 | ||
5 │ 5 5 | ||
``` | ||
""" | ||
function singlethreaded(f) | ||
Threads.atomic_add!(SINGLETHREADING_DEPTH, 1) | ||
try | ||
return f() | ||
finally | ||
Threads.atomic_sub!(SINGLETHREADING_DEPTH, 1) | ||
end | ||
end | ||
|
||
""" | ||
DataFrames.setmultithreading(enable::Bool) | ||
|
||
Enable or disable multithreading permanently in all DataFrames.jl operations. | ||
This is useful in particular to run functions which are not thread-safe, or when | ||
distribution of work across threads is managed separately. | ||
|
||
*See also*: [`DataFrames.singlethreaded`](@ref) to disable multithreading only | ||
for a specific code block | ||
|
||
!!! note | ||
|
||
This function is considered as experimental and may change or be removed once | ||
a cross-package mechanism for multithreading configuration is developed. | ||
|
||
# Examples | ||
```jldoctest | ||
julia> df = DataFrame(x=repeat(1:5, inner=2), y=1:10); | ||
|
||
julia> gd = groupby(df, :x); | ||
|
||
julia> counter = 0; | ||
|
||
julia> f(x) = (sleep(0.1); global counter += 1); # Thread-unsafe function | ||
|
||
julia> DataFrames.setmultithreading(false); | ||
|
||
julia> combine(gd, :y => f) | ||
5×2 DataFrame | ||
Row │ x y_f | ||
│ Int64 Int64 | ||
─────┼────────────── | ||
1 │ 1 1 | ||
2 │ 2 2 | ||
3 │ 3 3 | ||
4 │ 4 4 | ||
5 │ 5 5 | ||
|
||
julia> DataFrames.setmultithreading(true); | ||
``` | ||
""" | ||
function setmultithreading(enable::Bool) | ||
old_state = Threads.atomic_xchg!(SINGLETHREADING, !enable) | ||
if !enable && !old_state | ||
Threads.atomic_add!(SINGLETHREADING_DEPTH, 1) | ||
elseif enable && old_state | ||
Threads.atomic_sub!(SINGLETHREADING_DEPTH, 1) | ||
end | ||
return enable | ||
end | ||
|
||
# Compute chunks of indices, each with at least `basesize` entries | ||
# This method ensures balanced sizes by avoiding a small last chunk | ||
function split_indices(len::Integer, basesize::Integer) | ||
|
@@ -159,7 +268,7 @@ if VERSION >= v"1.4" | |
|
||
nt = Threads.nthreads() | ||
len = length(x) | ||
if nt > 1 && len > basesize | ||
if ismultithreaded() && nt > 1 && len > basesize | ||
tasks = [Threads.@spawn begin | ||
for i in p | ||
local $(esc(lidx)) = @inbounds x[i] | ||
|
@@ -215,6 +324,89 @@ macro spawn_for_chunks(basesize, ex) | |
return _spawn_for_chunks_helper(ex.args[1], ex.args[2], basesize) | ||
end | ||
|
||
""" | ||
@spawn_or_async expr | ||
|
||
Equivalent to `Threads.@spawn` if `DataFrames.ismultithreaded() === true` | ||
and to `@async` otherwise. | ||
""" | ||
macro spawn_or_async end | ||
|
||
""" | ||
@spawn_or_run expr | ||
|
||
Equivalent to `Threads.@spawn` if `DataFrames.ismultithreaded() === true`, | ||
otherwise simply runs `expr`. | ||
""" | ||
macro spawn_or_run end | ||
|
||
if VERSION >= v"1.4" | ||
macro spawn_or_async(expr) | ||
letargs = Base._lift_one_interp!(expr) | ||
|
||
thunk = esc(:(()->($expr))) | ||
var = esc(Base.sync_varname) | ||
quote | ||
bkamins marked this conversation as resolved.
Show resolved
Hide resolved
|
||
let $(letargs...) | ||
local task = Task($thunk) | ||
task.sticky = !DataFrames.ismultithreaded() | ||
if $(Expr(:islocal, var)) | ||
@static if VERSION >= v"1.5.0" | ||
put!($var, task) | ||
else | ||
push!($var, task) | ||
end | ||
end | ||
schedule(task) | ||
task | ||
end | ||
end | ||
end | ||
|
||
macro spawn_or_run(expr) | ||
letargs = Base._lift_one_interp!(expr) | ||
|
||
thunk = esc(:(()->($expr))) | ||
var = esc(Base.sync_varname) | ||
quote | ||
let $(letargs...) | ||
if DataFrames.ismultithreaded() | ||
local task = Task($thunk) | ||
task.sticky = false | ||
if $(Expr(:islocal, var)) | ||
@static if VERSION >= v"1.5.0" | ||
put!($var, task) | ||
else | ||
push!($var, task) | ||
end | ||
end | ||
schedule(task) | ||
else | ||
$thunk() | ||
end | ||
nothing | ||
end | ||
end | ||
end | ||
else | ||
# This is the definition of @async in Base | ||
macro spawn_or_async(expr) | ||
thunk = esc(:(()->($expr))) | ||
var = esc(Base.sync_varname) | ||
quote | ||
local task = Task($thunk) | ||
if $(Expr(:isdefined, var)) | ||
push!($var, task) | ||
end | ||
schedule(task) | ||
end | ||
end | ||
|
||
macro spawn_or_run(expr) | ||
esc(:($expr; nothing)) | ||
end | ||
end | ||
|
||
function _nt_like_hash(v, h::UInt) | ||
length(v) == 0 && return hash(NamedTuple(), h) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrames.setmultithreading(true)
sounds like enabling multithreading, not disabling it. Maybe I'm misunderstanding the sentence though.