Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a keyword argument to disable multithreading #3030
Add a keyword argument to disable multithreading #3030
Changes from 3 commits
a5b1089
a31bf16
37604e6
afd9b8f
c24618a
f218189
39300fc
77e3475
7e3321f
fd665cc
b488196
10e262f
c21a1c8
aa391cd
4b5163c
d4dfb0f
27028e0
2c6488e
2d8d5f5
ce4a2cf
4bbbb27
e575338
1382ffe
f63fe97
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrames.setmultithreading(true)
sounds like enabling multithreading, not disabling it. Maybe I'm misunderstanding the sentence though.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a lot of code that isn't thread-safe is also not safe for
@async
concurrency. Silly example:Perhaps folks will push an ID to a list or something inside a
combine
, and expectsinglethreaded()
will give no concurrency so it works?My actual concern is loading data with Downloads.jl or HTTP.jl; both have a lot of issues with concurrency right now, and even
@async
is enough to quickly lock up release versions of HTTP.jl. So if you do something like "for each group, download a file associated to that group and compute some statistic of it" (a thing I actually do) then this will have issues.So I think it would be great if there was a way to disallow concurrent execution of a function altogether, from green threads or threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrevels pointed out we often don't really want to refer to threads but rather tasks, which kind of generalizes my point above. Maybe it should be
singletask
instead ofsinglethreaded
? (orNotPure
as suggested in the issue)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good point - I think
@async
should be also disabled.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, too bad. I hoped we could support this without adding branches all over the place which make the code hard to read and add special code paths that need a specific test for each case. This applies in particular to splitapplycombine.jl:669. The only way to avoid adding branches I can see is having the
@run_or_async
macro run the code immediately, and return a pseudo-Task
object that supportswith
andfetch
but makes them no-ops. But that's quite complex too.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we can simply run the passed expression and return a
Task
that returns its value. It's a bit of a waste but not worse than spawning multiple tasks when the number of threads is 1. The cost is negligible anyway, which I why we always spawn tasks in these functions.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this API design for a while, I personally find it limiting to mix thread-safety and resource control issues.
A more granular fix for thread-safety issues is to make it configurable per-invocation basis. For example, the function
f
used ascombine(gd, :y => f)
may not be safe to be executed on multiple tasks. However,f
on its own may invoke DataFrames APIs that can be safely parallelized. However, since these APIs are executed within the invocation ofcombine(gd, :y => f)
, a context-based solution cannot (easily) be used. (Reading @nalimilan's comment #2988 (comment), there may be a PR to add the keyword argument later. But I think it'd be better to not advertisesinglethreaded
as a mechanism for thread-safety if so.)On the other hand, for resource control, a context-based solution is a good choice. However, for making composable ecosystem for parallel computing, I think it is better to treat "parallelized context" (=
singlethreaded
-like-API) as a hint. If you are doinggropuby
of a large DataFrame in a child task where there are only 4 parent tasks on 8 core machines, it probably still is a better choice to parallelizegroupby
. So, from this perspective, I'd just try to be very conservative when choosing basesize in the "parallelized context." Furthermore, since this works only if this API is used across packages, I suggest creating a simple package for this and use it in DataFrames. (A small upside is that the version of said API does not have to be tied to DataFrames version. Since it is purely a performance hint, it may not have to be considered breaking if DataFrames.jl stops supporting this.)Combining the above two comments, what I suggest is to only provide a keyword argument for supporting thread-unsafe user-defined functions as a surface DataFrames API. DataFrames can internally start relying on an external package for performance hints and let programmers who want to tune the performance invoke the external package.
All that said, I also realize these subtle differences may be hard to understand for many (especially new) Julia programmers. I understand DataFrames.jl needs to keep API very simple to be user-friendly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of separating thread-safety and resource control is very good.
@tkf - do I understand you correctly that you propose that:
If this is correct then is there any concrete plan/timeline that could be sketched for this "extra package"?
@jpsamaroo - what do you think about this proposal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalimilan - what @tkf proposes is a very good. But maybe for simplicity what we should do now is:
setmultithreading
only for now as a poor man's solution for both thread safety and resource control.The reason is that I believe
setmultithreading
API is simple for users to understand. It will be marked as experimental so we can remove it when we have something better.I know that this is not a super clean design but we could signal to the users that we are planning to have something better in the future but this is a simple temporary solution. What do you think?
The alternative, if we follow @tkf suggestion is to add
singlethreaded::Bool=false
kwarg to all functions that take a function and potentially do multi-threading.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea (probably not very good but it came to my mind so I am sharing it 😄):
for thread-safety we could define wrapper
NotPure(fun)
and then write e.g.:a => NotPure(x -> rand()) => :b
and then we would not even have to add any kwarg.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think you do!
Also, I created ParallelismHints.jl (ref tkf/ParallelismHints.jl#1) as a POC "external package" that I was talking about.
I actually like this idea very much 👍 and even actually wanted to suggest this. For example, not a DataFrames API but
mapreduce(f, NotPure(op), xs)
is somewhat parallelizable. But I suggested the keyword argument because it may look "simpler" in some sense. OTOH, people use API likeByRow
so maybe it's not crazy?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the
NotPure
approach is interesting as it works generally without having to add arguments everywhere. It also allows passing a wrapped function through several call levels if needed. We already considered a similar approach formap
onPooledArray
(JuliaData/PooledArrays.jl#63).Julia recently gained support for declaring various levels of purity (JuliaLang/julia#43852), but AFAICT there's no way to mark a function as not pure given that it's the default, nor to mark a function as thread-unsafe.