Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround for @cfunction closures on unsupported platforms #131

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Keluaa
Copy link
Contributor

@Keluaa Keluaa commented Jan 31, 2024

Fixes #88 by storing the BatchClosure object in the thread local storage of each task.

Architectures which do not support closures are detected in the same way as in this line, which should cover all affected platforms.

Tested on Nvidia Grace, not on Mac. Comparing this method with the @cfunction closure on x86, the overhead seems to be almost the same, but with additional allocations when using threadlocal.

@Keluaa
Copy link
Contributor Author

Keluaa commented Jan 31, 2024

I might have PR'ed too soon, as I see some #undefs with when using threadlocal for threads when using more than 65 threads, here on Nvidia Grace, with julia -t 72:

julia> using Polyester

julia> let
           @batch threadlocal=rand(10:99) for i in 0:Threads.nthreads()
               threadlocal += 1
           end
           findall(i -> !isassigned(threadlocal, i), 1:length(threadlocal))
       end
7-element Vector{Int64}:
 65
 66
 67
 68
 69
 70
 71

All threads after the 64-th are #undef, appart from the very last one. Always happens with more than 65 threads.

@Keluaa
Copy link
Contributor Author

Keluaa commented Feb 1, 2024

Well... I might have forgotten to commit a single line from my PR a year ago, sorry about that. Somehow I didn't stumble upon this sooner.

Now everything is working fine on my side. CI is failing but apparently it is only because of Aqua.jl, all multithreading tests are passing.

@Keluaa
Copy link
Contributor Author

Keluaa commented Feb 1, 2024

I found that while this works, it does allocate a bit more than what I would like (which is zero), and it seems to be proportional to the number of threads, which is too detrimental for performance. More info tomorrow.

Copy link
Member

@chriselrod chriselrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Polyester is supposed to try and avoid closures closing over objects by identifying the objects that would be captured, and then passing them as function arguments instead.

Ideally, we'd be able to fix examples of cases where this doesn't work.

Behavior dependent on capturing and mutating isn't going to be threadsafe anyway, and is thus unsupported.

Comment on lines 63 to +64
fptr::Ptr{Cvoid},
closure_obj,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we want both fptr and closure_obj? Wouldn't it be one or the other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fptr allows to infer the type of BatchClosure{F,A,C} thanks to the @cfunction.
We can't place the closure object directly into the TLS because it contains the captured values in a Core.Box, which doesn't have a fixed size at compile-time, and therefore store! fails.
So I put the closure in a Reference, which does have a fixed size, and when calling FakeClosure{F,A,C} we have the type of the closure before load-ing it, allowing type-inference.

@Keluaa
Copy link
Contributor Author

Keluaa commented Feb 2, 2024

Polyester is supposed to try and avoid closures closing over objects by identifying the objects that would be captured, and then passing them as function arguments instead.

But currently users are oblivious of this optimization on x86, even when it fails and falls back to a closure instead, as performance is only slightly altered. I think that having a working fallback on all platforms might be preferable over the current unhelpful error cfunction: closures are not supported on this platform.
Maybe adding a warning would help? Something such as some captured variables couldn't be turned into function arguments, consider changing how those variables are used by the @batch loop?

I found that while this works, it does allocate a bit more than what I would like

It seems that the allocation issue is unrelated to my changes, and only happen on ARM when using --check-bounds=no, because of a type-inference issue (#132).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

@batch with ARM
2 participants