-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
synchronize(blocking = false) hangs in julia 1.7 eventually #1350
Comments
I can confirm I see the exact same problem on my experiments, it is highly stochastic, happens on different machines and cards, but half of them eventually get stuck at different synchronization points, sometimes hours into an experiment. Have been trying all week to make sense of it and extract a minimal example, but that proved difficult. On Julia 1.6.5 the problem does not arise. For example CTRL-C on a single process single thread CUDA.jl application that has been stuck doing nothing for an hour gives:
Another one (after 1300 epochs of training) looks like this:
|
I can confirm now that moving from this pattern
to this
"Fixes" the problem in Julia 1.7. Run a test with 70 million calls +. All good. Whereas calling |
That's concerning. Can you confirm you are using the exact same packages across Julia versions? Another interesting datapoint would be to disable the nonblocking synchronization by commenting-out: Lines 126 to 128 in 00955dd
Of course, if you rely on multitasking (to perform other GPU operations while the sync is happening and blocking the thread) this will change the dynamics of your application. |
This suggestion is coming from a place of complete ignorance, but I wonder if it can be related to JuliaLang/julia#44019
So it seems that these are all random outcomes of the same bug. |
Are you using multiple threads? If so, it's possible there's some bugs lurking. But with plain multitasking we shouldn't be locking up. |
I'm running a single process with a single thread. I am running different Julia instances on different GPUs if that matters. |
The issue you linked to is about use of
Those are very different from a deadlock. Can you post the error messages and backtraces? |
I just jump on the train, I see similar deadlocks in an iterative algorithm where I mainly use operations like |
My questions remain though:
Ideally a MWE or reproducer would be most helpful, but if that doesn't work a bisect of CUDA.jl (assuming it's an upgrade of the package that causes this issue) could also shed some light on this issue. |
I am not seeing the issue on Julia 1.6 with CUDA v3.8.0, so it is not a pure CUDA.jl bug.
|
I try to execute it in the REPL, maybe it exposes some log |
For some reason it occurred only in my Jupyter notebook so far but not in the REPL, despite executing the same code. |
Maybe we should focus on hang issue in this bug (with no threading/tasking)? As for the original case:
Now I am trying to disable "disable the nonblocking synchronization" but somehow have this error while switching to package dev mode (julia 1.7.2). Any hints what am I doing wrong?
here is the top packages I have
and here is how CUDA.jl decencies look like
|
Do you have an old CUDA.jl clone in your dev folder? |
Correct. We can make it so that the synchronization doesn't consume CPU, but blocks on an OS primitive, but that still blocks other Julia tasks from making progress. @luraess's testing seems to imply this may be related to Julia 1.7.1 -- could you verify the nonblocking_sync hangs on that version but still works on 1.7.0? |
The problem appeared in 1.7.0. And I tried 1.7.1 and 1.7.2. All with the same result |
OK, we'll have to debug this then. What could be useful is a backtrace of all the live tasks during the hang. That isn't easy to come by though, and needs some Please try to reproduce the hang with this build of Julia. Once the deadlock happens, attach
Alternatively, if you don't have Once have GDB at the point of deadlock, we first need to find a thread that we can use to print the backtraces from. Typically that will just be thread 1, but if that thread happens to be doing GC (
So here thread 1 isn't doing GC and can be used to dump the task backtraces. First check how many live tasks there are:
Now we can print the backtraces for each of these (numbering starts at 0 to the length reported above):
This will print a back-trace in the process' terminal. For example, if I do a simple
Please report those here for all live tasks. If you have any troubles with this, contact me on Slack. |
Unfortunately, I run Windows for this project. I have linux in docker, but then GPUs don't get exposed correctly there (at least using VirtualBox and my limited knowledge ) . Any chance doing similar in windows? |
What about WSL2? That should be easier than running |
Following up on #1350 (comment), after more testing on 1.6.5, 1.7.0, 1.7.1 and 1.7.2, it seems that both for Spack-built binaries and binaries downloaded from julialang.org: |
Cool that WSL2 now supports GPU. I managed to make it working. Good news: The test hangs in WSL2. If that helps, then I press Ctrl+C I get the following
The bad news is that I can't seem to be able to start the julia version you sent me. I unzipped it and just try to start julia from the bin folder. I get the following error
quick search on web says that one shouldn't mess with this lib but instead ask developer to build for correct OS version. But I know so little about linux. maybe there are other ways to make it work for reference here is the setup I have.
|
Ah yes, I'm building on a fairly recent Linux distro. Can you build it yourself? The patch you need: diff --git a/src/julia_threads.h b/src/julia_threads.h
index 5727083212..9832fa9ac4 100644
--- a/src/julia_threads.h
+++ b/src/julia_threads.h
@@ -45,10 +45,10 @@ typedef win32_ucontext_t jl_ucontext_t;
#endif
#if 0
// very slow, but more debugging
-//#elif defined(_OS_DARWIN_)
-//#define JL_HAVE_UNW_CONTEXT
-//#elif defined(_OS_LINUX_)
-//#define JL_HAVE_UNW_CONTEXT
+#elif defined(_OS_DARWIN_)
+#define JL_HAVE_UNW_CONTEXT
+#elif defined(_OS_LINUX_)
+#define JL_HAVE_UNW_CONTEXT
#elif defined(_OS_EMSCRIPTEN_)
#define JL_HAVE_ASYNCIFY
#elif !defined(JL_HAVE_ASM) If not, I can have the Julia buildbots generate a build instead. |
@maleadt I will give it a try now as well since it turns out that 1.7.0, 1.7.1 and 1.7.2 hang. Using your debug Julia build, one encounters that |
@maleadt getting:
when running your Julia build using |
Could you please make a build for Ubuntu 20.04.3? Who knows how long it will take to make whole build chain for julia running |
Thanks! Here is that I am getting. Hopefully followed the instructions correctly
|
I have run 4M loops with CUDA from tb/async_errors and it is running ok. So looks very promising. |
With that and #1369 (comment) I hope we can close this. Please re-open if the issue remains. |
Thanks! |
I have a general pattern like this
It works 24/7 with julia 1.6 for months already with no issues (I restart it about once a week due to new data I need to add. In general, I think I have seen only one unexplained hang over past 2 years/2B calls. very happy about CUDA.jl stability). However with julia 1.7 after a while (between 10000 and 500_000 calls/loops, which in my case means typically once every 10-60 minutes) it just hangs. This "hang" is happening on different data input, different cards on the server (if I let it run eventually all 6 cards hang). It happens on my development pc as well.
"hang" means the code is stuck in
synchronize(blocking = false)
line, GPU stops showing any load, i.e. GPU does nothing yet the code doesn't return. if I callwork()
with the same input parameters as the call which just hang (i.e. pressing Ctrl+C first to get out of the loop), it works just fine.I am trying to create a simple snippet for the bug reproduction, but as you can imagine it is bit difficult so far. And then if issue is related to timing would not guaranteed to reproduce on different hardware.
So wonder if there are any tips how to debug it?
Or if CUDA.jl developers could be having ideas how julia 1.7.x could be affecting this pattern? (again same code/CUDA works fine in julia 1.6.5)
Meanwhile, maybe an interesting hint on what is going on. Running processing as a task seems to reduce the probably of a hang dramatically. Maybe will give ideas.
This code is about 5-10x less likely to hang. But I still managed to fail in my tests
I currently run this code
And already 10M+ calls with no issues. Kind of pointing that the problem is not likely in the data I ship but somewhere else.
Any input on how to debug it is appreciated. (as task workaround is bit ugly and looks like costing a fair amount of extra CPU then I run 60-100 calls a second and obviously just hides a problem which shouldn't be there in the first place)
This is how my dev PC looks like
The text was updated successfully, but these errors were encountered: