Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

synchronize(blocking = false) hangs in julia 1.7 eventually #1350

Closed
anj00 opened this issue Feb 4, 2022 · 32 comments
Closed

synchronize(blocking = false) hangs in julia 1.7 eventually #1350

anj00 opened this issue Feb 4, 2022 · 32 comments
Labels
bug Something isn't working

Comments

@anj00
Copy link

anj00 commented Feb 4, 2022

I have a general pattern like this

function work()
   do_work_on_gpu()
   synchronize(blocking = false)
end

while true
   work()
end

It works 24/7 with julia 1.6 for months already with no issues (I restart it about once a week due to new data I need to add. In general, I think I have seen only one unexplained hang over past 2 years/2B calls. very happy about CUDA.jl stability). However with julia 1.7 after a while (between 10000 and 500_000 calls/loops, which in my case means typically once every 10-60 minutes) it just hangs. This "hang" is happening on different data input, different cards on the server (if I let it run eventually all 6 cards hang). It happens on my development pc as well.
"hang" means the code is stuck in synchronize(blocking = false) line, GPU stops showing any load, i.e. GPU does nothing yet the code doesn't return. if I call work() with the same input parameters as the call which just hang (i.e. pressing Ctrl+C first to get out of the loop), it works just fine.

I am trying to create a simple snippet for the bug reproduction, but as you can imagine it is bit difficult so far. And then if issue is related to timing would not guaranteed to reproduce on different hardware.

So wonder if there are any tips how to debug it?
Or if CUDA.jl developers could be having ideas how julia 1.7.x could be affecting this pattern? (again same code/CUDA works fine in julia 1.6.5)

Meanwhile, maybe an interesting hint on what is going on. Running processing as a task seems to reduce the probably of a hang dramatically. Maybe will give ideas.
This code is about 5-10x less likely to hang. But I still managed to fail in my tests

function work()
   do_work_on_gpu()
   fetch(@async synchronize(blocking = false))
end

I currently run this code

function work()
   do_work_on_gpu()
   synchronize(blocking = false)
end

while true
   fetch(@async work())
end

And already 10M+ calls with no issues. Kind of pointing that the problem is not likely in the data I ship but somewhere else.

Any input on how to debug it is appreciated. (as task workaround is bit ugly and looks like costing a fair amount of extra CPU then I run 60-100 calls a second and obviously just hides a problem which shouldn't be there in the first place)

This is how my dev PC looks like

julia> versioninfo()
Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_CUDA_NSYS = C:\Program Files\NVIDIA Corporation\Nsight Systems 2021.2.1\target-windows-x64\nsys.exe

julia> CUDA.versioninfo()
CUDA toolkit 11.6, artifact installation
Unknown NVIDIA driver, for CUDA 11.6
CUDA driver 11.6

Libraries:
- CUBLAS: 11.8.1
- CURAND: 10.2.9
- CUFFT: 10.7.0
- CUSOLVER: 11.3.2
- CUSPARSE: 11.7.1
- CUPTI: 16.0.0
- NVML: missing
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.7.1
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

Environment:
- JULIA_CUDA_NSYS: C:\Program Files\NVIDIA Corporation\Nsight Systems 2021.2.1\target-windows-x64\nsys.exe

1 device:
  0: NVIDIA GeForce RTX 2070 (sm_75, 7.013 GiB / 8.000 GiB available)
@anj00 anj00 added the bug Something isn't working label Feb 4, 2022
@guyvdbroeck
Copy link
Contributor

guyvdbroeck commented Feb 4, 2022

I can confirm I see the exact same problem on my experiments, it is highly stochastic, happens on different machines and cards, but half of them eventually get stuck at different synchronization points, sometimes hours into an experiment. Have been trying all week to make sense of it and extract a minimal example, but that proved difficult. On Julia 1.6.5 the problem does not arise.

For example CTRL-C on a single process single thread CUDA.jl application that has been stuck doing nothing for an hour gives:

^C
signal (2): Interrupt
in expression starting at /home/guy/.julia/dev/ProbabilisticCircuits/example/bug0.jl:2         
epoll_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)                                   
uv__io_poll at /workspace/srcdir/libuv/src/unix/epoll.c:240                                    
uv_run at /workspace/srcdir/libuv/src/unix/core.c:383                                          
jl_task_get_next at /buildworker/worker/package_linux64/build/src/partr.c:481                  
poptask at ./task.jl:827
wait at ./task.jl:836
task_done_hook at ./task.jl:544
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]                
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429                    
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1788 [inlined]               
jl_finish_task at /buildworker/worker/package_linux64/build/src/task.c:218                     
start_task at /buildworker/worker/package_linux64/build/src/task.c:888                         
unknown function (ip: (nil))
Allocations: 909734470 (Pool: 909281029; Big: 453441); GC: 322                    

Another one (after 1300 epochs of training) looks like this:

^CERROR: InterruptException:      
Stacktrace:
  [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))                                       
    @ Base ./task.jl:777                                                                                                                                                                        [2] wait()                                                                                                                                                                                  
    @ Base ./task.jl:837        
  [3] wait(c::Base.GenericCondition{ReentrantLock})                                            
    @ Base ./condition.jl:123
  [4] wait(e::Base.Event)      
    @ Base ./lock.jl:366                                                                                                                                                                        [5] nonblocking_synchronize                                                                                                                                                                     @ ~/space/.julia/packages/CUDA/bki2w/lib/cudadrv/stream.jl:162 [inlined]                   
  [6] (::CUDA.var"#207#208"{Float32, Vector{Float32}, Int64, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Int64, Int64})()                                                                     
    @ CUDA ~/space/.julia/packages/CUDA/bki2w/src/array.jl:406                                 
  [7] #context!#59
    @ ~/space/.julia/packages/CUDA/bki2w/lib/cudadrv/state.jl:164 [inlined]                    
  [8] context!                                                                                 
    @ ~/space/.julia/packages/CUDA/bki2w/lib/cudadrv/state.jl:161 [inlined]                    
  [9] unsafe_copyto!(dest::Vector{Float32}, doffs::Int64, src::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, soffs::Int64, n::Int64)                                                            
    @ CUDA ~/space/.julia/packages/CUDA/bki2w/src/array.jl:402
 [10] copyto!
    @ ~/space/.julia/packages/CUDA/bki2w/src/array.jl:356 [inlined]
 [11] getindex
    @ ~/space/.julia/packages/GPUArrays/umZob/src/host/indexing.jl:89 [inlined]
 [12] #25
    @ ~/space/.julia/packages/GPUArrays/umZob/src/host/indexing.jl:75 [inlined]
 [13] task_local_storage(body::GPUArrays.var"#25#28"{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, key::Symbol, val::Bool)
    @ Base ./task.jl:281
 [14] macro expansion
    @ ~/space/.julia/packages/GPUArrays/umZob/src/host/indexing.jl:74 [inlined]
 [15] _mapreduce(f::typeof(identity), op::typeof(Base.add_sum), As::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}; dims::Colon, init::Nothing)
    @ GPUArrays ~/space/.julia/packages/GPUArrays/umZob/src/host/mapreduce.jl:65
 [16] #mapreduce#20
    @ ~/space/.julia/packages/GPUArrays/umZob/src/host/mapreduce.jl:28 [inlined]
 [17] mapreduce
    @ ~/space/.julia/packages/GPUArrays/umZob/src/host/mapreduce.jl:28 [inlined]
 [18] #_sum#735
    @ ./reducedim.jl:894 [inlined]
 [19] _sum
    @ ./reducedim.jl:894 [inlined]
 [20] #_sum#734
    @ ./reducedim.jl:893 [inlined]
 [21] _sum
    @ ./reducedim.jl:893 [inlined]
 [22] #sum#732
    @ ./reducedim.jl:889 [inlined]
 [23] sum
    @ ./reducedim.jl:889 [inlined]
 [24] mini_batch_em(bpc::CuBitsProbCircuit, raw_data::CuArray{Bool, 2, CUDA.Mem.DeviceBuffer}, num_epochs::Int64; batch_size::Int64, pseudocount::Float64, softness::Float64, param_inertia::F
loat64, param_inertia_end::Float64, flow_memory::Int64, flow_memory_end::Int64, shuffle::Symbol, mars_mem::Nothing, flows_mem::Nothing, node_aggr_mem::Nothing, edge_aggr_mem::Nothing, mine::
Int64, maxe::Int64, debug::Bool)
    @ ProbabilisticCircuits ~/space/.julia/dev/ProbabilisticCircuits/src/bit_circuits/em.jl:296
 [25] macro expansion
    @ ./timing.jl:220 [inlined]
 [26] experiment(train::CuArray{Bool, 2, CUDA.Mem.DeviceBuffer}, test::CuArray{Bool, 2, CUDA.Mem.DeviceBuffer}, epochs1::Int64, epochs2::Int64, epochs3::Int64, latents::Int64; batch_size::In
t64, latent_heuristic::String, pseudocount::Float64, softness::Float64, param_inertia1::Float64, param_inertia_end1::Float64, param_inertia2::Float64, param_inertia_end2::Float64, shuffle::S
ymbol)
    @ Main /scratch/guyvdb/.julia/dev/ProbabilisticCircuits/example/single_experiment.jl:25
 [27] top-level scope
    @ REPL[6]:1
 [28] top-level scope
    @ ~/space/.julia/packages/CUDA/bki2w/src/initialization.jl:52

@anj00
Copy link
Author

anj00 commented Feb 6, 2022

I can confirm now that moving from this pattern

function work()
   do_work_on_gpu()
   synchronize(blocking = false)
end

while true
   work()
end

to this

while true
   fetch(@async work())
end

"Fixes" the problem in Julia 1.7. Run a test with 70 million calls +. All good. Whereas calling work() directly consistently hangs after 30k-500k calls/iterations.

@maleadt
Copy link
Member

maleadt commented Feb 7, 2022

That's concerning. Can you confirm you are using the exact same packages across Julia versions?

Another interesting datapoint would be to disable the nonblocking synchronization by commenting-out:

# perform as much of the sync as possible without blocking in CUDA.
# XXX: remove this using a yield callback, or by synchronizing on a dedicated stream?
nonblocking_synchronize(stream)

Of course, if you rely on multitasking (to perform other GPU operations while the sync is happening and blocking the thread) this will change the dynamics of your application.

@guyvdbroeck
Copy link
Contributor

This suggestion is coming from a place of complete ignorance, but I wonder if it can be related to JuliaLang/julia#44019
For context, I was randomly getting the exact same script to sometimes

  • crash with signal (6): Aborted
  • crash with segfault
  • deadlock as shown above.

So it seems that these are all random outcomes of the same bug.

@maleadt
Copy link
Member

maleadt commented Feb 8, 2022

Are you using multiple threads? If so, it's possible there's some bugs lurking. But with plain multitasking we shouldn't be locking up.

@guyvdbroeck
Copy link
Contributor

I'm running a single process with a single thread. I am running different Julia instances on different GPUs if that matters.

@maleadt
Copy link
Member

maleadt commented Feb 8, 2022

The issue you linked to is about use of @threads, so it's unlikely to be related.

  • crash with signal (6): Aborted

  • crash with segfault

Those are very different from a deadlock. Can you post the error messages and backtraces?

@roflmaostc
Copy link

I just jump on the train, I see similar deadlocks in an iterative algorithm where I mainly use operations like abs2., sqrt. and broadcasting :/
Mean time until lock ~5-10min probably

@maleadt
Copy link
Member

maleadt commented Feb 8, 2022

My questions remain though:

  • is this caused by an upgrade of Julia 1.6 to 1.7?
  • or is this caused by an upgrade of CUDA.jl, or any other package?
  • does disabling non-blocking synchronization help?
  • does this only deadlock, or also abort/segfault (as reported by @guyvdbroeck)?

Ideally a MWE or reproducer would be most helpful, but if that doesn't work a bisect of CUDA.jl (assuming it's an upgrade of the package that causes this issue) could also shed some light on this issue.

@guyvdbroeck
Copy link
Contributor

I am not seeing the issue on Julia 1.6 with CUDA v3.8.0, so it is not a pure CUDA.jl bug.
Sorry I cannot be more helpful, my example takes hours to reach the bug. @roflmaostc's 5 minute example is the way to go.
I don't know how to get backtraces (--bug-report is too slow and couldn't run a julia debug instance, see other report), but one error message when it wasn't locking is:

Mini-batch EM iter 309; train LL -677.6212

signal (6): Aborted
in expression starting at REPL[2]:1
Allocations: 1006391627 (Pool: 1005937998; Big: 453629); GC: 416
Aborted (core dumped)

@roflmaostc
Copy link

I try to execute it in the REPL, maybe it exposes some log

@roflmaostc
Copy link

For some reason it occurred only in my Jupyter notebook so far but not in the REPL, despite executing the same code.

@anj00
Copy link
Author

anj00 commented Feb 9, 2022

Maybe we should focus on hang issue in this bug (with no threading/tasking)?

As for the original case:

  • is this caused by an upgrade of Julia 1.6 to 1.7
    • yes, very same CUDA.jl in both. In fact I tried to migrate to 1.7 for a while already. I think I had CUDA.jl 3.6.4 and already had issues with julia 1.7. Then tried several CUDA.jl between 3.6.4 and 3.8.0 all with the same results. And it was happening in julia 1.7.0, 1.7.1 and still in 1.7.2
  • or is this caused by an upgrade of CUDA.jl, or any other package?
    • I think some time ago I run a test where I had exactly same versions of all the packages (as I have a dedicated env folder as part of my git repo) and just different julia version. and had issue.

Now I am trying to disable "disable the nonblocking synchronization" but somehow have this error while switching to package dev mode (julia 1.7.2). Any hints what am I doing wrong?

dev CUDA
   Resolving package versions...
ERROR: Unsatisfiable requirements detected for package LLVM [929cbde3]:
 LLVM [929cbde3] log:
 ├─possible versions are: 0.9.0-4.7.1 or uninstalled
 ├─restricted to versions 1.5.2-2 by CUDA [052768ef], leaving only versions 1.5.2-2.0.0
 │ └─CUDA [052768ef] log:
 │   ├─possible versions are: 1.2.0 or uninstalled
 │   └─CUDA [052768ef] is fixed to version 1.2.0
 └─restricted by julia compatibility requirements to versions: 4.0.0-4.7.1 or uninstalled — no versions left

here is the top packages I have

  [6e4b80f9] BenchmarkTools v1.3.0
  [336ed68f] CSV v0.10.2
  [052768ef] CUDA v3.8.0
  [a93c6f00] DataFrames v1.3.2
  [5789e2e9] FileIO v1.13.0
  [708ec375] Gumbo v0.8.0
  [cd3eb016] HTTP v0.9.17
  [033835bb] JLD2 v0.4.20
  [682c06a0] JSON v0.21.2
  [bdcacae8] LoopVectorization v0.12.101
  [f0f68f2c] PlotlyJS v0.18.8
  [91a5bcdd] Plots v1.25.8
  [2913bbd2] StatsBase v0.33.14
  [f269a46b] TimeZones v1.7.1

and here is how CUDA.jl decencies look like

CUDA : 3.8.0 
  RandomNumbers        : 1.5.3
    Requires             : 1.3.0
  AbstractFFTs         : 1.1.0
    ChainRulesCore       : 1.12.0
      Compat               : 3.41.0
  TimerOutputs         : 0.5.15
    ExprTools            : 0.1.8
  GPUCompiler          : 0.13.11
    LLVM                 : 4.7.1
      CEnum                : 0.4.1
      LLVMExtra_jll        : 0.0.13+1
        JLLWrappers          : 1.4.1
          Preferences          : 1.2.3
    ExprTools            : 0.1.8
    TimerOutputs         : 0.5.15
      ExprTools            : 0.1.8
  LLVM                 : 4.7.1
    CEnum                : 0.4.1
    LLVMExtra_jll        : 0.0.13+1
      JLLWrappers          : 1.4.1
        Preferences          : 1.2.3
  CEnum                : 0.4.1
  BFloat16s            : 0.2.0
  GPUArrays            : 8.2.1
    LLVM                 : 4.7.1
      CEnum                : 0.4.1
      LLVMExtra_jll        : 0.0.13+1
        JLLWrappers          : 1.4.1
          Preferences          : 1.2.3
    Adapt                : 3.3.3
  SpecialFunctions     : 2.1.2
    IrrationalConstants  : 0.1.1
    ChainRulesCore       : 1.12.0
      Compat               : 3.41.0
    LogExpFunctions      : 0.3.6
      IrrationalConstants  : 0.1.1
      ChainRulesCore       : 1.12.0
        Compat               : 3.41.0
      ChangesOfVariables   : 0.1.2
        ChainRulesCore       : 1.12.0
          Compat               : 3.41.0
      DocStringExtensions  : 0.8.6
      InverseFunctions     : 0.1.2
    OpenSpecFun_jll      : 0.5.5+0
      JLLWrappers          : 1.4.1
        Preferences          : 1.2.3
  ExprTools            : 0.1.8
  Requires             : 1.3.0
  Reexport             : 1.2.2
  Adapt                : 3.3.3
  Random123            : 1.4.2
    RandomNumbers        : 1.5.3
      Requires             : 1.3.0

@maleadt
Copy link
Member

maleadt commented Feb 9, 2022

 LLVM [929cbde3] log:
 ├─possible versions are: 0.9.0-4.7.1 or uninstalled
 ├─restricted to versions 1.5.2-2 by CUDA [052768ef], leaving only versions 1.5.2-2.0.0

Do you have an old CUDA.jl clone in your dev folder?

@anj00
Copy link
Author

anj00 commented Feb 9, 2022

Indeed, sorry about that. Had an old CUDA.jl in the dev folder. Forgot about it.

Now I commented the line you suggested
image

And it looks like I don't see the code hanging. It has been running for 2.1 million loops (at least 10-20x better than with that line). At least we can say it is a significant improvement. And I will let the test running a bit more.

Of course, as I understand commenting that line is causing a CPU being 100% busy waiting for GPU results. Which I hope won't be the solution (this test is one process, in production I run run 6-12 julia processors (1-4 per card), meaning with such a solution CPU will be 100% busy actually slowing the process which generates data for gpu to process :) kind of ironic GPU is starving because other gpu processors use CPU to wait :) ) But just a hint where to look for the solution

@maleadt
Copy link
Member

maleadt commented Feb 10, 2022

Of course, as I understand commenting that line is causing a CPU being 100% busy waiting for GPU results.

Correct. We can make it so that the synchronization doesn't consume CPU, but blocks on an OS primitive, but that still blocks other Julia tasks from making progress.

@luraess's testing seems to imply this may be related to Julia 1.7.1 -- could you verify the nonblocking_sync hangs on that version but still works on 1.7.0?

@anj00
Copy link
Author

anj00 commented Feb 10, 2022

The problem appeared in 1.7.0. And I tried 1.7.1 and 1.7.2. All with the same result

@maleadt
Copy link
Member

maleadt commented Feb 10, 2022

OK, we'll have to debug this then. What could be useful is a backtrace of all the live tasks during the hang. That isn't easy to come by though, and needs some gdb wrangling using a custom Julia build. I've prepared an appropriate build here, https://drive.google.com/file/d/1C3wtlaIzAw6kQuZ8JqubA4BCbOJwLkl8/view?usp=sharing, which is just Julia 1.7.3-pre (from release-1.7) with the necessary patch applied.

Please try to reproduce the hang with this build of Julia. Once the deadlock happens, attach gdb to the process (it may be useful to note the output of getpid() from Julia before launching your application):

sudo gdb --pid 83937

Alternatively, if you don't have sudo on that machine, you could run the custom Julia under gdb (gdb --args ./julia $ANY_OTHER_ARGS and then run). To get back to GDB once the deadlock happens, hit Ctrl-C. If GDB breaks before that, because of another unrelated signal (e.g. SIGSEGV as used by the GC) you can tell it to ignore that signal using handle SIGSEGV nostop and continue to continue.

Once have GDB at the point of deadlock, we first need to find a thread that we can use to print the backtraces from. Typically that will just be thread 1, but if that thread happens to be doing GC (ptls->gc_state != 0) you can't use it (in that case, either continue and try again later, or try another thread but take care to only go up to the amount of threads Julia was launched with and not use the OpenBLAS threads):

(gdb) thread 1
[Switching to thread 1 (Thread 0x7f3ccc3cab80 (LWP 83937))]
#0  0x00007f3ccc4eb92e in epoll_wait () from /usr/lib/libc.so.6
(gdb) print (int8_t) ((jl_ptls_t)jl_get_ptls_states())->gc_state
$6 = 0 '\000'

So here thread 1 isn't doing GC and can be used to dump the task backtraces. First check how many live tasks there are:

(gdb) print jl_live_tasks()->length
$7 = 3

Now we can print the backtraces for each of these (numbering starts at 0 to the length reported above):

(gdb) call jlbacktracet(jl_arrayref(jl_live_tasks(), 0))

This will print a back-trace in the process' terminal. For example, if I do a simple wait(Condition()) from the REPL I get:

jl_unw_swapcontext at /tmp/julia/src/task.c:958 [inlined]
jl_swap_fiber at /tmp/julia/src/task.c:970
ctx_switch at /tmp/julia/src/task.c:437
jl_switch at /tmp/julia/src/task.c:502
try_yieldto at ./task.jl:767
wait at ./task.jl:837
wait at ./condition.jl:123
#134 at /tmp/julia/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:281 [inlined]
lock at ./lock.jl:190
lock at ./condition.jl:78 [inlined]
macro expansion at /tmp/julia/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:279 [inlined]
#133 at ./threadingconstructs.jl:178
jfptr_YY.133_50259 at /tmp/julia/usr/lib/julia/sys.so (unknown line)
jl_apply at /tmp/julia/src/julia.h:1788 [inlined]
start_task at /tmp/julia/src/task.c:877

Please report those here for all live tasks. If you have any troubles with this, contact me on Slack.

@anj00
Copy link
Author

anj00 commented Feb 10, 2022

Unfortunately, I run Windows for this project. I have linux in docker, but then GPUs don't get exposed correctly there (at least using VirtualBox and my limited knowledge ) .

Any chance doing similar in windows?
Quick search shows that there is gdb port for windows (have zero experience with it, but fingers crossed it works). If you can build me that special julia for windows as well. I can try to run it.

@maleadt
Copy link
Member

maleadt commented Feb 10, 2022

What about WSL2? That should be easier than running gdb in Windows, I think.

@luraess
Copy link

luraess commented Feb 10, 2022

Following up on #1350 (comment), after more testing on 1.6.5, 1.7.0, 1.7.1 and 1.7.2, it seems that both for Spack-built binaries and binaries downloaded from julialang.org:
1.6.5 - pass
1.7.0 - pass
1.7.1 - fail (freezing)
1.7.2 - pass (no Spack-build available yet)

@anj00
Copy link
Author

anj00 commented Feb 10, 2022

Cool that WSL2 now supports GPU. I managed to make it working.

Good news: The test hangs in WSL2. If that helps, then I press Ctrl+C I get the following

signal (2): Interrupt
in expression starting at /mnt/c/Src/test.jl:54
epoll_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
uv__io_poll at /workspace/srcdir/libuv/src/unix/epoll.c:240
uv_run at /workspace/srcdir/libuv/src/unix/core.c:383
jl_task_get_next at /buildworker/worker/package_linux64/build/src/partr.c:481
poptask at ./task.jl:827
wait at ./task.jl:836
task_done_hook at ./task.jl:544
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1788 [inlined]
jl_finish_task at /buildworker/worker/package_linux64/build/src/task.c:218
start_task at /buildworker/worker/package_linux64/build/src/task.c:888
unknown function (ip: (nil))
Allocations: 2378839661 (Pool: 2378701340; Big: 138321); GC: 1885

The bad news is that I can't seem to be able to start the julia version you sent me. I unzipped it and just try to start julia from the bin folder. I get the following error

ERROR: Unable to load dependent library /_path_to_unzipped_julia_debug_/../lib/julia/libjulia-internal.so.1
Message:/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /_path_to_unzipped_julia_debug_/../lib/julia/libjulia-internal.so.1)

quick search on web says that one shouldn't mess with this lib but instead ask developer to build for correct OS version. But I know so little about linux. maybe there are other ways to make it work

for reference here is the setup I have. Windows WSL2. Ubuntu 20.04.3

Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
CUDA toolkit 11.6, artifact installation
NVIDIA driver 511.65.0, for CUDA 11.6
CUDA driver 11.6

Libraries:
- CUBLAS: 11.8.1
- CURAND: 10.2.9
- CUFFT: 10.7.0
- CUSOLVER: 11.3.2
- CUSPARSE: 11.7.1
- CUPTI: 16.0.0
- NVML: 11.0.0+510.47.3
  Downloaded artifact: CUDNN
- CUDNN: 8.30.2 (for CUDA 11.5.0)
  Downloaded artifact: CUTENSOR
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.7.2
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: NVIDIA GeForce RTX 2070 (sm_75, 7.843 GiB / 8.000 GiB available)

@maleadt
Copy link
Member

maleadt commented Feb 10, 2022

Ah yes, I'm building on a fairly recent Linux distro. Can you build it yourself? The patch you need:

diff --git a/src/julia_threads.h b/src/julia_threads.h
index 5727083212..9832fa9ac4 100644
--- a/src/julia_threads.h
+++ b/src/julia_threads.h
@@ -45,10 +45,10 @@ typedef win32_ucontext_t jl_ucontext_t;
 #endif
 #if 0
 // very slow, but more debugging
-//#elif defined(_OS_DARWIN_)
-//#define JL_HAVE_UNW_CONTEXT
-//#elif defined(_OS_LINUX_)
-//#define JL_HAVE_UNW_CONTEXT
+#elif defined(_OS_DARWIN_)
+#define JL_HAVE_UNW_CONTEXT
+#elif defined(_OS_LINUX_)
+#define JL_HAVE_UNW_CONTEXT
 #elif defined(_OS_EMSCRIPTEN_)
 #define JL_HAVE_ASYNCIFY
 #elif !defined(JL_HAVE_ASM)

If not, I can have the Julia buildbots generate a build instead.

@luraess
Copy link

luraess commented Feb 10, 2022

@maleadt I will give it a try now as well since it turns out that 1.7.0, 1.7.1 and 1.7.2 hang.

Using your debug Julia build, one encounters that GLIBC_2.32 is missing. I'll try to install it locally on my system that has GLIBC_2.31 and see how far I can get.

@luraess
Copy link

luraess commented Feb 10, 2022

@maleadt getting:

LD_LIBRARY_PATH=/home/luraess/scratch/julia_tmp/glibc/glibc-2.32-install/lib ./julia
Segmentation fault (core dumped)

when running your Julia build using GLIBC_2.32 installed in my tmp.

@anj00
Copy link
Author

anj00 commented Feb 11, 2022

Ah yes, I'm building on a fairly recent Linux distro. Can you build it yourself?
....
If not, I can have the Julia buildbots generate a build instead.

Could you please make a build for Ubuntu 20.04.3? Who knows how long it will take to make whole build chain for julia running

@maleadt
Copy link
Member

maleadt commented Feb 11, 2022

@anj00
Copy link
Author

anj00 commented Feb 11, 2022

Thanks! Here is that I am getting. Hopefully followed the instructions correctly

Attaching to process 5038
[New LWP 5039]
[New LWP 5048]
[New LWP 5049]
[New LWP 5050]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f52535715ce in epoll_wait (epfd=3, events=0x7f523a5978c0, maxevents=1024, timeout=-1)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
30      ../sysdeps/unix/sysv/linux/epoll_wait.c: No such file or directory.
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f525322db80 (LWP 5038))]
#0  0x00007f52535715ce in epoll_wait (epfd=3, events=0x7f523a5978c0, maxevents=1024, timeout=-1)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
30      in ../sysdeps/unix/sysv/linux/epoll_wait.c
(gdb) print jl_live_tasks()->length
$1 = 3


(gdb) call jlbacktracet(jl_arrayref(jl_live_tasks(), 0))
jl_start_fiber_swap at /buildworker/worker/package_linux64/build/src/task.c:1064 [inlined]
ctx_switch at /buildworker/worker/package_linux64/build/src/task.c:465
jl_switch at /buildworker/worker/package_linux64/build/src/task.c:502
try_yieldto at ./task.jl:767
wait at ./task.jl:837
wait at ./condition.jl:123
wait at ./lock.jl:366
nonblocking_synchronize at /home/wls_user/.julia/packages/CUDA/bki2w/lib/cudadrv/stream.jl:162 [inlined]
#synchronize#12 at /home/wls_user/.julia/packages/CUDA/bki2w/lib/cudadrv/stream.jl:128
synchronize##kw at /home/wls_user/.julia/packages/CUDA/bki2w/lib/cudadrv/stream.jl:122 [inlined]
synchronize##kw at /home/wls_user/.julia/packages/CUDA/bki2w/lib/cudadrv/stream.jl:122 [inlined]
...
user code pointing to synchronize(blocking = false)
...
unknown function (ip: 0x7f523a0aa851)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
....
user code 
....
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:876
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:830
jl_toplevel_eval_in at /buildworker/worker/package_linux64/build/src/toplevel.c:944
eval at ./boot.jl:373 [inlined]
include_string at ./loading.jl:1196
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
_include at ./loading.jl:1253
include at ./Base.jl:418
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
exec_options at ./client.jl:292
_start at ./client.jl:495
jfptr__start_34903.clone_1 at /_julia_debug_install_/bin/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1788 [inlined]
true_main at /buildworker/worker/package_linux64/build/src/jlapi.c:559
jl_repl_entrypoint at /buildworker/worker/package_linux64/build/src/jlapi.c:701
main at /buildworker/worker/package_linux64/build/cli/loader_exe.c:42
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_start at bin/bin/julia (unknown line)


(gdb) call jlbacktracet(jl_arrayref(jl_live_tasks(), 1))
jl_unw_swapcontext at /buildworker/worker/package_linux64/build/src/task.c:958 [inlined]
jl_swap_fiber at /buildworker/worker/package_linux64/build/src/task.c:970
ctx_switch at /buildworker/worker/package_linux64/build/src/task.c:437
jl_switch at /buildworker/worker/package_linux64/build/src/task.c:502
try_yieldto at ./task.jl:767
wait at ./task.jl:837
wait at ./condition.jl:123
#134 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:281 [inlined]
lock at ./lock.jl:190
lock at ./condition.jl:78 [inlined]
macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:279 [inlined]
#133 at ./threadingconstructs.jl:178
jfptr_YY.133_50291.clone_1 at //_julia_debug_install_/bin/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1788 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:877


(gdb) call jlbacktracet(jl_arrayref(jl_live_tasks(), 2))
jl_rec_backtrace at /buildworker/worker/package_linux64/build/src/stackwalk.c:700 [inlined]
jlbacktracet at /buildworker/worker/package_linux64/build/src/stackwalk.c:770
unknown function (ip: 0x7f523a59775e)

@maleadt
Copy link
Member

maleadt commented Feb 11, 2022

Thanks for the backtraces! They don't reveal anything though, or at least they reveal that the problem is strictly with nonblocking_synchronization (and not a complicated deadlock between different tasks). #1366 made me realize how such a hang can occur though, so could you try #1369?

@anj00
Copy link
Author

anj00 commented Feb 12, 2022

I have run 4M loops with CUDA from tb/async_errors and it is running ok. So looks very promising.

@maleadt
Copy link
Member

maleadt commented Feb 14, 2022

With that and #1369 (comment) I hope we can close this. Please re-open if the issue remains.

@maleadt maleadt closed this as completed Feb 14, 2022
@roflmaostc
Copy link

Thanks!
Right now (not master version) it just got stuck again. Still I only observe it inside Jupyter notebooks.
I report back whether it is better with master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants