Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault sampling without instrumentation python app. #220

Closed
sfantao opened this issue Nov 29, 2022 · 4 comments · Fixed by #294
Closed

Segmentation fault sampling without instrumentation python app. #220

sfantao opened this issue Nov 29, 2022 · 4 comments · Fixed by #294

Comments

@sfantao
Copy link

sfantao commented Nov 29, 2022

I get the segmentation fault below with OpenSUSE version for ROCm 5.3.3, the same code built for ROCm 5.2.3 works well with the corresponding omnitrace release. The code itself is a python code (pytorch) workload. The code is run as omnitrace-sample --include rcclp -c $wd/omnitrace.cfg -- python -u ./train.py .... The RCCLP include doesn't make a difference.

I appreciate it is hard to debug these things without the actual but at the same type it is not trivial to build. I am thinking it might be easier to get some guidance on what I should look for to troubleshoot.

�[01;32mHSA_TOOLS_LIB=/pfs/lustrep2/projappl/project_462000125/samantao/apps-build-rocm-5.3.3/omnitrace/omnitrace-1.7.3-opensuse-15.3-ROCm-50300-PAPI-OMPT-Python3/lib/libomnitrace-dl.so.1.7.3
�[0m�[01;32mHSA_TOOLS_REPORT_LOAD_FAILURE=1
�[0m�[01;32mLD_PRELOAD=/pfs/lustrep2/projappl/project_462000125/samantao/apps-build-rocm-5.3.3/omnitrace/omnitrace-1.7.3-opensuse-15.3-ROCm-50300-PAPI-OMPT-Python3/lib/libomnitrace-dl.so.1.7.3
�[0m�[01;32mOMNITRACE_CONFIG_FILE=/pfs/lustrep2/projappl/project_462000125/samantao/apps-build-rocm-5.3.3/mlperf/omnitrace.cfg
�[0m�[01;32mOMNITRACE_CRITICAL_TRACE=false
�[0m�[01;32mOMNITRACE_USE_PROCESS_SAMPLING=false
�[0m�[01;32mOMNITRACE_USE_RCCLP=true
�[0m�[01;32mOMNITRACE_USE_SAMPLING=true
�[0m�[01;32mOMP_TOOL_LIBRARIES=/pfs/lustrep2/projappl/project_462000125/samantao/apps-build-rocm-5.3.3/omnitrace/omnitrace-1.7.3-opensuse-15.3-ROCm-50300-PAPI-OMPT-Python3/lib/libomnitrace-dl.so.1.7.3
�[0m�[01;32mROCP_HSA_INTERCEPT=1
�[0m�[01;32mROCP_TOOL_LIB=/pfs/lustrep2/projappl/project_462000125/samantao/apps-build-rocm-5.3.3/omnitrace/omnitrace-1.7.3-opensuse-15.3-ROCm-50300-PAPI-OMPT-Python3/lib/libomnitrace.so.1.7.3
�[0m
�[0m�[0m�[0m�[0m�[01;34m[omnitrace][omnitrace_init_tooling] Instrumentation mode: Sampling
�[0m�[0m�[0m�[0m�[01;34m

      ______   .___  ___. .__   __.  __  .___________..______          ___       ______  _______
     /  __  \  |   \/   | |  \ |  | |  | |           ||   _  \        /   \     /      ||   ____|
    |  |  |  | |  \  /  | |   \|  | |  | `---|  |----`|  |_)  |      /  ^  \   |  ,----'|  |__
    |  |  |  | |  |\/|  | |  . `  | |  |     |  |     |      /      /  /_\  \  |  |     |   __|
    |  `--'  | |  |  |  | |  |\   | |  |     |  |     |  |\  \----./  _____  \ |  `----.|  |____
     \______/  |__|  |__| |__| \__| |__|     |__|     | _| `._____/__/     \__\ \______||_______|

    �[0m
�[0m�[0m�[0m�[0m�[01;34m[omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 3. Disabling PAPI (requires a value <= 2)...
�[0m�[0m�[0m�[0m�[0m�[0m�[0m�[0m�[01;34m[omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is <= 2
�[0m�[0m�[0m�[0m[232.253]       perfetto.cc:55910 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
:::MLLOG {"namespace": "", "time_ms": 1669757580460, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "train.py", "lineno": 481}}
4 Using seed = 248595656
/pfs/lustrep2/projappl/project_462000125/samantao/apps-build-rocm-5.3.3/mlperf/miniconda3/envs/mlperf-ssd/lib/python3.7/site-packages/apex/contrib/groupbn/batch_norm.py:199: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pfs/lustrep2/projappl/project_462000125/samantao/apps-build-rocm-5.3.3/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:205.)
  my_handle = torch.cuda.ByteTensor(np.frombuffer(internal_cuda_mem[1], dtype=np.uint8))
/pfs/lustrep2/projappl/project_462000125/samantao/apps-build-rocm-5.3.3/mlperf/miniconda3/envs/mlperf-ssd/lib/python3.7/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
  warnings.warn(warning.format(ret))
/pfs/lustrep2/projappl/project_462000125/samantao/apps-build-rocm-5.3.3/mlperf/miniconda3/envs/mlperf-ssd/lib/python3.7/site-packages/apex/contrib/groupbn/batch_norm.py:85: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  shape = int(((nhw + 3) & ~3) * grid_dim_y)
loading annotations into memory...
Done (t=0.11s)
creating index...
index created!
Enable prefetch
terminate called after throwing an instance of 'std::out_of_range'
  what():  array::at: __n (which is 2048) >= _Nm (which is 2048)
�[0m�[0m�[0m�[0mterminate called recursively

�[01;33m[omnitrace][81497][2042] Signal 6 caught : Aborted (Signal sent by tkill() 81497 10015212)
�[0m
�[01;31m### ERROR ### [omnitrace][PID=81497][TID=2042] signal=6 (SIGABRT) abort program (formerly SIGIOT). code: -6
Backtrace:
[PID=81497][TID=2042][0/20] __restore_rt
[PID=81497][TID=2042][3/20] _ZN3tim9component6gotchaILm3ESt5tupleIJEEN9omnitrace9component11exit_gotchaEE12replace_funcILm0EvJEEET0_DpT1_ +0x24
[PID=81497][TID=2042][4/20] _ZN9__gnu_cxx27__verbose_terminate_handlerEv +0x11a
[PID=81497][TID=2042][5/20] _ZSt17rethrow_exceptionNSt15__exception_ptr13exception_ptrE +0x7c
[PID=81497][TID=2042][6/20] __cxa_free_dependent_exception +0x79
[PID=81497][TID=2042][7/20] __gxx_personality_v0 +0x87
[PID=81497][TID=2042][8/20] _Unwind_RaiseException_Phase2 +0x43
[PID=81497][TID=2042][9/20] _Unwind_Resume +0x11e
[PID=81497][TID=2042][10/20] _ZN9omnitrace5debug4lockC2Ev +0x68
[PID=81497][TID=2042][11/20] _ZN9omnitrace9component12_GLOBAL__N_118invoke_exit_gotchaIPFvvEJEEEvRKN3tim9component11gotcha_dataET_DpT0_ +0x6a
[PID=81497][TID=2042][12/20] _ZN3tim9component6gotchaILm3ESt5tupleIJEEN9omnitrace9component11exit_gotchaEE12replace_funcILm0EvJEEET0_DpT1_ +0x4e
[PID=81497][TID=2042][13/20] __cxa_throw_bad_array_new_length +0x558
[PID=81497][TID=2042][14/20] _ZSt17rethrow_exceptionNSt15__exception_ptr13exception_ptrE +0x7c
[PID=81497][TID=2042][15/20] _ZSt9terminatev +0x17
[PID=81497][TID=2042][16/20] __cxa_throw +0x49
[PID=81497][TID=2042][17/20] _ZSt20__throw_out_of_rangePKc +0x6e
[PID=81497][TID=2042][18/20] _ZN9omnitrace12_GLOBAL__N_124get_thread_state_historyEl +0x108e9
[PID=81497][TID=2042][19/20] _ZN9omnitrace17push_thread_stateENS_11ThreadStateE +0x65
[PID=81497][TID=2042][20/20] _ZNK9omnitrace9component21pthread_create_gotcha7wrapperclEv +0x82
[PID=81497][TID=2042][21/20] start_thread +0xdc

Backtrace (demangled):
[PID=81497][TID=2042][0/23] /lib64/libpthread.so.0(+0x168c0) [0x1483ac89e8c0]
[PID=81497][TID=2042][1/23] /lib64/libc.so.6(gsignal+0x10d) [0x1483ac4ddcdb]
[PID=81497][TID=2042][2/23] /lib64/libc.so.6(abort+0x177) [0x1483ac4df375]
[PID=81497][TID=2042][3/23] /pfs/lustrep2/projappl/project_462000125/samantao/apps-build-rocm-5.3.3/omnitrace/omnitrace-1.7.3-opensuse-15.3-ROCm-50300-PAPI-OMPT-Python3/lib/libomnitrace.so(+0xb93fa4) [0x1483aa0d9fa4]
[PID=81497][TID=2042][4/23] /usr/lib64/libstdc++.so.6(+0xb28aa) [0x1483ac1328aa]
[PID=81497][TID=2042][5/23] /usr/lib64/libstdc++.so.6(+0xb08dc) [0x1483ac1308dc]
[PID=81497][TID=2042][6/23] /usr/lib64/libstdc++.so.6(+0xaf939) [0x1483ac12f939]
[PID=81497][TID=2042][7/23] /usr/lib64/libstdc++.so.6(__gxx_personality_v0+0x87) [0x1483ac130067]
[PID=81497][TID=2042][8/23] /pfs/lustrep2/projappl/project_462000125/samantao/apps-build-rocm-5.3.3/omnitrace/omnitrace-1.7.3-opensuse-15.3-ROCm-50300-PAPI-OMPT-Python3/lib/libomnitrace.so(+0x1ad4d93) [0x1483ab01ad93]
[PID=81497][TID=2042][9/23] /pfs/lustrep2/projappl/project_462000125/samantao/apps-build-rocm-5.3.3/omnitrace/omnitrace-1.7.3-opensuse-15.3-ROCm-50300-PAPI-OMPT-Python3/lib/libomnitrace.so(+0x1ad58ce) [0x1483ab01b8ce]
[PID=81497][TID=2042][10/23] /pfs/lustrep2/projappl/project_462000125/samantao/apps-build-rocm-5.3.3/omnitrace/omnitrace-1.7.3-opensuse-15.3-ROCm-50300-PAPI-OMPT-Python3/lib/libomnitrace.so(+0x5cb888) [0x1483a9b11888]
[PID=81497][TID=2042][11/23] /pfs/lustrep2/projappl/project_462000125/samantao/apps-build-rocm-5.3.3/omnitrace/omnitrace-1.7.3-opensuse-15.3-ROCm-50300-PAPI-OMPT-Python3/lib/libomnitrace.so(+0xb8ebfa) [0x1483aa0d4bfa]
[PID=81497][TID=2042][12/23] /pfs/lustrep2/projappl/project_462000125/samantao/apps-build-rocm-5.3.3/omnitrace/omnitrace-1.7.3-opensuse-15.3-ROCm-50300-PAPI-OMPT-Python3/lib/libomnitrace.so(+0xb93fce) [0x1483aa0d9fce]
[PID=81497][TID=2042][13/23] /usr/lib64/libstdc++.so.6(+0xa5016) [0x1483ac125016]
[PID=81497][TID=2042][14/23] /usr/lib64/libstdc++.so.6(+0xb08dc) [0x1483ac1308dc]
[PID=81497][TID=2042][15/23] /usr/lib64/libstdc++.so.6(+0xb0947) [0x1483ac130947]
[PID=81497][TID=2042][16/23] /usr/lib64/libstdc++.so.6(+0xb0be9) [0x1483ac130be9]
[PID=81497][TID=2042][17/23] /usr/lib64/libstdc++.so.6(+0xa7a34) [0x1483ac127a34]
[PID=81497][TID=2042][18/23] /pfs/lustrep2/projappl/project_462000125/samantao/apps-build-rocm-5.3.3/omnitrace/omnitrace-1.7.3-opensuse-15.3-ROCm-50300-PAPI-OMPT-Python3/lib/libomnitrace.so(+0x90c979) [0x1483a9e52979]
[PID=81497][TID=2042][19/23] /pfs/lustrep2/projappl/project_462000125/samantao/apps-build-rocm-5.3.3/omnitrace/omnitrace-1.7.3-opensuse-15.3-ROCm-50300-PAPI-OMPT-Python3/lib/libomnitrace.so(+0x4900f5) [0x1483a99d60f5]
[PID=81497][TID=2042][20/23] /pfs/lustrep2/projappl/project_462000125/samantao/apps-build-rocm-5.3.3/omnitrace/omnitrace-1.7.3-opensuse-15.3-ROCm-50300-PAPI-OMPT-Python3/lib/libomnitrace.so(+0xed7292) [0x1483aa41d292]
[PID=81497][TID=2042][21/23] /lib64/libpthread.so.0(+0xa6ea) [0x1483ac8926ea]
[PID=81497][TID=2042][22/23] /lib64/libc.so.6(clone+0x3f) [0x1483ac5aaa8f]

Backtrace (demangled):
[PID=81497][TID=2042][0/20] __restore_rt
[PID=81497][TID=2042][3/20] void tim::component::gotcha<3ul, std::tuple<>, omnitrace::component::exit_gotcha>::replace_func<0ul, void>() +0x24
[PID=81497][TID=2042][4/20] __gnu_cxx::__verbose_terminate_handler() +0x11a
[PID=81497][TID=2042][5/20] std::rethrow_exception(std::__exception_ptr::exception_ptr) +0x7c
[PID=81497][TID=2042][6/20] __cxa_free_dependent_exception +0x79
[PID=81497][TID=2042][7/20] __gxx_personality_v0 +0x87
[PID=81497][TID=2042][8/20] _Unwind_RaiseException_Phase2 +0x43
[PID=81497][TID=2042][9/20] _Unwind_Resume +0x11e
[PID=81497][TID=2042][10/20] omnitrace::debug::lock::lock() +0x68
[PID=81497][TID=2042][11/20] void omnitrace::component::(anonymous namespace)::invoke_exit_gotcha<void (*)()>(tim::component::gotcha_data const&, void (*)()) +0x6a
[PID=81497][TID=2042][12/20] void tim::component::gotcha<3ul, std::tuple<>, omnitrace::component::exit_gotcha>::replace_func<0ul, void>() +0x4e
[PID=81497][TID=2042][13/20] __cxa_throw_bad_array_new_length +0x558
[PID=81497][TID=2042][14/20] std::rethrow_exception(std::__exception_ptr::exception_ptr) +0x7c
[PID=81497][TID=2042][15/20] std::terminate() +0x17
[PID=81497][TID=2042][16/20] __cxa_throw +0x49
[PID=81497][TID=2042][17/20] std::__throw_out_of_range(char const*) +0x6e
[PID=81497][TID=2042][18/20] omnitrace::(anonymous namespace)::get_thread_state_history(long) +0x108e9
[PID=81497][TID=2042][19/20] omnitrace::push_thread_state(omnitrace::ThreadState) +0x65
[PID=81497][TID=2042][20/20] omnitrace::component::pthread_create_gotcha::wrapper::operator()() const +0x82
[PID=81497][TID=2042][21/20] start_thread +0xdc

Backtrace (lineinfo):
[PID=81497][TID=2042][0/21]
    �[01;32m[/lib64/libpthread-2.31.so:?]�[01;31m __restore_rt
[PID=81497][TID=2042][1/21]
    �[01;32m[/lib64/libc-2.31.so:?]�[01;31m no unwind info found
[PID=81497][TID=2042][2/21]
    �[01;32m[??:901]�[01;31m void tim::component::gotcha<3ul, std::tuple<>, omnitrace::component::exit_gotcha>::replace_func<0ul, void>()
[PID=81497][TID=2042][3/21]
    �[01;32m[/usr/lib64/libstdc++.so.6.0.29:?]�[01;31m __gnu_cxx::__verbose_terminate_handler()
[PID=81497][TID=2042][4/21]
    �[01;32m[/usr/lib64/libstdc++.so.6.0.29:?]�[01;31m std::rethrow_exception(std::__exception_ptr::exception_ptr)
[PID=81497][TID=2042][5/21]
    �[01;32m[/usr/lib64/libstdc++.so.6.0.29:?]�[01;31m __cxa_free_dependent_exception
[PID=81497][TID=2042][6/21]
    �[01;32m[/usr/lib64/libstdc++.so.6.0.29:?]�[01;31m __gxx_personality_v0
[PID=81497][TID=2042][7/21]
    �[01;32m[??:64]�[01;31m _Unwind_RaiseException_Phase2
[PID=81497][TID=2042][8/21]
    �[01;32m[??:234]�[01;31m _Unwind_Resume
[PID=81497][TID=2042][9/21]
    �[01;32m[??:64]�[01;31m omnitrace::debug::lock::lock()
    �[01;32m[/usr/include/c++/7/bits/std_mutex.h:264]�[01;31m std::unique_lock<std::recursive_mutex>::lock()
[PID=81497][TID=2042][10/21]
    �[01;32m[??:61]�[01;31m invoke_exit_gotcha<void (*)()>
    �[01;32m[??:109]�[01;31m tim::log::color::info()
[PID=81497][TID=2042][11/21]
    �[01;32m[??:916]�[01;31m void tim::component::gotcha<3ul, std::tuple<>, omnitrace::component::exit_gotcha>::replace_func<0ul, void>()
[PID=81497][TID=2042][12/21]
    �[01;32m[/usr/lib64/libstdc++.so.6.0.29:?]�[01;31m __cxa_throw_bad_array_new_length
[PID=81497][TID=2042][13/21]
    �[01;32m[/usr/lib64/libstdc++.so.6.0.29:?]�[01;31m std::rethrow_exception(std::__exception_ptr::exception_ptr)
[PID=81497][TID=2042][14/21]
    �[01;32m[/usr/lib64/libstdc++.so.6.0.29:?]�[01;31m std::terminate()
[PID=81497][TID=2042][15/21]
    �[01;32m[/usr/lib64/libstdc++.so.6.0.29:?]�[01;31m __cxa_throw
[PID=81497][TID=2042][16/21]
    �[01;32m[/usr/lib64/libstdc++.so.6.0.29:?]�[01;31m std::__throw_out_of_range(char const*)
[PID=81497][TID=2042][17/21]
    �[01;32m[??:52]�[01;31m get_thread_state_history
    �[01;32m[/usr/include/c++/7/array:94]�[01;31m omnitrace::(anonymous namespace)::get_thread_state_history(long)
[PID=81497][TID=2042][18/21]
    �[01;32m[??:96]�[01;31m omnitrace::push_thread_state(omnitrace::ThreadState)
[PID=81497][TID=2042][19/21]
    �[01;32m[??:164]�[01;31m omnitrace::component::pthread_create_gotcha::wrapper::operator()() const
    �[01;32m[/usr/include/c++/7/bits/stl_set.h:157]�[01;31m std::set<int, std::less<int>, std::allocator<int>>::set()
    �[01;32m[/usr/include/c++/7/bits/stl_tree.h:913]�[01;31m std::_Rb_tree<int, int, std::_Identity<int>, std::less<int>, std::allocator<int>>::_Rb_tree()
    �[01;32m[/usr/include/c++/7/bits/stl_tree.h:688]�[01;31m std::_Rb_tree<int, int, std::_Identity<int>, std::less<int>, std::allocator<int>>::_Rb_tree_impl<std::less<int>, true>::_Rb_tree_impl()
    �[01;32m[/usr/include/c++/7/bits/stl_tree.h:176]�[01;31m std::_Rb_tree_header::_Rb_tree_header()
    �[01;32m[/usr/include/c++/7/bits/stl_tree.h:209]�[01;31m std::_Rb_tree_header::_M_reset()
[PID=81497][TID=2042][20/21]
    �[01;32m[/lib64/libpthread-2.31.so:?]�[01;31m start_thread

�[0m�[0m�[0m�[0msignal_settings::exit_action(6) threw an exception
array::at: __n (which is 2048) >= _Nm (which is 2048)
�[0m
�[01;33m[81497]Killing process 81497 with signal 6...
�[0m
@jrmadsen
Copy link
Collaborator

How many threads is the application creating? This isn't a segfault. This is omnitrace hitting its limit for the number of active threads (which is set to 2048 for release builds).

@sfantao
Copy link
Author

sfantao commented Nov 30, 2022

This is what gdb is telling me when the exception is raised:

(gdb) bt
#0  0x00001555543ebcdb in raise () from /lib64/libc.so.6
#1  0x00001555543ed375 in abort () from /lib64/libc.so.6
#2  0x0000155551fe7fa4 in tim::component::gotcha<3ul, std::tuple<>, omnitrace::component::exit_gotcha>::replace_func<0ul, void>() () at /home/omnitrace/external/timemory/source/timemory/components/gotcha/components.cpp:899
#3  0x00001555540408aa in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#4  0x000015555403e8dc in ?? () from /usr/lib64/libstdc++.so.6
#5  0x000015555403d939 in ?? () from /usr/lib64/libstdc++.so.6
#6  0x000015555403e067 in __gxx_personality_v0 () from /usr/lib64/libstdc++.so.6
#7  0x0000155552f28d93 in _Unwind_RaiseException_Phase2 (exc=exc@entry=0x1543bc043900, context=context@entry=0x154a75374570) at ../../../libgcc/unwind.inc:62
#8  0x0000155552f298ce in _Unwind_Resume (exc=0x1543bc043900) at ../../../libgcc/unwind.inc:230
#9  0x0000155551a1f888 in std::unique_lock<std::recursive_mutex>::~unique_lock () at /usr/include/c++/7/bits/std_mutex.h:231
#10 omnitrace::debug::lock::lock () at /home/omnitrace/source/lib/omnitrace/library/debug.cpp:59
#11 0x0000155551fe2bfa in invoke_exit_gotcha<void (*)()> () at /home/omnitrace/source/lib/omnitrace/library/components/exit_gotcha.cpp:61
#12 0x0000155551fe7fce in omnitrace::component::exit_gotcha::operator() () at /home/omnitrace/source/lib/omnitrace/library/components/exit_gotcha.cpp:100
#13 tim::component::gotcha_invoker<omnitrace::component::exit_gotcha, void, true>::sfinae<tim::component::gotcha_data, void (*&)()> () at /home/omnitrace/external/timemory/source/timemory/components/gotcha/backends.hpp:132
#14 tim::component::gotcha_invoker<omnitrace::component::exit_gotcha, void, true>::invoke_sfinae<tim::component::gotcha_data, void (*&)()> () at /home/omnitrace/external/timemory/source/timemory/components/gotcha/backends.hpp:187
#15 tim::component::gotcha_invoker<omnitrace::component::exit_gotcha, void, true>::operator()<void (*&)()> () at /home/omnitrace/external/timemory/source/timemory/components/gotcha/backends.hpp:118
#16 tim::component::gotcha<3ul, std::tuple<>, omnitrace::component::exit_gotcha>::invoke<tim::component_tuple<omnitrace::component::exit_gotcha>, void>(tim::component::gotcha_data&&, tim::component_tuple<omnitrace::component::exit_gotcha>&, void (*)()) () at /home/omnitrace/external/timemory/source/timemory/components/gotcha/components.hpp:438
#17 tim::component::gotcha<3ul, std::tuple<>, omnitrace::component::exit_gotcha>::replace_func<0ul, void>() () at /home/omnitrace/external/timemory/source/timemory/components/gotcha/components.cpp:906
#18 0x0000155554033016 in ?? () from /usr/lib64/libstdc++.so.6
#19 0x000015555403e8dc in ?? () from /usr/lib64/libstdc++.so.6
#20 0x000015555403e947 in std::terminate() () from /usr/lib64/libstdc++.so.6
#21 0x000015555403ebe9 in __cxa_throw () from /usr/lib64/libstdc++.so.6
#22 0x0000155554035a34 in ?? () from /usr/lib64/libstdc++.so.6
#23 0x0000155551d60979 in std::array<std::vector<omnitrace::ThreadState, std::allocator<omnitrace::ThreadState> >, 2048ul>::at () at /usr/include/c++/7/array:196
#24 get_thread_state_history () at /home/omnitrace/source/lib/omnitrace/library/state.cpp:54
#25 0x00001555518e40f5 in omnitrace::push_thread_state () at /home/omnitrace/source/lib/omnitrace/library/state.cpp:96
#26 0x000015555232b292 in omnitrace::component::pthread_create_gotcha::wrapper::operator() () at /home/omnitrace/source/lib/omnitrace/library/components/pthread_create_gotcha.cpp:158
#27 0x00001555547a06ea in start_thread () from /lib64/libpthread.so.0
#28 0x00001555544b8a8f in clone () from /lib64/libc.so.6

and

(gdb) info th                                                                                              
  Id   Target Id         Frame                                                                             
  1    Thread 6876.6876  0x00001555547a770c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
* 2    Thread 6876.25499 0x00001555543ebcdb in raise () from /lib64/libc.so.6                        
  3    Thread 6876.7639  0x00001555544ac1e9 in poll () from /lib64/libc.so.6                         
  4    Thread 6876.7640  0x00001555547a770c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  5    Thread 6876.7641  0x00001555547a770c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  6    Thread 6876.7642  0x00001555547a770c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  7    Thread 6876.7643  0x00001555547a770c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  8    Thread 6876.7644  0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0
  9    Thread 6876.7672  futex_wait (val=16, addr=0x55555c9eb2d4) at /home/builder/ktietz/cos6/ci_cos6/ctng-compilers_1622658800915/work/.build/x86_64-conda-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
  10   Thread 6876.7673  futex_wait (val=16, addr=0x55555c9eb2d4) at /home/builder/ktietz/cos6/ci_cos6/ctng-compilers_1622658800915/work/.build/x86_64-conda-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
  11   Thread 6876.7674  futex_wait (val=16, addr=0x55555c9eb2d4) at /home/builder/ktietz/cos6/ci_cos6/ctng-compilers_1622658800915/work/.build/x86_64-conda-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
  12   Thread 6876.7675  futex_wait (val=16, addr=0x55555c9eb2d4) at /home/builder/ktietz/cos6/ci_cos6/ctng-compilers_1622658800915/work/.build/x86_64-conda-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
  13   Thread 6876.7676  futex_wait (val=16, addr=0x55555c9eb2d4) at /home/builder/ktietz/cos6/ci_cos6/ctng-compilers_1622658800915/work/.build/x86_64-conda-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
  14   Thread 6876.7677  futex_wait (val=16, addr=0x55555c9eb2d4) at /home/builder/ktietz/cos6/ci_cos6/ctng-compilers_1622658800915/work/.build/x86_64-conda-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
  15   Thread 6876.7678  futex_wait (val=16, addr=0x55555c9eb2d4) at /home/builder/ktietz/cos6/ci_cos6/ctng-compilers_1622658800915/work/.build/x86_64-conda-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
  16   Thread 6876.7679  futex_wait (val=16, addr=0x55555c9eb2d4) at /home/builder/ktietz/cos6/ci_cos6/ctng-compilers_1622658800915/work/.build/x86_64-conda-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
  17   Thread 6876.7680  futex_wait (val=16, addr=0x55555c9eb2d4) at /home/builder/ktietz/cos6/ci_cos6/ctng-compilers_1622658800915/work/.build/x86_64-conda-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
  18   Thread 6876.7681  futex_wait (val=16, addr=0x55555c9eb2d4) at /home/builder/ktietz/cos6/ci_cos6/ctng-compilers_1622658800915/work/.build/x86_64-conda-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
  19   Thread 6876.7682  futex_wait (val=16, addr=0x55555c9eb2d4) at /home/builder/ktietz/cos6/ci_cos6/ctng-compilers_1622658800915/work/.build/x86_64-conda-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
  20   Thread 6876.7683  futex_wait (val=16, addr=0x55555c9eb2d4) at /home/builder/ktietz/cos6/ci_cos6/ctng-compilers_1622658800915/work/.build/x86_64-conda-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
  21   Thread 6876.7684  futex_wait (val=16, addr=0x55555c9eb2d4) at /home/builder/ktietz/cos6/ci_cos6/ctng-compilers_1622658800915/work/.build/x86_64-conda-linux-gnu/src/gcc/libgomp/config/linux/x86/futex.h:44
  22   Thread 6876.7751  0x00001555544adc47 in ioctl () from /lib64/libc.so.6                        
  23   Thread 6876.7752  0x00001555547a770c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  24   Thread 6876.7754  0x00001555544ac1e9 in poll () from /lib64/libc.so.6                         
  25   Thread 6876.7755  0x00001555547a7a5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  26   Thread 6876.7770  0x00001555544ac1e9 in poll () from /lib64/libc.so.6                         
  27   Thread 6876.7789  0x00001555547a770c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  28   Thread 6876.7935  0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0
  29   Thread 6876.7944  0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0
  30   Thread 6876.7953  0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0
  31   Thread 6876.8102  0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0
  32   Thread 6876.8112  0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0
  33   Thread 6876.8261  0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0
  34   Thread 6876.8270  0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0
    35   Thread 6876.8279  0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  36   Thread 6876.8569  0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  37   Thread 6876.8578  0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  38   Thread 6876.8727  0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  39   Thread 6876.8736  0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  40   Thread 6876.8745  0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  41   Thread 6876.8894  0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  42   Thread 6876.8903  0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  43   Thread 6876.9052  0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  44   Thread 6876.9061  0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  45   Thread 6876.9070  0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  46   Thread 6876.9079  0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  47   Thread 6876.9088  0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  48   Thread 6876.9379  0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  49   Thread 6876.9388  0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  50   Thread 6876.9397  0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  51   Thread 6876.10048 0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  52   Thread 6876.10057 0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  53   Thread 6876.10090 0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  54   Thread 6876.10099 0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  55   Thread 6876.10116 0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  56   Thread 6876.10153 0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  57   Thread 6876.10190 0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  58   Thread 6876.10375 0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  59   Thread 6876.10404 0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  60   Thread 6876.10434 0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  61   Thread 6876.11545 0x00001555544adc47 in ioctl () from /lib64/libc.so.6                                                                                                                                         
  62   Thread 6876.11886 0x00001555547a770c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0                                                                                                          
  63   Thread 6876.11887 0x00001555547a770c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0                                                                                                          
  64   Thread 6876.11888 0x00001555547a770c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0                                                                                                          
  65   Thread 6876.11889 0x00001555547a770c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0                                                                                                          
  66   Thread 6876.11890 0x00001555544b1ec9 in syscall () from /lib64/libc.so.6                                                                                                                                       
  67   Thread 6876.11891 0x00001555547a770c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0                                                                                                          
  68   Thread 6876.11892 0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0                                                                                                                 
  69   Thread 6876.11893 0x00001555547a770c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  70   Thread 6876.11894 0x00001555547a770c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  ...
  287  Thread 6876.25462 0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0
  288  Thread 6876.25471 0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0
  289  Thread 6876.25480 0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0
  290  Thread 6876.25489 0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0
  291  Thread 6876.25498 0x00001555547aa5f4 in do_futex_wait.constprop () from /lib64/libpthread.so.0

So there are 291 threads active at that point. This is a python code using GPUs, and python is known for forking many processes. What other info should I try capture?

@sfantao
Copy link
Author

sfantao commented Nov 30, 2022

If I run strace with omnitrace I count 864 occurences of clone(.

@jrmadsen
Copy link
Collaborator

By the way, I have made some improvements to handling forks in #250 and I have a solution that allows omnitrace to support any number of threads

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants