Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation faults on Fedora Rawhide with ucx 1.5.1 and openmpi 4.0.1 #3558

Closed
opoplawski opened this issue May 6, 2019 · 2 comments
Closed
Labels

Comments

@opoplawski
Copy link

Since updating to openmpi 4.0.1 we're seeing a number of mpi tests segfault. One example from scalasca:

Starting program: /home/orion/fedora/scalasca/scalasca-2.5/openmpi/build-mpi/.libs/pearl_ipc_Test.compute_mpi 
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.29.9000-18.fc31.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[Detaching after fork from child process 27392]
[New Thread 0x7ffff73b4700 (LWP 27397)]
[New Thread 0x7ffff69f0700 (LWP 27398)]

Thread 1 "pearl_ipc_Test." received signal SIGSEGV, Segmentation fault.
ucs_list_insert_replace (elem=0x7ffff58bd420 <entry>, next=<optimized out>, prev=0x7ffff5800940)
    at /usr/src/debug/ucx-1.5.1-1.fc31.x86_64/src/ucs/datastruct/list.h:48
48          prev->next = elem;
Missing separate debuginfos, use: dnf debuginfo-install hwloc-libs-1.11.12-2.fc30.x86_64 infinipath-psm-3.3-26_g604758e_open.fc30.1.x86_64 libevent-2.1.8-5.fc30.x86_64 libfabric-1.7.1-1.fc31.x86_64 libgcc-9.1.1-1.fc31.x86_64 libibumad-20.1-3.fc30.x86_64 libibverbs-20.1-3.fc30.x86_64 libnl3-3.4.0-8.fc30.x86_64 libpsm2-11.2.78-2.fc30.x86_64 librdmacm-20.1-3.fc30.x86_64 libstdc++-9.1.1-1.fc31.x86_64 libtool-ltdl-2.4.6-30.fc31.x86_64 libuuid-2.34-0.1.fc31.x86_64 munge-libs-0.5.13-3.fc30.x86_64 numactl-libs-2.0.12-2.fc30.x86_64 opensm-libs-3.3.22-1.fc31.x86_64 openssl-libs-1.1.1b-6.fc31.x86_64 pmix-3.0.2-2.fc30.x86_64 zlib-1.2.11-15.fc30.x86_64
(gdb) print *prev
$1 = {prev = 0x1988f8bfa1e0ff3, next = 0x1a8878b0000}
(gdb) print *prev->next
Cannot access memory at address 0x1a8878b0000
(gdb) bt
#0  ucs_list_insert_replace (elem=0x7ffff58bd420 <entry>, next=<optimized out>, prev=0x7ffff5800940)
    at /usr/src/debug/ucx-1.5.1-1.fc31.x86_64/src/ucs/datastruct/list.h:48
#1  ucs_list_insert_before (new_link=0x7ffff58bd420 <entry>, pos=<optimized out>)
    at /usr/src/debug/ucx-1.5.1-1.fc31.x86_64/src/ucs/datastruct/list.h:73
#2  ucs_initializer0 () at sm/self/self.c:315
#3  0x00007ffff7fe1e4a in call_init.part () from /lib64/ld-linux-x86-64.so.2
#4  0x00007ffff7fe1f51 in _dl_init () from /lib64/ld-linux-x86-64.so.2
#5  0x00007ffff7fe5eae in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#6  0x00007ffff7a4a3e9 in _dl_catch_exception () from /lib64/libc.so.6
#7  0x00007ffff7fe572e in _dl_open () from /lib64/ld-linux-x86-64.so.2
#8  0x00007ffff779d39c in dlopen_doit () from /lib64/libdl.so.2
#9  0x00007ffff7a4a3e9 in _dl_catch_exception () from /lib64/libc.so.6
#10 0x00007ffff7a4a483 in _dl_catch_error () from /lib64/libc.so.6
#11 0x00007ffff779daf9 in _dlerror_run () from /lib64/libdl.so.2
#12 0x00007ffff779d42a in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#13 0x00007ffff7810ad7 in do_dlopen (err_msg=0x7fffffffd8a8, handle=<synthetic pointer>, flags=257, fname=<optimized out>)
    at dl_dlopen_module.c:94
#14 dlopen_open (fname=0x4c6dd0 "/usr/lib64/openmpi/lib/openmpi/mca_pml_ucx", use_ext=<optimized out>, 
    private_namespace=<optimized out>, handle=0x4c6ec8, err_msg=0x7fffffffd8a8) at dl_dlopen_module.c:94
#15 0x00007ffff77ee524 in mca_base_component_repository_open (framework=framework@entry=0x7ffff7f85860 <ompi_pml_base_framework>, 
    ri=ri@entry=0x4c6e30) at mca_base_component_repository.c:416
#16 0x00007ffff77ed4eb in find_dyn_components (include_mode=<optimized out>, names=0x0, 
    framework=0x7ffff7f85860 <ompi_pml_base_framework>, path=0x0) at mca_base_component_find.c:264
#17 mca_base_component_find (directory=directory@entry=0x0, framework=framework@entry=0x7ffff7f85860 <ompi_pml_base_framework>, 
    ignore_requested=ignore_requested@entry=false, open_dso_components=open_dso_components@entry=true) at mca_base_component_find.c:135
#18 0x00007ffff77f8dfe in mca_base_framework_components_register (framework=framework@entry=0x7ffff7f85860 <ompi_pml_base_framework>, 
    flags=flags@entry=MCA_BASE_REGISTER_DEFAULT) at mca_base_components_register.c:55
#19 0x00007ffff77f92e6 in mca_base_framework_register (flags=MCA_BASE_REGISTER_DEFAULT, 
    framework=0x7ffff7f85860 <ompi_pml_base_framework>) at mca_base_framework.c:129
#20 mca_base_framework_register (framework=0x7ffff7f85860 <ompi_pml_base_framework>, flags=<optimized out>) at mca_base_framework.c:55
#21 0x00007ffff77f9344 in mca_base_framework_open (framework=0x7ffff7f85860 <ompi_pml_base_framework>, 
    flags=flags@entry=MCA_BASE_OPEN_DEFAULT) at mca_base_framework.c:148
#22 0x00007ffff7ebc795 in ompi_mpi_init (argc=<optimized out>, argv=<optimized out>, requested=0, 
    provided=provided@entry=0x7fffffffdb64, reinit_ok=reinit_ok@entry=false) at runtime/ompi_mpi_init.c:617
#23 0x00007ffff7eeca53 in PMPI_Init (argc=argc@entry=0x7fffffffdbbc, argv=argv@entry=0x7fffffffdbb0) at pinit.c:67
#24 0x0000000000410dcf in main (argc=<optimized out>, argv=<optimized out>) at ../../build-mpi/../vendor/gtest/src/ext-main_mpi.cpp:24
@junghans
Copy link

junghans commented May 7, 2019

Same when running the VOTCA pipeline: https://gitlab.com/votca/votca/-/jobs/208439789

/usr/lib64/openmpi/bin/mdrun_openmpi_d -s topol.tpr -c confout.gro -o traj.trr -x traj.xtc -multidir sim1 sim2 -nsteps 500 -v'
[runner-hLZEhTcM-project-10573529-concurrent-0:15081:0:15081] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f0b01637948)
==== backtrace ====
    0  /lib64/libucs.so.0(+0x194a3) [0x7f0b015bf4a3]
    1  /lib64/libucs.so.0(+0x1965a) [0x7f0b015bf65a]
    2  /lib64/libuct.so.0(+0x1b72b) [0x7f0b018a572b]
    3  /lib64/ld-linux-x86-64.so.2(+0xfe4a) [0x7f0b04287e4a]
    4  /lib64/ld-linux-x86-64.so.2(+0xff51) [0x7f0b04287f51]
    5  /lib64/ld-linux-x86-64.so.2(+0x13eae) [0x7f0b0428beae]
    6  /lib64/libc.so.6(_dl_catch_exception+0x79) [0x7f0b03a5e3e9]
    7  /lib64/ld-linux-x86-64.so.2(+0x1372e) [0x7f0b0428b72e]
    8  /lib64/libdl.so.2(+0x239c) [0x7f0b037af39c]
    9  /lib64/libc.so.6(_dl_catch_exception+0x79) [0x7f0b03a5e3e9]
   10  /lib64/libc.so.6(_dl_catch_error+0x33) [0x7f0b03a5e483]
   11  /lib64/libdl.so.2(+0x2af9) [0x7f0b037afaf9]
   12  /lib64/libdl.so.2(dlopen+0x4a) [0x7f0b037af42a]
   13  /usr/lib64/openmpi/lib/libopen-pal.so.40(+0x6ead7) [0x7f0b03824ad7]
   14  /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_component_repository_open+0x1f4) [0x7f0b03802524]
   15  /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_component_find+0x35b) [0x7f0b038014eb]
   16  /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_components_register+0x2e) [0x7f0b0380cdfe]
   17  /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_register+0x256) [0x7f0b0380d2e6]
   18  /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_open+0x14) [0x7f0b0380d344]
   19  /usr/lib64/openmpi/lib/libmpi.so.40(ompi_mpi_init+0x695) [0x7f0b0419a795]
   20  /usr/lib64/openmpi/lib/libmpi.so.40(PMPI_Init_thread+0x5b) [0x7f0b041cabbb]
   21  /usr/lib64/openmpi/bin/mdrun_openmpi_d(+0x14efa5) [0x55f447172fa5]
   22  /usr/lib64/openmpi/bin/mdrun_openmpi_d(+0xcc1dd) [0x55f4470f01dd]
   23  /usr/lib64/openmpi/bin/mdrun_openmpi_d(+0x6cff2) [0x55f447090ff2]
   24  /usr/lib64/openmpi/bin/mdrun_openmpi_d(+0x6d376) [0x55f447091376]
   25  /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f0b0394bf73]
   26  /usr/lib64/openmpi/bin/mdrun_openmpi_d(+0x6835e) [0x55f44708c35e]
===================

The same test works on a variety of other platforms.

@yosefe
Copy link
Contributor

yosefe commented Jun 8, 2019

Fixed in v1.5.x (#3656), v1.6.x (#3671) and master (#3658)

@yosefe yosefe closed this as completed Jun 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants