Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seg fault in Cholesky #181

Closed
haimav opened this issue Sep 13, 2016 · 30 comments
Closed

Seg fault in Cholesky #181

haimav opened this issue Sep 13, 2016 · 30 comments

Comments

@haimav
Copy link
Contributor

haimav commented Sep 13, 2016

In my larger code, the following code generates a seg fault in Elemental.

    El::Identity(C, 100, 100);
    El::Cholesky(El::LOWER, C);

(original code was more complex, but even this generated the seg fault).

Here is a stack trace:

Program received signal SIGSEGV, Segmentation fault.
0x00007fff8af9a2da in stack_not_16_byte_aligned_error () from /usr/lib/system/libdyld.dylib
(gdb) where
#0 0x00007fff8af9a2da in stack_not_16_byte_aligned_error () from /usr/lib/system/libdyld.dylib
#1 0x00007fff5fbfcb80 in ?? ()
#2 0x00000001028d2398 in ?? () from /usr/local/lib/libEl.dylib
#3 0x0000000000137ad6 in ?? ()
#4 0x00000001012cbf6e in El::Matrix::operator()(El::Range, El::Range) () from /usr/local/lib/libEl.dylib
#5 0x000000010152ecfc in void El::cholesky::UVar3(El::Matrix&) () from /usr/local/lib/libEl.dylib
#6 0x0000000101537f15 in void El::cholesky::UVar3(El::AbstractDistMatrix&) () from /usr/local/lib/libEl.dylib
#7 0x00000001000a01b6 in skylark::ml::feature_map_precond_t<El::DistMatrix<double, (El::DistNS::Dist)0, (El::DistNS::Dist)2, (El::DistWrapNS::DistWrap)0> >::feature_map_precond_t<skylark::ml::kernel_container_t, El::DistMatrix<double, (El::DistNS::Dist)0, (El::DistNS::Dist)2, (El::DistWrapNS::DistWrap)0> > (this=0x105757490, k=..., lambda=, X=..., s=, context=..., params=...)

at /Users/haimav/Coding/libskylark/ml/krr.hpp:385

#8 0x00000001000a44c1 in skylark::ml::FasterKernelRidge<double, skylark::ml::kernel_container_t> (direction=, k=..., X=...,

Y=..., lambda=0.01, A=..., s=50, context=..., params=...) at /Users/haimav/Coding/libskylark/ml/krr.hpp:501

#9 0x000000010012a280 in skylark::ml::FasterKernelRLSC<double, int, skylark::ml::kernel_container_t> (direction=COLUMNS, k=..., X=..., L=...,

lambda=0.01, A=..., rcoding=..., s=50, context=..., params=...) at /Users/haimav/Coding/libskylark/ml/rlsc.hpp:244

#10 0x000000010012f243 in execute_classification (context=...) at /Users/haimav/Coding/libskylark/examples/kernel_regression.cpp:401
#11 0x00000001001309b9 in main (argc=9, argv=0x7fff5fbff978) at /Users/haimav/Coding/libskylark/examples/kernel_regression.cpp:882

@poulson
Copy link
Member

poulson commented Sep 13, 2016

None of my current builds seem to show any of these symptoms (with tests/lapack_like/Cholesky passing all tests with --uplo L and --uplo U and various numbers of MPI processes). Would you mind providing a bit more information?

@haimav
Copy link
Contributor Author

haimav commented Sep 13, 2016

Just try a code that only has

El::Initialize(argc, argv);

El::Matrix<double> C;
El::Identity(C, 1000, 1000);
El::Cholesky(El::LOWER, C);

and you get the seg fault, at least on my mac.

@rhl-
Copy link
Member

rhl- commented Sep 13, 2016

Is this on master or a release?

On Tue, Sep 13, 2016 at 2:52 PM Haim Avron notifications@github.com wrote:

Just try a code that only has

El::Initialize(argc, argv);

El::Matrix C;
El::Identity(C, 1000, 1000);
El::Cholesky(El::LOWER, C);

and you get the seg fault, at least on my mac.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#181 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AATdUbRgVOr9Wu824Dxjt5WJwUaF40oLks5qpxsggaJpZM4J7Tvz
.

@jeffhammond
Copy link
Member

stack_not_16_byte_aligned_error looks like some kind of Mac tool chain
issue.

Jeff Hammond
jeff.science@gmail.com
http://jeffhammond.github.io/

@poulson
Copy link
Member

poulson commented Sep 14, 2016

I unfortunately don't have a personal Mac to test this on, but extensive tests on my Linux box have not turned anything up.

@poulson
Copy link
Member

poulson commented Sep 17, 2016

Has anyone seen this behavior on any other system? A large number of users making regular use of El::Cholesky, in addition to none of my tests turning anything up, suggests that this is indeed a toolchain mismatch/issue as suggested by @jeffhammond

@haimav
Copy link
Contributor Author

haimav commented Sep 17, 2016

Maybe it is, but I don't have any other mac to test it on. Also there is no reason why suddenly the toolchain got a mismatch -- it was working fine until not so long ago

@poulson
Copy link
Member

poulson commented Sep 17, 2016

For what it's worth, Cholesky has not been modified in quite some time.

@poulson
Copy link
Member

poulson commented Sep 19, 2016

@haimav Am I understanding the discussion in xdata-skylark/libskylark#36 properly that the issue was in the implementation of Skylark's Gram and not Elemental's Cholesky?

@haimav
Copy link
Contributor Author

haimav commented Sep 19, 2016

@poulson No that was a separate issue.

@poulson
Copy link
Member

poulson commented Sep 19, 2016

Thanks; could you share what compiler (and version) and MPI (and version) was being used? If necessary I will buy a Mac to debug this.

@haimav
Copy link
Contributor Author

haimav commented Sep 19, 2016

I have just updated gcc and I am recompiling Elemental. Will update you if it works...

@poulson
Copy link
Member

poulson commented Sep 20, 2016

I bought a Macbook yesterday, compiled GCC 6.2.0 from scratch, MPICH 3.2 on top of it, and Elemental HEAD on top of those in Debug mode and tests/lapack_like/Cholesky passes all tests I can throw at it. What version of GCC and/or MPICH/OpenMPI are your errors occurring with?

@haimav
Copy link
Contributor Author

haimav commented Sep 20, 2016

I am actually using homebrew to install gcc and mpi, and used it to compile gcc 5.2.0.

@poulson
Copy link
Member

poulson commented Sep 20, 2016

Thanks; I can look into that toolchain tonight. My guess is that this is a homebrew compatibility issue. It would be good to verify that the same version of GCC and MPI implementation (is it MPICH or OpenMPI?) was used for each component.

@jeffhammond
Copy link
Member

jeffhammond commented Sep 20, 2016

I homebrewed GCC 6.2.0 yesterday and can test that myself.

I wish I had the kind of money that let me buy a new computer just to debug
GitHub issues 😄

@jeffhammond
Copy link
Member

I too am unable to reproduce.

I saw the following issue:

/var/folders/tz/3sxkhvt90632mzr6fxm1cd0h0000gp/T//ccRiuRgy.s:235666:11: warning: section "__const_coal" is deprecated
        .section __DATA,__const_coal,coalesced
                 ^      ~~~~~~~~~~~~
/var/folders/tz/3sxkhvt90632mzr6fxm1cd0h0000gp/T//ccRiuRgy.s:235666:11: note: change section name to "__const"
        .section __DATA,__const_coal,coalesced
                 ^      ~~~~~~~~~~~~

This was solved exactly as described on http://stackoverflow.com/questions/39502921/warning-section-const-coal-is-deprecated-error-after-updating-xcode-to-la.

The test I ran successful was:

#include "El.hpp"
int main(int argc, char* argv[]){
El::Initialize(argc, argv);
El::Matrix<double> C;
El::Identity(C, 1000, 1000);
El::Cholesky(El::LOWER, C);
}
/opt/mpich/dev/gcc/default/bin/mpicxx \
-I$HOME/Work/Elemental/git/install-gcc/include \
-std=c++11 toy.cc  \
-L$HOME/Work/Elemental/git/install-gcc/lib -lEl \
-Wl,-rpath -Wl,$HOME/Work/Elemental/git/install-gcc/lib

I am running OS X 10.11.6 with the following GCC and MPICH.

$ g++-6 -v
Using built-in specs.
COLLECT_GCC=g++-6
COLLECT_LTO_WRAPPER=/usr/local/Cellar/gcc/6.2.0/libexec/gcc/x86_64-apple-darwin15.6.0/6.2.0/lto-wrapper
Target: x86_64-apple-darwin15.6.0
Configured with: ../configure --build=x86_64-apple-darwin15.6.0 --prefix=/usr/local/Cellar/gcc/6.2.0 --libdir=/usr/local/Cellar/gcc/6.2.0/lib/gcc/6 --enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-6 --with-gmp=/usr/local/opt/gmp --with-mpfr=/usr/local/opt/mpfr --with-mpc=/usr/local/opt/libmpc --with-isl=/usr/local/opt/isl --with-system-zlib --enable-libstdcxx-time=yes --enable-stage1-checking --enable-checking=release --enable-lto --with-build-config=bootstrap-debug --disable-werror --with-pkgversion='Homebrew gcc 6.2.0 --without-multilib' --with-bugurl=https://github.com/Homebrew/homebrew/issues --enable-plugin --disable-nls --disable-multilib
Thread model: posix
gcc version 6.2.0 (Homebrew gcc 6.2.0 --without-multilib) 
$ /opt/mpich/dev/gcc/default/bin/mpichversion 
MPICH Version:      3.3a1
MPICH Release date: unreleased development copy
MPICH Device:       ch3:nemesis
MPICH configure:    CC=gcc-6 CXX=g++-6 FC=gfortran-6 F77=gfortran-6 --enable-cxx --enable-fortran --enable-threads=runtime --enable-g=dbg --with-pm=hydra --prefix=/opt/mpich/dev/gcc/default --enable-wrapper-rpath --disable-static --enable-shared
MPICH CC:   gcc-6    -g -O2
MPICH CXX:  g++-6   -g
MPICH F77:  gfortran-6   -g
MPICH FC:   gfortran-6   -g
MPICH Custom Information: 

@poulson
Copy link
Member

poulson commented Sep 21, 2016

Thanks for looking into this Jeff. I'm compiling with Homebrew's GCC 6 right now (with MPICH 3.2 manually built on top of said compiler) and hope to contribute another datapoint.

EDIT: All tests pass for me.

@rhl-
Copy link
Member

rhl- commented Oct 13, 2016

Since we can't replicate it, i'm going to close it. Let's reopen if something changes.

@rhl- rhl- closed this as completed Oct 13, 2016
@poulson
Copy link
Member

poulson commented Oct 31, 2016

I have been running into similar segfaults on Mac OS X El Capitan with Release builds (but not Debug builds). I believe that the following discussion might be relevant:
https://trac.macports.org/ticket/44596#comment:35

@poulson poulson reopened this Oct 31, 2016
@poulson
Copy link
Member

poulson commented Oct 31, 2016

To nail this down a bit further: I only observe the alignment errors with homebrew's GCC when building in Release mode; the issue seems to disappear when compiling with LLVM, so my current hypothesis is that this is caused by a faulty GCC toolchain (similar to that discussed in Macports).

@poulson
Copy link
Member

poulson commented Nov 13, 2016

I believe that this issue is due to a bug in GCC not always forcing the stack to be aligned to 16-byte boundaries on OS X when compiling with -O3. A similar issue was reported about 8 years ago: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=35271

The following output from lldb shows that a movdqa between an SSE register (xmm0) and the stack (%rsp) that is not 16-byte aligned is at fault:

Process 86524 stopped
* thread #1: tid = 0x23d403, 0x00007fffb178c506 libdyld.dylib`stack_not_16_byte_aligned_error, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x00007fffb178c506 libdyld.dylib`stack_not_16_byte_aligned_error
libdyld.dylib`stack_not_16_byte_aligned_error:
->  0x7fffb178c506 <+0>: movdqa %xmm0, (%rsp)
    0x7fffb178c50b <+5>: int3   

libdyld.dylib`_dyld_func_lookup:
    0x7fffb178c50c <+0>: pushq  %rbp
    0x7fffb178c50d <+1>: movq   %rsp, %rbp

poulson added a commit that referenced this issue Nov 14, 2016
…typo that leads to Debug builds not properly compiling
@poulson poulson closed this as completed Nov 19, 2016
@haimav
Copy link
Contributor Author

haimav commented Nov 19, 2016

So, on OSX we should compile with -O2 ?

@poulson
Copy link
Member

poulson commented Nov 19, 2016

Due to what is almost certainly a GCC optimization bug on OS X, the better recommendation would be to compile with Clang. But -O2 would also work (if you're okay with the performance hit).

EDIT: If you choose to go with a Release build with GCC and an -O2 optimization level, you will need to add the extra "I know what I'm doing" CMake flag detailed in 6fd612e

@poulson
Copy link
Member

poulson commented Nov 19, 2016

If anyone can come up with a Minimum Reproducible Example (ideally not depending on Elemental), then I would be willing to shephard the bug report through GCC.

@jwakely
Copy link

jwakely commented Dec 2, 2016

A bug that was fixed 8 years ago is unlikely to be the problem, and should not have recurred.

Before reporting a new GCC bug, try building with ubsan, and asan, and see if they find any problems. A minimal reproducer is ideal, but not necessary. It should be enough to provide preprocessed source for the translation unit that segfaults, and details of the compiler flags that cause apparent miscompilation. See https://gcc.gnu.org/bugs/ (and please read it twice).

@poulson
Copy link
Member

poulson commented Dec 3, 2016

@jwakely I completely agree that this particular bug is unlikely to be the issue, but my guess is that there is a bug that is similar in spirit.

For what it's worth, I'm going down the path of running ubsan, but it is unfortunately known to be broken in the system clang on Sierra, and I hit a compiler bug from the git head in Elemental after manually compiling clang:

.	/Users/poulson/Source/Elemental/include/El/macros/Instantiate.h:96:1 <Spelling=/Users/poulson/Source/Elemental/src/blas_like/level1/ColumnMinAbs.cpp:290:67>: current parser token ';'
2.	/Users/poulson/Source/Elemental/src/blas_like/level1/ColumnMinAbs.cpp:12:1: parsing namespace 'El'
3.	/Users/poulson/Source/Elemental/src/blas_like/level1/ColumnMinAbs.cpp:56:6: instantiating function definition 'El::ColumnMinAbs<float, El::DistNS::Dist::MC, El::DistNS::Dist::STAR>'
4.	/Users/poulson/Source/Elemental/include/El/core/DistMatrix/Element/VC_STAR.hpp:153:67: instantiating class definition 'El::DistMatrix<float, El::DistNS::Dist::MC, El::DistNS::Dist::STAR, El::DistWrapNS::DistWrap::ELEMENT>'
5.	/Users/poulson/Source/Elemental/include/El/core/DistMatrix/Element/STAR_MC.hpp:21:7: instantiating class definition 'El::DistMatrix<float, El::DistNS::Dist::STAR, El::DistNS::Dist::MC, El::DistWrapNS::DistWrap::ELEMENT>'
6.	/Users/poulson/Source/Elemental/include/El/core/DistMatrix/Element/STAR_MC.hpp:21:7: LLVM IR generation of declaration 'El::DistMatrix'
clang-4.0: error: unable to execute command: Abort trap: 6
clang-4.0: error: clang frontend command failed due to signal (use -v to see invocation)
clang version 4.0.0 (trunk 288506)
Target: x86_64-apple-darwin16.1.0
Thread model: posix
InstalledDir: /Users/poulson/Source/build/bin
clang-4.0: note: diagnostic msg: PLEASE submit a bug report to http://llvm.org/bugs/ and include the crash backtrace, preprocessed source, and associated run script.
clang-4.0: note: diagnostic msg: 
********************

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
clang-4.0: note: diagnostic msg: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/ColumnMinAbs-946391.cpp
clang-4.0: note: diagnostic msg: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/ColumnMinAbs-946391.sh
clang-4.0: note: diagnostic msg: Crash backtrace is located in
clang-4.0: note: diagnostic msg: /Users/poulson/Library/Logs/DiagnosticReports/clang-4.0_<YYYY-MM-DD-HHMMSS>_<hostname>.crash
clang-4.0: note: diagnostic msg: (choose the .crash file that corresponds to your crash)
clang-4.0: note: diagnostic msg: 

********************
make[2]: *** [CMakeFiles/El.dir/src/blas_like/level1/ColumnMinAbs.cpp.o] Error 254
make[1]: *** [CMakeFiles/El.dir/all] Error 2
make: *** [all] Error 2

LLVM's user registration is currently down, but at some point I can start working up this compiler bug inception stack.

EDIT: In the mean time, I'll try GCC's ubsan on a Linux machine.

@jwakely
Copy link

jwakely commented Dec 4, 2016

What about GCC's ubsan? Oh, just saw your edit. It works on OS X too.

@poulson
Copy link
Member

poulson commented Dec 4, 2016

GCC 5 ubsan is clean on Linux. I can try on OS X as well.

@poulson
Copy link
Member

poulson commented Dec 12, 2016

For what it's worth, there seems to be no issue with Homebrew's GCC 4.9 on OS X Sierra (with any optimization level).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants