-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seg fault in Cholesky #181
Comments
None of my current builds seem to show any of these symptoms (with |
Just try a code that only has
and you get the seg fault, at least on my mac. |
Is this on master or a release? On Tue, Sep 13, 2016 at 2:52 PM Haim Avron notifications@github.com wrote:
|
stack_not_16_byte_aligned_error looks like some kind of Mac tool chain Jeff Hammond |
I unfortunately don't have a personal Mac to test this on, but extensive tests on my Linux box have not turned anything up. |
Has anyone seen this behavior on any other system? A large number of users making regular use of |
Maybe it is, but I don't have any other mac to test it on. Also there is no reason why suddenly the toolchain got a mismatch -- it was working fine until not so long ago |
For what it's worth, Cholesky has not been modified in quite some time. |
@haimav Am I understanding the discussion in xdata-skylark/libskylark#36 properly that the issue was in the implementation of Skylark's |
@poulson No that was a separate issue. |
Thanks; could you share what compiler (and version) and MPI (and version) was being used? If necessary I will buy a Mac to debug this. |
I have just updated gcc and I am recompiling Elemental. Will update you if it works... |
I bought a Macbook yesterday, compiled GCC 6.2.0 from scratch, MPICH 3.2 on top of it, and Elemental HEAD on top of those in Debug mode and |
I am actually using homebrew to install gcc and mpi, and used it to compile gcc 5.2.0. |
Thanks; I can look into that toolchain tonight. My guess is that this is a homebrew compatibility issue. It would be good to verify that the same version of GCC and MPI implementation (is it MPICH or OpenMPI?) was used for each component. |
I homebrewed GCC 6.2.0 yesterday and can test that myself. I wish I had the kind of money that let me buy a new computer just to debug |
I too am unable to reproduce. I saw the following issue:
This was solved exactly as described on http://stackoverflow.com/questions/39502921/warning-section-const-coal-is-deprecated-error-after-updating-xcode-to-la. The test I ran successful was: #include "El.hpp"
int main(int argc, char* argv[]){
El::Initialize(argc, argv);
El::Matrix<double> C;
El::Identity(C, 1000, 1000);
El::Cholesky(El::LOWER, C);
} /opt/mpich/dev/gcc/default/bin/mpicxx \
-I$HOME/Work/Elemental/git/install-gcc/include \
-std=c++11 toy.cc \
-L$HOME/Work/Elemental/git/install-gcc/lib -lEl \
-Wl,-rpath -Wl,$HOME/Work/Elemental/git/install-gcc/lib I am running OS X 10.11.6 with the following GCC and MPICH.
|
Thanks for looking into this Jeff. I'm compiling with Homebrew's GCC 6 right now (with MPICH 3.2 manually built on top of said compiler) and hope to contribute another datapoint. EDIT: All tests pass for me. |
Since we can't replicate it, i'm going to close it. Let's reopen if something changes. |
I have been running into similar segfaults on Mac OS X El Capitan with Release builds (but not Debug builds). I believe that the following discussion might be relevant: |
To nail this down a bit further: I only observe the alignment errors with homebrew's GCC when building in |
I believe that this issue is due to a bug in GCC not always forcing the stack to be aligned to 16-byte boundaries on OS X when compiling with The following output from
|
…typo that leads to Debug builds not properly compiling
So, on OSX we should compile with -O2 ? |
Due to what is almost certainly a GCC optimization bug on OS X, the better recommendation would be to compile with Clang. But -O2 would also work (if you're okay with the performance hit). EDIT: If you choose to go with a |
If anyone can come up with a Minimum Reproducible Example (ideally not depending on Elemental), then I would be willing to shephard the bug report through GCC. |
A bug that was fixed 8 years ago is unlikely to be the problem, and should not have recurred. Before reporting a new GCC bug, try building with ubsan, and asan, and see if they find any problems. A minimal reproducer is ideal, but not necessary. It should be enough to provide preprocessed source for the translation unit that segfaults, and details of the compiler flags that cause apparent miscompilation. See https://gcc.gnu.org/bugs/ (and please read it twice). |
@jwakely I completely agree that this particular bug is unlikely to be the issue, but my guess is that there is a bug that is similar in spirit. For what it's worth, I'm going down the path of running ubsan, but it is unfortunately known to be broken in the system clang on Sierra, and I hit a compiler bug from the git head in Elemental after manually compiling clang:
LLVM's user registration is currently down, but at some point I can start working up this compiler bug inception stack. EDIT: In the mean time, I'll try GCC's |
What about GCC's ubsan? Oh, just saw your edit. It works on OS X too. |
GCC 5 ubsan is clean on Linux. I can try on OS X as well. |
For what it's worth, there seems to be no issue with Homebrew's GCC 4.9 on OS X Sierra (with any optimization level). |
In my larger code, the following code generates a seg fault in Elemental.
(original code was more complex, but even this generated the seg fault).
Here is a stack trace:
Program received signal SIGSEGV, Segmentation fault.
0x00007fff8af9a2da in stack_not_16_byte_aligned_error () from /usr/lib/system/libdyld.dylib
(gdb) where
#0 0x00007fff8af9a2da in stack_not_16_byte_aligned_error () from /usr/lib/system/libdyld.dylib
#1 0x00007fff5fbfcb80 in ?? ()
#2 0x00000001028d2398 in ?? () from /usr/local/lib/libEl.dylib
#3 0x0000000000137ad6 in ?? ()
#4 0x00000001012cbf6e in El::Matrix::operator()(El::Range, El::Range) () from /usr/local/lib/libEl.dylib
#5 0x000000010152ecfc in void El::cholesky::UVar3(El::Matrix&) () from /usr/local/lib/libEl.dylib
#6 0x0000000101537f15 in void El::cholesky::UVar3(El::AbstractDistMatrix&) () from /usr/local/lib/libEl.dylib
#7 0x00000001000a01b6 in skylark::ml::feature_map_precond_t<El::DistMatrix<double, (El::DistNS::Dist)0, (El::DistNS::Dist)2, (El::DistWrapNS::DistWrap)0> >::feature_map_precond_t<skylark::ml::kernel_container_t, El::DistMatrix<double, (El::DistNS::Dist)0, (El::DistNS::Dist)2, (El::DistWrapNS::DistWrap)0> > (this=0x105757490, k=..., lambda=, X=..., s=, context=..., params=...)
#8 0x00000001000a44c1 in skylark::ml::FasterKernelRidge<double, skylark::ml::kernel_container_t> (direction=, k=..., X=...,
#9 0x000000010012a280 in skylark::ml::FasterKernelRLSC<double, int, skylark::ml::kernel_container_t> (direction=COLUMNS, k=..., X=..., L=...,
#10 0x000000010012f243 in execute_classification (context=...) at /Users/haimav/Coding/libskylark/examples/kernel_regression.cpp:401
#11 0x00000001001309b9 in main (argc=9, argv=0x7fff5fbff978) at /Users/haimav/Coding/libskylark/examples/kernel_regression.cpp:882
The text was updated successfully, but these errors were encountered: