Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dill build fails on crusher@olcf w/ clang-14.0.0 during ADIOS2 build #27

Open
steindev opened this issue Apr 28, 2022 · 12 comments
Open

Comments

@steindev
Copy link

Building ADIOS2 2.8.0 on crusher (https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html) fails during step

[  3%] Building C object thirdparty/dill/dill/CMakeFiles/dill.dir/dill_util.c.o

I cannot draw any useful information from the error output. It is attached together with all output obtained during both configure and build phase.

configure.out.txt
configure.err.txt
build.out.txt
build.err.txt

Disabling SST functionality of ADIOS2 by using option -DADIOS2_USE_SST=OFF when building ADIOS2 workarounds the issue for me (as I do not want to use SST anyways).

I am happy to provide more information, e.g. regarding used modules and compiler flags, if needed.

@eisenhauer
Copy link
Member

eisenhauer commented Apr 28, 2022 via email

@steindev
Copy link
Author

Here is a build script. I had to add .txt to the filename in order to be able to upload it
install-adios2.sh.txt
.

@eisenhauer
Copy link
Member

So, here's what I've found out. First, this looks to be a compiler bug. llvm seems to be faulting in llvm::BasicBlock::isEntryBlock(). While it does give us an llvm stack dump, there's no real clue as to what in the source code might be triggering the bug. Second, I have modified your script to build dill outside of ADIOS (just substituting the dill github repo for the ADIOS spec), and I don't see the failure in that circumstance. So, while the issue seems to be inside the compiler, the ADIOS configuration may be adding flags or something else that are impacting whether or not it's triggered with this source. This obviously complicates things somewhat.

This is all a bit odd. Dill does dynamic code generation, so has need to use asm() specs and other odd things that might confuse an optimizer. But that generally happens in the machine-specific source files, not dill_util.c, which contains relatively generic code.

I think there are a couple of possible paths forward. Probably one needs to be submitting a bug report on llvm. For that, it would probably be helpful to try to sort out more details, narrowing down to the problematic part of dill_util.c, and or figuring out why this happen during the ADIOS build and not outside of it. The second path might be to sort out a workaround that would let you build ADIOS on crusher. The same experimenting required to narrow down the circumstances of the failure might also lend a clue to how it might be avoided. I'm both the maintainer of Dill and involved in ADIOS, so I'm positioned to work on this, but I've also got a stack of other things to work on, so progress might be slow. To the extent that you can help with the experimentation necessary to see why this is failing sometimes and not others, that would help speed resolution.

@khuck
Copy link

khuck commented May 18, 2022

Is there an update on this? I ran into the same issue, and I found that updating the ADIOS2/thirdparty/dill/CMakeLists.txt and forcing CMAKE_C_COMPILER to gcc (and CXX to g++ and Fortran to gfortran) allowed the build to continue just fine. Possibly a solution is for ADIOS2 to build dill with the GCC toolchain when configured with clang/clang++/flang and friends. Alternatively...is it possible this code in the main ADIOS2 CMakeLists.txt file could be the culprit?

if(CMAKE_C_COMPILER_ID MATCHES "^(GNU|Intel|Clang|AppleClang|MSVC)$")
  set(ADIOS2_C99_FEATURES c_restrict)
else()
  set(ADIOS2_C99_FEATURES c_std_99)
endif()

@khuck
Copy link

khuck commented May 18, 2022

update: I changed the ADIOS2 cmake configuration to not force -std=gnu99, and that didn't make a difference. I think this is a clang/llvm bug.

@eisenhauer
Copy link
Member

I tend to agree with that diagnosis... If anyone can identify a workaround, I'm happy to push it into dill and then ADIOS. But unfortunately I don't have cycles for it myself at the moment.

@jychoi-hpc
Copy link

It looks like "-O3" optimization is causing some problem. RelWithDebInfo or Debug option with cmake, I can compile.

@eisenhauer
Copy link
Member

Thanks Jong, that's helpful. I can maybe get on crusher, replicate the problem and see if there's a way to rearrange the code so that it doesn't trigger this compiler bug. Also, a quick google doesn't show any reports of clang segfaulting in IsEntryBlock, so this may be an unreported problem. If I can narrow down the problem I can also likely get a test case to submit a bug report. (Of course, finding a code-based workaround is probably the best short-term solution. Bug reporting is a long play.

@eisenhauer
Copy link
Member

OK, I have a code workaround:
diff --git a/dill_util.c b/dill_util.c
index e50c302..10e0f32 100644
--- a/dill_util.c
+++ b/dill_util.c
@@ -785,6 +785,7 @@ dill_start_proc(dill_stream s, char *name, int ret_type, char *arg_str)
(s->j->proc_start)(s, name, arg_count, args, NULL);
}

+attribute((optnone))
void
init_code_block(dill_stream s)
{

That is, just add "attribute((optnone))" to the init_code_block() function in dill_util.c. (The problem seems to be arising when that clang tries to inline that function in places where it's called. Adding the optnone disables that.)

I need to see how I can put this in the source so that it doesn't break on non-clang compilers, but that might not happen today as I have other things on my plate. But in the meantime, you have a workaround to compile ADIOS.

@khuck
Copy link

khuck commented May 19, 2022

@eisenhauer you might be able to wrap it with #ifdef __clang__ / #endif?

@eisenhauer
Copy link
Member

The recommendation seems to be like:
#ifndef __has_attribute // Optional of course.
#define __has_attribute(x) 0 // Compatibility with non-clang compilers.
#endif

...
#if __has_attribute(optnone)
#define OPTNONE_ATTR attribute((optnone))
#else
#define OPTNONE_ATTR
#endif

I'll give that a go and see how far it gets me...

@chuckatkins
Copy link
Contributor

FWIW I can easily reproduce this issue on my local workstation. It fails with the both 5.0.2 and 5.1.2 of the rocm stack, both based on llvm 14, but succeeds with 4.5.2, based on llvm 12. The following patch resolved the issue from the build system end:

diff --git a/thirdparty/dill/dill/CMakeLists.txt b/thirdparty/dill/dill/CMakeLists.txt
index db8183439..60bb35604 100644
--- a/thirdparty/dill/dill/CMakeLists.txt
+++ b/thirdparty/dill/dill/CMakeLists.txt
@@ -56,6 +56,15 @@ if(WIN32)
   endif()
 endif()
 
+# -O3 is known to crash some flavors of clang 14.
+if(CMAKE_C_COMPILER_ID MATCHES "Clang" AND
+   CMAKE_C_COMPILER_VERSION VERSION_GREATER_EQUAL 14.0.0)
+  foreach(_var_suffix IN ITEMS "" _DEBUG _RELEASE _RELWITHDEBINFO _MINSIZEREL)
+    string(REGEX REPLACE "-O(2|3)" "-O" CMAKE_C_FLAGS${_var_suffix}
+      "${CMAKE_C_FLAGS${_var_suffix}}")
+  endforeach()
+endif()
+
 # Setup shared library defaults.  If explicitly specified somehow, then default 
 # to that.  Otherwise base the default on whether or not shared libs are even
 # supported.

I'm more partial to the build solution over an in-source patch since my gut tells me to not trust that opt level for the rest of dill, even if the compiler doesn't segfault. Up to @eisenhauer though as to which approach he prefers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants