Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLD: Fails to build: internal_error #4278

Closed
mcepl opened this issue Jan 2, 2023 · 14 comments · Fixed by #4894
Closed

BLD: Fails to build: internal_error #4278

mcepl opened this issue Jan 2, 2023 · 14 comments · Fixed by #4894
Labels
bug build related to the build process
Milestone

Comments

@mcepl
Copy link

mcepl commented Jan 2, 2023

Bug report

Bug summary

When building the package as part of the packaging yt for openSUSE, the fails often (not always, something between 50 % and 80 %) it fails with an error like this:

[  338s]   gcc -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=3 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -Werror=r
eturn-type -g -DOPENSSL_LOAD_CONF -fwrapv -fno-semantic-interposition -O2 -Wall -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=3 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protecti
on -Werror=return-type -g -IVendor/ -O2 -Wall -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=3 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -Werror=return-type -g -IVendor
/ -O2 -Wall -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=3 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -Werror=return-type -flto=auto -freport-bug -fPIC -Iyt/utilities/
lib/ -Iyt/utilities/lib -I./yt/utilities/lib -I/usr/include/python3.10 -I/usr/lib64/python3.10/site-packages/numpy/core/include -c yt/utilities/lib/marching_cubes.cpp -o build/temp.linux-x86_64-cpython-310/yt/u
tilities/lib/marching_cubes.o
[  338s]   lto1: internal compiler error: resolution sub id 0xadc1906a211a636c not in object file
[  338s]   0x102f76b internal_error(char const*, ...)
[  338s]         ???:0
[  338s]   0x15cfb50 lto_main()
[  338s]         ???:0
[  338s]   Please submit a full bug report, with preprocessed source.
[  338s]   Please include the complete backtrace with any bug report.
[  338s]   See <https://bugs.opensuse.org/> for instructions.
[  338s]   lto-wrapper: fatal error: /usr/bin/g++ returned 1 exit status
[  338s]   compilation terminated.
[  338s]   /usr/lib64/gcc/x86_64-suse-linux/12/../../../../x86_64-suse-linux/bin/ld: error: lto-wrapper failed
[  338s]   collect2: error: ld returned 1 exit status

Code for reproduction

It gets triggered while building the package, so no obvious code snippet.

Complete build log with all packages used and steps taken to achieve this problem.

Actual outcome

See the build error above.

Expected outcome

Clean build.

Version Information

  • Operating System: Linux/openSUSE/Tumbleweed (rolling distro) as of 2023-01-02
  • Python Version: this one is for Python 3.10.9, but I don’t think it is interpreter driven
  • yt version: 4.1.2 from the PyPI tarball
  • Other Libraries (if applicable): see above log for the complete list

All packages from openSUSE packages.


When I have consulted this with our GCC expert (hi, @marxin!) I got this suggestion:

Hi, I'm almost sure that the problem is that there are some wrong dependencies and some .o file is used before its producer completes it. I have seen it many times, it manifests itself in different voices and the fact that it is not deterministic:

[ 10s] lto1: internal compiler error: resolution sub id 0xb2bc78268f7df7a0 not in object file
[ 10s] 0x102f987 internal_error(char const*, ...)
[ 10s] ???:0
[ 10s] 0x15d02f0 lto_main()
[ 10s] ???:0

or that message of yours with lto1: internal compiler error: original not compressed with zstd

@welcome
Copy link

welcome bot commented Jan 2, 2023

Hi, and welcome to yt! Thanks for opening your first issue. We have an issue template that helps us to gather relevant information to help diagnosing and fixing the issue.

@neutrinoceros neutrinoceros changed the title Fails to build: internal_error BLD: Fails to build: internal_error Jan 2, 2023
@neutrinoceros neutrinoceros added infrastructure Related to CI, versioning, websites, organizational issues, etc bug labels Jan 2, 2023
@neutrinoceros
Copy link
Member

neutrinoceros commented Jan 2, 2023

Thank you very much for reporting this. I take it that the problem is still present for yt 4.1.3 ? Most likely the whole 4.1.x series is affected

@mcepl
Copy link
Author

mcepl commented Jan 2, 2023

Yeah, it is still the same: sometimes it passes (so fortunately, I could manage to push the update to openSUSE/Factory), but then next time it fails. I am afraid it could be something Cythonish or even GCC.

@neutrinoceros
Copy link
Member

I'm almost sure that the problem is that there are some wrong dependencies and some .o file is used before its producer completes it.

I'm not really sure what can be done about it on our side. Even if we have incorrect (unused) dependencies between source files, it seems to me that compiler race conditions are probably out of our control. Otherwise, I currently don't have any idea how to look for what's causing it wether in yt's source code or its infrastructure.

@marxin
Copy link

marxin commented Jan 3, 2023

Well, the build runs in 2 steps, the first one compiles .c files into object files (.o). And later, these objects are linked into shared libraries (.so). And what happens is that due to a missing dependency in between a .so file and one of its dependencies (.o file), such object file is provided to the linker. But it happens before it's actually finished. That's why we speak about broken dependencies. It has nothing to do with compiler race conditions. Does it make sense?

@mcepl
Copy link
Author

mcepl commented Jan 3, 2023

@marxin So, to translate it to the pure language of blame: it is most likely all Cython (converting Python to C) fault, right?

Writing the message to their email list as we speak (and, ugh, their current branch is considered legacy and they are all working on new shiny 3.0 version!).

@cphyc
Copy link
Member

cphyc commented Jan 3, 2023

@mcepl one thing we could try is to try to build using only one thread rather than the number of cores available to see whether we indeed have no leverage on the bug at all. Could you try building with

MAX_BUILD_CORES=0 python -m pip install  # or whatever command you use to build the project

This will cause the build to be sequential rather than parallel and should™ make everything on yt's side deterministic.

@mcepl
Copy link
Author

mcepl commented Jan 4, 2023

Yes, that helped. Thank you.

@da-woods
Copy link

da-woods commented Jan 7, 2023

Cython dev here (I also posted this response on the mailing list):

Assuming the race-condition theory (I think I misread this above though) is correct (if I understand correctly, it's trying to an .o file that it's in the process of generating) then it's probably a distutils/setuptools problem. Cython's job is to translate a .pyx file to a .c file, and the compilation of the .c file is handled by something else (which is usually distutils or setuptools, but could be make, or something else).

Cython does do a small amount of monkey-patching to these packages though, so it's possible the problem is on our end. But probably unlikely.

I'm aware this sounds like trying to pass the buck further ;).


(and, ugh, their current branch is considered legacy and they are all working on new shiny 3.0 version!).

The current branch is maintained, and has had a release in the last few days. So this kind of bug would likely be fixed on it (if it were our fault), but new features are going into the shiny new version.

@da-woods
Copy link

da-woods commented Jan 7, 2023

Just one more comment based on a quick reading of the log file (I could be wrong of course).

I think all your .so files are linked with fixed_interpolator. It looks to me like fixed_interpolator.cpp might be being rebuilt for every individual .so file, so you end up with something like:

a.so: compiles a.c -> compiles fixed_interpolator.cpp -> links a.o and fixed_interpolator.o
b.so: compiles c.c -> compiles fixed_interpolator.cpp -> links b.o and fixed_interpolator.o

And the b.so recompilation of fixed_interpolator is conflicting with a a.so linking of fixed_interpolator.

I guess you want to compile it only once. I'm not quite sure how you do that in setup.py though

@mcepl
Copy link
Author

mcepl commented Jan 7, 2023

(and, ugh, their current branch is considered legacy and they are all working on new shiny 3.0 version!).

The current branch is maintained, and has had a release in the last few days. So this kind of bug would likely be fixed on it (if it were our fault), but new features are going into the shiny new version.

I am sorry then. I really didn’t mean to slander you and I am sorry if I did so.

@da-woods
Copy link

da-woods commented Jan 7, 2023

I am sorry then. I really didn’t mean to slander you and I am sorry if I did so.

No worries! There's definitely good reasons why people are frustrated with the somewhat stale state of the stable branch. We just don't want people to think it's completely neglected.

@neutrinoceros neutrinoceros added build related to the build process and removed infrastructure Related to CI, versioning, websites, organizational issues, etc labels Jul 31, 2023
@yut23
Copy link
Member

yut23 commented May 6, 2024

I believe this is also the source of the intermittent failures on Jenkins that look like ImportError: .../marching_cubes.cpython-310-x86_64-linux-gnu.so: undefined symbol: _Z13eval_gradientPiPdS0_S0_ (e.g. https://tests.yt-project.org/job/yt_py310_git/8108/console).

yut23 added a commit to yut23/yt that referenced this issue May 6, 2024
Avoid a race condition on fixed_interpolator.o during parallel builds by
building it only once and storing it in a static library.

Hopefully fixes yt-project#4278.
yut23 added a commit to yut23/yt that referenced this issue May 6, 2024
Avoid a race condition on fixed_interpolator.o during parallel builds by
building it only once and storing it in a static library.

Hopefully fixes yt-project#4278.
@neutrinoceros neutrinoceros added this to the 4.4.0 milestone May 7, 2024
@neutrinoceros
Copy link
Member

@mcepl we think we identified the cause and addressed it in #4894 , which should land in our next release (4.4.0). Don't hesitate to re-open if it turns out the problem is still there.

henrynjones pushed a commit to henrynjones/yt that referenced this issue May 13, 2024
Avoid a race condition on fixed_interpolator.o during parallel builds by
building it only once and storing it in a static library.

Hopefully fixes yt-project#4278.
henrynjones pushed a commit to henrynjones/yt that referenced this issue May 13, 2024
Avoid a race condition on fixed_interpolator.o during parallel builds by
building it only once and storing it in a static library.

Hopefully fixes yt-project#4278.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug build related to the build process
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants