Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootstrap Reform and aarch64-linux's GCC Upgrade #208412

Closed
tpwrules opened this issue Dec 30, 2022 · 17 comments · Fixed by #209870
Closed

Bootstrap Reform and aarch64-linux's GCC Upgrade #208412

tpwrules opened this issue Dec 30, 2022 · 17 comments · Fixed by #209870
Labels
0.kind: bug Something is broken 6.topic: bootstrap Bootstrapping, avoiding pre-built binaries. Often overlaps with cross-compilation.

Comments

@tpwrules
Copy link
Contributor

tpwrules commented Dec 30, 2022

This is the research I have done trying to figure out how to best upgrade aarch64-linux from GCC 9. I've collected everything here to make the problems clear, provide context for those who can help, and help the community decide on the right path forward. Some sources are linked where appropriate, but I have lots more available on request.

The Problem

Upgrading aarch64-linux past GCC 9 breaks large numbers of packages. As a result, there is a specific clause in all-packages.nix which keeps aarch64-linux at GCC 9, while allowing every other platform to use 11 (and soon 12).

GCC 9 is pretty old at this point and important packages, like KDE/Plasma, are demanding later versions to use modern C++ language features. We cannot practicably ship NixOS 23.05 with GCC 9. It must be upgraded before then, with sufficient time to test and fix up packages.

The two main breakages observed with more recent compilers are linker errors (e.g. with pkgs.icu) and random aborts (e.g. pkgs.expect during pkgs.dejagnu's test phase)

The Reason

The nixpkgs bootstrap sequence, which builds the latest GCC and stdenv using prebuilt seed binaries, is a bit sleazy. It compiles glibc with the old GCC, then builds the latest GCC and other utilities using that glibc. This results in the stdenv not being completely compiled by the new GCC.

In addition, and more importantly, GCC's low-level runtime library, libgcc_s.so, ends up simply copied from the GCC used in the bootstrap (currently 9 for aarch64-linux) to the glibc used in the stdenv (which would ordinarily be using GCC 11). This causes programs built with the later version of GCC to use the library of an earlier version, instead of the library expected by that GCC.

The library is linked in automatically by GCC (and can be linked in manually using -lgcc_s) when it needs certain e.g. SIMD math routines or atomics. This going wrong (e.g. more recent GCCs having additional functions) results in linker errors and packages failing to build. It's also loaded in certain circumstances at runtime by glibc, and failure here (e.g. not being available in rpath) results in runtime aborts, possibly with messages like libgcc_s.so.1 must be installed for pthread_exit to work.

This deficiency in the bootstrap happens to cause problems in a visible way for aarch64-linux and GCC 9->11, but copying libgcc_s.so around is unsafe and wrong for all architectures and GCC versions and needs to be fixed. However, it turns out to be a happy accident that libgcc_s.so is always available at runtime for glibc to use, and this needs to be preserved somehow too.

Possible Solutions

1. Ignore reason, upgrade bootstrap

Pros: Possible right now, pretty certain to actually fix the problem

Cons: Commits us to upgrade the bootstrap every time libgcc_s.so breaks compatibility on any architecture, does not solve the underlying reason. We continue to hope that this will never cause a subtle issue and always break visibly for most packages.

2. Remove hack which copies around bootstrap libgcc_s.so, add -lgcc_s to wrapper

Pros: Tested and seems to work now, relatively certain to actually fix the problem

Cons: Could break in the future if e.g. dejagnu is needed in the bootstrap sequence again, adds 7.1 megabytes of GCC's library output to everyone's runtime closure (though this is already the case for C++ programs), doesn't actually improve bootstrap

It might be possible to patch GCC to detect when glibc could need libgcc_s.so too (i.e. if pthread support is enabled or exceptions are used?) and then include it only in that case, but that is kind of risky due to the failure mode. Maybe libgcc_s.so could be split into a separate output to avoid the size penalty.

3. Add extra bootstrap stages to glue together a glibc that has the latest GCC's libgcc_s.so and a GCC which uses them

Pros: Should not add much overhead to the bootstrap process

Cons: Would likely require a lot of patchelfing, doesn't actually improve bootstrap

4. Add extra bootstrap stages to recompile glibc with the latest GCC (and its libgcc_s.so), then possibly GCC with that glibc

Pros: Solves the issue properly, cleanest and most correct bootstrap approach

Cons: Would complicate the life of people who work on the stdenv as bootstrap would be slower, complex to implement

It might be possible to reduce the overhead of this last solution especially if we need to build another GCC, as GCC already builds itself several times. We might be able to build the first GCC just once and the second GCC fewer times to keep the total number of builds less than double. There is also rumored to be a combined mode that can build GCC and glibc together which might be faster and a shortcut for the first GCC. This must also be careful to preserve correct operation for cross-compilation.

The Path Forward

We have the first solution essentially ready right now so that NixOS 23.05 is not held up, but it's the worst. The last solution is the correct one and needs to be done at some point for the benefit of nixpkgs as a whole. But it's also the most work and might cause problems for contributors if not done carefully.

cc: @K900, @trofi

@tpwrules tpwrules added the 0.kind: bug Something is broken label Dec 30, 2022
@winterqt winterqt added the 6.topic: bootstrap Bootstrapping, avoiding pre-built binaries. Often overlaps with cross-compilation. label Dec 30, 2022
@trofi
Copy link
Contributor

trofi commented Jan 3, 2023

A bit of progress update:

I added tiny amount of comments to describe nixpkgs's convoluted bootstrap process in #208478.

I'm currently playing locally with double gcc rebuild to update libgcc_s.so using existing framework. Various minor issues pop up and I did not yet manage to produce working stdenv, but I think I'll get there without too many failures.

I will post at least non-controversial proper PRs once succeed getting a working stdenv on x86_64.

The gist of various bugs I encounter so far is the incorrect inclusion order of library paths between bootstrapTools and recently built compilers. For example libstdc++.so gets pulled from bootstrapTools even when gcc is rebuilt in early phases. So far I'm locally using the following hack (and many more less correct hacks):

--- a/pkgs/stdenv/linux/bootstrap-tools/scripts/unpack-bootstrap-tools.sh
+++ b/pkgs/stdenv/linux/bootstrap-tools/scripts/unpack-bootstrap-tools.sh
@@ -17,6 +17,15 @@ else
    LD_BINARY=$out/lib/ld-*so.?
 fi

+# path to version-specific libraries, like libstdc++.so
+LIBSTDCXX_SO_DIR=$(echo $out/lib/gcc/*/*)
+
+# Move version-specific libraries out to avoid library mix when we
+# upgrade gcc.
+# TODO(trofi): update bootstrap tarball script and tarballs to put them
+# into expected location directly.
+LD_LIBRARY_PATH=$out/lib $LD_BINARY $out/bin/mv $out/lib/libstdc++.* $LIBSTDCXX_SO_DIR/
+
 # On x86_64, ld-linux-x86-64.so.2 barfs on patchelf'ed programs.  So
 # use a copy of patchelf.
 LD_LIBRARY_PATH=$out/lib $LD_BINARY $out/bin/cp $out/bin/patchelf .
@@ -25,8 +34,8 @@ for i in $out/bin/* $out/libexec/gcc/*/*/*; do
     if [ -L "$i" ]; then continue; fi
     if [ -z "${i##*/liblto*}" ]; then continue; fi
     echo patching "$i"
-    LD_LIBRARY_PATH=$out/lib $LD_BINARY \
-        ./patchelf --set-interpreter $LD_BINARY --set-rpath $out/lib --force-rpath "$i"
+    LD_LIBRARY_PATH=$out/lib:$LIBSTDCXX_SO_DIR $LD_BINARY \
+        ./patchelf --set-interpreter $LD_BINARY --set-rpath $out/lib:$LIBSTDCXX_SO_DIR --force-rpath "$i"
 done

 for i in $out/lib/librt-*.so $out/lib/libpcre*; do

@trofi
Copy link
Contributor

trofi commented Jan 4, 2023

I think I got a PoC:

The branch contains these non-controversial changes we could merge either as part of PoC or separately:

Please give it a go.

winterqt added a commit to winterqt/nixpkgs that referenced this issue Jan 4, 2023
This change switches to using GCC 11 by default on aarch64-linux, as well as passing `-lgcc` to the linker, per NixOS#201485.

See NixOS#201254 and NixOS#208412 for wider context on the issue.
@ghost
Copy link

ghost commented Jan 7, 2023

Possible Solutions

  1. Enable rebootstrap on troublesome platforms

    • Cons:
      • it's a temporary band-aid like 1-4 are
      • extra builds of a bunch of packages on aarch64, but mostly small/fast ones
    • Pros:
      • don't need to upload new bootstrap-files to tarballs.nixos.org (cumbersome process, imposes a hosting burden for the rest of eternity)
      • zero added rebuilds on non-aarch64 platforms
      • a one-liner, easy to revert later when we fix things the right way
  2. Make gcc's stage1 a separate derivation and configure stage2 with --disable-bootstrap.

    • Pros:
      • zero additional rebuilds compared to the current approach
      • avoids the static-lib{mpfr,mpc,gmp,isl}.a hack in stage3 (stdenv stages.nix will automatically use gcc.stage1 to build them).
      • no more "frankenstein gcc" which is partly compiled by itself and partly compiled by the bootstrapFiles gcc.
      • allows the "rumored to be a combined mode that can build GCC and glibc together" that @tpwrules mentioned
    • Cons:
      • nobody has implemented this yet

IMHO 6 is the long-term solution.

@K900
Copy link
Contributor

K900 commented Jan 7, 2023

6 definitely seems like the optimal solution, but it's going to need someone with enough knowledge of GCC's internals and enough free time, and I'm not sure that person exists right now. That said, I kind of like option 5 - it's a clever hack, which is a downside IMO, but it allows us to win some time in a less invasive way and keeps the hackery contained in the bootstrap.

@ghost
Copy link

ghost commented Jan 7, 2023

I should also add that I ran into this problem six months ago here, and seriously considered taking on # 6.

But I was kind of demotivated by the general indifference to the problems caused by frankenstein-compilation. And a few of my recent major-project PRs have languished for 6+ months, requiring constant rebasing, which is further-demotivating.

If there is now a general appreciation of why this is a problem, and people willing to allocate time to reviewing the resulting PR, I could take this up again. Would likely be aiming for right after 23.05, in the brief "it's (almost) okay to break stuff" window after the release.

@K900
Copy link
Contributor

K900 commented Jan 7, 2023

I am absolutely willing to help with this, but my knowledge of GCC bootstrapping is stuck in the late 00s, so I'll have to catch up a lot to understand the specifics. It's also probably worth mentioning that we now have way more resources available to Hydra, so "just build everything with it" isn't just a viable way of testing changes such as this - it's something we can do in a few days, so it should be possible to land this at any time without incurring downstream breakage.

@tpwrules
Copy link
Contributor Author

tpwrules commented Jan 7, 2023

I'm absolutely not a compiler guy, just been interested in pushing this along for the practical consequences. So the possibilities and fixes are not all obvious to me. Thank you @amjoseph-nixpkgs for your additional ideas and experiments.

Option 6 sounds like option 4 but implementing "We might be able to build the first GCC just once and the second GCC fewer times to keep the total number of builds less than double." It seems like the difficulty of reverting @trofi's proposal for option 4 is overstated, and I don't see how merging that proposal for 4 now makes achieving 6 any harder in the future, i.e. how it incurs technical debt.

I'm not at this moment a fan of 5. At least on my machine and with an earlier revision building the stdenv from scratch with that option is dramatically slower than it was before. It also does not get us any closer to our goals for other architectures. It might be helpful in updating the bootstrap files without trusting a random contributor to build them on their own machine and without half-breaking aarch64-linux for a week to build new ones on Hydra. I'm re-rebuilding with the latest changes on that PR and will update when that completes.

@ghost
Copy link

ghost commented Jan 7, 2023

Option 6 sounds like option 4 but implementing "We might be able to build the first GCC just once and the second GCC fewer times to keep the total number of builds less than double."

No, they are completely different.

pkgs/development/compilers/gcc/ already builds gcc twice, internally, but never lets you get access to the first copy.

At least on my machine and with an earlier revision

That obsolete version should not be used for build-time measurements.

I don't see how merging that proposal for 4 now makes achieving 6 any harder in the future

Because it will have to be reverted at that future time.

It is far from being a one-line change (like #209462 is), so ability to revert cleanly depends on whether or not any other commit has touched the same lines, or lines near it.

@tpwrules
Copy link
Contributor Author

tpwrules commented Jan 7, 2023

pkgs/development/compilers/gcc/ already builds gcc twice, internally, but never lets you get access to the first copy.

Yes, I get this. Maybe my language was unclear. Each realization of the derivation generated by the expression pkgs/development/compilers/gcc/ builds GCC three times (not twice). Currently, we realize it once. trofi's PR #209063 (implementation of option 4) proposes to realize it twice, building six total copies of GCC. What I meant by that comment, and what you seem to want to do with option 6, is split GCC's derivation up to get less total copies, ideally back to the original three.

That obsolete version should not be used for build-time measurements.

I've completed measurement of the latest version and it does not substantially reduce the required build time (116 vs 103 minutes, compared to 40 before).

Because it will have to be reverted at that future time.

I don't understand why this is true. It is claimed that the first 3 of 4 commits in trofi's PR are general improvements, which are prerequisites for the modification of the bootstrap sequence, but would not have to be reverted. The last commit to modify the bootstrap sequence does make substantial modifications, but I don't see why we would have to trust git to be able to mechanically revert them. The new sequence would have to be crafted to accommodate the split GCC derivation and the appropriate documentation would have to be written too. I don't see how that would be any harder with trofi's revised sequence. trofi commented more on this here if you missed it

@ghost
Copy link

ghost commented Jan 8, 2023

That obsolete version should not be used for build-time measurements.

I've completed measurement of the latest version and it does not substantially reduce the required build time (116 vs 103 minutes, compared to 40 before).

This measurement was done against an obsolete commit; please be sure you built from 77c2173 -- it is does drastically less rebuilding on aarch64, at the expense of some complexity.

@ghost
Copy link

ghost commented Jan 8, 2023

It is claimed that the first 3 of 4 commits in trofi's PR are general improvements,

I agree. Those should be broken out as a separate PR and merged immediately so we can focus on the important issues.

trofi commented more

See my reply.

@ghost
Copy link

ghost commented Jan 8, 2023

builds GCC three times (not twice).

Well, technically the third build is a test -- for comparison with stage2. The stage2 compiler is the finished product; the stage3 compiler is only built as a sanity check. It really ought to be part of the checkPhase rather than the buildPhase but I guess nobody's done that yet.

@ghost ghost mentioned this issue Jan 9, 2023
4 tasks
trofi added a commit to trofi/nixpkgs that referenced this issue Jan 10, 2023
… from libc

I would like to add an extra `gcc` build step during linux bootstrap
(NixOS#208412). This makes it early
bootstrap compiler linked and targeted against `bootstrapTools` `glibc`
including it's headers.

Without this change `gcc`'s spec files always prefer `bootstrapTools` `glibc`
for header search path (passed in as --with-native-system-header-dir=). We'can't
override it with:

- `-I` option as it gets stacked before gcc-specific headers, we need to keep
  glibc headers after gcc as gcc cleans namespace up for C standard by using
  #include_next and by undefining system macros.
- `-idirafter` option as it gets appended after existing `glibc`-includes

This `--sysroot=/nix/store/does/not/exist` hack allows us to remove existing
`glibc` headers and add new ones with `-idirafter`.

We use `cc-cflags-before` instead of `libc-cflags` to allow user to define
their own `--sysroot=` (like `firefox` does).

To keep it working prerequisite cross-symlink in gcc.libs is required:
NixOS#209153
@tpwrules
Copy link
Contributor Author

I was able to come up with a proof of concept of option 2 here. Too janky and unknown to be a PR yet, but it does fix the two identified symptoms. Worth noting that guix does this same thing, but patches GCC instead of using a wrapper.

@zzywysm
Copy link
Contributor

zzywysm commented Jan 21, 2023

Given I pointed you to a lot of this analysis, I would have appreciated a brief mention.

tpwrules/nixos-apple-silicon#11

@zzywysm
Copy link
Contributor

zzywysm commented Jan 21, 2023

I don't see a comment to this effect, but it looks like the herculean effort by @amjoseph-nixpkgs in #209870 will address this issue.

@NickCao NickCao mentioned this issue Jan 21, 2023
13 tasks
edolstra added a commit to edolstra/nix that referenced this issue Feb 10, 2023
Nixpkgs on aarch64-linux is currently stuck on GCC 9
(NixOS/nixpkgs#208412) and using gcc11Stdenv
doesn't work either.

So use c++2a instead of c++20 for now. Unfortunately this means we
can't use some C++20 features for now (like std::span).
winterqt added a commit to winterqt/nixpkgs that referenced this issue Feb 15, 2023
This change switches to using GCC 11 by default on aarch64-linux, as well as passing `-lgcc` to the linker, per NixOS#201485.

See NixOS#201254 and NixOS#208412 for wider context on the issue.

(cherry picked from commit 8442601)
@flokli
Copy link
Contributor

flokli commented Apr 3, 2023

#209870 got merged 5 hours ago, this should have been auto closed (but didn't?)

@vcunat
Copy link
Member

vcunat commented Apr 3, 2023

Auto-closing normally happens when the thing reaches master. Anyway, by now we've implemented at least two of the listed "solutions" already. (and use gcc12 as default on aarch64-linux)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken 6.topic: bootstrap Bootstrapping, avoiding pre-built binaries. Often overlaps with cross-compilation.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants