-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update CI for upcoming macOS changes #7821
Comments
I have a branch which implements this, but I am running into problems. See rust-lang/rust#68863 (comment) for details. EDIT: Added logs here: Logstaskgated 11 libsystem_pthread.dylib 0x00007fff6522fe65 _pthread_start + 148 taskgated 12 libsystem_pthread.dylib 0x00007fff6522b83b thread_start + 15 taskgated no system signature for unsigned /Users/eric/Proj/rust/cargo/target/debug/cargo[30241] taskgated close(3) err: 0 taskgated end request taskgated begin request: 3331, 27001 taskgated UNIX error exception: 2 opendirectoryd PID: 30240, Client: 'clang', exited with 0 session(s), 0 node(s) and 0 active request(s) opendirectoryd Trigger - cancelled opendirectoryd Finalizing pidinfo (30240) object - 0x7fccecb44d00 taskgated 0 Security 0x00007fff3a1fea43 Security::CommonError::LogBacktrace() + 107 taskgated 1 Security 0x00007fff3a1fed5d Security::UnixError::UnixError(int, bool) + 263 taskgated 2 Security 0x00007fff3a1fedb8 Security::UnixError::throwMe(int) + 36 taskgated 3 Security 0x00007fff3a101c56 Security::CodeSigning::KernelCode::identifyGuest(Security::CodeSigning::SecCode*, __CFData const**) + 1006 taskgated 4 Security 0x00007fff3a0df0fa Security::CodeSigning::SecCode::identify() + 58 taskgated 5 Security 0x00007fff3a0df514 Security::CodeSigning::SecCode::autoLocateGuest(__CFDictionary const*, unsigned int) + 122 taskgated 6 Security 0x00007fff3a0e55a6 SecCodeCopyGuestWithAttributes + 78 taskgated 7 taskgated 0x000000010119f3b7 taskgated + 13239 taskgated 8 taskgated 0x000000010119facb taskgated + 15051 taskgated 9 taskgated 0x000000010119fe67 taskgated + 15975 taskgated 10 taskgated 0x000000010119fed3 taskgated + 16083 taskgated 11 taskgated 0x00000001011a20ea taskgated + 24810 taskgated 12 taskgated 0x00000001011a1617 taskgated + 22039 opendirectoryd PID: 30005, Client: 'cargo', exited with 0 session(s), 0 node(s) and 0 active request(s) taskgated 13 taskgated 0x00000001011a0ce5 taskgated + 19685 syspolicyd GK process assessment: <-- (, ) opendirectoryd Trigger - cancelled syspolicyd Unable (errno: 2) to read file at for process path: library path: (null) taskgated 14 libsystem_pthread.dylib 0x00007fff6522fe65 _pthread_start + 148 opendirectoryd Finalizing pidinfo (30005) object - 0x7fccead43440 syspolicyd Dropping com.apple.syspolicy.Gatekeeper.Errors as it isn't used in any transform (not in the config or budgeted?) kernel build_userspace_exit_reason: illegal flags passed from userspace (some masked off) 0x141, ns: 9, code 0x8 syspolicyd Terminating process due to Gatekeeper rejection: 30005, taskgated 15 libsystem_pthread.dylib 0x00007fff6522b83b thread_start + 15 kernel Waking up reference: 66615 taskgated no signature for pid=30005 (cannot make code: UNIX[No such file or directory]) kernel Thread waiting on reference 66615 woke up taskgated end request kernel Sleep interrupted, signal 0x100 kernel Security policy would not allow process: 30005, /Users/eric/Proj/rust/cargo/target/cit/t1430/foo/target/debug/deps/foo Most of these messages are unique to when it decides to kill a process. |
@ehuss it may be worthwhile opening an issue for that on https://github.com/actions/virtual-environments since that's probably an image thing that may need fixing? |
Nah, I can repro locally. I'm pretty sure it's a bug in Catalina (I wasn't able to repro in older macOS versions). Catalina had a bunch of new security changes, so it's not too surprising that a weird use case like Cargo's test suite causes a problem. I might file a radar, but in my experience it usually takes at least a few years to get a response (and since this is an obscure case, unlikely to get any response). I'm currently leaning towards changing Cargo's testsuite to check for processes that exit with SIGKILL on macos and retrying them. I'm not sure how difficult that will be (it will probably need to scan the output for text like "exited with signal: 9"). My concern is that it would also happen with rustc's test suite, since it also rapidly creates new executables and runs them. But it seemed fine on my end, so maybe there's something unique or different in how Cargo is interacting. Or maybe Cargo's tests are more complex than rustc's, since they often have more complex linking requirements and do several rapid rebuilds of the same project (like overwriting an executable several times might confuse Gatekeeper). |
Oh jeez that's a bummer :( I figured there was some way to disable this but it doesn't look like that's possible at least from the UI... I agree that waiting for a fix from Apple probably isn't gonna happen, but I'm surprised others haven't run into this in the sense that testing C/C++ projects (maybe even LLVM) or other big-ish build systems in general seems like they'd have the same issues. Googling around "disable gatekeeper" it looks like there are some hits though. I've got catalina myself and I can try to play with this soon and see if those can fix the issue. My thinking is that if we could disable w/e security things are there then we could make the request to github actions to do that on their image. |
I tried disabling Gatekeeper with One thing did fix the problem, and that is disabling System Integrity Protection. That requires running I could maybe run the tests in a loop and collect data about which tests seem to fail. So far it seems pretty random, but maybe there is a pattern to which ones are affected? |
I ran the tests in a loop for a couple hours and counted which tests failed:
The only commonality I see is that these all run a process after it is built. |
@ehuss Do you have a script for generating that? It might be useful to try and have others run tests locally as well, I guess, and see if we get the ~same set. |
I just ran |
Hm well without really knowing much about why these tests are getting killed makes it sort of hard to figure out how to work around this. We may be able to get by with a 10ms sleep before spawning processes or something like that? The nondeterministic nature of this in particular is hard to wrap my head around. Given that I'm not sure how we can handle this because it's hard to pinpoint a cause. Clearly we can run just-generated binaries, just not in some cases... |
In the reproductions I've managed to come across locally, "syspolicyd Unable (errno: 2) to read file at for process path: library path: (null)" is a commonality -- that suggests to me that some file is perhaps not being fully created yet? Maybe we need to (loosely) fsync before running a binary? Unfortunately it looks like locally I get the following, and turning these privates into non-private seems hard (the random blogs online suggests running reverse-engineered code, which seems... error prone, and I don't want to break something :)
Some rough googling for the error here leads me to flutter/flutter#38325 via https://github.com/christopherfujino/catalina-crasher-demo/, which has the interesting sounding:
That sounds at least plausibly like something that might happen to us here? I don't think we're deleting anything that is forking a subprocess off, but maybe there's a similar error condition. |
Yea, from what I've read (blog) it is impossible to reveal those I went strolling through Apple's opensource deposits to see if any of this code is available, but they have only released the kernel for 10.15, and I can't find syspolicyd in 10.14 (I found copies from much older releases (10.7?), but that is not helpful). I found the code for I created an isolated reproduction that does not use cargo: It seems like calling |
Heh, I didn't come across that particular blog! By chance, I think I'm still on .2, so I'll try that out. I will also try to fiddle with the reproduction you've noted. Losing hard linking is actually probably not too bad, especially on Macs, where most people have ssds I'd guess (or at least fast disks). |
I was trying to help someone on the users forum who had an issue with something that seems suspiciously similar - https://users.rust-lang.org/t/github-actions-randomly-kill-a-test-program/37255/ |
I am the one @aidanhs helped at the Forum with a similar (if not the same) bug. I posted a workaround there, but I do not know what the actual bug is. |
With privacy disabled, here's the log. Initial guess is that we're trying to look at the old file to verify the new file? (This is with the scargo reproduction provided by @ehuss). I have not looked in detail at the scargo contents, and probably do not have time to dig in much more though.
|
Thanks for posting the unredacted logs. I don't see anything too surprising. Just wanted to update, I've been experimenting with different things over the past 2 days, but haven't found any great workarounds. I'd like to avoid copies since they can use a lot of disk space. I also suspect for normal usage Gatekeeper is unlikely to affect anyone. Maybe we could only copy in debug mode for cargo's test suite? I was also trying to think of more extreme options (like APFS clones), but nothing practical has come up. I created a repro just using shell scripts (just to rule out any particulars of Rust): repro.sh #!/bin/bash
# Gatekeeper crash reproduction.
set -e
echo "int main() {}" > foo.c
cc -o foo foo.c
N=8
for ((i=0;i<$N;i++))
do
./runner.sh $i &
pids[$i]=$!
done
cleanup() {
echo "Cleanup after exit..."
kill -TERM "${pids[@]}"
exit 1
}
# wait -n isn't available in this old version of bash.
trap "cleanup" CHLD
wait runner.sh #!/bin/sh
set -e
root=t$1
for i in {1..1000}
do
echo $i
rm -rf $root
mkdir -p $root/out
cp foo $root/out/foo2
ln $root/out/foo2 $root/foo
$root/foo
done What's crazy is that it is very particular to the exact commands here. The following variations don't exhibit the crash:
(Assuming I'm not being mislead by timing variations.) Some things that I've tried that don't help:
|
As a random shot in the dark, what if instead of hard-linking we instead did something like:
Since B would then be the "original copy" does it then fix the issue? I'm also running those scripts locally to try to reproduce but nothing yet. How quickly does it reproduce for you? |
I could reproduce with cargo on just run tests in around 15, maybe 20 minutes, though I only tried a couple times. I think I managed a reproduction in around 5-10 minutes with @ehuss's rust script. Both of these are on .2, though, and I've since updated to .3 and have not tried to reproduce since then. I wonder if due to the migration to APFS which I presume is quite widely used, and I believe is CoW, we might be observing some side effect of that and it would be beneficial to write a few bytes (vs. messing with what gets hard linked). Obviously we could make these the same bytes, I guess, though maybe that doesn't defeat the CoW nature of the filesystem... If it is CoW, we should also check that just making a copy isn't already sharing disk space and as such isn't fast enough that we don't need to bother hardlinking at all? |
Usually takes anywhere from a few seconds to 5 minutes. I run my tests for 10-20 minutes just to make sure. I suspect it is very sensitive to timing. The system I'm running tests on is kinda old (~6 years). You can maybe tweak the parallelism to match the number of cpus (I hard-coded it to 8). You can also maybe disable some CPUs to slow your machine down (Instruments > Settings > CPU). Interestingly, I tried your rename trick and it doesn't seem to repro with that. How strange! I tested with HFS, and I'm unable to repro on that, so it does seem to be related to APFS! I also verified that the 10.15 image on azure switched to apfs (which doesn't seem to be documented anywhere 😠). @Mark-Simulacrum regarding the CoW stuff, I'm not too familiar with APFS. However, my understanding is that CoW only works if the program uses special options. That is, the |
Oh I think I was actually having some kills locally, I just didn't see them because the script didn't stop at them (or I messed something up in how I ran the script)...
Do you think this is a viable way forward maybe? We could try to optimize this to not move files around if the hard links are already set up (I think we already |
I ran some tests with Cargo's full testsuite using the rename trick, but I still got failures. |
Switch azure to macOS 10.15. Switches CI to the macOS 10.15 image. Since 32-bit support is no longer available, this changes how cross-compile testing works. I decided to use `x86_64-apple-ios` as a cross target, since it can easily build/link on macOS. `cargo run` won't work without a simulator, so some of the tests are restructured to check if `cargo run` is allowed. If you do have a simulator, it should Just Work. CI doesn't seem to be configured with a simulator installed, and I didn't bother to look if that would be possible (the simulators tend to be several gigabytes in size). An alternative approach would be to use wasm as a cross target, which is also fairly easy to support. But wasm is a sufficiently different target that it can cause some issues in some tests, and is a bit harder to run as an executable. This also adds some more help text on how to configure cross-compile tests. Rustup is now installed on macOS by default, so no need to install it. Unfortunately self-updates are not allowed, but hopefully that won't be an issue. Closes rust-lang#7821
So our mac builder is getting killed with sigkill (9). This is super mysterious: * rust-lang/cargo#7821 * fortran-lang/fpm#16 * https://gh.neting.ccmunity/t/github-actions-on-macos-randomly-kill-my-test-program/17387
So our mac builder is getting killed with sigkill (9). This is super mysterious: * rust-lang/cargo#7821 * fortran-lang/fpm#16 * https://gh.neting.ccmunity/t/github-actions-on-macos-randomly-kill-my-test-program/17387
Azure will be removing support for macOS 10.13 in March (https://devblogs.microsoft.com/devops/azure-pipelines-hosted-pools-updates/). We will need to update our CI configuration to use the new image. This will also remove support for cross-compiling tests to i686-apple-darwin. I would like to retain some kind of cross-compile testing on macOS, and Alex suggested using wasm as the alternate target (possibly minus the cross-compile
cargo run
tests).The text was updated successfully, but these errors were encountered: