-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test stdout is sometimes missing #4608
Comments
After a lot of offline discussion, we think that the principled fix is to split test execution and test.xml generation (only to be executed if the test did not generate a test.xml file). I have a patch for that, but unfortunately, it's causing problems elsewhere. I have a patch for that, except it's causing problems in yet another place. As a short-term solution, we're going to put in a Linux-specific fix, which is the most critical case right now. |
Progress on #4608. PiperOrigin-RevId: 185126689
…here "tail --pid" is supported. The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@ Mitigates #4608 for compatible Linux systems. TESTED=manual scripts and new test case. RELNOTES: None PiperOrigin-RevId: 185374273
…here "tail --pid" is supported. The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@ Mitigates #4608 for compatible Linux systems. TESTED=manual scripts and new test case. RELNOTES: None PiperOrigin-RevId: 185374273
*** Reason for rollback *** #4625 What I thought was a short fix is turning into a long hunt, so I better roll this back to get the build green again. I'm not yet 100% certain what the interactions are, but there's a chance that it's back to the drawing board. *** Original change description *** Fixing test-setup.sh occasionally missing stdout/stderr, on systems where "tail --pid" is supported. The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@ Mitigates #4608 for compatible Linux systems. TESTED=manual scripts and new test case. RELNOTES: None PiperOrigin-RevId: 185482604
…ess in a sub-shell. Apparently, nested background processes interfere with SIGINT handling in bash. I don't 100% understand why and how, but I do have a small bash script that demonstrates the problem: script A that spawns a background process, sends it a SIGINT, and verifies it was received. The script works, *unless* run in the background by a process B; this extra layer of backgrounding cause process A's logic to stop working. See experimental/users/olaola/shell/ for examples. See also https://stackoverflow.com/questions/48847722/nested-background-processes-and-sigint-handling *** Original change description *** Fixing test-setup.sh occasionally missing stdout/stderr, on systems where "tail --pid" is supported. The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@ Mitigates #4608 for compatible Linux systems. TESTED=presubmits, manual shell tests on new bazel RELNOTES: None PiperOrigin-RevId: 186312008
…here "tail --pid" is supported. The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@ Mitigates #4608 for compatible Linux systems. TESTED=manual scripts and new test case. RELNOTES: None PiperOrigin-RevId: 185374273
*** Reason for rollback *** #4625 What I thought was a short fix is turning into a long hunt, so I better roll this back to get the build green again. I'm not yet 100% certain what the interactions are, but there's a chance that it's back to the drawing board. *** Original change description *** Fixing test-setup.sh occasionally missing stdout/stderr, on systems where "tail --pid" is supported. The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@ Mitigates #4608 for compatible Linux systems. TESTED=manual scripts and new test case. RELNOTES: None PiperOrigin-RevId: 185482604
…ess in a sub-shell. Apparently, nested background processes interfere with SIGINT handling in bash. I don't 100% understand why and how, but I do have a small bash script that demonstrates the problem: script A that spawns a background process, sends it a SIGINT, and verifies it was received. The script works, *unless* run in the background by a process B; this extra layer of backgrounding cause process A's logic to stop working. See experimental/users/olaola/shell/ for examples. See also https://stackoverflow.com/questions/48847722/nested-background-processes-and-sigint-handling *** Original change description *** Fixing test-setup.sh occasionally missing stdout/stderr, on systems where "tail --pid" is supported. The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@ Mitigates #4608 for compatible Linux systems. TESTED=presubmits, manual shell tests on new bazel RELNOTES: None PiperOrigin-RevId: 186312008
Is the remaining work on this a release blocker for 0.11.0? |
No, as long as 9fe02aa is in 0.11.0, we're good. the remaining work will take a few more weeks. |
That's fine, I am removing the release blocker label. |
…here "tail --pid" is supported. The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@ Mitigates #4608 for compatible Linux systems. TESTED=manual scripts and new test case. RELNOTES: None PiperOrigin-RevId: 185374273
…ess in a sub-shell. Apparently, nested background processes interfere with SIGINT handling in bash. I don't 100% understand why and how, but I do have a small bash script that demonstrates the problem: script A that spawns a background process, sends it a SIGINT, and verifies it was received. The script works, *unless* run in the background by a process B; this extra layer of backgrounding cause process A's logic to stop working. See experimental/users/olaola/shell/ for examples. See also https://stackoverflow.com/questions/48847722/nested-background-processes-and-sigint-handling *** Original change description *** Fixing test-setup.sh occasionally missing stdout/stderr, on systems where "tail --pid" is supported. The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@ Mitigates #4608 for compatible Linux systems. TESTED=presubmits, manual shell tests on new bazel RELNOTES: None PiperOrigin-RevId: 186312008
…here "tail --pid" is supported. The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@ Mitigates #4608 for compatible Linux systems. TESTED=manual scripts and new test case. RELNOTES: None PiperOrigin-RevId: 185374273
…ess in a sub-shell. Apparently, nested background processes interfere with SIGINT handling in bash. I don't 100% understand why and how, but I do have a small bash script that demonstrates the problem: script A that spawns a background process, sends it a SIGINT, and verifies it was received. The script works, *unless* run in the background by a process B; this extra layer of backgrounding cause process A's logic to stop working. See experimental/users/olaola/shell/ for examples. See also https://stackoverflow.com/questions/48847722/nested-background-processes-and-sigint-handling *** Original change description *** Fixing test-setup.sh occasionally missing stdout/stderr, on systems where "tail --pid" is supported. The solutions aren't mine, the new test was taken from Ola's unknown commit and the way to avoid race condition courtesy of sethkoehler@ Mitigates #4608 for compatible Linux systems. TESTED=presubmits, manual shell tests on new bazel RELNOTES: None PiperOrigin-RevId: 186312008
Assigning to myself for the long-term fix. |
I looked into the issue a bit with I'm not very familiar with Bazel internals but is there a good reason for using bash for this task? (Sometimes the short way is actually the long way). |
Thanks for looking into it! The "short way" implementation is in the works. Unfortunately, some non-trivial refactoring is involved so it's not coming as fast as we'd like - that's the main reason for the stop-gap fix in 9fe02aa |
Do you have a CL number of the change that needs to be rolled forward? Will you take care of this or should I? |
cl/207084179 needs to be rolled back (i.e., the original change is cl/205629237 and needs to be rolled forward). I already have a rollback of the rollback, but we were unable to determine how the change could have caused the problems that led to the rollback. @meisterT was pretty confident that the original change caused a problem. We later found another change that caused a similar problem, but it's unclear if that was the same problem. |
There was an intermediate change by Janak which may have resolved the underlying problem - cl/210731120. |
It's unlikely that I'll have any time this week to work on it. |
So, the rollback of the rollback is in, guarded with a flag: f2b260c Unfortunately, @meisterT was completely correct that the change was broken. Fortunately, we managed to track it down and I just submitted a fix: 9de2ea5 (See the description for a long explanation of what was going wrong.) If that second change sticks, then we can move forward with flipping the flag introduced in the first change and backing out the temporary code. That should happen some time next week. Assuming that that also sticks, i.e., we didn't overlook something else, we're done with the second-level Yak shaving. That means we can go up one level and flip --experimental_split_xml_generation to true. If next week looks good, then we can tackle that the week after. If that also sticks, then we can clean up the temporary code for that, and finally close this bug. |
Well, it took me a bit longer but I have a patch to flip the flag, as well as a patch to back out the temporary code. They should get submitted this week. |
To clarify: I already flipped the flag for all Google builds and we have not observed any problems. My patch is to flip the flag in Bazel as well, and then to back out the temporary code. |
Progress on #4608. PiperOrigin-RevId: 222205065
Temporary code removed in b7ae7b0. Next up is to flip the --experimental_split_xml_generation flag. |
Otherwise the test name is not generated correctly. This is caught by our existing integration tests, and I'm adding an assertion to our unit tests as well. Progress on #4608. PiperOrigin-RevId: 223338264
Progress on #4608. PiperOrigin-RevId: 223497462
This got delayed due to me being on vacation for all of December. Unfortunately, I just found yet another issue. As far as I can tell, our code to replace invalid xml characters doesn't actually work right now (at least with some versions of perl). On the plus side, I can now confirm that the upcoming replacement is equivalent to what we're doing right now. |
Make that two issues - there's still a bug in the code for which I have a fix, and the current XML character replacement function doesn't work, which I want to replace anyway. Ok, make that three issues - the test doesn't actually test whether XML character replacement works because it's broken. |
Actually, four - the script is silently ignoring the non-zero exit code from perl. |
test-setup.sh was setting a trap to call write_xml_output_file, but there's no test.log.xml if split xml generation is enabled. Add a test case for this and comment out a part of the test case that is not actually working (the test passes because it ignores the exit code from perl). Progress on #4608. PiperOrigin-RevId: 229167973
The good news is that I fixed most of the issues mentioned above, at least those that are critical for rolling this out. However, I found yet another issue with signal handling, which apparently isn't working correctly. |
If we don't set a trap here, then bash ignores the signal, and the test process also does not receive the signal, so the test runner has no chance of writing a test.xml output. However, the behavior of trap forwarding the signal to the subprocess is not at all documented in the bash documentation, and also inconsistent with the behavior reported in #7119. There is a similar problem in the Java stub template reported in #6338. This may or may not be progress on #4608. PiperOrigin-RevId: 232035930
It looks like there may still be an issue with remote execution. |
It is possible for Bazel to run test spawns that generate no output files (only stdout/stderr). Leave any such validation to the higher level. This was failing with --experimental_split_xml_generation. Progress on #4608. PiperOrigin-RevId: 232295958
It's in! It's amazing! (But don't celebrate yet, it could still be rolled back...) |
There is a race condition in test-setup.sh which will cause the test output to occasionally be (partially or fully) missing, from either the test log or from the test.xml, or both.
Sample Tensorflow tests revealed the stdout missing from xml in as much as 40% of the cases, although usual incidence for small tests is ~1%. Internal issue: b/65977241.
The text was updated successfully, but these errors were encountered: