-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/go: "fork/exec foo.test: text file busy" flakes #22220
Comments
A similar one from a linux/386 trybot: https://storage.googleapis.com/go-build-log/a1da169f/linux-386_2279a6a1.log
|
Almost certainly related to the recent cmd/go changes. Russ is looking into it.
It's not really clear how cdbc363 could actually be causing this, though. |
Makes no sense. I did have a missing f.Close, but on a file open for reading, not writing. The close of the file open for writing seems obviously there:
I am wondering if a child process is inheriting the file descriptor and so it stays open in the child after we close it, but os.OpenFile adds O_CLOEXEC, which should make that impossible. Trying to reproduce. I could just bypass that code path entirely for binaries we are going to run and discard. I assume that would fix the problem, but I'd like to understand the actual problem first. |
Reproduced with
on my Mac. The code in question is not actually needed for ephemeral binaries, like the ones that go test runs, so I'll probably just disable it without understanding it and move on. |
It occurs to me that this would happen if os.File.Close returned before the fd had actually been closed. @ianlancetaylor and I have been talking about things like this related to net.Conns, but since the use of the writable fd here is single-threaded, I don't see how the close would not be complete at the moment the test runs. |
Change https://golang.org/cl/70390 mentions this issue: |
If anyone can make this happen reliably - like five times in a row in a given time period reliably - then please patch in CL 70390 and see how that changes the frequency at which it happens. I've somehow lost the ability to reproduce this on my Mac in an unmodified client, so I can't see whether CL 70390 makes the situation better. I did, however, see the problem happen once on my Mac even with CL 70390 applied, so it doesn't appear to be a complete fix regardless. It should be safe, though, so if anyone can demonstrate proof that it helps, it's fine to submit. |
I got this to happen once on Linux using 'perf trace record'. Here is the trace - trace1.txt.gz - truncated right after the exec that got ETXTBSY. Literally every open call in the trace has CLOEXEC set. But maybe the problem is that there is a separate fork+exec happening in parallel, and while that separate child's exec will close the writable fd, it hasn't exec'ed yet, so it still holds the fd in its forked copy of the fd tables. That would explain the problem. It would not explain why I saw this even after removing the "open binary for write" code from my Mac, but maybe there is a second problem on the Mac. Now that I got a trace on Linux with the old code I will try with my pending CL. |
Backing up a step. On systems where close(2) actually does real work and can return a meaningful error, does that only apply to the final close of the underlying file? That is, if I do:
Does the close(fd) do any real work flushing writes, or is it only the close(fd2)? Because if it's only the close(fd2), then any time you have a file open for write and have a different goroutine doing fork+exec, there's a race where you might not see write errors because your close is not the last. The other goroutine can do a fork (essentially creating fd2 as above), and then close(fd) does nothing - and succeeds - because fd2 is still open, and then the new child process either execs or exits and closes fd2 implicitly. Are things really this bad? The same logic means that cmd/go basically can never open an executable for write. It must always invoke a subprocess to do that, because if it opens an executable for write there is no guarantee about a timely close of the fd relative to a future attempted exec. Are things really this bad? |
My understanding is the first Unfortunately, I can't nail down what serious work is. It is definitely FS-specific, but I haven't found any notes about what is conventional for "fast" file systems. It sounds like it flushes to the kernel write buffers. The syscall documentation explicitly says it does not push the changes to physical media:
But that doesn't say anything about the visibility of the changes to other processes. There is also this alarming note:
http://man7.org/linux/man-pages/man2/close.2.html It is not clear if this applies to close-on-exec descriptors. If it does, then None of that really helps though, sorry. |
That's an easy enough experiment to see if it helps, though, assuming this is reproducible enough. |
If a process has a file open for write access, and while the file is open the process calls fork, then there are two processes that have the file open for write access. An So, yes, if the go tool is opening executables for writing in parallel with exec'ing other programs, and the go tool then wants to exec the executable that it just wrote, there is a race condition.
The usual fix is to not write EXE1 in the first place. Instead, write TMP1. Write to TMP1 as much as you like. Fork off other processes that temporarily hold TMP1 open for writing--it's OK. When you're done writing to TMP1, call |
Unfortunately, neither of the current fork/exec-related locks is applicable here. runtime.execLock is about calling Exec directly, to replace the entire Go process, which we're not doing. syscall.ForkLock is about making sure a child process does not inherit a temporarily not-close-on-exec fd. That is, consider:
Now P2 has a copy of fd, because the fork happened before fd was marked close-on-exec, so it didn't get marked close-on-exec in P2. The ForkLock protects this kind of sequence by keeping 2 from splitting the combination of 1+3. That's not what's happening here. For one thing, we're using open with O_CLOEXEC, so that 1+3 is actually just 1. What's happening here is:
syscall.ForkLock guards against P2 inheriting fd non-close-on-exec. Here, P2 has inherited fd close-on-exec, and when it gets to the exec in step 6, it will close fd. But that doesn't do us any good at step 5 when we attempt the exec. Right now we acquire ForkLock before the fork and release ForkLock after the fork, before waiting for the exec to complete. (The signal that exec has completed is that we get EOF on a close-on-exec pipe fd we passed to it.) We could potentially keep holding ForkLock until the exec completes. That would prevent the sequence above by making sure that two different fork+exec cannot be interleaved. In effect linux/amd64 already does this, because it has been changed to use vfork, which doesn't let the parent run again (and release ForkLock) until the exec completes. And that seems OK, so maybe extending ForkLock in general is OK too. |
@ianlancetaylor I don't understand why rename would help. Changing the name of the directory entry pointing at the open file doesn't make it less open.
or maybe more to the point:
|
Bother, you're clearly right. I was thinking of a different problem. |
Why don't we have the linker handle the build ID, rather than cmd/go itself? |
Hey look! We've been here before - #3001, which turned into this very long comment 5+ years ago:
|
There's a specific problem: cmd/go writes a binary and then runs it, and a race in fork+exec makes that fail with ETXTBUSY sometimes. The first question to answer is which problem we want to solve. Do we want to protect other Go programs from this race, or just work around it in cmd/go, like we did in #3001? If we want to solve the general problem, then holding ForkLock until exec closes the old fds seems like the best solution, provided we're sure that seeing the pipe close means all the other close-on-exec fds are closed too. If we want to solve the cmd/go problem only, I think it suffices to just stop calling updateBuildID for binaries we write+run+remove. The other case is binaries we install+run, but we only do that during 'go test' with an implicit install of the binary (like 'go test -cpuprofile'), and in that case the final link+updateBuildID+run is not running in parallel with any other possible fork+execs. We can also drop the cmd/cgo hack that I found, because we no longer install+run cmd/cgo. Putting the updateBuildID into the linker would be an alternative way to solve the cmd/go problem only, but the code would have to be in gccgo's linker too, and we'd still need to keep the code in cmd/go for updating the build ID in package archives. It seems easier to have it in one place instead of three. |
I think seeing the pipe close only ensures (on Linux) that lower-numbered FDs are closed, but on the implementation of |
I'd be surprised if that's a POSIX guarantee. If there were thousands of FDs to CLOEXEC and the pipe was early amongst them, I wouldn't be surprised if the reader saw EOF whilst other close were still to happen. |
I don't think it's necessary to solve the general problem. It's quite unusual for a program to write an executable. There is no problem if the executable is written by a child process, and that is exactly how most executables are written, the child process being the linker. |
I'm not sure I understand quite how you are using build IDs now, but it looks like you are creating a build ID that is a hash of the executable. gccgo's linker already has a way to do that: |
Change https://golang.org/cl/71570 mentions this issue: |
I opened #22315 for the general problem. I'm less convinced that it's unnecessary to fix. It's easy to imagine an AppEngine-like service that runs code on demand by fetching a zip file, unpacking it, and running the program inside, and that would get exactly these kinds of flakes. Yes, it's not really our fault, but at the same time I wasted a day debugging this last week (and probably a similar amount of time five years ago), and there's an open Java bug report about the same problem, so people do hit this. If there's something simple we can do, we probably should do it. But let's move that discussion to #22315. I sent CL 71570 to stop writing the build ID for binaries we will run. That eliminates the race as currently being experienced by cmd/go today and should fix this issue (cmd/go flakes). |
While trybotting https://go-review.googlesource.com/c/go/+/70151, I got this flake https://storage.googleapis.com/go-build-log/c8965fb8/openbsd-amd64-60_817d939c.log;
Maybe recent cmd/go work?
/cc @rsc
The text was updated successfully, but these errors were encountered: