-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert test directory cleaning change. #7042
Conversation
r? @nrc (rust_highfive has picked a reviewer for you, use r? to override) |
Thank you very much for working on this! I have 2 random thoughts,
|
@bors: r+ |
📌 Commit 0082290 has been approved by |
Revert test directory cleaning change. #6900 changed it so that the entire `cit` directory was cleaned once when tests started. Previously, each `t#` directory was deleted just before each test ran. This restores the old behavior due to problems on Windows. The problem is that the call to `rm_rf` would fail with various errors ("Not found", "directory not empty", etc.) if you run `cargo test` twice. The first panic would poison the lazy static initializer, causing all subsequent tests to fail. There are a variety of reasons deleting a file on Windows is difficult. My hypothesis in this case is that services like the indexing service and Defender swoop in and temporarily hold handles to files. This seems to be worse on slower systems, where presumably these services take longer to process all the files created by the test suite. It may also be related to how files are "marked for deletion" but are not immediately deleted. The solution here is to spread out the deletion over time, giving Windows more of an opportunity to release its handles. This is a poor solution, and should only help reduce the frequency, but not entirely fix it. I believe that this cannot be solved using `DeleteFileW`. There are more details at rust-lang/rust#29497, which is a long-standing problem that there are no good Rust implementations for recursively deleting a directory. An example of something that implements a "safe" delete is [Cygwin's unlink implementation](https://github.com/cygwin/cygwin/blob/ad101bcb0f55f0eb1a9f60187f949c3decd855e4/winsup/cygwin/syscalls.cc#L675-L1064). As you can see, it is quite complex. Of course our use case does not need to handle quite as many edge cases, but I think any implementation is going to be nontrivial, and require Windows-specific APIs not available in std. Note: Even before #6900 I still get a lot of errors on a slow VM (particularly "directory not empty"), with Defender and Indexing off. I'm not sure why. This PR should make it more bearable, though.
@Eh2406 It would be nice to have a Windows expert help. There is code to disable backups of |
☀️ Test successful - checks-travis, status-appveyor |
Hi, Indexing service does indeed honour FILE_ATTRIBUTE_NOT_CONTENT_INDEXED. I'm not sure if it will eliminate handle races from it because the hook will still need to examine the file to determine not to index it... but it will reduce system load, OTOH IndexService self throttles under high load anyway... Defender or other AV systems also do hold handles on the files at various times; we haven't spotted a Defender handle on windows 10 so far - it seems to do its access all in-line in IOPS from our processes, but Symantec and other scanner seem to actively race. This approach should be robust - rust-lang/rustup#1873 - its obviously not suitable for a stdlib API call, but should be fine for cargo or other higher layer crates. - in short, spin for some short period. The 28 seconds we use here is because we have long lived proxy processes; for the cargo test suite I'd suggest a spin period of no more than 50ms or so. If you can reproduce the problem, running under WPA or procmon and getting insight into what process is holding handles would perhaps allow more designs to be proposed for dealing with the issue. If, for instance, it is defender then in CI jobs, powershell can be used to programmatically disable Defender for a directory; this isn't safe to do for development machines which can be targetted, but for a one-off random directory in a cloud instance its pretty safe. Grab me on discord in the wg-rustup channel if you want to give me more context, I'm happy to help. |
Last time I investigated this about a year ago (#5481 (comment)), all I was able to determine is that one of the service processes had a handle open, and I have no idea how to tell which service was responsible. It also included the "system" process, which was not helpful. It doesn't really matter, the Also, from that same issue (5481) there are some really tricky problems. Deleting an executable can succeed, but leave a phantom behind (such that The jitter code also sounds like a good idea, but I don't really have time to implement it right now. @Eh2406 can you let me know if you run into problems after this PR? I'll make it a higher priority if it is still bugging you. |
#7011 means that I am not running into this much these days. I would not prioritize fixing the test sweet. |
@ehuss WPA or procmon would give much more detailed diagnostics about handles; your restart manager API technique was clever but too indirect I think. https://github.com/cygwin/cygwin/blob/ad101bcb0f55f0eb1a9f60187f949c3decd855e4/winsup/cygwin/syscalls.cc#L930 is the rm -r mitigation with https://github.com/cygwin/cygwin/blob/ad101bcb0f55f0eb1a9f60187f949c3decd855e4/winsup/cygwin/syscalls.cc#L636 being the referenced comment. I can't shed much light on those techniques (they post date my active involvement in cygwin. The moving to the recycle bin is a reasonable approach when direct unlinking doesn't work for test suites, but you would need the full code path - folk will run the test suite on network drives and so forth, where that doesn't work. My point about about gathering data is that making rm resilient to the situation depends on knowing what actually is happening. If cargo is encountering this well known WTF mode where unlinking a directory returns before the unlink actually happens, then its pretty straight forward to mitigate - that appears to be very robust with the move-then-unlink path in rustup. If it is a handle collision with a system utility (index service / virus scanner etc), then a jitter based retry can mitigate that. One could of course just through several techniques at it and hope :). Anyhow - my offer is open, let me know if you need help. |
Update cargo 17 commits in 807429e1b6da4e2ec52488ef2f59e77068c31e1f..4c1fa54d10f58d69ac9ff55be68e1b1c25ecb816 2019-06-11 14:06:10 +0000 to 2019-06-24 11:24:18 +0000 - Fix typo in comment (rust-lang/cargo#7066) - travis: enforce formatting of subcrates as well (rust-lang/cargo#7063) - _cargo: Make function style consistent (rust-lang/cargo#7060) - Update some fix comments. (rust-lang/cargo#7061) - Stabilize default-run (rust-lang/cargo#7056) - Fix typo in comment (rust-lang/cargo#7054) - fix(fingerpring): do not touch intermediate artifacts (rust-lang/cargo#7050) - Resolver test/debug cleanup (rust-lang/cargo#7045) - Rename to_url → into_url (rust-lang/cargo#7048) - Remove needless lifetimes (rust-lang/cargo#7047) - Revert test directory cleaning change. (rust-lang/cargo#7042) - cargo book /reference/manifest: fix typo (rust-lang/cargo#7041) - Extract resolver tests to their own crate (rust-lang/cargo#7011) - ci: Do not install addons on rustfmt build jobs (rust-lang/cargo#7038) - Support absolute paths in dep-info files (rust-lang/cargo#7030) - ci: Run cargo fmt on all workspaces (rust-lang/cargo#7033) - Deprecated ONCE_INIT in favor of Once::new() (rust-lang/cargo#7031)
Remove remove_dir_all This removes the `remove_dir_all` dependency. It is no longer needed, as there has been significant improvements to std's implementation over the years (particularly recently). This improves the performance of running cargo's testsuite on my machine from 8m to 2m 30s (!!). This was added in #7137 to deal with some issues with symbolic links. I believe those seem to be resolved now. Note that std's implementation is still not bullet proof. In particular, I think there are still some cases where a file can be locked by another process which prevents it from being removed. There are some more notes about this at #7042 (comment) (and various other places linked there). Closes #9585
#6900 changed it so that the entire
cit
directory was cleaned once when tests started. Previously, eacht#
directory was deleted just before each test ran. This restores the old behavior due to problems on Windows.The problem is that the call to
rm_rf
would fail with various errors ("Not found", "directory not empty", etc.) if you runcargo test
twice. The first panic would poison the lazy static initializer, causing all subsequent tests to fail.There are a variety of reasons deleting a file on Windows is difficult. My hypothesis in this case is that services like the indexing service and Defender swoop in and temporarily hold handles to files. This seems to be worse on slower systems, where presumably these services take longer to process all the files created by the test suite. It may also be related to how files are "marked for deletion" but are not immediately deleted.
The solution here is to spread out the deletion over time, giving Windows more of an opportunity to release its handles. This is a poor solution, and should only help reduce the frequency, but not entirely fix it.
I believe that this cannot be solved using
DeleteFileW
. There are more details at rust-lang/rust#29497, which is a long-standing problem that there are no good Rust implementations for recursively deleting a directory.An example of something that implements a "safe" delete is Cygwin's unlink implementation. As you can see, it is quite complex. Of course our use case does not need to handle quite as many edge cases, but I think any implementation is going to be nontrivial, and require Windows-specific APIs not available in std.
Note: Even before #6900 I still get a lot of errors on a slow VM (particularly "directory not empty"), with Defender and Indexing off. I'm not sure why. This PR should make it more bearable, though.