Revert test directory cleaning change. #7042

ehuss · 2019-06-18T21:17:59Z

#6900 changed it so that the entire cit directory was cleaned once when tests started. Previously, each t# directory was deleted just before each test ran. This restores the old behavior due to problems on Windows.

The problem is that the call to rm_rf would fail with various errors ("Not found", "directory not empty", etc.) if you run cargo test twice. The first panic would poison the lazy static initializer, causing all subsequent tests to fail.

There are a variety of reasons deleting a file on Windows is difficult. My hypothesis in this case is that services like the indexing service and Defender swoop in and temporarily hold handles to files. This seems to be worse on slower systems, where presumably these services take longer to process all the files created by the test suite. It may also be related to how files are "marked for deletion" but are not immediately deleted.

The solution here is to spread out the deletion over time, giving Windows more of an opportunity to release its handles. This is a poor solution, and should only help reduce the frequency, but not entirely fix it.

I believe that this cannot be solved using DeleteFileW. There are more details at rust-lang/rust#29497, which is a long-standing problem that there are no good Rust implementations for recursively deleting a directory.

An example of something that implements a "safe" delete is Cygwin's unlink implementation. As you can see, it is quite complex. Of course our use case does not need to handle quite as many edge cases, but I think any implementation is going to be nontrivial, and require Windows-specific APIs not available in std.

Note: Even before #6900 I still get a lot of errors on a slow VM (particularly "directory not empty"), with Defender and Indexing off. I'm not sure why. This PR should make it more bearable, though.

rust-highfive · 2019-06-18T21:18:10Z

r? @nrc

(rust_highfive has picked a reviewer for you, use r? to override)

Eh2406 · 2019-06-19T14:45:04Z

Thank you very much for working on this! I have 2 random thoughts,

I know that @rbtcollins has been doing great work to make rustup's IO faster on windows, I wonder if we ~~can steal some ideas~~ work with them to make this faster.
I have heard a rumor that there is a programmatic way to turn Indexing off on a per directory bases. If that rumor is true, then we should do that for cit. (If it works well there we should consider doing it for some or all of each profile directory and some or all of ~/cargo)

alexcrichton · 2019-06-19T16:22:55Z

@bors: r+

bors · 2019-06-19T16:22:56Z

📌 Commit 0082290 has been approved by alexcrichton

bors · 2019-06-19T16:23:04Z

⌛ Testing commit 0082290 with merge 2ee292f...

Revert test directory cleaning change. #6900 changed it so that the entire `cit` directory was cleaned once when tests started. Previously, each `t#` directory was deleted just before each test ran. This restores the old behavior due to problems on Windows. The problem is that the call to `rm_rf` would fail with various errors ("Not found", "directory not empty", etc.) if you run `cargo test` twice. The first panic would poison the lazy static initializer, causing all subsequent tests to fail. There are a variety of reasons deleting a file on Windows is difficult. My hypothesis in this case is that services like the indexing service and Defender swoop in and temporarily hold handles to files. This seems to be worse on slower systems, where presumably these services take longer to process all the files created by the test suite. It may also be related to how files are "marked for deletion" but are not immediately deleted. The solution here is to spread out the deletion over time, giving Windows more of an opportunity to release its handles. This is a poor solution, and should only help reduce the frequency, but not entirely fix it. I believe that this cannot be solved using `DeleteFileW`. There are more details at rust-lang/rust#29497, which is a long-standing problem that there are no good Rust implementations for recursively deleting a directory. An example of something that implements a "safe" delete is [Cygwin's unlink implementation](https://github.com/cygwin/cygwin/blob/ad101bcb0f55f0eb1a9f60187f949c3decd855e4/winsup/cygwin/syscalls.cc#L675-L1064). As you can see, it is quite complex. Of course our use case does not need to handle quite as many edge cases, but I think any implementation is going to be nontrivial, and require Windows-specific APIs not available in std. Note: Even before #6900 I still get a lot of errors on a slow VM (particularly "directory not empty"), with Defender and Indexing off. I'm not sure why. This PR should make it more bearable, though.

ehuss · 2019-06-19T16:45:23Z

@Eh2406 It would be nice to have a Windows expert help. There is code to disable backups of target on macos. Better support like that on windows would be good.

bors · 2019-06-19T16:46:34Z

☀️ Test successful - checks-travis, status-appveyor
Approved by: alexcrichton
Pushing 2ee292f to master...

rbtcollins · 2019-06-19T21:23:09Z

Hi, Indexing service does indeed honour FILE_ATTRIBUTE_NOT_CONTENT_INDEXED. I'm not sure if it will eliminate handle races from it because the hook will still need to examine the file to determine not to index it... but it will reduce system load, OTOH IndexService self throttles under high load anyway...

Defender or other AV systems also do hold handles on the files at various times; we haven't spotted a Defender handle on windows 10 so far - it seems to do its access all in-line in IOPS from our processes, but Symantec and other scanner seem to actively race.

This approach should be robust - rust-lang/rustup#1873 - its obviously not suitable for a stdlib API call, but should be fine for cargo or other higher layer crates. - in short, spin for some short period. The 28 seconds we use here is because we have long lived proxy processes; for the cargo test suite I'd suggest a spin period of no more than 50ms or so.

If you can reproduce the problem, running under WPA or procmon and getting insight into what process is holding handles would perhaps allow more designs to be proposed for dealing with the issue.

If, for instance, it is defender then in CI jobs, powershell can be used to programmatically disable Defender for a directory; this isn't safe to do for development machines which can be targetted, but for a one-off random directory in a cloud instance its pretty safe.

Grab me on discord in the wg-rustup channel if you want to give me more context, I'm happy to help.

ehuss · 2019-06-20T15:54:52Z

Last time I investigated this about a year ago (#5481 (comment)), all I was able to determine is that one of the service processes had a handle open, and I have no idea how to tell which service was responsible. It also included the "system" process, which was not helpful.

It doesn't really matter, the rm code should be resilient to this situation. For example, looking at the cygwin implementation, I think when it fails it moves it to the trash bin instead. Perhaps something like that is the solution, though I'm not sure if it's a good solution.

Also, from that same issue (5481) there are some really tricky problems. Deleting an executable can succeed, but leave a phantom behind (such that path.exists() is true immediately after a successful unlink call, and you cannot reuse the same filename for a few milliseconds). Executables (that were recently created and ran) are somehow special, and the cargo test suite creates a lot of them very rapidly.

The jitter code also sounds like a good idea, but I don't really have time to implement it right now.

@Eh2406 can you let me know if you run into problems after this PR? I'll make it a higher priority if it is still bugging you.

Eh2406 · 2019-06-20T16:47:50Z

#7011 means that I am not running into this much these days. I would not prioritize fixing the test sweet.

rbtcollins · 2019-06-20T19:20:49Z

@ehuss WPA or procmon would give much more detailed diagnostics about handles; your restart manager API technique was clever but too indirect I think.

https://github.com/cygwin/cygwin/blob/ad101bcb0f55f0eb1a9f60187f949c3decd855e4/winsup/cygwin/syscalls.cc#L930 is the rm -r mitigation with https://github.com/cygwin/cygwin/blob/ad101bcb0f55f0eb1a9f60187f949c3decd855e4/winsup/cygwin/syscalls.cc#L636 being the referenced comment. I can't shed much light on those techniques (they post date my active involvement in cygwin.

The moving to the recycle bin is a reasonable approach when direct unlinking doesn't work for test suites, but you would need the full code path - folk will run the test suite on network drives and so forth, where that doesn't work.

My point about about gathering data is that making rm resilient to the situation depends on knowing what actually is happening. If cargo is encountering this well known WTF mode where unlinking a directory returns before the unlink actually happens, then its pretty straight forward to mitigate - that appears to be very robust with the move-then-unlink path in rustup. If it is a handle collision with a system utility (index service / virus scanner etc), then a jitter based retry can mitigate that.

One could of course just through several techniques at it and hope :). Anyhow - my offer is open, let me know if you need help.

Update cargo 17 commits in 807429e1b6da4e2ec52488ef2f59e77068c31e1f..4c1fa54d10f58d69ac9ff55be68e1b1c25ecb816 2019-06-11 14:06:10 +0000 to 2019-06-24 11:24:18 +0000 - Fix typo in comment (rust-lang/cargo#7066) - travis: enforce formatting of subcrates as well (rust-lang/cargo#7063) - _cargo: Make function style consistent (rust-lang/cargo#7060) - Update some fix comments. (rust-lang/cargo#7061) - Stabilize default-run (rust-lang/cargo#7056) - Fix typo in comment (rust-lang/cargo#7054) - fix(fingerpring): do not touch intermediate artifacts (rust-lang/cargo#7050) - Resolver test/debug cleanup (rust-lang/cargo#7045) - Rename to_url → into_url (rust-lang/cargo#7048) - Remove needless lifetimes (rust-lang/cargo#7047) - Revert test directory cleaning change. (rust-lang/cargo#7042) - cargo book /reference/manifest: fix typo (rust-lang/cargo#7041) - Extract resolver tests to their own crate (rust-lang/cargo#7011) - ci: Do not install addons on rustfmt build jobs (rust-lang/cargo#7038) - Support absolute paths in dep-info files (rust-lang/cargo#7030) - ci: Run cargo fmt on all workspaces (rust-lang/cargo#7033) - Deprecated ONCE_INIT in favor of Once::new() (rust-lang/cargo#7031)

Remove remove_dir_all This removes the `remove_dir_all` dependency. It is no longer needed, as there has been significant improvements to std's implementation over the years (particularly recently). This improves the performance of running cargo's testsuite on my machine from 8m to 2m 30s (!!). This was added in #7137 to deal with some issues with symbolic links. I believe those seem to be resolved now. Note that std's implementation is still not bullet proof. In particular, I think there are still some cases where a file can be locked by another process which prevents it from being removed. There are some more notes about this at #7042 (comment) (and various other places linked there). Closes #9585

Revert test directory cleaning change.

0082290

rust-highfive assigned nrc Jun 18, 2019

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jun 18, 2019

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jun 19, 2019

bors merged commit 0082290 into rust-lang:master Jun 19, 2019

ehuss mentioned this pull request Jun 24, 2019

Update cargo rust-lang/rust#62100

Merged

ehuss mentioned this pull request Jan 30, 2020

fs::remove_dir_all rarely succeeds for large directories on windows rust-lang/rust#29497

Open

ehuss mentioned this pull request Jun 15, 2021

windows: testsuite is unreasonably slow #9585

Closed

ehuss mentioned this pull request Jul 16, 2021

cargo_test: remove test roots after success #9701

Closed

ehuss added this to the 1.37.0 milestone Feb 6, 2022

ehuss mentioned this pull request Nov 4, 2022

Remove remove_dir_all #11333

Merged

ehuss mentioned this pull request Nov 5, 2022

Reduce cargo's testsuite overhead #11341

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert test directory cleaning change. #7042

Revert test directory cleaning change. #7042

ehuss commented Jun 18, 2019

rust-highfive commented Jun 18, 2019

Eh2406 commented Jun 19, 2019

alexcrichton commented Jun 19, 2019

bors commented Jun 19, 2019

bors commented Jun 19, 2019

ehuss commented Jun 19, 2019

bors commented Jun 19, 2019

rbtcollins commented Jun 19, 2019

ehuss commented Jun 20, 2019

Eh2406 commented Jun 20, 2019

rbtcollins commented Jun 20, 2019

Revert test directory cleaning change. #7042

Revert test directory cleaning change. #7042

Conversation

ehuss commented Jun 18, 2019

rust-highfive commented Jun 18, 2019

Eh2406 commented Jun 19, 2019

alexcrichton commented Jun 19, 2019

bors commented Jun 19, 2019

bors commented Jun 19, 2019

ehuss commented Jun 19, 2019

bors commented Jun 19, 2019

rbtcollins commented Jun 19, 2019

ehuss commented Jun 20, 2019

Eh2406 commented Jun 20, 2019

rbtcollins commented Jun 20, 2019