Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tf] AtomicRenameUtil - add short retry period on windows #2570

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

pmolodo
Copy link
Contributor

@pmolodo pmolodo commented Aug 7, 2023

Description of Change(s)

On windows, things like virus checkers and indexing service can often grab a handle to newly-created files; they usually release them fairly quickly, though, so just need a short retry period

Fixes Issue(s)

  • I have verified that all unit tests pass with the proposed changes
  • I have submitted a signed Contributor License Agreement

@jesschimein
Copy link
Contributor

Filed as internal issue #USD-8565

@pmolodo
Copy link
Contributor Author

pmolodo commented Oct 17, 2023

Would love to get this merged... continues to cause problems when testing stock windows builds in our CI.

@gitamohr
Copy link
Contributor

gitamohr commented Nov 7, 2023

Hey Paul, looking at this one today. Oooof... this Windows behavior is horrid! Is the lock grabbed on the temporary file that we've written to after we close it but before we rename it over? Is there some way we can prevent that? I guess I'm just trying to see if there's some better way to do this b/c having to sit and retry for 300 msec (!) feels icky and unreliable.

@gitamohr
Copy link
Contributor

gitamohr commented Nov 7, 2023

Oh I see from reading the test that it's not the temp file, it's the actual output file. Does this mean we should do this kind of retrying always everytime we try to create a file on Windows?

@pmolodo
Copy link
Contributor Author

pmolodo commented Nov 7, 2023

Oh I see from reading the test that it's not the temp file, it's the actual output file. Does this mean we should do this kind of retrying always everytime we try to create a file on Windows?

So - I think this isn't a problem so much on initial file creation, as it is when you try to then to replace or remove a file shortly after creation.

ie, i think what happens is this:

  • Some unittest creates test.usda
  • This generally goes through ok, even without my changes
  • The virus checker or indexer or whatever notices a new file was created, starts inspecting it
  • The unittest then does some more stuff, saves over test.usda with new content
  • This write fails because the virus checker still has a lock on the original file

As I've implemented, though, I'm not doing any sort of checks to see if the file already exists before applying retrying logic... so it's doing the retry loop on both the initial creation and later attempt to overwrite.

Also - when I was writing the test, I was thinking in terms of the "final" file being the problem... but I suppose the temp file could also be causing issues: ie, we create the temp file, virus checker grabs a lock on it, then when we try to do a move operation on it, and it fails. The retry logic is on a move operation, so we don't really know if the root of the problem was the deletion of the old (temp) half of it, or the rewriting of the new (final) file half.

But to the larger point... yes, this would be a potential problem on any windows write or delete style operation, I think. I tackled it in atomicRenameUtil because that's what was popping up in my tracebacks when I was doing spot checks, so practically speaking, I think it gets "most" of the use cases?

@gitamohr
Copy link
Contributor

gitamohr commented Nov 7, 2023

Thanks for the info Paul -- are you saying that just running the shipped USD tests on Windows are failing for you like this? We have Windows CIs and don't see these kinds of failures. I also ran this by Mark T. at SideFX b/c I know he has a lot of Windows experience, and he says he hasn't seen behavior like this before either. Do you have a sense of which virus checker is causing this for you?

@pmolodo
Copy link
Contributor Author

pmolodo commented Nov 7, 2023

Yeah, the standard tests fail... not at a high rate, but high enough to be noticeable and annoying. Unfortunately never dedicated time to get exact figures, but at a rough guess I'd say about 10% of the complete test runs on machines that have the virus checker enabled would have at least one test fail due to something that looked like this.

As I understand it, the only virus checker we have enabled on those test machines is Windows Defender. Not sure if the windows indexing service is enabled or not.

@pmolodo
Copy link
Contributor Author

pmolodo commented Nov 7, 2023

Also - a bit of a chagrined mea culpa - while looking at the PR again I noticed that my modified ArchFileAccess wasn't always returning a value! I pushed a fix... but I'm surprised that this didn't show up earlier, because we've been using this code ourselves for a while now.
My best guess is that the undefined behavior in this particular case was resulting in it always returning 0, which meant that the file-access pre-checks always passed. Which, given some of the issues here, may not have been the worst side effect...
Still, apologies that that slipped through!

@gitamohr
Copy link
Contributor

gitamohr commented Nov 8, 2023

Cool -- thanks for all the info! And yeah I think we have a task to remove the file access checks on Windows too -- I know the SideFX folks delete those too. I will check to see if we have Windows Defender on in our CI setup. I'd love to repro this ourselves if possible.

@marktucker
Copy link
Contributor

The thing I keep coming back to here is the question of why this would be a "USD problem", if the explanation behind the source of the problem is correct? Why don't the unit tests of any other library have this problem? I'm fairly certain (having spent a little time ten years ago writing Windows kernel device drivers) that a properly written device driver of the sort anti-virus software uses will not cause this kind of problem. They either delay the operation until they are done with the file, or they instantly release their lock on the file so the requested operation can complete. If this wasn't the case, every library and application on Windows would face this problem, but I'm not aware of any other software that implements this kind of "retry loop" when writing files...

One complicating factor might be if this only happens on network file systems, which are generally problematic on Windows. But especially so if there are multiple machines with this drive/directory mounted? Maybe another machine's antivirus is scanning the new/temp file and getting in the way?

@pmolodo
Copy link
Contributor Author

pmolodo commented Nov 8, 2023

@marktucker - some very good questions! I'm relatively new to windows development, so I thought perhaps this sort of thing was common. My other thought was that perhaps most other companies simply don't have virus checking enabled on CI test runners, as perhaps it was just "known" to create problems.

If neither of those are true, then it does indeed make our situation a bit more singular.

I should also note that I haven't definitively placed the blame for this behavior on Virus Checking / Windows Defender - I wanted to do some A/B tests with some runners that had Defender disabled (but were otherwise identical to machines that typically display this problem), but I couldn't get support from the IT teams that handle our CI.

Here's what I know:

  • these write errors seem to happen happen on most of our locally-managed test runner groups
  • these test runner groups have Windows Defender enabled
  • they don't happen on any our AWS runner groups
  • which don't have Windows Defender
  • adding these retries reduced incidence of these issues

So "Windows Defender is the culprit" is my working hypothesis, but it's certainly possible there's some other pertinent difference in the runners which is causing this difference in behavior. If people are reluctant to implement these sort of changes without some more definitive information, I can try leaning on our IT team again to see if I can get some help for testing.

@pmolodo
Copy link
Contributor Author

pmolodo commented Dec 7, 2023

@marktucker - Haven't had a chance to do more digging on our end, but while doing an unrelated task, I stumbled across this bit of code in CMake, which at least indicates we're not the only people having to retry when using MoveFileEx:

https://github.com/Kitware/CMake/blob/master/Source/cmSystemTools.cxx#L1234:

 Try multiple times since we may be racing against another process
 creating/opening the destination file just before our MoveFileEx.  */

https://github.com/Kitware/CMake/blob/master/Source/cmSystemTools.cxx#L1252

// Try again only if failure was due to access/sharing permissions.
// Most often ERROR_ACCESS_DENIED (a.k.a. I/O error) for a directory, and
// ERROR_SHARING_VIOLATION for a file, are caused by one of the following:
// 1) Anti-Virus Software
// 2) Windows Search Indexer
// 3) Windows Explorer has an associated directory already opened.

@pmolodo
Copy link
Contributor Author

pmolodo commented Dec 7, 2023

That made me curious, and I did some quick searching on github - and found several reputable software packages that are doing similar retry loops.

ie, here's one in golang:

https://github.com/golang/go/blob/master/src/testing/testing.go#L1254

// removeAll is like os.RemoveAll, but retries Windows "Access is denied."
// errors up to an arbitrary timeout.

...and another in elastic:

Increase the time between RemovePath retries by AndersonQ · Pull Request #3349 · elastic/elastic-agent

@marktucker
Copy link
Contributor

Thanks @pmolodo. Sorry for the delayed response, but all of that is pretty compelling. Sometimes Microsoft does not make it easy to defend (pun intended) them :)

@pmolodo pmolodo force-pushed the pr/atomicRenameUtil-retry branch 2 times, most recently from cade3bc to f49ca5c Compare May 2, 2024 21:46
@pmolodo pmolodo force-pushed the pr/atomicRenameUtil-retry branch from f49ca5c to 074d772 Compare May 16, 2024 22:18
On windows, things like virus checkers and indexing service can often grab
a handle to newly-created files; they usually release them fairly quickly,
though, so just need a short retry period

Closes:
PixarAnimationStudios#2141
@jesschimein
Copy link
Contributor

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants