-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tf] AtomicRenameUtil - add short retry period on windows #2570
base: dev
Are you sure you want to change the base?
[tf] AtomicRenameUtil - add short retry period on windows #2570
Conversation
Filed as internal issue #USD-8565 |
Would love to get this merged... continues to cause problems when testing stock windows builds in our CI. |
Hey Paul, looking at this one today. Oooof... this Windows behavior is horrid! Is the lock grabbed on the temporary file that we've written to after we close it but before we rename it over? Is there some way we can prevent that? I guess I'm just trying to see if there's some better way to do this b/c having to sit and retry for 300 msec (!) feels icky and unreliable. |
Oh I see from reading the test that it's not the temp file, it's the actual output file. Does this mean we should do this kind of retrying always everytime we try to create a file on Windows? |
ebdd58f
to
f0ebdea
Compare
So - I think this isn't a problem so much on initial file creation, as it is when you try to then to replace or remove a file shortly after creation. ie, i think what happens is this:
As I've implemented, though, I'm not doing any sort of checks to see if the file already exists before applying retrying logic... so it's doing the retry loop on both the initial creation and later attempt to overwrite. Also - when I was writing the test, I was thinking in terms of the "final" file being the problem... but I suppose the temp file could also be causing issues: ie, we create the temp file, virus checker grabs a lock on it, then when we try to do a move operation on it, and it fails. The retry logic is on a move operation, so we don't really know if the root of the problem was the deletion of the old (temp) half of it, or the rewriting of the new (final) file half. But to the larger point... yes, this would be a potential problem on any windows write or delete style operation, I think. I tackled it in |
Thanks for the info Paul -- are you saying that just running the shipped USD tests on Windows are failing for you like this? We have Windows CIs and don't see these kinds of failures. I also ran this by Mark T. at SideFX b/c I know he has a lot of Windows experience, and he says he hasn't seen behavior like this before either. Do you have a sense of which virus checker is causing this for you? |
Yeah, the standard tests fail... not at a high rate, but high enough to be noticeable and annoying. Unfortunately never dedicated time to get exact figures, but at a rough guess I'd say about 10% of the complete test runs on machines that have the virus checker enabled would have at least one test fail due to something that looked like this. As I understand it, the only virus checker we have enabled on those test machines is Windows Defender. Not sure if the windows indexing service is enabled or not. |
Also - a bit of a chagrined mea culpa - while looking at the PR again I noticed that my modified |
Cool -- thanks for all the info! And yeah I think we have a task to remove the file access checks on Windows too -- I know the SideFX folks delete those too. I will check to see if we have Windows Defender on in our CI setup. I'd love to repro this ourselves if possible. |
The thing I keep coming back to here is the question of why this would be a "USD problem", if the explanation behind the source of the problem is correct? Why don't the unit tests of any other library have this problem? I'm fairly certain (having spent a little time ten years ago writing Windows kernel device drivers) that a properly written device driver of the sort anti-virus software uses will not cause this kind of problem. They either delay the operation until they are done with the file, or they instantly release their lock on the file so the requested operation can complete. If this wasn't the case, every library and application on Windows would face this problem, but I'm not aware of any other software that implements this kind of "retry loop" when writing files... One complicating factor might be if this only happens on network file systems, which are generally problematic on Windows. But especially so if there are multiple machines with this drive/directory mounted? Maybe another machine's antivirus is scanning the new/temp file and getting in the way? |
@marktucker - some very good questions! I'm relatively new to windows development, so I thought perhaps this sort of thing was common. My other thought was that perhaps most other companies simply don't have virus checking enabled on CI test runners, as perhaps it was just "known" to create problems. If neither of those are true, then it does indeed make our situation a bit more singular. I should also note that I haven't definitively placed the blame for this behavior on Virus Checking / Windows Defender - I wanted to do some A/B tests with some runners that had Defender disabled (but were otherwise identical to machines that typically display this problem), but I couldn't get support from the IT teams that handle our CI. Here's what I know:
So "Windows Defender is the culprit" is my working hypothesis, but it's certainly possible there's some other pertinent difference in the runners which is causing this difference in behavior. If people are reluctant to implement these sort of changes without some more definitive information, I can try leaning on our IT team again to see if I can get some help for testing. |
@marktucker - Haven't had a chance to do more digging on our end, but while doing an unrelated task, I stumbled across this bit of code in CMake, which at least indicates we're not the only people having to retry when using https://github.com/Kitware/CMake/blob/master/Source/cmSystemTools.cxx#L1234:
https://github.com/Kitware/CMake/blob/master/Source/cmSystemTools.cxx#L1252
|
That made me curious, and I did some quick searching on github - and found several reputable software packages that are doing similar retry loops. ie, here's one in golang: https://github.com/golang/go/blob/master/src/testing/testing.go#L1254
...and another in elastic: |
Thanks @pmolodo. Sorry for the delayed response, but all of that is pretty compelling. Sometimes Microsoft does not make it easy to defend (pun intended) them :) |
cade3bc
to
f49ca5c
Compare
f49ca5c
to
074d772
Compare
074d772
to
601a06d
Compare
On windows, things like virus checkers and indexing service can often grab a handle to newly-created files; they usually release them fairly quickly, though, so just need a short retry period Closes: PixarAnimationStudios#2141
601a06d
to
6551d28
Compare
…ETRIES and TF_FILE_LOCK_RETRY_WAIT_MS,
1a51396
to
070245f
Compare
/AzurePipelines run |
Azure Pipelines successfully started running 1 pipeline(s). |
Description of Change(s)
On windows, things like virus checkers and indexing service can often grab a handle to newly-created files; they usually release them fairly quickly, though, so just need a short retry period
Fixes Issue(s)