Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileStream.Windows.cs can freeze on Interop.Kernel32.Readfile #26850

Closed
houseofcat opened this issue Jul 17, 2018 · 11 comments
Closed

FileStream.Windows.cs can freeze on Interop.Kernel32.Readfile #26850

houseofcat opened this issue Jul 17, 2018 · 11 comments
Milestone

Comments

@houseofcat
Copy link

Tracked an infinite loop situation triggered by BenchmarkDotNet.

https://github.com/dotnet/corefx/blob/015ee21f6fe1eadad478d8c9409c95cf98c2b121/src/Common/src/CoreLib/System/IO/FileStream.Windows.cs#L1197

My project with Benchmark ready to go.
https://github.com/thyams/CookedRabbit

Opened an incident with BenchmarkDotNet and optimized their StreamReader loop:
dotnet/BenchmarkDotNet#830

Other issue is abated. Now it just never returns from FileStream.Windows.cs line 1197. That means the underlying issue was probably this to begin with.

image

image

Frozen on this:

00007FFA31861B1E  jmp         00007FFA31861B2D 

Corresponding line:

00007FFA31861B2D  xor         ecx,ecx  

Function Scope, Disassembly with Bytes:

  1186:             int numBytesRead = 0;
00007FFA31861AE0 55                   push        rbp  
00007FFA31861AE1 56                   push        rsi  
00007FFA31861AE2 48 83 EC 38          sub         rsp,38h  
00007FFA31861AE6 48 8D 6C 24 40       lea         rbp,[rsp+40h]  
00007FFA31861AEB 33 C0                xor         eax,eax  
00007FFA31861AED 48 89 45 E8          mov         qword ptr [rbp-18h],rax  
00007FFA31861AF1 48 8B F1             mov         rsi,rcx  
00007FFA31861AF4 48 8B CA             mov         rcx,rdx  
00007FFA31861AF7 49 8B 10             mov         rdx,qword ptr [r8]  
00007FFA31861AFA 45 8B 40 08          mov         r8d,dword ptr [r8+8]  
00007FFA31861AFE 33 C0                xor         eax,eax  
00007FFA31861B00 89 45 F0             mov         dword ptr [rbp-10h],eax  
00007FFA31861B03 48 89 55 E8          mov         qword ptr [rbp-18h],rdx  
  1189:             {
  1190:                 r = _useAsyncIO ?
  1191:                     Interop.Kernel32.ReadFile(handle, p, bytes.Length, IntPtr.Zero, overlapped) :
  1192:                     Interop.Kernel32.ReadFile(handle, p, bytes.Length, out numBytesRead, IntPtr.Zero);
00007FFA31861B07 80 7E 74 00          cmp         byte ptr [rsi+74h],0  
00007FFA31861B0B 75 13                jne         00007FFA31861B20  
00007FFA31861B0D 45 33 C9             xor         r9d,r9d  
00007FFA31861B10 4C 89 4C 24 20       mov         qword ptr [rsp+20h],r9  
00007FFA31861B15 4C 8D 4D F0          lea         r9,[rbp-10h]  
00007FFA31861B19 E8 52 B2 BD FF       call        00007FFA3143CD70  
00007FFA31861B1E EB 0D                jmp         00007FFA31861B2D  <- FROZEN HERE
00007FFA31861B20 4C 89 4C 24 20       mov         qword ptr [rsp+20h],r9  
00007FFA31861B25 45 33 C9             xor         r9d,r9d  
00007FFA31861B28 E8 53 B2 BD FF       call        00007FFA3143CD80  
00007FFA31861B2D 33 C9                xor         ecx,ecx  
00007FFA31861B2F 48 89 4D E8          mov         qword ptr [rbp-18h],rcx  
  1193:             }
  1194: 
  1195:             if (r == 0)
00007FFA31861B33 85 C0                test        eax,eax  
00007FFA31861B35 75 1C                jne         00007FFA31861B53  
00007FFA31861B37 48 8B CE             mov         rcx,rsi  
00007FFA31861B3A 33 D2                xor         edx,edx  
00007FFA31861B3C E8 BF 00 00 00       call        00007FFA31861C00  
00007FFA31861B41 48 8B 75 30          mov         rsi,qword ptr [rbp+30h]  
00007FFA31861B45 89 06                mov         dword ptr [rsi],eax  
  1198:                 return -1;
00007FFA31861B47 B8 FF FF FF FF       mov         eax,0FFFFFFFFh  
00007FFA31861B4C 48 8D 65 F8          lea         rsp,[rbp-8]  
00007FFA31861B50 5E                   pop         rsi  
00007FFA31861B51 5D                   pop         rbp  
00007FFA31861B52 C3                   ret  
  1199:             }
  1200:             else
  1201:             {
  1202:                 errorCode = 0;
00007FFA31861B53 33 C0                xor         eax,eax  
00007FFA31861B55 48 8B 75 30          mov         rsi,qword ptr [rbp+30h]  
00007FFA31861B59 89 06                mov         dword ptr [rsi],eax  
  1203:                 return numBytesRead;
00007FFA31861B5B 8B 45 F0             mov         eax,dword ptr [rbp-10h]  
00007FFA31861B5E 48 8D 65 F8          lea         rsp,[rbp-8]  
00007FFA31861B62 5E                   pop         rsi  
00007FFA31861B63 5D                   pop         rbp  
00007FFA31861B64 C3                   ret  
@danmoseley
Copy link
Member

Do you actually mean an infinite loop (busy loop) ie code continually returns back to that jmp?
Is it using async IO?
What version of .NET Core is this on?
Also is it possible to narrow down the repro ?

@JeremyKuhne

@houseofcat
Copy link
Author

houseofcat commented Jul 17, 2018

Great questions, no the jump is frozen in place, not fluttering back and forth.
AsyncIO is false.
NetCore 2.1.4 (SDK v2.1.302).
All Windows 10 updates (OS Build: 17134.167)
Latest version of Visual Studio 2017 Professional (stable branch / non preview).

Here is a minidump:
dotnet.zip

As far as a simpler reproduction... I will see what I can do.
StreamReader.ReadLine() or StreamReader.EndofStream._get() can trigger it.

Additional information is CPU is i7 4970K Intel, build is x64.

@houseofcat
Copy link
Author

houseofcat commented Jul 18, 2018

Tracked it down to Process RedirectStandardInput and RedirectStandardOutput. If I had to guess, multiple dotnet hosts are writing to a central host but there is a file being read from.

@adamsitnik Is my understanding correct? (Based on ConsoleHost -> SynchronousProcessOutputLoggerWithDiagnoser)

@stephentoub
Copy link
Member

It's not returning from the read on the file stream because this is actually a pipe to another process and the other process still has its end of the pipe open / that process hasn't gone away... so this code reading from the pipe is just going to sit blocked waiting for something more to read or for the pipe to close.

@houseofcat
Copy link
Author

houseofcat commented Jul 18, 2018

Thanks Stephen. That makes sense but I wonder why it only affects my library call and ocassionally BenchmarkDotNet build in Travis CI.

I will figure it out.

If a Pipe is blocked, should it not throw an exception? I get one when trying to Read or Write a locked file handle in a stream... Why not another pipe?

In assembly JMP should jump hit the supposed lock check and fall back?

Anywho, I believe the issue is the old old old RedirectInput Output deadlock from synchronous reads.

If it is a standard Redirect input and Redirect Output deadlock writing to a file or middleman. That middleman is then hung, which is what BenchmarkDotNet. All it needs is a flush and close on the input side.

@stephentoub
Copy link
Member

If a Pipe is blocked, should it not throw an exception?

No. The whole point of doing a read on a pipe is to wait for data to read to be available. So it'll block until either data is available or the pipe is closed.

@houseofcat
Copy link
Author

houseofcat commented Jul 18, 2018

Thanks Stephen, this sounds like an implementation related issue.

I will close this ticket. Armed with this knowledge I should be able to resolve it now that I know for sure what I am looking for.

Would StreamReader.Peek block?

For example if the While loop was
while(NotEndofStream && Peek > 0)
could it safely be reduced to just StreamReader.Peek() > 0

StreamReader.EndOfStream()._get() is the blocking call but does Peek?

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@frank-dong-ms-zz
Copy link

@houseofcat Could you please explain how to resolve this hanging issue? We met similar hanging in our CI build

@houseofcat
Copy link
Author

houseofcat commented Mar 23, 2020

@frank-dong-ms It can be a hung process that blocks the read of the file as the preceding pipeline or file accessor has an exclusivity lock on writing. This normally doesnt occur these days except when using redirect input and output is being used - in some scenarios - it can.

In my case, I created an infinite loop that exposed a small bug in BenchmarkDotnet that never released the TestHost, thereby never writing EOF to the output pipeline thus freezing the end output on read.

It's a read call but Kernel32 is infinitely waiting to read and EOF wasnt coming.

If you have some code, more than happy to take a look at a workaround.

Note this method is used by ConsoleStream, so "file" is bad word for me to use?

@frank-dong-ms-zz
Copy link

frank-dong-ms-zz commented Mar 23, 2020

Thanks @houseofcat

Our benchmark hanging only happens on CI of dotnet core 3.0 and not reproducible in local. I got below call stack when benchmark hangs for 20 minutes and looks exact as yours:
image

I paste several CI builds with hangs below not sure if you have access to them:
https://dev.azure.com/dnceng/public/_build/results?buildId=568308&view=logs&j=11c3dbcc-a5f4-5edd-335b-a8af5aa47d46&t=42ea9add-ee54-581f-d033-310ec15a7ff0
https://dev.azure.com/dnceng/public/_build/results?buildId=565081&view=logs&j=11c3dbcc-a5f4-5edd-335b-a8af5aa47d46&t=42ea9add-ee54-581f-d033-310ec15a7ff0
https://dev.azure.com/dnceng/public/_build/results?buildId=563397&view=logs&j=11c3dbcc-a5f4-5edd-335b-a8af5aa47d46&t=42ea9add-ee54-581f-d033-310ec15a7ff0

I have several questions:

  1. if BenchmarkDotnet release TestHost in time, will this hanging go away?
  2. If it is a standard Redirect input and Redirect Output deadlock writing to a file or middleman. That middleman is then hung, which is what BenchmarkDotNet. All it needs is a flush and close on the input side.
    this flush and close need to be done from user side(our end) or from BenchmarkDotnet?

  3. is there any way to mitigate this issue from our code?

Below is the entry point of our bench mark test:
https://github.com/dotnet/machinelearning/blob/master/test/Microsoft.ML.Benchmarks.Tests/BenchmarksTest.cs

@houseofcat
Copy link
Author

houseofcat commented Mar 24, 2020

1.) If BenchmarkDotnet has a test running in an unreleased host (dubbed TestHost) and that never finishes - for programmatic reasons (i.e. a bug code side) - then fixing the bug allows that TestHost to finish. This will naturally release the stream (after publishing the EOF) and everything will work out. This only applies if the cause was some long running test that wasn't supposed to be long running or infinite looped occurs. Different machines process jobs and different rates - so a CI building node maybe much slower than your desktop for running code / tests. That could factor into the behavior difference. Maybe its not infinite - just a great deal slower than anticipated. This would usually cause a CI/CD pipeline to timeout and appear like the issue at hand.

2.) @stephentoub really pointed out that my reproduction code examples were trash. I dug and dug into the real code, optimizing and optimizing where I could. What really made a difference was fixing my hot infinite loop, but what made it much more noticeable was when I over encumbered the resources of the TaskScheduler with too many async Tasks that had hot body loops. Some cases it worked, some cases, it never got passed the first or second task essentially. This a common mistake for rookies and experts alike. Sometimes invoking await Task.Yield(); before a very hot fast body loops allows the main thread to properly schedule the work and invoke the code from a threadpool thread (in the background.) Without finding that, it greatly exacerbated my debugging experience. Obviously, I was desperate and doing a lot of research. Simply throttling my hot loops with await Task.Delay(1); in the iterations allowed me to prevent the behavior before I learned of Yield(). Now, you discussed again that this doesn't happen locally but it is possible to have a much weaker system preforming CI/CD building than your local machine much like number 1.). The task delay though is a terrible solution for benchmarks. If I could describe my testing process back then - it was far too complicated. I should be testing smaller chunks of code. I was trying to get my CSharp code to beat Golang in throughput to my local RabbitMQ. I successfully did it but with a redesign.

3.) I would have to deep dive your code and see what comes up. I would use logic first and if assuming its related to NetCore's behavior, isolate what is different code wise, with NetCore. Is it possible to "dockerize" the Travis CI/CD environment to run your code manually against docker locally where you can attach a debugger?

You could try moving to NetCore3.1 - see if there is an adjustment to code or minor hotfix that has your issue solved, SDK 3.1.2/200 I think are the latest versions.

We could look at the MSIL from the machine that built the code to run the benchmark and compare to local. This could provide some insight into the behavior differences. This maybe the most beneficial.

I don't think 1 or 2 really apply looking at your suite of tests. I apologize I couldn't be more helpful. If I had to guess Iris test is frozen awaiting data inputs or writing data inputs to the Pipeline.

Here is a similar issue:
#28583

@ghost ghost locked as resolved and limited conversation to collaborators Dec 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants