Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutex.TryOpenExisting intermittently throws IOException #76736

Closed
rokonec opened this issue Oct 7, 2022 · 16 comments
Closed

Mutex.TryOpenExisting intermittently throws IOException #76736

rokonec opened this issue Oct 7, 2022 · 16 comments

Comments

@rokonec
Copy link
Member

rokonec commented Oct 7, 2022

Description

After introducing .NET 7 rc1 SDK into Runtime CI we have started seeing intermittent exceptions System.IO.IOException: Connection timed out : 'Global\msbuild-server-launch-{45_random-chars}' in Runtime and Arcade on Linux CI agents or docker based Linux builds.

Reproduction Steps

I was trying hard to create minimal repro - without success.
I believe easiest repro would be to rerun some of our CI's where it was seen:

https://dev.azure.com/dnceng-public/public/_build/results?buildId=31675&view=logs&j=3fe1f0d5-61d6-5e8f-eead-4d3bcfb9dfc3&t=380e8ab7-dd79-5cad-d265-1eca160e9b82&s=526c4a30-42a9-575e-a58b-243c7c515350

https://dev.azure.com/dnceng-public/public/_build/results?buildId=41146&view=logs&j=190ad6c8-5950-568c-cadd-f2dfb7d5a79f&t=c0f6fdc1-ac5d-583c-8ae1-a18de0846552

Expected behavior

new Mutex(initiallyOwned: true, name: "Global\UniqueName", out bool createdNew) and Mutex.TryOpenExisting shall never intermitently throw IOException.

Actual behavior

new Mutex(initiallyOwned: true, name: "Global\UniqueName", out bool createdNew) and Mutex.TryOpenExisting sometimes throws:

System.IO.IOException: Connection timed out : 'Global\msbuild-server-launch-{45_random-chars}'
     at System.Threading.Mutex.CreateMutexCore(Boolean initiallyOwned, String name, Boolean& createdNew)
     at Microsoft.Build.Experimental.MSBuildClient.TryLaunchServer()
     at Microsoft.Build.Experimental.MSBuildClient.Execute(CancellationToken cancellationToken)
     at Microsoft.Build.CommandLine.MSBuildClientApp.Execute(String[] commandLine, String msbuildLocation, CancellationToken cancellationToken)
     at Microsoft.Build.CommandLine.MSBuildApp.Main(String[] args)
     at Microsoft.DotNet.Cli.Utils.MSBuildForwardingAppWithoutLogging.ExecuteInProc(String[] arguments)

Regression?

Unknown

Known Workarounds

Unknown.

Configuration

I have seen this mostly on:

  • OSX_x64
  • Linux_musl_x64
  • Linux_x64

Other information

No response

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Oct 7, 2022
@ghost
Copy link

ghost commented Oct 7, 2022

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

After introducing .NET 7 rc1 SDK into Runtime CI we have started seeing intermittent exceptions System.IO.IOException: Connection timed out : 'Global\msbuild-server-launch-{45_random-chars}' in Runtime and Arcade on Linux CI agents or docker based Linux builds.

Reproduction Steps

I was trying hard to create minimal repro - without success.
I believe easiest repro would be to rerun some of our CI's where it was seen:

https://dev.azure.com/dnceng-public/public/_build/results?buildId=31675&view=logs&j=3fe1f0d5-61d6-5e8f-eead-4d3bcfb9dfc3&t=380e8ab7-dd79-5cad-d265-1eca160e9b82&s=526c4a30-42a9-575e-a58b-243c7c515350

https://dev.azure.com/dnceng-public/public/_build/results?buildId=41146&view=logs&j=190ad6c8-5950-568c-cadd-f2dfb7d5a79f&t=c0f6fdc1-ac5d-583c-8ae1-a18de0846552

Expected behavior

new Mutex(initiallyOwned: true, name: "Global\UniqueName", out bool createdNew) and Mutex.TryOpenExisting shall never intermitently throw IOException.

Actual behavior

new Mutex(initiallyOwned: true, name: "Global\UniqueName", out bool createdNew) and Mutex.TryOpenExisting sometimes throws:

System.IO.IOException: Connection timed out : 'Global\msbuild-server-launch-{45_random-chars}'
     at System.Threading.Mutex.CreateMutexCore(Boolean initiallyOwned, String name, Boolean& createdNew)
     at Microsoft.Build.Experimental.MSBuildClient.TryLaunchServer()
     at Microsoft.Build.Experimental.MSBuildClient.Execute(CancellationToken cancellationToken)
     at Microsoft.Build.CommandLine.MSBuildClientApp.Execute(String[] commandLine, String msbuildLocation, CancellationToken cancellationToken)
     at Microsoft.Build.CommandLine.MSBuildApp.Main(String[] args)
     at Microsoft.DotNet.Cli.Utils.MSBuildForwardingAppWithoutLogging.ExecuteInProc(String[] arguments)

Regression?

Unknown

Known Workarounds

Unknown.

Configuration

I have seen this mostly on:

  • OSX_x64
  • Linux_musl_x64
  • Linux_x64

Other information

No response

Author: rokonec
Assignees: -
Labels:

area-System.Threading

Milestone: -

@rokonec rokonec changed the title Mutex.TryOpenExisting intermittently throws IOException with "Timeout Mutex.TryOpenExisting intermittently throws IOException Oct 7, 2022
@rokonec
Copy link
Member Author

rokonec commented Oct 7, 2022

@kouvel You have been recommended me to tag here by @janvorli. Please check it out.

@kouvel
Copy link
Member

kouvel commented Oct 7, 2022

The TryOpenExisting failure seems to have occurred in Mono wasm builds. I'm guessing the create failure is also coming from Mono, based on the error message and similarity to the other stack. I didn't find any core dumps, also not sure where the error is being thrown from, looking at the code.

@jkotas
Copy link
Member

jkotas commented Oct 7, 2022

@kouvel This was thrown from here: https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Threading/Mutex.Windows.cs#L35 .

Note that the "Connection timed out" message may be bogus. This code is mixing and matching Windows and Unix error codes.

The error comes from msbuild. msbuild always runs on CoreCLR. This is not Mono problem.

@kouvel
Copy link
Member

kouvel commented Oct 7, 2022

Ok makes sense. I wasn't aware of anything on these paths that would cause a timeout, but may be possible depending on the setup. I'll try to see what errors may be leading to this and what may be causing it.

@ghost
Copy link

ghost commented Oct 7, 2022

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

After introducing .NET 7 rc1 SDK into Runtime CI we have started seeing intermittent exceptions System.IO.IOException: Connection timed out : 'Global\msbuild-server-launch-{45_random-chars}' in Runtime and Arcade on Linux CI agents or docker based Linux builds.

Reproduction Steps

I was trying hard to create minimal repro - without success.
I believe easiest repro would be to rerun some of our CI's where it was seen:

https://dev.azure.com/dnceng-public/public/_build/results?buildId=31675&view=logs&j=3fe1f0d5-61d6-5e8f-eead-4d3bcfb9dfc3&t=380e8ab7-dd79-5cad-d265-1eca160e9b82&s=526c4a30-42a9-575e-a58b-243c7c515350

https://dev.azure.com/dnceng-public/public/_build/results?buildId=41146&view=logs&j=190ad6c8-5950-568c-cadd-f2dfb7d5a79f&t=c0f6fdc1-ac5d-583c-8ae1-a18de0846552

Expected behavior

new Mutex(initiallyOwned: true, name: "Global\UniqueName", out bool createdNew) and Mutex.TryOpenExisting shall never intermitently throw IOException.

Actual behavior

new Mutex(initiallyOwned: true, name: "Global\UniqueName", out bool createdNew) and Mutex.TryOpenExisting sometimes throws:

System.IO.IOException: Connection timed out : 'Global\msbuild-server-launch-{45_random-chars}'
     at System.Threading.Mutex.CreateMutexCore(Boolean initiallyOwned, String name, Boolean& createdNew)
     at Microsoft.Build.Experimental.MSBuildClient.TryLaunchServer()
     at Microsoft.Build.Experimental.MSBuildClient.Execute(CancellationToken cancellationToken)
     at Microsoft.Build.CommandLine.MSBuildClientApp.Execute(String[] commandLine, String msbuildLocation, CancellationToken cancellationToken)
     at Microsoft.Build.CommandLine.MSBuildApp.Main(String[] args)
     at Microsoft.DotNet.Cli.Utils.MSBuildForwardingAppWithoutLogging.ExecuteInProc(String[] arguments)

Regression?

Unknown

Known Workarounds

Unknown.

Configuration

I have seen this mostly on:

  • OSX_x64
  • Linux_musl_x64
  • Linux_x64

Other information

No response

Author: rokonec
Assignees: -
Labels:

area-System.Threading, untriaged

Milestone: -

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Oct 7, 2022
@mangod9 mangod9 added this to the 7.0.0 milestone Oct 7, 2022
@mangod9
Copy link
Member

mangod9 commented Oct 7, 2022

adding @janvorli here too, in case this is related to some recent changes.

@jkotas
Copy link
Member

jkotas commented Oct 7, 2022

The error is ERROR_OPEN_FAILED from Win32 PAL. ERROR_OPEN_FAILED code is 110, the Linux message for error code 110 is "Connection timed out".

@kouvel
Copy link
Member

kouvel commented Oct 7, 2022

Yea looks like the Windows error codes are not converted by the PAL anymore and the message is incorrect. This error could have occurred for various reasons. Maybe I'll try to preserve more accurate errors and try to get a failure again in a PR's CI.

@janvorli
Copy link
Member

janvorli commented Oct 7, 2022

Kusto shows that this issue has happened about 50 times during the last 30 days and it always occurs in the mono wasm legs. It seems the reason it occurs there is that for wasm, the compilation of each test happens during run of the test (when its generated .sh script is called).

@janvorli
Copy link
Member

janvorli commented Oct 7, 2022

It is not clear to me why we mix in the Unix error messages there at all when the method is named GetExceptionForWin32Error. It doesn't seem right to do such a thing. @AaronRobinsonMSFT I can see you have added the code that makes that translation, what was the reasoning behind it?

@jkotas
Copy link
Member

jkotas commented Oct 7, 2022

what was the reasoning behind it?

I guess nobody noticed that the changes in #70685 have unintended interaction with Win32 emulator PAL uses in CoreLib. It is very hard to keep in mind at all times that a few parts of the CoreLib use the Win32 emulator PAL.

@jkotas
Copy link
Member

jkotas commented Oct 7, 2022

Error message fix: #76768

@AaronRobinsonMSFT
Copy link
Member

@AaronRobinsonMSFT I can see you have added the code that makes that translation, what was the reasoning behind it?

If I recall there was a desire to use the newest API when available, even if the performance could be impacted. Seems like the SPCL scenario needs to be considered as well. Thanks @jkotas for rooting it out.

@kouvel
Copy link
Member

kouvel commented Oct 11, 2022

I haven't been able to repro the error locally using the same container image.

An strace would help to see which API is failing and the error code, so a possibility may be to add strace onto the msbuild command to get that output. I'm not sure if the container is set up for strace though, anyone know where these containers are set up? It may just be a matter of adding commands to install strace as root.

It may also be useful to preserve more info about the errors such as the failing API, relevant parameters, and the error code, and to include that in the exception message. I can look into improving that, but I guess it wouldn't help until we have a better way of reproing the issue or until a newer version of the SDK with that change is used in the CI.

@rokonec
Copy link
Member Author

rokonec commented Jun 9, 2024

Inactive for a while. Let close it.

@rokonec rokonec closed this as completed Jun 9, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Jul 10, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants