Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[linux-arm64] Random and rare runtime crash System.ArgumentOutOfRangeException (System.Net.Sockets) #72365

Closed
NQ-Brewir opened this issue Jul 18, 2022 · 19 comments
Labels
Milestone

Comments

@NQ-Brewir
Copy link

Description

Description
Random and rare crashes with this exception:

Unhandled exception. System.ArgumentOutOfRangeException: Specified argument was out of the range of valid values. (Parameter 'state')
  at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.InvokeContinuation(Action`1 continuation, Object state, Boolean forceAsync, Boolean requiresExecutionContextFlow)
  at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.OnCompleted(SocketAsyncEventArgs _)
  at System.Net.Sockets.SocketAsyncEngine.System.Threading.IThreadPoolWorkItem.Execute()
  at System.Threading.ThreadPoolWorkQueue.Dispatch()
  at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()

It seems to append only on loaded applications.

Exit signal: Abort (6)

Reproduction Steps

We don't have any reproduction yet. We probably need to heavily stress network! It seems to be a race condition.

Expected behavior

don't crash the runtime when we are using sockets...

Actual behavior

random and rare crashes of the runtime

Regression?

No response

Known Workarounds

No response

Configuration

Dotnet runtime version: 6.0.6
OS : GNU/Linux Debian 11 Bullseye
CPU: ARM64 Graviton 2 (AWS)
We are using Orleans with this application

Other information

follow up of #70486
we triple checked all usages of ValueTask and removed all usages of it, just to be sure
this time, this is notn some ValueTasks awaited twice

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jul 18, 2022
@ghost
Copy link

ghost commented Jul 18, 2022

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Description
Random and rare crashes with this exception:

Unhandled exception. System.ArgumentOutOfRangeException: Specified argument was out of the range of valid values. (Parameter 'state')
  at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.InvokeContinuation(Action`1 continuation, Object state, Boolean forceAsync, Boolean requiresExecutionContextFlow)
  at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.OnCompleted(SocketAsyncEventArgs _)
  at System.Net.Sockets.SocketAsyncEngine.System.Threading.IThreadPoolWorkItem.Execute()
  at System.Threading.ThreadPoolWorkQueue.Dispatch()
  at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()

It seems to append only on loaded applications.

Exit signal: Abort (6)

Reproduction Steps

We don't have any reproduction yet. We probably need to heavily stress network! It seems to be a race condition.

Expected behavior

don't crash the runtime when we are using sockets...

Actual behavior

random and rare crashes of the runtime

Regression?

No response

Known Workarounds

No response

Configuration

Dotnet runtime version: 6.0.6
OS : GNU/Linux Debian 11 Bullseye
CPU: ARM64 Graviton 2 (AWS)
We are using Orleans with this application

Other information

follow up of #70486
we triple checked all usages of ValueTask and removed all usages of it, just to be sure
this time, this is notn some ValueTasks awaited twice

Author: NQ-Brewir
Assignees: -
Labels:

area-System.Net.Sockets, untriaged

Milestone: -

@karelz karelz added arch-arm64 os-linux Linux OS (any supported distro) labels Jul 19, 2022
@karelz
Copy link
Member

karelz commented Jul 19, 2022

@NQ-Brewir are you working on getting a repro, or some more actionable information?
In current state, the bug is not actionable for us -- same arguments apply as in #70486 (comment)
Moreover, you are the only customer hitting the problem so far.

I would recommend to close the issue until there is info which is actionable.

@karelz karelz added the needs-author-action An issue or pull request that requires more info or actions from the author. label Jul 19, 2022
@ghost
Copy link

ghost commented Jul 19, 2022

This issue has been marked needs-author-action and may be missing some important information.

@antonfirsov antonfirsov removed the untriaged New issue has not been triaged by the area owner label Jul 26, 2022
@antonfirsov antonfirsov added this to the Future milestone Jul 26, 2022
@ghost ghost added the no-recent-activity label Aug 9, 2022
@ghost
Copy link

ghost commented Aug 9, 2022

This issue has been automatically marked no-recent-activity because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity.

@paulquinn
Copy link

paulquinn commented Aug 16, 2022

I'm also getting the same error. It looks like it's also appearing here: aws/aws-lambda-dotnet#1244.

My config:
Dotnet runtime version: v7.0.100-preview.7 (ARM64)
OS : macOS 12.5 (Monterey)
CPU: ARM64 Apple Silicon M1 Max

This happens (again, intermittently) when I'm running/debugging a few microservices (on Kestrel - was unsure if this was a Kestrel issue, but saw this reported here).

One additional piece of info is that in the framework method that throws the exception:

throw new ArgumentOutOfRangeException(GetArgumentName(argument));

The argument parameter is always (int) 40 .

Just like the linked AWS issue, none of my exception handlers seem to be catching the error.

Any ideas on next steps? I can't seem to isolate the exception for repo...

@ghost ghost removed the no-recent-activity label Aug 16, 2022
@paulquinn
Copy link

I decided to run exactly the same code in the same way on a Windows machine:

Dotnet runtime version: v7.0.100-preview.7 (x64)
OS: Windows 11 (22H2)
CPU: AMD Ryzen 9 3900X 12-Core Processor

...it's been running now for 24hr without error. I'll keep it running, but I'd normally get that ^ exception thrown within a couple of hours on ARM64/macOS - so more of a platform issue?

@NQ-Brewir
Copy link
Author

the issue is still happening, but way less ofter since we removed all ValueTask from our codebase.
we are still not able to create a clear repro, and the problem seems totally random
anyways, managed code should not crash like that for this kind of problem

@ghost ghost added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed needs-author-action An issue or pull request that requires more info or actions from the author. labels Aug 23, 2022
@NQ-Brewir
Copy link
Author

Hello,
I still have this problem, and we removed all the usages of our ValueTasks we had. We have no real way to find a clear repro of this issue, but it is quite problematic as our code is running on a production environment.
Is it possible to add more info in the context to better track where this issue could come from?
Regards

@am11
Copy link
Member

am11 commented Oct 4, 2022

@NQ-Brewir, could you try catching the unhandled exception via the AppDomain event handler and dump the full exception object to the logger (with the inner exception)? Note that it can get too noisy and costly in the production environment, so you may want to filter which exception object to dump.

The call stack in top post resembles the lower part of exception @BrennanConroy logged here: aspnet/SignalR#1703 (comment). I'm not sure if it is the same (mysterious) issue. If it is, then going by the SignalR's call stack, the inner exception of ArgumentOutOfRangeException seems to be InvalidOperationException coming from ThrowMultipleContinuationsException() under high concurrency:

Action<object>? prevContinuation = Interlocked.CompareExchange(ref _continuation, continuation, null);
if (ReferenceEquals(prevContinuation, s_completedSentinel))
{
// Lost the race condition and the operation has now already completed.
// We need to invoke the continuation, but it must be asynchronously to
// avoid a stack dive. However, since all of the queueing mechanisms flow
// ExecutionContext, and since we're still in the same context where we
// captured it, we can just ignore the one we captured.
bool requiresExecutionContextFlow = _executionContext != null;
_executionContext = null;
UserToken = null; // we have the state in "state"; no need for the one in UserToken
InvokeContinuation(continuation, state, forceAsync: true, requiresExecutionContextFlow);
}
else if (prevContinuation != null)
{
// Flag errors with the continuation being hooked up multiple times.
// This is purely to help alert a developer to a bug they need to fix.
ThrowMultipleContinuationsException();
why it is happening frequently on unix arm64 than the others is unclear.

@BruceForstall
Copy link
Member

BruceForstall commented Oct 6, 2022

I saw what looks like this issue in a CI pipeline run on osx/arm64:

Unhandled exception. System.ArgumentOutOfRangeException: Specified argument was out of the range of valid values. (Parameter 'state')
   at System.Threading.ThreadPool.<>c.<.cctor>b__78_0(Object state)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.InvokeContinuation(Action`1 continuation, Object state, Boolean forceAsync, Boolean requiresExecutionContextFlow)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.OnCompleted(SocketAsyncEventArgs _)
   at System.Net.Sockets.SocketAsyncEngine.System.Threading.IThreadPoolWorkItem.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()

https://dev.azure.com/dnceng-public/public/_build/results?buildId=42417&view=ms.vss-test-web.build-test-results-tab&runId=857504&resultId=111076&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab

note: it looks like a crash dump was created

@antonfirsov
Copy link
Member

antonfirsov commented Oct 6, 2022

This can be a bug, we should investigate.

note: it looks like a crash dump was created

Unfortunately, I see the following:

DumpFileToLarge: The dump /cores/core.84631 is 4671242240 bytes, which is larger than the supported 1610612736.0, so was not uploaded.

@dotnet/dnceng any chance the limit can be increased?

@paulquinn it's been a while, but any chance you can produce a dump for us?

@MattGal
Copy link
Member

MattGal commented Oct 6, 2022

https://github.com/orgs/dotnet/teams/dnceng any chance the limit can be increased?

Unfortunately this limitation comes about from our support of on-premises machines; these tend to cost us lots of time and money uploading large dumps which are often ignored despite this time/financial cost.

If you need to check out a machine with the same specifications as the test one, that can likely be arranged, we'd just need to know the specific queue that this work item ran on (or have its full log linked, etc)

@antonfirsov
Copy link
Member

antonfirsov commented Oct 6, 2022

I just noticed I missed the start of the conversation and the fact this is technically duplicate of #70486. Might worth to keep it open because of the number of the reports we see.

@antonfirsov antonfirsov removed the bug label Oct 6, 2022
@antonfirsov antonfirsov modified the milestones: 8.0.0, Future Oct 6, 2022
@wfurt
Copy link
Member

wfurt commented Oct 7, 2022

We really should compress the dumps @MattGal. They are often full of zeros and we can probable get 10:1 gain

@MattGal
Copy link
Member

MattGal commented Oct 7, 2022

We really should compress the dumps @MattGal. They are often full of zeros and we can probable get 10:1 gain

This was discussed in dotnet/dnceng#1219, feel free to reopen it and make your case.

@NQ-Brewir
Copy link
Author

@am11 I tried logging more info using the AppDomain eventhandler, but it seems to not go through it.

@NQ-Brewir
Copy link
Author

We had to remigrate to amd64 du to some other reasons, and the server is not crashing anymore. This issue is thus really due to ARM, and not to any wrongly used ValueTask

@stephentoub
Copy link
Member

the server is not crashing anymore

Thanks for the update. I'll close this and we can reopen if it reoccurs and we're able to get more information for debugging.

@stephentoub stephentoub closed this as not planned Won't fix, can't repro, duplicate, stale Jan 26, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Feb 25, 2023
@karelz karelz modified the milestones: Future, 8.0.0 Mar 22, 2023
@karelz
Copy link
Member

karelz commented May 27, 2023

Problem in .NET identified after all - duplicate of #84407

Fixed in 7.0.7 in PR #84641 and in 6.0.18 in PR #84432.
Main (8.0) is not affected - see description in #84432.

@karelz karelz modified the milestones: 8.0.0, 6.0.x May 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

10 participants