-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
InvalidCastException thrown after TimerQueueTimer.CallCallback #81211
Comments
Tagging subscribers to this area: @mangod9 Issue DetailsDescriptionAfter upgrading my iOS app from classic Xamarin.iOS to .NET7 I started observing frequent Reproduction StepsUnfortunately, I don't know what part of my code causes this behavior. Expected behaviorNo crash :) Actual behavior
this is here: runtime/src/libraries/System.Private.CoreLib/src/System/Threading/CancellationTokenSource.cs Line 35 in 215b39a
this is here: runtime/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs Line 5632 in 26b58b9
Both crashes originate from here:
The "_state" variable is nullable, and based on the reports, the crashes are caused by the variable to be indeed null. Regression?I did not see those crashes when using "old" Xamarin.iOS Known WorkaroundsNo response Configuration.net7-ios Other informationNo response
|
Tagging subscribers to 'os-ios': @steveisok, @akoeplinger Issue DetailsDescriptionAfter upgrading my iOS app from classic Xamarin.iOS to .NET7 I started observing frequent Reproduction StepsUnfortunately, I don't know what part of my code causes this behavior. Expected behaviorNo crash :) Actual behavior
this is here: runtime/src/libraries/System.Private.CoreLib/src/System/Threading/CancellationTokenSource.cs Line 35 in 215b39a
this is here: runtime/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs Line 5632 in 26b58b9
Both crashes originate from here:
The "_state" variable is nullable, and based on the reports, the crashes are caused by the variable to be indeed null. Regression?I did not see those crashes when using "old" Xamarin.iOS Known WorkaroundsNo response Configuration.net7-ios Other informationNo response
|
/cc @lambdageek |
Weird. Both CancellationTokenSource and DelayTimer pass Maybe there's a GC hole somewhere? |
@tipa when you try to reproduce the issue locally, are you trying it in a simulator or on a device? |
@lambdageek I tested my apps on both simulator and physical device numerous times - I was never able to observe this myself (edit: I now am, using a TestFlight app and repeatedly starting it). |
I am also seeing a lot of these exceptions (220 crashes from 80 devices , all originating from |
A lot more crash reports have been accumulated in the meantime. I expect the real number to be even higher because AppCenter only seems to report a fraction of the real crashes. I observed a crash with a similar behavior on Android (see here + fix here), maybe this is from a similar cause? Additionally, I also started seeing these stack traces::
Maybe I should have opened this bug in the xamarin-macios repo? |
No, it does look like a runtime issue, so you have the right place. |
Ok, I was just wondering because the same app runs without much issues on Android, I am only observing them on iOS at the moment. |
Same issue here with .net MAUI iOS. |
Same here. .net 7 MAUI iOS. Starting to see these in azure monitor. |
Same problem here - workaround - turn off llvm optimisations |
Same issue, .Net iOS(Native), llvm optimisation is already tuned off. SIGABRT: Arg_InvalidCastException |
I can confirm:
In my case though I was using an old CancellationTokenSource in a case where i should not have so I could easily fix this by fixing how I use CancellationTokenSource. |
@durandt Can you please elaborate on how exactly you fix this issue ? |
@suchithm I think I was over-optimistic. I could find which piece of code was crashing and there was I risk that code would sometimes have retained an old CancellationTokenSource and maybe call I have also reported this issue to the AppCenter folks as it seems this crash gets underreported. (something in the AppCenter libs may be silencing the crash, when it occurred I could not find the crashlog file on the iOS device). |
Update I don't think this is related. I think most of the time for @vargaz I'm looking at runtime/src/mono/mono/mini/mini-ops.h Lines 1326 to 1332 in cf3328c
And at the commit from 2015 that removed generic class trampolines mono/mono@bcf4f87 What guarantees now that |
I sincerely hope that this issue will be fixed as planned with the release of .NET 8 (and not moved to the |
We are also encountering this problem since "upgrading" from Xamarin to net7-ios. We have an install base of tens of thousands, and are seeing dozens of these reports coming in each day, having fully deployed our latest and greatest version. Rolling back would be extremely difficult so this puts us in an unfortunate position since the whole app crashes. Come on Microsoft, a little support? MonoTouch: 16.5.0 |
@divil5000 just for Info, do you have llvm turned on? |
What does your repro look like at the moment? We're only seeing a few exceptions per thousands of runs of our software in the wild. I can tell you the following, though:
If I were trying to develop a repro I would be firing off lots of HttpClient requests for large files, all with cancellation triggered after a certain timeout (we always use 20 seconds in case that helps focus efforts) then pressing the Home button on the iPhone hardware in question to see if it reproduces after a little while. |
Thanks @divil5000 that's pretty helpful! I've been focusing on this as a possible runtime or AOT compiler bug around the invocation of delegates on arm64 in concurrent settings (ie But actually we've had issues with backgrounded iOS apps and the threadpool before (e.g. xamarin/xamarin-macios#7080) Couple of questions:
|
As I noted earlier in this thread, I also see this crash reported for apps that do a lot of concurrent HTTP requests.
I am using
I am passing a cancellation token in some situations, but when the crashes happen (I sometimes experienced them myself), they seem to happen right when the http request is being sent out, before the token is cancelled or request timeouts would trigger. I am pretty sure the crashes just happen with a plain |
@tipa thanks! Another question, do you have something enabled that is collecting Http telemetry? (https://learn.microsoft.com/en-us/dotnet/core/diagnostics/well-known-event-providers#systemnethttp-provider) |
Not that I'm aware of, no. |
To answer your questions @lambdageek,
var cts = new System.Threading.CancellationTokenSource();
cts.CancelAfter(Timeout); // Timeout is always 20 seconds
httpResponse = await httpClient.SendAsync(message, HttpCompletionOption.ResponseHeadersRead, cts.Token).ConfigureAwait(false); // 2020/11/06 Added ConfigureAwait as it was necessary on Android, see https://github.com/xamarin/xamarin-android/issues/5264 I hope this is helpful for the repro. |
Update can't get it to happen more than once. I looked at the disassembly for I think I'm getting somewhere... Got this on an iPhone 15 (iOS 17.0) arm64 simulator on an M1 mac
And a native stack like this:
The app is a .NET 7 ios template app with this in the AppDelegate var f = typeof(HttpMessageInvoker).GetField("_handler", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
var o = f?.GetValue(client);
if (o?.GetType().FullName != "System.Net.Http.NSUrlSessionHandler") {
SetText($"handler: {o?.GetType().FullName}");
} else {
Task.Run (() => {
Thread.Sleep (3000);
Task.Run(RunRedraw);
for (int i = 0; i < NumTasks; i++) {
StartFetch(i);
//Thread.Sleep(10);
}
});
} where private static HttpClient client = new(new NSUrlSessionHandler());
private void SetText(string text) {
CoreFoundation.DispatchQueue.MainQueue.DispatchAsync(() => {
label.Text = text;
label.SetNeedsDisplay();
});
}
enum State : byte {
Unstarted,
Started,
Completed,
Cancelled,
}
private const int NumTasks = 100;
private byte[] states = new byte[NumTasks];
private string faulting = string.Empty;
private void UpdateState () {
var text = $"{DateTime.Now:HH:mm:ss} ";
for (int i = 0; i < NumTasks; i++) {
text += (State)Volatile.Read (ref states[i]) switch {
State.Unstarted => ".",
State.Started => "S",
State.Completed => "C",
State.Cancelled => "X",
_ => "?",
};
if (i == NumTasks / 2) {
text += "\n";
}
}
if (faulting != string.Empty) {
text += $"\nFAULT: {faulting}";
}
SetText(text);
}
private async Task RunFetch(int i)
{
try {
var cts = new CancellationTokenSource();
cts.CancelAfter(TimeSpan.FromSeconds(20));
try {
Volatile.Write(ref states[i], (byte)State.Started);
var message = new HttpRequestMessage(HttpMethod.Get, "https://www.example.com:81/");
var response = await client.SendAsync(message, HttpCompletionOption.ResponseHeadersRead, cts.Token);
Volatile.Write(ref states[i], (byte)State.Completed);
} catch (TaskCanceledException) {
Volatile.Write(ref states[i], (byte)State.Cancelled);
}
} catch (Exception e) {
faulting = e.ToString();
}
}
private void StartFetch(int i)
{
Task.Run(() => RunFetch(i));
}
private async Task RunRedraw()
{
while (true) {
UpdateState();
await Task.Delay(1000);
var _ = Task.Run(() => {
GC.Collect();
GC.WaitForPendingFinalizers();
GC.WaitForPendingFinalizers();
});
}
} To repro, launch the app and then hit "Home" in the first 3 seconds so the app is in the background before it launches the tasks. There's some nonsense in my csproj file (I was trying, fruitlessly, to try to get it launched with AOT on the simulator) that I'm not sure is relevant. I'll try cleaning things up before uploading a repro project. I wasn't getting anywhere with the app in the foreground. So folks guessing that it has to do with the app being in the background are very likely correct. |
I wonder if it's worth changing your repro so it waits for the entire HTTP response before continuing, and deliberately downloading a very large file, so the cancellation token actually fires while your app is backgrounded? This would simulate a poor internet connection situation. |
It actually is getting cancelled. a Get from It really doesn't feel like a problem with the http stack or with the threadpool. I think backgrounding the app just gives the threadpool a chance to lower the number of worker threads, and then the actual crash is some kind of a data race in the delegate implementation. |
Is there anything we can do, on the app side, to stop the entire app from crashing when this unhandled exception is thrown? At the moment we pick it up and record it for our own telemetry, but can't do much else. We are seeing over a hundred crashes on devices per day at the moment. |
Update (one other thing I forgot to write down: I have no idea why this would only show up in net7 and not in classic Xamarin. This part of Mono doesn't change very often) I managed to catch this in Xcode in the debugger (although mostly I was getting an seemingly spurious crash). As far as I can tell what is happening is we're getting garbage values in the For example in
to reach
which I think is this code runtime/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs Lines 7092 to 7094 in a22dee1
and specifically the cast that sequence of assembly code looks like it is emitted by runtime/src/mono/mono/mini/type-checking.c Lines 347 to 349 in 5f5d860
(the offset of Annoyingly the debugger doesn't actually show me that the object in @divil5000 I think eventually you might try adding some code at app startup that does some gratuitous type casts of some [MethodImpl(MethodImplOptions.NoInlining)]
static CancellationTokenSource CastToCTS(object o)
{
return (CancellationTokenSource)o;
}
...
var cts = new CancellationTokenSource();
CastToCTS(cts).Dispose(); (and same with I'm not sure this will help (for one, I'm not sure this will get compiled the same way by the AOT compiler - I still need to verify that), but so far this is the only suggestion I have... |
Hmm... actually I think I'm wrong. In this code:
it actually looks like x9 is zero. So it's not the object that has the problem. it's that The workaround i proposed won't work |
@lateralusX I think this is the same as #75088 - see my previous two comments - this is some LLVM AOT code that is seeing a NULL value in a GOT aotconst. I think the fences and memory barriers in your PR would resolve this issue, too. What do you think? Also, what do you think about the risk of backporting to net7? |
Yes, if you have got passed method_init and still sees NULL values in used GOT slots for that method, that probably means you hit the race where that thread exited its call to method_init before the stores into needed GOT slot has happened or became visible by the other thread. This is an LLVM only issue, meaning that it won't reproduce when running without LLVM. When I originally investigate that issue it caused rare random crashes that could occur at any point during apps lifetime, because a method calls its method_init only on first call to to the method, and it needs to race with another thread doing the same thing to expose the potential race. This fix has been implemented and used in a downstream repo running in some large apps, installed and executed in very large quantities for over a year, and it eliminated the in-frequent crashes previously seen by those apps and didn't cause any regression (x64). It has been in dotnet/runtime main for over a year, so since the fix have been around for sometime and it has been applied to both mono/mono and dotnet/runtime repro's. I believe it should be a small risk of backporting it to net7, especially since we see issues potentially affected by it and its LLVM only. |
We have never seen this issue in the past, when running on the Xamarin framework. Only since our latest release where we changed to net7-ios. In both cases we are using LLVM. Is that expected? It seems to contradict what @lateralusX is saying. |
I haven't seen 1 crash of this type after we turned off LLVM. We had crash like this about every day. |
Xamarin runs on a Mono branch 2020-2 and the critical changes around this race was introduced later in both mono/mono as well as dotnet/runtime, so running on a later mono/mono branch as well as dotnet/runtime branch will hit it, but not the branch used by legacy Xamarin. That might explain what you have been identifying. On the product I originally hit and analyzed the issue we didn't see it until after upgrading to a later mono/mono commit, not included in Mono's 2020-2 branch. |
Thanks for the detailed explanation (and I missed that we changed this code relatively recently - that explains why Xamarin wasn't affected). I'm going to try a local build of #93006 to verify that it makes the crash go away (It's quite infrequent for me under Xcode so if I don't see if after a couple hundred launches, I'm going to consider it resolved). |
Backport of #75088 to release/7.0-staging Fixes #81211 ## Customer Impact Customers targeting Apple platforms using LLVM AOT codegen (the default) in highly concurrent settings (such as firing off multiple simultaneous async HTTP requests) may experience unexpected behavior such as InvalidCastExceptions, NullReferenceExceptions or crashes. ## Testing Manual testing ## Risk Low. This code has been running on .NET 8 `main` for over a year in CI, as well as on some other non-mobile platforms --- * [Mono] Race in init_method when using LLVM AOT. When using LLVM AOT codegen, init_method updates two GOT slots. These slots are initialized as part of init_method, but there is a race between initialization of the two slots. Current implementation can have two threads running init_method for the same method, but as soon as: [got_slots [pindex]] = addr store is visible, it will trigger other threads to return back from init_method, but since that could happen before the corresponding LLVM AOT const slot is set, second thread will return to method calling init_method, load the LLVM aot const, and crash when trying to use it (since its still NULL). This crash is very rare but have been identified on x86/x64 CPU's, when one thread is either preempted between updating regular GOT slot and LLVM GOT slot or store into LLVM GOT slot gets delayed in store buffer. I have also been able to emulate the scenario in debugger, triggering the issue and crashing in the method loading from LLVM aot const slot. Fix change order of updates and make sure the update of LLVM aot const slot happens before memory_barrier, since: got [got_slots [pindex]] = addr; have release semantics in relation to addr and update of LLVM aot const slot. Fix also add acquire/release semantics for ji->type in init_method since it is used to guard if a thread ignores a patch or not and it should not be re-ordered with previous stores, since it can cause similar race conditions with updated slots. * Move register_jump_target_got_slot above mono_memory_barrier. * revert unintentional branding change --------- Co-authored-by: vseanreesermsft <78103370+vseanreesermsft@users.noreply.github.com> Co-authored-by: lateralusX <lateralusx.github@gmail.com> Co-authored-by: Aleksey Kliger <alklig@microsoft.com>
So where are we with this issue? Will we have to move to net8-ios to fix it (when it's released) or will there be a patch to net7-ios at some point? |
It's fixed in .NET 8 RC2, and there is a backport that will be in an upcoming .NET 7 servicing release |
@lambdageek thanks a lot for the fix - the crashes I reported in my first post of this thread have gone away! |
Description
After upgrading my iOS app from classic Xamarin.iOS to .NET7 I started observing frequent
InvalidCastExeption
s in my crash reporting tool (AppCenter). I haven't changed much dependency-wise and I am unable to reproduce this locally.Reproduction Steps
Unfortunately, I don't know what part of my code causes this behavior.
For the first stack trace, I assume that my code (or some dependencies code) is creating a
CancellationTokenSource
with a delay parameter.For the second stack trace below, I assume that some call to
Task.Delay
is causing the crash.Expected behavior
No crash :)
Actual behavior
SIGABRT: Arg_InvalidCastException
this is here:
runtime/src/libraries/System.Private.CoreLib/src/System/Threading/CancellationTokenSource.cs
Line 35 in 215b39a
SIGABRT: Arg_InvalidCastException
this is here:
runtime/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs
Line 5632 in 26b58b9
Both crashes originate from here:
runtime/src/libraries/System.Private.CoreLib/src/System/Threading/Timer.cs
Line 706 in 26b58b9
The "_state" variable is nullable, and based on the reports, the crashes are caused by the variable to be indeed null.
Regression?
I did not see those crashes when using "old" Xamarin.iOS
Known Workarounds
No response
Configuration
.net7-ios
happens on various iPhone models and iOS versions
Other information
No response
The text was updated successfully, but these errors were encountered: