-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System.CodeDom.Tests nullref on arm64 #53042
Comments
Tagging subscribers to this area: @buyaa-n, @krwq Issue DetailsStarted this morning. See https://runfo.azurewebsites.net/search/tests/?q=started%3A%7E7+definition%3Aruntime+name%3ASystem.CodeDom.Tests.CodeAttributeArgumentCollectionTests
|
Always net6.0-windows-Release-arm64-CoreCLR_release-Windows.10.Arm64.Open. No change in CodeDom code itself since May 18th.
The method is simply
where TItem is types like CodeCatchClause. @dotnet/jit-contrib are you aware of any ARM64 specific codegen issue htat might be causing random NRE? I know this was happening on Apple Silicon, this is Windows. |
@hoyosjs do we run on net6.0-windows-Debug-arm64-CoreCLR_release-Windows.10.Arm64.Open? If so then it smells like an optimization issue. |
Interesting part is if you look at other tests within CodeDom you see that there's linux-arm64 issues too with the same nullref on https://runfo.azurewebsites.net/search/tests/?q=started%3A~7%20definition%3Aruntime%20name%3Asystem.codedom.tests&pagenumber=1, but it's never on Debug. As for the running on debug, no. We do run checked, but I see Linux checked failures: runtime/eng/pipelines/runtime.yml Lines 825 to 842 in f318f32
|
There's no ARM64 debug. |
OK, so we know we get failures like this in Arm64 only, on release lib+release CoreCLR on Windows and release lib+checked CoreCLR on Ubuntu (but not Alpine/musl). But apparently not on release lib + checked CoreCLR on Alpine (musl) or Windows nor release+release on Linux, nor release|debug on checked CoreCLR on Apple silicon, since I see those also in the runtime.yml. Nor do we see these failures on any Mono configuration. That's not much of a pattern, but it does make it even less likely it's a libraries code issue, but either codegen or R2R (what do we R2R?) I see all variations have "0x6000423+0x0" which I guess means that it's at the very first IL instruction in the method? (Why are there no PDB's deployed here to give line numbers?) |
We R2R the product, but I think that doesn't apply to the tests. There's 10 failures today, so let me download a workitem from today and inspect it. |
The first failure was 5/20/2021 5:24:28 PM - not sure whether this is UTC. Nothing jumps out:
@AndyAyersMS can you imagine how edit, I mean #52998 |
Looking at this, the commits before failures are: https://github.com/dotnet/runtime/commits/d81ad044fa6830f5f31f6b6e8224ebf66a3c298c |
@danmoseley Your query above ran vs 2020 and not 2021 ? My change could be quite relevant, it involved devirtualization and enumerators. |
(ah you must have fixed it...) |
I'll take a look, but it will be a few hours. |
@danmoseley https://dev.azure.com/dnceng/public/_build/results?buildId=1147916&view=results failed and it was a rolling build before @AndyAyersMS change went in |
Oh, well so much for that theory. Still, I do believe codegen is most likely based on evidence above. Can you find last commit that succeeded? Assuming this is consistent? |
Let me verify consistency (although it looks like it is). |
@hoyosjs was there a clever way you figured that out? Kusto? |
#53042 suggests similar NRE in some non codedom tests. |
I am looking into it. |
I have ran the test 10000 times in a loop on Windows ARM64 machine equivalent to the lab one (both checked and release versions of coreclr), I've also tried to run 5 instances in parallel 1000 times and I am unable to repro the failure. |
Ah, my mistake, for some weird reason, I was building the stuff from a stale main. Trying again. |
Alternatively, you can also use runfo to download helix payload of failing test. |
It's a nullref that gets swallowed by the test runner. That being said I downloaded the payload from a failed run and ran it locally in a loop and never got the test to fail. Worth the query on Kusto to see if it was a particular agent. Update: 32 different issues have seen the issue in a uniform manner, so agent issues are unlikely. |
Finally got it fail after hundreds of iterations, even though in a different manner (Fatal error. Internal CLR error. (0x80131506)). |
This one is becoming a hot pretty hot issue in CI council. It doesn't affect PRs, but we have limited visibility into the CI health as this is taking down pretty 8-9 rolling builds a day. Should we back the change or disable the testing on the namespace? |
Let's move these tests to staging for now. |
Actually, after a bit more digging, this is also hitting Linux ARM64 runs equally hard, and those do affect PR runs. @agocke I didn't know you could move tests to staging. I thought you had to move the leg itself? |
Ugh, we need to fix that -- staging isn't very useful if we can't move over individual tests. OK then we should skip the tests. |
@hoyosjs do you have a link for the Linux failures? I've been running the failing tests in a loop tens of thousands of times in a loop and in 10 shells in parallel on arm64 windows machine since yesterday to repro it. It reproed just once so far, but since the null reference is handled by the xunit, there is no dump and so no way to see what's wrong. So I have modified runtime to fail fast on any null reference exception. It has been running with this modification for about 8 hours without any crash yet. |
You can see some of the Linux arm64 failures here: |
This file is all the failures with their queues and build links. |
I got this to repro in a lab machine. The four tests that need disabling are: AddRange_CodeStatementArray_Works |
It turns out my change has not caused the issue, it has just exposed a bug that was in the runtime ever at least since the first github commit. There is a bug in the void runtime/src/coreclr/vm/arm64/stubs.cpp Lines 1325 to 1352 in d3ed5a9
|
I'll send out a PR with a fix soon. |
Started this morning. See https://runfo.azurewebsites.net/search/tests/?q=started%3A%7E7+definition%3Aruntime+name%3ASystem.CodeDom.Tests.CodeAttributeArgumentCollectionTests
Runfo Tracking Issue: system.codedom.tests.codeattributeargumentcollectiontests
Displaying 100 of 248 results
Build Result Summary
The text was updated successfully, but these errors were encountered: