-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C000007B in CoreCLR Pri0 Test Run Windows_NT arm checked #1097
Comments
/cc @dagood |
@jkotas can this be closed now you reverted the commit? |
We have been dealing with several different CI breaks. The revert fixed the failures in This issue tracks failures in |
I experimentally tried to roll back two commits merged in between the last passing and the first failing run that seemed potentially related (Leandro's change to assembly resolution logic and JanV's merge of Viktor's build cleanup) but, as JanK rightly pointed out, sadly neither fixed the regression. Today I tried a different approach: I checked out a private branch off the "old coreclr" repo master, I backported Jarret's change re-enabling ARM runs and I ran an "old coreclr PR pipeline" on the change: Apparently it failed with exactly the same failure as the new runs. I believe this confirms that the failure is more likely related to the ambient state of the ARM machines than to "real" pipeline changes, just as JanK speculated in his previous response. It's probably time to start talking to infra guys, initially perhaps @ilyas1974, @MattGal or @Chrisboh, but I doubt anyone will be able to shed more light on this before the end of the holidays. |
Where there specific questions you had about the ARM hardware being used? The current hardware being used for Windows ARM64 runs is running a newer build of Windows that we were previously using as well as Python 3.7.0 - the x86 version. |
Thanks so much Ilya for your quick response in the holiday period :-). As a first question, I guess I'm wondering what happened between 2019/12/19 10:45 AM and 12:14 PM PST w.r.t. Windows ARM pool so that, without any relevant commits to the repo in that time interval (there are four commits there total, I can provide more detailed info if needed), a run at 10:45 AM passed and a subsequent one at 12:14 PST failed. |
Just for reference, I believe that the name of the Helix queue in question is "Windows.10.Arm64.Open". |
That was around the time period we migrated from the systems running the hold Seattle chipset to the Centrix chip set. We also migrated from using python 2 to python 3. We also update to a newer build of Windows. The old systems are still around (just disabled in Helix). |
Great, looks like we're starting to untangle the problem. Have you got any idea whether we might be able to somehow divide the problem space to identify which of the three components of the update caused the regression? |
I have a system (we're still the process of updating the rest of the systems in that queue) in the windows.10.arm64 queue with the same software configuration, but different hardware. If we are able to run these workloads on that system, that will tell us if it's hardware specific or if it's something with the software configuration. |
Awesome, just please let me know what to alter in the queue definitions or whatnot and I can easily trigger a private run to evaluate that. |
I have disabled all systems in the windows.10.arm64 queue except for the one with the latest build of Windows and Python 3. If you send your workloads to that queue, it will be run against this system. |
Hmm, apparently I've messed something up in the experimental run Sending Job to Windows.10.Arm64... F:\workspace\_work\1\s\.packages\microsoft.dotnet.helix.sdk\5.0.0-beta.19617.1\tools\Microsoft.DotNet.Helix.Sdk.MonoQueue.targets(47,5): error : ArgumentException: Unknown QueueId. Check that authentication is used and user is in correct groups. [F:\workspace\_work\1\s\src\coreclr\tests\helixpublishwitharcade.proj] Please let me know what I need to change when you have a chance and I'll retry the run. Thanks a lot Tomas |
I talked with my team and it appears that it is not as simple a task to run your workloads on the Windows.10.Arm64 queue as I first thought. Why don't we do this, let's coordinate a good time for me to move the laptop to the Windows.10.Arm64.Open queue and we can make it the only available system in that queue. What time works best for you on this? |
@JpratherMS @ilyas1974 @MattGal If the Windows ARM machine problem is a network issue, has anyone investigated the network slowness issue? @trylek and all: what are the next steps to get Windows ARM and Windows ARM64 jobs running again? Just troubleshoot the networking issues? Use different machines? |
@BruceForstall - I worked with Matt on running a couple of experimental runs against the queues he suggested but none worked so far due to various infra issues. I believe that, once we receive a queue we're able to run our pipelines on, that should be basically it. Please just note that most of this discussion dealt with 32-bit ARM - in fact, I haven't yet investigated whether we might be already able to re-enable the ARM64 runs using the Windows.10.Arm64.Open queue that kind of initially started this thread due to no longer supporting ARM32 - I can easily look into that. |
/cc @tommcdon for visibility |
@trylek What's the current status of Windows arm32/arm64 testing? |
@BruceForstall - I re-enabled Windows.Arm64 PR job last Monday, see e.g. here: [I grabbed a PR that happened to succeed to make a good impression ;-).] For Windows arm32, there seem to be a new queue of the Galaxy Book machines DDFUN is standing up, @ilyas1974 sent out an update earlier today, I'll send an initial canary run to their new queue and I guess next week we can decide on re-enabling the Windows arm32 PR jobs if it turns out to be stable enough. |
There are currently 18 Galaxy book systems online with this new queue |
Nice! |
Windows arm32 is still disabled until dotnet#1097 is fully resolved.
Windows arm32 is still disabled until #1097 is fully resolved.
Arm32 is now being run again |
arm32 library test runs are still disabled against this issue: runtime/eng/pipelines/runtime.yml Line 945 in 5092e10
|
The comment should be removed; we've removed almost all Windows arm32 testing: #39655 |
So we don't run our libraries tests against arm32 anywhere at all anymore? |
Specifically Windows arm32; Linux arm32 testing should be everywhere. I don't know about the libraries. For coreclr, I think we still do builds (but not test runs) in CI, and test runs only in the "runtime-coreclr outerloop" pipeline (but none of the many stress pipelines). |
For libraries I believe we agreed to no longer test against arm32. We think that the chances of finding an arm32 bug on libraries tests is very low, however JIT does need to do testing so we scaled that testing to outerloop pipeline. |
I think it was @jkotas observation - I believe it, but it is not impossible and it feels odd to offer first class support without ever running libraries tests, even once a cycle. I am not going to push on this if consensus is it's not necessary/feasible. Do we get coverage from dotnet/iot repo? |
I have made this comment in the context of doing arm32 tests for every PR. I agree that we should have a once a day or once a week test runs of the full matrix for everything we ship as officially supported. |
@jaredpar thoughts about this? we should have no tests on Windows ARM32, and occasional but regular runs (that include regular library unit tests) on Linux ARM32. |
I believe the reason why we stopped running tests was because win-arm32 is not a supported platform on .NET 5. https://github.com/dotnet/core/blob/master/release-notes/5.0/5.0-supported-os.md Actually, should stop building the win-arm32 runtime pack? |
We should have stopped all Windows ARM32 activities so far as I can see. |
First known occurrence:
https://dev.azure.com/dnceng/public/_build/results?buildId=462402&view=results
Commit info:
Proximate failure:
C000007B = STATUS_INVALID_IMAGE_FORMAT
As this happens on ARM, a usual cause is erroneously mixing up ARM and Intel binaries.
The text was updated successfully, but these errors were encountered: