Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C000007B in CoreCLR Pri0 Test Run Windows_NT arm checked #1097

Closed
trylek opened this issue Dec 20, 2019 · 54 comments
Closed

C000007B in CoreCLR Pri0 Test Run Windows_NT arm checked #1097

trylek opened this issue Dec 20, 2019 · 54 comments

Comments

@trylek
Copy link
Member

trylek commented Dec 20, 2019

First known occurrence:

https://dev.azure.com/dnceng/public/_build/results?buildId=462402&view=results

Commit info:

HEAD is now at 7dc30fc1 Merge 03693d335194dd76177cec10ac159f32de973082 into 0a29c61468f60f815119e52e477d0154351a4abf

Proximate failure:

C:\dotnetbuild\work\100ccf8f-adbf-49ff-87b2-37c7a2ef03ce\Work\c74808db-757a-43c3-a1f2-ec1c841e07ff\Exec>C:\dotnetbuild\work\100ccf8f-adbf-49ff-87b2-37c7a2ef03ce\Payload\CoreRun.exe C:\dotnetbuild\work\100ccf8f-adbf-49ff-87b2-37c7a2ef03ce\Payload\xunit.console.dll JIT\Stress\JIT.Stress.XUnitWrapper.dll -parallel collections -nocolor -noshadow -xml testResults.xml -trait TestGroup=JIT.Stress 

C:\dotnetbuild\work\100ccf8f-adbf-49ff-87b2-37c7a2ef03ce\Work\c74808db-757a-43c3-a1f2-ec1c841e07ff\Exec>set _commandExitCode=-1073741701 

C000007B = STATUS_INVALID_IMAGE_FORMAT

As this happens on ARM, a usual cause is erroneously mixing up ARM and Intel binaries.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Dec 20, 2019
@trylek
Copy link
Member Author

trylek commented Dec 20, 2019

/cc @dagood

@danmoseley
Copy link
Member

@jkotas can this be closed now you reverted the commit?

@jkotas
Copy link
Member

jkotas commented Dec 23, 2019

We have been dealing with several different CI breaks. The revert fixed the failures in runtime-live-build (Runtime Build Windows_NT x86 release).

This issue tracks failures in CoreCLR Pri0 Test Run Windows_NT arm. It is not fixed yet. I was not able to pin point it to any single commit. I believe that it was likely introduced by some kind of ambient infrastructure change.

@trylek
Copy link
Member Author

trylek commented Dec 27, 2019

I experimentally tried to roll back two commits merged in between the last passing and the first failing run that seemed potentially related (Leandro's change to assembly resolution logic and JanV's merge of Viktor's build cleanup) but, as JanK rightly pointed out, sadly neither fixed the regression.

Today I tried a different approach: I checked out a private branch off the "old coreclr" repo master, I backported Jarret's change re-enabling ARM runs and I ran an "old coreclr PR pipeline" on the change:

https://dev.azure.com/dnceng/public/_build/results?buildId=466805&view=logs&j=41021207-15b4-5953-02cc-987654ff0f7b&t=786f87fb-d4be-5ad9-2d0b-e89af519eba6

Apparently it failed with exactly the same failure as the new runs. I believe this confirms that the failure is more likely related to the ambient state of the ARM machines than to "real" pipeline changes, just as JanK speculated in his previous response. It's probably time to start talking to infra guys, initially perhaps @ilyas1974, @MattGal or @Chrisboh, but I doubt anyone will be able to shed more light on this before the end of the holidays.

@ilyas1974
Copy link

Where there specific questions you had about the ARM hardware being used? The current hardware being used for Windows ARM64 runs is running a newer build of Windows that we were previously using as well as Python 3.7.0 - the x86 version.

@trylek
Copy link
Member Author

trylek commented Dec 27, 2019

Thanks so much Ilya for your quick response in the holiday period :-). As a first question, I guess I'm wondering what happened between 2019/12/19 10:45 AM and 12:14 PM PST w.r.t. Windows ARM pool so that, without any relevant commits to the repo in that time interval (there are four commits there total, I can provide more detailed info if needed), a run at 10:45 AM passed and a subsequent one at 12:14 PST failed.

@trylek
Copy link
Member Author

trylek commented Dec 27, 2019

Just for reference, I believe that the name of the Helix queue in question is "Windows.10.Arm64.Open".

@ilyas1974
Copy link

That was around the time period we migrated from the systems running the hold Seattle chipset to the Centrix chip set. We also migrated from using python 2 to python 3. We also update to a newer build of Windows. The old systems are still around (just disabled in Helix).

@trylek
Copy link
Member Author

trylek commented Dec 27, 2019

Great, looks like we're starting to untangle the problem. Have you got any idea whether we might be able to somehow divide the problem space to identify which of the three components of the update caused the regression?

@ilyas1974
Copy link

I have a system (we're still the process of updating the rest of the systems in that queue) in the windows.10.arm64 queue with the same software configuration, but different hardware. If we are able to run these workloads on that system, that will tell us if it's hardware specific or if it's something with the software configuration.

@trylek
Copy link
Member Author

trylek commented Dec 27, 2019

Awesome, just please let me know what to alter in the queue definitions or whatnot and I can easily trigger a private run to evaluate that.

@ilyas1974
Copy link

I have disabled all systems in the windows.10.arm64 queue except for the one with the latest build of Windows and Python 3. If you send your workloads to that queue, it will be run against this system.

@trylek
Copy link
Member Author

trylek commented Dec 28, 2019

Hmm, apparently I've messed something up in the experimental run

https://dev.azure.com/dnceng/public/_build/results?buildId=466908&view=logs&j=41021207-15b4-5953-02cc-987654ff0f7b&t=786f87fb-d4be-5ad9-2d0b-e89af519eba6

Sending Job to Windows.10.Arm64...
F:\workspace\_work\1\s\.packages\microsoft.dotnet.helix.sdk\5.0.0-beta.19617.1\tools\Microsoft.DotNet.Helix.Sdk.MonoQueue.targets(47,5): error : ArgumentException: Unknown QueueId. Check that authentication is used and user is in correct groups. [F:\workspace\_work\1\s\src\coreclr\tests\helixpublishwitharcade.proj]

Please let me know what I need to change when you have a chance and I'll retry the run.

Thanks a lot

Tomas

@ilyas1974
Copy link

I talked with my team and it appears that it is not as simple a task to run your workloads on the Windows.10.Arm64 queue as I first thought. Why don't we do this, let's coordinate a good time for me to move the laptop to the Windows.10.Arm64.Open queue and we can make it the only available system in that queue. What time works best for you on this?

@BruceForstall
Copy link
Member

@JpratherMS @ilyas1974 @MattGal If the Windows ARM machine problem is a network issue, has anyone investigated the network slowness issue?

@trylek and all: what are the next steps to get Windows ARM and Windows ARM64 jobs running again? Just troubleshoot the networking issues? Use different machines?

@trylek
Copy link
Member Author

trylek commented Jan 21, 2020

@BruceForstall - I worked with Matt on running a couple of experimental runs against the queues he suggested but none worked so far due to various infra issues. I believe that, once we receive a queue we're able to run our pipelines on, that should be basically it. Please just note that most of this discussion dealt with 32-bit ARM - in fact, I haven't yet investigated whether we might be already able to re-enable the ARM64 runs using the Windows.10.Arm64.Open queue that kind of initially started this thread due to no longer supporting ARM32 - I can easily look into that.

@trylek
Copy link
Member Author

trylek commented Jan 24, 2020

/cc @tommcdon for visibility

@BruceForstall
Copy link
Member

@trylek What's the current status of Windows arm32/arm64 testing?

@trylek
Copy link
Member Author

trylek commented Feb 13, 2020

@BruceForstall - I re-enabled Windows.Arm64 PR job last Monday, see e.g. here:

https://dev.azure.com/dnceng/public/_build/results?buildId=519682&view=logs&j=2a67aa9c-b536-5994-b205-8dd60cf9ea5b&t=5c413307-f06e-59d7-136a-b9e14895fc4c

[I grabbed a PR that happened to succeed to make a good impression ;-).] For Windows arm32, there seem to be a new queue of the Galaxy Book machines DDFUN is standing up, @ilyas1974 sent out an update earlier today, I'll send an initial canary run to their new queue and I guess next week we can decide on re-enabling the Windows arm32 PR jobs if it turns out to be stable enough.

@ilyas1974
Copy link

There are currently 18 Galaxy book systems online with this new queue

@jashook
Copy link
Contributor

jashook commented Feb 13, 2020

Nice!

@jashook jashook removed the untriaged New issue has not been triaged by the area owner label Mar 2, 2020
BruceForstall added a commit to BruceForstall/runtime that referenced this issue Mar 21, 2020
Windows arm32 is still disabled until dotnet#1097
is fully resolved.
BruceForstall added a commit that referenced this issue Mar 21, 2020
Windows arm32 is still disabled until #1097
is fully resolved.
@jashook
Copy link
Contributor

jashook commented Mar 26, 2020

Arm32 is now being run again

@jashook jashook closed this as completed Mar 26, 2020
@ViktorHofer
Copy link
Member

arm32 library test runs are still disabled against this issue:

# - Windows_NT_arm return this when https://github.com/dotnet/runtime/issues/1097 is fixed.
. Is that intentional? cc @safern

@BruceForstall
Copy link
Member

The comment should be removed; we've removed almost all Windows arm32 testing: #39655

@ViktorHofer
Copy link
Member

ViktorHofer commented Sep 14, 2020

So we don't run our libraries tests against arm32 anywhere at all anymore?

@BruceForstall
Copy link
Member

Specifically Windows arm32; Linux arm32 testing should be everywhere.

I don't know about the libraries. For coreclr, I think we still do builds (but not test runs) in CI, and test runs only in the "runtime-coreclr outerloop" pipeline (but none of the many stress pipelines).

@safern
Copy link
Member

safern commented Sep 14, 2020

For libraries I believe we agreed to no longer test against arm32. We think that the chances of finding an arm32 bug on libraries tests is very low, however JIT does need to do testing so we scaled that testing to outerloop pipeline.

@danmoseley
Copy link
Member

chances of finding an arm32 bug on libraries tests is very low,

I think it was @jkotas observation - I believe it, but it is not impossible and it feels odd to offer first class support without ever running libraries tests, even once a cycle. I am not going to push on this if consensus is it's not necessary/feasible. Do we get coverage from dotnet/iot repo?

@jkotas
Copy link
Member

jkotas commented Sep 14, 2020

chances of finding an arm32 bug on libraries tests is very low,

I have made this comment in the context of doing arm32 tests for every PR.

I agree that we should have a once a day or once a week test runs of the full matrix for everything we ship as officially supported.

@danmoseley
Copy link
Member

@jaredpar thoughts about this? we should have no tests on Windows ARM32, and occasional but regular runs (that include regular library unit tests) on Linux ARM32.

@safern
Copy link
Member

safern commented Sep 15, 2020

I believe the reason why we stopped running tests was because win-arm32 is not a supported platform on .NET 5. https://github.com/dotnet/core/blob/master/release-notes/5.0/5.0-supported-os.md

Actually, should stop building the win-arm32 runtime pack?

@danmoseley
Copy link
Member

danmoseley commented Sep 15, 2020

We should have stopped all Windows ARM32 activities so far as I can see.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants