C000007B in CoreCLR Pri0 Test Run Windows_NT arm checked #1097

trylek · 2019-12-20T20:18:27Z

First known occurrence:

https://dev.azure.com/dnceng/public/_build/results?buildId=462402&view=results

Commit info:

HEAD is now at 7dc30fc1 Merge 03693d335194dd76177cec10ac159f32de973082 into 0a29c61468f60f815119e52e477d0154351a4abf

Proximate failure:

C:\dotnetbuild\work\100ccf8f-adbf-49ff-87b2-37c7a2ef03ce\Work\c74808db-757a-43c3-a1f2-ec1c841e07ff\Exec>C:\dotnetbuild\work\100ccf8f-adbf-49ff-87b2-37c7a2ef03ce\Payload\CoreRun.exe C:\dotnetbuild\work\100ccf8f-adbf-49ff-87b2-37c7a2ef03ce\Payload\xunit.console.dll JIT\Stress\JIT.Stress.XUnitWrapper.dll -parallel collections -nocolor -noshadow -xml testResults.xml -trait TestGroup=JIT.Stress 

C:\dotnetbuild\work\100ccf8f-adbf-49ff-87b2-37c7a2ef03ce\Work\c74808db-757a-43c3-a1f2-ec1c841e07ff\Exec>set _commandExitCode=-1073741701

C000007B = STATUS_INVALID_IMAGE_FORMAT

As this happens on ARM, a usual cause is erroneously mixing up ARM and Intel binaries.

trylek · 2019-12-20T20:25:45Z

/cc @dagood

danmoseley · 2019-12-23T02:54:49Z

@jkotas can this be closed now you reverted the commit?

jkotas · 2019-12-23T04:27:59Z

We have been dealing with several different CI breaks. The revert fixed the failures in runtime-live-build (Runtime Build Windows_NT x86 release).

This issue tracks failures in CoreCLR Pri0 Test Run Windows_NT arm. It is not fixed yet. I was not able to pin point it to any single commit. I believe that it was likely introduced by some kind of ambient infrastructure change.

trylek · 2019-12-27T19:45:32Z

I experimentally tried to roll back two commits merged in between the last passing and the first failing run that seemed potentially related (Leandro's change to assembly resolution logic and JanV's merge of Viktor's build cleanup) but, as JanK rightly pointed out, sadly neither fixed the regression.

Today I tried a different approach: I checked out a private branch off the "old coreclr" repo master, I backported Jarret's change re-enabling ARM runs and I ran an "old coreclr PR pipeline" on the change:

https://dev.azure.com/dnceng/public/_build/results?buildId=466805&view=logs&j=41021207-15b4-5953-02cc-987654ff0f7b&t=786f87fb-d4be-5ad9-2d0b-e89af519eba6

Apparently it failed with exactly the same failure as the new runs. I believe this confirms that the failure is more likely related to the ambient state of the ARM machines than to "real" pipeline changes, just as JanK speculated in his previous response. It's probably time to start talking to infra guys, initially perhaps @ilyas1974, @MattGal or @Chrisboh, but I doubt anyone will be able to shed more light on this before the end of the holidays.

ilyas1974 · 2019-12-27T22:42:06Z

Where there specific questions you had about the ARM hardware being used? The current hardware being used for Windows ARM64 runs is running a newer build of Windows that we were previously using as well as Python 3.7.0 - the x86 version.

trylek · 2019-12-27T22:48:48Z

Thanks so much Ilya for your quick response in the holiday period :-). As a first question, I guess I'm wondering what happened between 2019/12/19 10:45 AM and 12:14 PM PST w.r.t. Windows ARM pool so that, without any relevant commits to the repo in that time interval (there are four commits there total, I can provide more detailed info if needed), a run at 10:45 AM passed and a subsequent one at 12:14 PST failed.

trylek · 2019-12-27T22:49:50Z

Just for reference, I believe that the name of the Helix queue in question is "Windows.10.Arm64.Open".

ilyas1974 · 2019-12-27T22:52:25Z

That was around the time period we migrated from the systems running the hold Seattle chipset to the Centrix chip set. We also migrated from using python 2 to python 3. We also update to a newer build of Windows. The old systems are still around (just disabled in Helix).

trylek · 2019-12-27T22:54:11Z

Great, looks like we're starting to untangle the problem. Have you got any idea whether we might be able to somehow divide the problem space to identify which of the three components of the update caused the regression?

ilyas1974 · 2019-12-27T22:57:12Z

I have a system (we're still the process of updating the rest of the systems in that queue) in the windows.10.arm64 queue with the same software configuration, but different hardware. If we are able to run these workloads on that system, that will tell us if it's hardware specific or if it's something with the software configuration.

trylek · 2019-12-27T22:59:06Z

Awesome, just please let me know what to alter in the queue definitions or whatnot and I can easily trigger a private run to evaluate that.

ilyas1974 · 2019-12-27T23:12:36Z

I have disabled all systems in the windows.10.arm64 queue except for the one with the latest build of Windows and Python 3. If you send your workloads to that queue, it will be run against this system.

trylek · 2019-12-28T12:27:30Z

Hmm, apparently I've messed something up in the experimental run

https://dev.azure.com/dnceng/public/_build/results?buildId=466908&view=logs&j=41021207-15b4-5953-02cc-987654ff0f7b&t=786f87fb-d4be-5ad9-2d0b-e89af519eba6

Sending Job to Windows.10.Arm64...
F:\workspace\_work\1\s\.packages\microsoft.dotnet.helix.sdk\5.0.0-beta.19617.1\tools\Microsoft.DotNet.Helix.Sdk.MonoQueue.targets(47,5): error : ArgumentException: Unknown QueueId. Check that authentication is used and user is in correct groups. [F:\workspace\_work\1\s\src\coreclr\tests\helixpublishwitharcade.proj]

Please let me know what I need to change when you have a chance and I'll retry the run.

Thanks a lot

Tomas

ilyas1974 · 2019-12-30T17:24:27Z

I talked with my team and it appears that it is not as simple a task to run your workloads on the Windows.10.Arm64 queue as I first thought. Why don't we do this, let's coordinate a good time for me to move the laptop to the Windows.10.Arm64.Open queue and we can make it the only available system in that queue. What time works best for you on this?

BruceForstall · 2020-01-21T19:20:41Z

@JpratherMS @ilyas1974 @MattGal If the Windows ARM machine problem is a network issue, has anyone investigated the network slowness issue?

@trylek and all: what are the next steps to get Windows ARM and Windows ARM64 jobs running again? Just troubleshoot the networking issues? Use different machines?

trylek · 2020-01-21T19:35:21Z

@BruceForstall - I worked with Matt on running a couple of experimental runs against the queues he suggested but none worked so far due to various infra issues. I believe that, once we receive a queue we're able to run our pipelines on, that should be basically it. Please just note that most of this discussion dealt with 32-bit ARM - in fact, I haven't yet investigated whether we might be already able to re-enable the ARM64 runs using the Windows.10.Arm64.Open queue that kind of initially started this thread due to no longer supporting ARM32 - I can easily look into that.

trylek · 2020-01-24T00:02:40Z

/cc @tommcdon for visibility

BruceForstall · 2020-02-13T22:03:41Z

@trylek What's the current status of Windows arm32/arm64 testing?

trylek · 2020-02-13T22:11:20Z

@BruceForstall - I re-enabled Windows.Arm64 PR job last Monday, see e.g. here:

https://dev.azure.com/dnceng/public/_build/results?buildId=519682&view=logs&j=2a67aa9c-b536-5994-b205-8dd60cf9ea5b&t=5c413307-f06e-59d7-136a-b9e14895fc4c

[I grabbed a PR that happened to succeed to make a good impression ;-).] For Windows arm32, there seem to be a new queue of the Galaxy Book machines DDFUN is standing up, @ilyas1974 sent out an update earlier today, I'll send an initial canary run to their new queue and I guess next week we can decide on re-enabling the Windows arm32 PR jobs if it turns out to be stable enough.

ilyas1974 · 2020-02-13T22:15:30Z

There are currently 18 Galaxy book systems online with this new queue

jashook · 2020-02-13T23:58:56Z

Nice!

Windows arm32 is still disabled until dotnet#1097 is fully resolved.

Windows arm32 is still disabled until #1097 is fully resolved.

jashook · 2020-03-26T00:12:12Z

Arm32 is now being run again

ViktorHofer · 2020-09-14T17:57:57Z

arm32 library test runs are still disabled against this issue:

runtime/eng/pipelines/runtime.yml

Line 945 in 5092e10

    
           # - Windows_NT_arm return this when https://github.com/dotnet/runtime/issues/1097 is fixed.

. Is that intentional? cc @safern

BruceForstall · 2020-09-14T18:21:20Z

The comment should be removed; we've removed almost all Windows arm32 testing: #39655

ViktorHofer · 2020-09-14T18:23:46Z

So we don't run our libraries tests against arm32 anywhere at all anymore?

BruceForstall · 2020-09-14T18:43:47Z

Specifically Windows arm32; Linux arm32 testing should be everywhere.

I don't know about the libraries. For coreclr, I think we still do builds (but not test runs) in CI, and test runs only in the "runtime-coreclr outerloop" pipeline (but none of the many stress pipelines).

safern · 2020-09-14T18:54:28Z

For libraries I believe we agreed to no longer test against arm32. We think that the chances of finding an arm32 bug on libraries tests is very low, however JIT does need to do testing so we scaled that testing to outerloop pipeline.

danmoseley · 2020-09-14T22:39:40Z

chances of finding an arm32 bug on libraries tests is very low,

I think it was @jkotas observation - I believe it, but it is not impossible and it feels odd to offer first class support without ever running libraries tests, even once a cycle. I am not going to push on this if consensus is it's not necessary/feasible. Do we get coverage from dotnet/iot repo?

jkotas · 2020-09-14T23:12:01Z

chances of finding an arm32 bug on libraries tests is very low,

I have made this comment in the context of doing arm32 tests for every PR.

I agree that we should have a once a day or once a week test runs of the full matrix for everything we ship as officially supported.

danmoseley · 2020-09-15T00:04:34Z

@jaredpar thoughts about this? we should have no tests on Windows ARM32, and occasional but regular runs (that include regular library unit tests) on Linux ARM32.

safern · 2020-09-15T01:41:43Z

I believe the reason why we stopped running tests was because win-arm32 is not a supported platform on .NET 5. https://github.com/dotnet/core/blob/master/release-notes/5.0/5.0-supported-os.md

Actually, should stop building the win-arm32 runtime pack?

danmoseley · 2020-09-15T04:22:02Z

We should have stopped all Windows ARM32 activities so far as I can see.

trylek added the area-Infrastructure-coreclr label Dec 20, 2019

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Dec 20, 2019

jashook mentioned this issue Dec 20, 2019

Infrastructure - Status/Health #702

Closed

This was referenced Dec 20, 2019

Publish the official build to blob storage #1092

Merged

Fix and Test case for #27924 #1059

Merged

echesakov mentioned this issue Dec 31, 2019

Arm64: Add AddAcross Codegen and Tests. #1093

Merged

CarolEidt mentioned this issue Jan 2, 2020

Allow folding of aligned loads when using the VEX encoding and optimizations are enabled #376

Merged

This was referenced Jan 2, 2020

Support GetCultureInfo with predefinedOnly flag #654

Merged

Fix CompareInfo and SQL tests #1254

Merged

sandreenko mentioned this issue Jan 3, 2020

Disable failing jobs in PR testing. #1283

Merged

tarekgh mentioned this issue Jan 3, 2020

Revert "Revert "Support GetCultureInfo with predefinedOnly flag"" #1261

Merged

hoyosjs mentioned this issue Jan 16, 2020

Fix issues in release 3.0 dotnet/coreclr#28000

Merged

trylek mentioned this issue Feb 24, 2020

JIT.jit64.mcc fails on newly enabled Windows arm32 queue Windows.10.Arm64v8.Open #32320

Closed

jashook removed the untriaged New issue has not been triaged by the area owner label Mar 2, 2020

jashook assigned trylek Mar 9, 2020

BruceForstall added a commit to BruceForstall/runtime that referenced this issue Mar 21, 2020

Enable Windows arm64 JIT stress pipeline testing

26c7283

Windows arm32 is still disabled until dotnet#1097 is fully resolved.

BruceForstall mentioned this issue Mar 21, 2020

Enable Windows arm64 JIT stress pipeline testing #33906

Merged

BruceForstall added a commit that referenced this issue Mar 21, 2020

Enable Windows arm64 JIT stress pipeline testing (#33906)

5e6711d

Windows arm32 is still disabled until #1097 is fully resolved.

jashook closed this as completed Mar 26, 2020

ghost locked as resolved and limited conversation to collaborators Dec 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C000007B in CoreCLR Pri0 Test Run Windows_NT arm checked #1097

C000007B in CoreCLR Pri0 Test Run Windows_NT arm checked #1097

trylek commented Dec 20, 2019

trylek commented Dec 20, 2019

danmoseley commented Dec 23, 2019

jkotas commented Dec 23, 2019

trylek commented Dec 27, 2019

ilyas1974 commented Dec 27, 2019

trylek commented Dec 27, 2019

trylek commented Dec 27, 2019

ilyas1974 commented Dec 27, 2019

trylek commented Dec 27, 2019

ilyas1974 commented Dec 27, 2019

trylek commented Dec 27, 2019

ilyas1974 commented Dec 27, 2019

trylek commented Dec 28, 2019

ilyas1974 commented Dec 30, 2019

BruceForstall commented Jan 21, 2020

trylek commented Jan 21, 2020

trylek commented Jan 24, 2020

BruceForstall commented Feb 13, 2020

trylek commented Feb 13, 2020

ilyas1974 commented Feb 13, 2020

jashook commented Feb 13, 2020

jashook commented Mar 26, 2020

ViktorHofer commented Sep 14, 2020

BruceForstall commented Sep 14, 2020

ViktorHofer commented Sep 14, 2020 •

edited

Loading

BruceForstall commented Sep 14, 2020

safern commented Sep 14, 2020

danmoseley commented Sep 14, 2020

jkotas commented Sep 14, 2020

danmoseley commented Sep 15, 2020

safern commented Sep 15, 2020

danmoseley commented Sep 15, 2020 •

edited

Loading

C000007B in CoreCLR Pri0 Test Run Windows_NT arm checked #1097

C000007B in CoreCLR Pri0 Test Run Windows_NT arm checked #1097

Comments

trylek commented Dec 20, 2019

trylek commented Dec 20, 2019

danmoseley commented Dec 23, 2019

jkotas commented Dec 23, 2019

trylek commented Dec 27, 2019

ilyas1974 commented Dec 27, 2019

trylek commented Dec 27, 2019

trylek commented Dec 27, 2019

ilyas1974 commented Dec 27, 2019

trylek commented Dec 27, 2019

ilyas1974 commented Dec 27, 2019

trylek commented Dec 27, 2019

ilyas1974 commented Dec 27, 2019

trylek commented Dec 28, 2019

ilyas1974 commented Dec 30, 2019

BruceForstall commented Jan 21, 2020

trylek commented Jan 21, 2020

trylek commented Jan 24, 2020

BruceForstall commented Feb 13, 2020

trylek commented Feb 13, 2020

ilyas1974 commented Feb 13, 2020

jashook commented Feb 13, 2020

jashook commented Mar 26, 2020

ViktorHofer commented Sep 14, 2020

BruceForstall commented Sep 14, 2020

ViktorHofer commented Sep 14, 2020 • edited Loading

BruceForstall commented Sep 14, 2020

safern commented Sep 14, 2020

danmoseley commented Sep 14, 2020

jkotas commented Sep 14, 2020

danmoseley commented Sep 15, 2020

safern commented Sep 15, 2020

danmoseley commented Sep 15, 2020 • edited Loading

ViktorHofer commented Sep 14, 2020 •

edited

Loading

danmoseley commented Sep 15, 2020 •

edited

Loading