Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disabled Job - Libraries Test Run release mono windows x64 Release #45524

Closed
riarenas opened this issue Dec 3, 2020 · 20 comments
Closed

Disabled Job - Libraries Test Run release mono windows x64 Release #45524

riarenas opened this issue Dec 3, 2020 · 20 comments
Assignees
Labels
area-Infrastructure-mono blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Milestone

Comments

@riarenas
Copy link
Member

riarenas commented Dec 3, 2020

Almost all tests involving networking fail in that leg. The test runs have been disabled temporarily in #45529

An example in https://dnceng.visualstudio.com/public/_build/results?buildId=906071&view=ms.vss-test-web.build-test-results-tab

[6:10 PM] Alexander Köplinger
hmm this might be related to the helix rollout since it overlaps with when the rollout happened. all of the workitems have something to do with networking or hitting some http, and some have exceptions when trying to instantiante MsQuicApi
it's possible that a new Windows version got rolled out that enables http3/quic somehow and that triggers this codepath

Exploratory PR in #45520

@riarenas riarenas added the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Dec 3, 2020
@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Dec 3, 2020
@hoyosjs hoyosjs changed the title multiple tests in Libraries Test Run release mono windows x64 Release failing Disabled Job - Libraries Test Run release mono windows x64 Release Dec 3, 2020
@scalablecory
Copy link
Contributor

scalablecory commented Dec 3, 2020

Looking at logs, failures appear to be related to reflection. Can you point me to where you are seeing MsQuicApi as possible suspect?

some have exceptions when trying to instantiante MsQuicApi

A first-chance exception is expected here as a P/Invoke call is used to test for a DLL's presence. Are you seeing the exception leak out in other ways?

it's possible that a new Windows version got rolled out that enables http3/quic somehow and that triggers this codepath

Even with an updated Windows, unless the new version of Windows includes the required msquic.dll to light up the .NET feature -- I have been told this is explicitly not the plan for MsQuic -- I would not expect any change in behavior. CC @nibanks

@akoeplinger
Copy link
Member

akoeplinger commented Dec 3, 2020

One example is in this log:

System.Xml.Tests.TC_SchemaSet_Add_URL.v4 [FAIL]
      System.NullReferenceException : Object reference not set to an instance of an object
      Stack Trace:
        /_/src/mono/netcore/System.Private.CoreLib/src/System/Reflection/RuntimeMethodInfo.cs(378,0): at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)

Unhandled Exception:
System.TypeInitializationException: The type initializer for 'System.Net.Quic.Implementations.MsQuic.Internal.MsQuicApi' threw an exception.

 ---> System.NullReferenceException: Object reference not set to an instance of an object

   at System.Net.Quic.Implementations.MsQuic.Internal.MsQuicApi..ctor() in /_/src/libraries/System.Net.Quic/src/System/Net/Quic/Implementations/MsQuic/Internal/MsQuicApi.cs:line 28

   at System.Net.Quic.Implementations.MsQuic.Internal.MsQuicApi..cctor() in /_/src/libraries/System.Net.Quic/src/System/Net/Quic/Implementations/MsQuic/Internal/MsQuicApi.cs:line 156

   --- End of inner exception stack trace ---
[ERROR] FATAL UNHANDLED EXCEPTION: System.TypeInitializationException: The type initializer for 'System.Net.Quic.Implementations.MsQuic.Internal.MsQuicApi' threw an exception.

 ---> System.NullReferenceException: Object reference not set to an instance of an object

   at System.Net.Quic.Implementations.MsQuic.Internal.MsQuicApi..ctor() in /_/src/libraries/System.Net.Quic/src/System/Net/Quic/Implementations/MsQuic/Internal/MsQuicApi.cs:line 28

   at System.Net.Quic.Implementations.MsQuic.Internal.MsQuicApi..cctor() in /_/src/libraries/System.Net.Quic/src/System/Net/Quic/Implementations/MsQuic/Internal/MsQuicApi.cs:line 156

   --- End of inner exception stack trace ---

It sounds like something goes wrong when loading the MsQuic library but I don't see any related changes between the good/bad commits so my current hunch is that something in the OS changed.

@akoeplinger
Copy link
Member

Hmm after more digging this doesn't look related to MsQuic anymore, I tried disabling loading it with #45520 but there are still other workitems that hit the reflection issue.

We seem to be hitting it on the release/5.0 branch as well so it definitely looks related to the recent Helix rollout.

@riarenas
Copy link
Member Author

riarenas commented Dec 3, 2020

@dotnet/dnceng for visibility. This looks like it started failing on all windows queues as of the latest rollout.

@premun
Copy link
Member

premun commented Dec 3, 2020

I opened https://github.com/dotnet/core-eng/issues/11569 for tracking

@eerhardt
Copy link
Member

eerhardt commented Dec 3, 2020

I hit the same NullReferenceException in https://dev.azure.com/dnceng/public/_build/results?buildId=906896&view=logs&j=d81dbc02-ec0b-5bd5-14c1-0072bd710d3b&t=780f1b30-8338-5e4d-6b72-c57eeb2413c9&l=76

Executed on a000GUM

C:\h\w\A14F08F6\w\AB370972\e>call RunTests.cmd --runtime-path C:\h\w\A14F08F6\p 
----- start Thu 12/03/2020  4:22:57.90 ===============  To repro directly: ===================================================== 
pushd C:\h\w\A14F08F6\w\AB370972\e\
"C:\h\w\A14F08F6\p\dotnet.exe" exec --runtimeconfig System.Data.Common.Tests.runtimeconfig.json --depsfile System.Data.Common.Tests.deps.json xunit.console.dll System.Data.Common.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing 
popd
===========================================================================================================

C:\h\w\A14F08F6\w\AB370972\e>"C:\h\w\A14F08F6\p\dotnet.exe" exec --runtimeconfig System.Data.Common.Tests.runtimeconfig.json --depsfile System.Data.Common.Tests.deps.json xunit.console.dll System.Data.Common.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing  
  Discovering: System.Data.Common.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Data.Common.Tests (found 1810 of 1813 test cases)
  Starting:    System.Data.Common.Tests (parallel test collections = on, max threads = 2)
    System.Data.Tests.SqlTypes.SqlStringTest.ReadWriteXmlTest [FAIL]
      System.NullReferenceException : Object reference not set to an instance of an object
      Stack Trace:
        /_/src/mono/netcore/System.Private.CoreLib/src/System/Reflection/RuntimeMethodInfo.cs(378,0): at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
    System.Data.Common.Tests.DbConnectionTests.CanBeFinalized [SKIP]
      Condition(s) not met: "IsPreciseGcSupported"
    System.Data.Common.Tests.DbCommandTests.CanBeFinalized [SKIP]
      Condition(s) not met: "IsPreciseGcSupported"
   System.Data.Common.Tests: [Long Running Test] 'System.Data.Tests.SqlTypes.SqlInt64Test.ReadWriteXmlTest', Elapsed: 00:02:11
   System.Data.Common.Tests: [Long Running Test] 'System.Data.Tests.SqlTypes.SqlInt64Test.ReadWriteXmlTest', Elapsed: 00:04:11
   System.Data.Common.Tests: [Long Running Test] 'System.Data.Tests.SqlTypes.SqlInt64Test.ReadWriteXmlTest', Elapsed: 00:06:11
   System.Data.Common.Tests: [Long Running Test] 'System.Data.Tests.SqlTypes.SqlInt64Test.ReadWriteXmlTest', Elapsed: 00:08:11
   System.Data.Common.Tests: [Long Running Test] 'System.Data.Tests.SqlTypes.SqlInt64Test.ReadWriteXmlTest', Elapsed: 00:10:11
   System.Data.Common.Tests: [Long Running Test] 'System.Data.Tests.SqlTypes.SqlInt64Test.ReadWriteXmlTest', Elapsed: 00:12:11
   System.Data.Common.Tests: [Long Running Test] 'System.Data.Tests.SqlTypes.SqlInt64Test.ReadWriteXmlTest', Elapsed: 00:14:11

@ghost
Copy link

ghost commented Dec 3, 2020

Tagging subscribers to this area: @directhex
See info in area-owners.md if you want to be subscribed.

Issue Details

Almost all tests involving networking fail in that leg. The test runs have been disabled temporarily in #45529

An example in https://dnceng.visualstudio.com/public/_build/results?buildId=906071&view=ms.vss-test-web.build-test-results-tab

[6:10 PM] Alexander Köplinger
hmm this might be related to the helix rollout since it overlaps with when the rollout happened. all of the workitems have something to do with networking or hitting some http, and some have exceptions when trying to instantiante MsQuicApi
it's possible that a new Windows version got rolled out that enables http3/quic somehow and that triggers this codepath

Exploratory PR in #45520

Author: riarenas
Assignees: -
Labels:

area-Infrastructure-mono, blocking-clean-ci, untriaged

Milestone: -

@MattGal
Copy link
Member

MattGal commented Dec 3, 2020

Investigating via https://github.com/dotnet/core-eng/issues/11569

@akoeplinger
Copy link
Member

@eerhardt those builds ran before the leg was disabled.

@akoeplinger
Copy link
Member

I think this might be related to the VS update since I can reproduce it locally, i.e. the cause is not the Helix test machine but
the build machine.

@akoeplinger
Copy link
Member

Ok, I confirmed that this is indeed due to the VS update.

Good:

  **********************************************************************
  ** Visual Studio 2019 Developer Command Prompt v16.7.3
  ** Copyright (c) 2020 Microsoft Corporation
  **********************************************************************
  [vcvarsall.bat] Environment initialized for: 'x86_x64'
  -- Building for: Visual Studio 16 2019
  -- The C compiler identification is MSVC 19.27.29111.0
  -- The CXX compiler identification is MSVC 19.27.29111.0

Bad:

  **********************************************************************
  ** Visual Studio 2019 Developer Command Prompt v16.8.0
  ** Copyright (c) 2020 Microsoft Corporation
  **********************************************************************
  [vcvarsall.bat] Environment initialized for: 'x86_x64'
  -- Building for: Visual Studio 16 2019
  -- The C compiler identification is MSVC 19.28.29333.0
  -- The CXX compiler identification is MSVC 19.28.29333.0

@akoeplinger akoeplinger removed the untriaged New issue has not been triaged by the area owner label Dec 3, 2020
@akoeplinger
Copy link
Member

To reproduce:

.\build.cmd mono+libs+libs.pretest -c Release

.\dotnet.cmd build /p:RuntimeFlavor=mono /t:Test /p:Configuration=Release .\src\libraries\System.Private.Xml\tests\XmlSchema\XmlSchemaSet\System.Xml.XmlSchemaSet.Tests.csproj

@radical
Copy link
Member

radical commented Dec 3, 2020

BatchedCI build for 2ee13ec is also failing with NRE in RuntimeMethodInfo.Invoke.

lateralusX added a commit to lateralusX/runtime that referenced this issue Jan 5, 2021
After upgrade to later msvc version on CI boots, see
dotnet#45524 for details,
Window x64 Release builds started to crash on libraries tests.

After investigation it turns out that new msvc compiler handles an
expression different compared to how it was handled in previous version.

After upgrade of msvc, the expression:

int _amd64_width_temp = ((guint64)(imm) == (guint64)(int)(guint64)(imm));

implemented in amd64_mov_reg_imm and then called from tramp-amd64.c@500
was transformed into an always true expression by compiler:

amd64_mov_reg_imm (code, AMD64_R11, (guint8*)mono_get_rethrow_preserve_exception_addr ());
lea rcx,[rethrow_preserve_exception_func (07FFB9E33A590h)]
mov word ptr [rbx+0Dh],0BB41h
mov byte ptr [rbx+0Fh],cl
mov rax,rcx
shr eax,8
mov byte ptr [rbx+10h],al
mov rax,rcx
shr eax,10h
shr ecx,18h
mov byte ptr [rbx+11h],al
lea rax,[rbx+13h]
mov byte ptr [rbx+12h],cl

as seen above, the condition and handling of a 64-bit imm has been
dropped by compiler.

This cause issues when the imm is a 64-bit value since it will always gets
truncated into 32-bit imm and in this case it was a pointer to a function
within coreclr.dll (mono_get_rethrow_preserve_exception_addr)
loaded located at higher address (using more than 32-bit).

This is most likely a regression issue in compiler for this specific
construction. I tried simpler construction (using same type conversion) on
both old and new compiler version and then it makes the right optimization.

Fix is to switch to a macro already available in amd64-codegen (amd64_is_imm32)
detecting if an imm needs a 32-bit or 64-bit sized value. This will be
correctly optimized by new msvc compiler and even if this is a work around
for a what seems to be a optimization bug in the compiler, it is still
cleaner and better describes the intent than current code.

Fix also re-enable Windows x64 Release CI test lane.
@lateralusX
Copy link
Member

Crash should be fixed by #46573. PR also re-enables the Windows x64 Release CI lane.

monojenkins pushed a commit to monojenkins/mono that referenced this issue Jan 5, 2021
After upgrade to later msvc version on CI boots, see dotnet/runtime#45524 for details, Window x64 Release builds started to crash on libraries tests.

After investigation it turns out that new msvc compiler handles an expression different compared to how it was handled in previous version.

After upgrade of msvc, the expression:

int _amd64_width_temp = ((guint64)(imm) == (guint64)(int)(guint64)(imm));

implemented in amd64_mov_reg_imm and then called from tramp-amd64.c@500 was transformed into an always true expression by compiler:

```
amd64_mov_reg_imm (code, AMD64_R11, (guint8*)mono_get_rethrow_preserve_exception_addr ());

lea rcx,[rethrow_preserve_exception_func (07FFB9E33A590h)]
mov word ptr [rbx+0Dh],0BB41h
mov byte ptr [rbx+0Fh],cl
mov rax,rcx
shr eax,8
mov byte ptr [rbx+10h],al
mov rax,rcx
shr eax,10h
shr ecx,18h
mov byte ptr [rbx+11h],al
lea rax,[rbx+13h]
mov byte ptr [rbx+12h],cl
```

as seen above, the condition and handling of a 64-bit imm has been dropped by compiler.

This cause issues when the imm is a 64-bit value since it will always gets truncated into 32-bit imm and in this case it was a pointer to a function within coreclr.dll (mono_get_rethrow_preserve_exception_addr) loaded located at higher address (using more than 32-bit).

This is most likely a regression issue in compiler for this specific construction. I tried simpler construction (using same type conversion) on both old and new compiler version and then it makes the right optimization.

Fix is to switch to a macro already available in amd64-codegen (amd64_is_imm32) detecting if an imm needs a 32-bit or 64-bit sized value. This will be correctly optimized by new msvc compiler and even if this is a work around for a what seems to be a optimization bug in the compiler, it is still cleaner and better describes the intent than current code.

Fix also re-enable Windows x64 Release CI test lane.
@danmoseley
Copy link
Member

Do we need to report to compiler team?

@lateralusX
Copy link
Member

@danmosemsft Yes that would be good, how do we best progress regarding that?

lateralusX added a commit to mono/mono that referenced this issue Jan 11, 2021
After upgrade to later msvc version on CI boots, see dotnet/runtime#45524 for details, Window x64 Release builds started to crash on libraries tests.

After investigation it turns out that new msvc compiler handles an expression different compared to how it was handled in previous version.

After upgrade of msvc, the expression:

int _amd64_width_temp = ((guint64)(imm) == (guint64)(int)(guint64)(imm));

implemented in amd64_mov_reg_imm and then called from tramp-amd64.c@500 was transformed into an always true expression by compiler:

```
amd64_mov_reg_imm (code, AMD64_R11, (guint8*)mono_get_rethrow_preserve_exception_addr ());

lea rcx,[rethrow_preserve_exception_func (07FFB9E33A590h)]
mov word ptr [rbx+0Dh],0BB41h
mov byte ptr [rbx+0Fh],cl
mov rax,rcx
shr eax,8
mov byte ptr [rbx+10h],al
mov rax,rcx
shr eax,10h
shr ecx,18h
mov byte ptr [rbx+11h],al
lea rax,[rbx+13h]
mov byte ptr [rbx+12h],cl
```

as seen above, the condition and handling of a 64-bit imm has been dropped by compiler.

This cause issues when the imm is a 64-bit value since it will always gets truncated into 32-bit imm and in this case it was a pointer to a function within coreclr.dll (mono_get_rethrow_preserve_exception_addr) loaded located at higher address (using more than 32-bit).

This is most likely a regression issue in compiler for this specific construction. I tried simpler construction (using same type conversion) on both old and new compiler version and then it makes the right optimization.

Fix is to switch to a macro already available in amd64-codegen (amd64_is_imm32) detecting if an imm needs a 32-bit or 64-bit sized value. This will be correctly optimized by new msvc compiler and even if this is a work around for a what seems to be a optimization bug in the compiler, it is still cleaner and better describes the intent than current code.

Fix also re-enable Windows x64 Release CI test lane.

Co-authored-by: lateralusX <lateralusX@users.noreply.github.com>
lateralusX added a commit that referenced this issue Jan 11, 2021
After upgrade to later msvc version on CI boots, see
#45524 for details,
Window x64 Release builds started to crash on libraries tests.

After investigation it turns out that new msvc compiler handles an
expression different compared to how it was handled in previous version.

After upgrade of msvc, the expression:

int _amd64_width_temp = ((guint64)(imm) == (guint64)(int)(guint64)(imm));

implemented in amd64_mov_reg_imm and then called from tramp-amd64.c@500
was transformed into an always true expression by compiler:

amd64_mov_reg_imm (code, AMD64_R11, (guint8*)mono_get_rethrow_preserve_exception_addr ());
lea rcx,[rethrow_preserve_exception_func (07FFB9E33A590h)]
mov word ptr [rbx+0Dh],0BB41h
mov byte ptr [rbx+0Fh],cl
mov rax,rcx
shr eax,8
mov byte ptr [rbx+10h],al
mov rax,rcx
shr eax,10h
shr ecx,18h
mov byte ptr [rbx+11h],al
lea rax,[rbx+13h]
mov byte ptr [rbx+12h],cl

as seen above, the condition and handling of a 64-bit imm has been
dropped by compiler.

This cause issues when the imm is a 64-bit value since it will always gets
truncated into 32-bit imm and in this case it was a pointer to a function
within coreclr.dll (mono_get_rethrow_preserve_exception_addr)
loaded located at higher address (using more than 32-bit).

This is most likely a regression issue in compiler for this specific
construction. I tried simpler construction (using same type conversion) on
both old and new compiler version and then it makes the right optimization.

Fix is to switch to a macro already available in amd64-codegen (amd64_is_imm32)
detecting if an imm needs a 32-bit or 64-bit sized value. This will be
correctly optimized by new msvc compiler and even if this is a work around
for a what seems to be a optimization bug in the compiler, it is still
cleaner and better describes the intent than current code.

Fix also re-enable Windows x64 Release CI test lane.
@akoeplinger
Copy link
Member

@lateralusX I'd start with https://docs.microsoft.com/en-us/cpp/overview/how-to-report-a-problem-with-the-visual-cpp-toolset which essentially means opening an issue on developer community via https://aka.ms/feedback/report?space=62.

If we don't get a response we can ping someone internally.

@akoeplinger
Copy link
Member

Closing the issue since the problem on our end was fixed with #46573

@ghost ghost locked as resolved and limited conversation to collaborators Feb 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Infrastructure-mono blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Projects
None yet
Development

No branches or pull requests