Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in CI Mono llvmfullaot Pri0 Runtime Tests Run Linux arm64 release #60234

Closed
BruceForstall opened this issue Oct 10, 2021 · 18 comments · Fixed by #63800
Closed

Error in CI Mono llvmfullaot Pri0 Runtime Tests Run Linux arm64 release #60234

BruceForstall opened this issue Oct 10, 2021 · 18 comments · Fixed by #63800

Comments

@BruceForstall
Copy link
Member

BruceForstall commented Oct 10, 2021

in LLVM AOT cross-compile CoreCLR tests

Error message:
##[error]Exit code 137 returned from process: file name '/usr/bin/docker', arguments 'exec -i -u 1000 -w /home/cloudtest_azpcontainer 12e65458950cdcacb35a892c9bd140157953823f62a984dfc4c13e1632863c42 /__a/externals/node/bin/node /__w/_temp/containerHandlerInvoker.js'.

https://dev.azure.com/dnceng/public/_build/results?buildId=1411090&view=logs&j=b40c4ee0-d59d-50cb-cee3-6c499edc769b&t=24038e5f-e5d5-5f4a-9cc0-31eac914e6d6

Is this a real error or infrastructure/random? Is it known? What is the actual problem?

@dotnet/runtime-infrastructure

Runfo Tracking Issue: LLVM AOT build gets killed under docker

Definition Build Kind Job Name
runtime 1569424 PR 63800 Mono llvmfullaot Pri0 Runtime Tests Run Linux arm64 release
runtime 1553349 PR 63800 Mono llvmfullaot Pri0 Runtime Tests Run Linux arm64 release
runtime 1551116 PR 63689 Mono llvmfullaot Pri0 Runtime Tests Run Linux arm64 release

Build Result Summary

Day Hit Count Week Hit Count Month Hit Count
1 1 3
@ghost
Copy link

ghost commented Oct 10, 2021

Tagging subscribers to this area: @directhex
See info in area-owners.md if you want to be subscribed.

Issue Details

in LLVM AOT cross-compile CoreCLR tests

Error message:
##[error]Exit code 137 returned from process: file name '/usr/bin/docker', arguments 'exec -i -u 1000 -w /home/cloudtest_azpcontainer 12e65458950cdcacb35a892c9bd140157953823f62a984dfc4c13e1632863c42 /__a/externals/node/bin/node /__w/_temp/containerHandlerInvoker.js'.

https://dev.azure.com/dnceng/public/_build/results?buildId=1411090&view=logs&j=b40c4ee0-d59d-50cb-cee3-6c499edc769b&t=24038e5f-e5d5-5f4a-9cc0-31eac914e6d6

Is this a real error or infrastructure/random? Is it known? What is the actual problem?

@dotnet/runtime-infrastructure

Author: BruceForstall
Assignees: -
Labels:

area-Infrastructure-mono

Milestone: -

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Oct 10, 2021
@BruceForstall
Copy link
Member Author

Failure from this PR: #60192

@agocke agocke added blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' and removed untriaged New issue has not been triaged by the area owner labels Oct 18, 2021
@agocke
Copy link
Member

agocke commented Oct 18, 2021

This is failing almost every runtime build @directhex @steveisok

@steveisok
Copy link
Member

I've seen this pop up w/ wasm runs. Most likely some kind of OOM. We should take the leg down until we come up with a fix/workaround.

steveisok pushed a commit to steveisok/runtime that referenced this issue Oct 18, 2021
Addresses dotnet#60234

Suspicion is that we're hitting some kind of OOM error within docker.  Unblocks CI while we investigate.
@lambdageek
Copy link
Member

lambdageek commented Oct 18, 2021

The last 3 (as of right now) (and some randomly clicked older) runs from the runfo list all fail when AOTing Microsoft.Diagnostics.Tracing.TraceEvent.dll

aot-compile: compiling /__w/2/s/artifacts/tests/coreclr/Linux.arm64.Release/Tests/Core_Root/Microsoft.Diagnostics.Tracing.TraceEvent.dll; MONO_PATH: /__w/2/s/artifacts/tests/coreclr/Linux.arm64.Release/Tests/Core_Root:/__w/2/s/artifacts/tests/coreclr/Linux.arm64.Release/Tests/Core_Root

Is that one unexpectedly large?

@hoyosjs
Copy link
Member

hoyosjs commented Oct 18, 2021

cc: @josalem for AOT of TraceEvent

akoeplinger pushed a commit that referenced this issue Oct 19, 2021
Addresses #60234

Suspicion is that we're hitting some kind of OOM error within docker.  Unblocks CI while we investigate.

Co-authored-by: Steve Pfister <steve.pfister@microsoft.com>
@josalem
Copy link
Contributor

josalem commented Oct 22, 2021

Saw an interesting one where it's failing to locate DiaSymReader for compiling ReadyToRun binaries:
https://dev.azure.com/dnceng/public/_build/results?buildId=1434785&view=logs&j=d631b51d-8d68-5da3-5a13-95f6a763c7bb&t=800563d8-8639-597e-b5b2-a98a874df5bf&l=28344

 Mono Ahead of Time compiler - compiling assembly /__w/2/s/artifacts/tests/coreclr/Linux.arm64.Release/Tests/Core_Root/ILCompiler.TypeSystem.ReadyToRun.dll
  AOTID 8FE1236D-0DB8-5A35-FB2C-8F1305E04CD5
  Could not load signature of Internal.TypeSystem.Ecma.UnmanagedPdbSymbolReader:.ctor due to: Could not load file or assembly 'Microsoft.DiaSymReader, Version=1.3.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35' or one of its dependencies.
  Could not load signature of Internal.TypeSystem.Ecma.UnmanagedPdbSymbolReader:GetUrl due to: Could not load file or assembly 'Microsoft.DiaSymReader, Version=1.3.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35' or one of its dependencies.
  Could not load signature of Internal.TypeSystem.Ecma.UnmanagedPdbSymbolReader:ProbeScopeForLocals due to: Could not load file or assembly 'Microsoft.DiaSymReader, Version=1.3.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35' or one of its dependencies.
  Could not load signature of Internal.TypeSystem.Ecma.UnmanagedPdbSymbolReader:.ctor due to: Could not load file or assembly 'Microsoft.DiaSymReader, Version=1.3.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35' or one of its dependencies.
  Could not load signature of Internal.TypeSystem.Ecma.UnmanagedPdbSymbolReader:GetUrl due to: Could not load file or assembly 'Microsoft.DiaSymReader, Version=1.3.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35' or one of its dependencies.
  Could not load signature of Internal.TypeSystem.Ecma.UnmanagedPdbSymbolReader:ProbeScopeForLocals due to: Could not load file or assembly 'Microsoft.DiaSymReader, Version=1.3.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35' or one of its dependencies.
  Unable to compile method 'void Internal.TypeSystem.Ecma.UnmanagedPdbSymbolReader:.cctor ()' due to: 'Could not resolve field token 0x040002f6, due to: Could not load type of field 'Internal.TypeSystem.Ecma.UnmanagedPdbSymbolReader:s_symBinder' (1) due to: Could not load file or assembly 'Microsoft.DiaSymReader, Version=1.3.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35' or one of its dependencies. assembly:/__w/2/s/artifacts/tests/coreclr/Linux.arm64.Release/Tests/Core_Root/ILCompiler.TypeSystem.ReadyToRun.dll type:UnmanagedPdbSymbolReader member:(null)'.
  Unable to compile method 'Internal.TypeSystem.Ecma.PdbSymbolReader Internal.TypeSystem.Ecma.UnmanagedPdbSymbolReader:TryOpenSymbolReaderForMetadataFile (string,string)' due to: 'Could not load file or assembly 'Microsoft.DiaSymReader, Version=1.3.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35' or one of its dependencies.'.
  Unable to compile method 'void Internal.TypeSystem.Ecma.UnmanagedPdbSymbolReader:Dispose ()' due to: 'Could not resolve field token 0x040002f8, due to: Could not load type of field 'Internal.TypeSystem.Ecma.UnmanagedPdbSymbolReader:s_symBinder' (1) due to: Could not load file or assembly 'Microsoft.DiaSymReader, Version=1.3.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35' or one of its dependencies. assembly:/__w/2/s/artifacts/tests/coreclr/Linux.arm64.Release/Tests/Core_Root/ILCompiler.TypeSystem.ReadyToRun.dll type:UnmanagedPdbSymbolReader member:(null)'.
  Unable to compile method 'System.Collections.Generic.IEnumerable`1<Internal.IL.ILLocalVariable> Internal.TypeSystem.Ecma.UnmanagedPdbSymbolReader:GetLocalVariableNamesForMethod (int)' due to: 'Could not load file or assembly 'Microsoft.DiaSymReader, Version=1.3.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35' or one of its dependencies.'.
  Unable to compile method 'void Internal.TypeSystem.Ecma.UnmanagedPdbSymbolReader/<GetSequencePointsForMethod>d__15:.ctor (int)' due to: 'Could not resolve field token 0x040003c5, due to: Invalid type Internal.TypeSystem.Ecma.UnmanagedPdbSymbolReader for instance field Internal.TypeSystem.Ecma.UnmanagedPdbSymbolReader+<GetSequencePointsForMethod>d__15:<>4__this assembly:/__w/2/s/artifacts/tests/coreclr/Linux.arm64.Release/Tests/Core_Root/ILCompiler.TypeSystem.ReadyToRun.dll type:<GetSequencePointsForMethod>d__15 member:(null)'.
  Unable to compile method 'bool Internal.TypeSystem.Ecma.UnmanagedPdbSymbolReader/<GetSequencePointsForMethod>d__15:MoveNext ()' due to: 'Could not load file or assembly 'Microsoft.DiaSymReader, Version=1.3.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35' or one of its dependencies.'.
  Unable to compile method 'Internal.IL.ILSequencePoint Internal.TypeSystem.Ecma.UnmanagedPdbSymbolReader/<GetSequencePointsForMethod>d__15:System.Collections.Generic.IEnumerator<Internal.IL.ILSequencePoint>.get_Current ()' due to: 'Could not resolve field token 0x040003c6, due to: Invalid type Internal.TypeSystem.Ecma.UnmanagedPdbSymbolReader for instance field Internal.TypeSystem.Ecma.UnmanagedPdbSymbolReader+<GetSequencePointsForMethod>d__15:<>4__this assembly:/__w/2/s/artifacts/tests/coreclr/Linux.arm64.Release/Tests/Core_Root/ILCompiler.TypeSystem.ReadyToRun.dll type:<GetSequencePointsForMethod>d__15 member:(null)'.
  Unable to compile method 'object Internal.TypeSystem.Ecma.UnmanagedPdbSymbolReader/<GetSequencePointsForMethod>d__15:System.Collections.IEnumerator.get_Current ()' due to: 'Could not resolve field token 0x040003c6, due to: Invalid type Internal.TypeSystem.Ecma.UnmanagedPdbSymbolReader for instance field Internal.TypeSystem.Ecma.UnmanagedPdbSymbolReader+<GetSequencePointsForMethod>d__15:<>4__this assembly:/__w/2/s/artifacts/tests/coreclr/Linux.arm64.Release/Tests/Core_Root/ILCompiler.TypeSystem.ReadyToRun.dll type:<GetSequencePointsForMethod>d__15 member:(null)'.
  Unable to compile method 'System.Collections.Generic.IEnumerator`1<Internal.IL.ILSequencePoint> Internal.TypeSystem.Ecma.UnmanagedPdbSymbolReader/<GetSequencePointsForMethod>d__15:System.Collections.Generic.IEnumerable<Internal.IL.ILSequencePoint>.GetEnumerator ()' due to: 'Could not resolve field token 0x040003c5, due to: Invalid type Internal.TypeSystem.Ecma.UnmanagedPdbSymbolReader for instance field Internal.TypeSystem.Ecma.UnmanagedPdbSymbolReader+<GetSequencePointsForMethod>d__15:<>4__this assembly:/__w/2/s/artifacts/tests/coreclr/Linux.arm64.Release/Tests/Core_Root/ILCompiler.TypeSystem.ReadyToRun.dll type:<GetSequencePointsForMethod>d__15 member:(null)'.

@krwq krwq removed the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Oct 25, 2021
@krwq
Copy link
Member

krwq commented Oct 25, 2021

Removing blocking-clean-ci label because the legs are disabled

@SamMonoRT
Copy link
Member

Who is investigating the actual cause of the OOM issue ?

@steveisok
Copy link
Member

I think it would be better to tweak docker and see if that helps fix the problem. @MattGal suggested tweaking vm.overcommit_memory to start.

@steveisok steveisok self-assigned this Oct 26, 2021
@steveisok
Copy link
Member

@MattGal doesn't vm.overcommit_memory have to be set on the host?

@MattGal
Copy link
Member

MattGal commented Oct 27, 2021

@MattGal doesn't vm.overcommit_memory have to be set on the host?

Oops, yeah I think that may be true. If you can comfortably change the pipeline to drive docker itself that would be allowed to be set on the host... @MichaelSimons / @mthalman may have some suggestions how this could be achieved best.

@mthalman
Copy link
Member

If you can comfortably change the pipeline to drive docker itself that would be allowed to be set on the host... @MichaelSimons / @mthalman may have some suggestions how this could be achieved best.

I'm not understanding the request here. How "what" could be best achieved?

@steveisok
Copy link
Member

steveisok commented Nov 9, 2021

If you can comfortably change the pipeline to drive docker itself that would be allowed to be set on the host... @MichaelSimons / @mthalman may have some suggestions how this could be achieved best.

I'm not understanding the request here. How "what" could be best achieved?

The what in this case is being able to tweak the docker host setting vm.overcommit_memory in the pipeline we have. We currently don't have any way to do this nor any insight into how.

@mthalman
Copy link
Member

mthalman commented Nov 9, 2021

The what in this case is being able to tweak the docker host setting vm.overcommit_memory in the pipeline we have. We currently don't have any way to do this nor any insight into how.

Ok. I don't have any experience with the pipelines that are used in this repo so I can't comment on how to best execute logic on the host. Is the problem that the hook points you have available to you are running within the container?

@steveisok
Copy link
Member

Yeah, I believe that's the case.

@steveisok
Copy link
Member

@MattGal @mthalman This seems a bit uncertain. I think I would prefer bumping the machine memory. What options do we have?

@MattGal
Copy link
Member

MattGal commented Nov 10, 2021

@MattGal @mthalman This seems a bit uncertain. I think I would prefer bumping the machine memory. What options do we have?

Not sure I completely follow:

  • If you mean making the host just have more memory, this is probably a non-starter; we always use 4-core, 16 GB RAM machines which is generally plenty for most builds, and the next size up represents a doubling in per-machine cost regardless of processor brand. Since 1ES pool provider machines are homogenous (all the same SKU / size) this would have to happen for the entire pool provider.
  • If you would like to have the image configured differently, this is absolutely possible; make an issue in dotnet/core-eng specifying what you're asking for, apply the "first responders" label, and we'll see what we can do.

One final option, though ugly, would be to just drive docker directly from the build instead of specifying a build within a container. This lets you set all sorts of interesting settings from the outside, but handling getting files in and out, mounting volumes, etc becomes your build's problem so this should definitely be considered a last-resort option.

@steveisok steveisok assigned directhex and unassigned steveisok Jan 13, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jan 14, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jan 25, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Feb 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.