Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Microsoft.ML.TorchSharp.Tests.QATests.TestSimpleQA followed by process killed / return 137 #6978

Open
ericstj opened this issue Jan 30, 2024 · 2 comments
Labels
blocking-clean-ci Blocking PR or rolling builds bug Something isn't working Known Build Error Use this to report build issues in the .NET Helix tab untriaged New issue has not been triaged

Comments

@ericstj
Copy link
Member

ericstj commented Jan 30, 2024

Build Information

Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=530980&view=results
Build error leg or test failing: Microsoft.ML.TorchSharp.Tests Work Item
Pull Request #6976

Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": [ "Starting test: Microsoft.ML.TorchSharp.Tests.QATests.TestSimpleQA", "+ export _commandExitCode=137" ],
  "ErrorPattern": "",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

System Information (please complete the following information):

  • OS & Version: Ubuntu 18.04
  • ML.NET Version: latest
  • .NET Version: .NET 6.0

Describe the bug
This test is failing in CI somewhat regularly. The error pattern looks like the following:

Starting test: Microsoft.ML.TorchSharp.Tests.QATests.TestSimpleQA
Killed
+ export _commandExitCode=137

Here are a few instances:
https://helixre107v0xd1eu3ibi6ka.blob.core.windows.net/dotnet-machinelearning-refs-pull-6974-merge-f61a125156aa4af1bd/Microsoft.ML.TorchSharp.Tests/1/console.83a6fa6c.log?helixlogtype=result
https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-machinelearning-refs-pull-6976-merge-0a13c2cd41724c3483/Microsoft.ML.TorchSharp.Tests/1/console.ff57f777.log?helixlogtype=result

I can't currently capture this failure in a known issue because there is no unique line logged. I've seen this failure numerous times - always when TestSimpleQA is running.

Report

Build Definition Test Pull Request
897244 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution
896815 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7342
896771 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution
894089 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7319
894077 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7334
893349 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7319
893297 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7319
893201 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7319
890221 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution
889490 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7319
888867 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7319
887667 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution
887399 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7330
886721 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7328
886303 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution
886295 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7319
885954 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7319
884593 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7329
883840 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7328
881188 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7327
881190 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution
878953 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7316
878533 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7266
877391 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7319

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 3 24

Known issue validation

Build: 🔎
Result validation: ⚠️ Build internal information not found. This may happen if your build is too old. Please use a build that is no older than two weeks. If the problem persists, contact .NET Engineering Services Team and share this issue.
Validation performed at: 2/14/2024 10:25:46 PM UTC

@ericstj ericstj added bug Something isn't working blocking-clean-ci Blocking PR or rolling builds labels Jan 30, 2024
@ghost ghost added the untriaged New issue has not been triaged label Jan 30, 2024
@ericstj
Copy link
Member Author

ericstj commented Jan 30, 2024

@michaelgsharp made a good observation offline - we're seeing memory usage go up quite a bit as the tests progress.

Finished test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestSentenceSimilarity with memory usage 2,077,020,160.00 and max memory usage 2,370,473,984.00

That's using 2GB memory after the previous test completed.

@ericstj
Copy link
Member Author

ericstj commented Jan 31, 2024

Wow - the memory usage of this test is very high. Here's what I see from a local passing run on Windows.

  Discovering: Microsoft.ML.TorchSharp.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  Microsoft.ML.TorchSharp.Tests (found 12 test cases)
  Starting:    Microsoft.ML.TorchSharp.Tests (parallel test collections = on [20 threads], stop on fail = off)
Starting test: Microsoft.ML.TorchSharp.Tests.NerTests.TestSimpleNer
Finished test: Microsoft.ML.TorchSharp.Tests.NerTests.TestSimpleNer with memory usage 751,607,808.00 and max memory usage 751,607,808.00
Starting test: Microsoft.ML.TorchSharp.Tests.NerTests.TestSimpleNerOptions
    Microsoft.ML.TorchSharp.Tests.NerTests.TestNERLargeFileGpu [SKIP]
      Needs to be on a comp with GPU or will take a LONG time.
Finished test: Microsoft.ML.TorchSharp.Tests.NerTests.TestSimpleNerOptions with memory usage 895,778,816.00 and max memory usage 895,778,816.00
Starting test: Microsoft.ML.TorchSharp.Tests.ObjectDetectionTests.SimpleObjDetectionTest
total : 171, filtered: 0, filter ratio: 0.00%
Finished test: Microsoft.ML.TorchSharp.Tests.ObjectDetectionTests.SimpleObjDetectionTest with memory usage 1,142,628,352.00 and max memory usage 1,155,977,216.00
Starting test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestSingleSentence3Classes
Finished test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestSingleSentence3Classes with memory usage 1,111,171,072.00 and max memory usage 1,155,977,216.00
Starting test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestDoubleSentence2Classes
Finished test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestDoubleSentence2Classes with memory usage 1,352,704,000.00 and max memory usage 1,352,818,688.00
Starting test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestSingleSentence2Classes
Finished test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestSingleSentence2Classes with memory usage 1,365,450,752.00 and max memory usage 1,366,872,064.00
Starting test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestSentenceSimilarity
Finished test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestSentenceSimilarity with memory usage 1,362,817,024.00 and max memory usage 1,368,600,576.00
    Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestSentenceSimilarityLargeFileGpu [SKIP]
      Needs to be on a comp with GPU or will take a LONG time.
    Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestTextClassificationWithBigDataOnGpu [SKIP]
      Condition(s) not met: "EnableRunningGpuTest"
Starting test: Microsoft.ML.TorchSharp.Tests.QATests.TestSimpleQA
Finished test: Microsoft.ML.TorchSharp.Tests.QATests.TestSimpleQA with memory usage 4,675,801,088.00 and max memory usage 5,540,958,208.00
    Microsoft.ML.TorchSharp.Tests.QATests.TestQALargeFileGpu [SKIP]
      Needs to be on a comp with GPU or will take a LONG time.
  Finished:    Microsoft.ML.TorchSharp.Tests

So we may have some leak (this still shows growth) but we also are using a ton of memory when running this test.

@ericstj ericstj added the Known Build Error Use this to report build issues in the .NET Helix tab label Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocking-clean-ci Blocking PR or rolling builds bug Something isn't working Known Build Error Use this to report build issues in the .NET Helix tab untriaged New issue has not been triaged
Projects
None yet
Development

No branches or pull requests

1 participant