Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with Azure Devops Reporter #7371

Closed
1 of 2 tasks
MattGal opened this issue May 11, 2021 · 15 comments
Closed
1 of 2 tasks

Problems with Azure Devops Reporter #7371

MattGal opened this issue May 11, 2021 · 15 comments
Assignees

Comments

@MattGal
Copy link
Member

MattGal commented May 11, 2021

  • This issue is blocking
  • This issue is causing unreasonable pain

As I've written up in https://github.com/dotnet/core-eng/issues/13026, we introduced a threading bug in https://github.com/dotnet/arcade/pull/7310/files that can crash the reporter. It may be difficult to intentionally reproduce this problem since it kind of relies on actually doing the reporting, but we can inspect the code or just revert the lock part of the change and instead just do the "don't let it pass if it doesn't finish" part of the change.

@alexperovich
Copy link
Member

The error stack trace in https://github.com/dotnet/core-eng/issues/13026 doesn't have anything to do with the lock I added. This error is happening because the interpreter can't get the lock for "stdout" and none of my change affects locking of stdout. This error is just a very rare race condition where one of the worker threads (Thread 0x000070000f354000 in that stack trace) happens to be inside a _print call at the exact time that the process is trying to exit.

Looking at what it was trying to print. One of the worker threads never got a chance to start, before the work was all completed and the process exited.

@MattGal
Copy link
Member Author

MattGal commented May 11, 2021

Cool, I am mistaken. We should still perhaps use this or another issue to have the conversation about dealing with failure better.

@ilyas1974 ilyas1974 changed the title Mutex bug in Azure Devops Reporter can cause crashes (regression) Add Retries in Azure Devops Reporter can cause crashes (regression) May 12, 2021
@ilyas1974 ilyas1974 changed the title Add Retries in Azure Devops Reporter can cause crashes (regression) Add Retries in Azure Devops Reporter (regression) May 12, 2021
@ilyas1974 ilyas1974 changed the title Add Retries in Azure Devops Reporter (regression) Add Retries in Azure Devops Reporter May 12, 2021
@MattGal
Copy link
Member Author

MattGal commented May 12, 2021

Just had a great conversation with @safern and he made a very keen insight. Specifically, if we just update the arcade reporter behavior to just return the real, not lying exit code when Azure Devops reporter fails, it's easy to implement and gives us the best of both things; when reporting fails, the work item still can pass, and when the work item fails and reporting fails, we still fail the check, e.g.:

if (reporting fails) 
    {work item returns the actual test case's exit code, i.e. 1} 
else 
   {Work item returns the normal the coerced exit code (0 if 1, the actual exit code otherwise)}

@MattGal MattGal self-assigned this May 17, 2021
@MattGal
Copy link
Member Author

MattGal commented May 17, 2021

Unfortunately while interesting, this idea doesn't work as simply as I'd hoped, because there's no way to get the decision made in generating RunTests.cmd (example template).

Without removing these lines, there's no way to make the above pseudocode work since the exit code of the test is coerced in code not owned by the Helix Arcade SDK. I will confer with some runtime folks and get an idea of what they'd like to do.

MattGal added a commit to MattGal/arcade that referenced this issue May 17, 2021
…ttempts) , and stop trying in the case of "run is deleted"
@MattGal
Copy link
Member Author

MattGal commented May 17, 2021

Going with the "just retry for 503s" approach, #7399

MattGal added a commit that referenced this issue May 18, 2021
* #7371 - retry in the case of HTTP 503 (10x, 3 seconds between attempts) , and stop trying in the case of "run is deleted"

* Oops
@MattGal
Copy link
Member Author

MattGal commented May 18, 2021

This is merged and should begin flowing to main-branch Arcade consumers soon, I'll keep this in validate until I see some uptake.

@ericstj
Copy link
Member

ericstj commented May 18, 2021

Another hit on the threading issue: https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-52793-merge-d1edb0fae6604797b1/Microsoft.Extensions.Hosting.Functional.Tests/console.f1e93568.log?sv=2019-07-07&se=2021-06-07T20%3A07%3A07Z&sr=c&sp=rl&sig=fZdoSW%2Fcu7ZkOzwiZiYlIOE6rA4GERWu6EbZ98P3QFY%3D

2021-05-18 20:22:27,519: INFO: 140459850491712: run(89): main: Main thread starting 10 workers
Worker 0: starting...
2021-05-18 20:22:27,522: INFO: 140459850491712: run(96): main: Beginning reading of test results.
2021-05-18 20:22:27,532: INFO: 140459850491712: run(105): main: Uploading results in batches of size 1000
2021-05-18 20:22:27,533: INFO: 140459850491712: run(110): main: Main thread finished queueing batches
2021-05-18 20:22:27,533: INFO: 140459850491712: run(114): main: Main thread exiting
Worker 1: starting...
Worker 2: starting...
Worker 5: starting...
Worker 7: starting...
Worker 9: starting...
Searching '/datadisks/disk1/work/B842096F/w/BC1909DE/e/..' for log files
Found log '/datadisks/disk1/work/B842096F/w/BC1909DE/e/../console.f1e93568.log'
Uri 'https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-52793-merge-d1edb0fae6604797b1/Microsoft.Extensions.Hosting.Functional.Tests/console.f1e93568.log?sv=2019-07-07&se=2021-06-07T20%3A07%3A07Z&sr=c&sp=rl&sig=fZdoSW%2Fcu7ZkOzwiZiYlIOE6rA4GERWu6EbZ98P3QFY%3D'
Generated log list: console.f1e93568.log:
  https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-52793-merge-d1edb0fae6604797b1/Microsoft.Extensions.Hosting.Functional.Tests/console.f1e93568.log?sv=2019-07-07&se=2021-06-07T20%3A07%3A07Z&sr=c&sp=rl&sig=fZdoSW%2Fcu7ZkOzwiZiYlIOE6rA4GERWu6EbZ98P3QFY%3D

Searching '/datadisks/disk1/work/B842096F/w/BC1909DE/e' for test results files
Found results file /datadisks/disk1/work/B842096F/w/BC1909DE/e/testResults.xml with format xunit
Searching '/datadisks/disk1/work/B842096F/w/BC1909DE/uploads' for test results files
Fatal Python error: could not acquire lock for <_io.BufferedWriter name='<stdout>'> at interpreter shutdown, possibly due to daemon threads

Thread 0x00007fbf3e7fc700 (most recent call first):
  File "/datadisks/disk1/work/B842096F/p/reporter/run.py", line 31 in __print
  File "/datadisks/disk1/work/B842096F/p/reporter/run.py", line 40 in run
  File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fbf3effd700 (most recent call first):
  File "/datadisks/disk1/work/B842096F/p/reporter/run.py", line 31 in __print
  File "/datadisks/disk1/work/B842096F/p/reporter/run.py", line 40 in run
  File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fbf3f7fe700 (most recent call first):
  File "/usr/lib/python3.6/threading.py", line 295 in wait
  File "/usr/lib/python3.6/queue.py", line 164 in get
  File "/datadisks/disk1/work/B842096F/p/reporter/run.py", line 43 in run
  File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fbf3ffff700 (most recent call first):
  File "/datadisks/disk1/work/B842096F/p/reporter/run.py", line 31 in __print
  File "/datadisks/disk1/work/B842096F/p/reporter/run.py", line 40 in run
  File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fbf54bb1700 (most recent call first):
  File "/usr/lib/python3.6/threading.py", line 295 in wait
  File "/usr/lib/python3.6/queue.py", line 164 in get
  File "/datadisks/disk1/work/B842096F/p/reporter/run.py", line 43 in run
  File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fbf553b2700 (most recent call first):
  File "/datadisks/disk1/work/B842096F/p/reporter/run.py", line 31 in __print
  File "/datadisks/disk1/work/B842096F/p/reporter/run.py", line 40 in run
  File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fbf55bb3700 (most recent call first):
  File "/datadisks/disk1/work/B842096F/p/reporter/run.py", line 31 in __print
  File "/datadisks/disk1/work/B842096F/p/reporter/run.py", line 40 in run
  File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fbf563b4700 (most recent call first):
  File "/usr/lib/python3.6/threading.py", line 295 in wait
  File "/usr/lib/python3.6/queue.py", line 164 in get
  File "/datadisks/disk1/work/B842096F/p/reporter/run.py", line 43 in run
  File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fbf56bb5700 (most recent call first):
  File "/usr/lib/python3.6/threading.py", line 295 in wait
  File "/usr/lib/python3.6/queue.py", line 164 in get
  File "/datadisks/disk1/work/B842096F/p/reporter/run.py", line 43 in run
  File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fbf573b6700 (most recent call first):
  File "/usr/lib/python3.6/threading.py", line 295 in wait
  File "/usr/lib/python3.6/queue.py", line 164 in get
  File "/datadisks/disk1/work/B842096F/p/reporter/run.py", line 43 in run
  File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap

@MattGal
Copy link
Member Author

MattGal commented May 19, 2021

@ericstj I may take some stabs at this one tomorrow but it's much harder than simply adding retries to AzDO calls.

@MattGal MattGal changed the title Add Retries in Azure Devops Reporter Problems with Azure Devops Reporter May 20, 2021
MattGal added a commit to MattGal/arcade that referenced this issue May 20, 2021
…or failures if ADO fails for any reason, return 0 if we actually passed.
MattGal added a commit to MattGal/arcade that referenced this issue May 20, 2021
MattGal added a commit that referenced this issue May 21, 2021
* Workaround for #7371.  Have azure-pipelines reporter parse XML for failures if ADO fails for any reason, return 0 if we actually passed.

* PR feedbacks
@MattGal
Copy link
Member Author

MattGal commented May 27, 2021

Added 65 unique test run instances to the IcM, since they seem to be unable to find them.

@MattGal
Copy link
Member Author

MattGal commented May 27, 2021

THere's not much to do here and the mitigations seem to be helping so putting in Tracking/blocked.

MattGal added a commit to MattGal/arcade that referenced this issue May 28, 2021
- Don't have a threading contention for stdout by only using stdout in the exception case; "starting..." adds minimal value, one can assume if the script is called it starts.
- sleep the main thread 2s before accessing stdout in case the exception case wants to happen
@MattGal
Copy link
Member Author

MattGal commented May 28, 2021

new PR up, we can easily avoid the race condition by eliminating the race entirely.

@MattGal
Copy link
Member Author

MattGal commented May 28, 2021

I accidentally made what I intended to be a 2s sleep 33 minutes, and oddly pushing an updated version didn't seem to fix this; will pick this back up on Tuesday.

MattGal added a commit to MattGal/arcade that referenced this issue Jun 2, 2021
- Don't have a threading contention for stdout by only using stdout in the exception case; "starting..." adds minimal value, one can assume if the script is called it starts.
- sleep() the main thread 5s before accessing stdout in case the exception case wants to happen
- Remove most non-error-path logging since it adds little value and multiple threads competing for stdout can cause un-catch-able failure.
MattGal added a commit that referenced this issue Jun 2, 2021
- Don't have a threading contention for stdout by only using stdout in the exception case; "starting..." adds minimal value, one can assume if the script is called it starts.
- sleep() the main thread 5s before accessing stdout in case the exception case wants to happen
- Remove most non-error-path logging since it adds little value and multiple threads competing for stdout can cause un-catch-able failure.
@MattGal
Copy link
Member Author

MattGal commented Jun 2, 2021

Merged the PR with attempted workaround, will leave this in Validate til it makes it to Runtime.

@MattGal
Copy link
Member Author

MattGal commented Jun 9, 2021

Current workaround is in Runtime, and seems to be holding, closing (somehow I thought I already had)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants