Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Green CI pipeline on a failed run #85824

Closed
CarnaViire opened this issue May 5, 2023 · 7 comments · Fixed by #86342
Closed

Green CI pipeline on a failed run #85824

CarnaViire opened this issue May 5, 2023 · 7 comments · Fixed by #86342
Assignees

Comments

@CarnaViire
Copy link
Member

At least for 2 linux distros, the tests are NOT run because of some internal problems, BUT the test suite is reported as PASSED outside. I reported this to CoreEng and their reply was that we ignore the exit code in our scripts.

Below is the example of the problem:

Pipeline has "Libraries Test Run release coreclr linux x64 Release" as green Pipelines - Run 20230505.2 logs (azure.com)

Centos 7 run, test suites have state "passed" e.g. https://helix.dot.net/api/jobs/0746b253-5b79-402b-b185-b63a8e921e92/workitems/System.Composition.Runtime.Tests?api-version=2019-06-17 (it's not only this test suite, it's all of them)

Console output shows exit code 1 on trying to run the tests: https://helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-heads-main-0746b2535b79402bb1/System.Composition.Runtime.Tests/1/console.c9dbb469.log?helixlogtype=result

/root/helix/work/correlation/dotnet: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /root/helix/work/correlation/dotnet)
/root/helix/work/correlation/dotnet: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /root/helix/work/correlation/dotnet)
/root/helix/work/workitem/e
----- end Fri May 5 08:50:03 UTC 2023 ----- exit code 1 ----------------------------------------------------------

the same is for RHEL 7
https://helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-heads-main-a3e4a99b995b4511b7/ComInterfaceGenerator.Tests/1/console.6f507810.log?helixlogtype=result

/mnt/work/AC7808F5/p/dotnet: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /mnt/work/AC7808F5/p/dotnet)
/mnt/work/AC7808F5/p/dotnet: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /mnt/work/AC7808F5/p/dotnet)
/mnt/work/AC7808F5/w/AC0A09CF/e
----- end Fri May 5 08:46:05 UTC 2023 ----- exit code 1 ----------------------------------------------------------

Below is comment from CoreEng team:

So it seems your Helix job script ignores the exit code from the dotnet exec command because then the whole script exits with 0 and Helix then assumes the job succeeded. I think you need to change the script you're sending to Helix to not ignore the exit code?
I don't believe there's anything else to do from our side here if I understand this correctly.

https://helixde107v0xdeko0k025g8.blob.core.windows.net/helix-job-6c0bad34-07f1-4e6e-8e7b-d23a3d77070d2bf95559561482b8c/System.Composition.Runtime.Tests.zip

This is the payload that is getting executed, inside I can see the RunTests.sh script that runs this

    echo pushd $EXECUTION_DIR
echo "$RUNTIME_PATH/dotnet exec --runtimeconfig System.Composition.Runtime.Tests.runtimeconfig.json --depsfile System.Composition.Runtime.Tests.deps.json xunit.console.dll System.Composition.Runtime.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE"
echo popd
echo ===========================================================================================================
pushd $EXECUTION_DIR"
$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Composition.Runtime.Tests.runtimeconfig.json --depsfile System.Composition.Runtime.Tests.deps.json xunit.console.dll System.Composition.Runtime.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
test_exitcode=$?
popd
echo ----- end $(date) ----- exit code $test_exitcode ----------------------------------------------------------

if [[ -n "${​​exitcode_list[$test_exitcode]}​​" ]]; then
  echo exit code $test_exitcode means ${​​exitcode_list[$test_exitcode]}
​​fi

Where the exit codes that it checks for are these:

exitcode_list[0]="Exited Successfully"
exitcode_list[130]="SIGINT  Ctrl-C occurred. Likely tests timed out."
exitcode_list[131]="SIGQUIT Ctrl-\ occurred. Core dumped."
exitcode_list[132]="SIGILL  Illegal Instruction. Core dumped. Likely codegen issue."
exitcode_list[133]="SIGTRAP Breakpoint hit. Core dumped."
exitcode_list[134]="SIGABRT Abort. Managed or native assert, or runtime check such as heap corruption, caused call to abort(). Core dumped."
exitcode_list[135]="IGBUS  Unaligned memory access. Core dumped."
exitcode_list[136]="SIGFPE  Bad floating point arguments. Core dumped."
exitcode_list[137]="SIGKILL Killed eg by kill"
exitcode_list[139]="SIGSEGV Illegal memory access. Deref invalid pointer, overrunning buffer, stack overflow etc. Core dumped."
exitcode_list[143]="SIGTERM Terminated. Usually before SIGKILL."
exitcode_list[159]="SIGSYS  Bad System Call."
@ghost ghost added the untriaged New issue has not been triaged by the area owner label May 5, 2023
@ghost
Copy link

ghost commented May 5, 2023

Tagging subscribers to this area: @dotnet/runtime-infrastructure
See info in area-owners.md if you want to be subscribed.

Issue Details

At least for 2 linux distros, the tests are NOT run because of some internal problems, BUT the test suite is reported as PASSED outside. I reported this to CoreEng and their reply was that we ignore the exit code in our scripts.

Below is the example of the problem:

Pipeline has "Libraries Test Run release coreclr linux x64 Release" as green Pipelines - Run 20230505.2 logs (azure.com)

Centos 7 run, test suites have state "passed" e.g. https://helix.dot.net/api/jobs/0746b253-5b79-402b-b185-b63a8e921e92/workitems/System.Composition.Runtime.Tests?api-version=2019-06-17 (it's not only this test suite, it's all of them)

Console output shows exit code 1 on trying to run the tests: https://helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-heads-main-0746b2535b79402bb1/System.Composition.Runtime.Tests/1/console.c9dbb469.log?helixlogtype=result

/root/helix/work/correlation/dotnet: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /root/helix/work/correlation/dotnet)
/root/helix/work/correlation/dotnet: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /root/helix/work/correlation/dotnet)
/root/helix/work/workitem/e
----- end Fri May 5 08:50:03 UTC 2023 ----- exit code 1 ----------------------------------------------------------

the same is for RHEL 7
https://helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-heads-main-a3e4a99b995b4511b7/ComInterfaceGenerator.Tests/1/console.6f507810.log?helixlogtype=result

/mnt/work/AC7808F5/p/dotnet: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /mnt/work/AC7808F5/p/dotnet)
/mnt/work/AC7808F5/p/dotnet: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /mnt/work/AC7808F5/p/dotnet)
/mnt/work/AC7808F5/w/AC0A09CF/e
----- end Fri May 5 08:46:05 UTC 2023 ----- exit code 1 ----------------------------------------------------------

Below is comment from CoreEng team:

So it seems your Helix job script ignores the exit code from the dotnet exec command because then the whole script exits with 0 and Helix then assumes the job succeeded. I think you need to change the script you're sending to Helix to not ignore the exit code?
I don't believe there's anything else to do from our side here if I understand this correctly.

https://helixde107v0xdeko0k025g8.blob.core.windows.net/helix-job-6c0bad34-07f1-4e6e-8e7b-d23a3d77070d2bf95559561482b8c/System.Composition.Runtime.Tests.zip

This is the payload that is getting executed, inside I can see the RunTests.sh script that runs this

    echo pushd $EXECUTION_DIR
echo "$RUNTIME_PATH/dotnet exec --runtimeconfig System.Composition.Runtime.Tests.runtimeconfig.json --depsfile System.Composition.Runtime.Tests.deps.json xunit.console.dll System.Composition.Runtime.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE"
echo popd
echo ===========================================================================================================
pushd $EXECUTION_DIR"
$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Composition.Runtime.Tests.runtimeconfig.json --depsfile System.Composition.Runtime.Tests.deps.json xunit.console.dll System.Composition.Runtime.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
test_exitcode=$?
popd
echo ----- end $(date) ----- exit code $test_exitcode ----------------------------------------------------------

if [[ -n "${​​exitcode_list[$test_exitcode]}​​" ]]; then
  echo exit code $test_exitcode means ${​​exitcode_list[$test_exitcode]}
​​fi

Where the exit codes that it checks for are these:

exitcode_list[0]="Exited Successfully"
exitcode_list[130]="SIGINT  Ctrl-C occurred. Likely tests timed out."
exitcode_list[131]="SIGQUIT Ctrl-\ occurred. Core dumped."
exitcode_list[132]="SIGILL  Illegal Instruction. Core dumped. Likely codegen issue."
exitcode_list[133]="SIGTRAP Breakpoint hit. Core dumped."
exitcode_list[134]="SIGABRT Abort. Managed or native assert, or runtime check such as heap corruption, caused call to abort(). Core dumped."
exitcode_list[135]="IGBUS  Unaligned memory access. Core dumped."
exitcode_list[136]="SIGFPE  Bad floating point arguments. Core dumped."
exitcode_list[137]="SIGKILL Killed eg by kill"
exitcode_list[139]="SIGSEGV Illegal memory access. Deref invalid pointer, overrunning buffer, stack overflow etc. Core dumped."
exitcode_list[143]="SIGTERM Terminated. Usually before SIGKILL."
exitcode_list[159]="SIGSYS  Bad System Call."
Author: CarnaViire
Assignees: -
Labels:

area-Infrastructure

Milestone: -

@akoeplinger
Copy link
Member

akoeplinger commented May 5, 2023

This is expected:

# The helix work item should not exit with non-zero if tests ran and produced results
# The special console runner for runtime returns 1 when tests fail
if [[ "$test_exitcode" == "1" ]]; then
if [ -n "$HELIX_WORKITEM_PAYLOAD" ]; then
exit 0
fi
fi

Looks like we got unlucky and this case also returns exit code 1 which we then misinterpret as failing tests (which are intentionally a "passing" state as far as our helix usage is concerned)

@CarnaViire
Copy link
Member Author

CarnaViire commented May 5, 2023

Interesting, so I guess we need to find a way to tell a difference between a normal run with some failing tests vs a run that actually didn't happen at all 🤔

@jkoritzinsky
Copy link
Member

In any case, we've dropped support for CentOS7/RHEL7, so we should update any legs that are testing on it to test on distros like CentOS Stream 8 or AlmaLinux 8.

@CarnaViire
Copy link
Member Author

I agree we need to update from unsupported distros, but I believe we should not ignore the issue in hand. This is literally unnoticeable from the outside, I wouldn't discover the problem if I didn't go all the way down to the console logs -- if this would happen again for some other distro, we wouldn't know that tests are not running, possibly for quite some time.

@akoeplinger
Copy link
Member

akoeplinger commented May 6, 2023

I think an easy fix would be to add a check whether a testResults.xml was produced in the "exit code == 1" case.

Maybe also check whether it is > 0 bytes but that might already be handled by the arcade test reporter.

@wfurt
Copy link
Member

wfurt commented May 15, 2023

That would make sense to me @akoeplinger. The fact that failed run can be reported as success is highly problematic IMHO as its can make us blind to whole range of issues - obsolete distro is only now of them.

And for now is great waste of resources as we do runs that yield no useful results.

We can test the presence of the result file with -f or conveniently for content with -s

@ManickaP ManickaP self-assigned this May 16, 2023
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label May 16, 2023
@ghost ghost removed in-pr There is an active PR which will close this issue when it is merged untriaged New issue has not been triaged by the area owner labels May 29, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jun 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants