Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tests] Add more logging for runtime test timeouts #109998

Merged
merged 4 commits into from
Nov 25, 2024

Conversation

mdh1418
Copy link
Member

@mdh1418 mdh1418 commented Nov 20, 2024

The runtime testing infrastructure collects dumps of the test process and all of its corerun child processes whenever the runtime test times out.

foreach (var child in FindChildProcessesByName(process, "corerun"))
{
string crashDumpPath = Path.Combine(Path.GetFullPath(crashDumpFolder), string.Format("crashdump_{0}.dmp", child.Id));
Console.WriteLine($"Attempting to collect crash dump: {crashDumpPath}");
if (CollectCrashDump(child, crashDumpPath, outputWriter))
{
Console.WriteLine("Collected crash dump: {0}", crashDumpPath);
}
else
{
Console.WriteLine("Failed to collect crash dump");
}
}

However, there have been instances where a runtime test times out due to waiting for a subprocess to finish, yet there are logs indicating that the subprocess has finished. See #104670 (comment). Its not clear whether 1) the subprocess is indeed hanging (yet no crashdump is collected) or 2) the subprocess completed yet failed to signal back to the parent process (no crashdump to collect).

This PR aims to clarify, if the Id of the subprocess that the main test process is waiting for is known, whether that Id is part of the list of active processes at timeout.

Copy link
Contributor

Tagging subscribers to this area: @hoyosjs
See info in area-owners.md if you want to be subscribed.

@@ -642,6 +642,8 @@ public static bool TryPrintStackTraceFromDmp(string dmpFile, TextWriter outputWr
// The children are sorted in the order they should be dumped
static unsafe IEnumerable<Process> FindChildProcessesByName(Process process, string childName)
{
Console.WriteLine($"Finding all child processes of '{process.ProcessName}' (ID: {process.Id}) with name '{childName}'");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everywhere else in this code I see an outputWriter being used but in a few spots I also see Console.WriteLine() being used. Can you tell what the distinction is? Its not clear to me when we want to use one and when we use the other.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the outputWriter + errorWriter are used to write each test's output.txt + error.txt.
I haven't yet pinpointed the logic for how each helix workitem's console.<id>.log is created, but it looks to me that the console.<id>.log will stitch together each individually generated xunit test wrapper's error.txt and output.txt in addition to the Console.WriteLines used.

I think the output.txt would be used to validate the test result, error.txt would be used to capture any test specific failures, and any other failures that show up in console.<id>.log are probably non test-specific (i.e. infrastructure or timeout?) failures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the outputWriter + errorWriter are used to write each test's output.txt + error.txt.

Sounds reasonable to me. As long as the output will be visible in the CI output I'm happy but hopefully there is some owner for this code who can confirm we are following the intended convention :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forced a timeout in the BinderTracingTests in a previous commit and the logging added in this PR did show up in the console.*.log file in CI.

@@ -788,6 +791,12 @@ public int RunTest(string executable, string outputFile, string errorFile, strin
executable, timeout, (environmentVar != null) ? " from variable " : "", (environmentVar != null) ? TIMEOUT_ENVIRONMENT_VAR : "",
startTime.ToString(), endTime.ToString());

Console.WriteLine("Timed Out with Active Processes:");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log above said we timed out so I don't think we need to repeat that again. Perhaps output like this:

cmdLine: blablabla Timed Out (timeout in milliseconds: XYZ, start: A, end: B)
Collecting diagnostic information...
Snapshot of processes currently running:
   ID    ParentId  CmdLine
   145   9001      Foo.exe arg1 arg2
   1458  145       Blah.exe something something
   10632 2056      corerun.exe SomeTest.dll --whatever arg more_args
   ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we already have a straightforward way to grab the ParentIds and the CommandLine args of the process?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping that either the existing enumeration code that handles GetChildren() would have it or you could get it from Process.GetProcesses(). If neither of them provide an easy option feel free to simplify it as much as needed. Maybe it would just be ID and ProcessName as a simpler alternative.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The platform specific implementation for GetChildren are also in this file

internal static class ProcessExtensions
{
public unsafe static IEnumerable<Process> GetChildren(this Process process)
{
var children = new List<Process>();
if (OperatingSystem.IsWindows())
{
return Windows_GetChildren(process);
}
else if (OperatingSystem.IsLinux())
{
return Linux_GetChildren(process);
}
else if (OperatingSystem.IsMacOS())
{
return MacOS_GetChildren(process);
}
return children;
}
private unsafe static IEnumerable<Process> Windows_GetChildren(Process process)
{
var children = new List<Process>();
IntPtr snapshot = Kernel32.CreateToolhelp32Snapshot(Kernel32.Toolhelp32Flags.TH32CS_SNAPPROCESS, 0);
if (snapshot != IntPtr.Zero && snapshot.ToInt64() != Kernel32.INVALID_HANDLE)
{
try
{
children = new List<Process>();
int ppid = process.Id;
var processEntry = new Kernel32.ProcessEntry32W { Size = sizeof(Kernel32.ProcessEntry32W) };
bool success = Kernel32.Process32FirstW(snapshot, ref processEntry);
while (success)
{
if (processEntry.ParentProcessID == ppid)
{
try
{
children.Add(Process.GetProcessById(processEntry.ProcessID));
}
catch {}
}
success = Kernel32.Process32NextW(snapshot, ref processEntry);
}
}
finally
{
Kernel32.CloseHandle(snapshot);
}
}
return children;
}
private static IEnumerable<Process> Linux_GetChildren(Process process)
{
var children = new List<Process>();
List<int>? childPids = null;
try
{
childPids = File.ReadAllText($"/proc/{process.Id}/task/{process.Id}/children")
.Split(' ', StringSplitOptions.RemoveEmptyEntries)
.Select(pidString => int.Parse(pidString))
.ToList();
}
catch (IOException e)
{
// Some distros might not have the /proc/pid/task/tid/children entry enabled in the kernel
// attempt to use pgrep then
var pgrepInfo = new ProcessStartInfo("pgrep");
pgrepInfo.RedirectStandardOutput = true;
pgrepInfo.Arguments = $"-P {process.Id}";
using Process pgrep = Process.Start(pgrepInfo)!;
string[] pidStrings = pgrep.StandardOutput.ReadToEnd().Split('\n', StringSplitOptions.RemoveEmptyEntries);
pgrep.WaitForExit();
childPids = new List<int>();
foreach (var pidString in pidStrings)
if (int.TryParse(pidString, out int childPid))
childPids.Add(childPid);
}
foreach (var pid in childPids)
{
try
{
children.Add(Process.GetProcessById(pid));
}
catch (ArgumentException)
{
// Ignore failure to get process, the process may have exited
}
}
return children;
}
private static IEnumerable<Process> MacOS_GetChildren(Process process)
{
var children = new List<Process>();
if (libproc.ListChildPids(process.Id, out int[] childPids))
{
foreach (var childPid in childPids)
{
children.Add(Process.GetProcessById(childPid));
}
}
return children;
}

I think just validating whether or not the subprocess timing out is a part of that list would be a good initial check, and if we do need to extend/add GetParent, that can be added in a separate PR.

@mdh1418 mdh1418 force-pushed the add_test_timeout_active_process_logs branch from e7a053c to 9a775c2 Compare November 20, 2024 20:20
Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me, but hopefully at least one of the past editors/owners of this file will also be able to take a look. My knowledge of it is minimal.

@mdh1418 mdh1418 requested a review from jkoritzinsky November 22, 2024 17:44
@mdh1418 mdh1418 merged commit 4caf064 into dotnet:main Nov 25, 2024
70 checks passed
mikelle-rogers pushed a commit to mikelle-rogers/runtime that referenced this pull request Dec 10, 2024
* Add more logs

* Flush writers and format active process listing

* Temporarily force timeout

* Revert "Temporarily force timeout"

This reverts commit 9a775c2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants