[Tests] Add more logging for runtime test timeouts #109998

mdh1418 · 2024-11-20T00:50:13Z

The runtime testing infrastructure collects dumps of the test process and all of its corerun child processes whenever the runtime test times out.

runtime/src/tests/Common/Coreclr.TestWrapper/CoreclrTestWrapperLib.cs

Lines 795 to 807 in e3b3aaa

    
           foreach (var child in FindChildProcessesByName(process, "corerun")) 
        
           { 
        
               string crashDumpPath = Path.Combine(Path.GetFullPath(crashDumpFolder), string.Format("crashdump_{0}.dmp", child.Id)); 
        
               Console.WriteLine($"Attempting to collect crash dump: {crashDumpPath}"); 
        
               if (CollectCrashDump(child, crashDumpPath, outputWriter)) 
        
               { 
        
                   Console.WriteLine("Collected crash dump: {0}", crashDumpPath); 
        
               } 
        
               else 
        
               { 
        
                   Console.WriteLine("Failed to collect crash dump"); 
        
               } 
        
           }

However, there have been instances where a runtime test times out due to waiting for a subprocess to finish, yet there are logs indicating that the subprocess has finished. See #104670 (comment). Its not clear whether 1) the subprocess is indeed hanging (yet no crashdump is collected) or 2) the subprocess completed yet failed to signal back to the parent process (no crashdump to collect).

This PR aims to clarify, if the Id of the subprocess that the main test process is waiting for is known, whether that Id is part of the list of active processes at timeout.

dotnet-policy-service · 2024-11-20T00:50:55Z

Tagging subscribers to this area: @hoyosjs
See info in area-owners.md if you want to be subscribed.

noahfalk · 2024-11-20T01:08:26Z

src/tests/Common/Coreclr.TestWrapper/CoreclrTestWrapperLib.cs

@@ -642,6 +642,8 @@ public static bool TryPrintStackTraceFromDmp(string dmpFile, TextWriter outputWr
        // The children are sorted in the order they should be dumped
        static unsafe IEnumerable<Process> FindChildProcessesByName(Process process, string childName)
        {
+            Console.WriteLine($"Finding all child processes of '{process.ProcessName}' (ID: {process.Id}) with name '{childName}'");


Everywhere else in this code I see an outputWriter being used but in a few spots I also see Console.WriteLine() being used. Can you tell what the distinction is? Its not clear to me when we want to use one and when we use the other.

I believe the outputWriter + errorWriter are used to write each test's output.txt + error.txt.
I haven't yet pinpointed the logic for how each helix workitem's console.<id>.log is created, but it looks to me that the console.<id>.log will stitch together each individually generated xunit test wrapper's error.txt and output.txt in addition to the Console.WriteLines used.

I think the output.txt would be used to validate the test result, error.txt would be used to capture any test specific failures, and any other failures that show up in console.<id>.log are probably non test-specific (i.e. infrastructure or timeout?) failures.

I believe the outputWriter + errorWriter are used to write each test's output.txt + error.txt.

Sounds reasonable to me. As long as the output will be visible in the CI output I'm happy but hopefully there is some owner for this code who can confirm we are following the intended convention :)

I forced a timeout in the BinderTracingTests in a previous commit and the logging added in this PR did show up in the console.*.log file in CI.

noahfalk · 2024-11-20T01:23:04Z

src/tests/Common/Coreclr.TestWrapper/CoreclrTestWrapperLib.cs

@@ -788,6 +791,12 @@ public int RunTest(string executable, string outputFile, string errorFile, strin
                                executable, timeout, (environmentVar != null) ? " from variable " : "", (environmentVar != null) ? TIMEOUT_ENVIRONMENT_VAR : "",
                                startTime.ToString(), endTime.ToString());

+                        Console.WriteLine("Timed Out with Active Processes:");


The log above said we timed out so I don't think we need to repeat that again. Perhaps output like this:

cmdLine: blablabla Timed Out (timeout in milliseconds: XYZ, start: A, end: B) Collecting diagnostic information... Snapshot of processes currently running: ID ParentId CmdLine 145 9001 Foo.exe arg1 arg2 1458 145 Blah.exe something something 10632 2056 corerun.exe SomeTest.dll --whatever arg more_args ...

Do we already have a straightforward way to grab the ParentIds and the CommandLine args of the process?

I was hoping that either the existing enumeration code that handles GetChildren() would have it or you could get it from Process.GetProcesses(). If neither of them provide an easy option feel free to simplify it as much as needed. Maybe it would just be ID and ProcessName as a simpler alternative.

The platform specific implementation for GetChildren are also in this file

runtime/src/tests/Common/Coreclr.TestWrapper/CoreclrTestWrapperLib.cs

Lines 112 to 228 in eed57c3

internal static class ProcessExtensions

{

public unsafe static IEnumerable<Process> GetChildren(this Process process)

{

var children = new List<Process>();

if (OperatingSystem.IsWindows())

{

return Windows_GetChildren(process);

}

else if (OperatingSystem.IsLinux())

{

return Linux_GetChildren(process);

}

else if (OperatingSystem.IsMacOS())

{

return MacOS_GetChildren(process);

}

return children;

}

private unsafe static IEnumerable<Process> Windows_GetChildren(Process process)

{

var children = new List<Process>();

IntPtr snapshot = Kernel32.CreateToolhelp32Snapshot(Kernel32.Toolhelp32Flags.TH32CS_SNAPPROCESS, 0);

if (snapshot != IntPtr.Zero && snapshot.ToInt64() != Kernel32.INVALID_HANDLE)

{

try

{

children = new List<Process>();

int ppid = process.Id;

var processEntry = new Kernel32.ProcessEntry32W { Size = sizeof(Kernel32.ProcessEntry32W) };

bool success = Kernel32.Process32FirstW(snapshot, ref processEntry);

while (success)

{

if (processEntry.ParentProcessID == ppid)

{

try

{

children.Add(Process.GetProcessById(processEntry.ProcessID));

}

catch {}

}

success = Kernel32.Process32NextW(snapshot, ref processEntry);

}

}

finally

{

Kernel32.CloseHandle(snapshot);

}

}

return children;

}

private static IEnumerable<Process> Linux_GetChildren(Process process)

{

var children = new List<Process>();

List<int>? childPids = null;

try

{

childPids = File.ReadAllText($"/proc/{process.Id}/task/{process.Id}/children")

.Split(' ', StringSplitOptions.RemoveEmptyEntries)

.Select(pidString => int.Parse(pidString))

.ToList();

}

catch (IOException e)

{

// Some distros might not have the /proc/pid/task/tid/children entry enabled in the kernel

// attempt to use pgrep then

var pgrepInfo = new ProcessStartInfo("pgrep");

pgrepInfo.RedirectStandardOutput = true;

pgrepInfo.Arguments = $"-P {process.Id}";

using Process pgrep = Process.Start(pgrepInfo)!;

string[] pidStrings = pgrep.StandardOutput.ReadToEnd().Split('\n', StringSplitOptions.RemoveEmptyEntries);

pgrep.WaitForExit();

childPids = new List<int>();

foreach (var pidString in pidStrings)

if (int.TryParse(pidString, out int childPid))

childPids.Add(childPid);

}

foreach (var pid in childPids)

{

try

{

children.Add(Process.GetProcessById(pid));

}

catch (ArgumentException)

{

// Ignore failure to get process, the process may have exited

}

}

return children;

}

private static IEnumerable<Process> MacOS_GetChildren(Process process)

{

var children = new List<Process>();

if (libproc.ListChildPids(process.Id, out int[] childPids))

{

foreach (var childPid in childPids)

{

children.Add(Process.GetProcessById(childPid));

}

}

return children;

}

I think just validating whether or not the subprocess timing out is a part of that list would be a good initial check, and if we do need to extend/add GetParent, that can be added in a separate PR.

This reverts commit 9a775c2.

noahfalk

This looks fine to me, but hopefully at least one of the past editors/owners of this file will also be able to take a look. My knowledge of it is minimal.

* Add more logs * Flush writers and format active process listing * Temporarily force timeout * Revert "Temporarily force timeout" This reverts commit 9a775c2.

Add more logs

32e432c

dotnet-issue-labeler bot added the area-Infrastructure-coreclr label Nov 20, 2024

dotnet-policy-service bot assigned mdh1418 Nov 20, 2024

mdh1418 requested review from noahfalk, jkotas and hoyosjs November 20, 2024 00:50

noahfalk reviewed Nov 20, 2024

View reviewed changes

mdh1418 added 2 commits November 20, 2024 15:07

Flush writers and format active process listing

3770074

Temporarily force timeout

9a775c2

mdh1418 force-pushed the add_test_timeout_active_process_logs branch from e7a053c to 9a775c2 Compare November 20, 2024 20:20

build-analysis bot mentioned this pull request Nov 20, 2024

BinderTracingTest.ResolutionFlow times out #104670

Open

Revert "Temporarily force timeout"

23ede50

This reverts commit 9a775c2.

noahfalk approved these changes Nov 22, 2024

View reviewed changes

mdh1418 requested a review from jkoritzinsky November 22, 2024 17:44

mdh1418 merged commit 4caf064 into dotnet:main Nov 25, 2024
70 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tests] Add more logging for runtime test timeouts #109998

[Tests] Add more logging for runtime test timeouts #109998

mdh1418 commented Nov 20, 2024

dotnet-policy-service bot commented Nov 20, 2024

noahfalk Nov 20, 2024

mdh1418 Nov 20, 2024

noahfalk Nov 22, 2024

mdh1418 Nov 22, 2024

noahfalk Nov 20, 2024

mdh1418 Nov 20, 2024

noahfalk Nov 22, 2024

mdh1418 Nov 22, 2024

noahfalk left a comment

	foreach (var child in FindChildProcessesByName(process, "corerun"))
	{
	string crashDumpPath = Path.Combine(Path.GetFullPath(crashDumpFolder), string.Format("crashdump_{0}.dmp", child.Id));
	Console.WriteLine($"Attempting to collect crash dump: {crashDumpPath}");
	if (CollectCrashDump(child, crashDumpPath, outputWriter))
	{
	Console.WriteLine("Collected crash dump: {0}", crashDumpPath);
	}
	else
	{
	Console.WriteLine("Failed to collect crash dump");
	}
	}

	internal static class ProcessExtensions
	{
	public unsafe static IEnumerable<Process> GetChildren(this Process process)
	{
	var children = new List<Process>();
	if (OperatingSystem.IsWindows())
	{
	return Windows_GetChildren(process);
	}
	else if (OperatingSystem.IsLinux())
	{
	return Linux_GetChildren(process);
	}
	else if (OperatingSystem.IsMacOS())
	{
	return MacOS_GetChildren(process);
	}
	return children;
	}

	private unsafe static IEnumerable<Process> Windows_GetChildren(Process process)
	{
	var children = new List<Process>();
	IntPtr snapshot = Kernel32.CreateToolhelp32Snapshot(Kernel32.Toolhelp32Flags.TH32CS_SNAPPROCESS, 0);
	if (snapshot != IntPtr.Zero && snapshot.ToInt64() != Kernel32.INVALID_HANDLE)
	{
	try
	{
	children = new List<Process>();
	int ppid = process.Id;

	var processEntry = new Kernel32.ProcessEntry32W { Size = sizeof(Kernel32.ProcessEntry32W) };

	bool success = Kernel32.Process32FirstW(snapshot, ref processEntry);
	while (success)
	{
	if (processEntry.ParentProcessID == ppid)
	{
	try
	{
	children.Add(Process.GetProcessById(processEntry.ProcessID));
	}
	catch {}
	}

	success = Kernel32.Process32NextW(snapshot, ref processEntry);
	}

	}
	finally
	{
	Kernel32.CloseHandle(snapshot);
	}
	}

	return children;
	}

	private static IEnumerable<Process> Linux_GetChildren(Process process)
	{
	var children = new List<Process>();
	List<int>? childPids = null;

	try
	{
	childPids = File.ReadAllText($"/proc/{process.Id}/task/{process.Id}/children")
	.Split(' ', StringSplitOptions.RemoveEmptyEntries)
	.Select(pidString => int.Parse(pidString))
	.ToList();
	}
	catch (IOException e)
	{
	// Some distros might not have the /proc/pid/task/tid/children entry enabled in the kernel
	// attempt to use pgrep then
	var pgrepInfo = new ProcessStartInfo("pgrep");
	pgrepInfo.RedirectStandardOutput = true;
	pgrepInfo.Arguments = $"-P {process.Id}";

	using Process pgrep = Process.Start(pgrepInfo)!;

	string[] pidStrings = pgrep.StandardOutput.ReadToEnd().Split('\n', StringSplitOptions.RemoveEmptyEntries);
	pgrep.WaitForExit();

	childPids = new List<int>();
	foreach (var pidString in pidStrings)
	if (int.TryParse(pidString, out int childPid))
	childPids.Add(childPid);
	}

	foreach (var pid in childPids)
	{
	try
	{
	children.Add(Process.GetProcessById(pid));
	}
	catch (ArgumentException)
	{
	// Ignore failure to get process, the process may have exited
	}
	}

	return children;
	}

	private static IEnumerable<Process> MacOS_GetChildren(Process process)
	{
	var children = new List<Process>();
	if (libproc.ListChildPids(process.Id, out int[] childPids))
	{
	foreach (var childPid in childPids)
	{
	children.Add(Process.GetProcessById(childPid));
	}
	}

	return children;
	}

[Tests] Add more logging for runtime test timeouts #109998

[Tests] Add more logging for runtime test timeouts #109998

Conversation

mdh1418 commented Nov 20, 2024

dotnet-policy-service bot commented Nov 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noahfalk left a comment

Choose a reason for hiding this comment