-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tests] Add more logging for runtime test timeouts #109998
[Tests] Add more logging for runtime test timeouts #109998
Conversation
Tagging subscribers to this area: @hoyosjs |
@@ -642,6 +642,8 @@ public static bool TryPrintStackTraceFromDmp(string dmpFile, TextWriter outputWr | |||
// The children are sorted in the order they should be dumped | |||
static unsafe IEnumerable<Process> FindChildProcessesByName(Process process, string childName) | |||
{ | |||
Console.WriteLine($"Finding all child processes of '{process.ProcessName}' (ID: {process.Id}) with name '{childName}'"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everywhere else in this code I see an outputWriter being used but in a few spots I also see Console.WriteLine() being used. Can you tell what the distinction is? Its not clear to me when we want to use one and when we use the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the outputWriter
+ errorWriter
are used to write each test's output.txt
+ error.txt
.
I haven't yet pinpointed the logic for how each helix workitem's console.<id>.log
is created, but it looks to me that the console.<id>.log
will stitch together each individually generated xunit test wrapper's error.txt
and output.txt
in addition to the Console.WriteLine
s used.
I think the output.txt
would be used to validate the test result, error.txt
would be used to capture any test specific failures, and any other failures that show up in console.<id>.log
are probably non test-specific (i.e. infrastructure or timeout?) failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the outputWriter + errorWriter are used to write each test's output.txt + error.txt.
Sounds reasonable to me. As long as the output will be visible in the CI output I'm happy but hopefully there is some owner for this code who can confirm we are following the intended convention :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forced a timeout in the BinderTracingTests in a previous commit and the logging added in this PR did show up in the console.*.log
file in CI.
@@ -788,6 +791,12 @@ public int RunTest(string executable, string outputFile, string errorFile, strin | |||
executable, timeout, (environmentVar != null) ? " from variable " : "", (environmentVar != null) ? TIMEOUT_ENVIRONMENT_VAR : "", | |||
startTime.ToString(), endTime.ToString()); | |||
|
|||
Console.WriteLine("Timed Out with Active Processes:"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log above said we timed out so I don't think we need to repeat that again. Perhaps output like this:
cmdLine: blablabla Timed Out (timeout in milliseconds: XYZ, start: A, end: B)
Collecting diagnostic information...
Snapshot of processes currently running:
ID ParentId CmdLine
145 9001 Foo.exe arg1 arg2
1458 145 Blah.exe something something
10632 2056 corerun.exe SomeTest.dll --whatever arg more_args
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we already have a straightforward way to grab the ParentIds and the CommandLine args of the process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping that either the existing enumeration code that handles GetChildren() would have it or you could get it from Process.GetProcesses(). If neither of them provide an easy option feel free to simplify it as much as needed. Maybe it would just be ID and ProcessName as a simpler alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The platform specific implementation for GetChildren
are also in this file
runtime/src/tests/Common/Coreclr.TestWrapper/CoreclrTestWrapperLib.cs
Lines 112 to 228 in eed57c3
internal static class ProcessExtensions | |
{ | |
public unsafe static IEnumerable<Process> GetChildren(this Process process) | |
{ | |
var children = new List<Process>(); | |
if (OperatingSystem.IsWindows()) | |
{ | |
return Windows_GetChildren(process); | |
} | |
else if (OperatingSystem.IsLinux()) | |
{ | |
return Linux_GetChildren(process); | |
} | |
else if (OperatingSystem.IsMacOS()) | |
{ | |
return MacOS_GetChildren(process); | |
} | |
return children; | |
} | |
private unsafe static IEnumerable<Process> Windows_GetChildren(Process process) | |
{ | |
var children = new List<Process>(); | |
IntPtr snapshot = Kernel32.CreateToolhelp32Snapshot(Kernel32.Toolhelp32Flags.TH32CS_SNAPPROCESS, 0); | |
if (snapshot != IntPtr.Zero && snapshot.ToInt64() != Kernel32.INVALID_HANDLE) | |
{ | |
try | |
{ | |
children = new List<Process>(); | |
int ppid = process.Id; | |
var processEntry = new Kernel32.ProcessEntry32W { Size = sizeof(Kernel32.ProcessEntry32W) }; | |
bool success = Kernel32.Process32FirstW(snapshot, ref processEntry); | |
while (success) | |
{ | |
if (processEntry.ParentProcessID == ppid) | |
{ | |
try | |
{ | |
children.Add(Process.GetProcessById(processEntry.ProcessID)); | |
} | |
catch {} | |
} | |
success = Kernel32.Process32NextW(snapshot, ref processEntry); | |
} | |
} | |
finally | |
{ | |
Kernel32.CloseHandle(snapshot); | |
} | |
} | |
return children; | |
} | |
private static IEnumerable<Process> Linux_GetChildren(Process process) | |
{ | |
var children = new List<Process>(); | |
List<int>? childPids = null; | |
try | |
{ | |
childPids = File.ReadAllText($"/proc/{process.Id}/task/{process.Id}/children") | |
.Split(' ', StringSplitOptions.RemoveEmptyEntries) | |
.Select(pidString => int.Parse(pidString)) | |
.ToList(); | |
} | |
catch (IOException e) | |
{ | |
// Some distros might not have the /proc/pid/task/tid/children entry enabled in the kernel | |
// attempt to use pgrep then | |
var pgrepInfo = new ProcessStartInfo("pgrep"); | |
pgrepInfo.RedirectStandardOutput = true; | |
pgrepInfo.Arguments = $"-P {process.Id}"; | |
using Process pgrep = Process.Start(pgrepInfo)!; | |
string[] pidStrings = pgrep.StandardOutput.ReadToEnd().Split('\n', StringSplitOptions.RemoveEmptyEntries); | |
pgrep.WaitForExit(); | |
childPids = new List<int>(); | |
foreach (var pidString in pidStrings) | |
if (int.TryParse(pidString, out int childPid)) | |
childPids.Add(childPid); | |
} | |
foreach (var pid in childPids) | |
{ | |
try | |
{ | |
children.Add(Process.GetProcessById(pid)); | |
} | |
catch (ArgumentException) | |
{ | |
// Ignore failure to get process, the process may have exited | |
} | |
} | |
return children; | |
} | |
private static IEnumerable<Process> MacOS_GetChildren(Process process) | |
{ | |
var children = new List<Process>(); | |
if (libproc.ListChildPids(process.Id, out int[] childPids)) | |
{ | |
foreach (var childPid in childPids) | |
{ | |
children.Add(Process.GetProcessById(childPid)); | |
} | |
} | |
return children; | |
} |
I think just validating whether or not the subprocess timing out is a part of that list would be a good initial check, and if we do need to extend/add GetParent
, that can be added in a separate PR.
e7a053c
to
9a775c2
Compare
This reverts commit 9a775c2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me, but hopefully at least one of the past editors/owners of this file will also be able to take a look. My knowledge of it is minimal.
* Add more logs * Flush writers and format active process listing * Temporarily force timeout * Revert "Temporarily force timeout" This reverts commit 9a775c2.
The runtime testing infrastructure collects dumps of the test process and all of its
corerun
child processes whenever the runtime test times out.runtime/src/tests/Common/Coreclr.TestWrapper/CoreclrTestWrapperLib.cs
Lines 795 to 807 in e3b3aaa
However, there have been instances where a runtime test times out due to waiting for a subprocess to finish, yet there are logs indicating that the subprocess has finished. See #104670 (comment). Its not clear whether 1) the subprocess is indeed hanging (yet no crashdump is collected) or 2) the subprocess completed yet failed to signal back to the parent process (no crashdump to collect).
This PR aims to clarify, if the Id of the subprocess that the main test process is waiting for is known, whether that Id is part of the list of active processes at timeout.