Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merge-on-Red] - Implement Test Process Watcher #78742

Merged
merged 35 commits into from
Mar 10, 2023
Merged

[Merge-on-Red] - Implement Test Process Watcher #78742

merged 35 commits into from
Mar 10, 2023

Conversation

ivdiazsa
Copy link
Member

@ivdiazsa ivdiazsa commented Nov 23, 2022

This PR adds a new harness to run tests by means of corerun. This is the watchdog work item defined in issue #77735.

The main purpose of adding it, is to be able to have a way to monitor potential freezes and hangs when running tests. If a specified time frame lapses, and the test hasn't finished, then the watcher will automatically kill the process and report accordingly.

With this mechanism in place, we will no longer have incomplete test runs that froze somewhere, without any further information, making test failures of this kind much easier to begin investigating, and consequently fix them.

Remaining Tasks Until Completion:

  • Refactor accordingly to give it a more solid multi-platform shape.
  • Modify the test script generators to use the watcher, instead of directly calling corerun.
  • Add the watcher to the repo's build scripts, so it's addition is transparent.

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost assigned ivdiazsa Nov 23, 2022
@ivdiazsa
Copy link
Member Author

Hi @jkoritzinsky! Here's the initial draft of the test watcher. It works on Linux and Windows, and it's ready for sharing, so you can take a look and we can adjust/refactor/edit/etc as necessary.

@ghost
Copy link

ghost commented Nov 23, 2022

Tagging subscribers to this area: @hoyosjs
See info in area-owners.md if you want to be subscribed.

Issue Details

This PR adds a new harness to run tests by means of corerun.

The main purpose of adding it, is to be able to have a way to monitor potential freezes and hangs when running tests. If a specified time frame lapses, and the test hasn't finished, then the watcher will automatically kill the process and report accordingly.

With this mechanism in place, we will no longer have incomplete test runs that froze somewhere, without any further information, making test failures of this kind much easier to begin investigating, and consequently fix them.

Remaining Tasks Until Completion:

  • Refactor accordingly to give it a more solid multi-platform shape.
  • Modify the test script generators to use the watcher, instead of directly calling corerun.
Author: ivdiazsa
Assignees: ivdiazsa
Labels:

area-Infrastructure-coreclr

Milestone: -

@ivdiazsa
Copy link
Member Author

ivdiazsa commented Dec 1, 2022

FYI. This is the draft PR of the watcher @agocke @tommcdon

… yet functional, but gotta save my progress :)
@jkoritzinsky
Copy link
Member

I think we can just include the watchdog CMakeLists.txt from the CoreCLR and Mono CMake builds. I don’t think we need to introduce a separate subset and native project build for it.

#else
const int check_interval = 1000;
int check_count = 0;
char **args = new char *[exe_argc];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider a vector here. You forgot to delete

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot the deletion, thanks for pointing it out Andy. The reason we're using a char*[] is because that's the type that execv() requires.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unique_ptr<char*[]> then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might work as well. I'll try it.

@agocke
Copy link
Member

agocke commented Dec 6, 2022

fwiw I'm not completely sold on C++ if we have to parse and fixup the XML file. I'd rather use... anything else for string manipulation tbh.

@ivdiazsa
Copy link
Member Author

ivdiazsa commented Dec 6, 2022

fwiw I'm not completely sold on C++ if we have to parse and fixup the XML file. I'd rather use... anything else for string manipulation tbh.

Yeah I agree, C++ is too much for this. We should do it in plain C like it should be :)

Jokes aside, thanks for pointing this out @agocke. We seem be on different pages on this, so let's take this chance to sync that. The YAML log is generated in C#, right here: https://github.com/ivdiazsa/runtime/blob/2f66c46c1948af53c447e1efe0d1cc32867244f3/src/tests/Common/XUnitWrapperLibrary/TestSummary.cs#L95

I was under the impression that the XML corrector, if we decide to go with that approach, would go somewhere around there as well.

@ivdiazsa ivdiazsa marked this pull request as ready for review February 28, 2023 20:30
…irectory, rather than just the executable, and reallowed tests to be run without the watcher.
Copy link
Member

@jkoritzinsky jkoritzinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small comments, but other than that, LGTM!


#else // !TARGET_WINDOWS

// TODO: Describe what the 'ms_factor' is, and why it's being used here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address this TODO?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops forgot that. Thanks for the catch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually removed it altogether. I needed it because originally, I was dealing with some C stuff that combined microseconds and milliseconds. After switching to C++'s std::chrono::milliseconds, testing with and without it yielded virtually the same wait time, so it ended up being redundant.

src/tests/Common/helixpublishwitharcade.proj Outdated Show resolved Hide resolved
@lewing lewing requested a review from radical March 10, 2023 01:00
<HelixCorrelationPayload Include="$(XUnitLogCheckerDirectory)" />

<!-- Browser-Wasm follows a very different workflow, which is currently out of scope of the Log Checker. -->
<HelixCorrelationPayload Include="$(XUnitLogCheckerDirectory)" Condition="'$(TargetsBrowser)' != 'true'" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change the current behavior for wasm logs then?

Copy link
Member Author

@ivdiazsa ivdiazsa Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all. Wasm is out of scope of this project for the time being, hence we are excluding it. So wasm tests will remain unaffected.

@BruceForstall
Copy link
Member

It seems undesirable to hard-code yet another timeout value (300 seconds?) into a new location (and there isn't a big, obvious comment noting that's what the number is). Especially since the YAML files (?) already specify per-test timeouts. E.g., it wouldn't surprise me if some merged test cases (e.g., Hardware Intrinsics) run under GCStress=3 on Linux/arm could be very slow.

hoyosjs added a commit that referenced this pull request Mar 14, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Apr 13, 2023
@ivdiazsa ivdiazsa deleted the the-native-eye branch April 24, 2023 22:10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Merge-on-Red] - Accurately log catastrophic test failures and freezes in source-gened test infrastructure
6 participants