Skip to content
This repository has been archived by the owner on Apr 20, 2023. It is now read-only.

Implement child process reaping. #10720

Merged
merged 4 commits into from
Feb 8, 2019
Merged

Implement child process reaping. #10720

merged 4 commits into from
Feb 8, 2019

Conversation

peterhuene
Copy link

This PR implements child process reaping in the event of termination
of a CLI command.

On Windows, the child process is added to a job object that is set to
terminate the child (and its tree) upon the termination of the parent
dotnet process. On Windows 7 and Server 2008, the dotnet process cannot
already be associated with a job object for the reaping to occur. On
later Windows versions, a nested job will be created so the reaping will
still occur. After the child process exits, the job object is closed
without terminating the remaining processes in the job; this allows for
the child process to spawn additional processes that outlive the child.

On POSIX operating systems, a SIGTERM is intercepted and forwarded on to
the child process only. Like the SIGINT forwarding, it is up to the
child process to decide what to do with the SIGTERM signal (the default
is to abort).

Additionally, this fix further expands upon the previous fix to dotnet run to properly handle SIGINT so that all child processes now benefit
from the fixed behavior. This means MSBuild-forwarding commands like
dotnet build now behave as if MSBuild were directly being executed with
respect to Ctrl-C handling.

Fixes #7426.

Peter Huene added 2 commits February 6, 2019 17:43
This commit implements child process reaping in the event of termination
of a CLI command.

On Windows, the child process is added to a job object that is set to
terminate the child (and its tree) upon the termination of the parent
dotnet process.  On Windows 7 and Server 2008, the dotnet process cannot
already be associated with a job object for the reaping to occur.  On
later Windows versions, a nested job will be created so the reaping will
still occur.  After the child process exits, the job object is closed
without terminating the remaining processes in the job; this allows for
the child process to spawn additional processes that outlive the child.

On POSIX operating systems, a SIGTERM is intercepted and forwarded on to
the child process only. Like the SIGINT forwarding, it is up to the
child process to decide what to do with the SIGTERM signal (the default
is to abort).

Additionally, this fix further expands upon the previous fix to `dotnet
run` to properly handle SIGINT so that all child processes now benefit
from the fixed behavior.  This means MSBuild-forwarding commands like
`dotnet build` now behave as if MSBuild were directly being executed with
respect to Ctrl-C handling.

Fixes #7426.
This commit deletes the `DepsJsonCommandFactory`,
`ProjectDependenciesCommandFactory`, and `PublishedPathCommandFactory`
types as they are not instantiated from product code.
@peterhuene peterhuene added this to the 3.0.1xx milestone Feb 7, 2019
@peterhuene peterhuene requested a review from a team February 7, 2019 01:50
@peterhuene
Copy link
Author

peterhuene commented Feb 7, 2019

I will likely still implement one more thing for this PR: sending SIGTERM on POSIX to the parent process group if it matches the pid of the parent (i.e. the parent is in its own process group). That way POSIX behaves a little more like Windows in terms of it trying to reap the entire tree. There will still be the inherent difference that SIGKILL will do no reaping at all on POSIX, whereas terminating the process on Windows will reap.

@nguerrera
Copy link
Contributor

IIRC, msbuild has its own handler that kills child processes.

If so, how do these interact with each other? Should they use the same code/approach?

@peterhuene
Copy link
Author

It seems that invokes when a task times out.

Ours shouldn't interfere with that as we also don't kill any process unless the parent dotnet process terminates on Windows or is sent SIGTERM on POSIX (which doesn't kill, just forwards the signal).

cc @rainersigwald though.

@peterhuene
Copy link
Author

Looks like something is holding file locks on two Windows CI legs 😞

@peterhuene
Copy link
Author

No wait, the copy failed because of this Operation did not complete successfully because the file contains a virus or potentially unwanted software.. This is dotnet-msbuild.Tests.dll 🤔.

Copy link
Member

@rainersigwald rainersigwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems good to me.

The behavior on a primary-MSBuild-node crash will be slightly different from full-framework MSBuild (with no job object), but it's not really worse in my understanding:

full framework

Successful build (new nodes created)

  • Build starts
  • worker nodes launched
  • build completes
  • primary node shuts down
  • Workers live until idle timeout, then shut down

Successful build (node reuse)

  • Build starts
  • Connects to worker nodes
  • build completes
  • primary node shuts down
  • Workers live until idle timeout, then shut down

Crash (new nodes)

  • Build starts
  • Worker nodes launched
  • Primary node crashes
  • Worker nodes attempt to communicate with primary, fail, shut down

Crash (node reuse)

  • Build starts
  • Connects to worker nodes
  • Primary node crashes
  • Worker nodes attempt to communicate with primary, fail, shut down

With this

Successful build (new nodes created)

  • Build starts
  • worker nodes launched
  • build completes
  • primary node shuts down
  • job object kill disabled
  • Workers live until idle timeout, then shut down

Successful build (node reuse)

  • Build starts
  • Connects to worker nodes not in job
  • build completes
  • primary node shuts down
  • Workers live until idle timeout, then shut down

Crash (new nodes)

  • Build starts
  • Worker nodes launched
  • Primary node crashes
  • job object killed, killing workers and tool children

Crash (node reuse)

  • Build starts
  • Connects to worker nodes
  • Primary node crashes
  • job object killed, killing any tool children and new nodes but not reused workers or their children
  • Worker nodes attempt to communicate with primary, fail, shut down

@peterhuene
Copy link
Author

@rainersigwald to be clear, the termination would only occur if the root dotnet process is rudely terminated; if the child (primary) MSBuild process crashes or is otherwise rudely terminated, dotnet will gracefully terminate with a non-zero exit code, but will not reap any processes. dotnet is only guarding against its own termination (or a graceful SIGTERM on POSIX); if the direct child is killed then there is no behavior change from how things work today.

This commit implements sending SIGTERM to the process group if the current
process is the root of a group.  This ensures that all descendant processes
receive the SIGTERM signal.

If the current process is not at the root (e.g. a child of another root), then
only the direct child will be sent the SIGTERM signal.
@peterhuene
Copy link
Author

peterhuene commented Feb 7, 2019

Ok I've implemented sending SIGTERM to the current group (if the current process is the root of the group). This ensures that all descendants of dotnet receive the SIGTERM signal when a SIGTERM is sent to dotnet.

This should now be fully ready for review.

There's a race in `DotnetUnderTest.FullPath` on Windows that may result
in the path to dotnet to have a `.exe.exe` suffix when running tests in
parallel.

The root of the problem is that the method is not thread safe and relies
on static state to calculate the result; the state may have already
changed as a result of another thread of execution.

Rather than lock or otherwise change this code, this fix simply removes
the dependency on the static state to do the calculation.  The resulting
method is still not thread safe, but now no longer matters if the full
path is calculated concurrently.
@peterhuene peterhuene merged commit 14add78 into dotnet:master Feb 8, 2019
@peterhuene peterhuene deleted the fix-dotnet-run-termination branch February 8, 2019 02:09
@danmoseley
Copy link
Member

@tmds since you have worked on process killing stuff in CoreFX, could you possibly cast a quick eye over this one too?


// First, try to SIGTERM our process group
// If the pgid is not the same as pid, then this process is not the root of the group
if (NativeMethods.Posix.getpgid(currentPid) == currentPid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no code that calls `setpgid. Why deal with this specifically?

Copy link
Author

@peterhuene peterhuene Feb 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this process is a child of another that doesn't setpgid (e.g. another .NET process that spawns dotnet), then this check will be false. It was to avoid a failing call to kill for the group. If you think that's unnecessary and I should just call kill and check for ESRCH for the group, I'm fine with changing it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering when this would be true, so my comment was more suggesting to remove the whole if block.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most shells setpgid though, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. If I start dotnet from a bash script though, they are not the same.
This may cause subtle differences in behaviour.
If we remove this, things will still work, and should be consistent independent of being process group leader or not.

Copy link
Author

@peterhuene peterhuene Feb 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So unfortunately setpgid must occur before the exec* call (for valid reasons), and I'm not going to duplicate the complex process creation code that the framework does just for this fix.

As such, I think I'll remove the pg kill.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So unfortunately setpgid must occur before the exec* call (for valid reasons), and I'm not going to duplicate the complex process creation code

Setting it before calling Process.Start would be enough.
That said, I think we should keep it simple and not change the process group unless we need to.

Copy link
Author

@peterhuene peterhuene Feb 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't setpgid require a fork prior to being called? I thought the framework only forks as a result of Process.Start (seems to happen in StartCore in the Unix implementation.

Or did you meant to setpgid the parent dotnet process before spawning the child? Given that executing from a shell usually means we're already in our own process group, setpgid would be ineffectual. In the case where the dotnet process is not in its own group, then it would probably be unwise to break out.

The only consistent way to do it would to be setpgid the child prior to the exec, which we can't do unless CoreFX implements dotnet/corefx#17412.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or did you meant to setpgid the parent dotnet process before spawning the child?

That is what I meant, so the child would inherit it from the parent.

In the case where the dotnet process is not in its own group, then it would probably be unwise to break out.

The only consistent way to do it would to be setpgid the child prior to the exec, which we can't do unless CoreFX implements dotnet/corefx#17412.

It's more tricky than I thought. If the dotnet process would change its own pgid, it would no longer receive signals (e.g. SIGINT) from the console. So setpgid needs to occur between fork and exec of the child. And the parent needs to take care of propagating the signals to the children.

Copy link
Contributor

@nguerrera nguerrera Feb 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus, no managed code can safely run between fork and exec. If we ever need something there, we’d need API changes in the BCL. (Édit: I see that was already pointed out.)

}
}

if (sendChildSIGTERM && NativeMethods.Posix.kill(_process.Id, NativeMethods.Posix.SIGTERM) != 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to be safe for 'pid recycling', you can add a WaitForExit(0) and do nothing if the process has already exited.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add that. Thanks!

@tmds
Copy link
Member

tmds commented Feb 12, 2019

@danmosemsft @peterhuene LGTM, added a few minor comments.

@danmoseley
Copy link
Member

Thanks @tmds. @peterhuene, @tmds is from RedHat and helps us with Linux subtleties.

@peterhuene
Copy link
Author

Very much appreciated for the additional review! I'll put up another PR with the feedback changes.

@steven-solomon
Copy link

Is it the case that this fix won't ship until some 3.0.1.X release?

@peterhuene
Copy link
Author

Hi @steven-solomon. That's correct. This fix is available in the current 3.0.100 SDK previews. Unfortunately, it doesn't meet the servicing bar to back port it to previous releases.

@steven-solomon
Copy link

@peterhuene Thank you for clarifying.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants