-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Implement child process reaping. #10720
Implement child process reaping. #10720
Conversation
This commit implements child process reaping in the event of termination of a CLI command. On Windows, the child process is added to a job object that is set to terminate the child (and its tree) upon the termination of the parent dotnet process. On Windows 7 and Server 2008, the dotnet process cannot already be associated with a job object for the reaping to occur. On later Windows versions, a nested job will be created so the reaping will still occur. After the child process exits, the job object is closed without terminating the remaining processes in the job; this allows for the child process to spawn additional processes that outlive the child. On POSIX operating systems, a SIGTERM is intercepted and forwarded on to the child process only. Like the SIGINT forwarding, it is up to the child process to decide what to do with the SIGTERM signal (the default is to abort). Additionally, this fix further expands upon the previous fix to `dotnet run` to properly handle SIGINT so that all child processes now benefit from the fixed behavior. This means MSBuild-forwarding commands like `dotnet build` now behave as if MSBuild were directly being executed with respect to Ctrl-C handling. Fixes #7426.
This commit deletes the `DepsJsonCommandFactory`, `ProjectDependenciesCommandFactory`, and `PublishedPathCommandFactory` types as they are not instantiated from product code.
I will likely still implement one more thing for this PR: sending |
IIRC, msbuild has its own handler that kills child processes. If so, how do these interact with each other? Should they use the same code/approach? |
It seems that invokes when a task times out. Ours shouldn't interfere with that as we also don't kill any process unless the parent dotnet process terminates on Windows or is sent cc @rainersigwald though. |
Looks like something is holding file locks on two Windows CI legs 😞 |
No wait, the copy failed because of this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems good to me.
The behavior on a primary-MSBuild-node crash will be slightly different from full-framework MSBuild (with no job object), but it's not really worse in my understanding:
full framework
Successful build (new nodes created)
- Build starts
- worker nodes launched
- build completes
- primary node shuts down
- Workers live until idle timeout, then shut down
Successful build (node reuse)
- Build starts
- Connects to worker nodes
- build completes
- primary node shuts down
- Workers live until idle timeout, then shut down
Crash (new nodes)
- Build starts
- Worker nodes launched
- Primary node crashes
- Worker nodes attempt to communicate with primary, fail, shut down
Crash (node reuse)
- Build starts
- Connects to worker nodes
- Primary node crashes
- Worker nodes attempt to communicate with primary, fail, shut down
With this
Successful build (new nodes created)
- Build starts
- worker nodes launched
- build completes
- primary node shuts down
- job object kill disabled
- Workers live until idle timeout, then shut down
Successful build (node reuse)
- Build starts
- Connects to worker nodes not in job
- build completes
- primary node shuts down
- Workers live until idle timeout, then shut down
Crash (new nodes)
- Build starts
- Worker nodes launched
- Primary node crashes
- job object killed, killing workers and tool children
Crash (node reuse)
- Build starts
- Connects to worker nodes
- Primary node crashes
- job object killed, killing any tool children and new nodes but not reused workers or their children
- Worker nodes attempt to communicate with primary, fail, shut down
@rainersigwald to be clear, the termination would only occur if the root |
This commit implements sending SIGTERM to the process group if the current process is the root of a group. This ensures that all descendant processes receive the SIGTERM signal. If the current process is not at the root (e.g. a child of another root), then only the direct child will be sent the SIGTERM signal.
Ok I've implemented sending This should now be fully ready for review. |
There's a race in `DotnetUnderTest.FullPath` on Windows that may result in the path to dotnet to have a `.exe.exe` suffix when running tests in parallel. The root of the problem is that the method is not thread safe and relies on static state to calculate the result; the state may have already changed as a result of another thread of execution. Rather than lock or otherwise change this code, this fix simply removes the dependency on the static state to do the calculation. The resulting method is still not thread safe, but now no longer matters if the full path is calculated concurrently.
@tmds since you have worked on process killing stuff in CoreFX, could you possibly cast a quick eye over this one too? |
|
||
// First, try to SIGTERM our process group | ||
// If the pgid is not the same as pid, then this process is not the root of the group | ||
if (NativeMethods.Posix.getpgid(currentPid) == currentPid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no code that calls `setpgid. Why deal with this specifically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this process is a child of another that doesn't setpgid
(e.g. another .NET process that spawns dotnet
), then this check will be false. It was to avoid a failing call to kill
for the group. If you think that's unnecessary and I should just call kill
and check for ESRCH
for the group, I'm fine with changing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering when this would be true, so my comment was more suggesting to remove the whole if block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most shells setpgid
though, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. If I start dotnet from a bash script though, they are not the same.
This may cause subtle differences in behaviour.
If we remove this, things will still work, and should be consistent independent of being process group leader or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So unfortunately setpgid
must occur before the exec*
call (for valid reasons), and I'm not going to duplicate the complex process creation code that the framework does just for this fix.
As such, I think I'll remove the pg kill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So unfortunately setpgid must occur before the exec* call (for valid reasons), and I'm not going to duplicate the complex process creation code
Setting it before calling Process.Start
would be enough.
That said, I think we should keep it simple and not change the process group unless we need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't setpgid
require a fork
prior to being called? I thought the framework only forks as a result of Process.Start
(seems to happen in StartCore
in the Unix implementation.
Or did you meant to setpgid
the parent dotnet
process before spawning the child? Given that executing from a shell usually means we're already in our own process group, setpgid
would be ineffectual. In the case where the dotnet
process is not in its own group, then it would probably be unwise to break out.
The only consistent way to do it would to be setpgid
the child prior to the exec, which we can't do unless CoreFX implements dotnet/corefx#17412.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or did you meant to setpgid the parent dotnet process before spawning the child?
That is what I meant, so the child would inherit it from the parent.
In the case where the dotnet process is not in its own group, then it would probably be unwise to break out.
The only consistent way to do it would to be setpgid the child prior to the exec, which we can't do unless CoreFX implements dotnet/corefx#17412.
It's more tricky than I thought. If the dotnet process would change its own pgid, it would no longer receive signals (e.g. SIGINT) from the console. So setpgid needs to occur between fork and exec of the child. And the parent needs to take care of propagating the signals to the children.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plus, no managed code can safely run between fork and exec. If we ever need something there, we’d need API changes in the BCL. (Édit: I see that was already pointed out.)
} | ||
} | ||
|
||
if (sendChildSIGTERM && NativeMethods.Posix.kill(_process.Id, NativeMethods.Posix.SIGTERM) != 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to be safe for 'pid recycling', you can add a WaitForExit(0)
and do nothing if the process has already exited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add that. Thanks!
@danmosemsft @peterhuene LGTM, added a few minor comments. |
Thanks @tmds. @peterhuene, @tmds is from RedHat and helps us with Linux subtleties. |
Very much appreciated for the additional review! I'll put up another PR with the feedback changes. |
Is it the case that this fix won't ship until some |
Hi @steven-solomon. That's correct. This fix is available in the current 3.0.100 SDK previews. Unfortunately, it doesn't meet the servicing bar to back port it to previous releases. |
@peterhuene Thank you for clarifying. |
This PR implements child process reaping in the event of termination
of a CLI command.
On Windows, the child process is added to a job object that is set to
terminate the child (and its tree) upon the termination of the parent
dotnet process. On Windows 7 and Server 2008, the dotnet process cannot
already be associated with a job object for the reaping to occur. On
later Windows versions, a nested job will be created so the reaping will
still occur. After the child process exits, the job object is closed
without terminating the remaining processes in the job; this allows for
the child process to spawn additional processes that outlive the child.
On POSIX operating systems, a SIGTERM is intercepted and forwarded on to
the child process only. Like the SIGINT forwarding, it is up to the
child process to decide what to do with the SIGTERM signal (the default
is to abort).
Additionally, this fix further expands upon the previous fix to
dotnet run
to properly handle SIGINT so that all child processes now benefitfrom the fixed behavior. This means MSBuild-forwarding commands like
dotnet build
now behave as if MSBuild were directly being executed withrespect to Ctrl-C handling.
Fixes #7426.