osproc.terminate broken on posix (when using fork at least) #1558

jovial · 2014-10-07T14:02:40Z

The following piece of code, from osproc.nim, results in signals being sent to invalid process GIDs when used in conjunction with osproc.startProcess:

  proc terminate(p: Process) =
    if kill(-p.id, SIGTERM) == 0'i32:
      if p.running():
        if kill(-p.id, SIGKILL) != 0'i32: raiseOSError(osLastError())
    else: raiseOSError(osLastError())

Essentially, calling kill with a negative PID sends a signal to the process group with the PGID = |-p.id|. This can only be valid when p.id is the PID of a process group leader. This can never be true for a process started with osproc.startProcess as there is no call to setpgid, which could make the child a process group leader.

I believe sending a signal to the whole group is also inconsistent with the windows implementation. Furthermore, it is unreliable, as children can escape the process group with calls to setpgid. I think the only reliable way to achieve this would be to use: Linux-cgroups ,Windows-jobs, and equivalents, which prevent children escaping the process group.

Another issue is that, as a SIGKILL is sent almost immediately after sending a SIGTERM, you aren't giving the process the opportunity to shut down cleanly. I believe calls to osproc.terminate should be paired with calls to waitForExit to give it time to shut down. This also cleans up zombie processes, which don't receive signals.

waitForExit should also respect the timeout parameter on posix by polling with the option: WNOHANG

The text was updated successfully, but these errors were encountered:

simonkrauter · 2014-10-20T20:13:08Z

I think the label should be High Priority.

How should the issue be solved?
Ideas:

kill only the process itself (not a good solution)
test, whether the Process is a Process Group Leader, if yes, kill the Process Group, if no, kill only the process itself
retrieve the Process Group ID via getpgid() and kill the group (not a good solution)
use pkill() (I have no idea)
use Linux-cgroups (I have no idea)
start new processes in own Process Group (using setpgid())

simonkrauter · 2014-10-25T18:42:45Z

Impact of this bug in Aporia: When you click "Terminate running process", Aporia will be terminated, but not the running process.

simonkrauter · 2014-10-25T18:43:54Z

"-p.id" is also used in suspend() and resume().

simonkrauter · 2014-10-25T23:40:54Z

Possible solution number 1:

Part 1: Improve terminate(), suspend() and resume() - #1590

test, whether the Process is a Process Group Leader, if yes, kill the Process Group, if no, kill only the process itself
same approach for suspend() and resume()
Added some wait time between SIGTERM and SIGKILL (can be changed by optional parameter)

Part 2: Improve startProcess() - #1591

Optional feature
Added start option poCreateNewGroup, if set, the new process will be moved to a new process group

Open issues:

How to terminate sub-processes under Windows?
How to implement an alternative for SIGTERM under Windows?

Possible solution number 2:

PR: #1620

terminate only the process itself (consistency for POSIX and Windows)
terminate() only sends SIGTERM under POSIX
kill() sends SIGKILL under POSIX
consistency to Python (see https://docs.python.org/3/library/subprocess.html )

jovial · 2014-10-27T17:50:18Z

As suggested, repeating comment from #1590 here:

Is it worth having a "stronger" function, kill, like in the python multiprocess api? Then have terminate only send a sigterm? That way if you have a load of processes that take a while to shutdown, you just spam them all with a terminate and check their exit codes at later time (instead of waiting on each individually).

Also, you might not always want to send a sigterm to the whole group. What if the process takes care of sending a signal to its own children?

The python implementation of kill on windows is just another alias for TerminateProcess. It's worth noting that this only terminates the one process, so for cross-platform code, we might want to replicate this in the posix version (at least by default).

simonkrauter · 2014-11-02T02:05:40Z

Please give comments to "Possible solution number 2".

jovial · 2014-11-02T13:01:03Z

Should kill peek the exit code (after sending the signal) to ensure it also cleans up a zombie? Here is a possible waitForExit which allows you to reap a process after calling terminate (requires: from times import epochTime):

  proc waitForExit(p: Process, timeout: int = -1): int =
    #if waitPid(p.id, p.exitCode, 0) == int(p.id):
    # ``waitPid`` fails if the process is not running anymore. But then
    # ``running`` probably set ``p.exitCode`` for us. Since ``p.exitCode`` is
    # initialized with -3, wrong success exit codes are prevented.
    if p.exitCode != -3: return p.exitCode
    let start = epochTime()
    var ret = 0
    while ret == 0:
      ret = waitpid(p.id, p.exitCode, WNOHANG)
      let now = epochTime()
      if timeout >= 0 and (now - start) * 1000 >= timeout.float:
        break
      # sleep for 50 milliseconds
      if usleep(50*1000) != 0'i32:
        raiseOSError(osLastError())  

    if ret == 0:
      # indicates timeout, windows code doesn't signal timeout in anyway, so we
      # don't for now
      discard
    elif ret < 0:
      p.exitCode = -3
      raiseOSError(osLastError())

    result = int(p.exitCode) shr 8

There is a way to declare you aren't interested in the process' return code, see here:

http://www.win.tue.nl/~aeb/linux/lk/lk-5.html#ss5.5 ,

but apparently, this has compatibility issues with BSD / sysvinit. Another solution would be to define your own SIGCHLD handler, see:

http://www.microhowto.info/howto/reap_zombie_processes_using_a_sigchld_handler.html ,

but you would have to peekExitCode for all processes (as if two exit in quick succession you may only receive one SIGCHLD).

simonkrauter · 2014-11-03T11:14:42Z

How to reproduce this issue:

https://gist.github.com/trustable-code/acbfadfb88e5927a98c2
Insert this code in a nim file
open it with Aporia
Choose "Compile & run current file"

Result:
Aporia will be terminated.

Expected result:
Only the running test program will be terminated, not Aporia.

simonkrauter · 2014-11-03T13:29:18Z

Issue can be closed.

dom96 added Standard Library Medium Priority labels Oct 8, 2014

dom96 added Severe and removed Medium Priority labels Oct 21, 2014

This was referenced Oct 25, 2014

Fix terminate() issue #1590

Closed

Add process start option poCreateNewGroup #1591

Closed

simonkrauter mentioned this issue Nov 2, 2014

Fix terminate() and add kill() #1620

Merged

Varriount closed this as completed Nov 3, 2014

This was referenced Nov 3, 2014

"Terminate running process" doesn't work under Linux dom96/Aporia#55

Closed

[linux] Running an infinit loop crashes Aporia dom96/Aporia#52

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osproc.terminate broken on posix (when using fork at least) #1558

osproc.terminate broken on posix (when using fork at least) #1558

jovial commented Oct 7, 2014

simonkrauter commented Oct 20, 2014

simonkrauter commented Oct 25, 2014

simonkrauter commented Oct 25, 2014

simonkrauter commented Oct 25, 2014

jovial commented Oct 27, 2014

simonkrauter commented Nov 2, 2014

jovial commented Nov 2, 2014

simonkrauter commented Nov 3, 2014

simonkrauter commented Nov 3, 2014

osproc.terminate broken on posix (when using fork at least) #1558

osproc.terminate broken on posix (when using fork at least) #1558

Comments

jovial commented Oct 7, 2014

simonkrauter commented Oct 20, 2014

simonkrauter commented Oct 25, 2014

simonkrauter commented Oct 25, 2014

simonkrauter commented Oct 25, 2014

Possible solution number 1:

Possible solution number 2:

jovial commented Oct 27, 2014

simonkrauter commented Nov 2, 2014

jovial commented Nov 2, 2014

simonkrauter commented Nov 3, 2014

simonkrauter commented Nov 3, 2014