Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osproc.terminate broken on posix (when using fork at least) #1558

Closed
jovial opened this issue Oct 7, 2014 · 9 comments
Closed

osproc.terminate broken on posix (when using fork at least) #1558

jovial opened this issue Oct 7, 2014 · 9 comments

Comments

@jovial
Copy link
Contributor

jovial commented Oct 7, 2014

The following piece of code, from osproc.nim, results in signals being sent to invalid process GIDs when used in conjunction with osproc.startProcess:

  proc terminate(p: Process) =
    if kill(-p.id, SIGTERM) == 0'i32:
      if p.running():
        if kill(-p.id, SIGKILL) != 0'i32: raiseOSError(osLastError())
    else: raiseOSError(osLastError())

Essentially, calling kill with a negative PID sends a signal to the process group with the PGID = |-p.id|. This can only be valid when p.id is the PID of a process group leader. This can never be true for a process started with osproc.startProcess as there is no call to setpgid, which could make the child a process group leader.

I believe sending a signal to the whole group is also inconsistent with the windows implementation. Furthermore, it is unreliable, as children can escape the process group with calls to setpgid. I think the only reliable way to achieve this would be to use: Linux-cgroups ,Windows-jobs, and equivalents, which prevent children escaping the process group.

Another issue is that, as a SIGKILL is sent almost immediately after sending a SIGTERM, you aren't giving the process the opportunity to shut down cleanly. I believe calls to osproc.terminate should be paired with calls to waitForExit to give it time to shut down. This also cleans up zombie processes, which don't receive signals.

waitForExit should also respect the timeout parameter on posix by polling with the option: WNOHANG

@simonkrauter
Copy link
Contributor

I think the label should be High Priority.

How should the issue be solved?
Ideas:

  • kill only the process itself (not a good solution)
  • test, whether the Process is a Process Group Leader, if yes, kill the Process Group, if no, kill only the process itself
  • retrieve the Process Group ID via getpgid() and kill the group (not a good solution)
  • use pkill() (I have no idea)
  • use Linux-cgroups (I have no idea)
  • start new processes in own Process Group (using setpgid())

@simonkrauter
Copy link
Contributor

Impact of this bug in Aporia: When you click "Terminate running process", Aporia will be terminated, but not the running process.

@simonkrauter
Copy link
Contributor

"-p.id" is also used in suspend() and resume().

@simonkrauter
Copy link
Contributor

Possible solution number 1:

Part 1: Improve terminate(), suspend() and resume() - #1590

  • test, whether the Process is a Process Group Leader, if yes, kill the Process Group, if no, kill only the process itself
  • same approach for suspend() and resume()
  • Added some wait time between SIGTERM and SIGKILL (can be changed by optional parameter)

Part 2: Improve startProcess() - #1591

  • Optional feature
  • Added start option poCreateNewGroup, if set, the new process will be moved to a new process group

Open issues:

  • How to terminate sub-processes under Windows?
  • How to implement an alternative for SIGTERM under Windows?

Possible solution number 2:

PR: #1620

@jovial
Copy link
Contributor Author

jovial commented Oct 27, 2014

As suggested, repeating comment from #1590 here:

Is it worth having a "stronger" function, kill, like in the python multiprocess api? Then have terminate only send a sigterm? That way if you have a load of processes that take a while to shutdown, you just spam them all with a terminate and check their exit codes at later time (instead of waiting on each individually).

Also, you might not always want to send a sigterm to the whole group. What if the process takes care of sending a signal to its own children?

The python implementation of kill on windows is just another alias for TerminateProcess. It's worth noting that this only terminates the one process, so for cross-platform code, we might want to replicate this in the posix version (at least by default).

@simonkrauter
Copy link
Contributor

Please give comments to "Possible solution number 2".

@jovial
Copy link
Contributor Author

jovial commented Nov 2, 2014

Should kill peek the exit code (after sending the signal) to ensure it also cleans up a zombie? Here is a possible waitForExit which allows you to reap a process after calling terminate (requires: from times import epochTime):

  proc waitForExit(p: Process, timeout: int = -1): int =
    #if waitPid(p.id, p.exitCode, 0) == int(p.id):
    # ``waitPid`` fails if the process is not running anymore. But then
    # ``running`` probably set ``p.exitCode`` for us. Since ``p.exitCode`` is
    # initialized with -3, wrong success exit codes are prevented.
    if p.exitCode != -3: return p.exitCode
    let start = epochTime()
    var ret = 0
    while ret == 0:
      ret = waitpid(p.id, p.exitCode, WNOHANG)
      let now = epochTime()
      if timeout >= 0 and (now - start) * 1000 >= timeout.float:
        break
      # sleep for 50 milliseconds
      if usleep(50*1000) != 0'i32:
        raiseOSError(osLastError())  

    if ret == 0:
      # indicates timeout, windows code doesn't signal timeout in anyway, so we
      # don't for now
      discard
    elif ret < 0:
      p.exitCode = -3
      raiseOSError(osLastError())

    result = int(p.exitCode) shr 8

There is a way to declare you aren't interested in the process' return code, see here:

http://www.win.tue.nl/~aeb/linux/lk/lk-5.html#ss5.5 ,

but apparently, this has compatibility issues with BSD / sysvinit. Another solution would be to define your own SIGCHLD handler, see:

http://www.microhowto.info/howto/reap_zombie_processes_using_a_sigchld_handler.html ,

but you would have to peekExitCode for all processes (as if two exit in quick succession you may only receive one SIGCHLD).

@simonkrauter
Copy link
Contributor

How to reproduce this issue:

  1. https://gist.github.com/trustable-code/acbfadfb88e5927a98c2
  2. Insert this code in a nim file
  3. open it with Aporia
  4. Choose "Compile & run current file"

Result:
Aporia will be terminated.

Expected result:
Only the running test program will be terminated, not Aporia.

@simonkrauter
Copy link
Contributor

Issue can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants