Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: mark the work as failed if exact command runner does not exist after receptor restart #716

Open
wants to merge 2 commits into
base: devel
Choose a base branch
from

Conversation

kurokobo
Copy link
Contributor

@kurokobo kurokobo commented Dec 24, 2022

Description

Closes #711

During Restart() of command work, this PR makes the receptor to check if command runner for the unit exists or not, and if not exists, mark the unit as failed.

  1. Check if PID exists; throw process does not exist error if negative
  2. If PID found, check if the process is receptor; throw is not command runner error if negative
  3. If the process is receptor, check if the args for the process contains unitdir=<expected unitdir>; throw is command runner for different unit error if negative
  4. UPDATE: In any error pattern, as well as mark it failed, wipe stored PID for the unit. Because if the PID remains in status, the SIGINT sent to the PID that may already used different process during release/cancel after Receptor restart.

Using shirou/gopsutil.

I don't know if this is the best way but works anyway.
Please feel free to reject this if there is alternative better way 😃

Tests

Configuration file

---
- local-only:
- node:
    id: foo
- control-service:
    service: control
    filename: /tmp/receptor.sock
- work-command:
    worktype: sleep
    command: sh
    params: -c "sleep 60; echo WORKER DONE"

Case 01: Resumed correctly

  1. Start Receptor

    ./receptor -c foo.yml &
  2. Start command work

    $ receptorctl --socket /tmp/receptor.sock work submit --no-payload sleep
    Result: Job Started
    Unit ID: Fm7fQoAs
  3. Kill and restart Receptor main process

    pkill -f "./receptor -c foo.yml" && ./receptor -c foo.yml &
  4. Monitor work unit and ensure work state will be updated from Running to Succeeded

    $ receptorctl --socket /tmp/receptor.sock work list
    {
        "Fm7fQoAs": {
            "Detail": "exit status 0",
            "ExtraData": {
                "Params": "-c \"sleep 60; echo WORKER DONE\"",
                "Pid": 939252
            },
            "State": 2,
            "StateName": "Succeeded",
            "StdoutSize": 12,
            "WorkType": "sleep"
        }
    }
  5. Release units and stop receptor

    receptorctl --socket /tmp/receptor.sock work release --all
    pkill -f "./receptor -c foo.yml"

Case 02: Missing command runner (process does not exist)

  1. Start Receptor

    ./receptor -c foo.yml &
  2. Start command work

    $ receptorctl --socket /tmp/receptor.sock work submit --no-payload sleep
    Result: Job Started
    Unit ID: EqMr5iww
  3. Send SIGKILL both Receptor main process and command runner

    pkill -9 -f "./receptor -c foo.yml"
    kill -9 $(pgrep -f EqMr5iww)
  4. Ensure the status file for the unit is in Running state

    $ cat /tmp/receptor/foo/EqMr5iww/status
    {"State":1,"Detail":"Running: PID 939879","StdoutSize":0,"WorkType":"sleep","ExtraData":{"Pid":939871,"Params":"-c \"sleep 60; echo WORKER DONE\""}}
  5. Restart Receptor main process

    ./receptor -c foo.yml &
  6. Ensure work state has been updated as Failed with Orphaned: process does not exist, and PID has been wiped as 0

    $ receptorctl --socket /tmp/receptor.sock work list
    {
        "EqMr5iww": {
            "Detail": "Orphaned: process does not exist",
            "ExtraData": {
                "Params": "-c \"sleep 60; echo WORKER DONE\"",
                "Pid": 0
            },
            "State": 3,
            "StateName": "Failed",
            "StdoutSize": 0,
            "WorkType": "sleep"
        }
    }
  7. Release units and stop receptor

    receptorctl --socket /tmp/receptor.sock work release --all
    pkill -f "./receptor -c foo.yml"

Case 03: Missing command runner (process exists but not receptor)

  1. Start Receptor

    ./receptor -c foo.yml &
  2. Start command work

    $ receptorctl --socket /tmp/receptor.sock work submit --no-payload sleep
    Result: Job Started
    Unit ID: JuYT8el0
  3. Send SIGKILL both Receptor main process and command runner

    pkill -9 -f "./receptor -c foo.yml"
    kill -9 $(pgrep -f JuYT8el0)
  4. Ensure the status file for the unit is in Running state

    $ cat /tmp/receptor/foo/JuYT8el0/status
    {"State":1,"Detail":"Running: PID 940207","StdoutSize":0,"WorkType":"sleep","ExtraData":{"Pid":940197,"Params":"-c \"sleep 60; echo WORKER DONE\""}}
  5. Fake PID in status file to invalid process 1 (systemd)

    $ sed -i 's/"Pid"\:[0-9]*/"Pid"\:1/g' /tmp/receptor/foo/JuYT8el0/status
    $ cat /tmp/receptor/foo/JuYT8el0/status
    {"State":1,"Detail":"Running: PID 940207","StdoutSize":0,"WorkType":"sleep","ExtraData":{"Pid":1,"Params":"-c \"sleep 60; echo WORKER DONE\""}}
  6. Restart Receptor main process

    ./receptor -c foo.yml &
  7. Ensure work state has been updated as Failed with Orphaned: pid X is not command runner

    $ receptorctl --socket /tmp/receptor.sock work list
    {
        "JuYT8el0": {
            "Detail": "Orphaned: pid 1 is not command runner",
            "ExtraData": {
                "Params": "-c \"sleep 60; echo WORKER DONE\"",
                "Pid": 0
            },
            "State": 3,
            "StateName": "Failed",
            "StdoutSize": 12,
            "WorkType": "sleep"
        }
    }
  8. Release units and stop receptor. Ensure work is releasable since SIGINT never sent to PID 1 (If PID was not wiped, SIGINT was sent to 1 and we got Operation not permitted error)

    receptorctl --socket /tmp/receptor.sock work release --all
    pkill -f "./receptor -c foo.yml"

Case 04: Missing command runner (process exists but is not for exact unit)

  1. Start Receptor

    ./receptor -c foo.yml &
  2. Start two command works

    $ receptorctl --socket /tmp/receptor.sock work submit --no-payload sleep
    Result: Job Started
    Unit ID: EqMr5iww
    
    $ receptorctl --socket /tmp/receptor.sock work submit --no-payload sleep
    Result: Job Started
    Unit ID: FYyQ6Ifn
  3. Send SIGKILL both Receptor main process and one of the command runner

    pkill -9 -f "./receptor -c foo.yml"
    kill -9 $(pgrep -f FYyQ6Ifn)
  4. Ensure the status file for the unit is in Running state

    $ cat /tmp/receptor/foo/FYyQ6Ifn/status
    {"State":1,"Detail":"Running: PID 941175","StdoutSize":0,"WorkType":"sleep","ExtraData":{"Pid":941166,"Params":"-c \"sleep 60; echo WORKER DONE\""}}
  5. Fake PID in status file to pid for another runner

    $ pgrep -f EqMr5iww
    941083
    $ sed -i 's/"Pid"\:[0-9]*/"Pid"\:941083/g' /tmp/receptor/foo/FYyQ6Ifn/status
    $ cat /tmp/receptor/foo/FYyQ6Ifn/status
    {"State":1,"Detail":"Running: PID 941175","StdoutSize":0,"WorkType":"sleep","ExtraData":{"Pid":941083,"Params":"-c \"sleep 60; echo WORKER DONE\""}}
  6. Restart Receptor main process

    ./receptor -c foo.yml &
  7. Ensure work state has been updated as Failed with Orphaned: pid X is command runner for different unit

    $ receptorctl --socket /tmp/receptor.sock work list
    {
        "EqMr5iww": {
            "Detail": "exit status 0",
            "ExtraData": {
                "Params": "-c \"sleep 60; echo WORKER DONE\"",
                "Pid": 941083
            },
            "State": 2,
            "StateName": "Succeeded",
            "StdoutSize": 12,
            "WorkType": "sleep"
        },
        "FYyQ6Ifn": {
            "Detail": "Orphaned: pid 941083 is command runner for different unit",
            "ExtraData": {
                "Params": "-c \"sleep 60; echo WORKER DONE\"",
                "Pid": 0
            },
            "State": 3,
            "StateName": "Failed",
            "StdoutSize": 0,
            "WorkType": "sleep"
        }
    }
  8. Release units and stop receptor

    receptorctl --socket /tmp/receptor.sock work release --all
    pkill -f "./receptor -c foo.yml"

Case 05: Node down

Invoke completely the same test as described in #711.
At the last step, I can get Failed state for the work.

$ docker compose exec foo receptorctl work list
{
    "rN7KMwTD": {
        "Detail": "Orphaned: process does not exist",
        "ExtraData": {
            "Expiration": "0001-01-01T00:00:00Z",
            "LocalCancelled": false,
            "LocalReleased": false,
            "RemoteNode": "bar",
            "RemoteParams": {},
            "RemoteStarted": true,
            "RemoteUnitID": "8ve1Hdiz",
            "RemoteWorkType": "sleep",
            "SignWork": false,
            "TLSClient": ""
        },
        "State": 3,
        "StateName": "Failed",
        "StdoutSize": 0,
        "WorkType": "remote"
    }
}

$ docker compose exec bar cat /tmp/receptor/bar/8ve1Hdiz/status
{"State":3,"Detail":"Orphaned: process does not exist","StdoutSize":0,"WorkType":"sleep","ExtraData":{"Pid":0,"Params":"-c \"sleep 60; echo WORKER DONE\""}}

$ docker compose exec foo receptorctl work release --all
Released:
(rN7KMwTD, released)

@@ -16,6 +16,7 @@ import (
"github.com/ansible/receptor/pkg/logger"
"github.com/ghjm/cmdline"
"github.com/google/shlex"
"github.com/shirou/gopsutil/v3/process"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what user elevation is needed for this library to work as expected?


cmd, err := p.CmdlineSlice()
if err != nil {
return true, err
Copy link
Contributor

@AaronH88 AaronH88 Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im worried here that receptor may not have privilege to view the cmd line slice, do you know if their are any restrictions on which type of user can see this data?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. do we need sudo? do we need to be in a user group etc.

p, err := process.NewProcess(int32(pid))
if err != nil {
// process does not exist
return true, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im finding the name of the function a little confusing here. If the process dosent exist, how can it also be an orphaned process?

Perhaps changing the name to something like "isPidValid()" and reversing the bools returned will make a little more sense? What do you think?

@kurokobo
Copy link
Contributor Author

@AaronH88
Thanks for your review here too, will look into this!

@AaronH88
Copy link
Contributor

Thanks @kurokobo !
If you can find some way to automate the tests you outlined in the comments here, that would be amazing also!

@kurokobo
Copy link
Contributor Author

I agree, when I was testing by hand as described in my first comment, I thought these tests should be automated :P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Work remains in "Running" state after the failure of a remote worker node
2 participants