-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: mark the work as failed if exact command runner does not exist after receptor restart #716
base: devel
Are you sure you want to change the base?
Conversation
…fter receptor restart
@@ -16,6 +16,7 @@ import ( | |||
"github.com/ansible/receptor/pkg/logger" | |||
"github.com/ghjm/cmdline" | |||
"github.com/google/shlex" | |||
"github.com/shirou/gopsutil/v3/process" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know what user elevation is needed for this library to work as expected?
|
||
cmd, err := p.CmdlineSlice() | ||
if err != nil { | ||
return true, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im worried here that receptor may not have privilege to view the cmd line slice, do you know if their are any restrictions on which type of user can see this data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. do we need sudo? do we need to be in a user group etc.
p, err := process.NewProcess(int32(pid)) | ||
if err != nil { | ||
// process does not exist | ||
return true, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im finding the name of the function a little confusing here. If the process dosent exist, how can it also be an orphaned process?
Perhaps changing the name to something like "isPidValid()" and reversing the bools returned will make a little more sense? What do you think?
@AaronH88 |
Thanks @kurokobo ! |
I agree, when I was testing by hand as described in my first comment, I thought these tests should be automated :P |
Description
Closes #711
During
Restart()
of command work, this PR makes the receptor to check if command runner for the unit exists or not, and if not exists, mark the unit as failed.process does not exist
error if negativeis not command runner
error if negativeunitdir=<expected unitdir>
; throwis command runner for different unit
error if negativeUsing shirou/gopsutil.
I don't know if this is the best way but works anyway.
Please feel free to reject this if there is alternative better way 😃
Tests
Configuration file
Case 01: Resumed correctly
Start Receptor
./receptor -c foo.yml &
Start command work
Kill and restart Receptor main process
Monitor work unit and ensure work state will be updated from
Running
toSucceeded
Release units and stop receptor
receptorctl --socket /tmp/receptor.sock work release --all pkill -f "./receptor -c foo.yml"
Case 02: Missing command runner (process does not exist)
Start Receptor
./receptor -c foo.yml &
Start command work
Send SIGKILL both Receptor main process and command runner
Ensure the status file for the unit is in
Running
stateRestart Receptor main process
./receptor -c foo.yml &
Ensure work state has been updated as
Failed
withOrphaned: process does not exist
, and PID has been wiped as0
Release units and stop receptor
receptorctl --socket /tmp/receptor.sock work release --all pkill -f "./receptor -c foo.yml"
Case 03: Missing command runner (process exists but not receptor)
Start Receptor
./receptor -c foo.yml &
Start command work
Send SIGKILL both Receptor main process and command runner
Ensure the status file for the unit is in
Running
stateFake PID in
status
file to invalid process1
(systemd
)Restart Receptor main process
./receptor -c foo.yml &
Ensure work state has been updated as
Failed
withOrphaned: pid X is not command runner
Release units and stop receptor. Ensure work is releasable since SIGINT never sent to PID
1
(If PID was not wiped, SIGINT was sent to1
and we gotOperation not permitted
error)receptorctl --socket /tmp/receptor.sock work release --all pkill -f "./receptor -c foo.yml"
Case 04: Missing command runner (process exists but is not for exact unit)
Start Receptor
./receptor -c foo.yml &
Start two command works
Send SIGKILL both Receptor main process and one of the command runner
Ensure the status file for the unit is in
Running
stateFake PID in
status
file to pid for another runnerRestart Receptor main process
./receptor -c foo.yml &
Ensure work state has been updated as
Failed
withOrphaned: pid X is command runner for different unit
Release units and stop receptor
receptorctl --socket /tmp/receptor.sock work release --all pkill -f "./receptor -c foo.yml"
Case 05: Node down
Invoke completely the same test as described in #711.
At the last step, I can get
Failed
state for the work.