Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable job timeout #906

Merged
merged 4 commits into from
Mar 2, 2021

Conversation

andreygolev
Copy link
Contributor

Sometimes jobs may stuck by any reason, and we expect them to finish in a specified time.
Default behaviour is not do anything.

if jobTimeout != "" {
t, err := time.ParseDuration(jobTimeout)
if err != nil {
return nil, errors.New("shell: Error parsing job timeout")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, the job has already started. If the timeout is misconfigured, any output from the job will be lost and I suspect it will be unclear to the user whether the job did or did not run.

slowTimer := time.AfterFunc(t, func() {
j := fmt.Sprintf("shell: Job '%s' execution time exceeding defined timeout %v. Killing job.", command, t)
log.Print(j)
_, err := output.Write([]byte(j))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the message to the process output before it is killed, could lead to:

  1. output from the process being written in the output buffer after the message but before termination, which can be confusing
  2. the log containing a message suggesting it is killed, even though it wasn't, if killing failed for some reason.

Instead, I suggest adding a (different) message after the process has actually terminated. In this case you can also account for any truncation of the output buffer to prevent the message form being lost.

@andreygolev
Copy link
Contributor Author

Thank you for a review. I've reworked a feature. Please take a look, again. Thanks!

return fmt.Errorf("Error parsing job timeout value")
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Would you mind creating a test in api_test.go for this validation, analoguous to the other validation tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure


// Warn if buffer is overritten
// Warn if buffer is ovewritten
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You made a typo correcting the typo ;D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol :)

slowTimer := time.AfterFunc(jt, func() {
err = cmd.Process.Kill()
if err != nil {
jobTimeoutMessage = fmt.Sprintf("shell: Job '%s' execution time exceeding defined timeout %v. SIGKILL returned error. Job probably was not killed", command, jt)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Job probably was not killed" - the probability of this is debatable.
Errors returned can include e.g. Process Done.
In order to prevent digging deep to find out whether it was actually killed or not (on several platforms), maybe just say "Job may not have been killed".

err = cmd.Start()
if err != nil {
return nil, err
}

// Warn if buffer is overritten
var jobTimeoutMessage string
var jobWasKilled bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The jobWasKilled value actually signifies something different. I suggest renaming it to jobTimedOut, so it is in line with jobTimeoutMessage.

@andreygolev
Copy link
Contributor Author

Thanks :) here's new commit.

P.S. btw, tests were failing with this one:

=== RUN   Test_processFilteredNodes
test3 29.7 %
test2 35.6 %
test1 34.699999999999996 %
    assertion_compare.go:313: 
        	Error Trace:	agent_test.go:326
        	Error:      	"0.297" is not greater than "0.3"
        	Test:       	Test_processFilteredNodes
        	Messages:   	[]
--- FAIL: Test_processFilteredNodes (10.16s)

@yvanoers
Copy link
Collaborator

P.S. btw, tests were failing with this one:

Yes, that happens sometimes. It is due to randomness.
I'm looking into resolving that.

Copy link
Collaborator

@yvanoers yvanoers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@vcastellm vcastellm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work here @yvanoers, thank you!

@vcastellm vcastellm merged commit c874129 into distribworks:master Mar 2, 2021
@yvanoers
Copy link
Collaborator

yvanoers commented Mar 3, 2021

Thanks @Victorcoder , but let's not forget @andreygolev 's contribution, he did the work!
Thanks @andreygolev !

var jobTimeoutMessage string
var jobTimedOut bool

slowTimer := time.AfterFunc(jt, func() {
Copy link
Member

@vcastellm vcastellm Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that this will be 0s in case there is no timeout set. There should not be timeout when not specifying a timeout.

Could you please open a new PR with the change @andreygolev?

And yes! sorry for my fast typing, thanks for the work here @andreygolev of course!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sure! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants