Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pbs: handle poll if PBS client cannot connect #2691

Merged
merged 2 commits into from
Jun 26, 2018

Conversation

matthewrmshin
Copy link
Contributor

If PBS qstat cannot connect to its server, assume that jobs managed by it are still OK.

Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine. Can this be tested - e.g. by temporarily aliasing the batch sys poll command to a custom or non-existent command?

@matthewrmshin
Copy link
Contributor Author

Yes, thinking about how to write an effective test.

If PBS qstat cannot connect to its server, assume that jobs managed by
it are still OK.
@matthewrmshin matthewrmshin force-pushed the handle-poll-when-pbs-not-avail branch from f0d0905 to fa1c7b6 Compare June 18, 2018 10:04
@matthewrmshin
Copy link
Contributor Author

Test added.

@@ -39,6 +39,7 @@ class PBSHandler(object):
# N.B. The "qstat JOB_ID" command returns 1 if JOB_ID is no longer in the
# system, so there is no need to filter its output.
POLL_CMD = "qstat"
POLL_CANT_CONNECT_ERR = "Connection refused"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How stable is this message across different pbs versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only have our own to test on our site, so no idea. Our admin team is confident that it is a good message to use though. @hjoliver may have a better insight with his connection with the people who develop PBS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll ping my PBS connections...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(however, we can always adapt to other PBS versions in future releases if need be)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(... pinged)

@hjoliver
Copy link
Member

From a contact at Altair:

" is this response standard across PBS versions?"
That is generally correct. I've experienced this across a few versions. Also re-tested on Azure, and message is:

------------
Connection refused
qstat: cannot connect to server xxxxxx (errno=111)
------------

Also correct to assume, if you get the message above, it is no reflection on jobs itself. Simply means the PBS daemons can't be contacted. As usual, PBS jobs which are already running, will keep running. Not so for array sub-jobs.

But wait.... for PBS v18.x and beyond.

  • Even subjobs of array jobs, when they are running, will keep running if PBS daemons crashes or whatever.
  • Qstat will have JSON format output as an option, and customizable delimiters
  • Job's stdout/stderr can be retrieved while job is still running.

.... just in case you want to plan for Cylc 8, 9, or beyond.

@matthewrmshin
Copy link
Contributor Author

(The PBS v18.x Qstat new features are pretty exciting, but I don't think they would affect the purpose of this PR.)

@hjoliver hjoliver merged commit a36d9a8 into cylc:master Jun 26, 2018
@matthewrmshin matthewrmshin deleted the handle-poll-when-pbs-not-avail branch July 6, 2018 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants