Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CalcJob: always call Scheduler.parse_output #5458

Merged
merged 2 commits into from
Mar 21, 2022

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Mar 21, 2022

Fixes #4840

When the functionality was added for a Scheduler plugin to parse the
output that was written to stdout and stderr or retrieved from a
specialized status command, it was decided to only call parse_output
if all three information streams were successfully retrieved. The
reasoning was especially that the detailed_info should be required as
that would return well-structured data and would allow to reliably
determine what had happened, whereas parsing free text from stdout and
stderr would be error prone.

Although probably a safe choice, the direct result was that setups where
the scheduler didn't have the necessary implementation to return the
detailed_info, no scheduler output parsing would be available. This
would lead to many OOM and OOW problems to go unnoticed and the engine
retrying to submit without any error handling.

Here we remove the requirement that all three information streams from
the parser, detailed_info, stderr and stdout should be known, but
that parse_output will always be called, with a default of None for
each. The advantage is that this will allow scheduler plugins to
implement parsing from stderr, on top of detailed_info wherever
applicable, to increase the chances of catching basic problems.

The major downside is that this is a backwards incompatible change for
scheduler plugins that rely on the assumption that the arguments always
have values that are not None.

@sphuber sphuber requested review from ltalirz and chrisjsewell March 21, 2022 18:24
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic, thanks a lot @sphuber !

Just some minor suggestions/questions

aiida/engine/processes/calcjobs/calcjob.py Show resolved Hide resolved
aiida/schedulers/plugins/slurm.py Show resolved Hide resolved
aiida/schedulers/plugins/slurm.py Show resolved Hide resolved
sphuber added 2 commits March 21, 2022 22:44
When the functionality was added for a `Scheduler` plugin to parse the
output that was written to `stdout` and `stderr` or retrieved from a
specialized status command, it was decided to only call `parse_output`
if all three information streams were successfully retrieved. The
reasoning was especially that the `detailed_info` should be required as
that would return well-structured data and would allow to reliably
determine what had happened, whereas parsing free text from stdout and
stderr would be error prone.

Although probably a safe choice, the direct result was that setups where
the scheduler didn't have the necessary implementation to return the
`detailed_info`, no scheduler output parsing would be available. This
would lead to many OOM and OOW problems to go unnoticed and the engine
retrying to submit without any error handling.

Here we remove the requirement that all three information streams from
the parser, `detailed_info`, `stderr` and `stdout` should be known, but
that `parse_output` will always be called, with a default of `None` for
each. The advantage is that this will allow scheduler plugins to
implement parsing from `stderr`, on top of `detailed_info` wherever
applicable, to increase the chances of catching basic problems.

The major downside is that this is a backwards incompatible change for
scheduler plugins that rely on the assumption that the arguments always
have values that are not `None`.

This commit also fixes a bug in `CalcJob.parse_scheduler_output` where
it would pass the result of `node.get_option('scheduler_stderr')`
straight to `retrieved.get_object_content`. However, if the option was
not defined on the node, `get_option` will return `None` which will
result in a `TypeError` from the `get_object_content` call. This is
fixed by explicitly checking for `None` in which case a warning is
logged.
The `CalcJob` implementation was changed to always call the
`parse_output` method of the scheduler, even if `detailed_info` is
`None`. This means that now we can attempt to parse errors from the
`stderr` as well.

Here we add simple regexes to try and detect OOM and OOW errors. They
return the exact same exit code as if they would have been detected from
the `detailed_info`. Note that since this is done with regexes, this
opens the door to false positives. It is not know how likely these are
to occur.
@sphuber sphuber force-pushed the fix/4840/scheduler-parse-output branch from 399aee4 to 85b78c3 Compare March 21, 2022 21:46
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot!

@ltalirz ltalirz enabled auto-merge (squash) March 21, 2022 22:11
@ltalirz ltalirz merged this pull request into aiidateam:develop Mar 21, 2022
if data['State'] == 'OUT_OF_MEMORY':
return CalcJob.exit_codes.ERROR_SCHEDULER_OUT_OF_MEMORY # pylint: disable=no-member
if re.match(r'.*cancelled at.*due to time limit.*', stderr_lower):
return CalcJob.exit_codes.ERROR_SCHEDULER_OUT_OF_MEMORY
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn, just noticed this small error here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eurgh, would it not have been good to add some tests in this PR, to catch things like this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep; I think it's fair to do this in the follow-up PR that adds support for the remaining schedulers.
Did not want to add more work for @sphuber who jumped in quickly here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slow and steady wins the race 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nordic walking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I wasn't really thinking this was ready to be merged. Would have suggested to test this in the wild with SLURM to see if the regex patterns actually match. Had just taken them from the issue discussion, but not sure that these are actually correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh, I moaned at @sphuber about this before lol; for maintainers lets not merge each other's PRs, unless given "permission". At least I don't want people merging my PRs 😜

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, all good now. Corrected the commits on develop.

Copy link
Member

@ltalirz ltalirz Mar 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for maintainers lets not merge each other's PRs, unless given "permission"

Ok, will keep it in mind as a policy on this repository for the future.

I typically mark my PRs as "draft" if I don't want them to be merged yet, so I didn't pick up the hint - sorry for the confusion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that they are not ready to be merged, it's that I want to make sure that it is merged correctly (squash/merge/rebase), and that the commit message is correct

@sphuber sphuber deleted the fix/4840/scheduler-parse-output branch March 21, 2022 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parse stdout and stderr from schedulers even if detailed job info is not available
3 participants