-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CalcJob
: always call Scheduler.parse_output
#5458
CalcJob
: always call Scheduler.parse_output
#5458
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic, thanks a lot @sphuber !
Just some minor suggestions/questions
When the functionality was added for a `Scheduler` plugin to parse the output that was written to `stdout` and `stderr` or retrieved from a specialized status command, it was decided to only call `parse_output` if all three information streams were successfully retrieved. The reasoning was especially that the `detailed_info` should be required as that would return well-structured data and would allow to reliably determine what had happened, whereas parsing free text from stdout and stderr would be error prone. Although probably a safe choice, the direct result was that setups where the scheduler didn't have the necessary implementation to return the `detailed_info`, no scheduler output parsing would be available. This would lead to many OOM and OOW problems to go unnoticed and the engine retrying to submit without any error handling. Here we remove the requirement that all three information streams from the parser, `detailed_info`, `stderr` and `stdout` should be known, but that `parse_output` will always be called, with a default of `None` for each. The advantage is that this will allow scheduler plugins to implement parsing from `stderr`, on top of `detailed_info` wherever applicable, to increase the chances of catching basic problems. The major downside is that this is a backwards incompatible change for scheduler plugins that rely on the assumption that the arguments always have values that are not `None`. This commit also fixes a bug in `CalcJob.parse_scheduler_output` where it would pass the result of `node.get_option('scheduler_stderr')` straight to `retrieved.get_object_content`. However, if the option was not defined on the node, `get_option` will return `None` which will result in a `TypeError` from the `get_object_content` call. This is fixed by explicitly checking for `None` in which case a warning is logged.
The `CalcJob` implementation was changed to always call the `parse_output` method of the scheduler, even if `detailed_info` is `None`. This means that now we can attempt to parse errors from the `stderr` as well. Here we add simple regexes to try and detect OOM and OOW errors. They return the exact same exit code as if they would have been detected from the `detailed_info`. Note that since this is done with regexes, this opens the door to false positives. It is not know how likely these are to occur.
399aee4
to
85b78c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a lot!
if data['State'] == 'OUT_OF_MEMORY': | ||
return CalcJob.exit_codes.ERROR_SCHEDULER_OUT_OF_MEMORY # pylint: disable=no-member | ||
if re.match(r'.*cancelled at.*due to time limit.*', stderr_lower): | ||
return CalcJob.exit_codes.ERROR_SCHEDULER_OUT_OF_MEMORY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Damn, just noticed this small error here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eurgh, would it not have been good to add some tests in this PR, to catch things like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep; I think it's fair to do this in the follow-up PR that adds support for the remaining schedulers.
Did not want to add more work for @sphuber who jumped in quickly here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slow and steady wins the race 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nordic walking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I wasn't really thinking this was ready to be merged. Would have suggested to test this in the wild with SLURM to see if the regex patterns actually match. Had just taken them from the issue discussion, but not sure that these are actually correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeh, I moaned at @sphuber about this before lol; for maintainers lets not merge each other's PRs, unless given "permission". At least I don't want people merging my PRs 😜
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, all good now. Corrected the commits on develop
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for maintainers lets not merge each other's PRs, unless given "permission"
Ok, will keep it in mind as a policy on this repository for the future.
I typically mark my PRs as "draft" if I don't want them to be merged yet, so I didn't pick up the hint - sorry for the confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not that they are not ready to be merged, it's that I want to make sure that it is merged correctly (squash/merge/rebase), and that the commit message is correct
Fixes #4840
When the functionality was added for a
Scheduler
plugin to parse theoutput that was written to
stdout
andstderr
or retrieved from aspecialized status command, it was decided to only call
parse_output
if all three information streams were successfully retrieved. The
reasoning was especially that the
detailed_info
should be required asthat would return well-structured data and would allow to reliably
determine what had happened, whereas parsing free text from stdout and
stderr would be error prone.
Although probably a safe choice, the direct result was that setups where
the scheduler didn't have the necessary implementation to return the
detailed_info
, no scheduler output parsing would be available. Thiswould lead to many OOM and OOW problems to go unnoticed and the engine
retrying to submit without any error handling.
Here we remove the requirement that all three information streams from
the parser,
detailed_info
,stderr
andstdout
should be known, butthat
parse_output
will always be called, with a default ofNone
foreach. The advantage is that this will allow scheduler plugins to
implement parsing from
stderr
, on top ofdetailed_info
whereverapplicable, to increase the chances of catching basic problems.
The major downside is that this is a backwards incompatible change for
scheduler plugins that rely on the assumption that the arguments always
have values that are not
None
.