-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add infrastructure to parse scheduler output for CalcJobs
#3906
Add infrastructure to parse scheduler output for CalcJobs
#3906
Conversation
@espenfl on top of this PR, I have a working branch that implements the new I think this would then also fully address the open issue and supersede PR #3261 and PR #3647 or is there still some functionality in there that would not be covered with this PR? Some of the functionality in your PRs actual is already present in |
Codecov Report
@@ Coverage Diff @@
## develop #3906 +/- ##
===========================================
+ Coverage 79.08% 79.09% +0.01%
===========================================
Files 468 468
Lines 34622 34642 +20
===========================================
+ Hits 27377 27395 +18
- Misses 7245 7247 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
1d3d633
to
b860d2a
Compare
27004f4
to
c3a9406
Compare
Thanks a lot @sphuber , I think this is a crucial addition to the scheduler handling and will be very useful!
Just to make sure I understand you here: while, technically, a scheduler plugin could return any exit code that is recognized by a custom calcjob class, I guess we should strongly advise to only return I think the approach chosen by you is very sensible (as opposed to, e.g., build some machinery that would allow scheduler plugins to define their own exit codes), since it gives AiiDA users the ability to deal with scheduler errors in a more abstract way.
While I get that this approach is the most flexible one, and should probably be supported, it does mean extra work for developers of existing plugins. As the developer of an existing plugin, I would much prefer to see an exit code referring to an "OOM" rather than some generic "file not found / parsing failed" error. What if, instead, there was a flag Of course one could think of more flexible ways than a flag (a whitelist / blacklist of scheduler errors); perhaps this would be too much. |
Thanks for the comments @ltalirz
Yes, there is no check as of yet that any exit code returned is one that we define, but even if we did, there is no way we can check that what is returned is actually correct. I think therefore checking does not matter and it is in everyone's interest to have all scheduler plugins behave as coherent as possible. We already control a big part of them through
I don't think this is so much a problem for plugin developers (as long as they are aware of it) as for the users of it, but yes I see your point.
This is absolutely true, albeit not too much work. Adding the following at the top would suffice:
I would probably provide a utility method on the The biggest reason why I have opted for this approach now, is that this is backwards compatible, while your proposed solution is not. Now it would be a good idea to present these two options to the mailing list to see what users and developers would prefer. For a kick-off: @chrisjsewell @greschd @espenfl and @giovannipizzi what do you think about these two possibilities? Exit on scheduler exit code by default or not? |
Yes, I will have a look at this and suggest improvements if need be. Thanks. |
Thanks @sphuber for collecting all the scattered work on this into one PR and also the additional improvements. I think it makes total sense to integrate it with the current framework for exit codes. When it comes to the flow of this I would certainly advice that we at the parsing step have the posibility to know what kind of exit code was returned from the scheduler. Sometimes you might need to parse a different file, check some extra parameters before loading data into a node or similar. This also means we should not close shop once we see an exit code from the scheduler. We should continue with the parsing step as I believe there are plenty of scheduling errors that can appear, but can leave the calculation salvageable, or in a state where we need more info from the parser to act. In the end this might result in a different final exit code that is returned or no exit code at all. |
Another issue we should think about already now is; can we call this machinery from a transport monitor? At some point we need to monitor the job info, scheduler stdout/stderr and code stdout/stderr while it is running. It would be nice if we could reuse this, or rewrite it now so that it is general enough to be used in such a context. |
That's clear - however with 49 plugins registered, any change we require from plugin developers will take a significant amount of time to propagate (and some otherwise working plugins may never see it).
@espenfl I think we anyhow all agree that this must me possible. The question is: what should happen by default? Exiting on non-zero scheduler exit codes would have the advantage that all plugins will benefit from the functionality automatically - however, as @sphuber mentions this is is a change of behavior. Do any of you guys have an existing example in mind, where this change in behavior would be detrimental? |
Thinking more about this, there is even an alternative route, where the switch to "exit immediately" exists but is turned off by default. Once we feel it is safe, we can then still change the default value in a later release. |
This is also what I think. Adding a default to break on non-zero exit codes in case developers have not added other functionality is maybe formally not so easy. If the exit codes and whatever we chose to implement in the scheduler plugins is not overly aggressive this is easy (say OOM and walltime). In the end I think we will end up with a rather extensive set of exit codes, some which are not that critical, possibly even just a scheduler error, but the calculation finished just fine. The scheduler stdout/stderr will house a lot of things which might not linked to the scheduler, say MPI stuff. We might get errors here, which are not critical, but that we want to know about and possibly act on, say by adjusting the node/cpu spread or whatever. This would be a use case. Probably many. We can of course argue that these type of exit codes should maybe formally not be on the scheduler, but right now, we have no other place to put it. But it is a while before we reach this and in the meantime I think most plugin developers would like the benefits from OOM and walltime etc. So I guess in practical terms this is more easy. |
c3a9406
to
47cbdbc
Compare
47cbdbc
to
98d7881
Compare
98d7881
to
05603e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the work, @sphuber! I just have one question/suggestion and spotted a typo.
Regarding the discussion on whether to exit immediately if the |
33169b0
to
3191858
Compare
Thanks @mbercx , I addressed the two comments and rebased on latest
This might potentially be a useful feature, but I would not implement it now. Once we have people asking for it we can open a feature request and implement it. |
Good for me as well to skip adding this option for the time being. |
I just realized a subtlety that we might want to change. If the parser does not return a specific error code, but just |
Hmm, good point. I'd say that in the (not very common?) case that the calculation parser doesn't return an exit code and the scheduler does, we should return the scheduler exit code. Else the user won't see that there was an issue unless he/she checks the logs.
I'm not sure what you mean here... 😅 |
Thanks for picking up on this again. What I would expect is that the exit code penetrates, regardless of source. If the scheduler sets an exit code I would expect it to be rather serious (as some do not for many errors that are relevant). If we do a parse and detect nothing I would trust the exit code more than the parsing, at least from a programmatically perspective. Also, I would expect it is more common to change the formatting of the text than the exit code. So I would make sure the exit code from the parser is what sticks, unless you manually override it during parsing. |
Yeah, I thought about it some more and the current behavior is not what we want. For example, if there is an OOM, but the parser doesn't check or doesn't notice and so returns |
3191858
to
ed98ba4
Compare
Add a new method `Scheduler.parse_output` that takes three arguments: `detailed_job_info`, `stdout` and `stderr`, which are the dictionary returned by `Scheduler.get_detailed_job_info` and the content of scheduler stdout and stderr files from the repository, respectively. A scheduler plugin can implement this method to parse the content of these data sources to detect standard scheduler problems such as node failures and out of memory errors. If such an error is detected, the method can return an `ExitCode` that should be defined on the calculation job class. The `CalcJob` base class already defines certain exit codes for common errors, such as an out of memory error. If the detailed job info, stdout and stderr from the scheduler output are available after the job has been retrieved, and the scheduler plugin that is used has implemented `parse_output`, it will be called by the `CalcJob.parse` method. If an exit code is returned, it is set on the corresponding node and a warning is logged. Subsequently, the normal output parser is called, if any was defined in the inputs, which can then of course check the node for the presence of an exit code. It then has the opportunity to parse the retrieved output files, if any, to try and determine a more specific error code, if applicable. Returning an exit code from the output parser will override the exit code set by the scheduler parser. This is why that exit code is also logged as a warning so that the information is not completely lost. This choice does change the old behavior when an output parser would return `None` which would be interpreted as `ExitCode(0)`. However, now if the scheduler parser returned an exit code, it will not be overridden by the `None` of the output parser, which is then essentially ignored. This is necessary, because otherwise, basic parsers that don't return anything even if an error might have occurred will always just override the scheduler exit code, which is not desirable.
The `ERROR_NO_RETRIEVED_FOLDER` is now defined on the `CalcJob` base class and the `CalcJob.parse` method already checks for the presence of the retrieved folder and return the exit code if it is missing. This allows us to remove the similar exit codes that are currently defined on the calculation plugins shipped with `aiida-core` `ArithmeticAddCalculation` and `TemplateReplacerCalculation` as well as the check for the presence of the `retrieved` output from the corresponding parsers. The fact that is now checked in the `CalcJob` base class means that `Parser` implementations can assume safely that the retrieved output node exists.
ed98ba4
to
8562f5e
Compare
Great work, @sphuber! I've run a few tests for the slurm scheduler (after applying fe31e42438ba770a1c5363c2e6cc6d0ed265bc7a to this PR's branch) with I have noticed that the Other than that, I have no further comments, so this is ready to go for me. |
@@ -192,6 +193,14 @@ def define(cls, spec: CalcJobProcessSpec): | |||
help='Files that are retrieved by the daemon will be stored in this node. By default the stdout and stderr ' | |||
'of the scheduler will be added, but one can add more by specifying them in `CalcInfo.retrieve_list`.') | |||
|
|||
# Errors caused or returned by the scheduler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sphuber sorry only saw this now: should some (all?) off these exit codes set invalidates_cache=True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I think they probably should. Will make a PR to correct it, we are now testing the implementation of the scheduler output parser for the SLURM plugin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah.. I'm pretty sure about the 100
error. For 110
and 120
, I think we should check if the inputs related to how much memory / walltime is requested go into the caching hash.
If they do not, invalidates_cache=True
seems right to me. If they do, False
is probably right - because the job is unlikely to succeed with the same resources.
Fixes #4331
Add a new method
Scheduler.parse_output
that takes three arguments:detailed_job_info
,stdout
andstderr
, which are the dictionaryreturned by
Scheduler.get_detailed_job_info
and the content ofscheduler stdout and stderr files from the repository, respectively.
A scheduler plugin can implement this method to parse the content of
these data sources to detect standard scheduler problems such as node
failures and out of memory errors. If such an error is detected, the
method can return an
ExitCode
that should be defined on thecalculation job class. The
CalcJob
base class already defines certainexit codes for common errors, such as an out of memory error.
If the detailed job info, stdout and stderr from the scheduler output
are available after the job has been retrieved, and the scheduler plugin
that is used has implemented
parse_output
, it will be called by theCalcJob.parse
method. If an exit code is returned, it is set on thecorresponding node and a warning is logged. Subsequently, the normal
output parser is called, if any was defined in the inputs, which can
then of course check the node for the presence of an exit code. It then
has the opportunity to parse the retrieved output files, if any, to try
and determine are more specific error code, if applicable. Returning an
exit code from the output parser will override the exit code set by the
scheduler parser. This is why that exit code is also logged as a warning
so that the information is not completely lost.