Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AEP: Allow CalcJobs to be actively monitored and interrupted #36

Merged
merged 2 commits into from
Oct 13, 2022

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Sep 20, 2022

  • Used AEP template from AEP 0
  • Status is draft
  • Added type & status labels to PR
  • Added AEP to README.md
  • Provided github handles for authors

@sphuber sphuber force-pushed the aep/calcjob-monitors branch from 43ca6be to 9bbac77 Compare September 26, 2022 11:03
@sphuber
Copy link
Contributor Author

sphuber commented Sep 26, 2022

This AEP is implemented and can be seen here: aiidateam/aiida-core#5659

Copy link
Member

@ramirezfranciscof ramirezfranciscof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is all good in general, I just have 2 questions / comments

008_calcjob_monitors/readme.md Outdated Show resolved Hide resolved
008_calcjob_monitors/readme.md Outdated Show resolved Hide resolved
008_calcjob_monitors/readme.md Outdated Show resolved Hide resolved
008_calcjob_monitors/readme.md Show resolved Hide resolved
008_calcjob_monitors/readme.md Show resolved Hide resolved

Since this would significantly complicate the implementation as well as the user-interface, and currently it is not clear whether this design goal addresses an actual use-case, it was decided to table this direction.

### Alternatives
Copy link
Member

@ltalirz ltalirz Sep 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least as far as control flow is considered (e.g. shutting down / restarting a calculation based on something seen in the output), it may be worth mentioning here or in the introduction that there is a third alternative: implement the application-specific control flow in a python script and let AiiDA run that python script instead of the simulation code.

This is the path taken by @astamminger in aiida-cusp with custodian.

This approach has several important advantages:

  • no additional load on aiida (even allows to do remote post-processing that scales with the number of jobs)
  • maximum performance (no need to go through slow network connections)
  • portability: the same custodian script can be used by aiida, fireworks or other workflow managers
  • simplicity: no extra features needed in aiida

Of course, it also has some downsides:

  • no way to adjust computational resources based on output (note: it would be important for an AiiDA-centric implementation to support this; this can be a selling point)
  • python + dependencies required on the compute resource

The above only applies to the question of how to do control flow.

Making it more convenient for AiiDA users to monitor a calculation (e.g. a shortcut to stream the output file of a calculation) can be useful even in this scenario. But from the current implementation I understand that control flow is the main objective of the implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the "Alternatives" section with a reference to aiida-cusp. Unfortunately I am not super familiar with the implementation and intricate details of how Custodian and aiida-cusp work. If there is anything that could be changed in aiida-core that would make the concept of aiida-cusp more generically applicable or easier to implement, then it would be great if you or @astamminger could comment on this. Then I could include a more in-depth discussion on aiida-cusp in this AEP.

The reason I didn't discuss it in more depth for now is because:

  • This solution is already possible without an AEP, as evidenced by aiida-cusp working.
  • This solution is not generic. Even though Custodian could be used for various codes and is (at least not as I understood) not specific to VASP (although it is mostly used for this since it comes with predefined handlers), it was still necessary to write a separate AiiDA plugin for it to make it work.

This last point is a significant disadvantage in my opinion. Not only does it make the barrier to start using the functionality high by having to write an entire package of plugins (CalcJob, Parser and maybe more?) it also means that is not compatible with the existing aiida-vasp plugin. Unless I misunderstood, it is not possible to simply have custodian as a drop-in replacement for the aiida-vasp plugin, and so all its workflows can also not be used with custodian. As a consequence, all the work of the aiida-common-workflows is also not compatible with aiida-cusp. Contrast this with this AEP, where a user can simply take any of the existing workflows (directly from aiida-vasp or the common workflows) and simply add monitors in the steps where they would like. Due to the design of port exposing, it is possible to add monitors without the common workflows or aiida-vasp having to change any of their workflows. I think this is pretty powerful.

Making it more convenient for AiiDA users to monitor a calculation (e.g. a shortcut to stream the output file of a calculation) can be useful even in this scenario. But from the current implementation I understand that control flow is the main objective of the implementation.

Do you mean that a user could launch a calcjob and then AiiDA provide some API to retrieve the output from one of the output files in the working directory of the job? If that is the case, then that is already possible:

node = load_node()   # Load the calcjob node
with node.computer.get_transport() as transport:
    workdir = node.get_remote_workdir()
    transport.chdir(workdir)
    with tempfile.NamedTemporaryFile() as tmpfile:
        transport.get('stdout.txt', tmpfile.name)
        print(tmpfile.read())

Admittedly this is very verbose, but it could easily be wrapped in a utility function. It would also already be made a lot simpler if the transport interface would supporting streaming, which I intend to add at some point.

Anyway, the point is, if that is the use case you had in mind, then no change in aiida-core is necessary.

Copy link
Member

@ltalirz ltalirz Sep 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This solution is not generic. Even though Custodian could be used for various codes and is (at least not as I understood) not specific to VASP (although it is mostly used for this since it comes with predefined handlers), it was still necessary to write a separate AiiDA plugin for it to make it work.

This last point is a significant disadvantage in my opinion. Not only does it make the barrier to start using the functionality high by having to write an entire package of plugins (CalcJob, Parser and maybe more?) it also means that is not compatible with the existing aiida-vasp plugin

I think in the end it boils down to the question: which logic should be implemented at which level?
Let's call implementing everything in AiiDA route A.
Route B is to use a "smaller" (intentionally limited) workflow language like custodian (or maybe jobflow/...) to do basic job monitoring/management directly inside the compute job that is running the calculation, and to integrate that with AiiDA.

I would argue that, from an ecosystem perspective, route B is objectively more generic: the logic that can live directly inside the compute job is dealt with by the local workflow manager, and the higher-level workflow manager (e.g. AiiDA) integrates with it directly. Different high-level workflow managers will be able to reuse and benefit from this locally implemented logic (users can even run it manually stand-alone if they have no need for more fancy job management). This is what has enabled the aiida-cusp plugin.

Route A (implementing everything in AiiDA) does not allow this type of reuse: e.g. there isn't a straightforward way to reuse the logic encoded in aiida-vasp inside, say, fireworks or some other job orchestrator.

This is not to say that we should not offer the capability outlined in this AEP in AiiDA; also having everything in one software package has its own advantages.
In my view, however, we should give serious consideration to route B going forward, if we want to maximize the impact of our work ("coaxing" simulation codes to do what we want) on accelerating computational materials science for everyone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that a user could launch a calcjob and then AiiDA provide some API to retrieve the output from one of the output files in the working directory of the job? If that is the case, then that is already possible:

Cheers, yes - it's just what I intuitively thought of when I read "job monitoring".
Indeed it's possible; and while it might be useful to have a shortcut for this I understand that this is not the purpose of this AEP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair points, and I am not saying that route B shouldn't be considered, but my point is that this is already possible, without any changes to AiiDA. Of course, if there are changes in AiiDA that would make it easier to implement that route, then that would be interesting for an AEP, but I am not familiar enough with those methods to see what might be missing. Feedback from @astamminger may be very valuable here.

I just want to restate though that if you go down route B, although you might develop workflows that can be run with multiple workflow managers, you will be abandoning the rest of the AiiDA ecosystem, as they will need their own plugins and won't be compatible with the plugins of the codes that they are running, e.g., aiida-quantumespresso, aiida-vasp etc. This is a choice of course, but we have already invested a lot of time to develop high level workflows with advanced functionality in those packages and you would be abandoning this. The same goes for the common workflow project.

Again, in the end it is a choice, but I think for the current state, it makes a lot of sense to have an integrated approach that works with all the code that is already out there in the current ecosystem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good. I'll leave this thread open for others to see, but I consider my comment addressed.

Copy link
Member

@ramirezfranciscof ramirezfranciscof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments have been dealt with and now this is ok to merge for me (I'll leave @ltalirz to give feedback on the corrections for his comments)

@sphuber sphuber merged commit 19a7e4d into master Oct 13, 2022
@sphuber sphuber deleted the aep/calcjob-monitors branch October 13, 2022 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants