Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data migration for old calculation jobs #2854

Closed
sphuber opened this issue May 9, 2019 · 14 comments · Fixed by #3090
Closed

Add data migration for old calculation jobs #2854

sphuber opened this issue May 9, 2019 · 14 comments · Fixed by #3090
Assignees
Milestone

Comments

@sphuber
Copy link
Contributor

sphuber commented May 9, 2019

JobCalculation nodes in old databases will currently automatically be migrated to CalcJobNode instances. However, these nodes still have some old attributes that are not migrated and miss many attributes that they would have had, had they been run with aiida-core==1.0.0, e.g. the process state and process label attributes.
We should consider adding a data migration that tries as best we can to transform these old calculations into something compatible.

  1. The old DbCalcJobState was dropped, but the proxy attribute state remains there. The CalcJobNode.get_state method will try to cast this into the new CalcJobState enum, but the old calculations can contain values that are not supported by this enum, causing a ValueError to be thrown. In the new system, this state attribute is a temporary sub-state, in addition to the process state, that only applies to CalcJob processes, while they are active and are deleted once the process terminates. Therefore, I propose we simply delete this state for all CalcJobNodes that are in a terminal state. We should be careful that we only apply this to terminated process nodes, because there might be active ones during the migration that we do not want to touch.
  2. All CalcJobNodes should have a process_state, process_label and optionally an exit_status. The process_label is currently set as the name of the process class or function name (in the case of a process function). I am not sure if an equivalent was stored on the old node, but otherwise we could maybe use the value in the process_type, which should have been added during existing migrations. The problem here will be how to set the process_state. The old calculations only had the state attribute, the old job calc state, but those values do not map one-to-one on the new process states. For example, any of the old "failed" states, such as SUBMISSIONFAILED, or FAILED: should they become Excepted or Finished with a non-zero exit status. But in the case of the latter, what exit status should be used?

This can be summed up in the following table which should tell how to translate the state attribute to the correct process_state and exit_status:

Old state Process state Exit status Process status
NEW Killed None Legacy JobCalculation with state 'NEW'
TOSUBMIT Killed None Legacy JobCalculation with state 'TOSUBMIT'
SUBMITTING Killed None Legacy JobCalculation with state 'SUBMITTING'
WITHSCHEDULER Killed None Legacy JobCalculation with state 'WITHSCHEDULER'
COMPUTED Killed None Legacy JobCalculation with state 'COMPUTED'
RETRIEVING Killed None Legacy JobCalculation with state 'RETRIEVING'
PARSING Killed None Legacy JobCalculation with state 'PARSING'
FINISHED Finished 0
SUBMISSIONFAILED Excepted None Legacy JobCalculation with state 'SUBMISSIONFAILED'
RETRIEVALFAILED Excepted None Legacy JobCalculation with state 'RETRIEVALFAILED'
PARSINGFAILED Excepted None Legacy JobCalculation with state 'PARSINGFAILED'
FAILED Finished 1
IMPORTED - -

Notes;

  • Note the IMPORTED state was never actually stored in the state attribute, so we do not have to consider it in the migration.
  • All the actives states will be set to Killed. Before the migration is performed, the user will be warned (note this still needs to be added) to first finish all running calculations and workflows. Then stop the daemon and perform the migration. In a sense then, if we find these "active" calculation states during the migration, it means the user either has ignored the warning and is effectively telling us to kill them or the calculation got stuck in an active state and was lost, in which case the Killed state is also appropriate.
  • For the process_label, inferring the original class name is difficult, although the process_type could potentially be used. However, since this complicated to do in SQL we simply put Legacy JobCalculation.
@ltalirz
Copy link
Member

ltalirz commented May 9, 2019

@giovannipizzi Seb is happy to implement the migration once we've agreed on a mapping from old to new states. Something like

  • SUBMISSIONFAILED => Excepted
  • FAILED => Excepted

We're waiting until we have some input from you here.

@sphuber On a side note, having one place to look for the state of a job seems, on the surface, preferable to having multiple states spread across different classes (with one of them temporary/conditional, if I understand correctly). Could you maybe explain in a few words the reasoning behind this current implementation?

@sphuber
Copy link
Contributor Author

sphuber commented May 10, 2019

I initially did not want to put it in, because I think the new process state should be sufficient. However, the change did introduce a loss of granularity. For example for all the stages of the CalcJob its process state is simply Running or Waiting. There is the process status, that contains a human readable message to see for example: Waiting for scheduler update to tell you at which transport task it is. However, this is not queryable.

Just in case there might be a use case of people wanting to query for CalcJobs that are currently 'with the scheduler' as opposed to being uploaded, submitted or retrieved, this CalcJobState can be used for that purpose. Since this only makes sense during its lifetime, it is automatically set by the engine as the calculation job goes through the transport tasks and then gets deleted once the process terminates.

@ltalirz
Copy link
Member

ltalirz commented May 11, 2019

Just a naive question: Is it necessary to have this split between CalcJobNode.state and CalcJobNode.process_state?
Can't we just decide for one and have all relevant information in there + queryable?

@sphuber
Copy link
Contributor Author

sphuber commented May 12, 2019

Yes, I think it's necessary (at least to not change the behaviour of the process state just for the calcjob) because state is specific only to CalcJobs, unlike the process state which, as the name implies, applies to all processes. The whole point of the exercise over the last two years was to homogenise all the different parts, such that we wouldn't have all different sorts of ways of dealing with things. Now sure, there are still differences between a calculation function and job, that require some differences in implementation, but they should be additive to the common base, as I have done with the new calc job state. It is an additional attribute that provides some extra queryable granularity to the active state of calc jobs.

@zhubonan
Copy link
Contributor

zhubonan commented Jun 3, 2019

I would support to map FINISHED to exit_status=0. For the FAILED states perhaps it can be mapped to a special exit_status (in the range of 0-99 which is recommended to be reserved for AiiDA internally)?

The conventional in aiida-quantumespresso, if I recall correctly, is to mark the calculation FAILED if one of the critical warnings are defected (such as SCF not being able to reach convergence). My plugin follows the same logic, and in the 1.0 version, I set the exit_status according to the critical warnings.

PS: Sorry I clicked the wrong button and closed the issue...

@zhubonan zhubonan closed this as completed Jun 3, 2019
@zhubonan zhubonan reopened this Jun 3, 2019
@sphuber
Copy link
Contributor Author

sphuber commented Jun 3, 2019

Thanks for your input @zhubonan . I agree, the FINISHED to Finished [0] mapping seems straightforward. Your second suggestion about mapping the various failed states to a pre-defined exit status in the range 0-99 I have also considered, except that since the implementation of the exponential back off retry mechanism, the concept of SUBMISSIONFAILED and RETRIEVALFAILED no longer really apply, because, in principle these should never occur. It would then be a bit strange to introduce a reserved exit status for these. But it might be an option yes.

As far as the further use of the exit_status, the aiida-quantumespresso currently does not yet make use of it for the CalcJobs, but this is something that I am implementing yes. I am rewriting the parser to always try to parse as much as possible and then give the most appropriate error code. Note that I say "most appropriate" because sometimes there may be more than 1 individual errors, however, we can only return a single exit status. Often you can define what the most important error should be though.

@zhubonan
Copy link
Contributor

zhubonan commented Jun 3, 2019

I would suggest to just let JobCalculations in SUBMISSIONFAILED and RETRIEVEALFAILED to have the excepted process state. Since 0.12.x does not have exponential fall back, the user would have to recreate new calculations for retrying and see calculations in these states as unrecoverable.

Regarding the exit_status I am taking the same route - if more than one critical_errors exists I would set the exit_status to the most specific error (if possible) while still saving them in the create Dict output just like before.

@sphuber
Copy link
Contributor Author

sphuber commented Jun 3, 2019

@zhubonan I think I agree with your suggestion regarding the specific old failed states. I have adapted the table. The only two that are missing now are the FAILED and IMPORTED state. The IMPORTED state was invented at some point, because the daemon actually started running calculations based on this job state. Therefore, importing calculations with an "active" state would have caused the daemon to start running them. To prevent this, imported calculations were always have their state set to IMPORTED. Since the information is already lost, maybe the best we can do is to leave it empty?

Then for FAILED I think the most appropriate value would be Finished plus a non-zero exit_status. The question is just which one. Maybe here it is appropriate to reserve a builtin exit status in range 0-99 for this case.

@zhubonan
Copy link
Contributor

zhubonan commented Jun 5, 2019

I had this impression that 0-99 should not be used for plugins as they are supposed to be reserved. Maybe it was mentioned in the migration workshop or somewhere in the documentation.

I have not worked with any IMPORTED nodes, but I think it makes sense to leave it empty since the information is lost.

By the way, for those nodes in NEW/TOSUBMIT/SUBMITTING/PARSING etc state, there is obviously a problem with them as the plugins in 0.12.x and 1.0.x are unlikely to be compatible and it may cause problem to migrate the database in that state. Perhaps it makes sense to warn the user, and if they choose to proceed, set the process state of those nodes to Excepted? Otherwise, they have the process state Running while there isn't any actual process.

@sphuber
Copy link
Contributor Author

sphuber commented Jun 5, 2019

The 0-99 was indeed a suggestion I gave at some point, but it is not actually enforced yet. I am also not sure that there will be a way to do so. So at best it will be a strong guideline. But I will send an email at some point to plugin developers to discuss some of these questions.

About the old active states, good point. We have to think what to do here, just setting excepted might not be ok. Imagine someone was actually running some calculations and then stopped the daemon. Then he migrates. This is a perfectly legal and possible scenario. There is no way to correctly migrate this, no matter what we do. However, there is a bigger problem. Even if we could migrate them properly, they won't be able to continue, because the system that they were based on, has been completely removed in 1.0.0. So the real solution is that people should first finish all their calculations before migrating to 1.0.0. In principle then, there should no longer be active job calculation states. If there are, they are in a bugged state and can indeed be set to Excepted.

@ltalirz
Copy link
Member

ltalirz commented Jun 5, 2019

Imagine someone was actually running some calculations and then stopped the daemon. Then he migrates. This is a perfectly legal and possible scenario.

I usually side with the user on such issues but here I have to say: with all the changes in 1.0, expecting that you can stop your daemon, migrate to 1.0 and just continue your calculation (plugins changed etc etc. ) would be a bit ... insane ;-)

@sphuber
Copy link
Contributor Author

sphuber commented Jun 5, 2019

I would agree, but users might not necessarily realize this limitation. As far as I know we never communicated this explicitly, nor are there any warnings or safe guards in the code. I am wondering what the best way would be too accomplish this, but I am coming up short so far...

@zhubonan
Copy link
Contributor

The first thing I can think of would be to print some CAPITALIZED warnings before migration. The current version already has a good one to remind the user backing up the database. Perhaps it can give extra warnings when the migration is detected to be from 0.12.x to 1.0.0?

It is probably not possible to detect the state of the 0.12 database in 1.0.0. The other way I can think of would be to make a program in 0.12 to check the database and only allow migration from the database that passes (but overridable). This can perhaps can be included in a minor release of 0.12.

@sphuber
Copy link
Contributor Author

sphuber commented Jun 22, 2019

This will be taken care of on the django_jsonb branch since that will make the migration easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants