Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a PluginVersionProvider for processes #3131

Merged

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Jul 4, 2019

Fixes #2664 and fixes #2988

This new utility class is used by the Runner class to keep a mapping
of certain process classes, either Process sub classes or process
functions onto a dictionary of "version" information. This dictionary
now includes the version of aiida-core that is running as well as the
version of the top level module of the plugin, if defined. If the latter
cannot be determined for whatever reason, only the version of
aiida-core is returned.

This information is then retrieved by the Process class during the
creation process. It will store this information in the attributes of
the process node. Currently, no logic in aiida-core will act on this
information. Its sole purpose is to give the user slightly more info on
with what versions of core and the plugin a certain process node was
generated. The attributes can also be used in querying to filter nodes
for a sub set generated with a specific version of the plugin.

Note that "plugin" here refers to the entire package, e.g the entire
aiida-quantumespresso plugin. Each plugin can contain multiple entry
points for the various entry point categories, that are sometimes
individually also referred to as plugins. In the future, the version
dictionary returned by the PluginVersionProvider may be enriched with
a specific entry_point version, once that level of version granulariy
will become supported. For the time being it is not included.

@sphuber sphuber requested a review from giovannipizzi July 4, 2019 14:53
@sphuber
Copy link
Contributor Author

sphuber commented Jul 4, 2019

I tried to couple the PluginVersionProvider as loosely as possible from the Process and the Node class, such that if it needs to be removed or changed it impacts those parts of the code as little as possible. The difficulty was that since the version information needs to be stored ideally as "normal" attributes, meaning before the node gets stored, this information needs to be injected before the store method is called. I thought about something like an event hook on the Node.store method, but for that to work, any listeners need to be attached to an instance, but we will never have access to those instances.

This implementation only adds the version information to process nodes and therefore the best place is to have this information inserted in the Process._setup_db_record class. It tries to access the version_provider of its runner instance and if it has one, asks it for the version dictionary and stores it in the attributes. The Runner class now then by default creates an instance of PluginVersionProvider such that the determining of the plugin version info is cached properly for daemon runners (so that the overhead is minimal). Currently, the following structure will be set in the attributes:

'version': {
    'core': '1.0.0b4',
    'plugin': '3.0.0a3'
}

In the future the key entry_point can be added if we decide to start supporting sub versioning per entry point resource.

N.B.: in this implementation, only ProcessNodes get this version information. Do we also want this for all Data nodes. Thinking about it now, I don't see why not. But if we want this, it would require a different way of hooking the information in. We then need to somehow manage to interject this in the Node.store method, because the creation of Data nodes is not manager by the Process class and so the current implementation won't work.

Pinging @chrisjsewell @giovannipizzi @ltalirz @zhubonan for feedback about current proposal

Copy link
Member

@giovannipizzi giovannipizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small change required, thanks!

@@ -633,6 +633,12 @@ def _setup_db_record(self):

def _setup_metadata(self):
"""Store the metadata on the ProcessNode."""
try:
version_info = self.runner.plugin_version_provider.get_version_info(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current implementation I don't think the try/except is necessary

@giovannipizzi
Copy link
Member

A comment: if we want to support this also for Data nodes (not a bad idea) we should probably have another column (node_version) or similar, possibly JSONB, including this data. The reason is that e.g. for Dict all attributes are considered the content of the node - these new attributes would show up and collide with the keys defined by the users (this already happened in the past when we wanted to define e.g. the "license" of a node). In that case, we might want to decide to make the column more general (e.g. node_metadata) that include also version={...} if we think that this can contain other non-version info. But if we are not sure better to keep only node_version and in the future we can migrate it.

@giovannipizzi
Copy link
Member

@chrisjsewell @ltalirz @zhubonan for me this (apart for a minor fix) is good to merge. Do you see anything we are missing in terms of important information to keep? Otherwise we'll merge mid next week.

KEY_VERSION_ROOT = 'version'
KEY_VERSION_CORE = 'core' # The version of `aiida-core`
KEY_VERSION_PLUGIN = 'plugin' # The version of the plugin top level module, e.g. `aiida-quantumespresso`
KEY_VERSION_ENTRY_POINT = 'entry_point' # The version of the specific corresponding entry point
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove this line since it's not yet used

@chrisjsewell
Copy link
Member

chrisjsewell commented Jul 5, 2019

Hey @sphuber, looks good ta :) Yeh Data nodes would be a 'nice-to-have', but if they are going to be a big hassle, then I don't think it is as much of a priority, since they are better defined just by their content.

My only other thought, which is probably overkill, is that it might be nice to also read/store an (optional) __version__ from the processes specific module (i.e. the actual file its contained in). The motivation would be for later use of cached calculations: within your package you might have a certain calculation plugin that is 'stable', and doesn't change during version increases (for example because you are just adding extra calculation/workflow plugins to your package). You might want to denote that these calculations are still valid to use as a cache, despite a version increase of your package.
Actually, thinking about it now, I guess the converse approach would be to allow compatible core/plugin version ranges for a node type, to be specified in the caching.yaml, or something like that (maybe a todo for #2549)

@sphuber sphuber force-pushed the fix_2664_process_plugin_version_attribute branch from fcb7f98 to 1624fcc Compare July 5, 2019 08:06
@zhubonan
Copy link
Contributor

zhubonan commented Jul 5, 2019

I think it is a nice thing to have as at the moment no explicit information has been stored about the version.

@chrisjsewell I think that the module level __version__ is already being used for creating the hash when the node is stored.

def _get_objects_to_hash(self):
"""Return a list of objects which should be included in the hash."""
objects = [
importlib.import_module(self.__module__.split('.', 1)[0]).__version__,
{
key: val
for key, val in self.attributes_items()
if key not in self._hash_ignored_attributes and key not in self._updatable_attributes # pylint: disable=unsupported-membership-test
},

However, this does mean that if you bump up the module level __version__ of a process at later date (by updating the plugin code), the hash will change, e.g get_hash returns a different one other than that stored in extra. I imagine this potentially can cause false positives when the database is rehashed with updated plugin versions. Once the process has its version attributes stored in the database this is not a problem anymore.

I also agree with @chrisjsewell that it is quite likely that a CalcJob stays stable between different plugin versions, hence it may be useful if we only consider the module level __version__ when computing hashes for Process.

@sphuber
Copy link
Contributor Author

sphuber commented Jul 5, 2019

Thanks a lot for the feedback guys.

I see the use-case of being able to re-use CalcJobNodes for caching even if the plugin (i.e. the entire package) version has changed, but the implementation of the CalcJob hasn't. As @zhubonan has pointed out, this already does not work, because of the way the hash is currently computed. In that sense this commit does not worsen the situation but already paves the road to it being solved in the future. In that future, we might also envision sub versioning of not just CalcJobs but also WorkChain classes and Data classes, even though it won't affect the caching mechanism. Since the purpose of this commit is to provide a minimal starting point that will already provide a lot of added value, without tying ourselves too much down, I tempted not to include the module specific versioning now. I think it will be easy to add in the future with the current setup. Would you agree?

The only thing that bothers me here is a simple problem of "language". I find it becomes difficult to talk about the different versions because things like "package", "module" and "plugin" are ambiguous and can refer to various things. I think it would be very valuable if we can define these now, such that extending the functionality in the future will be easier, when we know what we are talking about.
Problem is really twofold:

  1. In python a module can be a single file or a collection of files in a nested hierarchy, often called a package. But ultimately a package is also a
  2. We refer to an entire repository with one or multiple entry points for different aiida entry point groups a "plugin", but individual resources, pointed to by those entry points, can also be referred to as a "plugin".

So how do we define the versions and call the following things? (using aiida-quantumespresso as an example)

aiida_core.__version__: version.core
aiida_quantumespresso.__version__: version.plugin / version.repository/ version.package ?
aiida_quantumespresso.workflows.pw.__version__: version.package / version.module ?
aiida_quantumespresso.workflows.pw.base.__version__: version.module
PwBaseWorkChain.__version__: version.entry_point / version.plugin / version.module ?

The latter could be called version.module but that kind of assumes that there is only one "resource" in that module. For example, you can also add a calcfunction as an entry point in the aiida.calculations group. If you define that in the same python module as the PwBaseWorkChain class, does it get the same version attribute? Other option would be to allow, somehow, to define a specific version per "entry_point", i.e. for the aiida.workflows:quantumespresso.pw.base one. For class based resources this could be done by defining a __version__ class attribute. However, for process functions this becomes impossible, I think.

@ltalirz
Copy link
Member

ltalirz commented Jul 5, 2019

I tempted not to include the module specific versioning now. I think it will be easy to add in the future with the current setup. Would you agree?

I would vote not to include module-level versioning at this point.
We should think carefully before introducing such a complication and be sure that it actually solves more problems than it creates.

So how do we define the versions and call the following things?

I'm not sure we even need to plan ahead to include such versioning at this point.
I think there are arguments to limit ourselves to existing versions (which are the versions of the python packages).

On comment do I have is that providing the version of two packages (core + package where the process is defined) is not only an incomplete approximation of the python package environment but even of the "aiida python package environment".

E.g. consider I publish a workflow package aiida-cp2k-lsmo, which depends on the "aiida-cp2k" plugin (but does not hardcode its version). A workflow from aiida-cp2k-lsmo will use the calculation plugin defined in aiida-cp2k.

I'm not suggesting to go beyond 2 packages now - but this is just to say that, in the medium term, we may not just want to have a core and plugin version, but e.g. a list of package versions.
I.e. to be forward-compatible, why not do something like

{ 'package-versions': {
    'aiida-core': '1.0.0b1',
    'aiida-cp2k': '...'
  }
}

@ltalirz
Copy link
Member

ltalirz commented Jul 5, 2019

P.S. Perhaps more important than defining the naming scheme or the implementation would be to define the purpose of storing the version information.

Its sole purpose is to give the user slightly more info on
with what versions of core and the plugin a certain process node was
generated.

This is quite vague - if we can agree what we would like to use the version information for (e.g. its relation to hashing), we also get a clearer picture of where this will go in the future.

@sphuber
Copy link
Contributor Author

sphuber commented Jul 5, 2019

On comment do I have is that providing the version of two packages (core + package where the process is defined)

Regarding the name "package": the aiida_cp2k is a python package, but so is aiida_cp2k.calculations and aiida_cp2k.workflows. They just happen to be bundled in another package aiida_cp2k that we then refer to as a "plugin".

A workflow from aiida-cp2k-lsmo will use the calculation plugin defined in aiida-cp2k.

This is fine though, because the CalcJobNode that will be created when the WorkChain calls the corresponding Cp2kCalculation, will record the version of aiida-cp2k that was installed in that environment. This is why I do not think that we should store the versions of all installed packages in the environment on every node. I mean where would you stop, would you also include the versions of aiida-codtools and aiida-quantumespresso on a CalcJobNode of a Cp2kCalculation? I don't think that is necessary as they typically won't have any connections. I think we just need to define the version of aiida-core and the version of the package that the "resource", be it a WorkChain, CalcJob class or a process function, was defined in.

P.S. Perhaps more important than defining the naming scheme or the implementation would be to define the purpose of storing the version information.

The amount of discussion here shows that just coming up with the definitions of the various versions and what they should mean is difficult. Any business logic that should then act on those versions will inherently be even more complicated and probably error prone. The goal here was then precisely not to tackle the whole problem, but rather provide a minimal subset, with the least constraints/complications for the greatest informational gain. This is what I meant with the statement:

Its sole purpose is to give the user slightly more info on with what versions of core and the plugin a certain process node was generated.

The protocol for how these version attributes are determined (in this current proposal) is very easy to define (but therefore is very limited in terms of configurability/granularity).
In addition, the data stored is merely for informational purposes. That is to say: no business logic in aiida-core will act on these values. Whether the plugin itself does this is up to them. Mostly I envision this information being useful for querying to filter out certain sub sets if you know the behavior changed after a certain version of either aiida-core, but most likely a given plugin.

@ltalirz
Copy link
Member

ltalirz commented Jul 5, 2019

Regarding the name "package": the aiida_cp2k is a python package, but so is aiida_cp2k.calculations and aiida_cp2k.workflows. They just happen to be bundled in another package aiida_cp2k that we then refer to as a "plugin".

You're right - but I'm not aware of any aiida plugin that versions subfolders separately.
I.e. I would say the term "package version" is quite well defined.

I mean where would you stop, would you also include the versions of aiida-codtools and aiida-quantumespresso on a CalcJobNode of a Cp2kCalculation?

Indeed - any point chosen to stop here is arbitrary to some degree. You chose to stop after aiida-core and the version of the package that defines the process, and I agree that it makes the most sense for now (it's just not the only possible choice).

@sphuber sphuber force-pushed the fix_2664_process_plugin_version_attribute branch 2 times, most recently from d2acba4 to ee505ab Compare July 15, 2019 16:39
This new utility class is used by the `Runner` class to keep a mapping
of certain process classes, either `Process` sub classes or process
functions onto a dictionary of "version" information. This dictionary
now includes the version of `aiida-core` that is running as well as the
version of the top level module of the plugin, if defined. If the latter
cannot be determined for whatever reason, only the version of
`aiida-core` is returned.

This information is then retrieved by the `Process` class during the
creation process. It will store this information in the attributes of
the process node. Currently, no logic in `aiida-core` will act on this
information. Its sole purpose is to give the user slightly more info on
with what versions of core and the plugin a certain process node was
generated. The attributes can also be used in querying to filter nodes
for a sub set generated with a specific version of the plugin.

Note that "plugin" here refers to the entire package, e.g the entire
`aiida-quantumespresso` plugin. Each plugin can contain multiple entry
points for the various entry point categories, that are sometimes
individually also referred to as plugins. In the future, the version
dictionary returned by the `PluginVersionProvider` may be enriched with
a specific `entry_point` version, once that level of version granulariy
will become supported. For the time being it is not included.
giovannipizzi
giovannipizzi previously approved these changes Jul 16, 2019
Copy link
Member

@giovannipizzi giovannipizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@sphuber sphuber force-pushed the fix_2664_process_plugin_version_attribute branch from ee505ab to 5a0e2d2 Compare July 16, 2019 07:11
Copy link
Member

@giovannipizzi giovannipizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again ;-)

@sphuber sphuber merged commit 5058e23 into aiidateam:develop Jul 16, 2019
@sphuber sphuber deleted the fix_2664_process_plugin_version_attribute branch July 16, 2019 07:48
d-tomerini pushed a commit to d-tomerini/aiida_core that referenced this pull request Sep 30, 2019
This new utility class is used by the `Runner` class to keep a mapping
of certain process classes, either `Process` sub classes or process
functions onto a dictionary of "version" information. This dictionary
now includes the version of `aiida-core` that is running as well as the
version of the top level module of the plugin, if defined. If the latter
cannot be determined for whatever reason, only the version of
`aiida-core` is returned.

This information is then retrieved by the `Process` class during the
creation process. It will store this information in the attributes of
the process node. Currently, no logic in `aiida-core` will act on this
information. Its sole purpose is to give the user slightly more info on
with what versions of core and the plugin a certain process node was
generated. The attributes can also be used in querying to filter nodes
for a sub set generated with a specific version of the plugin.

Note that "plugin" here refers to the entire package, e.g the entire
`aiida-quantumespresso` plugin. Each plugin can contain multiple entry
points for the various entry point categories, that are sometimes
individually also referred to as plugins. In the future, the version
dictionary returned by the `PluginVersionProvider` may be enriched with
a specific `entry_point` version, once that level of version granulariy
will become supported. For the time being it is not included.
d-tomerini pushed a commit to d-tomerini/aiida_core that referenced this pull request Oct 16, 2019
This new utility class is used by the `Runner` class to keep a mapping
of certain process classes, either `Process` sub classes or process
functions onto a dictionary of "version" information. This dictionary
now includes the version of `aiida-core` that is running as well as the
version of the top level module of the plugin, if defined. If the latter
cannot be determined for whatever reason, only the version of
`aiida-core` is returned.

This information is then retrieved by the `Process` class during the
creation process. It will store this information in the attributes of
the process node. Currently, no logic in `aiida-core` will act on this
information. Its sole purpose is to give the user slightly more info on
with what versions of core and the plugin a certain process node was
generated. The attributes can also be used in querying to filter nodes
for a sub set generated with a specific version of the plugin.

Note that "plugin" here refers to the entire package, e.g the entire
`aiida-quantumespresso` plugin. Each plugin can contain multiple entry
points for the various entry point categories, that are sometimes
individually also referred to as plugins. In the future, the version
dictionary returned by the `PluginVersionProvider` may be enriched with
a specific `entry_point` version, once that level of version granulariy
will become supported. For the time being it is not included.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants