-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrating node attributes and extras to JSONB for the Django backend #3090
Conversation
With Django adding support for the Posgres JSONB field, the old custom implementation of the entity-attribute-value (EAV) model, used for the attributes and extras for the `DbNode` table, has been made obsolete. Here we change the `DbNode` model for the Django backend to use the `JSONB` type directly and migrations are added to change the schema and migrate the existing data. The `DbSetting` table also used the custom EAV model and has likewise been migrated to use a JSONB field instead. The hard coded dummy model that the query builder uses to map the Django database models onto SqlAlchemy variants, has been removed and replaced by Aldjemy models that are generated on the fly.
The way to get, set and delete global settings in the `DbSetting` table was implemented by means of free functions. This is refactored into a `SettingsManager` utility class.
The migrations of the old attributes and extras from the EAV tables to a JSONB column on the node table, used a lazy fetch method to fetch nodes in batches. However, the consecutive fetches do not sort on anything so the same nodes may be fetched multiple times and as a result, some nodes are never fetched. Their attributes and extras are therefore then also never migrated. This caused certain nodes to have lost their data after the migration. Sorting the lazy fetches by the pk of the nodes, guarantees that each and every node is migrated, and only once.
Datetime objects were already not supported for SqlAlchemy since that always used JSONB, which does not have native support for this datatype. But now that Django also is migrated to use JSONB, the old support of the EAV implementation for datetime objects needs to be removed. The `clean_value` utility function that pre-processes each value that is entered to be stored in the database now checks a whitelist of acceptable value, which are essentially json-serializable, or raises an exception for anything else that is not recognized. The `aiida.common.timezone` module provides two utility functions named `datetime_to_isoformat` and `isoformat_to_datetime` to serialize and deserialize datetime objects to strings, respectively, that can be stored in the database. Co-authored-by: Giovanni Pizzi <giovanni.pizzi@epfl.ch>
Extras could originally only be set and mutated on stored nodes. The reason for this limitation was that the original implementation of extras in Django used the custom EAV schema to set arbitrary key value pairs on a node. Since this requires a foreign key of the node for any extra to be set, the corresponding node needed to be stored. Attributes that used the same mechanism had this restriction alleviated by using an in memory cache on the node instance while it was not stored. Since this cache was never implemented for extras it was simply not allowed to set them on unstored nodes. Now that the Django implementation has switched from the custom EAV schema for attributes and extras to a JSONB field, the foreign key restriction no longer exists and the database model instance naturally comes with a cache while it is not stored. Therefore the restriction on nodes being stored for extras being mutable can now be lifted. Co-authored-by: Giovanni Pizzi <giovanni.pizzi@epfl.ch>
Now that the Django implementation for the node attributes and extras uses a JSONB field, just like SqlAlchemy, the interface for interacting with these node properties can be homogenized and corrected. The clear division between the front and back end classes `Node` and `BackendNode` allows us to define a clear separation of responsabilities as well. * `BackendNode`: ensure serializability of values stored in the fields as they will have to be stored as JSON. The implementation needs to ensure that cleaning of values is done as little as possible and scales linearly when appending to existing values. * `Node`: implements AiiDA specific business logic such as the rules for the mutability of attributes as well as the validity of key names. The use of the `BackendNode` allows us to remove all business logic from the database models, which in turn allows us to write exhaustive tests for the interface in a backend independent way. Co-authored-by: Giovanni Pizzi <giovanni.pizzi@epfl.ch>
The goal of the `ModelWrapper` is to ensure that getting and setting fields of a database model instance automatically take care of refreshing the state from the database and flushing any changes. This is at the cost of more database operations which should be kept to a minimum especially if they are useless. For example, the refreshing and flushing of fields that are immutable model fields when the model is already stored. The get and set attribute methods are improved to ignore the refresh and flush if the model is already saved and the field is immutable. Currently these are hard coded to the set of `pk`, `id` and `uuid` since these are always immutable for all the database models implemented in AiiDA.
The Django and SqlAlchemy implementations of the `BackendUser` and `BackendComment` classes reimplemented the `from_dbmodel` method. However, the implementation was exactly that of the base class `BackendEntity` so it is better to rely on that. All that had to be done was to set the `ENTITY_CLASS` class attribute to the correct class.
The old `JobCalculation` maintained its state in the `DbCalcState` table whose value was also stored as an attribute with the key `state` as a proxy. With the change to processes, all `JobCalculation` nodes have been converted to `CalcJobNodes` but did not receive any of the typical process attributes such as `process_state` and `exit_status`. Moreover, they kept their `state` attribute which now contains deprecated values. The `state` attribute is still set for `CalcJobs` but it serves a different purpose as it is now a sub state to the process state, while the process is active for extra granularity. Here we add a data migration for those old calculation jobs where some process attributes are inferred from the old state attribute, which is then discarded. This will make sure that even for old migration calculations, commands like `verdi process list` will show some information that allows a user to discern what the entry was.
The change in the export archive is necessary to take the changes into account of two database migrations: * Moving of node attributes and extras to JSONB for the Django backend * Data migration of `state` attribute of legacy `JobCalculation` nodes Due to the migration of node attributes and extras to JSONB, which does not support datetime objects, in contrast with the old database schema, means that the deserialization from the export JSON data is no longer necessary. The only serialized types were datetime objects, but these are now no longer supported anyway in the database, just like SqlAlchemy already was. This allows us to remove the attribute and extra conversion dictionaries in its entirety. A second migration deals with the conversion of legacy job calculation states, by inferring the `process_state`, `exit_status` and `process_status` from it. Co-authored-by: Giovanni Pizzi <giovanni.pizzi@epfl.ch>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huge thanks to @szoupanos and @sphuber !
I have reviewed this together with @sphuber and we already fixed a few things before the PR was created. So this is ready to be merged for me.
@szoupanos in order to let you make your review, if you want to, we are not going to merge this until Friday lunchtime, because we would like to merge this before Monday, when we start a few more coding days. If we don't hear from you, we will merge then. Of course, if major issues surface, we will postpone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good work @sphuber and @giovannipizzi
Overall it looks nice. I will discuss 1-2 points with @sphuber but I am also OK with this PR
Fantastic! Killing those issues like flies ;-) |
Fixes #2821
Fixes #2788
Fixes #2789
Fixes #2786
Fixes #864
Fixes #3067
Fixes #2815
Fixes #2854
Fixes #1559
Fixes #3056
The main purpose of this PR is to migrate the old EAV schema for node attributes and extras for the Django backend. The change of the models and the database migrations have been implemented in the first commit by @szoupanos . But this change allowed to homogenize and streamline a lot of the rest of the code, since now both backends use JSONB for node attributes and extras. For example this now means that the attribute cache mechanism, that allowed one to store attributes, before the node itself was stored, can be removed. This cache is now in a sense delegated to the database model, where the attributes and extras column is represented as a dictionary. In addition, the restriction of only being able to store extras after a node was stored could also be lifted. To simplify and robustify the access of attributes and extras, the
BackendNode
interface has been significantly improved and tested.Then an unrelated commit was added to migrate old
state
attributes from legacyJobCalculations
. We decided to do this on this branch, since the migration was made a lot easier after both backends used JSONB, meaning the SQL used in the migration was identical for both backends.The final step was to add migrations for existing export archives to account for the two database migrations added in this PR, while upping the version number to
v0.6
. The migrations will findstate
attributes ofCalcJobNodes
and migrate the data. Serialized datetime strings in attributes are also updated to reattach the timezone information, which was originally never included since everything was first converted to UTC before converting.