Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrating node attributes and extras to JSONB for the Django backend #3090

Merged
merged 11 commits into from
Jun 27, 2019

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Jun 26, 2019

Fixes #2821
Fixes #2788
Fixes #2789
Fixes #2786
Fixes #864
Fixes #3067
Fixes #2815
Fixes #2854
Fixes #1559
Fixes #3056

The main purpose of this PR is to migrate the old EAV schema for node attributes and extras for the Django backend. The change of the models and the database migrations have been implemented in the first commit by @szoupanos . But this change allowed to homogenize and streamline a lot of the rest of the code, since now both backends use JSONB for node attributes and extras. For example this now means that the attribute cache mechanism, that allowed one to store attributes, before the node itself was stored, can be removed. This cache is now in a sense delegated to the database model, where the attributes and extras column is represented as a dictionary. In addition, the restriction of only being able to store extras after a node was stored could also be lifted. To simplify and robustify the access of attributes and extras, the BackendNode interface has been significantly improved and tested.

Then an unrelated commit was added to migrate old state attributes from legacy JobCalculations. We decided to do this on this branch, since the migration was made a lot easier after both backends used JSONB, meaning the SQL used in the migration was identical for both backends.

The final step was to add migrations for existing export archives to account for the two database migrations added in this PR, while upping the version number to v0.6. The migrations will find state attributes of CalcJobNodes and migrate the data. Serialized datetime strings in attributes are also updated to reattach the timezone information, which was originally never included since everything was first converted to UTC before converting.

szoupanos and others added 11 commits June 26, 2019 11:14
With Django adding support for the Posgres JSONB field, the old custom
implementation of the entity-attribute-value (EAV) model, used for the
attributes and extras for the `DbNode` table, has been made obsolete.

Here we change the `DbNode` model for the Django backend to use the
`JSONB` type directly and migrations are added to change the schema and
migrate the existing data.

The `DbSetting` table also used the custom EAV model and has likewise
been migrated to use a JSONB field instead.

The hard coded dummy model that the query builder uses to map the Django
database models onto SqlAlchemy variants, has been removed and replaced
by Aldjemy models that are generated on the fly.
The way to get, set and delete global settings in the `DbSetting` table
was implemented by means of free functions. This is refactored into a
`SettingsManager` utility class.
The migrations of the old attributes and extras from the EAV tables to a
JSONB column on the node table, used a lazy fetch method to fetch nodes
in batches. However, the consecutive fetches do not sort on anything so
the same nodes may be fetched multiple times and as a result, some nodes
are never fetched. Their attributes and extras are therefore then also
never migrated. This caused certain nodes to have lost their data after
the migration. Sorting the lazy fetches by the pk of the nodes,
guarantees that each and every node is migrated, and only once.
Datetime objects were already not supported for SqlAlchemy since that
always used JSONB, which does not have native support for this datatype.
But now that Django also is migrated to use JSONB, the old support of
the EAV implementation for datetime objects needs to be removed.

The `clean_value` utility function that pre-processes each value that is
entered to be stored in the database now checks a whitelist of
acceptable value, which are essentially json-serializable, or raises an
exception for anything else that is not recognized.

The `aiida.common.timezone` module provides two utility functions named
`datetime_to_isoformat` and `isoformat_to_datetime` to serialize and
deserialize datetime objects to strings, respectively, that can be
stored in the database.

Co-authored-by: Giovanni Pizzi <giovanni.pizzi@epfl.ch>
Extras could originally only be set and mutated on stored nodes. The
reason for this limitation was that the original implementation of
extras in Django used the custom EAV schema to set arbitrary key value
pairs on a node. Since this requires a foreign key of the node for any
extra to be set, the corresponding node needed to be stored. Attributes
that used the same mechanism had this restriction alleviated by using an
in memory cache on the node instance while it was not stored. Since this
cache was never implemented for extras it was simply not allowed to set
them on unstored nodes.

Now that the Django implementation has switched from the custom EAV
schema for attributes and extras to a JSONB field, the foreign key
restriction no longer exists and the database model instance naturally
comes with a cache while it is not stored. Therefore the restriction on
nodes being stored for extras being mutable can now be lifted.

Co-authored-by: Giovanni Pizzi <giovanni.pizzi@epfl.ch>
Now that the Django implementation for the node attributes and extras
uses a JSONB field, just like SqlAlchemy, the interface for interacting
with these node properties can be homogenized and corrected.

The clear division between the front and back end classes `Node` and
`BackendNode` allows us to define a clear separation of responsabilities
as well.

 * `BackendNode`: ensure serializability of values stored in the fields
   as they will have to be stored as JSON. The implementation needs to
   ensure that cleaning of values is done as little as possible and
   scales linearly when appending to existing values.

 * `Node`: implements AiiDA specific business logic such as the rules
   for the mutability of attributes as well as the validity of key names.

The use of the `BackendNode` allows us to remove all business logic from
the database models, which in turn allows us to write exhaustive tests
for the interface in a backend independent way.

Co-authored-by: Giovanni Pizzi <giovanni.pizzi@epfl.ch>
The goal of the `ModelWrapper` is to ensure that getting and setting
fields of a database model instance automatically take care of
refreshing the state from the database and flushing any changes.
This is at the cost of more database operations which should be kept to
a minimum especially if they are useless.

For example, the refreshing and flushing of fields that are immutable
model fields when the model is already stored. The get and set attribute
methods are improved to ignore the refresh and flush if the model is
already saved and the field is immutable. Currently these are hard coded
to the set of `pk`, `id` and `uuid` since these are always immutable for
all the database models implemented in AiiDA.
The Django and SqlAlchemy implementations of the `BackendUser` and
`BackendComment` classes reimplemented the `from_dbmodel` method.
However, the implementation was exactly that of the base class
`BackendEntity` so it is better to rely on that. All that had to be done
was to set the `ENTITY_CLASS` class attribute to the correct class.
The old `JobCalculation` maintained its state in the `DbCalcState` table
whose value was also stored as an attribute with the key `state` as a
proxy. With the change to processes, all `JobCalculation` nodes have
been converted to `CalcJobNodes` but did not receive any of the typical
process attributes such as `process_state` and `exit_status`. Moreover,
they kept their `state` attribute which now contains deprecated values.
The `state` attribute is still set for `CalcJobs` but it serves a
different purpose as it is now a sub state to the process state, while
the process is active for extra granularity.

Here we add a data migration for those old calculation jobs where some
process attributes are inferred from the old state attribute, which is
then discarded. This will make sure that even for old migration
calculations, commands like `verdi process list` will show some
information that allows a user to discern what the entry was.
The change in the export archive is necessary to take the changes into
account of two database migrations:

 * Moving of node attributes and extras to JSONB for the Django backend
 * Data migration of `state` attribute of legacy `JobCalculation` nodes

Due to the migration of node attributes and extras to JSONB, which does
not support datetime objects, in contrast with the old database schema,
means that the deserialization from the export JSON data is no longer
necessary. The only serialized types were datetime objects, but these
are now no longer supported anyway in the database, just like SqlAlchemy
already was. This allows us to remove the attribute and extra conversion
dictionaries in its entirety.

A second migration deals with the conversion of legacy job calculation
states, by inferring the `process_state`, `exit_status` and
`process_status` from it.

Co-authored-by: Giovanni Pizzi <giovanni.pizzi@epfl.ch>
Copy link
Member

@giovannipizzi giovannipizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge thanks to @szoupanos and @sphuber !
I have reviewed this together with @sphuber and we already fixed a few things before the PR was created. So this is ready to be merged for me.

@szoupanos in order to let you make your review, if you want to, we are not going to merge this until Friday lunchtime, because we would like to merge this before Monday, when we start a few more coding days. If we don't hear from you, we will merge then. Of course, if major issues surface, we will postpone.

@coveralls
Copy link

Coverage Status

Coverage increased (+3.2%) to 73.95% when pulling 83e02fd on sphuber:django_jsonb into 5e59da4 on aiidateam:develop.

Copy link
Contributor

@szoupanos szoupanos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good work @sphuber and @giovannipizzi
Overall it looks nice. I will discuss 1-2 points with @sphuber but I am also OK with this PR

@sphuber sphuber merged commit a62d664 into aiidateam:develop Jun 27, 2019
@sphuber sphuber deleted the django_jsonb branch June 27, 2019 14:55
@ltalirz
Copy link
Member

ltalirz commented Jun 27, 2019

Fantastic! Killing those issues like flies ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment