Skip to content

Commit

Permalink
♻️ REFACTOR: New archive format (#5145)
Browse files Browse the repository at this point in the history
Implement the new archive format,
as discussed in `aiidateam/AEP/005_exportformat`.

To address shortcomings in cpu/memory performance for export/import,
the archive format has been re-designed.
In particular,

1. The `data.json` has been replaced with an sqlite database,
   using the saeme schema as the sqlabackend,
   meaning it is no longer required to be fully read into memory.
2. The archive utilises the repository redesign,
   with binary files stored by hashkeys (removing de-duplication)
3. The archive is only saved as zip (not tar),
   meaning internal files can be decompressed+streamed independantly,
   without the need to uncompress the entire archive file.
4. The archive is implemented as a full (read-only) backend,
   meaning it can be queried without the need to import to a profile.

Additionally, the entire export/import code has been re-written
to utilise these changes.

These changes have reduced the export times by ~250%, export peak RAM by ~400%,
import times by ~400%, and import peak RAM by ~500%.
The changes also allow for future push/pull mechanisms.
  • Loading branch information
chrisjsewell authored Dec 1, 2021
2 parents d5084d6 + b1a7c46 commit 0acfb2d
Show file tree
Hide file tree
Showing 187 changed files with 9,408 additions and 11,525 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -34,3 +34,5 @@ pip-wheel-metadata
# Docs
docs/build
docs/source/reference/apidoc

pplot_out/
4 changes: 1 addition & 3 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -105,9 +105,7 @@ repos:
aiida/repository/.*py|
aiida/tools/graph/graph_traversers.py|
aiida/tools/groups/paths.py|
aiida/tools/importexport/archive/.*py|
aiida/tools/importexport/dbexport/__init__.py|
aiida/tools/importexport/dbimport/backends/.*.py|
aiida/tools/archive/.*py|
)$
- id: pylint
Expand Down
3 changes: 2 additions & 1 deletion aiida/backends/sqlalchemy/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
###########################################################################
"""Module to define the database models for the SqlAlchemy backend."""
import sqlalchemy as sa
from sqlalchemy.orm import mapper

# SqlAlchemy does not set default values for table columns upon construction of a new instance, but will only do so
# when storing the instance. Any attributes that do not have a value but have a defined default, will be populated with
Expand All @@ -34,4 +35,4 @@ def instant_defaults_listener(target, _, __):
setattr(target, key, column.default.arg)


sa.event.listen(sa.orm.mapper, 'init', instant_defaults_listener)
sa.event.listen(mapper, 'init', instant_defaults_listener)
7 changes: 6 additions & 1 deletion aiida/backends/sqlalchemy/models/authinfo.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,12 @@


class DbAuthInfo(Base):
"""Class that keeps the authernification data."""
"""Database model to keep computer authentication data, per user.
Specifications are user-specific of how to submit jobs in the computer.
The model also has an ``enabled`` logical switch that indicates whether the device is available for use or not.
This last one can be set and unset by the user.
"""
__tablename__ = 'db_dbauthinfo'

id = Column(Integer, primary_key=True) # pylint: disable=invalid-name
Expand Down
3 changes: 2 additions & 1 deletion aiida/backends/sqlalchemy/models/comment.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@


class DbComment(Base):
"""Class to store comments using SQLA backend."""
"""Database model to store comments, relating to a node."""

__tablename__ = 'db_dbcomment'

id = Column(Integer, primary_key=True) # pylint: disable=invalid-name
Expand Down
11 changes: 10 additions & 1 deletion aiida/backends/sqlalchemy/models/computer.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,16 @@


class DbComputer(Base):
"""Class to store computers using SQLA backend."""
"""Database model to store computers.
Computers are identified within AiiDA by their ``label`` (and thus it must be unique for each one in the database),
whereas the ``hostname`` is the label that identifies the computer within the network from which one can access it.
The ``scheduler_type`` column contains the information of the scheduler (and plugin)
that the computer uses to manage jobs, whereas the ``transport_type`` the information of the transport
(and plugin) required to copy files and communicate to and from the computer.
The ``metadata`` contains some general settings for these communication and management protocols.
"""
__tablename__ = 'db_dbcomputer'

id = Column(Integer, primary_key=True) # pylint: disable=invalid-name
Expand Down
11 changes: 9 additions & 2 deletions aiida/backends/sqlalchemy/models/group.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,12 +32,19 @@


class DbGroupNode(Base):
"""Class to store group to nodes relation using SQLA backend."""
"""Database model to store group-to-nodes relations."""
__tablename__ = table_groups_nodes.name
__table__ = table_groups_nodes


class DbGroup(Base):
"""Class to store groups using SQLA backend."""
"""Database model to store groups of nodes.
Users will typically identify and handle groups by using their ``label``
(which, unlike the ``labels`` in other models, must be unique).
Groups also have a ``type``, which serves to identify what plugin is being instanced,
and the ``extras`` property for users to set any relevant information.
"""

__tablename__ = 'db_dbgroup'

Expand Down
6 changes: 3 additions & 3 deletions aiida/backends/sqlalchemy/models/log.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,14 +22,14 @@


class DbLog(Base):
"""Class to store logs using SQLA backend."""
"""Database model to store log levels and messages relating to a process node."""
__tablename__ = 'db_dblog'

id = Column(Integer, primary_key=True) # pylint: disable=invalid-name
uuid = Column(UUID(as_uuid=True), default=get_new_uuid, unique=True)
time = Column(DateTime(timezone=True), default=timezone.now)
loggername = Column(String(255), index=True)
levelname = Column(String(255), index=True)
loggername = Column(String(255), index=True, doc='What process recorded the message')
levelname = Column(String(255), index=True, doc='How critical the message is')
dbnode_id = Column(
Integer, ForeignKey('db_dbnode.id', deferrable=True, initially='DEFERRED', ondelete='CASCADE'), nullable=False
)
Expand Down
24 changes: 22 additions & 2 deletions aiida/backends/sqlalchemy/models/node.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,20 @@


class DbNode(Base):
"""Class to store nodes using SQLA backend."""
"""Database model to store nodes.
Each node can be categorized according to its ``node_type``,
which indicates what kind of data or process node it is.
Additionally, process nodes also have a ``process_type`` that further indicates what is the specific plugin it uses.
Nodes can also store two kind of properties:
- ``attributes`` are determined by the ``node_type``,
and are set before storing the node and can't be modified afterwards.
- ``extras``, on the other hand,
can be added and removed after the node has been stored and are usually set by the user.
"""

__tablename__ = 'db_dbnode'

Expand Down Expand Up @@ -146,7 +159,14 @@ def __str__(self):


class DbLink(Base):
"""Class to store links between nodes using SQLA backend."""
"""Database model to store links between nodes.
Each entry in this table contains not only the ``id`` information of the two nodes that are linked,
but also some extra properties of the link themselves.
This includes the ``type`` of the link (see the :ref:`topics:provenance:concepts` section for all possible types)
as well as a ``label`` which is more specific and typically determined by
the procedure generating the process node that links the data nodes.
"""

__tablename__ = 'db_dblink'

Expand Down
2 changes: 1 addition & 1 deletion aiida/backends/sqlalchemy/models/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@


class DbSetting(Base):
"""Class to store node settings using the SQLA backend."""
"""Database model to store global settings."""
__tablename__ = 'db_dbsetting'
__table_args__ = (UniqueConstraint('key'),)
id = Column(Integer, primary_key=True) # pylint: disable=invalid-name
Expand Down
5 changes: 4 additions & 1 deletion aiida/backends/sqlalchemy/models/user.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,10 @@


class DbUser(Base):
"""Store users using the SQLA backend."""
"""Database model to store users.
The user information consists of the most basic personal contact details.
"""
__tablename__ = 'db_dbuser'

id = Column(Integer, primary_key=True) # pylint: disable=invalid-name
Expand Down
Loading

0 comments on commit 0acfb2d

Please sign in to comment.