Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement skip_orm option for SqlAlchemy Group.remove_nodes #4214

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 29 additions & 10 deletions aiida/orm/implementation/sqlalchemy/groups.py
Original file line number Diff line number Diff line change
Expand Up @@ -228,37 +228,56 @@ def check_node(given_node):
# Commit everything as up till now we've just flushed
session.commit()

def remove_nodes(self, nodes):
def remove_nodes(self, nodes, **kwargs):
"""Remove a node or a set of nodes from the group.

:note: all the nodes *and* the group itself have to be stored.

:param nodes: a list of `BackendNode` instance to be added to this group
:param kwargs:
skip_orm: When the flag is set to `True`, the SQLA ORM is skipped and SQLA is used to create a direct SQL
DELETE statement to the group-node relationship table in order to improve speed.
"""
from sqlalchemy import and_
from aiida.backends.sqlalchemy import get_scoped_session
from aiida.backends.sqlalchemy.models.base import Base
from aiida.orm.implementation.sqlalchemy.nodes import SqlaNode

super().remove_nodes(nodes)

# Get dbnodes here ONCE, otherwise each call to dbnodes will re-read the current value in the database
dbnodes = self._dbmodel.dbnodes
skip_orm = kwargs.get('skip_orm', False)

list_nodes = []

for node in nodes:
def check_node(node):
if not isinstance(node, SqlaNode):
raise TypeError('invalid type {}, has to be {}'.format(type(node), SqlaNode))

if node.id is None:
raise ValueError('At least one of the provided nodes is unstored, stopping...')

# If we don't check first, SqlA might issue a DELETE statement for an unexisting key, resulting in an error
if node.dbmodel in dbnodes:
list_nodes.append(node.dbmodel)
list_nodes = []

for node in list_nodes:
dbnodes.remove(node)
with utils.disable_expire_on_commit(get_scoped_session()) as session:
if not skip_orm:
for node in nodes:
check_node(node)

# Check first, if SqlA issues a DELETE statement for an unexisting key it will result in an error
if node.dbmodel in dbnodes:
list_nodes.append(node.dbmodel)

for node in list_nodes:
dbnodes.remove(node)
else:
table = Base.metadata.tables['db_dbgroup_dbnodes']
for node in nodes:
check_node(node)
clause = and_(table.c.dbnode_id == node.id, table.c.dbgroup_id == self.id)
statement = table.delete().where(clause)
session.execute(statement)

sa.get_scoped_session().commit()
session.commit()


class SqlaGroupCollection(BackendGroupCollection):
Expand Down
30 changes: 30 additions & 0 deletions tests/backends/aiida_sqlalchemy/test_generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,3 +164,33 @@ def test_group_batch_size(self):
group = Group(label='test_batches_' + str(batch_size)).store()
group.backend_entity.add_nodes(nodes, skip_orm=True, batch_size=batch_size)
self.assertEqual(set(_.pk for _ in nodes), set(_.pk for _ in group.nodes))

def test_remove_nodes_bulk(self):
"""Test node removal."""
backend = self.backend

node_01 = Data().store().backend_entity
node_02 = Data().store().backend_entity
node_03 = Data().store().backend_entity
node_04 = Data().store().backend_entity
nodes = [node_01, node_02, node_03]
group = backend.groups.create(label='test_remove_nodes', user=backend.users.create('simple2@ton.com')).store()

# Add initial nodes
group.add_nodes(nodes)
self.assertEqual(set(_.pk for _ in nodes), set(_.pk for _ in group.nodes))

# Remove a node that is not in the group: nothing should happen
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Is this desired behaviour? I know the result is the same for the node being deleted or not being there in the first place, but I'm wary of this obfuscating user mistakes from them (i.e. they meaning to delete a node in a group but putting the wrong group).

  2. (provided we agree in 1) If the checking was the main responsible for the lack of performance and there is no way around it, perhaps we could use a more explicit ignore_missing/skip_checks keyword (and do the checks when not caring about performance). On the other hand, if this is just a side effect of the current fix and we can improve the performance keeping the checks maybe it is not necessary to test for this (since it would indicate this is the expected behaviour).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This is the current behavior so I am not changing anything. Adding the check would make the implementation (possibly) even less performant then it already was. In the end it is a trade-off between performance and information. If we were to add a flag like ignore_missing we would probably have to set it to False by default, in order to prevent the unnoticed mistake of the example you are mentioning. Because if they are ignored by default, they would still not notice. This means however, that by default the performance is bad. I think I would personally prefer to have more performance.

Copy link
Member

@ramirezfranciscof ramirezfranciscof Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm but by default now the performance is bad as well, because the default of skip_orm is False. The only difference is that instead of just prunning the list provided into a clean list_nodes, it raise if it finds node that don't belong there. Did I misunderstand anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checking of the nodes existence is not the only part of the query that is slow. It is also the removing them one by one that is suboptimal. You could argue that we might as well just change the behavior now to raise for a non-existing node, but then you would have to do the same for the skip_orm path. I would argue that it is probably not a good idea to have different behavior for the two paths. The problem with this is though that we would have to add membership-checks to the fast skip_orm algorith, which requires retrieving the entire set of pks that are currently existing in the database and then making sure that all provided node pks are a part of that set. This operation has a non-negligible cost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, perhaps I was not clear, but I was not proposing to have two skip_orm paths with different ignore_missing defaults, I was thinking of replacing the skip_orm keyword by the ignore_missing. This is because I think the user of the function should be more worried about the ignoring of the missing nodes (the exposed behaviour) rather than the skipping of the ORM (the internal workings). The function is in charge of providing the optimal way of performing both behaviors: right now this means that it will skip the orm if the user doesn't care about the missing nodes and will go through a less performant ORM if he does, but in the future both paths would ideally use the ORM.

There are only 2 cases that would not be included by this that would warrant two keywords. Both of them are not only temporary (automatically solved once we have the performant ORM), but also a bit..."micromanagey"?

  1. A user wanting to gain a bit of performance while retaining the check (skip_orm=True + ignore_missing=False). I'm personally comfortable with only offering "max checks" VS "max performance" options, and if the user wants a compromise then he can check the nodes himself before calling this.

  2. A user wanting to get whatever check is in the ORM but not wanting to deal with errors for inexisting nodes (skip_orm=False + ignore_missing=True). Again, micromanaging what checks are performed, but in this case there is even no performance to be gained (as nodes need to be checked anyways to be pruned before being passed to the ORM even if no error is raised).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there may be some confusion on which part of the ORM we are speaking about right now. Just to be sure, we are talking about the interface of the backend ORM. A user never interacts with this interface. If you load a group, you get an instance of Group. Here we are talking about the interface of BackendGroup, which a user never sees. The argument skip_orm is then also never available to users. It is only available for aiida-core itself.

So I think this discussion is not really relevant here. The whole reason that we put skip_orm in the backend interface is exactly because we agree with you that the user should not have to think about it in this way. I also want to repeat that checking the existence of nodes before removing them is not the only factor slowing things down. If you look at the implementation of skip_orm=True, we are also skipping the SqlAlchemy ORM, not just our own, and are constructing an SQL query directly and executing it. This is the most direct and therefore most efficient.

I am not sure if I misunderstood your post again, if so, maybe it is easiest to discuss this in person.

group.remove_nodes([node_04], skip_orm=True)
self.assertEqual(set(_.pk for _ in nodes), set(_.pk for _ in group.nodes))

# Remove one Node
nodes.remove(node_03)
group.remove_nodes([node_03], skip_orm=True)
self.assertEqual(set(_.pk for _ in nodes), set(_.pk for _ in group.nodes))

# Remove a list of Nodes and check
nodes.remove(node_01)
nodes.remove(node_02)
group.remove_nodes([node_01, node_02], skip_orm=True)
self.assertEqual(set(_.pk for _ in nodes), set(_.pk for _ in group.nodes))