Implement `skip_orm` option for SqlAlchemy `Group.remove_nodes` #4214

sphuber · 2020-07-02T09:18:50Z

Addresses #4159 partially

The current implementation of Group.remove_nodes is very slow. For a
group of a few tens of thousands of nodes, removing a thousand can take
more than a day. The same problem exists for add_nodes which is why a
shortcut was added to the backend implementation for SqlAlchemy. Here,
we do the same for remove_nodes. The SqlaGroup.remove_nodes now
accepts a keyword argument skip_orm that, when True, will delete the
nodes by directly constructing a delete query on the join table.

codecov · 2020-07-16T11:35:31Z

Codecov Report

Merging #4214 into develop will decrease coverage by 0.03%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##           develop    #4214      +/-   ##
===========================================
- Coverage    79.23%   79.21%   -0.02%     
===========================================
  Files          468      468              
  Lines        34488    34594     +106     
===========================================
+ Hits         27323    27400      +77     
- Misses        7165     7194      +29

Flag	Coverage Δ
#django	`71.12% <0.00%> (-0.03%)`	⬇️
#sqlalchemy	`71.99% <100.00%> (-<0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
aiida/orm/implementation/sqlalchemy/groups.py	`87.63% <100.00%> (+0.97%)`	⬆️
aiida/transports/transport.py	`63.81% <0.00%> (-1.48%)`	⬇️
aiida/transports/plugins/ssh.py	`76.39% <0.00%> (-1.37%)`	⬇️
aiida/transports/plugins/local.py	`80.47% <0.00%> (-0.81%)`	⬇️
aiida/engine/daemon/execmanager.py	`63.10% <0.00%> (+0.56%)`	⬆️
aiida/engine/daemon/client.py	`75.55% <0.00%> (+3.14%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3a4eff7...f0c4623. Read the comment docs.

sphuber · 2020-07-22T11:18:20Z

@ramirezfranciscof if you have the time, this PR might be a good one to review as well.

ramirezfranciscof · 2020-07-23T12:22:52Z

Ok. One general question though: this is meant to be a temporary fix, correct? Ideally we should optimize the orm so it doesn't have this performance problem instead of skipping it when we need good performance, no?

ramirezfranciscof

I only have a couple of questions before approving. Two I already made on the code, but there is a third general one: why are you putting tests for something that lives in aiida/orm/... into tests/backends/...?

tests/backends/aiida_sqlalchemy/test_generic.py

ramirezfranciscof · 2020-07-23T12:59:07Z

tests/backends/aiida_sqlalchemy/test_generic.py

+        group.add_nodes(nodes)
+        self.assertEqual(set(_.pk for _ in nodes), set(_.pk for _ in group.nodes))
+
+        # Remove a node that is not in the group: nothing should happen


Is this desired behaviour? I know the result is the same for the node being deleted or not being there in the first place, but I'm wary of this obfuscating user mistakes from them (i.e. they meaning to delete a node in a group but putting the wrong group).

(provided we agree in 1) If the checking was the main responsible for the lack of performance and there is no way around it, perhaps we could use a more explicit ignore_missing/skip_checks keyword (and do the checks when not caring about performance). On the other hand, if this is just a side effect of the current fix and we can improve the performance keeping the checks maybe it is not necessary to test for this (since it would indicate this is the expected behaviour).

Good point. This is the current behavior so I am not changing anything. Adding the check would make the implementation (possibly) even less performant then it already was. In the end it is a trade-off between performance and information. If we were to add a flag like ignore_missing we would probably have to set it to False by default, in order to prevent the unnoticed mistake of the example you are mentioning. Because if they are ignored by default, they would still not notice. This means however, that by default the performance is bad. I think I would personally prefer to have more performance.

Mmm but by default now the performance is bad as well, because the default of skip_orm is False. The only difference is that instead of just prunning the list provided into a clean list_nodes, it raise if it finds node that don't belong there. Did I misunderstand anything?

The checking of the nodes existence is not the only part of the query that is slow. It is also the removing them one by one that is suboptimal. You could argue that we might as well just change the behavior now to raise for a non-existing node, but then you would have to do the same for the skip_orm path. I would argue that it is probably not a good idea to have different behavior for the two paths. The problem with this is though that we would have to add membership-checks to the fast skip_orm algorith, which requires retrieving the entire set of pks that are currently existing in the database and then making sure that all provided node pks are a part of that set. This operation has a non-negligible cost.

Ok, perhaps I was not clear, but I was not proposing to have two skip_orm paths with different ignore_missing defaults, I was thinking of replacing the skip_orm keyword by the ignore_missing. This is because I think the user of the function should be more worried about the ignoring of the missing nodes (the exposed behaviour) rather than the skipping of the ORM (the internal workings). The function is in charge of providing the optimal way of performing both behaviors: right now this means that it will skip the orm if the user doesn't care about the missing nodes and will go through a less performant ORM if he does, but in the future both paths would ideally use the ORM.

There are only 2 cases that would not be included by this that would warrant two keywords. Both of them are not only temporary (automatically solved once we have the performant ORM), but also a bit..."micromanagey"?

A user wanting to gain a bit of performance while retaining the check (skip_orm=True + ignore_missing=False). I'm personally comfortable with only offering "max checks" VS "max performance" options, and if the user wants a compromise then he can check the nodes himself before calling this.

A user wanting to get whatever check is in the ORM but not wanting to deal with errors for inexisting nodes (skip_orm=False + ignore_missing=True). Again, micromanaging what checks are performed, but in this case there is even no performance to be gained (as nodes need to be checked anyways to be pruned before being passed to the ORM even if no error is raised).

I think there may be some confusion on which part of the ORM we are speaking about right now. Just to be sure, we are talking about the interface of the backend ORM. A user never interacts with this interface. If you load a group, you get an instance of Group. Here we are talking about the interface of BackendGroup, which a user never sees. The argument skip_orm is then also never available to users. It is only available for aiida-core itself.

So I think this discussion is not really relevant here. The whole reason that we put skip_orm in the backend interface is exactly because we agree with you that the user should not have to think about it in this way. I also want to repeat that checking the existence of nodes before removing them is not the only factor slowing things down. If you look at the implementation of skip_orm=True, we are also skipping the SqlAlchemy ORM, not just our own, and are constructing an SQL query directly and executing it. This is the most direct and therefore most efficient.

I am not sure if I misunderstood your post again, if so, maybe it is easiest to discuss this in person.

sphuber · 2020-07-23T13:16:45Z

Ok. One general question though: this is meant to be a temporary fix, correct? Ideally we should optimize the orm so it doesn't have this performance problem instead of skipping it when we need good performance, no?

We put it in the backend ORM because we were not sure what the possible downsides were of skipping the ORM. By doing so, we may be circumventing checks that we are not aware off. I think it is for that reason that we decided to play it safe and "hide" it in the backend. Ideally, if this is a safe operation and respects all rules of our ORM, we should just have this as the standard implementation and always use it, without a switch. Since this workaround already existed since a long time for add_nodes I am just bringing remove_nodes up to speed. We can look at a later time, to use the new implementation as the default without needing the switch

sphuber · 2020-07-23T13:19:41Z

I only have a couple of questions before approving. Two I already made on the code, but there is a third general one: why are you putting tests for something that lives in aiida/orm/... into tests/backends/...?

The reason is kind of historical: the code we are testing here is specific to the SqlAlchemy backend. If we were to run this code for the Django backend, it would raise, because the skip_orm keyword doesn't exist there. Historically, these backend specific tests were placed in tests/backends. I actually have another open PR #4279 that implements a pytest fixture that allows skipping a test if they are not being run with a specific backend. If we merge this, I would update that PR to move the tests added here to aiida.orm.implementation.

The current implementation of `Group.remove_nodes` is very slow. For a group of a few tens of thousands of nodes, removing a thousand can take more than a day. The same problem exists for `add_nodes` which is why a shortcut was added to the backend implementation for SqlAlchemy. Here, we do the same for `remove_nodes`. The `SqlaGroup.remove_nodes` now accepts a keyword argument `skip_orm` that, when True, will delete the nodes by directly constructing a delete query on the join table.

ramirezfranciscof

Ok, I'll approve because it doesn't make sense to keep this on hold because of the detail that we are still discussing. I would anyways be interested in finishing the discussion, maybe in next wednesday's maintainers meeting.

sphuber force-pushed the fix/4159/group-remove-nodes-performance branch from 4419443 to 2ca1974 Compare July 16, 2020 11:20

sphuber force-pushed the fix/4159/group-remove-nodes-performance branch from 2ca1974 to c664866 Compare July 16, 2020 14:18

sphuber requested review from chrisjsewell, giovannipizzi and ramirezfranciscof July 16, 2020 14:19

sphuber force-pushed the fix/4159/group-remove-nodes-performance branch 2 times, most recently from 229d53f to c850391 Compare July 22, 2020 10:59

ramirezfranciscof reviewed Jul 23, 2020

View reviewed changes

sphuber force-pushed the fix/4159/group-remove-nodes-performance branch from c850391 to f0c4623 Compare July 24, 2020 07:06

ramirezfranciscof self-requested a review July 24, 2020 13:18

ramirezfranciscof approved these changes Jul 24, 2020

View reviewed changes

sphuber merged commit bced84e into aiidateam:develop Jul 24, 2020

sphuber deleted the fix/4159/group-remove-nodes-performance branch July 24, 2020 13:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `skip_orm` option for SqlAlchemy `Group.remove_nodes` #4214

Implement `skip_orm` option for SqlAlchemy `Group.remove_nodes` #4214

sphuber commented Jul 2, 2020

codecov bot commented Jul 16, 2020 •

edited

Loading

sphuber commented Jul 22, 2020

ramirezfranciscof commented Jul 23, 2020

ramirezfranciscof left a comment

ramirezfranciscof Jul 23, 2020

sphuber Jul 23, 2020

ramirezfranciscof Jul 23, 2020 •

edited

Loading

sphuber Jul 23, 2020

ramirezfranciscof Jul 23, 2020

sphuber Jul 24, 2020

sphuber commented Jul 23, 2020

sphuber commented Jul 23, 2020

ramirezfranciscof left a comment

Implement skip_orm option for SqlAlchemy Group.remove_nodes #4214

Implement skip_orm option for SqlAlchemy Group.remove_nodes #4214

Conversation

sphuber commented Jul 2, 2020

codecov bot commented Jul 16, 2020 • edited Loading

Codecov Report

sphuber commented Jul 22, 2020

ramirezfranciscof commented Jul 23, 2020

ramirezfranciscof left a comment

Choose a reason for hiding this comment

ramirezfranciscof Jul 23, 2020

Choose a reason for hiding this comment

sphuber Jul 23, 2020

Choose a reason for hiding this comment

ramirezfranciscof Jul 23, 2020 • edited Loading

Choose a reason for hiding this comment

sphuber Jul 23, 2020

Choose a reason for hiding this comment

ramirezfranciscof Jul 23, 2020

Choose a reason for hiding this comment

sphuber Jul 24, 2020

Choose a reason for hiding this comment

sphuber commented Jul 23, 2020

sphuber commented Jul 23, 2020

ramirezfranciscof left a comment

Choose a reason for hiding this comment

Implement `skip_orm` option for SqlAlchemy `Group.remove_nodes` #4214

Implement `skip_orm` option for SqlAlchemy `Group.remove_nodes` #4214

codecov bot commented Jul 16, 2020 •

edited

Loading

ramirezfranciscof Jul 23, 2020 •

edited

Loading