Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement skip_orm option for SqlAlchemy Group.remove_nodes #4214

Merged

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Jul 2, 2020

Addresses #4159 partially

The current implementation of Group.remove_nodes is very slow. For a
group of a few tens of thousands of nodes, removing a thousand can take
more than a day. The same problem exists for add_nodes which is why a
shortcut was added to the backend implementation for SqlAlchemy. Here,
we do the same for remove_nodes. The SqlaGroup.remove_nodes now
accepts a keyword argument skip_orm that, when True, will delete the
nodes by directly constructing a delete query on the join table.

@sphuber sphuber force-pushed the fix/4159/group-remove-nodes-performance branch from 4419443 to 2ca1974 Compare July 16, 2020 11:20
@codecov
Copy link

codecov bot commented Jul 16, 2020

Codecov Report

Merging #4214 into develop will decrease coverage by 0.03%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4214      +/-   ##
===========================================
- Coverage    79.23%   79.21%   -0.02%     
===========================================
  Files          468      468              
  Lines        34488    34594     +106     
===========================================
+ Hits         27323    27400      +77     
- Misses        7165     7194      +29     
Flag Coverage Δ
#django 71.12% <0.00%> (-0.03%) ⬇️
#sqlalchemy 71.99% <100.00%> (-<0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/orm/implementation/sqlalchemy/groups.py 87.63% <100.00%> (+0.97%) ⬆️
aiida/transports/transport.py 63.81% <0.00%> (-1.48%) ⬇️
aiida/transports/plugins/ssh.py 76.39% <0.00%> (-1.37%) ⬇️
aiida/transports/plugins/local.py 80.47% <0.00%> (-0.81%) ⬇️
aiida/engine/daemon/execmanager.py 63.10% <0.00%> (+0.56%) ⬆️
aiida/engine/daemon/client.py 75.55% <0.00%> (+3.14%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3a4eff7...f0c4623. Read the comment docs.

@sphuber sphuber force-pushed the fix/4159/group-remove-nodes-performance branch from 2ca1974 to c664866 Compare July 16, 2020 14:18
@sphuber sphuber force-pushed the fix/4159/group-remove-nodes-performance branch 2 times, most recently from 229d53f to c850391 Compare July 22, 2020 10:59
@sphuber
Copy link
Contributor Author

sphuber commented Jul 22, 2020

@ramirezfranciscof if you have the time, this PR might be a good one to review as well.

@ramirezfranciscof
Copy link
Member

Ok. One general question though: this is meant to be a temporary fix, correct? Ideally we should optimize the orm so it doesn't have this performance problem instead of skipping it when we need good performance, no?

Copy link
Member

@ramirezfranciscof ramirezfranciscof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only have a couple of questions before approving. Two I already made on the code, but there is a third general one: why are you putting tests for something that lives in aiida/orm/... into tests/backends/...?

tests/backends/aiida_sqlalchemy/test_generic.py Outdated Show resolved Hide resolved
group.add_nodes(nodes)
self.assertEqual(set(_.pk for _ in nodes), set(_.pk for _ in group.nodes))

# Remove a node that is not in the group: nothing should happen
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Is this desired behaviour? I know the result is the same for the node being deleted or not being there in the first place, but I'm wary of this obfuscating user mistakes from them (i.e. they meaning to delete a node in a group but putting the wrong group).

  2. (provided we agree in 1) If the checking was the main responsible for the lack of performance and there is no way around it, perhaps we could use a more explicit ignore_missing/skip_checks keyword (and do the checks when not caring about performance). On the other hand, if this is just a side effect of the current fix and we can improve the performance keeping the checks maybe it is not necessary to test for this (since it would indicate this is the expected behaviour).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This is the current behavior so I am not changing anything. Adding the check would make the implementation (possibly) even less performant then it already was. In the end it is a trade-off between performance and information. If we were to add a flag like ignore_missing we would probably have to set it to False by default, in order to prevent the unnoticed mistake of the example you are mentioning. Because if they are ignored by default, they would still not notice. This means however, that by default the performance is bad. I think I would personally prefer to have more performance.

Copy link
Member

@ramirezfranciscof ramirezfranciscof Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm but by default now the performance is bad as well, because the default of skip_orm is False. The only difference is that instead of just prunning the list provided into a clean list_nodes, it raise if it finds node that don't belong there. Did I misunderstand anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checking of the nodes existence is not the only part of the query that is slow. It is also the removing them one by one that is suboptimal. You could argue that we might as well just change the behavior now to raise for a non-existing node, but then you would have to do the same for the skip_orm path. I would argue that it is probably not a good idea to have different behavior for the two paths. The problem with this is though that we would have to add membership-checks to the fast skip_orm algorith, which requires retrieving the entire set of pks that are currently existing in the database and then making sure that all provided node pks are a part of that set. This operation has a non-negligible cost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, perhaps I was not clear, but I was not proposing to have two skip_orm paths with different ignore_missing defaults, I was thinking of replacing the skip_orm keyword by the ignore_missing. This is because I think the user of the function should be more worried about the ignoring of the missing nodes (the exposed behaviour) rather than the skipping of the ORM (the internal workings). The function is in charge of providing the optimal way of performing both behaviors: right now this means that it will skip the orm if the user doesn't care about the missing nodes and will go through a less performant ORM if he does, but in the future both paths would ideally use the ORM.

There are only 2 cases that would not be included by this that would warrant two keywords. Both of them are not only temporary (automatically solved once we have the performant ORM), but also a bit..."micromanagey"?

  1. A user wanting to gain a bit of performance while retaining the check (skip_orm=True + ignore_missing=False). I'm personally comfortable with only offering "max checks" VS "max performance" options, and if the user wants a compromise then he can check the nodes himself before calling this.

  2. A user wanting to get whatever check is in the ORM but not wanting to deal with errors for inexisting nodes (skip_orm=False + ignore_missing=True). Again, micromanaging what checks are performed, but in this case there is even no performance to be gained (as nodes need to be checked anyways to be pruned before being passed to the ORM even if no error is raised).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there may be some confusion on which part of the ORM we are speaking about right now. Just to be sure, we are talking about the interface of the backend ORM. A user never interacts with this interface. If you load a group, you get an instance of Group. Here we are talking about the interface of BackendGroup, which a user never sees. The argument skip_orm is then also never available to users. It is only available for aiida-core itself.

So I think this discussion is not really relevant here. The whole reason that we put skip_orm in the backend interface is exactly because we agree with you that the user should not have to think about it in this way. I also want to repeat that checking the existence of nodes before removing them is not the only factor slowing things down. If you look at the implementation of skip_orm=True, we are also skipping the SqlAlchemy ORM, not just our own, and are constructing an SQL query directly and executing it. This is the most direct and therefore most efficient.

I am not sure if I misunderstood your post again, if so, maybe it is easiest to discuss this in person.

@sphuber
Copy link
Contributor Author

sphuber commented Jul 23, 2020

Ok. One general question though: this is meant to be a temporary fix, correct? Ideally we should optimize the orm so it doesn't have this performance problem instead of skipping it when we need good performance, no?

We put it in the backend ORM because we were not sure what the possible downsides were of skipping the ORM. By doing so, we may be circumventing checks that we are not aware off. I think it is for that reason that we decided to play it safe and "hide" it in the backend. Ideally, if this is a safe operation and respects all rules of our ORM, we should just have this as the standard implementation and always use it, without a switch. Since this workaround already existed since a long time for add_nodes I am just bringing remove_nodes up to speed. We can look at a later time, to use the new implementation as the default without needing the switch

@sphuber
Copy link
Contributor Author

sphuber commented Jul 23, 2020

I only have a couple of questions before approving. Two I already made on the code, but there is a third general one: why are you putting tests for something that lives in aiida/orm/... into tests/backends/...?

The reason is kind of historical: the code we are testing here is specific to the SqlAlchemy backend. If we were to run this code for the Django backend, it would raise, because the skip_orm keyword doesn't exist there. Historically, these backend specific tests were placed in tests/backends. I actually have another open PR #4279 that implements a pytest fixture that allows skipping a test if they are not being run with a specific backend. If we merge this, I would update that PR to move the tests added here to aiida.orm.implementation.

The current implementation of `Group.remove_nodes` is very slow. For a
group of a few tens of thousands of nodes, removing a thousand can take
more than a day. The same problem exists for `add_nodes` which is why a
shortcut was added to the backend implementation for SqlAlchemy. Here,
we do the same for `remove_nodes`. The `SqlaGroup.remove_nodes` now
accepts a keyword argument `skip_orm` that, when True, will delete the
nodes by directly constructing a delete query on the join table.
@sphuber sphuber force-pushed the fix/4159/group-remove-nodes-performance branch from c850391 to f0c4623 Compare July 24, 2020 07:06
@ramirezfranciscof ramirezfranciscof self-requested a review July 24, 2020 13:18
Copy link
Member

@ramirezfranciscof ramirezfranciscof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll approve because it doesn't make sense to keep this on hold because of the detail that we are still discussing. I would anyways be interested in finishing the discussion, maybe in next wednesday's maintainers meeting.

@sphuber sphuber merged commit bced84e into aiidateam:develop Jul 24, 2020
@sphuber sphuber deleted the fix/4159/group-remove-nodes-performance branch July 24, 2020 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants