-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement skip_orm
option for SqlAlchemy Group.remove_nodes
#4214
Implement skip_orm
option for SqlAlchemy Group.remove_nodes
#4214
Conversation
4419443
to
2ca1974
Compare
Codecov Report
@@ Coverage Diff @@
## develop #4214 +/- ##
===========================================
- Coverage 79.23% 79.21% -0.02%
===========================================
Files 468 468
Lines 34488 34594 +106
===========================================
+ Hits 27323 27400 +77
- Misses 7165 7194 +29
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
2ca1974
to
c664866
Compare
229d53f
to
c850391
Compare
@ramirezfranciscof if you have the time, this PR might be a good one to review as well. |
Ok. One general question though: this is meant to be a temporary fix, correct? Ideally we should optimize the orm so it doesn't have this performance problem instead of skipping it when we need good performance, no? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only have a couple of questions before approving. Two I already made on the code, but there is a third general one: why are you putting tests for something that lives in aiida/orm/...
into tests/backends/...
?
group.add_nodes(nodes) | ||
self.assertEqual(set(_.pk for _ in nodes), set(_.pk for _ in group.nodes)) | ||
|
||
# Remove a node that is not in the group: nothing should happen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Is this desired behaviour? I know the result is the same for the node being deleted or not being there in the first place, but I'm wary of this obfuscating user mistakes from them (i.e. they meaning to delete a node in a group but putting the wrong group).
-
(provided we agree in 1) If the checking was the main responsible for the lack of performance and there is no way around it, perhaps we could use a more explicit
ignore_missing
/skip_checks
keyword (and do the checks when not caring about performance). On the other hand, if this is just a side effect of the current fix and we can improve the performance keeping the checks maybe it is not necessary to test for this (since it would indicate this is the expected behaviour).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. This is the current behavior so I am not changing anything. Adding the check would make the implementation (possibly) even less performant then it already was. In the end it is a trade-off between performance and information. If we were to add a flag like ignore_missing
we would probably have to set it to False
by default, in order to prevent the unnoticed mistake of the example you are mentioning. Because if they are ignored by default, they would still not notice. This means however, that by default the performance is bad. I think I would personally prefer to have more performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm but by default now the performance is bad as well, because the default of skip_orm
is False
. The only difference is that instead of just prunning the list provided into a clean list_nodes
, it raise if it finds node that don't belong there. Did I misunderstand anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The checking of the nodes existence is not the only part of the query that is slow. It is also the removing them one by one that is suboptimal. You could argue that we might as well just change the behavior now to raise for a non-existing node, but then you would have to do the same for the skip_orm
path. I would argue that it is probably not a good idea to have different behavior for the two paths. The problem with this is though that we would have to add membership-checks to the fast skip_orm
algorith, which requires retrieving the entire set of pks that are currently existing in the database and then making sure that all provided node pks are a part of that set. This operation has a non-negligible cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, perhaps I was not clear, but I was not proposing to have two skip_orm
paths with different ignore_missing
defaults, I was thinking of replacing the skip_orm
keyword by the ignore_missing
. This is because I think the user of the function should be more worried about the ignoring of the missing nodes (the exposed behaviour) rather than the skipping of the ORM (the internal workings). The function is in charge of providing the optimal way of performing both behaviors: right now this means that it will skip the orm if the user doesn't care about the missing nodes and will go through a less performant ORM if he does, but in the future both paths would ideally use the ORM.
There are only 2 cases that would not be included by this that would warrant two keywords. Both of them are not only temporary (automatically solved once we have the performant ORM), but also a bit..."micromanagey"?
-
A user wanting to gain a bit of performance while retaining the check (
skip_orm=True
+ignore_missing=False
). I'm personally comfortable with only offering "max checks" VS "max performance" options, and if the user wants a compromise then he can check the nodes himself before calling this. -
A user wanting to get whatever check is in the ORM but not wanting to deal with errors for inexisting nodes (
skip_orm=False
+ignore_missing=True
). Again, micromanaging what checks are performed, but in this case there is even no performance to be gained (as nodes need to be checked anyways to be pruned before being passed to the ORM even if no error is raised).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there may be some confusion on which part of the ORM we are speaking about right now. Just to be sure, we are talking about the interface of the backend ORM. A user never interacts with this interface. If you load a group, you get an instance of Group
. Here we are talking about the interface of BackendGroup
, which a user never sees. The argument skip_orm
is then also never available to users. It is only available for aiida-core
itself.
So I think this discussion is not really relevant here. The whole reason that we put skip_orm
in the backend interface is exactly because we agree with you that the user should not have to think about it in this way. I also want to repeat that checking the existence of nodes before removing them is not the only factor slowing things down. If you look at the implementation of skip_orm=True
, we are also skipping the SqlAlchemy ORM, not just our own, and are constructing an SQL query directly and executing it. This is the most direct and therefore most efficient.
I am not sure if I misunderstood your post again, if so, maybe it is easiest to discuss this in person.
We put it in the backend ORM because we were not sure what the possible downsides were of skipping the ORM. By doing so, we may be circumventing checks that we are not aware off. I think it is for that reason that we decided to play it safe and "hide" it in the backend. Ideally, if this is a safe operation and respects all rules of our ORM, we should just have this as the standard implementation and always use it, without a switch. Since this workaround already existed since a long time for |
The reason is kind of historical: the code we are testing here is specific to the SqlAlchemy backend. If we were to run this code for the Django backend, it would raise, because the |
The current implementation of `Group.remove_nodes` is very slow. For a group of a few tens of thousands of nodes, removing a thousand can take more than a day. The same problem exists for `add_nodes` which is why a shortcut was added to the backend implementation for SqlAlchemy. Here, we do the same for `remove_nodes`. The `SqlaGroup.remove_nodes` now accepts a keyword argument `skip_orm` that, when True, will delete the nodes by directly constructing a delete query on the join table.
c850391
to
f0c4623
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll approve because it doesn't make sense to keep this on hold because of the detail that we are still discussing. I would anyways be interested in finishing the discussion, maybe in next wednesday's maintainers meeting.
Addresses #4159 partially
The current implementation of
Group.remove_nodes
is very slow. For agroup of a few tens of thousands of nodes, removing a thousand can take
more than a day. The same problem exists for
add_nodes
which is why ashortcut was added to the backend implementation for SqlAlchemy. Here,
we do the same for
remove_nodes
. TheSqlaGroup.remove_nodes
nowaccepts a keyword argument
skip_orm
that, when True, will delete thenodes by directly constructing a delete query on the join table.