-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement skip_orm
option for SqlAlchemy Group.remove_nodes
#4214
Merged
sphuber
merged 1 commit into
aiidateam:develop
from
sphuber:fix/4159/group-remove-nodes-performance
Jul 24, 2020
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this desired behaviour? I know the result is the same for the node being deleted or not being there in the first place, but I'm wary of this obfuscating user mistakes from them (i.e. they meaning to delete a node in a group but putting the wrong group).
(provided we agree in 1) If the checking was the main responsible for the lack of performance and there is no way around it, perhaps we could use a more explicit
ignore_missing
/skip_checks
keyword (and do the checks when not caring about performance). On the other hand, if this is just a side effect of the current fix and we can improve the performance keeping the checks maybe it is not necessary to test for this (since it would indicate this is the expected behaviour).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. This is the current behavior so I am not changing anything. Adding the check would make the implementation (possibly) even less performant then it already was. In the end it is a trade-off between performance and information. If we were to add a flag like
ignore_missing
we would probably have to set it toFalse
by default, in order to prevent the unnoticed mistake of the example you are mentioning. Because if they are ignored by default, they would still not notice. This means however, that by default the performance is bad. I think I would personally prefer to have more performance.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm but by default now the performance is bad as well, because the default of
skip_orm
isFalse
. The only difference is that instead of just prunning the list provided into a cleanlist_nodes
, it raise if it finds node that don't belong there. Did I misunderstand anything?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The checking of the nodes existence is not the only part of the query that is slow. It is also the removing them one by one that is suboptimal. You could argue that we might as well just change the behavior now to raise for a non-existing node, but then you would have to do the same for the
skip_orm
path. I would argue that it is probably not a good idea to have different behavior for the two paths. The problem with this is though that we would have to add membership-checks to the fastskip_orm
algorith, which requires retrieving the entire set of pks that are currently existing in the database and then making sure that all provided node pks are a part of that set. This operation has a non-negligible cost.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, perhaps I was not clear, but I was not proposing to have two
skip_orm
paths with differentignore_missing
defaults, I was thinking of replacing theskip_orm
keyword by theignore_missing
. This is because I think the user of the function should be more worried about the ignoring of the missing nodes (the exposed behaviour) rather than the skipping of the ORM (the internal workings). The function is in charge of providing the optimal way of performing both behaviors: right now this means that it will skip the orm if the user doesn't care about the missing nodes and will go through a less performant ORM if he does, but in the future both paths would ideally use the ORM.There are only 2 cases that would not be included by this that would warrant two keywords. Both of them are not only temporary (automatically solved once we have the performant ORM), but also a bit..."micromanagey"?
A user wanting to gain a bit of performance while retaining the check (
skip_orm=True
+ignore_missing=False
). I'm personally comfortable with only offering "max checks" VS "max performance" options, and if the user wants a compromise then he can check the nodes himself before calling this.A user wanting to get whatever check is in the ORM but not wanting to deal with errors for inexisting nodes (
skip_orm=False
+ignore_missing=True
). Again, micromanaging what checks are performed, but in this case there is even no performance to be gained (as nodes need to be checked anyways to be pruned before being passed to the ORM even if no error is raised).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there may be some confusion on which part of the ORM we are speaking about right now. Just to be sure, we are talking about the interface of the backend ORM. A user never interacts with this interface. If you load a group, you get an instance of
Group
. Here we are talking about the interface ofBackendGroup
, which a user never sees. The argumentskip_orm
is then also never available to users. It is only available foraiida-core
itself.So I think this discussion is not really relevant here. The whole reason that we put
skip_orm
in the backend interface is exactly because we agree with you that the user should not have to think about it in this way. I also want to repeat that checking the existence of nodes before removing them is not the only factor slowing things down. If you look at the implementation ofskip_orm=True
, we are also skipping the SqlAlchemy ORM, not just our own, and are constructing an SQL query directly and executing it. This is the most direct and therefore most efficient.I am not sure if I misunderstood your post again, if so, maybe it is easiest to discuss this in person.