-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
♻️ REFACTOR: Remove QueryManager
#5101
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #5101 +/- ##
===========================================
+ Coverage 80.64% 80.81% +0.18%
===========================================
Files 537 534 -3
Lines 37168 36967 -201
===========================================
- Hits 29969 29873 -96
+ Misses 7199 7094 -105
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
fd16b97
to
845b112
Compare
I am personally not a fan of having these very domain specific and backend dependent queries in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with these changes in principle (except for two comments). When those are addressed I am happy to approve
|
||
|
||
def deduplicate_uuids(table=None, dry_run=True): | ||
def deduplicate_uuids(table=None, dry_run=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't change this default. Is there a particular reason you did so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok done, it doesn't make too much difference because this function is only used two places in the code aiida/cmdline/commands/cmd_database.py
(where dry_run is set explicitly) and tests/backends/aiida_django/migrations/test_migrations_many.py
(where it was assuming dry_run=False
, but now I've set it explicitly 351c5d0)
@@ -1072,3 +1072,34 @@ def analyze_query(self, data: QueryDictType, execute: bool = True, verbose: bool | |||
options = f' ({options})' if options else '' | |||
rows = self.get_session().execute(f'EXPLAIN{options} {compiled.string}').fetchall() | |||
return '\n'.join(row.values()[0] for row in rows) | |||
|
|||
def get_creation_statistics(self, user_pk: Optional[int] = None) -> Dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if this is the best place to put this? Since this is a very specific query, I don't think this should be on the QueryBuilder
itself. If I am not mistaken, this is exclusively used by the REST API. So why not just implement it there as a utility function for the endpoint that calls it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it requires a backend specific implementation, and there should be no backend specific query code that does not go through this abstraction.
Essentially, IMO there should be no sqlalchemy imports outside of aiida/backends/sqlalchemy
, aiida/orm/implementation/sqlalchemy
(and even then, I find this split a bit weird and want to look if there can be a consolidation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it requires a backend specific implementation
Does it though? I see that it does now, but don't really see why it should be to begin with. This can be implemented just using the QueryBuilder
couldn't it? And it shouldn't necessarily be that slow if you project the keys you need directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Err, I feel it is a lot less efficient with the general QueryBuilder
approach, because you have to fetch every row in the node table, and iterate on it. Rather than with the backend implementation, just using SQL COUNT
queries that can do a lot more work on the server side, and significantly reduce data transfer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a closer look, it would indeed be slower, but only because we don't have a distinct-count functionality on the QueryBuilder
. Currently, the best you can do is:
def get_creation_statistics(user_pk=None):
from collections import Counter
from aiida import orm
builder_total = orm.QueryBuilder().append(orm.Node, tag='node', project='*')
builder_types = orm.QueryBuilder().append(orm.Node, tag='node', project='node_type')
builder_ctime = orm.QueryBuilder().append(orm.Node, tag='node', project='ctime').order_by({'node': 'ctime'})
if user_pk is not None:
builder_total.append(orm.User, filters={'id': user_pk}, with_node='node')
builder_types.append(orm.User, filters={'id': user_pk}, with_node='node')
builder_ctime.append(orm.User, filters={'id': user_pk}, with_node='node')
return {
'total': builder_total.count(),
'types': dict(Counter(builder_types.all(flat=True))),
'ctime_by_day': dict(Counter([ctime.strftime('%Y-%m-%d') for ctime in builder_ctime.all(flat=True)])),
}
If we were to add the possibility to do .distinct(count=True)
on the QueryBuilder
we could have the same performance and not have an ad-hoc solution. This wouldn't have to be done in this PR of course. But after refactoring the QB do you think adding this feature would be complex? If not I would do this in a separate PR and then remove the custom get_creation_statistics
and replace it with my suggestion above and move it to the REST API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, you read my mind, yes I do think there is a possibility to do some further generalization (in a separate PR).
I'm not quite sure what you mean by .distinct(count=True)
; are you expecting distinct
to actually run a query here, and return a result?
(because that is not how distinct works, it just adds an additional parameter for the query)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is that in sqlalchemy you can do something like nvm, that actually doesn't work as I though, still it should be possible somehow. We would just need to forward this concept to the frontend session.query(DbNode.node_type).distinct().count()
which is more or less what the current function is doingQueryBuilder
class. The direct translation is already possible, i.e. QueryBuilder().append(Node, project='node_type').distinct().count()
but this will return the number of distinct node_type
s and not a count for each distinct node_type
. So my first though was to add a boolean arg to distinct
which then gets passed to the implementation and it does the right thing. Would be fine with other interfaces, just would be an interesting feature to have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeh I think this would be confusing for what .count()
does and it would be better to have a completely separate method, but yeh lets open an issue on that and I can have more of a think
Just to note, .distinct()
does not touch the backend at all now, i.e. it just stores {"distinct": True}
in the dictionary passed to the backend when you call one of the "query" methods (count, all, dict, etc). In fact you can also do e.g. .distinct(True).distinct(False)
962175d
to
e58d7e2
Compare
The
AbstractQueryManager
class and its implementations (DjabgoQueryManager
,SqlaQueryManager
), is an odd mix of a few methods, with no real conceptual purpose; essentially a catch-all for things that didn't have another home.In this PR, I have removed the classes, and reassigned their methods as appropriate:
get_creation_statistics
has been moved to theBackendQueryBuilder
(in fact, since #5093, I would say this is maybe more appropriately termedBackendQueryManager
)get_duplicate_uuids
andapply_new_uuid_mapping
have been moved toaiida/backends/general/migrations/utils.py
(they directly use SQL, so cannot be easily abstracted onBackendQueryBuilder
).In fact, there is actually a further duplication here with
aiida/manage/database/integrity/duplicate_uuid.py
; @sphuber in ee7a65e you updatedaiida/backends/general/migrations/utils.p::deduplicate_uuids
, but actually this is only used intests/backends/aiida_django/migrations/test_migrations_many.py
, whereasaiida/manage/database/integrity/duplicate_uuid.py::deduplicate_uuids
(which has not been updated) is the one used inverdi database detect-duplicate-uuid
.aiida/manage/database/integrity/duplicate_uuid.py
should also be removed? (done in 845b112, and extracted toaiida/backends/general/migrations/duplicate_uuids.py
to avoidverdi
imports ofaiida.repository
)Finally,
get_bands_and_parents_structure
has been moved toaiida/orm/nodes/data/array/bands.py
.This also had a "backend optimised" implementation (on
DjabgoQueryManager
, but not onSqlaQueryManager
), so the question would be:verdi data band list
used particularly ubiquitously and is the backend optimised implementation that much faster, to warrant this backend specific code? If so, is there some extra method(s) we can add to theBackendQueryBuilder
, to allow this to not have to be "special cased"?cc @sphuber, @ltalirz for comment