Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

♻️ REFACTOR: Remove QueryManager #5101

Merged
merged 3 commits into from
Sep 2, 2021

Conversation

chrisjsewell
Copy link
Member

@chrisjsewell chrisjsewell commented Aug 26, 2021

The AbstractQueryManager class and its implementations (DjabgoQueryManager, SqlaQueryManager), is an odd mix of a few methods, with no real conceptual purpose; essentially a catch-all for things that didn't have another home.

In this PR, I have removed the classes, and reassigned their methods as appropriate:

get_creation_statistics has been moved to the BackendQueryBuilder (in fact, since #5093, I would say this is maybe more appropriately termed BackendQueryManager)

get_duplicate_uuids and apply_new_uuid_mapping have been moved to aiida/backends/general/migrations/utils.py (they directly use SQL, so cannot be easily abstracted on BackendQueryBuilder).

In fact, there is actually a further duplication here with aiida/manage/database/integrity/duplicate_uuid.py; @sphuber in ee7a65e you updated aiida/backends/general/migrations/utils.p::deduplicate_uuids, but actually this is only used in tests/backends/aiida_django/migrations/test_migrations_many.py, whereas aiida/manage/database/integrity/duplicate_uuid.py::deduplicate_uuids (which has not been updated) is the one used in verdi database detect-duplicate-uuid.

  • I would suggest aiida/manage/database/integrity/duplicate_uuid.py should also be removed? (done in 845b112, and extracted to aiida/backends/general/migrations/duplicate_uuids.py to avoid verdi imports of aiida.repository)

Finally, get_bands_and_parents_structure has been moved to aiida/orm/nodes/data/array/bands.py.
This also had a "backend optimised" implementation (on DjabgoQueryManager, but not on SqlaQueryManager), so the question would be:

  • Is verdi data band list used particularly ubiquitously and is the backend optimised implementation that much faster, to warrant this backend specific code? If so, is there some extra method(s) we can add to the BackendQueryBuilder, to allow this to not have to be "special cased"?

cc @sphuber, @ltalirz for comment

@codecov
Copy link

codecov bot commented Aug 26, 2021

Codecov Report

Merging #5101 (e58d7e2) into develop (aea9bfe) will increase coverage by 0.18%.
The diff coverage is 87.72%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #5101      +/-   ##
===========================================
+ Coverage    80.64%   80.81%   +0.18%     
===========================================
  Files          537      534       -3     
  Lines        37168    36967     -201     
===========================================
- Hits         29969    29873      -96     
+ Misses        7199     7094     -105     
Flag Coverage Δ
django 75.52% <87.72%> (+0.22%) ⬆️
sqlalchemy 74.75% <71.06%> (+0.43%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/backends/general/migrations/utils.py 87.50% <ø> (-1.68%) ⬇️
...sions/37f3d4882837_make_all_uuid_columns_unique.py 91.31% <ø> (ø)
aiida/manage/__init__.py 100.00% <ø> (ø)
aiida/manage/database/__init__.py 100.00% <ø> (ø)
aiida/manage/database/integrity/__init__.py 100.00% <ø> (ø)
aiida/orm/implementation/backends.py 94.29% <ø> (+1.98%) ⬆️
aiida/orm/implementation/django/backend.py 100.00% <ø> (ø)
aiida/orm/implementation/sqlalchemy/backend.py 97.44% <ø> (-0.12%) ⬇️
aiida/cmdline/commands/cmd_database.py 62.41% <50.00%> (ø)
...orm/implementation/sqlalchemy/querybuilder/main.py 81.90% <81.25%> (-0.01%) ⬇️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aea9bfe...e58d7e2. Read the comment docs.

@sphuber
Copy link
Contributor

sphuber commented Aug 31, 2021

I am personally not a fan of having these very domain specific and backend dependent queries in aiida-core. This is mostly a relic from the olden days when this repo was all there was, and even all QE code was hosted here. As to whether it is being used, I am not sure. But I do know that @dev-zero recently opened this PR to fix a bug with this very functionality. So that would suggest that at least one person is using it. Still I would be interested to see the performance increase compared to the backend-agnostic version and we really need a dedicated query for this. If not, I would be in favor for removing this.

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with these changes in principle (except for two comments). When those are addressed I am happy to approve



def deduplicate_uuids(table=None, dry_run=True):
def deduplicate_uuids(table=None, dry_run=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't change this default. Is there a particular reason you did so?

Copy link
Member Author

@chrisjsewell chrisjsewell Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok done, it doesn't make too much difference because this function is only used two places in the code aiida/cmdline/commands/cmd_database.py (where dry_run is set explicitly) and tests/backends/aiida_django/migrations/test_migrations_many.py (where it was assuming dry_run=False, but now I've set it explicitly 351c5d0)

@@ -1072,3 +1072,34 @@ def analyze_query(self, data: QueryDictType, execute: bool = True, verbose: bool
options = f' ({options})' if options else ''
rows = self.get_session().execute(f'EXPLAIN{options} {compiled.string}').fetchall()
return '\n'.join(row.values()[0] for row in rows)

def get_creation_statistics(self, user_pk: Optional[int] = None) -> Dict[str, Any]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if this is the best place to put this? Since this is a very specific query, I don't think this should be on the QueryBuilder itself. If I am not mistaken, this is exclusively used by the REST API. So why not just implement it there as a utility function for the endpoint that calls it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it requires a backend specific implementation, and there should be no backend specific query code that does not go through this abstraction.
Essentially, IMO there should be no sqlalchemy imports outside of aiida/backends/sqlalchemy, aiida/orm/implementation/sqlalchemy (and even then, I find this split a bit weird and want to look if there can be a consolidation)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it requires a backend specific implementation

Does it though? I see that it does now, but don't really see why it should be to begin with. This can be implemented just using the QueryBuilder couldn't it? And it shouldn't necessarily be that slow if you project the keys you need directly.

Copy link
Member Author

@chrisjsewell chrisjsewell Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, I feel it is a lot less efficient with the general QueryBuilder approach, because you have to fetch every row in the node table, and iterate on it. Rather than with the backend implementation, just using SQL COUNT queries that can do a lot more work on the server side, and significantly reduce data transfer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a closer look, it would indeed be slower, but only because we don't have a distinct-count functionality on the QueryBuilder. Currently, the best you can do is:

def get_creation_statistics(user_pk=None):
    from collections import Counter
    from aiida import orm

    builder_total = orm.QueryBuilder().append(orm.Node, tag='node', project='*')
    builder_types = orm.QueryBuilder().append(orm.Node, tag='node', project='node_type')
    builder_ctime = orm.QueryBuilder().append(orm.Node, tag='node', project='ctime').order_by({'node': 'ctime'})

    if user_pk is not None:
        builder_total.append(orm.User, filters={'id': user_pk}, with_node='node')
        builder_types.append(orm.User, filters={'id': user_pk}, with_node='node')
        builder_ctime.append(orm.User, filters={'id': user_pk}, with_node='node')

    return {
        'total': builder_total.count(),
        'types': dict(Counter(builder_types.all(flat=True))),
        'ctime_by_day': dict(Counter([ctime.strftime('%Y-%m-%d') for ctime in builder_ctime.all(flat=True)])),
    }

If we were to add the possibility to do .distinct(count=True) on the QueryBuilder we could have the same performance and not have an ad-hoc solution. This wouldn't have to be done in this PR of course. But after refactoring the QB do you think adding this feature would be complex? If not I would do this in a separate PR and then remove the custom get_creation_statistics and replace it with my suggestion above and move it to the REST API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, you read my mind, yes I do think there is a possibility to do some further generalization (in a separate PR).
I'm not quite sure what you mean by .distinct(count=True); are you expecting distinct to actually run a query here, and return a result?
(because that is not how distinct works, it just adds an additional parameter for the query)

Copy link
Contributor

@sphuber sphuber Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that in sqlalchemy you can do something like session.query(DbNode.node_type).distinct().count() which is more or less what the current function is doing nvm, that actually doesn't work as I though, still it should be possible somehow. We would just need to forward this concept to the frontend QueryBuilder class. The direct translation is already possible, i.e. QueryBuilder().append(Node, project='node_type').distinct().count() but this will return the number of distinct node_types and not a count for each distinct node_type. So my first though was to add a boolean arg to distinct which then gets passed to the implementation and it does the right thing. Would be fine with other interfaces, just would be an interesting feature to have.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh I think this would be confusing for what .count() does and it would be better to have a completely separate method, but yeh lets open an issue on that and I can have more of a think

Just to note, .distinct() does not touch the backend at all now, i.e. it just stores {"distinct": True} in the dictionary passed to the backend when you call one of the "query" methods (count, all, dict, etc). In fact you can also do e.g. .distinct(True).distinct(False)

@sphuber sphuber merged commit 1dac2fe into aiidateam:develop Sep 2, 2021
@chrisjsewell chrisjsewell deleted the remove-query-manager branch September 16, 2021 01:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants