-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ NEW: Add Node.objects.iter_repo_keys
#4922
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #4922 +/- ##
===========================================
+ Coverage 81.20% 81.22% +0.03%
===========================================
Files 532 532
Lines 37307 37347 +40
===========================================
+ Hits 30290 30330 +40
Misses 7017 7017
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Thanks @chrisjsewell! Location seems possibly OK. However I'm wondering if we actually need the functionality to iterate only on a subset of nodes types. If this is not a requirement, in the end we just need some iterator over all hash keys; maybe there's a different place to put it (but I don't know where, so in case we can keep it here). As a second comment, on my mid-size DB (~320k nodes, ~340k different hash keys, ~510k objects listed in the node repository_metadata), I get this timing: import time
from aiida.manage.manager import get_manager
def iter_object_keys():
from aiida.repository import Repository
profile = get_manager().get_profile()
backend = profile.get_repository().backend
query = QueryBuilder()
query.append(Node, project=['repository_metadata'])
for metadata, in query.iterall():
repo = Repository.from_serialized(backend=backend, serialized=metadata)
for hash_key in repo.get_hash_keys():
yield hash_key
t = time.time()
b = set(iter_object_keys())
print(time.time() - t) This runs in ~8.2s. The following "hardcoded" logic would take only 2.1s: import aiida.backends.djsite.db.models as djmodels
import collections.abc
import time
def flatten_hashonly(dictionary):
items = set()
if not dictionary:
return items
for value in dictionary['o'].values():
try:
items.add(value['k'])
except KeyError:
items.update(flatten_hashonly(value))
return items
t = time.time()
results = djmodels.DbNode.objects.values_list('repository_metadata', flat=True)
hashes = set()
for data in results:
hashes.update(flatten_hashonly(data))
print(len(hashes), time.time() - t) (this might also give a sense on how long it takes to list all of them, I imagine this time just scales linearly with the size of the DB, see also discussions in #4321 and #4919). I don't know why the first implementation (from this PR) is 4x slower (maybe the class creation?); I know the second snippet is quite hard-coded and with no checks; but maybe there is a way to implement a version that is a compromise between the two? |
A partially unrelated question for @sphuber: should the "correct" default be |
I can have a look, but yeh the point was I went for the "abstractually best" solution |
yes yes, I know. As I said, it's OK performance wise and we can improve later. |
Note, if you just replace |
well its nice for summary statistics; do you gain anything by not having it? |
You mean this? import time
from aiida.manage.manager import get_manager
def iter_object_keys():
from aiida.repository import Repository
profile = get_manager().get_profile()
backend = profile.get_repository().backend
query = QueryBuilder()
query.append(Node, project=['repository_metadata'])
for metadata, in query.iterall():
for item in flatten_hashonly(metadata):
yield item
t = time.time()
b = set(iter_object_keys())
print(time.time() - t) For reference, I get 5.1s, so in between. But again I wouldn't block this on performance, it's still acceptable (and uses existing code), we can improve later.
Statistics is a good point. I'm just thinking whether by lifting that requirement we can put the logic somewhere else, in a more 'global' place (like a class method of |
err, I feel that would be kind of misleading, because obviously the output does not actually come from the repository |
I realise I've been unfair in my timings - due to some copy-paste from earlier code, in the faster one I'm using directly django rather than the QueryBuilder - this gives a closer timing of ~5.1s, much closer to the implementation in this PR: import aiida.backends.djsite.db.models as djmodels
import collections.abc
import time
def flatten_hashonly(dictionary):
items = set()
if not dictionary:
return items
for value in dictionary['o'].values():
try:
items.add(value['k'])
except KeyError:
items.update(flatten_hashonly(value))
return items
t = time.time()
query = QueryBuilder()
query.append(Node, project=['repository_metadata'])
hashes = set()
for data, in query.iterall():
hashes.update(flatten_hashonly(data))
print(len(hashes), time.time() - t) Sorry about this. Also, I see your point @chrisjsewell about not putting it in |
A few points:
|
Oh yeh cheers, I forgot that but ... this is certainly not evident in the names/docstrings (and theres a copy pasta foobar) def list_objects(self, path: str = None) -> typing.List[File]:
"""Return a list of the objects contained in this repository sorted by name, optionally in given sub directory.
:param path: the relative path where to store the object in the repository.
"""
def list_object_names(self, path: str = None) -> typing.List[str]:
"""Return a sorted list of the object names contained in this repository, optionally in the given sub directory.
:param path: the relative path where to store the object in the repository.
""" which mirrors my comments in #4321 (comment), that I feel it may be better to have a |
As a general notion about recursive functions in Python: Python does not do Tail Call Optimization hence I've come to avoid recursive functions in performance critical code (or code where the depth is not predetermined to avoid hitting the def flatten_hashonly(dictionary):
items = set()
if not dictionary:
return items
stack = [dictionary]
while stack:
value = stack.pop()
try:
items.add(value['k'])
except KeyError:
stack += list(dictionary['o'].values())
return items Maybe this could be of use here as well. |
Thanks @dev-zero def flatten_hashonly_nonrec(dictionary):
items = set()
if not dictionary:
return items
stack = [dictionary]
while stack:
values = stack.pop()['o'].values()
for value in values:
try:
items.add(value['k'])
except KeyError:
stack += [value]
return items (One important note where it took a moment for me to realise: if I checked and the timing seems the same as the recursive version, but I agree that a non-recursive version is better to avoid the recursion error exception |
Yep fair 👍 |
Ok I've updated this:
As already mentioned, this function will be used both in the backend repository cleaning and the archive code. |
cc @giovannipizzi @sphuber @dev-zero for re-review |
I'm merging this soon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave the check on the recursive implementation to @giovannipizzi or @dev-zero since they commented on this. I just have to note that I think you haven't addressed my comment on the naming for iter_object_keys. When defined on the collection, it leads to Node.objects.iter_object_keys() and that strongly suggests you are somehow iterating over the objects (i.e. nodes) in the collection. At no point is it clear that it is dealing with objects from the repository of the node. I really think we should name it iter_repository_object_keys.
Node.objects.iter_object_keys
Node.objects.iter_repo_keys
for more information, see https://pre-commit.ci
cheers, changed the name to |
This PR implements
NodeCollection.iter_repo_keys
, to retrieve all the repository object keys for the givenNode
subclass.For example:
It is primarily intended to form part of the solution for #4321 (retrieving all object keys on the AiiDA DB, to decide what needs to be deleted from the object store).
I think this is a good location in the API to expose this, in a generic way from which we can always change the implementation at a later date.
thoughts @sphuber, @giovannipizzi?
(If you agree I will add some tests)