Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ NEW: Add Node.objects.iter_repo_keys #4922

Merged
merged 16 commits into from
Nov 2, 2021
Merged

Conversation

chrisjsewell
Copy link
Member

@chrisjsewell chrisjsewell commented May 6, 2021

This PR implements NodeCollection.iter_repo_keys, to retrieve all the repository object keys for the given Node subclass.
For example:

from aiida.orm import Data
len(set(Data.objects.iter_repo_keys()))

It is primarily intended to form part of the solution for #4321 (retrieving all object keys on the AiiDA DB, to decide what needs to be deleted from the object store).

I think this is a good location in the API to expose this, in a generic way from which we can always change the implementation at a later date.

thoughts @sphuber, @giovannipizzi?

(If you agree I will add some tests)

@codecov
Copy link

codecov bot commented May 6, 2021

Codecov Report

Merging #4922 (2579d04) into develop (bb92dd5) will increase coverage by 0.03%.
The diff coverage is 97.73%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4922      +/-   ##
===========================================
+ Coverage    81.20%   81.22%   +0.03%     
===========================================
  Files          532      532              
  Lines        37307    37347      +40     
===========================================
+ Hits         30290    30330      +40     
  Misses        7017     7017              
Flag Coverage Δ
django 76.07% <97.73%> (+0.03%) ⬆️
sqlalchemy 75.17% <97.73%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/repository/repository.py 96.54% <94.12%> (-0.19%) ⬇️
aiida/cmdline/commands/cmd_code.py 89.69% <100.00%> (+0.15%) ⬆️
aiida/cmdline/params/options/commands/code.py 100.00% <100.00%> (ø)
aiida/orm/nodes/node.py 96.36% <100.00%> (+0.09%) ⬆️
aiida/transports/plugins/local.py 81.66% <0.00%> (+0.26%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d25339d...2579d04. Read the comment docs.

@giovannipizzi
Copy link
Member

giovannipizzi commented May 6, 2021

Thanks @chrisjsewell! Location seems possibly OK. However I'm wondering if we actually need the functionality to iterate only on a subset of nodes types. If this is not a requirement, in the end we just need some iterator over all hash keys; maybe there's a different place to put it (but I don't know where, so in case we can keep it here).

As a second comment, on my mid-size DB (~320k nodes, ~340k different hash keys, ~510k objects listed in the node repository_metadata), I get this timing:

import time
from aiida.manage.manager import get_manager

def iter_object_keys():
    from aiida.repository import Repository

    profile = get_manager().get_profile()
    backend = profile.get_repository().backend

    query = QueryBuilder()
    query.append(Node, project=['repository_metadata'])
    for metadata, in query.iterall():
        repo = Repository.from_serialized(backend=backend, serialized=metadata)
        for hash_key in repo.get_hash_keys():
            yield hash_key

t = time.time()
b = set(iter_object_keys())
print(time.time() - t)

This runs in ~8.2s.

The following "hardcoded" logic would take only 2.1s:

import aiida.backends.djsite.db.models as djmodels
import collections.abc
import time


def flatten_hashonly(dictionary):
    items = set()
    if not dictionary:
        return items
    for value in dictionary['o'].values():
        try:
            items.add(value['k'])
        except KeyError:
            items.update(flatten_hashonly(value))
    return items

t = time.time()
results = djmodels.DbNode.objects.values_list('repository_metadata', flat=True)
hashes = set()
for data in results:
    hashes.update(flatten_hashonly(data))
print(len(hashes), time.time() - t)

(this might also give a sense on how long it takes to list all of them, I imagine this time just scales linearly with the size of the DB, see also discussions in #4321 and #4919).

I don't know why the first implementation (from this PR) is 4x slower (maybe the class creation?); I know the second snippet is quite hard-coded and with no checks; but maybe there is a way to implement a version that is a compromise between the two?
(If it's not obvious how to speed it up, I think we can go ahead and optimise later, anyway, as long as we're happy with the interface).

@giovannipizzi
Copy link
Member

A partially unrelated question for @sphuber: should the "correct" default be {'o': {}} rather than {}, unless I'm misunderstanding the syntax? (I know you've been already moving from None to {})
Anyway I thought in my first version I had to patch {} to {'o': {}}, but this does not seem to be the case, maybe I'm just misunderstanding the syntax.

@chrisjsewell
Copy link
Member Author

I know the second snippet is quite hard-coded and with no checks; but maybe there is a way to implement a version that is a compromise between the two?

I can have a look, but yeh the point was I went for the "abstractually best" solution

@giovannipizzi
Copy link
Member

yes yes, I know. As I said, it's OK performance wise and we can improve later.
I'd still try to hear from @sphuber if he's better suggestions for the location (do we need filtering by type?)

@chrisjsewell
Copy link
Member Author

Note, if you just replace Repository.from_serialized(backend=backend, serialized=metadata).get_hash_keys() in my function, with your flatten_hashonly(metadata), you get the same speed up

@chrisjsewell
Copy link
Member Author

chrisjsewell commented May 6, 2021

do we need filtering by type?

well its nice for summary statistics; do you gain anything by not having it?

@giovannipizzi
Copy link
Member

Note, if you just replace Repository.from_serialized(backend=backend, serialized=metadata).get_hash_keys() in my function, with your flatten_hashonly(metadata), you get the same speed up

You mean this?

import time
from aiida.manage.manager import get_manager

def iter_object_keys():
    from aiida.repository import Repository

    profile = get_manager().get_profile()
    backend = profile.get_repository().backend

    query = QueryBuilder()
    query.append(Node, project=['repository_metadata'])
    for metadata, in query.iterall():
        for item in flatten_hashonly(metadata):
            yield item

t = time.time()
b = set(iter_object_keys())
print(time.time() - t)

For reference, I get 5.1s, so in between. But again I wouldn't block this on performance, it's still acceptable (and uses existing code), we can improve later.

well its nice for summary statistics; do you gain anything by not having it?

Statistics is a good point. I'm just thinking whether by lifting that requirement we can put the logic somewhere else, in a more 'global' place (like a class method of Repository just to give an example - but I'm not convinced, maybe the place you suggest is the best place

@chrisjsewell
Copy link
Member Author

like a class method of Repository just to give an example

err, I feel that would be kind of misleading, because obviously the output does not actually come from the repository

@giovannipizzi
Copy link
Member

I realise I've been unfair in my timings - due to some copy-paste from earlier code, in the faster one I'm using directly django rather than the QueryBuilder - this gives a closer timing of ~5.1s, much closer to the implementation in this PR:

import aiida.backends.djsite.db.models as djmodels
import collections.abc
import time


def flatten_hashonly(dictionary):
    items = set()
    if not dictionary:
        return items
    for value in dictionary['o'].values():
        try:
            items.add(value['k'])
        except KeyError:
            items.update(flatten_hashonly(value))
    return items

t = time.time()
query = QueryBuilder()
query.append(Node, project=['repository_metadata'])
hashes = set()
for data, in query.iterall():
    hashes.update(flatten_hashonly(data))
print(len(hashes), time.time() - t)

Sorry about this.

Also, I see your point @chrisjsewell about not putting it in Repository

@sphuber
Copy link
Contributor

sphuber commented May 6, 2021

A few points:

  • I agree that this shouldn't go in the Repository class as it explicitly does not have a connection to the database. It provides an interface to a repo on disk, but those repos are not expected to have a recording of the virtual hierarchy. This needs to be inserted into the Repository (through from_serialized) and it will keep it in memory for the duration of its lifetime. But in AiiDA the actual hierarchy is stored in our database and so I wouldn't add methods to "retrieve" that on the Repository class.

  • I am not sure why this needs to be defined centrally on the Node collection. Would this really be used anywhere else than the maintenance operation for the repository? Anyway, I won't stop this if you think it makes sense here. I would just recommend a slight name change because I was really confused by the name iter_object_keys. The fact that this is called as Node.objects.iter_object_keys you would think the objects is some hashmap and you would iterate over the keys that map the nodes in that collection. Maybe iter_repository_object_keys would be better? Then, I can see why we need the keys, in order to know which keys are no longer referenced, but why do we need the files? What would be the point of iterating over all filenames in the entire repository?

  • The implementation of iter_object_names is incorrect. The list_object_names method is not recursive and merely lists the names of the files in the root directory of that node. You can pass a relative path as argument to get the names in that subdirectory. Note also that list_object_names returns the name of both files and directories.

@chrisjsewell
Copy link
Member Author

chrisjsewell commented May 7, 2021

The implementation of iter_object_names is incorrect. The list_object_names method is not recursive and merely lists the names of the files in the root directory of that node.

Oh yeh cheers, I forgot that but ... this is certainly not evident in the names/docstrings (and theres a copy pasta foobar)

    def list_objects(self, path: str = None) -> typing.List[File]:
        """Return a list of the objects contained in this repository sorted by name, optionally in given sub directory.

        :param path: the relative path where to store the object in the repository.
        """

    def list_object_names(self, path: str = None) -> typing.List[str]:
        """Return a sorted list of the object names contained in this repository, optionally in the given sub directory.

        :param path: the relative path where to store the object in the repository.
        """

which mirrors my comments in #4321 (comment), that I feel it may be better to have a pathlib.PurePath like API

@dev-zero
Copy link
Contributor

def flatten_hashonly(dictionary):
    items = set()
    if not dictionary:
        return items
    for value in dictionary['o'].values():
        try:
            items.add(value['k'])
        except KeyError:
            items.update(flatten_hashonly(value))
    return items

As a general notion about recursive functions in Python: Python does not do Tail Call Optimization hence I've come to avoid recursive functions in performance critical code (or code where the depth is not predetermined to avoid hitting the RuntimeError: maximum recursion depth exceeded). There are several ways to do it, my favourite is usually an explicit stack which exploits the fact that Python uses references for anything but simple types. Untested, but the following is roughly what I mean:

def flatten_hashonly(dictionary):
    items = set()

    if not dictionary:
        return items

    stack = [dictionary]

    while stack:
        value = stack.pop()

        try:
            items.add(value['k'])
        except KeyError:
            stack += list(dictionary['o'].values())

    return items

Maybe this could be of use here as well.

@giovannipizzi
Copy link
Member

Thanks @dev-zero
The correct code should be this:

def flatten_hashonly_nonrec(dictionary):
    items = set()

    if not dictionary:
        return items

    stack = [dictionary]

    while stack:
        values = stack.pop()['o'].values()

        for value in values:
            try:
                items.add(value['k'])
            except KeyError:
                stack += [value]

    return items

(One important note where it took a moment for me to realise: if d is a dictionary, [d] is a list with d as an entry, while list(d) is a list of its keys).

I checked and the timing seems the same as the recursive version, but I agree that a non-recursive version is better to avoid the recursion error exception

@chrisjsewell
Copy link
Member Author

Yep fair 👍

@ramirezfranciscof ramirezfranciscof mentioned this pull request May 31, 2021
2 tasks
aiida/orm/nodes/node.py Outdated Show resolved Hide resolved
@chrisjsewell
Copy link
Member Author

chrisjsewell commented Oct 26, 2021

Ok I've updated this:

  • I've removed iter_object_names for now, so it doesn't complicate things
  • I've added the Repository.flatten class method, which implements a similar logic to the above discussed non-recursive functions
    • As you can see from the code/test it returns a dict mapping path -> key/None
    • The logic of prefixing a delimiter to the end of folder paths is copied from zippath/tarpath, and avoids key clashes of file/folder with the same name in the same folder
    • Incidentally, this format is how I feel it should actually be stored in the database (not including the sub-folders): it is a lot simpler and easier to run postgres JSONB queries on (like has_key)
  • NodeCollection.iter_object_keys then uses Repository.flatten

As already mentioned, this function will be used both in the backend repository cleaning and the archive code.

@chrisjsewell
Copy link
Member Author

cc @giovannipizzi @sphuber @dev-zero for re-review

@chrisjsewell
Copy link
Member Author

I'm merging this soon

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave the check on the recursive implementation to @giovannipizzi or @dev-zero since they commented on this. I just have to note that I think you haven't addressed my comment on the naming for iter_object_keys. When defined on the collection, it leads to Node.objects.iter_object_keys() and that strongly suggests you are somehow iterating over the objects (i.e. nodes) in the collection. At no point is it clear that it is dealing with objects from the repository of the node. I really think we should name it iter_repository_object_keys.

aiida/orm/nodes/node.py Outdated Show resolved Hide resolved
@chrisjsewell chrisjsewell changed the title ✨ NEW: Add Node.objects.iter_object_keys ✨ NEW: Add Node.objects.iter_repo_keys Nov 1, 2021
@chrisjsewell
Copy link
Member Author

I think you haven't addressed my comment on the naming for iter_object_keys

cheers, changed the name to iter_repo_keys

@chrisjsewell chrisjsewell merged commit 2d6df12 into develop Nov 2, 2021
@chrisjsewell chrisjsewell deleted the iter_object_keys branch November 2, 2021 16:21
chrisjsewell added a commit to chrisjsewell/aiida_core that referenced this pull request Nov 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants