✨ NEW: Add `Node.objects.iter_repo_keys` #4922

chrisjsewell · 2021-05-06T12:05:11Z

This PR implements NodeCollection.iter_repo_keys, to retrieve all the repository object keys for the given Node subclass.
For example:

from aiida.orm import Data
len(set(Data.objects.iter_repo_keys()))

It is primarily intended to form part of the solution for #4321 (retrieving all object keys on the AiiDA DB, to decide what needs to be deleted from the object store).

I think this is a good location in the API to expose this, in a generic way from which we can always change the implementation at a later date.

thoughts @sphuber, @giovannipizzi?

(If you agree I will add some tests)

codecov · 2021-05-06T12:46:52Z

Codecov Report

Merging #4922 (2579d04) into develop (bb92dd5) will increase coverage by 0.03%.
The diff coverage is 97.73%.

@@             Coverage Diff             @@
##           develop    #4922      +/-   ##
===========================================
+ Coverage    81.20%   81.22%   +0.03%     
===========================================
  Files          532      532              
  Lines        37307    37347      +40     
===========================================
+ Hits         30290    30330      +40     
  Misses        7017     7017

Flag	Coverage Δ
django	`76.07% <97.73%> (+0.03%)`	⬆️
sqlalchemy	`75.17% <97.73%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
aiida/repository/repository.py	`96.54% <94.12%> (-0.19%)`	⬇️
aiida/cmdline/commands/cmd_code.py	`89.69% <100.00%> (+0.15%)`	⬆️
aiida/cmdline/params/options/commands/code.py	`100.00% <100.00%> (ø)`
aiida/orm/nodes/node.py	`96.36% <100.00%> (+0.09%)`	⬆️
aiida/transports/plugins/local.py	`81.66% <0.00%> (+0.26%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d25339d...2579d04. Read the comment docs.

giovannipizzi · 2021-05-06T14:59:04Z

Thanks @chrisjsewell! Location seems possibly OK. However I'm wondering if we actually need the functionality to iterate only on a subset of nodes types. If this is not a requirement, in the end we just need some iterator over all hash keys; maybe there's a different place to put it (but I don't know where, so in case we can keep it here).

As a second comment, on my mid-size DB (~320k nodes, ~340k different hash keys, ~510k objects listed in the node repository_metadata), I get this timing:

import time
from aiida.manage.manager import get_manager

def iter_object_keys():
    from aiida.repository import Repository

    profile = get_manager().get_profile()
    backend = profile.get_repository().backend

    query = QueryBuilder()
    query.append(Node, project=['repository_metadata'])
    for metadata, in query.iterall():
        repo = Repository.from_serialized(backend=backend, serialized=metadata)
        for hash_key in repo.get_hash_keys():
            yield hash_key

t = time.time()
b = set(iter_object_keys())
print(time.time() - t)

This runs in ~8.2s.

The following "hardcoded" logic would take only 2.1s:

import aiida.backends.djsite.db.models as djmodels
import collections.abc
import time


def flatten_hashonly(dictionary):
    items = set()
    if not dictionary:
        return items
    for value in dictionary['o'].values():
        try:
            items.add(value['k'])
        except KeyError:
            items.update(flatten_hashonly(value))
    return items

t = time.time()
results = djmodels.DbNode.objects.values_list('repository_metadata', flat=True)
hashes = set()
for data in results:
    hashes.update(flatten_hashonly(data))
print(len(hashes), time.time() - t)

(this might also give a sense on how long it takes to list all of them, I imagine this time just scales linearly with the size of the DB, see also discussions in #4321 and #4919).

I don't know why the first implementation (from this PR) is 4x slower (maybe the class creation?); I know the second snippet is quite hard-coded and with no checks; but maybe there is a way to implement a version that is a compromise between the two?
(If it's not obvious how to speed it up, I think we can go ahead and optimise later, anyway, as long as we're happy with the interface).

giovannipizzi · 2021-05-06T15:02:48Z

A partially unrelated question for @sphuber: should the "correct" default be {'o': {}} rather than {}, unless I'm misunderstanding the syntax? (I know you've been already moving from None to {})
Anyway I thought in my first version I had to patch {} to {'o': {}}, but this does not seem to be the case, maybe I'm just misunderstanding the syntax.

chrisjsewell · 2021-05-06T15:05:29Z

I know the second snippet is quite hard-coded and with no checks; but maybe there is a way to implement a version that is a compromise between the two?

I can have a look, but yeh the point was I went for the "abstractually best" solution

giovannipizzi · 2021-05-06T15:13:40Z

yes yes, I know. As I said, it's OK performance wise and we can improve later.
I'd still try to hear from @sphuber if he's better suggestions for the location (do we need filtering by type?)

chrisjsewell · 2021-05-06T15:13:59Z

Note, if you just replace Repository.from_serialized(backend=backend, serialized=metadata).get_hash_keys() in my function, with your flatten_hashonly(metadata), you get the same speed up

chrisjsewell · 2021-05-06T15:15:19Z

do we need filtering by type?

well its nice for summary statistics; do you gain anything by not having it?

giovannipizzi · 2021-05-06T15:18:25Z

Note, if you just replace Repository.from_serialized(backend=backend, serialized=metadata).get_hash_keys() in my function, with your flatten_hashonly(metadata), you get the same speed up

You mean this?

import time
from aiida.manage.manager import get_manager

def iter_object_keys():
    from aiida.repository import Repository

    profile = get_manager().get_profile()
    backend = profile.get_repository().backend

    query = QueryBuilder()
    query.append(Node, project=['repository_metadata'])
    for metadata, in query.iterall():
        for item in flatten_hashonly(metadata):
            yield item

t = time.time()
b = set(iter_object_keys())
print(time.time() - t)

For reference, I get 5.1s, so in between. But again I wouldn't block this on performance, it's still acceptable (and uses existing code), we can improve later.

well its nice for summary statistics; do you gain anything by not having it?

Statistics is a good point. I'm just thinking whether by lifting that requirement we can put the logic somewhere else, in a more 'global' place (like a class method of Repository just to give an example - but I'm not convinced, maybe the place you suggest is the best place

chrisjsewell · 2021-05-06T15:22:22Z

like a class method of Repository just to give an example

err, I feel that would be kind of misleading, because obviously the output does not actually come from the repository

giovannipizzi · 2021-05-06T15:37:39Z

I realise I've been unfair in my timings - due to some copy-paste from earlier code, in the faster one I'm using directly django rather than the QueryBuilder - this gives a closer timing of ~5.1s, much closer to the implementation in this PR:

import aiida.backends.djsite.db.models as djmodels
import collections.abc
import time


def flatten_hashonly(dictionary):
    items = set()
    if not dictionary:
        return items
    for value in dictionary['o'].values():
        try:
            items.add(value['k'])
        except KeyError:
            items.update(flatten_hashonly(value))
    return items

t = time.time()
query = QueryBuilder()
query.append(Node, project=['repository_metadata'])
hashes = set()
for data, in query.iterall():
    hashes.update(flatten_hashonly(data))
print(len(hashes), time.time() - t)

Sorry about this.

Also, I see your point @chrisjsewell about not putting it in Repository

sphuber · 2021-05-06T19:02:58Z

A few points:

I agree that this shouldn't go in the Repository class as it explicitly does not have a connection to the database. It provides an interface to a repo on disk, but those repos are not expected to have a recording of the virtual hierarchy. This needs to be inserted into the Repository (through from_serialized) and it will keep it in memory for the duration of its lifetime. But in AiiDA the actual hierarchy is stored in our database and so I wouldn't add methods to "retrieve" that on the Repository class.
I am not sure why this needs to be defined centrally on the Node collection. Would this really be used anywhere else than the maintenance operation for the repository? Anyway, I won't stop this if you think it makes sense here. I would just recommend a slight name change because I was really confused by the name iter_object_keys. The fact that this is called as Node.objects.iter_object_keys you would think the objects is some hashmap and you would iterate over the keys that map the nodes in that collection. Maybe iter_repository_object_keys would be better? Then, I can see why we need the keys, in order to know which keys are no longer referenced, but why do we need the files? What would be the point of iterating over all filenames in the entire repository?
The implementation of iter_object_names is incorrect. The list_object_names method is not recursive and merely lists the names of the files in the root directory of that node. You can pass a relative path as argument to get the names in that subdirectory. Note also that list_object_names returns the name of both files and directories.

chrisjsewell · 2021-05-07T10:41:54Z

The implementation of iter_object_names is incorrect. The list_object_names method is not recursive and merely lists the names of the files in the root directory of that node.

Oh yeh cheers, I forgot that but ... this is certainly not evident in the names/docstrings (and theres a copy pasta foobar)

    def list_objects(self, path: str = None) -> typing.List[File]:
        """Return a list of the objects contained in this repository sorted by name, optionally in given sub directory.

        :param path: the relative path where to store the object in the repository.
        """

    def list_object_names(self, path: str = None) -> typing.List[str]:
        """Return a sorted list of the object names contained in this repository, optionally in the given sub directory.

        :param path: the relative path where to store the object in the repository.
        """

which mirrors my comments in #4321 (comment), that I feel it may be better to have a pathlib.PurePath like API

dev-zero · 2021-05-13T06:42:28Z

def flatten_hashonly(dictionary):
    items = set()
    if not dictionary:
        return items
    for value in dictionary['o'].values():
        try:
            items.add(value['k'])
        except KeyError:
            items.update(flatten_hashonly(value))
    return items

As a general notion about recursive functions in Python: Python does not do Tail Call Optimization hence I've come to avoid recursive functions in performance critical code (or code where the depth is not predetermined to avoid hitting the RuntimeError: maximum recursion depth exceeded). There are several ways to do it, my favourite is usually an explicit stack which exploits the fact that Python uses references for anything but simple types. Untested, but the following is roughly what I mean:

def flatten_hashonly(dictionary):
    items = set()

    if not dictionary:
        return items

    stack = [dictionary]

    while stack:
        value = stack.pop()

        try:
            items.add(value['k'])
        except KeyError:
            stack += list(dictionary['o'].values())

    return items

Maybe this could be of use here as well.

giovannipizzi · 2021-05-13T12:10:30Z

Thanks @dev-zero
The correct code should be this:

def flatten_hashonly_nonrec(dictionary):
    items = set()

    if not dictionary:
        return items

    stack = [dictionary]

    while stack:
        values = stack.pop()['o'].values()

        for value in values:
            try:
                items.add(value['k'])
            except KeyError:
                stack += [value]

    return items

(One important note where it took a moment for me to realise: if d is a dictionary, [d] is a list with d as an entry, while list(d) is a list of its keys).

I checked and the timing seems the same as the recursive version, but I agree that a non-recursive version is better to avoid the recursion error exception

chrisjsewell · 2021-05-13T12:11:41Z

Yep fair 👍

aiida/orm/nodes/node.py

for more information, see https://pre-commit.ci

chrisjsewell · 2021-10-26T08:29:49Z

Ok I've updated this:

I've removed iter_object_names for now, so it doesn't complicate things
I've added the Repository.flatten class method, which implements a similar logic to the above discussed non-recursive functions
- As you can see from the code/test it returns a dict mapping path -> key/None
- The logic of prefixing a delimiter to the end of folder paths is copied from zippath/tarpath, and avoids key clashes of file/folder with the same name in the same folder
- Incidentally, this format is how I feel it should actually be stored in the database (not including the sub-folders): it is a lot simpler and easier to run postgres JSONB queries on (like has_key)
NodeCollection.iter_object_keys then uses Repository.flatten

As already mentioned, this function will be used both in the backend repository cleaning and the archive code.

chrisjsewell · 2021-10-26T08:31:50Z

cc @giovannipizzi @sphuber @dev-zero for re-review

aiida/repository/repository.py

chrisjsewell · 2021-11-01T15:20:14Z

I'm merging this soon

sphuber

I'll leave the check on the recursive implementation to @giovannipizzi or @dev-zero since they commented on this. I just have to note that I think you haven't addressed my comment on the naming for iter_object_keys. When defined on the collection, it leads to Node.objects.iter_object_keys() and that strongly suggests you are somehow iterating over the objects (i.e. nodes) in the collection. At no point is it clear that it is dealing with objects from the repository of the node. I really think we should name it iter_repository_object_keys.

aiida/orm/nodes/node.py

tests/orm/node/test_node.py

for more information, see https://pre-commit.ci

chrisjsewell · 2021-11-01T21:43:00Z

I think you haven't addressed my comment on the naming for iter_object_keys

cheers, changed the name to iter_repo_keys

✨ NEW: Add Node.objects.iter_object_keys

e8370c9

chrisjsewell requested review from sphuber and giovannipizzi May 6, 2021 12:05

chrisjsewell added 2 commits May 6, 2021 14:11

fix pre-commit

7e0e7bc

Update node.py

a91f1c8

giovannipizzi mentioned this pull request May 6, 2021

Proposal to add DbFile table #4919

Closed

Merge branch 'develop' into iter_object_keys

0826ed2

ramirezfranciscof mentioned this pull request May 31, 2021

ADD: New repository CLI #4965

Merged

2 tasks

chrisjsewell added 2 commits October 19, 2021 14:37

Merge branch 'develop' into iter_object_keys

51e4dd5

Merge branch 'develop' into iter_object_keys

3da1e34

chrisjsewell commented Oct 26, 2021

View reviewed changes

aiida/orm/nodes/node.py Outdated Show resolved Hide resolved

chrisjsewell and others added 4 commits October 26, 2021 08:23

Update aiida/orm/nodes/node.py

2e65008

update and add tests

30599c9

[pre-commit.ci] auto fixes from pre-commit.com hooks

9bf9561

for more information, see https://pre-commit.ci

add batch_size

3aff8f4

Merge branch 'develop' into iter_object_keys

39cc793

chrisjsewell commented Oct 26, 2021

View reviewed changes

aiida/repository/repository.py Outdated Show resolved Hide resolved

chrisjsewell commented Oct 26, 2021

View reviewed changes

aiida/repository/repository.py Outdated Show resolved Hide resolved

chrisjsewell added 2 commits October 26, 2021 12:30

fix linting

f28adb0

Merge branch 'develop' into iter_object_keys

4da2a59

sphuber requested changes Nov 1, 2021

View reviewed changes

chrisjsewell commented Nov 1, 2021

View reviewed changes

aiida/orm/nodes/node.py Outdated Show resolved Hide resolved

chrisjsewell commented Nov 1, 2021

View reviewed changes

tests/orm/node/test_node.py Outdated Show resolved Hide resolved

Apply suggestions from code review

9e6c07f

chrisjsewell commented Nov 1, 2021

View reviewed changes

tests/orm/node/test_node.py Outdated Show resolved Hide resolved

chrisjsewell commented Nov 1, 2021

View reviewed changes

tests/orm/node/test_node.py Outdated Show resolved Hide resolved

Apply suggestions from code review

dcadcdb

chrisjsewell changed the title ~~✨ NEW: Add Node.objects.iter_object_keys~~ ✨ NEW: Add Node.objects.iter_repo_keys Nov 1, 2021

[pre-commit.ci] auto fixes from pre-commit.com hooks

2579d04

for more information, see https://pre-commit.ci

chrisjsewell merged commit 2d6df12 into develop Nov 2, 2021

chrisjsewell deleted the iter_object_keys branch November 2, 2021 16:21

chrisjsewell added a commit to chrisjsewell/aiida_core that referenced this pull request Nov 2, 2021

Update with aiidateam#4922

fd1f7da

chrisjsewell mentioned this pull request Mar 12, 2022

🔀 MERGE: Release v2.0.0b1 #5426

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ NEW: Add `Node.objects.iter_repo_keys` #4922

✨ NEW: Add `Node.objects.iter_repo_keys` #4922

chrisjsewell commented May 6, 2021 •

edited

Loading

codecov bot commented May 6, 2021 •

edited

Loading

giovannipizzi commented May 6, 2021 •

edited

Loading

giovannipizzi commented May 6, 2021

chrisjsewell commented May 6, 2021

giovannipizzi commented May 6, 2021

chrisjsewell commented May 6, 2021

chrisjsewell commented May 6, 2021 •

edited

Loading

giovannipizzi commented May 6, 2021

chrisjsewell commented May 6, 2021

giovannipizzi commented May 6, 2021

sphuber commented May 6, 2021 •

edited

Loading

chrisjsewell commented May 7, 2021 •

edited

Loading

dev-zero commented May 13, 2021

giovannipizzi commented May 13, 2021

chrisjsewell commented May 13, 2021

chrisjsewell commented Oct 26, 2021 •

edited

Loading

chrisjsewell commented Oct 26, 2021

chrisjsewell commented Nov 1, 2021

sphuber left a comment

chrisjsewell commented Nov 1, 2021

✨ NEW: Add Node.objects.iter_repo_keys #4922

✨ NEW: Add Node.objects.iter_repo_keys #4922

Conversation

chrisjsewell commented May 6, 2021 • edited Loading

codecov bot commented May 6, 2021 • edited Loading

Codecov Report

giovannipizzi commented May 6, 2021 • edited Loading

giovannipizzi commented May 6, 2021

chrisjsewell commented May 6, 2021

giovannipizzi commented May 6, 2021

chrisjsewell commented May 6, 2021

chrisjsewell commented May 6, 2021 • edited Loading

giovannipizzi commented May 6, 2021

chrisjsewell commented May 6, 2021

giovannipizzi commented May 6, 2021

sphuber commented May 6, 2021 • edited Loading

chrisjsewell commented May 7, 2021 • edited Loading

dev-zero commented May 13, 2021

giovannipizzi commented May 13, 2021

chrisjsewell commented May 13, 2021

chrisjsewell commented Oct 26, 2021 • edited Loading

chrisjsewell commented Oct 26, 2021

chrisjsewell commented Nov 1, 2021

sphuber left a comment

Choose a reason for hiding this comment

chrisjsewell commented Nov 1, 2021

✨ NEW: Add `Node.objects.iter_repo_keys` #4922

✨ NEW: Add `Node.objects.iter_repo_keys` #4922

chrisjsewell commented May 6, 2021 •

edited

Loading

codecov bot commented May 6, 2021 •

edited

Loading

giovannipizzi commented May 6, 2021 •

edited

Loading

chrisjsewell commented May 6, 2021 •

edited

Loading

sphuber commented May 6, 2021 •

edited

Loading

chrisjsewell commented May 7, 2021 •

edited

Loading

chrisjsewell commented Oct 26, 2021 •

edited

Loading