Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to add DbFile table #4919

Closed
chrisjsewell opened this issue May 6, 2021 · 10 comments
Closed

Proposal to add DbFile table #4919

chrisjsewell opened this issue May 6, 2021 · 10 comments

Comments

@chrisjsewell
Copy link
Member

chrisjsewell commented May 6, 2021

In-line with my comments in #4321 (comment),
I feel that with the current implementation of DbNode.repository_metadata it is really quite difficult to perform any task on repository files, in an efficient manner.

If we had a separate table, something like:

class DbFile(m.Model):
    node = m.ForeignKey('DbNode', on_delete=m.CASCADE)
	path = m.CharField(max_length=255, db_index=True)
    key = m.CharField(max_length=255)
    # also type if we need to store directories?
    class Meta:
        unique_together = (('node', 'path'),)

This would make it incredibly more performant and facile for file queries like:

  • Generating the list of unique keys (i.e. hashes) in the database (required to clean the repository)
  • Performing file system "like" operations that were previously possible when the files were stored on disk, e.g. searching for all files named data-file.xml

@sphuber, @giovannipizzi, @ltalirz?

@giovannipizzi
Copy link
Member

Hi @chrisjsewell - indeed the table was one of our options for the implementation.
Here I list the reasons (I remember of) for the current choice. Personally, I'm happy to re-evaluate it, but I want to note other operations that we should benchmark before deciding to switch.

  • consistency at the DB level: with the current nested dictionary, it's very clear what are directories and what are files, and there is no possibility to store inconsistent data, e.g. things like: a//b (double slash), or twice the same entry with similar syntax a//b and a/b, or e.g. it would be possible to have a file without the parent folder (a/b/c without a/b) and other similar things. [This is something I would accept to compromise if really needed]
  • Your proposal will of course the table would make queries more efficient (but note that you need to add a db_index also to hash, since often you will want to search if a given hash exists, e.g. before deciding to delete an object from the container). However, it's not clear (meaning: preliminary tests showed it to be inefficient, but I don't have the numbers so one would need to double check) if while storing objects, the additional cost of storing many rows (one per file at least) would be instead a big hit for performance (and similarly to get the list of all files in the repo, one would need to fetch many rows, that is often inefficient - at least if not done properly). Before deciding to switch, I'd like to see a comparison of performance also on these kind of operations that are quite common in daily operation in AiiDA.
  • We tried to think, to address consistency above, to have a table that could either represent folders or files, and have one-to-many links in the DB to represent the relationship "living in the folder of", but we couldn't find a working scheme that does not mix in non-efficient ways files and folders (also, this would increase even more the number of rows to fetch for every file, if it's stored in a nested subfolder).
  • We thought searching for a file with a given filename was a very rare operation so we didn't need to optimise for it

I see however that searching whether an object with a given hash key is ever referenced is important, so we might need to find a compromise solution

@chrisjsewell
Copy link
Member Author

chrisjsewell commented May 6, 2021

thanks for the response @giovannipizzi

consistency at the DB level: with the current nested dictionary, it's very clear what are directories and what are files, and there is no possibility to store inconsistent data, e.g. things like: a//b (double slash), or twice the same entry with similar syntax a//b and a/b, or e.g. it would be possible to have a file without the parent folder (a/b/c without a/b) and other similar things. [This is something I would accept to compromise if really needed]

So firstly, do we need to actually store the directories? i.e. is it important to store empty directories (since other directories can be reproduced from the file paths)?

On your point of consistency, this is why I mentioned PurePosixPath in #4321:

from pathlib import PurePosixPath
path = PurePosixPath("a//b")
str(p) == "a/b"

or

import posixpath
posixpath.normpath("a//b")

@chrisjsewell
Copy link
Member Author

Oh so it appears a kind of similar thing was in https://github.com/aiidateam/AEP/pull/7/files#r627163946? It feels like if we don't go this route then that AEP should in some sense me marked as rejected (and the reasons noted) or at least it should be updated and merged, @sphuber?
(I'm not sure if we have quite got the hang of these AEPs yet lol)

@sphuber
Copy link
Contributor

sphuber commented May 6, 2021

Yeah, that AEP indeed reflects one of the first designs which indeed had a separate table for the files. I never finished it because it was indeed rejected after Giovanni made an alternative one and never got around to wrapping it up. I think the main problem of the AEPs is that in order to it properly it just takes a lot of time and since there is no forcing mechanism yet, it is too easy to not properly finish them first before starting to code.

@chrisjsewell
Copy link
Member Author

I think the main problem of the AEPs is that in order to it properly it just takes a lot of time and since there is no forcing mechanism yet, it is too easy to not properly finish them first before starting to code.

Yeh personally I don't feel they necessarily need to be "finished" before coding, because most of this stuff fails on first contact anyway, and you end up changing a lot of things. But it would certainly be handy to record why you ended up not doing it exactly that way, e.g. what unanticipated issue you run into, and what/why decisions were made for the final iteration

@giovannipizzi
Copy link
Member

do we need to actually store the directories?

Yes. There are cases in which you want to store an empty directory e.g. during the plugin prepare_for_submission phase and store it in the file repo, as then the empty directory is required by the underlying code. This was the case at least with QE. Of course, with the new implementation, the object-store will not "see" it.

On your point of consistency, this is why I mentioned PurePosixPath in #4321

Fully agree on using that for consistency at the python level. My point is that, if possible, I'd love to favour an implementation that guarantees consistency at the DB level, all other parameters (storage cost, efficiency) similar, i.e. that you cannot even store two conflicting files/folders/...

@giovannipizzi
Copy link
Member

As a comment on this, for reference if we end up going this way.

For the empty folders, one could decide that if key is None, it represents an empty folder. We'll need to be careful with validation of new entries though.

E.g. once could enforce normalisation of paths (no double /, no leading or trailing /, etc.) and then the uniqueness constraint in the DB starts helping.
We would also need to decide if it's valid to have a/b/c without a folder `a/b: maybe ok, and it specification is optional and needed only when an empty folder is required. Maybe this is all acceptable in term of validation of the content.

In this case (after, as said above, checking the performance in creating new nodes with all these rows), if we end up going in this direction, I would then try to return the hash keys from an iterator (like e.g. #4922) but with a guarantee on ordering.

In this way, we won't need to load the whole list of keys in memory to compare with those in the container, but one can try to compare the difference of two sorted lists more efficiently and incrementally (see e.g. the implementation I'm using in disk-objectstore, that iterates over two iterators in parallel [let's say a left and a right one] and yields both the value in a sorted order, and whether each value was found only in the left or right iterator, or in both, essentially allowing to compute (sorted) unions, intersections and differences of two sorted sets without needing to load more than two elements in memory at any given time).

@giovannipizzi
Copy link
Member

As a further comment instead possibly in favour of the current implementation, I had a more detailed timing:

def flatten_hashonly(dictionary):
    items = set()
    if not dictionary:
        return items
    for value in dictionary['o'].values():
        try:
            items.add(value['k'])
        except KeyError:
            items.update(flatten_hashonly(value))
    return items

t = time.time()
results = list(djmodels.DbNode.objects.values_list('repository_metadata', flat=True))
print(time.time() - t)

hashes = set()
t = time.time()
for data in results:
    hashes.update(flatten_hashonly(data))
print(len(hashes), time.time() - t)

than what reported here: #4922 (comment) .
It's the same mid-size DB (~320k nodes, ~340k different hash keys, ~510k objects listed in the node repository_metadata).

The timings I get are 1.7-1.9s for the QueryBuilder part and 0.22-0.23s for the python flattening (total: ~1.9-2.1s).

Since in my case the number of nodes in the DB is actually approx. half the number of objects listed in the metadata, a quick guess from linear extrapolation (since we're listing everything and there's no index involved; but I might be wrong of course) is that switching to the suggested proposal might increase the querying time by a factor ~2x (say to 3.5s) while removing the need for a python flattening, so a total of 3.5s, that is larger than the current time (~2s).

Of course, it would be good to have the actual timing on a DB that is at least 10x larger, with million nodes.

@giovannipizzi
Copy link
Member

And note that if you use the QueryBuilder instead of directly Django, the timing becomes even more unfavourable:

t = time.time()
query = QueryBuilder()
query.append(Node, project=['repository_metadata'])
results = list(query.all(flat=True))
print(time.time() - t)
hashes = set()
t = time.time()
for data in results:
    hashes.update(flatten_hashonly(data))
print(len(hashes), time.time() - t)

(~6.0-7.0s for the QueryBuilder part, ~0.22s for the python flattening)

@sphuber
Copy link
Contributor

sphuber commented Apr 28, 2022

Closing this. With v2.0 and the new repository released, I think changing things like this would require a more detailed up-front discussion and maybe even an AEP.

@sphuber sphuber closed this as completed Apr 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants