-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal to add DbFile
table
#4919
Comments
Hi @chrisjsewell - indeed the table was one of our options for the implementation.
I see however that searching whether an object with a given hash key is ever referenced is important, so we might need to find a compromise solution |
thanks for the response @giovannipizzi
So firstly, do we need to actually store the directories? i.e. is it important to store empty directories (since other directories can be reproduced from the file paths)? On your point of consistency, this is why I mentioned from pathlib import PurePosixPath
path = PurePosixPath("a//b")
str(p) == "a/b" or import posixpath
posixpath.normpath("a//b") |
Oh so it appears a kind of similar thing was in https://github.com/aiidateam/AEP/pull/7/files#r627163946? It feels like if we don't go this route then that AEP should in some sense me marked as rejected (and the reasons noted) or at least it should be updated and merged, @sphuber? |
Yeah, that AEP indeed reflects one of the first designs which indeed had a separate table for the files. I never finished it because it was indeed rejected after Giovanni made an alternative one and never got around to wrapping it up. I think the main problem of the AEPs is that in order to it properly it just takes a lot of time and since there is no forcing mechanism yet, it is too easy to not properly finish them first before starting to code. |
Yeh personally I don't feel they necessarily need to be "finished" before coding, because most of this stuff fails on first contact anyway, and you end up changing a lot of things. But it would certainly be handy to record why you ended up not doing it exactly that way, e.g. what unanticipated issue you run into, and what/why decisions were made for the final iteration |
Yes. There are cases in which you want to store an empty directory e.g. during the plugin
Fully agree on using that for consistency at the python level. My point is that, if possible, I'd love to favour an implementation that guarantees consistency at the DB level, all other parameters (storage cost, efficiency) similar, i.e. that you cannot even store two conflicting files/folders/... |
As a comment on this, for reference if we end up going this way. For the empty folders, one could decide that if E.g. once could enforce normalisation of paths (no double In this case (after, as said above, checking the performance in creating new nodes with all these rows), if we end up going in this direction, I would then try to return the hash keys from an iterator (like e.g. #4922) but with a guarantee on ordering. In this way, we won't need to load the whole list of keys in memory to compare with those in the container, but one can try to compare the difference of two sorted lists more efficiently and incrementally (see e.g. the implementation I'm using in |
As a further comment instead possibly in favour of the current implementation, I had a more detailed timing: def flatten_hashonly(dictionary):
items = set()
if not dictionary:
return items
for value in dictionary['o'].values():
try:
items.add(value['k'])
except KeyError:
items.update(flatten_hashonly(value))
return items
t = time.time()
results = list(djmodels.DbNode.objects.values_list('repository_metadata', flat=True))
print(time.time() - t)
hashes = set()
t = time.time()
for data in results:
hashes.update(flatten_hashonly(data))
print(len(hashes), time.time() - t) than what reported here: #4922 (comment) . The timings I get are 1.7-1.9s for the QueryBuilder part and 0.22-0.23s for the python flattening (total: ~1.9-2.1s). Since in my case the number of nodes in the DB is actually approx. half the number of objects listed in the metadata, a quick guess from linear extrapolation (since we're listing everything and there's no index involved; but I might be wrong of course) is that switching to the suggested proposal might increase the querying time by a factor ~2x (say to 3.5s) while removing the need for a python flattening, so a total of 3.5s, that is larger than the current time (~2s). Of course, it would be good to have the actual timing on a DB that is at least 10x larger, with million nodes. |
And note that if you use the QueryBuilder instead of directly Django, the timing becomes even more unfavourable: t = time.time()
query = QueryBuilder()
query.append(Node, project=['repository_metadata'])
results = list(query.all(flat=True))
print(time.time() - t)
hashes = set()
t = time.time()
for data in results:
hashes.update(flatten_hashonly(data))
print(len(hashes), time.time() - t) (~6.0-7.0s for the QueryBuilder part, ~0.22s for the python flattening) |
Closing this. With v2.0 and the new repository released, I think changing things like this would require a more detailed up-front discussion and maybe even an AEP. |
In-line with my comments in #4321 (comment),
I feel that with the current implementation of
DbNode.repository_metadata
it is really quite difficult to perform any task on repository files, in an efficient manner.If we had a separate table, something like:
This would make it incredibly more performant and facile for file queries like:
data-file.xml
@sphuber, @giovannipizzi, @ltalirz?
The text was updated successfully, but these errors were encountered: