-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 FIX: Import archive into large DB #5740
Conversation
As detailed in https://www.sqlite.org/limits.html, SQLITE_MAX_VARIABLE_NUMBER puts a limit on how many variables can be used in a single SQL query. This size can be easily reached, if filtering by nodes in a large database. Therefore, this commit changes the filtering of UUIDs to be on the client side, then batches queries for the full nodes by a fixed number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @chrisjsewell apart from the comments in the code, some other questions:
- Given that this is fixing a problem for importing from large sqlite archives, does the same problem not also likely exist for exporting to/from sqlite database? I think that since these are often temporary databases the exporting-from is unlikely, but exporting-to would actually be a common use case, would it not?
- I presume you have tested this manually? Would it be impossible or too costly to implement a test? Or can we have a test that reproduces this, simply by creating one with >999 nodes and importing that?
def import_archive( | ||
path: Union[str, Path], | ||
*, | ||
archive_format: Optional[ArchiveFormatAbstract] = None, | ||
filter_size: int = 999, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be useful to put a comment here (or in the commit message) saying that this is the default limit in versions prior to 3.32.0 (2020-05-22)
aiida/tools/archive/imports.py
Outdated
@@ -117,6 +129,7 @@ def import_archive( | |||
type_check(test_run, bool) | |||
backend = backend or get_manager().get_profile_storage() | |||
type_check(backend, StorageBackend) | |||
qparams = QueryParams(batch_size=batch_size, filter_size=filter_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: why not just have query_params
for increased legibility. The few extra characters shouldn't hurt, should they?
).iterdict(batch_size=batch_size) | ||
|
||
# collect the unique entities from the input backend to be added to the output backend | ||
ufields = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand correctly that now here you read in all the unique fields for this entity into memory from the input backend, whereas before this was directly streamed into the bulk_insert
. Could this lead to memory problems, given that this very fix is to deal with large imports?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well only the those that are not already in the output backend, e.g. it will be a list of all node UUIDs that are in the archive but not in the profile.
So no I don't think this is a particularly big hit on memory usage, because you are only reading the UUIDs, as opposed to the full node content (attributes etc)
Co-authored-by: Sebastiaan Huber <mail@sphuber.net>
To clarify, this is a problem importing into large profiles, not from large archives |
I see, I didn't get that clearly. Could you perhaps add more information in the comments in the relevant code changes and the commit message? And if it is an sqlite limitation, does that mean there is something with a large sqlite database that they are importing data to? This should anyway not be recommended, should it? Where did you come across this bug? Looking at the code, I don't really understand why the problem is with the target backend. The only point where the target backend ( |
This reproduces the problem, and is fixed after the PR, for python 3.8 from conda-forge, including sqlite v3.39.4 (h9ae0607_0). you'll note that I couldn't actually recreate it until there were 300,000 nodes (at least 200,000 was fine), so perhaps The test takes ~105 seconds to run, so maybe not great to include from aiida import orm, manage, tools
from aiida.tools.archive import create_archive, import_archive
def test_import_into_large_profile(aiida_profile_clean, tmp_path):
backend = manage.get_manager().get_profile_storage()
user_id = backend.default_user.id
pks = backend.bulk_insert(orm.EntityTypes.NODE, [{"user_id": user_id} for _ in range(300_000)], allow_defaults=True)
assert orm.QueryBuilder().append(orm.Node).count() == 300_000
create_archive(None, filename=tmp_path / 'archive.aiida')
tools.delete_nodes([pks[0]], dry_run=False)
assert orm.QueryBuilder().append(orm.Node).count() == 300_000 - 1
import_archive(tmp_path / 'archive.aiida') |
@sphuber its not a problem with the target backend, its when the target backend and input backend have lots of nodes in common, so that
|
Cheers, appreciate that. Confirmed that it also breaks for me locally. I agree that we don't want to add this in the unit tests 😅
Thanks, that really helps, I now understand the problem. Think it would be great to add a comment at the line with
Last thing: I noticed you change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good, thanks @chrisjsewell
Thanks for the review 😄 |
As detailed in https://www.sqlite.org/limits.html, SQLITE_MAX_VARIABLE_NUMBER puts a limit on how many variables can be used in a single SQL query. This size can be easily reached, if filtering by nodes in a large database, leading to:
Therefore, this commit changes the filtering of UUIDs to be on the client side, then batches queries for the full node fields by a fixed number (
filter_size
).