-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Import Archive #4510
Refactor Import Archive #4510
Conversation
@@ -218,6 +218,8 @@ def deserialize_attributes(attributes_data, conversion_data): | |||
|
|||
def deserialize_field(key, value, fields_info, import_unique_ids_mappings, foreign_ids_reverse_mappings): | |||
"""Deserialize field using deserialize attributes""" | |||
if key in ('attributes', 'extras'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this broken before?
is there no test for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So before, the attributes and extras fields were appended to the data after this deserialization. This was really just due to the implementation detail of the current archive format, whereby the node attributes and extras are "special cased" and stored in separate sections of the data.json
.
In ReaderJsonZip.iter_entity_fields
, you will see that I "remove" this special-casing, by re-merging these fields into the full fields dict, before yielding it.
I guess rather than having this test here, ideally we would change aiida/tools/importexport/common/config.py::get_all_fields_info
to include these fields in the returned dict.
This would though necessitate a migration, to update this dict where it is stored in the 0.9
archive (in the all_fields_info
key of metadata.json
).
One thing I should mention here actually, is that really I want to eventually do a "second round" of refactoring of import_data_dj
/import_data_sqla
, to find a way for the process to actually make use of the iter_entity_fields
iterator, and not have to read all the entities and their fields into memory at the same time.
This would not affect the current archive format, because you have to read the whole of data.json
into memory anyway. But obviously with a new format this does not necessarily have to be the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess rather than having this test here, ideally we would change
aiida/tools/importexport/common/config.py::get_all_fields_info
to include these fields in the returned dict.
This would though necessitate a migration, to update this dict where it is stored in the0.9
archive (in theall_fields_info
key ofmetadata.json
).
Ok. I will anyhow need a migration for the group extras, so before merging this, I could give you write access to my fork to add this
One thing I should mention here actually, is that really I want to eventually do a "second round" of refactoring of
import_data_dj
/import_data_sqla
, to find a way for the process to actually make use of theiter_entity_fields
iterator, and not have to read all the entities and their fields into memory at the same time.
This would not affect the current archive format, because you have to read the whole ofdata.json
into memory anyway. But obviously with a new format this does not necessarily have to be the case.
Right. I agree it's good to hold this off for a later round/PR (unless it just "falls out" naturally).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I will anyhow need a migration for the group extras, so before merging this, I could give you write access to my fork to add this
yep makes sense to do it in a single migration. I guess we want to give these fields a new converter_type
, that basically tells deserialize_attributes
to do nothing to the value.
Something like:
def deserialize_attributes(attributes_data, conversion_data):
"""Deserialize attributes"""
import datetime
import pytz
if conversion_data == "null":
return attributes_data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I agree it's good to hold this off for a later round/PR
yep indeed 👍
FYI, this will also include probably passing ArchiveData.node_data
to the writer as a query or generator for node extras/attributes (as opposed to the current iterator), so that it can perform a chunked query.iterall(batch_size=x)
write into the archive and potentially (with a new format) not have to read all that data into memory at the same time.
(which I reckon is the largest source of data from the database).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I will anyhow need a migration for the group extras,
Just wanted to mention that Casper pointed out that, since the group extras are a simple addition (and we don't offer back-migrations), a migration is actually not needed to support group extras, i.e. we could also defer this migration to a later point.
- Remove progressbar module - Rewrite archive.extract_zip/extract_tar/extract_tree to use progress reporter - Remove archive.Archive class - Rewrite cmd_export.inspect to work with readers - Rewrite migration tests to work without Archive - move dbexport.zip contents to common.zip_folder (and deprecate) - move .dbexport.utils.ExportFileFormat to common.config - refacor django importer to work with progress reporter - move readers and writers to seperate module - added files to mypy still todo: sqlalchemy import and refactor migration
Codecov Report
@@ Coverage Diff @@
## develop #4510 +/- ##
===========================================
+ Coverage 79.30% 79.40% +0.11%
===========================================
Files 476 480 +4
Lines 34959 35073 +114
===========================================
+ Hits 27719 27845 +126
+ Misses 7240 7228 -12
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @chrisjsewell , I've reviewed up to importexport/common/zip_folder.py
before I continue, it would be great if you could rebase this PR on develop
.
In the meanwhile, I will test importing stuff and check the timings
if nodes_export_subfolder: | ||
if not isinstance(nodes_export_subfolder, str): | ||
raise TypeError('nodes_export_subfolder must be a string') | ||
else: | ||
nodes_export_subfolder = NODES_EXPORT_SUBFOLDER | ||
|
||
if not kwargs.get('silent', False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the export, the progress is now being set by the CLI command.
Is there a reason to take a different approach here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was simply for back compatibility. the functions in this module are no longer actually used for read/write, only in the archive migrations where the CLI has not yet been changed.
When I refactor that (I think now in a separate PR) I will probably remove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll keep this open in case you ever look back to this PR
return self._zipfile.open(self._get_internal_path(fname), mode) | ||
|
||
def _get_internal_path(self, filename): | ||
return os.path.normpath(os.path.join(self.pwd, filename)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we be using pathlib?
(several other uses of os.path in this class)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Err, I'm going to be lazy here and say not now. I have anyhow allowed for the filename in the init to be a Path and thats the main one
Tests:
New implementation:
Old implementation:
Conclusions
|
Co-authored-by: Leopold Talirz <leopold.talirz@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chrisjsewell I'm basically through now - looks all good; only minor comments.
*, group, existing_entries: Dict[str, Dict[str, dict]], new_entries: Dict[str, Dict[str, dict]], | ||
foreign_ids_reverse_mappings: Dict[str, Dict[str, int]] | ||
): | ||
"""Make an import group containing all imported nodes.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a docstring including the parameters & return value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the dedicated effort @chrisjsewell !
The import/export has definitely been one of the most "smelly" subpackages of AiiDA for a long time; it's really great to see it return to a stage where others will feel comfortable improving it as well.
and thanks for the prompt reviews! |
fixes #2979
fixes #3100
Refactor import of AiiDA archive files.
Mainly see
aiida/tools/importexport/dbimport/readers.py
andtests/tools/importexport/test_reader.py