Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Import Archive #4510

Merged
merged 36 commits into from
Oct 27, 2020
Merged

Conversation

chrisjsewell
Copy link
Member

@chrisjsewell chrisjsewell commented Oct 24, 2020

fixes #2979
fixes #3100

Refactor import of AiiDA archive files.

Mainly see aiida/tools/importexport/dbimport/readers.py and tests/tools/importexport/test_reader.py

@@ -218,6 +218,8 @@ def deserialize_attributes(attributes_data, conversion_data):

def deserialize_field(key, value, fields_info, import_unique_ids_mappings, foreign_ids_reverse_mappings):
"""Deserialize field using deserialize attributes"""
if key in ('attributes', 'extras'):
Copy link
Member

@ltalirz ltalirz Oct 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this broken before?
is there no test for this?

Copy link
Member Author

@chrisjsewell chrisjsewell Oct 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So before, the attributes and extras fields were appended to the data after this deserialization. This was really just due to the implementation detail of the current archive format, whereby the node attributes and extras are "special cased" and stored in separate sections of the data.json.

In ReaderJsonZip.iter_entity_fields, you will see that I "remove" this special-casing, by re-merging these fields into the full fields dict, before yielding it.

I guess rather than having this test here, ideally we would change aiida/tools/importexport/common/config.py::get_all_fields_info to include these fields in the returned dict.
This would though necessitate a migration, to update this dict where it is stored in the 0.9 archive (in the all_fields_info key of metadata.json).

One thing I should mention here actually, is that really I want to eventually do a "second round" of refactoring of import_data_dj/import_data_sqla, to find a way for the process to actually make use of the iter_entity_fields iterator, and not have to read all the entities and their fields into memory at the same time.
This would not affect the current archive format, because you have to read the whole of data.json into memory anyway. But obviously with a new format this does not necessarily have to be the case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess rather than having this test here, ideally we would change aiida/tools/importexport/common/config.py::get_all_fields_info to include these fields in the returned dict.
This would though necessitate a migration, to update this dict where it is stored in the 0.9 archive (in the all_fields_info key of metadata.json).

Ok. I will anyhow need a migration for the group extras, so before merging this, I could give you write access to my fork to add this

One thing I should mention here actually, is that really I want to eventually do a "second round" of refactoring of import_data_dj/import_data_sqla, to find a way for the process to actually make use of the iter_entity_fields iterator, and not have to read all the entities and their fields into memory at the same time.
This would not affect the current archive format, because you have to read the whole of data.json into memory anyway. But obviously with a new format this does not necessarily have to be the case.

Right. I agree it's good to hold this off for a later round/PR (unless it just "falls out" naturally).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I will anyhow need a migration for the group extras, so before merging this, I could give you write access to my fork to add this

yep makes sense to do it in a single migration. I guess we want to give these fields a new converter_type, that basically tells deserialize_attributes to do nothing to the value.
Something like:

def deserialize_attributes(attributes_data, conversion_data):
    """Deserialize attributes"""
    import datetime
    import pytz

   if conversion_data == "null":
       return attributes_data

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I agree it's good to hold this off for a later round/PR

yep indeed 👍

FYI, this will also include probably passing ArchiveData.node_data to the writer as a query or generator for node extras/attributes (as opposed to the current iterator), so that it can perform a chunked query.iterall(batch_size=x) write into the archive and potentially (with a new format) not have to read all that data into memory at the same time.
(which I reckon is the largest source of data from the database).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I will anyhow need a migration for the group extras,

Just wanted to mention that Casper pointed out that, since the group extras are a simple addition (and we don't offer back-migrations), a migration is actually not needed to support group extras, i.e. we could also defer this migration to a later point.

- Remove progressbar module
- Rewrite archive.extract_zip/extract_tar/extract_tree to use progress reporter
- Remove archive.Archive class
- Rewrite cmd_export.inspect to work with readers
- Rewrite migration tests to work without Archive
- move dbexport.zip contents to common.zip_folder (and deprecate)
- move .dbexport.utils.ExportFileFormat to common.config
- refacor django importer to work with progress reporter
- move readers and writers to seperate module
- added files to mypy

still todo: sqlalchemy import and refactor migration
@codecov
Copy link

codecov bot commented Oct 27, 2020

Codecov Report

Merging #4510 into develop will increase coverage by 0.11%.
The diff coverage is 91.73%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4510      +/-   ##
===========================================
+ Coverage    79.30%   79.40%   +0.11%     
===========================================
  Files          476      480       +4     
  Lines        34959    35073     +114     
===========================================
+ Hits         27719    27845     +126     
+ Misses        7240     7228      -12     
Flag Coverage Δ
#django 73.52% <68.75%> (+0.36%) ⬆️
#sqlalchemy 72.68% <69.18%> (+0.29%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/tools/importexport/dbexport/utils.py 81.76% <ø> (-0.51%) ⬇️
aiida/cmdline/commands/cmd_export.py 91.16% <75.00%> (+0.19%) ⬆️
aiida/tools/importexport/common/archive.py 77.22% <82.98%> (-4.82%) ⬇️
aiida/tools/importexport/archive/readers.py 86.20% <86.20%> (ø)
aiida/tools/importexport/common/zip_folder.py 90.22% <90.00%> (ø)
aiida/cmdline/commands/cmd_import.py 79.53% <91.67%> (+2.03%) ⬆️
...ida/tools/importexport/dbimport/backends/django.py 93.01% <93.01%> (ø)
aiida/tools/importexport/dbimport/backends/sqla.py 93.78% <93.78%> (ø)
aiida/tools/importexport/archive/common.py 96.30% <96.30%> (ø)
aiida/tools/importexport/archive/writers.py 97.30% <97.30%> (ø)
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 02c8a0c...af6833e. Read the comment docs.

@chrisjsewell chrisjsewell marked this pull request as ready for review October 27, 2020 07:18
@chrisjsewell chrisjsewell requested a review from ltalirz October 27, 2020 07:19
@ltalirz ltalirz mentioned this pull request Oct 27, 2020
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @chrisjsewell , I've reviewed up to importexport/common/zip_folder.py

before I continue, it would be great if you could rebase this PR on develop.
In the meanwhile, I will test importing stuff and check the timings

aiida/cmdline/commands/cmd_export.py Show resolved Hide resolved
aiida/cmdline/commands/cmd_export.py Show resolved Hide resolved
aiida/tools/importexport/archive/common.py Outdated Show resolved Hide resolved
aiida/tools/importexport/archive/common.py Outdated Show resolved Hide resolved
aiida/tools/importexport/archive/common.py Outdated Show resolved Hide resolved
if nodes_export_subfolder:
if not isinstance(nodes_export_subfolder, str):
raise TypeError('nodes_export_subfolder must be a string')
else:
nodes_export_subfolder = NODES_EXPORT_SUBFOLDER

if not kwargs.get('silent', False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the export, the progress is now being set by the CLI command.
Is there a reason to take a different approach here?

Copy link
Member Author

@chrisjsewell chrisjsewell Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was simply for back compatibility. the functions in this module are no longer actually used for read/write, only in the archive migrations where the CLI has not yet been changed.
When I refactor that (I think now in a separate PR) I will probably remove this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll keep this open in case you ever look back to this PR

aiida/tools/importexport/common/zip_folder.py Outdated Show resolved Hide resolved
aiida/tools/importexport/common/zip_folder.py Outdated Show resolved Hide resolved
aiida/tools/importexport/common/zip_folder.py Outdated Show resolved Hide resolved
return self._zipfile.open(self._get_internal_path(fname), mode)

def _get_internal_path(self, filename):
return os.path.normpath(os.path.join(self.pwd, filename))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we be using pathlib?
(several other uses of os.path in this class)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, I'm going to be lazy here and say not now. I have anyhow allowed for the filename in the init to be a Path and thats the main one

@ltalirz
Copy link
Member

ltalirz commented Oct 27, 2020

Tests:

  • archive with 25.8k links, 9.4k nodes
  • django backend

New implementation:

  • Importing chris.aiida into empty profile: 3m25s
  • Importing chris.aiida again into same profile: 14s
  • Importing chris.aiida into new empty profile: 2m12s
    • ~60% of the time spent importing links
    • Note: it's not entirely clear in which way the disk has "warmed up" here... I'll just check again
  • Importing chris.aiida into another new profile: 2m7s

Old implementation:

  • Importing chris.aiida into empty profile: 3m24s
  • Importing chris.aiida into same profile: 40s
  • Importing chris.aiida into new empty profile: 3m21s

Conclusions

  • The new implementation is a bit faster both on first import (65% of original time) and, for obvious reasons, significantly faster for re-import (35% of original time; this probably goes down further for larger archives)
  • In both the old and the new implementation, the import of links is slow.
    In the new implementation, importing links now dominates total import time (~60%).
    This can probably be optimized at a later point (not this PR).

@ltalirz
Copy link
Member

ltalirz commented Oct 27, 2020

Memory usage during import:

Old implementation
image

New implementation
image

Also looks fine (new implementation uses a tiny bit less).

Co-authored-by: Leopold Talirz <leopold.talirz@gmail.com>
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chrisjsewell I'm basically through now - looks all good; only minor comments.

aiida/tools/importexport/dbimport/__init__.py Outdated Show resolved Hide resolved
aiida/tools/importexport/dbimport/backends/common.py Outdated Show resolved Hide resolved
*, group, existing_entries: Dict[str, Dict[str, dict]], new_entries: Dict[str, Dict[str, dict]],
foreign_ids_reverse_mappings: Dict[str, Dict[str, int]]
):
"""Make an import group containing all imported nodes."""
Copy link
Member

@ltalirz ltalirz Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a docstring including the parameters & return value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just added

aiida/tools/importexport/dbimport/backends/common.py Outdated Show resolved Hide resolved
aiida/tools/importexport/dbimport/backends/common.py Outdated Show resolved Hide resolved
tests/tools/importexport/test_reader.py Outdated Show resolved Hide resolved
tests/tools/importexport/test_reader.py Show resolved Hide resolved
tests/tools/importexport/test_reader.py Outdated Show resolved Hide resolved
@chrisjsewell chrisjsewell requested a review from ltalirz October 27, 2020 18:11
ltalirz
ltalirz previously approved these changes Oct 27, 2020
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the dedicated effort @chrisjsewell !

The import/export has definitely been one of the most "smelly" subpackages of AiiDA for a long time; it's really great to see it return to a stage where others will feel comfortable improving it as well.

@chrisjsewell
Copy link
Member Author

Thanks a lot for the dedicated effort @chrisjsewell !

and thanks for the prompt reviews!
Good team work; this is how you implement a PR and don't let it stagnate like some of the currently open ones 😜

@chrisjsewell chrisjsewell changed the title ♻️ Refactor Import Archive Refactor Import Archive Oct 27, 2020
@chrisjsewell chrisjsewell merged commit 2f8e845 into aiidateam:develop Oct 27, 2020
@chrisjsewell chrisjsewell deleted the import-refactor branch October 27, 2020 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unify zip/tar archive format export functions Inefficient order of checks in verdi import
2 participants