Refactor Import Archive #4510

chrisjsewell · 2020-10-24T22:28:50Z

Refactor import of AiiDA archive files.

Mainly see aiida/tools/importexport/dbimport/readers.py and tests/tools/importexport/test_reader.py

tests/tools/importexport/test_reader.py

ltalirz · 2020-10-25T14:44:02Z

aiida/tools/importexport/dbimport/utils.py

@@ -218,6 +218,8 @@ def deserialize_attributes(attributes_data, conversion_data):

 def deserialize_field(key, value, fields_info, import_unique_ids_mappings, foreign_ids_reverse_mappings):
    """Deserialize field using deserialize attributes"""
+    if key in ('attributes', 'extras'):


was this broken before?
is there no test for this?

So before, the attributes and extras fields were appended to the data after this deserialization. This was really just due to the implementation detail of the current archive format, whereby the node attributes and extras are "special cased" and stored in separate sections of the data.json.

In ReaderJsonZip.iter_entity_fields, you will see that I "remove" this special-casing, by re-merging these fields into the full fields dict, before yielding it.

I guess rather than having this test here, ideally we would change aiida/tools/importexport/common/config.py::get_all_fields_info to include these fields in the returned dict.
This would though necessitate a migration, to update this dict where it is stored in the 0.9 archive (in the all_fields_info key of metadata.json).

One thing I should mention here actually, is that really I want to eventually do a "second round" of refactoring of import_data_dj/import_data_sqla, to find a way for the process to actually make use of the iter_entity_fields iterator, and not have to read all the entities and their fields into memory at the same time.
This would not affect the current archive format, because you have to read the whole of data.json into memory anyway. But obviously with a new format this does not necessarily have to be the case.

I guess rather than having this test here, ideally we would change aiida/tools/importexport/common/config.py::get_all_fields_info to include these fields in the returned dict.
This would though necessitate a migration, to update this dict where it is stored in the 0.9 archive (in the all_fields_info key of metadata.json).

Ok. I will anyhow need a migration for the group extras, so before merging this, I could give you write access to my fork to add this

One thing I should mention here actually, is that really I want to eventually do a "second round" of refactoring of import_data_dj/import_data_sqla, to find a way for the process to actually make use of the iter_entity_fields iterator, and not have to read all the entities and their fields into memory at the same time.
This would not affect the current archive format, because you have to read the whole of data.json into memory anyway. But obviously with a new format this does not necessarily have to be the case.

Right. I agree it's good to hold this off for a later round/PR (unless it just "falls out" naturally).

Ok. I will anyhow need a migration for the group extras, so before merging this, I could give you write access to my fork to add this

yep makes sense to do it in a single migration. I guess we want to give these fields a new converter_type, that basically tells deserialize_attributes to do nothing to the value.
Something like:

def deserialize_attributes(attributes_data, conversion_data): """Deserialize attributes""" import datetime import pytz if conversion_data == "null": return attributes_data

Right. I agree it's good to hold this off for a later round/PR

yep indeed 👍

FYI, this will also include probably passing ArchiveData.node_data to the writer as a query or generator for node extras/attributes (as opposed to the current iterator), so that it can perform a chunked query.iterall(batch_size=x) write into the archive and potentially (with a new format) not have to read all that data into memory at the same time.
(which I reckon is the largest source of data from the database).

Ok. I will anyhow need a migration for the group extras,

Just wanted to mention that Casper pointed out that, since the group extras are a simple addition (and we don't offer back-migrations), a migration is actually not needed to support group extras, i.e. we could also defer this migration to a later point.

aiida/tools/importexport/dbimport/readers.py

- Remove progressbar module - Rewrite archive.extract_zip/extract_tar/extract_tree to use progress reporter - Remove archive.Archive class - Rewrite cmd_export.inspect to work with readers - Rewrite migration tests to work without Archive - move dbexport.zip contents to common.zip_folder (and deprecate) - move .dbexport.utils.ExportFileFormat to common.config - refacor django importer to work with progress reporter - move readers and writers to seperate module - added files to mypy still todo: sqlalchemy import and refactor migration

codecov · 2020-10-27T01:42:23Z

Codecov Report

Merging #4510 into develop will increase coverage by 0.11%.
The diff coverage is 91.73%.

@@             Coverage Diff             @@
##           develop    #4510      +/-   ##
===========================================
+ Coverage    79.30%   79.40%   +0.11%     
===========================================
  Files          476      480       +4     
  Lines        34959    35073     +114     
===========================================
+ Hits         27719    27845     +126     
+ Misses        7240     7228      -12

Flag	Coverage Δ
#django	`73.52% <68.75%> (+0.36%)`	⬆️
#sqlalchemy	`72.68% <69.18%> (+0.29%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
aiida/tools/importexport/dbexport/utils.py	`81.76% <ø> (-0.51%)`	⬇️
aiida/cmdline/commands/cmd_export.py	`91.16% <75.00%> (+0.19%)`	⬆️
aiida/tools/importexport/common/archive.py	`77.22% <82.98%> (-4.82%)`	⬇️
aiida/tools/importexport/archive/readers.py	`86.20% <86.20%> (ø)`
aiida/tools/importexport/common/zip_folder.py	`90.22% <90.00%> (ø)`
aiida/cmdline/commands/cmd_import.py	`79.53% <91.67%> (+2.03%)`	⬆️
...ida/tools/importexport/dbimport/backends/django.py	`93.01% <93.01%> (ø)`
aiida/tools/importexport/dbimport/backends/sqla.py	`93.78% <93.78%> (ø)`
aiida/tools/importexport/archive/common.py	`96.30% <96.30%> (ø)`
aiida/tools/importexport/archive/writers.py	`97.30% <97.30%> (ø)`
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 02c8a0c...af6833e. Read the comment docs.

ltalirz

hi @chrisjsewell , I've reviewed up to importexport/common/zip_folder.py

before I continue, it would be great if you could rebase this PR on develop.
In the meanwhile, I will test importing stuff and check the timings

aiida/cmdline/commands/cmd_export.py

aiida/tools/importexport/archive/common.py

ltalirz · 2020-10-27T10:00:09Z

aiida/tools/importexport/common/archive.py

    if nodes_export_subfolder:
        if not isinstance(nodes_export_subfolder, str):
            raise TypeError('nodes_export_subfolder must be a string')
    else:
        nodes_export_subfolder = NODES_EXPORT_SUBFOLDER

+    if not kwargs.get('silent', False):


In the export, the progress is now being set by the CLI command.
Is there a reason to take a different approach here?

this was simply for back compatibility. the functions in this module are no longer actually used for read/write, only in the archive migrations where the CLI has not yet been changed.
When I refactor that (I think now in a separate PR) I will probably remove this

I'll keep this open in case you ever look back to this PR

aiida/tools/importexport/common/zip_folder.py

ltalirz · 2020-10-27T10:07:24Z

aiida/tools/importexport/common/zip_folder.py

+        return self._zipfile.open(self._get_internal_path(fname), mode)
+
+    def _get_internal_path(self, filename):
+        return os.path.normpath(os.path.join(self.pwd, filename))


shouldn't we be using pathlib?
(several other uses of os.path in this class)

Err, I'm going to be lazy here and say not now. I have anyhow allowed for the filename in the init to be a Path and thats the main one

ltalirz · 2020-10-27T10:49:16Z

Tests:

archive with 25.8k links, 9.4k nodes
django backend

New implementation:

Importing chris.aiida into empty profile: 3m25s
Importing chris.aiida again into same profile: 14s
Importing chris.aiida into new empty profile: 2m12s
- ~60% of the time spent importing links
- Note: it's not entirely clear in which way the disk has "warmed up" here... I'll just check again
Importing chris.aiida into another new profile: 2m7s

Old implementation:

Importing chris.aiida into empty profile: 3m24s
Importing chris.aiida into same profile: 40s
Importing chris.aiida into new empty profile: 3m21s

Conclusions

The new implementation is a bit faster both on first import (65% of original time) and, for obvious reasons, significantly faster for re-import (35% of original time; this probably goes down further for larger archives)
In both the old and the new implementation, the import of links is slow.
In the new implementation, importing links now dominates total import time (~60%).
This can probably be optimized at a later point (not this PR).

ltalirz · 2020-10-27T11:47:09Z

Memory usage during import:

Old implementation

New implementation

Also looks fine (new implementation uses a tiny bit less).

aiida/tools/importexport/archive/writers.py

Co-authored-by: Leopold Talirz <leopold.talirz@gmail.com>

ltalirz

@chrisjsewell I'm basically through now - looks all good; only minor comments.

aiida/tools/importexport/dbimport/__init__.py

aiida/tools/importexport/dbimport/backends/common.py

ltalirz · 2020-10-27T14:11:05Z

aiida/tools/importexport/dbimport/backends/common.py

+    *, group, existing_entries: Dict[str, Dict[str, dict]], new_entries: Dict[str, Dict[str, dict]],
+    foreign_ids_reverse_mappings: Dict[str, Dict[str, int]]
+):
+    """Make an import group containing all imported nodes."""


Could you add a docstring including the parameters & return value?

aiida/tools/importexport/dbimport/backends/common.py

tests/tools/importexport/migration/test_migration.py

tests/tools/importexport/migration/test_migration_array.py

tests/tools/importexport/test_reader.py

ltalirz

Thanks a lot for the dedicated effort @chrisjsewell !

The import/export has definitely been one of the most "smelly" subpackages of AiiDA for a long time; it's really great to see it return to a stage where others will feel comfortable improving it as well.

chrisjsewell · 2020-10-27T19:59:54Z

Thanks a lot for the dedicated effort @chrisjsewell !

and thanks for the prompt reviews!
Good team work; this is how you implement a PR and don't let it stagnate like some of the currently open ones 😜

chrisjsewell added 20 commits October 23, 2020 07:33

Refactor export progress bar

01ad67a

fix doc warnings

a3356a0

move prograss_context injection to the writer init

3ec26d3

Allow the tar archive to be created independently of an aiida repo

9fca8ce

Merge branch 'develop' into export/progress-bar

4d3d552

re-re-factor progress bar!

a54c319

update help text for verbose option

9e552b5

small update

b853e8c

Move control of logging verbosity to CLI

19da2ad

Move log formatting control to CLI

81f2118

Move partial to set_progress_reporter

16744d3

Apply suggestions from code review

286e865

fix description

78e9e79

initial code

f7dca29

rewrite progress reporter as class

1ab807a

minor fix

8d0afa1

Merge branch 'export/progress-bar' into import-refactor

cc0e84d

add node_repository method

0ab1a4c

working implementation of zip + django

c44f017

Merge remote-tracking branch 'upstream/develop' into import-refactor

d204229

ltalirz reviewed Oct 25, 2020

View reviewed changes

tests/tools/importexport/test_reader.py Outdated Show resolved Hide resolved

ltalirz reviewed Oct 25, 2020

View reviewed changes

aiida/tools/importexport/dbimport/readers.py Outdated Show resolved Hide resolved

ltalirz reviewed Oct 25, 2020

View reviewed changes

aiida/tools/importexport/dbimport/readers.py Outdated Show resolved Hide resolved

chrisjsewell commented Oct 25, 2020

View reviewed changes

aiida/tools/importexport/dbimport/readers.py Outdated Show resolved Hide resolved

chrisjsewell added 4 commits October 26, 2020 20:48

update sqla import

9781de5

Merge remote-tracking branch 'upstream/develop' into import-refactor

f8c8fb6

fix tests

4747c47

chrisjsewell marked this pull request as ready for review October 27, 2020 07:18

chrisjsewell requested a review from ltalirz October 27, 2020 07:19

ltalirz mentioned this pull request Oct 27, 2020

Release/1.5.0 #4516

Merged

ltalirz reviewed Oct 27, 2020

View reviewed changes

chrisjsewell commented Oct 27, 2020

View reviewed changes

aiida/tools/importexport/archive/writers.py Show resolved Hide resolved

Apply suggestions from code review

949d84d

Co-authored-by: Leopold Talirz <leopold.talirz@gmail.com>

ltalirz suggested changes Oct 27, 2020

View reviewed changes

chrisjsewell added 3 commits October 27, 2020 16:32

remove zip.py

f571d1f

improve zip_folder module code

44bf46f

apply review recommendations

8b874ff

chrisjsewell requested a review from ltalirz October 27, 2020 18:11

chrisjsewell added 5 commits October 27, 2020 19:14

Merge branch 'develop' into import-refactor

4c8ae0d

fix docs build warning

3dadb44

fix typing error

980e7c3

Merge branch 'develop' into import-refactor

3312257

document detect_archive_type

fed1ded

ltalirz previously approved these changes Oct 27, 2020

View reviewed changes

minor doc improvement

af6833e

chrisjsewell dismissed ltalirz’s stale review via af6833e October 27, 2020 19:55

chrisjsewell requested a review from ltalirz October 27, 2020 19:57

ltalirz approved these changes Oct 27, 2020

View reviewed changes

chrisjsewell changed the title ~~♻️ Refactor Import Archive~~ Refactor Import Archive Oct 27, 2020

chrisjsewell merged commit 2f8e845 into aiidateam:develop Oct 27, 2020

chrisjsewell deleted the import-refactor branch October 27, 2020 21:18

chrisjsewell mentioned this pull request Oct 27, 2020

Callback concept for the progress bar #4053

Closed

This was referenced Oct 29, 2020

Inefficient order of checks in verdi import #2979

Closed

Unify zip/tar archive format export functions #3100

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Import Archive #4510

Refactor Import Archive #4510

chrisjsewell commented Oct 24, 2020 •

edited by ltalirz

Loading

ltalirz Oct 25, 2020 •

edited

Loading

chrisjsewell Oct 25, 2020 •

edited

Loading

ltalirz Oct 25, 2020

chrisjsewell Oct 25, 2020

chrisjsewell Oct 25, 2020

ltalirz Oct 26, 2020

codecov bot commented Oct 27, 2020 •

edited

Loading

ltalirz left a comment

ltalirz Oct 27, 2020

chrisjsewell Oct 27, 2020 •

edited

Loading

ltalirz Oct 27, 2020

ltalirz Oct 27, 2020

chrisjsewell Oct 27, 2020

ltalirz commented Oct 27, 2020 •

edited

Loading

ltalirz commented Oct 27, 2020

ltalirz left a comment

ltalirz Oct 27, 2020 •

edited

Loading

chrisjsewell Oct 27, 2020

ltalirz left a comment

chrisjsewell commented Oct 27, 2020

Refactor Import Archive #4510

Refactor Import Archive #4510

Conversation

chrisjsewell commented Oct 24, 2020 • edited by ltalirz Loading

ltalirz Oct 25, 2020 • edited Loading

Choose a reason for hiding this comment

chrisjsewell Oct 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Oct 27, 2020 • edited Loading

Codecov Report

ltalirz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrisjsewell Oct 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ltalirz commented Oct 27, 2020 • edited Loading

ltalirz commented Oct 27, 2020

ltalirz left a comment

Choose a reason for hiding this comment

ltalirz Oct 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ltalirz left a comment

Choose a reason for hiding this comment

chrisjsewell commented Oct 27, 2020

chrisjsewell commented Oct 24, 2020 •

edited by ltalirz

Loading

ltalirz Oct 25, 2020 •

edited

Loading

chrisjsewell Oct 25, 2020 •

edited

Loading

codecov bot commented Oct 27, 2020 •

edited

Loading

chrisjsewell Oct 27, 2020 •

edited

Loading

ltalirz commented Oct 27, 2020 •

edited

Loading

ltalirz Oct 27, 2020 •

edited

Loading