-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor archive migrations #4532
Refactor archive migrations #4532
Conversation
@ltalirz I have not yet updated the tests, but you can start looking at this 😄 |
Hm yeah, I think this should be deprecated. |
Looks very nice! |
Oh and I also need to update |
By the way, the |
Came across this #3156 - just in case it's easy to fix |
and this #3193 |
And curious to see how migration tests will be affected in #3678 |
some of them should definitely be quicker, since |
ok @ltalirz I have updated all the tests! I moved all migration ones to pytest, and additionally I have consolidated the extracting of tar/zip into
In terms of test timings, here are the top ones. The long-running ones all belong to
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @chrisjsewell , looks great!
I started going through this; will continue tomorrow at some point
Co-authored-by: Leopold Talirz <leopold.talirz@gmail.com>
Codecov Report
@@ Coverage Diff @@
## develop #4532 +/- ##
===========================================
+ Coverage 79.40% 79.51% +0.12%
===========================================
Files 480 482 +2
Lines 35087 35333 +246
===========================================
+ Hits 27856 28092 +236
- Misses 7231 7241 +10
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @chrisjsewell
I went through the code except for the individual tests.
As discussed, we may want to think a bit more about the cachefolder approach
|
||
|
||
@pytest.fixture() | ||
def core_archive(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the core archive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and in general, do we want to define constants as fixtures as well?
it's true that then you don't need to import them, but perhaps importing constants actually makes it easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the core archive or ones that point to files in the aiida-core repository, as opoosed to the external ones, which point to the aiida-export-migration-tests
package.
Although I did point out that having the archives in a separate repo may not really be necessary anymore: aiidateam/aiida-export-migration-tests#13
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was just the easiest way really, to minimise the changes require during the conversion pytest. I don't think its too bad, if they are tightly scoped (i.e. the conftest only applies to this single folder),
plus its a lot easier now that vs code's intellisense recognises pytest fixtures, so that they exactly the same as if you had imported them 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I did point out that having the archives in a separate repo may not really be necessary anymore: aiidateam/aiida-export-migration-tests#13
Right, you are very welcome to move them back to the repo (not sure when someone will touch these tests for the next time).
By the way, there is also the git-lfs for storing larger binary files so it wouldn't even clog up repository size; I think git now automatically suggests it when trying to store larger files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting thanks I will check that out
allow for only a single copy of a dict to exist in memory at any time
@ltalirz in b44e350, I made the alterations to Unscientific test of caching performance, with test duration for:
0.64s, 0.65s, 0.63s, 0.67s, 0.66s
0.65s, 0.70s, 0.98s, 0.68s, 0.77s But obviously its a bit hard to tell definitively from such a small archive. what do you think? |
@ltalirz, ok I've moved the compression code to the Run through below with both zip and tar, and speeds seem as expected.
|
Ok done. So there is no longer a compression/extraction during an
|
@ltalirz ready for your review 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @chrisjsewell - just one request: could you check whether there currently are tar files in the test set?
If not, could you please add one (e.g. use one of the early zipped archives and tar it), so that we are safe from accidentally breaking tar in the future.
I'll be checking the timings & memory consumption in the meanwhile
@chrisjsewell In my test, the run in this PR takes quite a bit more memory than both It also runs slower than This PR (5d62743): 16m18s; top memory consumption ~2200 MB
|
Co-authored-by: Leopold Talirz <leopold.talirz@gmail.com>
Firstly I looked at procpath and it seems way over-complicated 😬. In 4dd7fa8 I have reduced the memory usage to the same as develop (and also it looks to have reduced the time a bit) This is the run when only making the $ cd tmp
$ syrupy.py -i 1 --separator=, --no-align verdi export migrate mount_folder/two_dimensional_database.tar.gz mount_folder/output_09.zip
SYRUPY: Writing process resource usage samples to 'syrupy_20201103014840.ps.log'
SYRUPY: Writing raw process resource usage logs to 'syrupy_20201103014840.ps.raw'
SYRUPY: Executing command 'verdi export migrate mount_folder/two_dimensional_database.tar.gz mount_folder/output_09.zip'
SYRUPY: Redirecting command output stream to 'syrupy_20201103014840.out.log'
SYRUPY: Redirecting command error stream to 'syrupy_20201103014840.err.log'
SYRUPY: Completed running: verdi export migrate mount_folder/two_dimensional_database.tar.gz mount_folder/output_09.zip
SYRUPY: Started at 2020-11-03 01:48:40.621727
SYRUPY: Ended at 2020-11-03 02:12:45.366769
SYRUPY: Total run time: 0 hour(s), 24 minute(s), 04.745042 second(s)
$ python -c 'import pandas as pd; ax = pd.read_csv("syrupy_20201103014840.ps.log").set_index("ELAPSED").plot(y="RSS", grid=True); ax.get_figure().savefig("mount_folder/output.png")' This is the run when also removing the full $ syrupy.py -i 1 --separator=, --no-align verdi export migrate mount_folder/two_dimensional_database.tar.gz mount_folder/output_09.zip
SYRUPY: Writing process resource usage samples to 'syrupy_20201103023904.ps.log'
SYRUPY: Writing raw process resource usage logs to 'syrupy_20201103023904.ps.raw'
SYRUPY: Executing command 'verdi export migrate mount_folder/two_dimensional_database.tar.gz mount_folder/output_09.zip'
SYRUPY: Redirecting command output stream to 'syrupy_20201103023904.out.log'
SYRUPY: Redirecting command error stream to 'syrupy_20201103023904.err.log'
SYRUPY: Completed running: verdi export migrate mount_folder/two_dimensional_database.tar.gz mount_folder/output_09.zip
SYRUPY: Started at 2020-11-03 02:39:04.003449
SYRUPY: Ended at 2020-11-03 03:01:28.654327
SYRUPY: Total run time: 0 hour(s), 22 minute(s), 24.650878 second(s)
$ python -c 'import pandas as pd; ax = pd.read_csv("syrupy_20201103023904.ps.log").set_index("ELAPSED").plot(y="RSS", grid=True); ax.get_figure().savefig("mount_folder/output.png")' |
No the bug was in this PR, for unpacking tar. You would not be able to add a normal test, since it was just writing the same files multiple times, but not actually causing an error (just overwriting it). Obviously this could be picked up in a benchmark timing test. Generally though, I think as long as the cpu/memory performance is equal to |
Added in 348242f |
Thanks for fixing!
Ah, I see... I was wondering where this weird oscillation came from; so it's throwing the JSON string away but only after it reaches some buffer size?
Ok. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @chrisjsewell , this is good to go from my side
Thanks!
Yeh I guess something like that 🤷♂️ |
This PR primarily refactors the archive migrations, to provide an
ArchiveMigratorAbstract
interface, which is agnostic to the internal implementation of the archive (i.e. not dependent on the presence of metadata.json and data.json)This will allow for subsequent changes to the archive format.
To facilitate this:
MIGRATE_FUNCTIONS
now includes both the to and from versions of the migration,CacheFolder
class, which caches file writes in memory, so reading of the files from the file system only happen once and they are written after all the migrations have finished.--verbose
flag has been added toverdi export migrate
, unfortunately only the long-format can be used, since-v
is already reserved for the version.safe_extract_tar
/safe_extract_zip
.These include callbacks, for which I created
aiida.common.progress_reporter::create_callback
, to provide a convenience function.This is the output of an example migration: