-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite CEMS archiver to use quarterly partitions rather than state-year. #181
Conversation
This reverts commit 0ddabae.
For more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the datapackage.json, and the archive metadata, and downloaded a file and opened it up, and everything looked good.
This isn't blocking, but I think it would be clearer if in the variable names, logging, and file names we always used "year-quarter" rather than mixing it with "quarter-year" sometimes, as this would match the desired sorting order, ISO8601 date formats, and the naming we're using on the files already.
Do you know if we're doomed to have to run this on a VM? Or is the resource use such that we could do it via a larger GitHub runner?
Weirdly, it looked like the files were getting sorted alphabetically in both the datapackage.json and the display of the archive on Zenodo. Is that just chance? It only looked sorted when viewing the record, not when editing it.
concurrency_limit = 10 # Number of files to concurrently download | ||
concurrency_limit = 2 # Number of files to concurrently download |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did it end up just being impossible to run this archive via GitHub Actions? What's the bottleneck? Would we be able to run it on a 4CPU 16GB runner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tried yet, let me see! I think the Github action we'd want to use is getting defined in #197, and the existing action will fail because of the sandbox DOIs having changed.
Relatedly, @jdangerx our branches are in a bit of an intertwined state so let me know how you want to handle merge order in this instance?
I manually sorted the files in the |
See epic. Closes issue #2929. In order to make quarterly updates of CEMS data possible, switch from archiving state-year data from the CEMS API to archiving quarterly data for all states and years, as it is released quarterly rather than annually.
Still to do:
conda-test
GCS.File size too large, try using force_zip64
zipfile errorCurrent run-time for the archiver is about ~1.5 hours on a GCE VM, with concurrency at 2 files to prevent memory overload. A successful sandbox archive can be seen here.
To generate a successful production archive this will need to be run with the "missing files" validation turned off in order to accommodate the change in file structure. Waiting for #177 to be merged first would provide additional validation of zipfile contents and improve the datapackage.