Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite CEMS archiver to use quarterly partitions rather than state-year. #181

Merged
merged 18 commits into from
Dec 4, 2023

Conversation

e-belfer
Copy link
Member

@e-belfer e-belfer commented Oct 11, 2023

See epic. Closes issue #2929. In order to make quarterly updates of CEMS data possible, switch from archiving state-year data from the CEMS API to archiving quarterly data for all states and years, as it is released quarterly rather than annually.

Still to do:

  • Test using conda-test GCS.
  • Resolve File size too large, try using force_zip64 zipfile error

Current run-time for the archiver is about ~1.5 hours on a GCE VM, with concurrency at 2 files to prevent memory overload. A successful sandbox archive can be seen here.

To generate a successful production archive this will need to be run with the "missing files" validation turned off in order to accommodate the change in file structure. Waiting for #177 to be merged first would provide additional validation of zipfile contents and improve the datapackage.

@e-belfer e-belfer changed the base branch from main to zenodo-migration November 1, 2023 20:58
@e-belfer e-belfer self-assigned this Nov 6, 2023
@e-belfer e-belfer changed the base branch from zenodo-migration to main November 22, 2023 20:35
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the datapackage.json, and the archive metadata, and downloaded a file and opened it up, and everything looked good.

This isn't blocking, but I think it would be clearer if in the variable names, logging, and file names we always used "year-quarter" rather than mixing it with "quarter-year" sometimes, as this would match the desired sorting order, ISO8601 date formats, and the naming we're using on the files already.

Do you know if we're doomed to have to run this on a VM? Or is the resource use such that we could do it via a larger GitHub runner?

Weirdly, it looked like the files were getting sorted alphabetically in both the datapackage.json and the display of the archive on Zenodo. Is that just chance? It only looked sorted when viewing the record, not when editing it.

Comment on lines -77 to +22
concurrency_limit = 10 # Number of files to concurrently download
concurrency_limit = 2 # Number of files to concurrently download
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did it end up just being impossible to run this archive via GitHub Actions? What's the bottleneck? Would we be able to run it on a 4CPU 16GB runner?

Copy link
Member Author

@e-belfer e-belfer Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tried yet, let me see! I think the Github action we'd want to use is getting defined in #197, and the existing action will fail because of the sandbox DOIs having changed.

Relatedly, @jdangerx our branches are in a bit of an intertwined state so let me know how you want to handle merge order in this instance?

src/pudl_archiver/archivers/epacems.py Outdated Show resolved Hide resolved
src/pudl_archiver/archivers/epacems.py Outdated Show resolved Hide resolved
@e-belfer
Copy link
Member Author

Weirdly, it looked like the files were getting sorted alphabetically in both the datapackage.json and the display of the archive on Zenodo. Is that just chance? It only looked sorted when viewing the record, not when editing it.

I manually sorted the files in the datapackage.json here, since in the present version of Zenodo the files automatically sort when you publish them (helpfully). We'll need to revisit this when the new API is released, not sure what the behavior will be like or if it will change.

@e-belfer e-belfer merged commit 9eab759 into main Dec 4, 2023
3 checks passed
@e-belfer e-belfer deleted the cems_quarterly branch December 4, 2023 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants