Rewrite CEMS archiver to use quarterly partitions rather than state-year. #181

e-belfer · 2023-10-11T13:56:34Z

See epic. Closes issue #2929. In order to make quarterly updates of CEMS data possible, switch from archiving state-year data from the CEMS API to archiving quarterly data for all states and years, as it is released quarterly rather than annually.

Still to do:

Test using conda-test GCS.
Resolve File size too large, try using force_zip64 zipfile error

Current run-time for the archiver is about ~1.5 hours on a GCE VM, with concurrency at 2 files to prevent memory overload. A successful sandbox archive can be seen here.

To generate a successful production archive this will need to be run with the "missing files" validation turned off in order to accommodate the change in file structure. Waiting for #177 to be merged first would provide additional validation of zipfile contents and improve the datapackage.

…rtitions

This reverts commit 0ddabae.

For more information, see https://pre-commit.ci

zaneselvans

I looked at the datapackage.json, and the archive metadata, and downloaded a file and opened it up, and everything looked good.

This isn't blocking, but I think it would be clearer if in the variable names, logging, and file names we always used "year-quarter" rather than mixing it with "quarter-year" sometimes, as this would match the desired sorting order, ISO8601 date formats, and the naming we're using on the files already.

Do you know if we're doomed to have to run this on a VM? Or is the resource use such that we could do it via a larger GitHub runner?

Weirdly, it looked like the files were getting sorted alphabetically in both the datapackage.json and the display of the archive on Zenodo. Is that just chance? It only looked sorted when viewing the record, not when editing it.

zaneselvans · 2023-11-23T22:46:57Z

src/pudl_archiver/archivers/epacems.py

-    concurrency_limit = 10  # Number of files to concurrently download
+    concurrency_limit = 2  # Number of files to concurrently download


Did it end up just being impossible to run this archive via GitHub Actions? What's the bottleneck? Would we be able to run it on a 4CPU 16GB runner?

I haven't tried yet, let me see! I think the Github action we'd want to use is getting defined in #197, and the existing action will fail because of the sandbox DOIs having changed.

Relatedly, @jdangerx our branches are in a bit of an intertwined state so let me know how you want to handle merge order in this instance?

src/pudl_archiver/archivers/epacems.py

e-belfer · 2023-11-24T15:00:32Z

Weirdly, it looked like the files were getting sorted alphabetically in both the datapackage.json and the display of the archive on Zenodo. Is that just chance? It only looked sorted when viewing the record, not when editing it.

I manually sorted the files in the datapackage.json here, since in the present version of Zenodo the files automatically sort when you publish them (helpfully). We'll need to revisit this when the new API is released, not sure what the behavior will be like or if it will change.

WIP needs testing, rewrite to use quarterly rather than state-year pa…

7b805be

…rtitions

e-belfer mentioned this pull request Oct 11, 2023

CEMS: Rewrite and test archiver to pull quarterly files rather than state-year catalyst-cooperative/pudl#2929

Closed

e-belfer added 3 commits October 11, 2023 18:01

Add handling for giant zipfiles in download_and_zip

38098b8

Try a new way to write bytes to zipfile

8bd7fc7

Change file count # for legibility

7d5a4e1

e-belfer changed the base branch from main to zenodo-migration November 1, 2023 20:58

Zip quarters by year

0ddabae

e-belfer self-assigned this Nov 6, 2023

e-belfer changed the base branch from zenodo-migration to main November 22, 2023 20:35

e-belfer and others added 5 commits November 22, 2023 15:41

Fix merge conflicts

660dbe6

Revert "Zip quarters by year"

3414cf9

This reverts commit 0ddabae.

Merge branch 'main' into cems_quarterly

d64a2e2

Reduce concurrency to prevent memory overload

9669bc4

[pre-commit.ci] auto fixes from pre-commit.com hooks

a9e3415

For more information, see https://pre-commit.ci

e-belfer requested a review from zaneselvans November 23, 2023 18:54

e-belfer added the epacems label Nov 23, 2023

e-belfer marked this pull request as ready for review November 23, 2023 19:01

e-belfer mentioned this pull request Nov 23, 2023

CEMS: Repartition extraction process and parquet files. catalyst-cooperative/pudl#2973

Closed

14 tasks

Merge branch 'main' into cems_quarterly

78bec3e

zaneselvans approved these changes Nov 23, 2023

View reviewed changes

e-belfer and others added 7 commits November 24, 2023 10:01

Standardize to year-quarter

64428e4

Correct year-quarter for missed logging statement

ff1fcb0

update sandbox concept doi

e7d2ffc

Merge branch 'main' into cems_quarterly

8cd0170

Merge branch 'main' into cems_quarterly

45280b4

Merge branch 'main' into cems_quarterly

fd3d3a0

Merge branch 'main' into cems_quarterly

3f52550

e-belfer merged commit 9eab759 into main Dec 4, 2023
3 checks passed

e-belfer deleted the cems_quarterly branch December 4, 2023 20:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite CEMS archiver to use quarterly partitions rather than state-year. #181

Rewrite CEMS archiver to use quarterly partitions rather than state-year. #181

e-belfer commented Oct 11, 2023 •

edited

Loading

zaneselvans left a comment

zaneselvans Nov 23, 2023

e-belfer Nov 24, 2023 •

edited

Loading

e-belfer commented Nov 24, 2023

		concurrency_limit = 10 # Number of files to concurrently download
		concurrency_limit = 2 # Number of files to concurrently download

Rewrite CEMS archiver to use quarterly partitions rather than state-year. #181

Rewrite CEMS archiver to use quarterly partitions rather than state-year. #181

Conversation

e-belfer commented Oct 11, 2023 • edited Loading

zaneselvans left a comment

Choose a reason for hiding this comment

zaneselvans Nov 23, 2023

Choose a reason for hiding this comment

e-belfer Nov 24, 2023 • edited Loading

Choose a reason for hiding this comment

e-belfer commented Nov 24, 2023

e-belfer commented Oct 11, 2023 •

edited

Loading

e-belfer Nov 24, 2023 •

edited

Loading