Improve and Automate raw data archiving/access #1418

bendnorman · 2022-01-21T21:06:47Z

Description

This Epic tracks updates to the data archiving and access processes. The previous process for creating new archives involved first running the scraper to download new data locally. Next, the archiver could be used to upload new data to zenodo and create a new archive version. This manual process makes updating archives somewhat difficult, and requires someone being aware of upstream updates, which often leads to stale data. Combining the archiver and scrapers will not only simplify this process, but also make automation much easier.

Once new data archives are created, there is still no easy way to access these raw archives outside of PUDL. This is because the Datastore that PUDL uses for accessing these data archives is embedded within PUDL. Making the Datastore a standalone software package would allow accessing these archives from client projects, and by users.

Scope

- How do we know when we are done? This epic is done when dataset archives are updated automatically.
- What is out of scope? Integrating specific datasets.

Tasks

Archiver

PUDL Integration

Notify PUDL when a new archive is created
Kick off nightly build to detect problems stemming from new data

Create standalone Datastore

Move datastore source code to new repo so it can be used as a library
Pull over tests from PUDL, and setup CI
Implement basic CLI for accessing data
Package Datastore on pypi and conda-forge

The text was updated successfully, but these errors were encountered:

jdangerx · 2023-01-10T19:19:13Z

We had mentioned maybe "Try adding a new dataset and see if our automation picks it up and archives it" as the final definition of done - what do you think @zschira ? Or is that just part of catalyst-cooperative/pudl-archiver#2?

zaneselvans · 2023-01-21T03:50:54Z

Kick off nightly build to detect problems stemming from new data

I feel like there are 2 ways we could approach this.

Use freshly archived data to do a nightly build, but only using the previously covered range of data. This would still integrate any corrections or tweaks that had been made to older data, which happens quite frequently, could be valuable, and I think usually wouldn't require any human intervention. The only thing we would need to do to make the PR is update the list of DOIs referenced, which would be very easy if they were in an YAML settings file rather than the code itself.
Try to process all the newly archived data, including new years that were previously unavailable. This will almost always require human intervention for some data sources (e.g. for FERC to EIA plant/util ID mapping, dealing with changes to EIA spreadsheet column names), but not for others (EPA CEMS seems very well behaved).

bendnorman · 2024-05-13T23:31:42Z

Can this be closed?

zaneselvans · 2024-05-14T01:04:07Z

We should probably carve out the unfinished work in another issue or issues.

Better reporting & notification of archive creation and validation failures -- enumerate all the failures without needing someone to re-run to debug.
Have the monthly archiving action automatically create the archive approval checklist issue and populate it based on the archiving or validation failures.
Splitting out the Datastore seems like a totally separable thing.
There are some data sources which are well behaved enough that the automated PR to update the DOI seems like it could be reasonable, including eia930, eia_bulk_elec, eia860m, and epacems.

jdangerx · 2024-05-15T16:24:17Z

I've carved those out, minus the datastore thing, which is a persistent large thing we've been thinking about.

catalyst-cooperative/pudl-archiver#346
catalyst-cooperative/pudl-archiver#347
#3639

Closing!

bendnorman self-assigned this Jan 21, 2022

bendnorman added the epic Any issue whose primary purpose is to organize other issues into a group. label Jan 21, 2022

zaneselvans mentioned this issue Aug 23, 2022

Integrate FERC XBRL data into PUDL #1574

Open

96 tasks

zschira assigned zschira and unassigned bendnorman Sep 13, 2022

zaneselvans mentioned this issue Jan 7, 2023

PUDL Release v2022.11.30 #2077

Closed

33 tasks

zschira changed the title ~~Automate our scraping and archiving~~ Improve and Automate raw data archiving/access Jan 9, 2023

e-belfer mentioned this issue Sep 28, 2023

Make quicker quarterly updates to PUDL EPA CEMS, EIA 923 and EIA 860 data #2902

Closed

2 tasks

jdangerx closed this as completed May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve and Automate raw data archiving/access #1418

Improve and Automate raw data archiving/access #1418

bendnorman commented Jan 21, 2022 •

edited by zschira

Loading

jdangerx commented Jan 10, 2023

zaneselvans commented Jan 21, 2023

bendnorman commented May 13, 2024

zaneselvans commented May 14, 2024

jdangerx commented May 15, 2024

Improve and Automate raw data archiving/access #1418

Improve and Automate raw data archiving/access #1418

Comments

bendnorman commented Jan 21, 2022 • edited by zschira Loading

Description

Scope

Tasks

Archiver

PUDL Integration

Create standalone Datastore

jdangerx commented Jan 10, 2023

zaneselvans commented Jan 21, 2023

bendnorman commented May 13, 2024

zaneselvans commented May 14, 2024

jdangerx commented May 15, 2024

bendnorman commented Jan 21, 2022 •

edited by zschira

Loading