Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve and Automate raw data archiving/access #1418

Closed
6 of 12 tasks
bendnorman opened this issue Jan 21, 2022 · 5 comments
Closed
6 of 12 tasks

Improve and Automate raw data archiving/access #1418

bendnorman opened this issue Jan 21, 2022 · 5 comments
Assignees
Labels
epic Any issue whose primary purpose is to organize other issues into a group.

Comments

@bendnorman
Copy link
Member

bendnorman commented Jan 21, 2022

Description

This Epic tracks updates to the data archiving and access processes. The previous process for creating new archives involved first running the scraper to download new data locally. Next, the archiver could be used to upload new data to zenodo and create a new archive version. This manual process makes updating archives somewhat difficult, and requires someone being aware of upstream updates, which often leads to stale data. Combining the archiver and scrapers will not only simplify this process, but also make automation much easier.

Once new data archives are created, there is still no easy way to access these raw archives outside of PUDL. This is because the Datastore that PUDL uses for accessing these data archives is embedded within PUDL. Making the Datastore a standalone software package would allow accessing these archives from client projects, and by users.

Scope

- How do we know when we are done? This epic is done when dataset archives are updated automatically.
- What is out of scope? Integrating specific datasets.

Tasks

Archiver

PUDL Integration

  • Notify PUDL when a new archive is created
  • Kick off nightly build to detect problems stemming from new data

Create standalone Datastore

  • Move datastore source code to new repo so it can be used as a library
  • Pull over tests from PUDL, and setup CI
  • Implement basic CLI for accessing data
  • Package Datastore on pypi and conda-forge
@bendnorman bendnorman self-assigned this Jan 21, 2022
@bendnorman bendnorman added the epic Any issue whose primary purpose is to organize other issues into a group. label Jan 21, 2022
@zschira zschira assigned zschira and unassigned bendnorman Sep 13, 2022
@zschira zschira changed the title Automate our scraping and archiving Improve and Automate raw data archiving/access Jan 9, 2023
@jdangerx
Copy link
Member

We had mentioned maybe "Try adding a new dataset and see if our automation picks it up and archives it" as the final definition of done - what do you think @zschira ? Or is that just part of catalyst-cooperative/pudl-archiver#2?

@zaneselvans
Copy link
Member

Kick off nightly build to detect problems stemming from new data

I feel like there are 2 ways we could approach this.

  • Use freshly archived data to do a nightly build, but only using the previously covered range of data. This would still integrate any corrections or tweaks that had been made to older data, which happens quite frequently, could be valuable, and I think usually wouldn't require any human intervention. The only thing we would need to do to make the PR is update the list of DOIs referenced, which would be very easy if they were in an YAML settings file rather than the code itself.
  • Try to process all the newly archived data, including new years that were previously unavailable. This will almost always require human intervention for some data sources (e.g. for FERC to EIA plant/util ID mapping, dealing with changes to EIA spreadsheet column names), but not for others (EPA CEMS seems very well behaved).

@bendnorman
Copy link
Member Author

Can this be closed?

@zaneselvans
Copy link
Member

We should probably carve out the unfinished work in another issue or issues.

  • Better reporting & notification of archive creation and validation failures -- enumerate all the failures without needing someone to re-run to debug.
  • Have the monthly archiving action automatically create the archive approval checklist issue and populate it based on the archiving or validation failures.
  • Splitting out the Datastore seems like a totally separable thing.
  • There are some data sources which are well behaved enough that the automated PR to update the DOI seems like it could be reasonable, including eia930, eia_bulk_elec, eia860m, and epacems.

@jdangerx
Copy link
Member

I've carved those out, minus the datastore thing, which is a persistent large thing we've been thinking about.

catalyst-cooperative/pudl-archiver#346
catalyst-cooperative/pudl-archiver#347
#3639

Closing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic Any issue whose primary purpose is to organize other issues into a group.
Projects
None yet
Development

No branches or pull requests

4 participants