Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make quicker quarterly updates to PUDL EPA CEMS, EIA 923 and EIA 860 data #2902

Closed
18 tasks done
e-belfer opened this issue Sep 28, 2023 · 1 comment · Fixed by #3085
Closed
18 tasks done

Make quicker quarterly updates to PUDL EPA CEMS, EIA 923 and EIA 860 data #2902

e-belfer opened this issue Sep 28, 2023 · 1 comment · Fixed by #3085
Assignees
Labels
eia860 Anything having to do with EIA Form 860 eia923 Anything having to do with EIA Form 923 enhancement Improvements in existing functionality. epacems Integration and analysis of the EPA CEMS dataset. epic Any issue whose primary purpose is to organize other issues into a group. github-actions Pull requests that update GitHub Actions code new-data Requests for integration of new data. rmi

Comments

@e-belfer
Copy link
Member

e-belfer commented Sep 28, 2023

Description

With the support of RMI, the goal of the project is to make it possible to integrate quarterly updates for CEMS, EIA 923M and EIA 860M data within 1-2 weeks of new data release. To do this, we are going to 1) automate archiving of new data, with the support of additional data validation checks on our archiver, 2) redesign PUDL to handle quarterly and monthly data formats and 3) integrate YTD data to test it all.

Archiver infrastructure updates

Goal: Generate robust report of archiver results to enable quick (<.25 hour per dataset) manual review and approval of draft production archives run by Github action.

Current state: Currently, archive runs check for 1) missing files, 2) valid file types, 3) emptiness of zips (in progress) and produce a summary of all changed files. The new default behavior of the ‘auto-publish’ flag allows for the production of a draft production archive in Zenodo for manual approval, removing the need for sandbox runs of an archiver that is known to work. @zschira has seriously improved the archiver’s robustness to large file uploads but this is always a trouble spot and we should anticipate some time required to handle things that come up.

Non-coding tasks required (likely RMI tasks):

  • Identify expected frequency of dataset releases in order to schedule automated archiving
  • Identify any known dataset quirks relevant to the archiving process (e.g., 2015 Q1 CEMS data is mislabelled).

Handle monthly data in PUDL

Goal: Design a mechanism to handle monthly data in a system that is designed for annual data. Make structural changes required for each dataset to make this possible, designating new data as YTD data and excluding it from annually aggregated tables.

Current state:
EIA 860M data is ‘annual’ in nature and already appended on to EIA 860 data. No changes required here.
EIA 923M data has the same format as EIA 923 data, with slightly fewer ‘pages’ (Schedules 2-5 only, which means no emission control table). The column names and layout are the same, with YTD data and blank rows for the months not yet covered.
CEMS is currently downloaded by year-state and will need to be downloaded by quarter instead (one file per quarter, ~2-3Gb per file).

Doc updates

Goal: make it easier for external contributors to make progress on the annual updates
Current status: Annual doc updates are relatively up to date but require additional elaboration for a few steps.

Tasks

Preview Give feedback
  1. docs
    aesharpe
  2. new-data rmi
    aesharpe

Integrate YTD data

Goal: test new infrastructure on YTD data (anticipated Q3 2023).

Tasks

Preview Give feedback
  1. eia923 new-data rmi
    aesharpe

If time, but not essential to project success

  • Explicitly specify expected missing partitions for ingested datasets (e.g. no HI data in 2013).
  • Write a suite of tests to characterize changes in ingested data at the end of the ETL (data coverage, number of entities, other metrics of completion) and output as report. Will speed up our process of knowing how ‘complete’ monthly data is and how revisions affect data.

Future projects that could complement this work

@e-belfer e-belfer added eia923 Anything having to do with EIA Form 923 eia860 Anything having to do with EIA Form 860 epacems Integration and analysis of the EPA CEMS dataset. new-data Requests for integration of new data. enhancement Improvements in existing functionality. epic Any issue whose primary purpose is to organize other issues into a group. github-actions Pull requests that update GitHub Actions code rmi labels Sep 28, 2023
@e-belfer e-belfer moved this from New to Backlog in Catalyst Megaproject Sep 28, 2023
@e-belfer e-belfer linked a pull request Nov 24, 2023 that will close this issue
@e-belfer e-belfer moved this from Backlog to In progress in Catalyst Megaproject Dec 4, 2023
@e-belfer
Copy link
Member Author

With the merging of #3185 and #3186, this issue is complete. Further ongoing quarterly updates will be tracked in individual issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
eia860 Anything having to do with EIA Form 860 eia923 Anything having to do with EIA Form 923 enhancement Improvements in existing functionality. epacems Integration and analysis of the EPA CEMS dataset. epic Any issue whose primary purpose is to organize other issues into a group. github-actions Pull requests that update GitHub Actions code new-data Requests for integration of new data. rmi
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants