Start EIA-176 pipelines: company data #2949

davidmudrauskas · 2023-10-18T17:26:07Z

PR Overview

This introduces the basic extraction components for EIA-176, starting with reading company data. We can extend this pattern to process all EIA-176 data, but that is not included in this PR.
This also includes integration with Dagster, where the assets materialized successfully.

PR Checklist

Merge the most recent version of the branch you are merging into (probably dev).
All CI checks are passing. Run tests locally to debug failures
Make sure you've included good docstrings.
For major data coverage & analysis changes, run data validation tests
Include unit tests for new functions and classes.
Defensive data quality/sanity checks in analyses & data processing functions.
Update the release notes and reference reference the PR and related issues.
Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

…nd options

src/pudl/extract/csv.py

src/pudl/extract/eia176.py

davidmudrauskas · 2023-10-18T19:11:10Z

Also I just wanted to capture a number of things I noticed while getting familiar with the codebase:

Potential systemic issues:

several different approaches to getting zipfiles out of an archive
several different approaches for getting the exact filename to read as part of extraction
excel metadata vs simpler ferc approach - one is metadata, and one is much more direct
some overloading of terms/responsibility - "extract" components also handle load and transformation
excel "extraction" is actually just an extraction - it return the data in memory after reading it from source
the ferc/dbf "extraction" is actually an overloaded term - it proceeds with transformation and loading. separate better?
E, T, and L don't live on same level in structural hierarchy
where does the excel extraction actually get transformed and loaded? presumably we're actually doing that?
one option: high-level "extract" function calls the "extractor" and actually ends up loading all the raw data into the database, ETL style; then proceed to next high-level step (T)
both ELT and ETL patterns in places? it might make sense to have a standard or clear decision criteria
at some point pipelines won't match up nicely with EIA forms? this might require better nomenclature

…pecific column types

davidmudrauskas · 2023-10-30T00:47:33Z

I have an extractor I can run locally that populates the company table associated with EIA-176 (screenshot). Next I'll make sure there's sufficient test coverage and graduate this out of a draft so it doesn't become too big. I'll aim to do that over the next week. I think the next iterations would be making sure to integrate with Dagster properly, populating the tables for other data, and any refactoring around the repeated patterns between the CSV-related objects and DBF-related objects.

Local call for execution:

from pudl.extract.eia176 import Eia176CsvExtractor
from pudl.workspace.datastore import Datastore
from pudl.workspace.setup import PudlPaths


ds = Datastore()
csv_extractor = Eia176CsvExtractor(ds, PudlPaths().output_dir)
csv_extractor.execute()

test/unit/extract/csv.py

…ema for CSV extractor

jdangerx

Thanks for the easy-to-read code + accompanying tests! I think we led you a bit astray with the FERC DBF extractors - hopefully my comment about using the EIA 860 patterns makes sense. If not, I'm happy to get on a call with you and hash stuff out!

src/pudl/extract/csv.py

src/pudl/extract/eia176.py

src/pudl/extract/csv.py

davidmudrauskas · 2023-11-16T21:11:40Z

src/pudl/extract/csv.py

+    )(extract)
+
+
+def raw_df_factory(extractor_cls: type[CsvExtractor], name: str) -> AssetsDefinition:


This follows the pattern established in excel.raw_df_factory, i.e., what we do for extracting EIA-860 in Dagster. Looks like the preexisting logic is covered by the Dagster nightly tests and I'm inclined to rely on the same here.

EIA 860 is a little more complicated than EIA 176, since each table corresponds to a number of files spread across a number of different zip files (see src/pudl/package_data/eia860/file_map.csv for a look into the mess...)

The upshot is, I think this raw_df_factory and extractor_factory can be combined into one layer of abstraction. See above for more details.

src/pudl/package_data/eia176/table_file_map.csv

test/integration/etl_test.py

davidmudrauskas · 2023-11-16T21:21:21Z

I was able to slim down the CSV extraction classes and tests a bunch. I also materialized the assets successfully in Dagster (screenshot).

jdangerx

Sweet! I ran this in Dagster and got a very reasonable looking dataframe out when I then ran defs.load_asset_value. Our first gas dataset!

I think we could save you from some annoying subclassing in future table extractions + provide a slightly nicer generic CSV extraction API, but I could definitely be missing something here so let me know what you think!

I could be convinced to merge this with no changes - my core worry is that the CsvExtractor interface doesn't cleanly reflect its purpose (turn a zipfile + table -> file map into a table name -> dataframe map), so I'm happy to merge once that worry is assuaged.

jdangerx · 2023-11-17T16:21:23Z

src/pudl/extract/csv.py

+
+
+class CsvExtractor:
+    """Generalized class for extracting dataframes from CSV files.


So this CsvExtractor class requires subclassing to set the DATASET - that seems a little clunky. If we can pass the dataset name to __init__, we'll have a CsvExtractor class which exposes:

extractor = CsvExtractor(datastore, dataset_name) extractor.read_source(filename) # get one file extractor.extract() # get all files

Which seems like a nice generic CSV extraction interface that doesn't require subclassing to read a variety of different collections of CSV files.

I think we might want to tweak that API a bit to expose table-level operations:

extractor.extract_one(table_name) # pd.DataFrame extractor.extract_all() # dict[str, pd.DataFrame]

Because then we can read multiple tables' files in, in parallel - each table could have its own asset like

@asset def raw_eia176__table_name(context): ... extractor.extract_one(table_name)

Which is pretty straightforward to factory-ize if you want to make a bunch of these assets.

In the event we need to read & combine multiple files for a single table (like we see in EIA 860), we can turn that simple asset above into a graph-backed asset. But for EIA176 that seems like overkill.

I think there's also an argument for passing in the table -> file(s) map and the path to file as init params, and then using the datastore at the call site - this lets people use the extractor to explore data sets that we haven't actually integrated into our upstream work yet, and lets you pass stuff in for testing without extensive patching. Could be something like

class CsvExtractor: def __init__(self, path: pathlib.Path | zipfile.Path, table_files_map: dict[str, list[str]]): ... @classmethod def from_resource(cls, datastore: Datastore, resource_id: str): # existing logic to get zipfile + table/file map return cls(...)

What do you think?

So this CsvExtractor class requires subclassing to set the DATASET - that seems a little clunky. If we can pass the dataset name to init

Meant to comment on this. I actually started using dataset as a construction parameter but stopped because I was introducing a competing pattern, e.g., vs FercDbfExtractor's inheritance tree. I'll proceed with parameterizing over inheritance now that someone else is also inclined.

I think there's also an argument for passing in the table -> file(s) map and the path to file as init params, and then using the datastore at the call site

I don't know what the datastore would be used for in that case, since right now it's only to get the zipfile path. But generally you're talking about providing a class that lets a user/client point to any zip file and get dataframes out of it based on a table_files_map? Right now the zipfile path and the table-file(s) map are coupled on the dataset name, e.g., eia176. Decoupling opens a wider window for invalid combos, i.e., table filenames that do not exist in the zip archive, but yeah, it would be nice to be able to develop against data without needing the source published on Zenodo. I'll take a pass in that direction.

jdangerx · 2023-11-17T16:27:20Z

src/pudl/extract/csv.py

+    )(extract)
+
+
+def raw_df_factory(extractor_cls: type[CsvExtractor], name: str) -> AssetsDefinition:


EIA 860 is a little more complicated than EIA 176, since each table corresponds to a number of files spread across a number of different zip files (see src/pudl/package_data/eia860/file_map.csv for a look into the mess...)

The upshot is, I think this raw_df_factory and extractor_factory can be combined into one layer of abstraction. See above for more details.

src/pudl/extract/dbf.py

…plify Dagster asset definition

davidmudrauskas · 2023-11-17T20:27:34Z

src/pudl/extract/csv.py

+logger = pudl.logging_helpers.get_logger(__name__)
+
+
+def open_csv_resource(dataset: str, base_filename: str) -> DictReader:


This can be moved to an even more general space at some point but I didn't want to introduce another moving part in this PR.

Yeah, I could see this moving to pudl.helpers or something but this is a fine place for it.

davidmudrauskas · 2023-11-17T20:29:54Z

src/pudl/extract/eia176.py

+
+
+@asset(required_resource_keys={"datastore"})
+def raw_eia176__company(context):


I wrote a concrete asset like this for now, after trying to get a factory pattern down. Happy to learn more about the Dagster components and write a factory at some point.

davidmudrauskas · 2023-11-17T20:32:13Z

I was able to materialize in Dagster and my updated integration test passed for this latest iteration.

jdangerx

🎉 looks great! Definitely a good point re: trying to make sure we can't represent illegal states.

jdangerx · 2023-11-17T22:47:13Z

src/pudl/extract/csv.py

+logger = pudl.logging_helpers.get_logger(__name__)
+
+
+def open_csv_resource(dataset: str, base_filename: str) -> DictReader:


Yeah, I could see this moving to pudl.helpers or something but this is a fine place for it.

jdangerx · 2023-11-17T22:49:08Z

src/pudl/extract/csv.py

+            table_file_map: map of table name to source file in zipfile archive
+        """
+        self._zipfile = zipfile
+        self._table_file_map = table_file_map


If you're worried about the table file map and the zipfile not matching up, you could validate that the files in table_file_map do show up in zipfile.namelist(). Doesn't have to happen in this pass, but should probably happen at some point.

jdangerx · 2023-11-20T21:32:12Z

The integration tests failures appear to be due to Zenodo flakiness + outdatedly including /draft in the download URLs they return. Should get handled if we merge catalyst-cooperative/pudl-archiver#192 and recreate some data packages on Zenodo with better URLs.

The zenodo-cache-sync failed on git pull? Which seems funky. And the notification failed because we're missing some Slack webhook secret, which I can set up.

jdangerx · 2023-11-28T19:51:34Z

@davidmudrauskas I think if you merge dev one more time, we should be able to try re-running the integration tests. Sorry that took so long 😬

We also took a look at the zenodo-cache-sync errors from a different PR - the root cause is some credential stuff that isn't particularly important to fix now, so we can just ignore those checks.

For more information, see https://pre-commit.ci

davidmudrauskas · 2023-12-06T01:19:59Z

FYI test_filter_for_freshest_data failed in my last build's unit tests:

Unreliable test timings! On an initial run, this test took 558.62ms, which exceeded the deadline of 400.00ms, but on a subsequent run it took 47.77 ms, which did not. If you expect this sort of variability in your test timings, consider turning deadlines off for this test by setting deadline=None.

You can reproduce this example by temporarily adding @reproduce_failure('6.91.0', b'AXicY2BgYGBhYFjAwMDIzMDAycjAzMB4moGRkQEOgEwglx1IMTGC1DCygIRsgTKrQICR4R0DszgjEwODD7MUyKgjLiD1UQxIAABbSwgI') as a decorator on your test case
== 1 failed, 284 passed, 1 skipped, 9 xfailed, 8 warnings in 71.09s (0:01:11) ==

It did fine locally, so hopefully it's more consistent and the above was an anomaly. Running a new build with latest dev.

codecov · 2023-12-07T04:03:17Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (2f67c05) 92.6% compared to head (07b48f3) 92.6%.
Report is 2 commits behind head on dev.

Additional details and impacted files

@@           Coverage Diff           @@
##             dev   #2949     +/-   ##
=======================================
- Coverage   92.6%   92.6%   -0.0%     
=======================================
  Files        140     143      +3     
  Lines      12841   12925     +84     
=======================================
+ Hits       11894   11969     +75     
- Misses       947     956      +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jdangerx · 2023-12-07T16:29:33Z

FYI test_filter_for_freshest_data failed in my last build's unit tests:

Unreliable test timings! On an initial run, this test took 558.62ms, which exceeded the deadline of 400.00ms, but on a subsequent run it took 47.77 ms, which did not. If you expect this sort of variability in your test timings, consider turning deadlines off for this test by setting deadline=None.

You can reproduce this example by temporarily adding @reproduce_failure('6.91.0', b'AXicY2BgYGBhYFjAwMDIzMDAycjAzMB4moGRkQEOgEwglx1IMTGC1DCygIRsgTKrQICR4R0DszgjEwODD7MUyKgjLiD1UQxIAABbSwgI') as a decorator on your test case
== 1 failed, 284 passed, 1 skipped, 9 xfailed, 8 warnings in 71.09s (0:01:11) ==

It did fine locally, so hopefully it's more consistent and the above was an anomaly. Running a new build with latest dev.

This happens sometimes with our single property-based test. It generates a bunch of dataframes for inputs, which is apparently non-deterministically slow. I'll set some ridiculous deadline like 2s.

jdangerx · 2023-12-07T16:38:02Z

I think you're good to merge! Might need to pull our latest dev into your branch. I see a button for me to do it but not sure if that will mess with your flow:

davidmudrauskas · 2024-01-03T19:40:47Z

src/pudl/etl/__init__.py

    *load_assets_from_modules([eia_bulk_elec_assets], group_name="core_eia_bulk_elec"),
    *load_assets_from_modules([epacems_assets], group_name="core_epacems"),


Captured the recent addition of the core_ prefix here in the resolution of the last merge conflict.

zaneselvans · 2024-01-05T04:27:02Z

Hey @davidmudrauskas I'm sorry I didn't realize that when I got rid of the dev branch it would close all the PRs from forks of the repo. Can you re-open this against the main branch?

Start of reusable CSV extractor, incorporating preexisting patterns a…

826a77a

…nd options

davidmudrauskas commented Oct 18, 2023

View reviewed changes

src/pudl/extract/csv.py Outdated Show resolved Hide resolved

davidmudrauskas commented Oct 18, 2023

View reviewed changes

src/pudl/extract/csv.py Show resolved Hide resolved

davidmudrauskas commented Oct 18, 2023

View reviewed changes

src/pudl/extract/eia176.py Show resolved Hide resolved

jdangerx self-requested a review October 18, 2023 19:03

davidmudrauskas added 2 commits October 29, 2023 20:12

Table schema and archive objects for CSV extraction, pipeline-/form-s…

bc6eddf

…pecific column types

Merge branch 'dev' into eia176_start_implementation

ff6b5bf

Unit tests for CsvTableSchema

9903674

davidmudrauskas commented Nov 3, 2023

View reviewed changes

test/unit/extract/csv.py Outdated Show resolved Hide resolved

Full unit test coverage for CSV extractor

3a0bfe2

davidmudrauskas changed the title ~~Eia176 start implementation~~ Start EIA-176 pipelines: company data Nov 5, 2023

Follow patterns for clobber and test file names, implement delete_sch…

1fb52e9

…ema for CSV extractor

davidmudrauskas marked this pull request as ready for review November 6, 2023 15:24

Merge branch 'dev' into eia176_start_implementation

8dbd975

jdangerx requested changes Nov 8, 2023

View reviewed changes

src/pudl/extract/csv.py Outdated Show resolved Hide resolved

src/pudl/extract/eia176.py Show resolved Hide resolved

src/pudl/extract/csv.py Show resolved Hide resolved

jdangerx added the community label Nov 11, 2023

davidmudrauskas added 4 commits November 16, 2023 13:06

Merge branch 'dev' into eia176_start_implementation

1531313

Update CSV extractor to just return dataframes, integrate with Dagster

9aff0a8

Combine thin CSV extraction-related class, update tests

caaa212

Remove extraneous files, undo find-replace error

fe3fbb7

davidmudrauskas commented Nov 16, 2023

View reviewed changes

src/pudl/package_data/eia176/table_file_map.csv Outdated Show resolved Hide resolved

davidmudrauskas commented Nov 16, 2023

View reviewed changes

test/integration/etl_test.py Show resolved Hide resolved

davidmudrauskas requested a review from jdangerx November 16, 2023 21:21

jdangerx requested changes Nov 17, 2023

View reviewed changes

davidmudrauskas added 2 commits November 17, 2023 13:42

Extract one table using CSV extractor

0b703c6

Move managing zipfile and table-file map to CSV extractor client, sim…

cb8e7e1

…plify Dagster asset definition

davidmudrauskas commented Nov 17, 2023

View reviewed changes

davidmudrauskas requested a review from jdangerx November 17, 2023 20:32

jdangerx approved these changes Nov 17, 2023

View reviewed changes

davidmudrauskas added 3 commits November 21, 2023 18:15

Merge branch 'dev' into eia176_start_implementation

4dc4ad9

Merge branch 'dev' into eia176_start_implementation

87b7c51

Merge branch 'dev' into eia176_start_implementation

82504f0

davidmudrauskas and others added 3 commits December 1, 2023 12:44

Merge branch 'dev' into eia176_start_implementation

8f4d93e

[pre-commit.ci] auto fixes from pre-commit.com hooks

35de6a8

For more information, see https://pre-commit.ci

Merge branch 'dev' into eia176_start_implementation

35fabe6

Merge branch 'dev' into eia176_start_implementation

07b48f3

davidmudrauskas commented Jan 3, 2024

View reviewed changes

zaneselvans deleted the branch catalyst-cooperative:dev January 5, 2024 04:14

zaneselvans closed this Jan 5, 2024

zaneselvans added the eia176 Issues related to the EIA Form 176 natural gas supply and disposition dataset. label Jan 5, 2024

zaneselvans mentioned this pull request Jan 5, 2024

Deprecate and remove dev branch #3179

Closed

davidmudrauskas mentioned this pull request Jan 9, 2024

Start EIA-176 pipelines: company data #3227

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start EIA-176 pipelines: company data #2949

Start EIA-176 pipelines: company data #2949

davidmudrauskas commented Oct 18, 2023 •

edited

Loading

davidmudrauskas commented Oct 18, 2023

davidmudrauskas commented Oct 30, 2023 •

edited

Loading

jdangerx left a comment

davidmudrauskas Nov 16, 2023

jdangerx Nov 17, 2023

davidmudrauskas commented Nov 16, 2023

jdangerx left a comment

jdangerx Nov 17, 2023

jdangerx Nov 17, 2023

davidmudrauskas Nov 17, 2023

jdangerx Nov 17, 2023

davidmudrauskas Nov 17, 2023

jdangerx Nov 17, 2023

davidmudrauskas Nov 17, 2023

davidmudrauskas commented Nov 17, 2023

jdangerx left a comment

jdangerx Nov 17, 2023

jdangerx Nov 17, 2023

jdangerx commented Nov 20, 2023

jdangerx commented Nov 28, 2023

davidmudrauskas commented Dec 6, 2023

codecov bot commented Dec 7, 2023 •

edited

Loading

jdangerx commented Dec 7, 2023

jdangerx commented Dec 7, 2023

davidmudrauskas Jan 3, 2024

zaneselvans commented Jan 5, 2024

		)(extract)


		def raw_df_factory(extractor_cls: type[CsvExtractor], name: str) -> AssetsDefinition:



		class CsvExtractor:
		"""Generalized class for extracting dataframes from CSV files.

		logger = pudl.logging_helpers.get_logger(__name__)


		def open_csv_resource(dataset: str, base_filename: str) -> DictReader:



		@asset(required_resource_keys={"datastore"})
		def raw_eia176__company(context):

		*load_assets_from_modules([eia_bulk_elec_assets], group_name="core_eia_bulk_elec"),
		*load_assets_from_modules([epacems_assets], group_name="core_epacems"),

Start EIA-176 pipelines: company data #2949

Start EIA-176 pipelines: company data #2949

Conversation

davidmudrauskas commented Oct 18, 2023 • edited Loading

PR Overview

PR Checklist

davidmudrauskas commented Oct 18, 2023

davidmudrauskas commented Oct 30, 2023 • edited Loading

jdangerx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidmudrauskas commented Nov 16, 2023

jdangerx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidmudrauskas commented Nov 17, 2023

jdangerx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdangerx commented Nov 20, 2023

jdangerx commented Nov 28, 2023

davidmudrauskas commented Dec 6, 2023

codecov bot commented Dec 7, 2023 • edited Loading

Codecov Report

jdangerx commented Dec 7, 2023

jdangerx commented Dec 7, 2023

Choose a reason for hiding this comment

zaneselvans commented Jan 5, 2024

davidmudrauskas commented Oct 18, 2023 •

edited

Loading

davidmudrauskas commented Oct 30, 2023 •

edited

Loading

codecov bot commented Dec 7, 2023 •

edited

Loading