-
-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start EIA-176 pipelines: company data #2949
Start EIA-176 pipelines: company data #2949
Conversation
Also I just wanted to capture a number of things I noticed while getting familiar with the codebase: Potential systemic issues:
|
…ema for CSV extractor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the easy-to-read code + accompanying tests! I think we led you a bit astray with the FERC DBF extractors - hopefully my comment about using the EIA 860 patterns makes sense. If not, I'm happy to get on a call with you and hash stuff out!
src/pudl/extract/csv.py
Outdated
)(extract) | ||
|
||
|
||
def raw_df_factory(extractor_cls: type[CsvExtractor], name: str) -> AssetsDefinition: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This follows the pattern established in excel.raw_df_factory, i.e., what we do for extracting EIA-860 in Dagster. Looks like the preexisting logic is covered by the Dagster nightly tests and I'm inclined to rely on the same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EIA 860 is a little more complicated than EIA 176, since each table corresponds to a number of files spread across a number of different zip files (see src/pudl/package_data/eia860/file_map.csv
for a look into the mess...)
The upshot is, I think this raw_df_factory
and extractor_factory
can be combined into one layer of abstraction. See above for more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet! I ran this in Dagster and got a very reasonable looking dataframe out when I then ran defs.load_asset_value
. Our first gas dataset!
I think we could save you from some annoying subclassing in future table extractions + provide a slightly nicer generic CSV extraction API, but I could definitely be missing something here so let me know what you think!
I could be convinced to merge this with no changes - my core worry is that the CsvExtractor
interface doesn't cleanly reflect its purpose (turn a zipfile + table -> file map into a table name -> dataframe map), so I'm happy to merge once that worry is assuaged.
src/pudl/extract/csv.py
Outdated
|
||
|
||
class CsvExtractor: | ||
"""Generalized class for extracting dataframes from CSV files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this CsvExtractor class requires subclassing to set the DATASET
- that seems a little clunky. If we can pass the dataset name to __init__
, we'll have a CsvExtractor
class which exposes:
extractor = CsvExtractor(datastore, dataset_name)
extractor.read_source(filename) # get one file
extractor.extract() # get all files
Which seems like a nice generic CSV extraction interface that doesn't require subclassing to read a variety of different collections of CSV files.
I think we might want to tweak that API a bit to expose table-level operations:
extractor.extract_one(table_name) # pd.DataFrame
extractor.extract_all() # dict[str, pd.DataFrame]
Because then we can read multiple tables' files in, in parallel - each table could have its own asset like
@asset
def raw_eia176__table_name(context):
...
extractor.extract_one(table_name)
Which is pretty straightforward to factory-ize if you want to make a bunch of these assets.
In the event we need to read & combine multiple files for a single table (like we see in EIA 860), we can turn that simple asset above into a graph-backed asset. But for EIA176 that seems like overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's also an argument for passing in the table -> file(s) map and the path to file as init params, and then using the datastore at the call site - this lets people use the extractor to explore data sets that we haven't actually integrated into our upstream work yet, and lets you pass stuff in for testing without extensive patching. Could be something like
class CsvExtractor:
def __init__(self, path: pathlib.Path | zipfile.Path, table_files_map: dict[str, list[str]]):
...
@classmethod
def from_resource(cls, datastore: Datastore, resource_id: str):
# existing logic to get zipfile + table/file map
return cls(...)
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this CsvExtractor class requires subclassing to set the DATASET - that seems a little clunky. If we can pass the dataset name to init
Meant to comment on this. I actually started using dataset as a construction parameter but stopped because I was introducing a competing pattern, e.g., vs FercDbfExtractor
's inheritance tree. I'll proceed with parameterizing over inheritance now that someone else is also inclined.
I think there's also an argument for passing in the table -> file(s) map and the path to file as init params, and then using the datastore at the call site
I don't know what the datastore would be used for in that case, since right now it's only to get the zipfile path. But generally you're talking about providing a class that lets a user/client point to any zip file and get dataframes out of it based on a table_files_map
? Right now the zipfile path and the table-file(s) map are coupled on the dataset name, e.g., eia176
. Decoupling opens a wider window for invalid combos, i.e., table filenames that do not exist in the zip archive, but yeah, it would be nice to be able to develop against data without needing the source published on Zenodo. I'll take a pass in that direction.
src/pudl/extract/csv.py
Outdated
)(extract) | ||
|
||
|
||
def raw_df_factory(extractor_cls: type[CsvExtractor], name: str) -> AssetsDefinition: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EIA 860 is a little more complicated than EIA 176, since each table corresponds to a number of files spread across a number of different zip files (see src/pudl/package_data/eia860/file_map.csv
for a look into the mess...)
The upshot is, I think this raw_df_factory
and extractor_factory
can be combined into one layer of abstraction. See above for more details.
…plify Dagster asset definition
logger = pudl.logging_helpers.get_logger(__name__) | ||
|
||
|
||
def open_csv_resource(dataset: str, base_filename: str) -> DictReader: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be moved to an even more general space at some point but I didn't want to introduce another moving part in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I could see this moving to pudl.helpers
or something but this is a fine place for it.
|
||
|
||
@asset(required_resource_keys={"datastore"}) | ||
def raw_eia176__company(context): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote a concrete asset like this for now, after trying to get a factory pattern down. Happy to learn more about the Dagster components and write a factory at some point.
I was able to materialize in Dagster and my updated integration test passed for this latest iteration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 looks great! Definitely a good point re: trying to make sure we can't represent illegal states.
logger = pudl.logging_helpers.get_logger(__name__) | ||
|
||
|
||
def open_csv_resource(dataset: str, base_filename: str) -> DictReader: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I could see this moving to pudl.helpers
or something but this is a fine place for it.
table_file_map: map of table name to source file in zipfile archive | ||
""" | ||
self._zipfile = zipfile | ||
self._table_file_map = table_file_map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're worried about the table file map and the zipfile not matching up, you could validate that the files in table_file_map
do show up in zipfile.namelist()
. Doesn't have to happen in this pass, but should probably happen at some point.
The integration tests failures appear to be due to Zenodo flakiness + outdatedly including The zenodo-cache-sync failed on git pull? Which seems funky. And the notification failed because we're missing some Slack webhook secret, which I can set up. |
@davidmudrauskas I think if you merge We also took a look at the zenodo-cache-sync errors from a different PR - the root cause is some credential stuff that isn't particularly important to fix now, so we can just ignore those checks. |
FYI
It did fine locally, so hopefully it's more consistent and the above was an anomaly. Running a new build with latest |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## dev #2949 +/- ##
=======================================
- Coverage 92.6% 92.6% -0.0%
=======================================
Files 140 143 +3
Lines 12841 12925 +84
=======================================
+ Hits 11894 11969 +75
- Misses 947 956 +9 ☔ View full report in Codecov by Sentry. |
This happens sometimes with our single property-based test. It generates a bunch of dataframes for inputs, which is apparently non-deterministically slow. I'll set some ridiculous deadline like 2s. |
*load_assets_from_modules([eia_bulk_elec_assets], group_name="core_eia_bulk_elec"), | ||
*load_assets_from_modules([epacems_assets], group_name="core_epacems"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Captured the recent addition of the core_
prefix here in the resolution of the last merge conflict.
Hey @davidmudrauskas I'm sorry I didn't realize that when I got rid of the |
PR Overview
company
data. We can extend this pattern to process all EIA-176 data, but that is not included in this PR.PR Checklist
dev
).