-
-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start EIA-176 pipelines: company data #2949
Changes from 10 commits
826a77a
bc6eddf
ff6b5bf
9903674
3a0bfe2
1fb52e9
8dbd975
1531313
9aff0a8
caaa212
fe3fbb7
0b703c6
cb8e7e1
4dc4ad9
87b7c51
82504f0
8f4d93e
35de6a8
35fabe6
07b48f3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,6 +8,7 @@ | |
:mod:`pudl.transform` subpackage. | ||
""" | ||
from . import ( | ||
eia176, | ||
eia860, | ||
eia860m, | ||
eia861, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
"""Extractor for CSV data.""" | ||
from csv import DictReader | ||
from importlib import resources | ||
|
||
import pandas as pd | ||
from dagster import AssetsDefinition, OpDefinition, graph_asset, op | ||
|
||
import pudl.logging_helpers | ||
from pudl.workspace.datastore import Datastore | ||
|
||
logger = pudl.logging_helpers.get_logger(__name__) | ||
|
||
|
||
class CsvExtractor: | ||
"""Generalized class for extracting dataframes from CSV files. | ||
|
||
When subclassing from this generic extractor, one should implement dataset specific | ||
logic in the following manner: | ||
|
||
2. Set DATASET class attribute. This is used to load metadata from package_data/{dataset} subdirectory. | ||
|
||
The extraction logic is invoked by calling extract() method of this class. | ||
""" | ||
|
||
DATASET = None | ||
|
||
def __init__(self, datastore: Datastore): | ||
"""Create a new instance of CsvExtractor. | ||
|
||
This can be used for retrieving data from CSV files. | ||
|
||
Args: | ||
datastore: provides access to raw files on disk. | ||
""" | ||
self._zipfile = datastore.get_zipfile_resource(self.DATASET) | ||
self._table_file_map = { | ||
row["table"]: row["filename"] | ||
for row in self._open_csv_resource("table_file_map.csv") | ||
} | ||
|
||
def _open_csv_resource(self, base_filename: str) -> DictReader: | ||
"""Open the given resource file as :class:`csv.DictReader`.""" | ||
csv_path = resources.files(f"pudl.package_data.{self.DATASET}") / base_filename | ||
return DictReader(csv_path.open()) | ||
|
||
def read_source(self, filename: str) -> pd.DataFrame: | ||
"""Read the data from the CSV source file and return as a dataframe.""" | ||
logger.info(f"Extracting {filename} from CSV into pandas DataFrame.") | ||
with self._zipfile.open(filename) as f: | ||
df = pd.read_csv(f) | ||
return df | ||
|
||
def extract(self) -> dict[str, pd.DataFrame]: | ||
"""Extracts a dictionary of table names and dataframes from CSV source files.""" | ||
data = {} | ||
for table in self._table_file_map: | ||
filename = self._table_file_map[table] | ||
df = self.read_source(filename) | ||
data[table] = df | ||
return data | ||
|
||
|
||
def extractor_factory(extractor_cls: type[CsvExtractor], name: str) -> OpDefinition: | ||
"""Construct a Dagster op that extracts data given an extractor class. | ||
|
||
Args: | ||
extractor_cls: Class of type :class:`CsvExtractor` used to extract the data. | ||
name: Name of a CSV-based dataset (e.g. "eia176"). | ||
""" | ||
|
||
def extract(context) -> dict[str, pd.DataFrame]: | ||
"""A function that extracts data from a CSV file. | ||
|
||
This function will be decorated with a Dagster op and returned. | ||
|
||
Args: | ||
context: Dagster keyword that provides access to resources and config. | ||
|
||
Returns: | ||
A dictionary of DataFrames extracted from CSV, keyed by table name. | ||
""" | ||
ds = context.resources.datastore | ||
return extractor_cls(ds).extract() | ||
|
||
return op( | ||
required_resource_keys={"datastore", "dataset_settings"}, | ||
name=f"extract_single_{name}_year", | ||
)(extract) | ||
|
||
|
||
def raw_df_factory(extractor_cls: type[CsvExtractor], name: str) -> AssetsDefinition: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This follows the pattern established in excel.raw_df_factory, i.e., what we do for extracting EIA-860 in Dagster. Looks like the preexisting logic is covered by the Dagster nightly tests and I'm inclined to rely on the same here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. EIA 860 is a little more complicated than EIA 176, since each table corresponds to a number of files spread across a number of different zip files (see The upshot is, I think this |
||
"""Return a dagster graph asset to extract a set of raw DataFrames from CSV files. | ||
|
||
Args: | ||
extractor_cls: The dataset-specific CSV extractor used to extract the data. | ||
Needs to correspond to the dataset identified by ``name``. | ||
name: Name of a CSV-based dataset (e.g. "eia176"). Currently this must be | ||
one of the attributes of :class:`pudl.settings.EiaSettings` | ||
""" | ||
extractor = extractor_factory(extractor_cls, name) | ||
|
||
def raw_dfs() -> dict[str, pd.DataFrame]: | ||
"""Produce a dictionary of extracted EIA dataframes.""" | ||
return extractor() | ||
|
||
return graph_asset(name=f"{name}_raw_dfs")(raw_dfs) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
"""Extract EIA Form 176 data from CSVs. | ||
|
||
The EIA Form 176 archive also contains CSVs for EIA Form 191 and EIA Form 757. | ||
davidmudrauskas marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
|
||
from dagster import AssetOut, Output, multi_asset | ||
|
||
from pudl.extract.csv import CsvExtractor, raw_df_factory | ||
|
||
DATASET = "eia176" | ||
|
||
|
||
class Eia176CsvExtractor(CsvExtractor): | ||
"""Extractor for EIA Form 176 data.""" | ||
|
||
DATASET = DATASET | ||
|
||
|
||
# TODO (davidmudrauskas): Add this information to the metadata | ||
raw_table_names = (f"raw_{DATASET}__company",) | ||
|
||
eia176_raw_dfs = raw_df_factory(Eia176CsvExtractor, name=DATASET) | ||
|
||
|
||
@multi_asset( | ||
outs={table_name: AssetOut() for table_name in sorted(raw_table_names)}, | ||
required_resource_keys={"datastore", "dataset_settings"}, | ||
) | ||
def extract_eia176(context, eia176_raw_dfs): | ||
"""Extract EIA-176 data from CSV source and return dataframes. | ||
|
||
Args: | ||
context: dagster keyword that provides access to resources and config. | ||
|
||
Returns: | ||
A tuple of extracted EIA dataframes. | ||
""" | ||
eia176_raw_dfs = { | ||
f"raw_{DATASET}__" + table_name: df for table_name, df in eia176_raw_dfs.items() | ||
} | ||
eia176_raw_dfs = dict(sorted(eia176_raw_dfs.items())) | ||
|
||
return ( | ||
Output(output_name=table_name, value=df) | ||
for table_name, df in eia176_raw_dfs.items() | ||
) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
"""Extract EIA Form 191 data from CSVs.""" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
"""Extract EIA Form 757 data from CSVs.""" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
table,filename | ||
company,all_company_176.csv | ||
davidmudrauskas marked this conversation as resolved.
Show resolved
Hide resolved
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
"""Unit tests for pudl.extract.csv module.""" | ||
from unittest.mock import MagicMock, patch | ||
|
||
from pudl.extract.csv import CsvExtractor | ||
|
||
TABLE_NAME = "company" | ||
|
||
FILENAME = "all_company_176.csv" | ||
TABLE_FILE_MAP = {TABLE_NAME: FILENAME} | ||
|
||
DATASET = "eia176" | ||
|
||
|
||
class FakeCsvExtractor(CsvExtractor): | ||
DATASET = DATASET | ||
|
||
|
||
def get_csv_extractor(): | ||
datastore = MagicMock() | ||
return FakeCsvExtractor(datastore) | ||
|
||
|
||
@patch("pudl.extract.csv.pd") | ||
def test_csv_extractor_read_source(mock_pd): | ||
extractor = get_csv_extractor() | ||
res = extractor.read_source(FILENAME) | ||
mock_zipfile = extractor._zipfile | ||
mock_zipfile.open.assert_called_once_with(FILENAME) | ||
f = mock_zipfile.open.return_value.__enter__.return_value | ||
mock_pd.read_csv.assert_called_once_with(f) | ||
df = mock_pd.read_csv() | ||
assert df == res | ||
|
||
|
||
def test_csv_extractor_extract(): | ||
extractor = get_csv_extractor() | ||
df = MagicMock() | ||
with patch.object(CsvExtractor, "read_source", return_value=df) as mock_read_source: | ||
raw_dfs = extractor.extract() | ||
mock_read_source.assert_called_once_with(FILENAME) | ||
assert {TABLE_NAME: df} == raw_dfs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this CsvExtractor class requires subclassing to set the
DATASET
- that seems a little clunky. If we can pass the dataset name to__init__
, we'll have aCsvExtractor
class which exposes:Which seems like a nice generic CSV extraction interface that doesn't require subclassing to read a variety of different collections of CSV files.
I think we might want to tweak that API a bit to expose table-level operations:
Because then we can read multiple tables' files in, in parallel - each table could have its own asset like
Which is pretty straightforward to factory-ize if you want to make a bunch of these assets.
In the event we need to read & combine multiple files for a single table (like we see in EIA 860), we can turn that simple asset above into a graph-backed asset. But for EIA176 that seems like overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's also an argument for passing in the table -> file(s) map and the path to file as init params, and then using the datastore at the call site - this lets people use the extractor to explore data sets that we haven't actually integrated into our upstream work yet, and lets you pass stuff in for testing without extensive patching. Could be something like
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meant to comment on this. I actually started using dataset as a construction parameter but stopped because I was introducing a competing pattern, e.g., vs
FercDbfExtractor
's inheritance tree. I'll proceed with parameterizing over inheritance now that someone else is also inclined.I don't know what the datastore would be used for in that case, since right now it's only to get the zipfile path. But generally you're talking about providing a class that lets a user/client point to any zip file and get dataframes out of it based on a
table_files_map
? Right now the zipfile path and the table-file(s) map are coupled on the dataset name, e.g.,eia176
. Decoupling opens a wider window for invalid combos, i.e., table filenames that do not exist in the zip archive, but yeah, it would be nice to be able to develop against data without needing the source published on Zenodo. I'll take a pass in that direction.