-
-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FERC1 2022 report year fix #2947
Conversation
I re-ran the integration tests because they failed for some weird SQLite reason that seemed like it was probably not related to these code changes and is hopefully ephemeral. Checking back in on this, I see that 40 minutes into the integration tests running, it is still working on extracting XBRL data and hasn't even gotten to the PUDL ETL. Welp, same error so I guess it's real.
|
65ea0ad
to
093b6f8
Compare
b1f560c
to
dbaaf37
Compare
Looks like there's a bunch of new plants to remap. That makes sense because there weren't very many to remap earlier, maybe related to the report year issue. Working on that. |
99a5660
to
05a31f3
Compare
We're running into one validation error still:
If we look at the
The largest PUDL plant ID on dev right now is 18020 - so anything smaller than that is "old" and anything newer than that is "new." In addition, any PUDL plant ID >= 18028 is a new small plant, <5MW, which I did not attempt to match to older plants in the sheet. Of the mapping, only 6 of them have multiple "old" PUDL IDs being mapped to the same FERC ID - which is exactly the number we expect. If we combine the FERC IDs above with the FERC ID 32 corresponds to PUDL ID 773 ("West Phoenix"/ "Arizona Public Service Company") and PUDL ID 18120 ("plant name: west phoenix 1 combined cycle" / "arizona public service company"( FERC ID 352 corresponds to PUDL ID 600 ("Sweatt" / "Mississippi Power Company") and PUDL ID 18061 ("sweat - steam" / "mississippi power company") etc. Should I be mapping all these new plants to their old versions? I thought there was no need to map plants <5MW, but maybe I'm confused. |
Ugh, well the CI is still blowing up on that "can't operate on a closed database" thing... it might be related to the huge amount of alembic logs pouring out of the tests? are we trying to run migrations on every single test somehow? The upshot is that I ran the full ETL on this branch last night and validations all passed this morning! |
@zaneselvans once you've reviewed this (and the CI stuff works again...) feel free to merge this into 2811 and then merge #2948 . Or to make changes that you deem necessary! |
This reverts commit c93bfec.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## 2811-ferc1-2022 #2947 +/- ##
=================================================
Coverage ? 88.6%
=================================================
Files ? 91
Lines ? 10854
Branches ? 0
=================================================
Hits ? 9618
Misses ? 1236
Partials ? 0 ☔ View full report in Codecov by Sentry. |
@@ -1815,10 +1815,10 @@ def assert_cols_areclose( | |||
# instead of just whether or not there are matches. | |||
mismatch = df.loc[ | |||
~np.isclose( | |||
df[a_cols], | |||
df[b_cols], | |||
np.ma.masked_where(np.isnan(df[a_cols]), df[a_cols]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mask all NaN
's so if one year's starting balance is NaN
and previous year's ending balance is not, it will not be treated as a failure
@@ -191,6 +191,6 @@ def test_extract_xbrl(self, ferc1_engine_dbf): | |||
for table_type, df in xbrl_tables.items(): | |||
# Some raw xbrl tables are empty | |||
if not df.empty and table_type == "duration": | |||
assert (df.report_year >= 2021).all() and ( | |||
assert (df.report_year >= 2020).all() and ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Records which pertain to 2020, but were reported in 2021 will now have a report_year
of 2020, so make this check more permissive.
As stated in the docstring, we had some issues setting facts' report years.
This didn't come up with only the 2021 data - the 2020 data in the XBRLs was set to 2021, but we had 2020 data from DBF which masked this error.
One corner I cut that we should probably handle in a separate PR: I initialize
Ferc1Settings()
to get the XBRL years - but that's not guaranteed to be the same XBRL years as is in Dagster's execution context, depending on whatever config Dagster is running with. So we should pass in the relevant bits of the Dagster context to transformers, which seems like a slightly larger refactor - hence, separate PR.