Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FERC1 2022 report year fix #2947

Merged
merged 11 commits into from
Oct 26, 2023
Merged

Conversation

jdangerx
Copy link
Member

@jdangerx jdangerx commented Oct 18, 2023

As stated in the docstring, we had some issues setting facts' report years.

This didn't come up with only the 2021 data - the 2020 data in the XBRLs was set to 2021, but we had 2020 data from DBF which masked this error.

One corner I cut that we should probably handle in a separate PR: I initialize Ferc1Settings() to get the XBRL years - but that's not guaranteed to be the same XBRL years as is in Dagster's execution context, depending on whatever config Dagster is running with. So we should pass in the relevant bits of the Dagster context to transformers, which seems like a slightly larger refactor - hence, separate PR.

@jdangerx jdangerx requested a review from zaneselvans October 18, 2023 15:40
src/pudl/io_managers.py Outdated Show resolved Hide resolved
@zaneselvans
Copy link
Member

zaneselvans commented Oct 19, 2023

I re-ran the integration tests because they failed for some weird SQLite reason that seemed like it was probably not related to these code changes and is hopefully ephemeral.

Checking back in on this, I see that 40 minutes into the integration tests running, it is still working on extracting XBRL data and hasn't even gotten to the PUDL ETL.

Welp, same error so I guess it's real.

test/integration/glue_test.py::test_for_fk_validation_and_unmapped_ids[missing_utility_id_pudl_in_utilities_ferc1] ERROR [ 37%]
test/integration/glue_test.py::test_for_fk_validation_and_unmapped_ids[missing_utility_id_ferc1_in_utilities_ferc1_dbf] 
-------------------------------- live log setup --------------------------------
2023-10-19 02:17:39 [   ERROR] sqlalchemy.pool.impl.NullPool:791 Exception during reset or similar
Traceback (most recent call last):
  File "/home/runner/work/pudl/pudl/.env_tox/lib/python3.11/site-packages/dagster/_core/execution/api.py", line 293, in ephemeral_instance_if_missing
    yield ephemeral_instance
GeneratorExit

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/runner/work/pudl/pudl/.env_tox/lib/python3.11/site-packages/sqlalchemy/pool/base.py", line 763, in _finalize_fairy
    fairy._reset(pool, transaction_was_reset)
  File "/home/runner/work/pudl/pudl/.env_tox/lib/python3.11/site-packages/sqlalchemy/pool/base.py", line 1038, in _reset
    pool._dialect.do_rollback(self)
  File "/home/runner/work/pudl/pudl/.env_tox/lib/python3.11/site-packages/sqlalchemy/engine/default.py", line 683, in do_rollback
    dbapi_connection.rollback()
sqlite3.ProgrammingError: Cannot operate on a closed database.
2023-10-19 02:17:39 [   ERROR] sqlalchemy.pool.impl.NullPool:791 Exception during reset or similar
Traceback (most recent call last):
  File "/home/runner/work/pudl/pudl/.env_tox/lib/python3.11/site-packages/dagster/_core/execution/api.py", line 293, in ephemeral_instance_if_missing
    yield ephemeral_instance
GeneratorExit

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/runner/work/pudl/pudl/.env_tox/lib/python3.11/site-packages/sqlalchemy/pool/base.py", line 763, in _finalize_fairy
    fairy._reset(pool, transaction_was_reset)
  File "/home/runner/work/pudl/pudl/.env_tox/lib/python3.11/site-packages/sqlalchemy/pool/base.py", line 1038, in _reset
    pool._dialect.do_rollback(self)
  File "/home/runner/work/pudl/pudl/.env_tox/lib/python3.11/site-packages/sqlalchemy/engine/default.py", line 683, in do_rollback
    dbapi_connection.rollback()
sqlite3.ProgrammingError: Cannot operate on a closed database.

@zaneselvans zaneselvans added ferc1 Anything having to do with FERC Form 1 new-data Requests for integration of new data. xbrl Related to the FERC XBRL transition labels Oct 19, 2023
@zaneselvans zaneselvans linked an issue Oct 19, 2023 that may be closed by this pull request
12 tasks
@jdangerx jdangerx force-pushed the ferc1-2022-report_year_fix branch from b1f560c to dbaaf37 Compare October 20, 2023 17:38
@jdangerx jdangerx mentioned this pull request Oct 20, 2023
@jdangerx
Copy link
Member Author

Looks like there's a bunch of new plants to remap. That makes sense because there weren't very many to remap earlier, maybe related to the report year issue. Working on that.

@jdangerx jdangerx force-pushed the ferc1-2022-report_year_fix branch from 99a5660 to 05a31f3 Compare October 20, 2023 18:53
@jdangerx
Copy link
Member Author

We're running into one validation error still:

E           AssertionError: Found 51 (expected 6) plant_id_ferc1 values associated with 81 non-unique plant_id_pudl values.     
E           plant_id_ferc1: [32, 52, 344, 352, 353, 354, 386, 612, 740, 745, 967, 994, 1031, 1032, 1119, 1285, 1314, 1467, 1468, 1473, 1474, 1475, 1488, 1550, 1569, 1574, 1575, 1599, 1614, 1615, 1624, 1625, 1626, 1627, 1628, 1629, 1630, 1631, 1632, 1666, 1667, 1815, 1903, 2001, 2002, 2003, 2004, 2005, 2034, 2035, 2053] 
E           plant_id_pudl: [352, 220, 51, 18069, 8468, 8659, 343, 18074, 373, 18075, 502, 16858, 18076, 560, 18080, 496, 18085, 18086, 18087, 18090, 1803, 11410, 11413, 317, 18096, 733, 18120, 110, 18098, 163, 18100, 464, 18103, 565, 18115, 18119, 1013, 18126, 564, 18118, 18124, 18117, 18099, 18129, 18104, 18125, 18121, 506, 18114, 18116, 437, 18112, 18128, 18127, 18113, 18131, 18130, 18109, 18105, 18106, 18108, 18107, 112, 18133, 655, 18137, 18138, 72, 18136, 146, 138, 585, 600, 18061, 18139, 656, 18140, 12, 18143, 221, 11544].

If we look at the denormalized_plants_steam_ferc1 output, filter by the bad plant_id_ferc1 values, and then see what the actual mapping of ferc1 to pudl IDs are, we get:

plant_id_ferc1
32             [733, 18120]
52              [496, 1803]
344              [220, 221]
352            [600, 18061]
353            [600, 18139]
354            [656, 18140]
386            [352, 18080]
612             [51, 18069]
740            [373, 18075]
745     [502, 16858, 18076]
967            [146, 18085]
994            [343, 18074]
1031           [655, 18137]
1032           [655, 18138]
1119         [11410, 11413]
1285          [8659, 11544]
1314           [110, 18098]
1467           [163, 18100]
1468           [464, 18103]
1473           [565, 18115]
1474           [733, 18119]
1475          [1013, 18126]
1488           [317, 18096]
1550            [560, 8468]
1569           [564, 18118]
1574           [733, 18124]
1575           [565, 18117]
1599            [72, 18136]
1614           [110, 18099]
1615          [1013, 18129]
1624           [464, 18104]
1625           [733, 18125]
1626           [733, 18121]
1627           [506, 18114]
1628           [565, 18116]
1629           [437, 18112]
1630          [1013, 18128]
1631          [1013, 18127]
1632           [506, 18113]
1666          [1013, 18131]
1667          [1013, 18130]
1815           [112, 18133]
1903            [12, 18143]
2001           [464, 18109]
2002           [464, 18105]
2003           [464, 18106]
2004           [464, 18108]
2005           [464, 18107]
2034           [138, 18086]
2035           [138, 18087]
2053           [585, 18090]

The largest PUDL plant ID on dev right now is 18020 - so anything smaller than that is "old" and anything newer than that is "new." In addition, any PUDL plant ID >= 18028 is a new small plant, <5MW, which I did not attempt to match to older plants in the sheet.

Of the mapping, only 6 of them have multiple "old" PUDL IDs being mapped to the same FERC ID - which is exactly the number we expect.

If we combine the FERC IDs above with the pudl_id_mapping.xlsx:

FERC ID 32 corresponds to PUDL ID 773 ("West Phoenix"/ "Arizona Public Service Company") and PUDL ID 18120 ("plant name: west phoenix 1 combined cycle" / "arizona public service company"(

FERC ID 352 corresponds to PUDL ID 600 ("Sweatt" / "Mississippi Power Company") and PUDL ID 18061 ("sweat - steam" / "mississippi power company")

etc.

Should I be mapping all these new plants to their old versions? I thought there was no need to map plants <5MW, but maybe I'm confused.

@jdangerx
Copy link
Member Author

Ugh, well the CI is still blowing up on that "can't operate on a closed database" thing... it might be related to the huge amount of alembic logs pouring out of the tests? are we trying to run migrations on every single test somehow?

The upshot is that I ran the full ETL on this branch last night and validations all passed this morning!

@jdangerx
Copy link
Member Author

@zaneselvans once you've reviewed this (and the CI stuff works again...) feel free to merge this into 2811 and then merge #2948 . Or to make changes that you deem necessary!

@codecov
Copy link

codecov bot commented Oct 26, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

❗ No coverage uploaded for pull request base (2811-ferc1-2022@a34a3fe). Click here to learn what that means.

Additional details and impacted files
@@                Coverage Diff                @@
##             2811-ferc1-2022   #2947   +/-   ##
=================================================
  Coverage                   ?   88.6%           
=================================================
  Files                      ?      91           
  Lines                      ?   10854           
  Branches                   ?       0           
=================================================
  Hits                       ?    9618           
  Misses                     ?    1236           
  Partials                   ?       0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -1815,10 +1815,10 @@ def assert_cols_areclose(
# instead of just whether or not there are matches.
mismatch = df.loc[
~np.isclose(
df[a_cols],
df[b_cols],
np.ma.masked_where(np.isnan(df[a_cols]), df[a_cols]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mask all NaN's so if one year's starting balance is NaN and previous year's ending balance is not, it will not be treated as a failure

@@ -191,6 +191,6 @@ def test_extract_xbrl(self, ferc1_engine_dbf):
for table_type, df in xbrl_tables.items():
# Some raw xbrl tables are empty
if not df.empty and table_type == "duration":
assert (df.report_year >= 2021).all() and (
assert (df.report_year >= 2020).all() and (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Records which pertain to 2020, but were reported in 2021 will now have a report_year of 2020, so make this check more permissive.

@zschira zschira merged commit 6cc735d into 2811-ferc1-2022 Oct 26, 2023
9 checks passed
@zschira zschira deleted the ferc1-2022-report_year_fix branch October 26, 2023 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ferc1 Anything having to do with FERC Form 1 new-data Requests for integration of new data. xbrl Related to the FERC XBRL transition
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Integrate FERC 2021-2022 data with new extractor
3 participants