FERC1 2022 report year fix #2947

jdangerx · 2023-10-18T15:39:58Z

As stated in the docstring, we had some issues setting facts' report years.

This didn't come up with only the 2021 data - the 2020 data in the XBRLs was set to 2021, but we had 2020 data from DBF which masked this error.

One corner I cut that we should probably handle in a separate PR: I initialize Ferc1Settings() to get the XBRL years - but that's not guaranteed to be the same XBRL years as is in Dagster's execution context, depending on whatever config Dagster is running with. So we should pass in the relevant bits of the Dagster context to transformers, which seems like a slightly larger refactor - hence, separate PR.

src/pudl/io_managers.py

zaneselvans · 2023-10-19T01:29:09Z

I re-ran the integration tests because they failed for some weird SQLite reason that seemed like it was probably not related to these code changes and is hopefully ephemeral.

Checking back in on this, I see that 40 minutes into the integration tests running, it is still working on extracting XBRL data and hasn't even gotten to the PUDL ETL.

Welp, same error so I guess it's real.

test/integration/glue_test.py::test_for_fk_validation_and_unmapped_ids[missing_utility_id_pudl_in_utilities_ferc1] ERROR [ 37%]
test/integration/glue_test.py::test_for_fk_validation_and_unmapped_ids[missing_utility_id_ferc1_in_utilities_ferc1_dbf] 
-------------------------------- live log setup --------------------------------
2023-10-19 02:17:39 [   ERROR] sqlalchemy.pool.impl.NullPool:791 Exception during reset or similar
Traceback (most recent call last):
  File "/home/runner/work/pudl/pudl/.env_tox/lib/python3.11/site-packages/dagster/_core/execution/api.py", line 293, in ephemeral_instance_if_missing
    yield ephemeral_instance
GeneratorExit

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/runner/work/pudl/pudl/.env_tox/lib/python3.11/site-packages/sqlalchemy/pool/base.py", line 763, in _finalize_fairy
    fairy._reset(pool, transaction_was_reset)
  File "/home/runner/work/pudl/pudl/.env_tox/lib/python3.11/site-packages/sqlalchemy/pool/base.py", line 1038, in _reset
    pool._dialect.do_rollback(self)
  File "/home/runner/work/pudl/pudl/.env_tox/lib/python3.11/site-packages/sqlalchemy/engine/default.py", line 683, in do_rollback
    dbapi_connection.rollback()
sqlite3.ProgrammingError: Cannot operate on a closed database.
2023-10-19 02:17:39 [   ERROR] sqlalchemy.pool.impl.NullPool:791 Exception during reset or similar
Traceback (most recent call last):
  File "/home/runner/work/pudl/pudl/.env_tox/lib/python3.11/site-packages/dagster/_core/execution/api.py", line 293, in ephemeral_instance_if_missing
    yield ephemeral_instance
GeneratorExit

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/runner/work/pudl/pudl/.env_tox/lib/python3.11/site-packages/sqlalchemy/pool/base.py", line 763, in _finalize_fairy
    fairy._reset(pool, transaction_was_reset)
  File "/home/runner/work/pudl/pudl/.env_tox/lib/python3.11/site-packages/sqlalchemy/pool/base.py", line 1038, in _reset
    pool._dialect.do_rollback(self)
  File "/home/runner/work/pudl/pudl/.env_tox/lib/python3.11/site-packages/sqlalchemy/engine/default.py", line 683, in do_rollback
    dbapi_connection.rollback()
sqlite3.ProgrammingError: Cannot operate on a closed database.

jdangerx · 2023-10-20T18:22:52Z

Looks like there's a bunch of new plants to remap. That makes sense because there weren't very many to remap earlier, maybe related to the report year issue. Working on that.

jdangerx · 2023-10-20T21:05:12Z

We're running into one validation error still:

E           AssertionError: Found 51 (expected 6) plant_id_ferc1 values associated with 81 non-unique plant_id_pudl values.     
E           plant_id_ferc1: [32, 52, 344, 352, 353, 354, 386, 612, 740, 745, 967, 994, 1031, 1032, 1119, 1285, 1314, 1467, 1468, 1473, 1474, 1475, 1488, 1550, 1569, 1574, 1575, 1599, 1614, 1615, 1624, 1625, 1626, 1627, 1628, 1629, 1630, 1631, 1632, 1666, 1667, 1815, 1903, 2001, 2002, 2003, 2004, 2005, 2034, 2035, 2053] 
E           plant_id_pudl: [352, 220, 51, 18069, 8468, 8659, 343, 18074, 373, 18075, 502, 16858, 18076, 560, 18080, 496, 18085, 18086, 18087, 18090, 1803, 11410, 11413, 317, 18096, 733, 18120, 110, 18098, 163, 18100, 464, 18103, 565, 18115, 18119, 1013, 18126, 564, 18118, 18124, 18117, 18099, 18129, 18104, 18125, 18121, 506, 18114, 18116, 437, 18112, 18128, 18127, 18113, 18131, 18130, 18109, 18105, 18106, 18108, 18107, 112, 18133, 655, 18137, 18138, 72, 18136, 146, 138, 585, 600, 18061, 18139, 656, 18140, 12, 18143, 221, 11544].

If we look at the denormalized_plants_steam_ferc1 output, filter by the bad plant_id_ferc1 values, and then see what the actual mapping of ferc1 to pudl IDs are, we get:

plant_id_ferc1
32             [733, 18120]
52              [496, 1803]
344              [220, 221]
352            [600, 18061]
353            [600, 18139]
354            [656, 18140]
386            [352, 18080]
612             [51, 18069]
740            [373, 18075]
745     [502, 16858, 18076]
967            [146, 18085]
994            [343, 18074]
1031           [655, 18137]
1032           [655, 18138]
1119         [11410, 11413]
1285          [8659, 11544]
1314           [110, 18098]
1467           [163, 18100]
1468           [464, 18103]
1473           [565, 18115]
1474           [733, 18119]
1475          [1013, 18126]
1488           [317, 18096]
1550            [560, 8468]
1569           [564, 18118]
1574           [733, 18124]
1575           [565, 18117]
1599            [72, 18136]
1614           [110, 18099]
1615          [1013, 18129]
1624           [464, 18104]
1625           [733, 18125]
1626           [733, 18121]
1627           [506, 18114]
1628           [565, 18116]
1629           [437, 18112]
1630          [1013, 18128]
1631          [1013, 18127]
1632           [506, 18113]
1666          [1013, 18131]
1667          [1013, 18130]
1815           [112, 18133]
1903            [12, 18143]
2001           [464, 18109]
2002           [464, 18105]
2003           [464, 18106]
2004           [464, 18108]
2005           [464, 18107]
2034           [138, 18086]
2035           [138, 18087]
2053           [585, 18090]

The largest PUDL plant ID on dev right now is 18020 - so anything smaller than that is "old" and anything newer than that is "new." In addition, any PUDL plant ID >= 18028 is a new small plant, <5MW, which I did not attempt to match to older plants in the sheet.

Of the mapping, only 6 of them have multiple "old" PUDL IDs being mapped to the same FERC ID - which is exactly the number we expect.

If we combine the FERC IDs above with the pudl_id_mapping.xlsx:

FERC ID 32 corresponds to PUDL ID 773 ("West Phoenix"/ "Arizona Public Service Company") and PUDL ID 18120 ("plant name: west phoenix 1 combined cycle" / "arizona public service company"(

FERC ID 352 corresponds to PUDL ID 600 ("Sweatt" / "Mississippi Power Company") and PUDL ID 18061 ("sweat - steam" / "mississippi power company")

etc.

Should I be mapping all these new plants to their old versions? I thought there was no need to map plants <5MW, but maybe I'm confused.

jdangerx · 2023-10-21T12:59:45Z

Ugh, well the CI is still blowing up on that "can't operate on a closed database" thing... it might be related to the huge amount of alembic logs pouring out of the tests? are we trying to run migrations on every single test somehow?

The upshot is that I ran the full ETL on this branch last night and validations all passed this morning!

jdangerx · 2023-10-21T13:06:06Z

@zaneselvans once you've reviewed this (and the CI stuff works again...) feel free to merge this into 2811 and then merge #2948 . Or to make changes that you deem necessary!

This reverts commit c93bfec.

…ew logic

codecov · 2023-10-26T19:20:12Z

Codecov Report

All modified and coverable lines are covered by tests ✅

❗ No coverage uploaded for pull request base (2811-ferc1-2022@a34a3fe). Click here to learn what that means.

Additional details and impacted files

@@                Coverage Diff                @@
##             2811-ferc1-2022   #2947   +/-   ##
=================================================
  Coverage                   ?   88.6%           
=================================================
  Files                      ?      91           
  Lines                      ?   10854           
  Branches                   ?       0           
=================================================
  Hits                       ?    9618           
  Misses                     ?    1236           
  Partials                   ?       0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

zschira · 2023-10-26T19:23:40Z

src/pudl/helpers.py

@@ -1815,10 +1815,10 @@ def assert_cols_areclose(
    # instead of just whether or not there are matches.
    mismatch = df.loc[
        ~np.isclose(
-            df[a_cols],
-            df[b_cols],
+            np.ma.masked_where(np.isnan(df[a_cols]), df[a_cols]),


Mask all NaN's so if one year's starting balance is NaN and previous year's ending balance is not, it will not be treated as a failure

zschira · 2023-10-26T19:26:39Z

test/integration/etl_test.py

@@ -191,6 +191,6 @@ def test_extract_xbrl(self, ferc1_engine_dbf):
            for table_type, df in xbrl_tables.items():
                # Some raw xbrl tables are empty
                if not df.empty and table_type == "duration":
-                    assert (df.report_year >= 2021).all() and (
+                    assert (df.report_year >= 2020).all() and (


Records which pertain to 2020, but were reported in 2021 will now have a report_year of 2020, so make this check more permissive.

jdangerx requested a review from zaneselvans October 18, 2023 15:40

jdangerx commented Oct 18, 2023

View reviewed changes

src/pudl/io_managers.py Outdated Show resolved Hide resolved

jdangerx mentioned this pull request Oct 18, 2023

Integrate FERC 2021-2022 data with new extractor #2811

Closed

12 tasks

zaneselvans approved these changes Oct 19, 2023

View reviewed changes

zaneselvans added ferc1 Anything having to do with FERC Form 1 new-data Requests for integration of new data. xbrl Related to the FERC XBRL transition labels Oct 19, 2023

zaneselvans linked an issue Oct 19, 2023 that may be closed by this pull request

Integrate FERC 2021-2022 data with new extractor #2811

Closed

12 tasks

jdangerx force-pushed the 2811-ferc1-2022 branch from 65ea0ad to 093b6f8 Compare October 20, 2023 17:36

jdangerx added 2 commits October 20, 2023 13:37

Redefine "report year" as "the year the data is describing."

3c6e8c9

Handle correct report_years in instant-to-duration

dbaaf37

jdangerx force-pushed the ferc1-2022-report_year_fix branch from b1f560c to dbaaf37 Compare October 20, 2023 17:38

jdangerx mentioned this pull request Oct 20, 2023

FERC1 2022 #2948

Merged

Update pudl ID mapping with new plants from FERC 2022

05a31f3

jdangerx force-pushed the ferc1-2022-report_year_fix branch from 99a5660 to 05a31f3 Compare October 20, 2023 18:53

jdangerx added 2 commits October 20, 2023 15:05

Remove obsolete ID table join.

7d64f60

Update validation tests with new plants.

8093750

Re-map more plants.

a09d920

e-belfer assigned zschira Oct 23, 2023

zschira added 5 commits October 24, 2023 14:45

Test new extractor version

c93bfec

Merge branch '2811-ferc1-2022' into ferc1-2022-report_year_fix

88d6734

Change assert_cols_areclose to ignore nans in either col

2481e41

Revert "Test new extractor version"

05e8921

This reverts commit c93bfec.

Make XBRL report years more lenient in extraction test to reflected n…

8fc0e81

…ew logic

zschira reviewed Oct 26, 2023

View reviewed changes

zschira merged commit 6cc735d into 2811-ferc1-2022 Oct 26, 2023
9 checks passed

zschira deleted the ferc1-2022-report_year_fix branch October 26, 2023 19:34

zschira approved these changes Oct 26, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FERC1 2022 report year fix #2947

FERC1 2022 report year fix #2947

jdangerx commented Oct 18, 2023 •

edited

Loading

zaneselvans commented Oct 19, 2023 •

edited

Loading

jdangerx commented Oct 20, 2023

jdangerx commented Oct 20, 2023

jdangerx commented Oct 21, 2023

jdangerx commented Oct 21, 2023

codecov bot commented Oct 26, 2023

zschira Oct 26, 2023

zschira Oct 26, 2023

FERC1 2022 report year fix #2947

FERC1 2022 report year fix #2947

Conversation

jdangerx commented Oct 18, 2023 • edited Loading

zaneselvans commented Oct 19, 2023 • edited Loading

jdangerx commented Oct 20, 2023

jdangerx commented Oct 20, 2023

jdangerx commented Oct 21, 2023

jdangerx commented Oct 21, 2023

codecov bot commented Oct 26, 2023

Codecov Report

zschira Oct 26, 2023

Choose a reason for hiding this comment

zschira Oct 26, 2023

Choose a reason for hiding this comment

jdangerx commented Oct 18, 2023 •

edited

Loading

zaneselvans commented Oct 19, 2023 •

edited

Loading