Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data maturity for 923m #2936

Merged
merged 18 commits into from
Nov 1, 2023
Merged

Add data maturity for 923m #2936

merged 18 commits into from
Nov 1, 2023

Conversation

aesharpe
Copy link
Member

@aesharpe aesharpe commented Oct 12, 2023

Explanation:

From the issue #2930:

Our current 923 archiver contains annual and YTD data. We need to come up with a way to distinguish the data maturity of each of the partitions. The way we currently determine data_maturity is via the add_data_maturity() function in the extract/excel.py module. In the future it might make sense to add a data_maturity field to the archive metadata, but for now, we'll update the add_data_maturity() to account for the EIA923 data. There's a category for the data_maturity column called incremental_ytd that was created for the purpose of this type of data.

The raw files are formatted as such (examples):

EIA923_Schedules_2_3_4_5_M_07_2023_20SEP2023.xlsx
EIA923_Schedules_2_3_4_5_M_12_2022_Early_Release.xlsx

Where the value after M is the the last month the have data. Data with 12 is annual, data with values less than 12 are YTD. We can use regex to extract the month from here and add the incremental_ytd flag if it's not 12.

This issue will also inherently add the 2023 quarterly data for 923.

Errors to Fix:

  • Unit tests are failing because there are more years of EIA923 data, now that we have the 2023 monthly data in there, than EIA860. How do we want to handle this miss-match? test/unit/settings_test.py::test_eia923_dependency.
=================================== FAILURES ===================================
____________________ TestEiaSettings.test_eia923_dependency ____________________

self = <test.unit.settings_test.TestEiaSettings object at 0x281055ad0>

    def test_eia923_dependency(self):
        """Test 860 is added if 923 is specified and 860 is not."""
        eia923_settings = Eia923Settings()
>       settings = EiaSettings(eia923=eia923_settings)

test/unit/settings_test.py:139: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   pydantic.error_wrappers.ValidationError: 1 validation error for EiaSettings
E   __root__ -> __root__
E     '[2023]' years are not available. (type=value_error)

pydantic/main.py:341: ValidationError
  • Validation test_minmax_rows tests for gen_eia923 monthly and annual are negative which seems wrong. When I looked at the table equivalent in the database, the row numbers were completely different, so I need to investigate that too. Would like some help understanding what I should be looking at because this test still seems tailored to the old pudl_out system with monthly and annual resolution, and it's a bit confusing.
eia_test.py::test_minmax_rows[eia_annual-gen_eia923-None-5178892-433286] - ValueError: gen_eia923: found 433277 rows, expected 433286. Off by -0.002%, allowed margin of 0.000%
FAILED test/validate/

eia_test.py::test_minmax_rows[eia_monthly-gen_eia923-None-5178892-433286] - ValueError: gen_eia923: found 5178828 rows, expected 5178892. Off by -0.001%, allowed margin of 0.000%
FAILED test/validate/

It appears that 64 ["report_date", "plant_id_eia", "utility_id_eia", "generator_id"] groupings are dropped from this gen table at some point in the process.

…e EIA data. This includes updating the package data to account for the 2023 year and updating the way to assign data maturities to 923 data. This also updates some of the expected row counts for the data. It should still fail on the gen_eia923 table because the row count was going down which doesn't seem right. There are also some failures related to check_date_freq as there are now less than 12 months expected in a given round of updates. Will handle those errors in another commit.
@aesharpe aesharpe added the eia923 Anything having to do with EIA Form 923 label Oct 12, 2023
@aesharpe aesharpe self-assigned this Oct 12, 2023
@aesharpe aesharpe added the rmi label Oct 12, 2023
@aesharpe aesharpe linked an issue Oct 12, 2023 that may be closed by this pull request
Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this all looks pleasantly straightforward! I sprinkled a few minor suggestions.

also i didn't do a super tight look at the CSVs but it looks like they were basically all copy/paste?!? cool if so! i wonder how straightforward adding the non-ytd release is gonna be in this regard.

src/pudl/output/eia923.py Outdated Show resolved Hide resolved
Comment on lines 200 to 207
# this conditional is a hacky way to get around the fact that the formatting
# for the file names didn't always look like "EIA923_Schedules_2_3_4_5_M_"
# But, because we only want YTD data, it's fine to set a year floor.
if partition["year"] > 2022:
release_month = re.search(
r"EIA923_Schedules_2_3_4_5_M_(\d{2})",
self.excel_filename(page, **partition),
).group(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the filename for the YTD file always contains EIA923_Schedules_2_3_4_5_M_ you could do something like:

Suggested change
# this conditional is a hacky way to get around the fact that the formatting
# for the file names didn't always look like "EIA923_Schedules_2_3_4_5_M_"
# But, because we only want YTD data, it's fine to set a year floor.
if partition["year"] > 2022:
release_month = re.search(
r"EIA923_Schedules_2_3_4_5_M_(\d{2})",
self.excel_filename(page, **partition),
).group(1)
file_name = self.excel_filename(page, **partition)
ytd_file_name_start = "EIA923_Schedules_2_3_4_5_M_"
if ytd_file_name_start in file_name:
release_month = re.search(
r"ytd_file_name_start(\d{2})", file_name,
).group(1)

i think you could probably even use the if ytd_file_name_start in file_name: as the elif instead of checking for 923 first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, but also it won't fail if there is no capture group found. I feel like it would be good if this failed if, for example, the file naming changed and we needed to come up with a new way to grab it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I don't think you can use a variable in re.search if you also have the {2} in brackets because it will think it's also an f-string variable.

I.e., release_month = re.search(r"ytd_file_name_start(\d{2})", file_name).group(1) won't work.

src/pudl/package_data/eia923/file_map.csv Outdated Show resolved Hide resolved
test/validate/eia_test.py Outdated Show resolved Hide resolved
@aesharpe
Copy link
Member Author

  • Unit tests are failing because there are more years of EIA923 data, now that we have the 2023 monthly data in there, than EIA860. How do we want to handle this miss-match? test/unit/settings_test.py::test_eia923_dependency.

For now, I commented out the test_eia923_dependency functions. Idk if I should outright delete them or not?

@aesharpe
Copy link
Member Author

aesharpe commented Oct 25, 2023

  • Validation test_minmax_rows tests for gen_eia923 monthly and annual are negative which seems wrong. When I looked at the table equivalent in the database, the row numbers were completely different, so I need to investigate that too. Would like some help understanding what I should be looking at because this test still seems tailored to the old pudl_out system with monthly and annual resolution, and it's a bit confusing.

This is an ongoing and unknown issue documented here: #2978

- Add a note about how the plants are getting dropped in the gen_eia923 output table and link to the issue.

- Update the way we tell whether an EIA923 filing is monthly or annual based on feedback in the PR
… of EIA923 and EIA860 data. This is causing issues for the monthly EIA923 data that gets integrated ahead of any available 860 data. This might cause issues elsewhere which is why I haven't committed to fully deleting it yet.
Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several nits + the ask to update the settings test + an ask to keep the data_maturity in the boiler generators table + two question below:

  • Do you need to update the package_data/settings files? (i thought i saw that updated the first time around)
  • Do you need to push a schema update as well??

src/pudl/metadata/resources/eia860.py Outdated Show resolved Hide resolved
src/pudl/metadata/resources/eia923.py Outdated Show resolved Hide resolved
src/pudl/output/eia.py Show resolved Hide resolved
src/pudl/transform/eia.py Outdated Show resolved Hide resolved
Comment on lines 136 to 149
def test_eia923_dependency(self):
"""Test 860 is added if 923 is specified and 860 is not."""
eia923_settings = Eia923Settings()
settings = EiaSettings(eia923=eia923_settings)
data_source = DataSource.from_id("eia860")
# def test_eia923_dependency(self):
# """Test 860 is added if 923 is specified and 860 is not."""
# eia923_settings = Eia923Settings()
# settings = EiaSettings(eia923=eia923_settings)
# data_source = DataSource.from_id("eia860")

assert settings.eia860
# assert settings.eia860

assert settings.eia860.years == data_source.working_partitions["years"]
# assert settings.eia860.years == data_source.working_partitions["years"]

def test_eia860_dependency(self):
"""Test 923 tables are added to eia860 if 923 is not specified."""
eia860_settings = Eia860Settings()
settings = EiaSettings(eia860=eia860_settings)
assert settings.eia923.years == eia860_settings.years
# def test_eia860_dependency(self):
# """Test 923 tables are added to eia860 if 923 is not specified."""
# eia860_settings = Eia860Settings()
# settings = EiaSettings(eia860=eia860_settings)
# assert settings.eia923.years == eia860_settings.years
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we chatted about this on the ☎️ and came to the understanding that despite this restriction not being technically necessary across the board, it'd be good to require an overlap between 923 and 860 years.

Copy link
Member

@cmgosnell cmgosnell Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both of these tests could be converted to use isdisjoint. We want to make sure they are not disjoint, so i think we cold do something like this for both tests:

first test

assert settings.eia860
# assign both EIA form years
eia860_years = settings.eia860.years
eia923_years = data_source.working_partitions["years"]
# assert that there is some overlap between EIA years
assert not set(eia860_years).isdisjoint(eia923_years)

second test

# assign both EIA form years
eia860_years = settings.eia860.years
eia923_years = settings.eia923.years
# assert that there is some overlap between EIA years
assert not set(eia860_years).isdisjoint(eia923_years)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the difference is between these two tests? Did you mean to reverse the 860 and the 923 labels? Also, what is the difference between eia923_years = data_source.working_partitions["years"] and eia923_years = settings.eia923.years?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they just some from different places: the working partitions comes from the stored working partitions while the settings.eia923.years come from the settings file that the test is checking.

Copy link
Member Author

@aesharpe aesharpe Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I don't understand what you mean by 1st and 2nd test then. They seem to be doing the same thing? Is the idea to test that there is overlap in general and then secondly to test that there is overlap in this specific test you're running?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe these are both checked because of the way the settings classes themselves are constructed. in the pudl.settings.EiaSettings.check_eia_dependencies we are setting default values for 860 and 923 years which can either come from the settings file or is by default grabbed from the working partitions,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also test this in the reverse as in:

set(eia923_years).isdisjoint(eia860_years)

src/pudl/output/eia923.py Outdated Show resolved Hide resolved
src/pudl/output/eia923.py Outdated Show resolved Hide resolved
- Restructure the way that the data_maturity field is dropped from certain tables when merging multiple tables together that each have that field. Previously it was ad-hoc, now it just gets dropped in the denorm_by_plant function.

- This also entails changing how the data_maturity field gets passed through to the agg tables: adds the data_maturity field to the agg function, selecting the 'first' instance of the data_maturity per agg because the fields are aggregated by date which is how data_maturity is determined. The annual aggregations drop the ytd rows before the aggregation happens so taking the first data_maturity value per year works in this case.

- Remove some comment fields

- Add new migration
@aesharpe aesharpe marked this pull request as ready for review October 30, 2023 14:35
Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes looks great! It looks like the settings tests weren't converted but I made some suggestions over there. Once the tests all pass let's merge this bb

  • there is a trailing whitespace error in the release notes that is making the docs and pre-commit tests fail
  • also same question about the settings files from above.

@codecov
Copy link

codecov bot commented Oct 31, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (b39eaac) 88.6% compared to head (b376141) 88.7%.
Report is 69 commits behind head on dev.

Additional details and impacted files
@@           Coverage Diff           @@
##             dev   #2936     +/-   ##
=======================================
+ Coverage   88.6%   88.7%   +0.1%     
=======================================
  Files         91      91             
  Lines      10854   11007    +153     
=======================================
+ Hits        9618    9766    +148     
- Misses      1236    1241      +5     
Files Coverage Δ
src/pudl/analysis/allocate_gen_fuel.py 91.3% <ø> (ø)
src/pudl/extract/eia923.py 100.0% <100.0%> (ø)
src/pudl/extract/excel.py 96.9% <100.0%> (+0.1%) ⬆️
src/pudl/metadata/resources/eia.py 100.0% <ø> (ø)
src/pudl/metadata/resources/eia923.py 100.0% <ø> (ø)
src/pudl/metadata/sources.py 100.0% <ø> (ø)
src/pudl/output/eia.py 59.0% <ø> (ø)
src/pudl/output/eia923.py 97.8% <100.0%> (+0.1%) ⬆️
src/pudl/settings.py 99.2% <100.0%> (+0.3%) ⬆️
src/pudl/transform/eia923.py 85.2% <ø> (ø)
... and 1 more

... and 3 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment re the settings tests.

also i still don't see any updates to the package_data/settings/etl_fast or package_data/settings/etl_full so i don't think the CI will have actually tested the ytd info yet.

Comment on lines 135 to 146
# def test_eia923_dependency(self):
# """Test 860 is added if 923 is specified and 860 is not."""
# eia923_settings = Eia923Settings()
# settings = EiaSettings(eia923=eia923_settings)
# data_source = DataSource.from_id("eia860")

# assert settings.eia860

# assert settings.eia860.years == data_source.working_partitions["years"]

# def test_eia860_dependency(self):
# """Test 923 tables are added to eia860 if 923 is not specified."""
# eia860_settings = Eia860Settings()
# settings = EiaSettings(eia860=eia860_settings)
# assert settings.eia923.years == eia860_settings.years
def test_eia923_dependency(self):
"""Test that there is some overlap between EIA860 and EIA923 data."""
settings = EiaSettings()
data_source = DataSource.from_id("eia860")
assert settings.eia860
# assign both EIA form years
eia860_years = settings.eia860.years
eia923_years_partition = data_source.working_partitions["years"]
eia923_years_settings = settings.eia923.years
# assert that there is some overlap between EIA years
assert not set(eia860_years).isdisjoint(eia923_years_partition)
assert not set(eia860_years).isdisjoint(eia923_years_settings)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous version of these two test were checking if the default years of 923 and 860 were getting added in when there was only 860 or 923 data in the settings respectively.

The first test_eia923_dependency test was checking this scenario:

  • Previously: If you fed PUDL settings EIA 923 only, our EiaSettings knows that 860 also needs to exist and gives it some default years. Are the 860 years the same as the 923 years?
  • What we could update it to now: If you fed PUDL settings EIA 923 only, our EiaSettings knows that 860 also needs to exist and gives it some default 860 years. Are the 860 years appropriate i.e.: do they have some overlap w/ 923? OR are the assigned default years the years in the partitions?

(honestly i'm not sure now why the previous setup for test_eia923_dependency didn't work bc you didn't touch the working partitions for 860 in this PR)

for test_eia860_dependency:

  • Previously: If you fed PUDL settings w/ EIA 860 only, our EiaSettings knows that 923 also needs to exist and gives it some default years. Are the 923 years the same as the 860 years?
  • Now: If you fed PUDL settings w/ EIA 860 only, our EiaSettings knows that 923 also needs to exist and gives it some default years. Do the 923 default years at least overlap with the 860 years?

tl;dr version is i think the combining of these two tests removes some important information. Your test builds a default EiaSettings object without feeding it anything specific with ends ups triggering EiaSettings.default_load_all. The intention previously as I understand it was to test the EiaSettings.check_eia_dependencies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm ok, my version doesn't combine the two tests (test_eia860_dependency and test_eia923_dependency) it just combines the two tests that you suggested above? I fixed the weird empty EiaSettings thing which was a mistake on my end. I also re-added the test_eia860_dependency test with the same but opposite code from the test_eia923_dependency.

I also had to update the check_eia_dependencies in settings.py to create default values for eia860 that were based off of whatever was in EIA923 but also what is actually available in 860. It was failing because it was trying to create a default suite of 860 data based on 923 data that included 2023 (which 860 doesn't have). Lmk if this feels like an ok fix.

… tests so that they aren't dependent on having the same years of EIA923 and EIA860 data
Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great updates to the settings!! wahoo 🤞🏻 that the ci does the things

@aesharpe aesharpe merged commit dbf686f into dev Nov 1, 2023
9 checks passed
@aesharpe aesharpe deleted the add-data-maturity-for-923m branch November 1, 2023 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
eia923 Anything having to do with EIA Form 923 rmi
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

EIA923M: Add data maturity flag and 2023 quarterly data
2 participants