-
-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data maturity for 923m #2936
Conversation
…e EIA data. This includes updating the package data to account for the 2023 year and updating the way to assign data maturities to 923 data. This also updates some of the expected row counts for the data. It should still fail on the gen_eia923 table because the row count was going down which doesn't seem right. There are also some failures related to check_date_freq as there are now less than 12 months expected in a given round of updates. Will handle those errors in another commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this all looks pleasantly straightforward! I sprinkled a few minor suggestions.
also i didn't do a super tight look at the CSVs but it looks like they were basically all copy/paste?!? cool if so! i wonder how straightforward adding the non-ytd release is gonna be in this regard.
src/pudl/extract/excel.py
Outdated
# this conditional is a hacky way to get around the fact that the formatting | ||
# for the file names didn't always look like "EIA923_Schedules_2_3_4_5_M_" | ||
# But, because we only want YTD data, it's fine to set a year floor. | ||
if partition["year"] > 2022: | ||
release_month = re.search( | ||
r"EIA923_Schedules_2_3_4_5_M_(\d{2})", | ||
self.excel_filename(page, **partition), | ||
).group(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the filename for the YTD file always contains EIA923_Schedules_2_3_4_5_M_
you could do something like:
# this conditional is a hacky way to get around the fact that the formatting | |
# for the file names didn't always look like "EIA923_Schedules_2_3_4_5_M_" | |
# But, because we only want YTD data, it's fine to set a year floor. | |
if partition["year"] > 2022: | |
release_month = re.search( | |
r"EIA923_Schedules_2_3_4_5_M_(\d{2})", | |
self.excel_filename(page, **partition), | |
).group(1) | |
file_name = self.excel_filename(page, **partition) | |
ytd_file_name_start = "EIA923_Schedules_2_3_4_5_M_" | |
if ytd_file_name_start in file_name: | |
release_month = re.search( | |
r"ytd_file_name_start(\d{2})", file_name, | |
).group(1) |
i think you could probably even use the if ytd_file_name_start in file_name:
as the elif
instead of checking for 923 first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense, but also it won't fail if there is no capture group found. I feel like it would be good if this failed if, for example, the file naming changed and we needed to come up with a new way to grab it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I don't think you can use a variable in re.search
if you also have the {2} in brackets because it will think it's also an f-string variable.
I.e., release_month = re.search(r"ytd_file_name_start(\d{2})", file_name).group(1)
won't work.
…move double returns from the drop_ytd_for_annual_tables function
For now, I commented out the test_eia923_dependency functions. Idk if I should outright delete them or not? |
This is an ongoing and unknown issue documented here: #2978 |
- Add a note about how the plants are getting dropped in the gen_eia923 output table and link to the issue. - Update the way we tell whether an EIA923 filing is monthly or annual based on feedback in the PR
… of EIA923 and EIA860 data. This is causing issues for the monthly EIA923 data that gets integrated ahead of any available 860 data. This might cause issues elsewhere which is why I haven't committed to fully deleting it yet.
…td records from annual EIA tables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several nits + the ask to update the settings test + an ask to keep the data_maturity
in the boiler generators table + two question below:
- Do you need to update the
package_data/settings
files? (i thought i saw that updated the first time around) - Do you need to push a schema update as well??
test/unit/settings_test.py
Outdated
def test_eia923_dependency(self): | ||
"""Test 860 is added if 923 is specified and 860 is not.""" | ||
eia923_settings = Eia923Settings() | ||
settings = EiaSettings(eia923=eia923_settings) | ||
data_source = DataSource.from_id("eia860") | ||
# def test_eia923_dependency(self): | ||
# """Test 860 is added if 923 is specified and 860 is not.""" | ||
# eia923_settings = Eia923Settings() | ||
# settings = EiaSettings(eia923=eia923_settings) | ||
# data_source = DataSource.from_id("eia860") | ||
|
||
assert settings.eia860 | ||
# assert settings.eia860 | ||
|
||
assert settings.eia860.years == data_source.working_partitions["years"] | ||
# assert settings.eia860.years == data_source.working_partitions["years"] | ||
|
||
def test_eia860_dependency(self): | ||
"""Test 923 tables are added to eia860 if 923 is not specified.""" | ||
eia860_settings = Eia860Settings() | ||
settings = EiaSettings(eia860=eia860_settings) | ||
assert settings.eia923.years == eia860_settings.years | ||
# def test_eia860_dependency(self): | ||
# """Test 923 tables are added to eia860 if 923 is not specified.""" | ||
# eia860_settings = Eia860Settings() | ||
# settings = EiaSettings(eia860=eia860_settings) | ||
# assert settings.eia923.years == eia860_settings.years |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we chatted about this on the ☎️ and came to the understanding that despite this restriction not being technically necessary across the board, it'd be good to require an overlap between 923 and 860 years.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think both of these tests could be converted to use isdisjoint
. We want to make sure they are not disjoint, so i think we cold do something like this for both tests:
first test
assert settings.eia860
# assign both EIA form years
eia860_years = settings.eia860.years
eia923_years = data_source.working_partitions["years"]
# assert that there is some overlap between EIA years
assert not set(eia860_years).isdisjoint(eia923_years)
second test
# assign both EIA form years
eia860_years = settings.eia860.years
eia923_years = settings.eia923.years
# assert that there is some overlap between EIA years
assert not set(eia860_years).isdisjoint(eia923_years)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the difference is between these two tests? Did you mean to reverse the 860 and the 923 labels? Also, what is the difference between eia923_years = data_source.working_partitions["years"]
and eia923_years = settings.eia923.years
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they just some from different places: the working partitions comes from the stored working partitions while the settings.eia923.years
come from the settings file that the test is checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I don't understand what you mean by 1st and 2nd test then. They seem to be doing the same thing? Is the idea to test that there is overlap in general and then secondly to test that there is overlap in this specific test you're running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe these are both checked because of the way the settings classes themselves are constructed. in the pudl.settings.EiaSettings.check_eia_dependencies
we are setting default values for 860 and 923 years which can either come from the settings file or is by default grabbed from the working partitions,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also test this in the reverse as in:
set(eia923_years).isdisjoint(eia860_years)
- Restructure the way that the data_maturity field is dropped from certain tables when merging multiple tables together that each have that field. Previously it was ad-hoc, now it just gets dropped in the denorm_by_plant function. - This also entails changing how the data_maturity field gets passed through to the agg tables: adds the data_maturity field to the agg function, selecting the 'first' instance of the data_maturity per agg because the fields are aggregated by date which is how data_maturity is determined. The annual aggregations drop the ytd rows before the aggregation happens so taking the first data_maturity value per year works in this case. - Remove some comment fields - Add new migration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes looks great! It looks like the settings tests weren't converted but I made some suggestions over there. Once the tests all pass let's merge this bb
- there is a trailing whitespace error in the release notes that is making the docs and pre-commit tests fail
- also same question about the settings files from above.
…years overlap but don't need to be the same
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## dev #2936 +/- ##
=======================================
+ Coverage 88.6% 88.7% +0.1%
=======================================
Files 91 91
Lines 10854 11007 +153
=======================================
+ Hits 9618 9766 +148
- Misses 1236 1241 +5
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comment re the settings tests.
also i still don't see any updates to the package_data/settings/etl_fast
or package_data/settings/etl_full
so i don't think the CI will have actually tested the ytd info yet.
test/unit/settings_test.py
Outdated
# def test_eia923_dependency(self): | ||
# """Test 860 is added if 923 is specified and 860 is not.""" | ||
# eia923_settings = Eia923Settings() | ||
# settings = EiaSettings(eia923=eia923_settings) | ||
# data_source = DataSource.from_id("eia860") | ||
|
||
# assert settings.eia860 | ||
|
||
# assert settings.eia860.years == data_source.working_partitions["years"] | ||
|
||
# def test_eia860_dependency(self): | ||
# """Test 923 tables are added to eia860 if 923 is not specified.""" | ||
# eia860_settings = Eia860Settings() | ||
# settings = EiaSettings(eia860=eia860_settings) | ||
# assert settings.eia923.years == eia860_settings.years | ||
def test_eia923_dependency(self): | ||
"""Test that there is some overlap between EIA860 and EIA923 data.""" | ||
settings = EiaSettings() | ||
data_source = DataSource.from_id("eia860") | ||
assert settings.eia860 | ||
# assign both EIA form years | ||
eia860_years = settings.eia860.years | ||
eia923_years_partition = data_source.working_partitions["years"] | ||
eia923_years_settings = settings.eia923.years | ||
# assert that there is some overlap between EIA years | ||
assert not set(eia860_years).isdisjoint(eia923_years_partition) | ||
assert not set(eia860_years).isdisjoint(eia923_years_settings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous version of these two test were checking if the default years of 923 and 860 were getting added in when there was only 860 or 923 data in the settings respectively.
The first test_eia923_dependency
test was checking this scenario:
- Previously: If you fed PUDL settings EIA 923 only, our
EiaSettings
knows that 860 also needs to exist and gives it some default years. Are the 860 years the same as the 923 years? - What we could update it to now: If you fed PUDL settings EIA 923 only, our
EiaSettings
knows that 860 also needs to exist and gives it some default 860 years. Are the 860 years appropriate i.e.: do they have some overlap w/ 923? OR are the assigned default years the years in the partitions?
(honestly i'm not sure now why the previous setup for test_eia923_dependency
didn't work bc you didn't touch the working partitions for 860 in this PR)
for test_eia860_dependency
:
- Previously: If you fed PUDL settings w/ EIA 860 only, our
EiaSettings
knows that 923 also needs to exist and gives it some default years. Are the 923 years the same as the 860 years? - Now: If you fed PUDL settings w/ EIA 860 only, our
EiaSettings
knows that 923 also needs to exist and gives it some default years. Do the 923 default years at least overlap with the 860 years?
tl;dr version is i think the combining of these two tests removes some important information. Your test builds a default EiaSettings
object without feeding it anything specific with ends ups triggering EiaSettings.default_load_all
. The intention previously as I understand it was to test the EiaSettings.check_eia_dependencies
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm ok, my version doesn't combine the two tests (test_eia860_dependency
and test_eia923_dependency
) it just combines the two tests that you suggested above? I fixed the weird empty EiaSettings thing which was a mistake on my end. I also re-added the test_eia860_dependency
test with the same but opposite code from the test_eia923_dependency
.
I also had to update the check_eia_dependencies
in settings.py
to create default values for eia860 that were based off of whatever was in EIA923 but also what is actually available in 860. It was failing because it was trying to create a default suite of 860 data based on 923 data that included 2023 (which 860 doesn't have). Lmk if this feels like an ok fix.
… tests so that they aren't dependent on having the same years of EIA923 and EIA860 data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great updates to the settings!! wahoo 🤞🏻 that the ci does the things
…maturity is not ytd_incremental
Explanation:
From the issue #2930:
Errors to Fix:
test/unit/settings_test.py::test_eia923_dependency
.test_minmax_rows
tests forgen_eia923
monthly and annual are negative which seems wrong. When I looked at the table equivalent in the database, the row numbers were completely different, so I need to investigate that too. Would like some help understanding what I should be looking at because this test still seems tailored to the oldpudl_out
system with monthly and annual resolution, and it's a bit confusing.It appears that 64 ["report_date", "plant_id_eia", "utility_id_eia", "generator_id"] groupings are dropped from this gen table at some point in the process.