Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update load processing; part of #67 #170

Conversation

JanFrederickUnnewehr
Copy link
Contributor

Is part of #67

Changes proposed in this Pull Request

The following update concerns the processing of load data in pypsa-eur.
Added a rule (build_load_data) that downloads the latest load data from the OPSD website.
The resulting data are cleaned and gaps filled based on manual filling methods. Before and after filling gaps the rule proved information about the gaps (length, frequency) by the function nan_statistics(df).

Checklist

  • I tested my contribution locally and it seems to work fine.
  • Code and workflow changes are sufficiently documented.
  • Newly introduced dependencies are added to environment.yaml and environment.docs.yaml.
  • Changes in configuration options are added in all of config.default.yaml, config.tutorial.yaml, and test/config.test1.yaml.
  • Changes in configuration options are also documented in doc/configtables/*.csv and line references are adjusted in doc/configuration.rst and doc/tutorial.rst.
  • A note for the release notes doc/release_notes.rst is amended in the format of previous release notes.

Is there a way to automate the generation of the file doc/configuration.rst.
Lets discuss first the proposed changes before I will add it to the doc/release_notes.rst.

Copy link
Contributor

@FabianHofmann FabianHofmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks @JanFrederickUnnewehr for contributing again, good job! I reviewed the first bit. Next is coming

scripts/add_electricity.py Outdated Show resolved Hide resolved
scripts/build_load_data.py Outdated Show resolved Hide resolved
scripts/build_load_data.py Outdated Show resolved Hide resolved
scripts/build_load_data.py Outdated Show resolved Hide resolved
Comment on lines 408 to 421

# Save location
to_fn = Path(f"{rootpath}/data/time_series_60min_singleindex.csv")

logger.info(f"Downloading load data from '{url}'.")

progress_retrieve(url, to_fn)

logger.info(f"Raw load data available at '{to_fn}'.")

opsd_load = (load_timeseries_opsd(years = slice(*pd.date_range(freq='y', **snakemake.config['snapshots'])[[0,-1]].year.astype(str)),
fn=to_fn,
countries = snakemake.config['countries'],
source = snakemake.config['load']['source']))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Save location
to_fn = Path(f"{rootpath}/data/time_series_60min_singleindex.csv")
logger.info(f"Downloading load data from '{url}'.")
progress_retrieve(url, to_fn)
logger.info(f"Raw load data available at '{to_fn}'.")
opsd_load = (load_timeseries_opsd(years = slice(*pd.date_range(freq='y', **snakemake.config['snapshots'])[[0,-1]].year.astype(str)),
fn=to_fn,
countries = snakemake.config['countries'],
source = snakemake.config['load']['source']))
opsd_load = (load_timeseries_opsd(years = slice(*pd.date_range(freq='y', **snakemake.config['snapshots'])[[0,-1]].year.astype(str)),
fn=url,
countries = snakemake.config['countries'],
source = snakemake.config['load']['source']))

Directly reading from url is fine, as we do not really need the raw data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have also thought that it might be useful to separate the loading and processing of load data. Then we could control the automatic loading of the data like in the data bundle. Maybe this is a bit too much work just for the load data. Maybe it makes sense to integrate everything into the load data bundle function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to do it here. Its good to have as many things as possible outside of the databundle. It causes extra work to maintain it properly. The more is retrieved automatically, the better.

JanFrederickUnnewehr and others added 8 commits July 15, 2020 14:54
Co-authored-by: FabianHofmann <hofmann@fias.uni-frankfurt.de>
Co-authored-by: FabianHofmann <hofmann@fias.uni-frankfurt.de>
Co-authored-by: FabianHofmann <hofmann@fias.uni-frankfurt.de>
Co-authored-by: FabianHofmann <hofmann@fias.uni-frankfurt.de>
Co-authored-by: FabianHofmann <hofmann@fias.uni-frankfurt.de>
@martacki
Copy link
Member

maybe we could add a switch if we want to manually fill the gaps as in the old timeseries_opsd? See PR in FRESNA https://github.com/FRESNA/vresutils/pull/14/files ("manual_alterations"), as there's many gaps lasting for a couple of hours only, but others as well for many months. Maybe we could add a switch in the config manual_load_alterations (or similar): True/False? Then it's up to the user

@JanFrederickUnnewehr
Copy link
Contributor Author

Moin, check out the new version of build_load_data . I have tried to include all suggestions in the code. Did I forget anything? We could leave the big manual adjustment part as an example but also shorten it, as it is not complete. See 'GB'

@martacki
Copy link
Member

martacki commented Jul 16, 2020

you copy often from start to stop, maybe for that you can use my function as in vresutils:

def copy_timeslice(load, cntry, start, stop, delta):
    start = pd.Timestamp(start)
    stop = pd.Timestamp(stop)
    if start in load.index and stop in load.index:
        load.loc[start:stop, cntry] = load.loc[start-delta:stop-delta, cntry].values
    return load

and then use it as, for example, copy_timeslice(load, 'GR', '2015-08-11 21:00', '2015-08-15 20:00', pd.Timedelta(weeks=1)). Saves many many lines of code and makes it more readable?

also, if you interpolate, you can add a limit of how many hours should be interpolated at max. Before, we used 4 hours.
load[interpolate_countries] = load[interpolate_countries].interpolate(limit=4).

You update Kosovo and Albania twice, line 142+

load['KV'] = load['RS'] * (4.8 / 27.)
load['AL'] = load['MK'] * (4.1 / 7.4)

and then in 370+

load['KV'] = load['RS'] * (5. / 33.)
load['AL'] = load['MK'] * (6.0 / 7.0)

but with different factors. which one is correct?

you can try to use my suggestions from the PR in vresutils I mentioned before, I also discussed them a while ago with @coroa. But your "alterations" seem more complete. Did you take into account holidays, weekends, etc.?

Now we have a code duplicate once in vresutils and here... not sure what to do with it. Is someone else using vresutils? If its just PyPSA-Eur, then we can delete it there and keep only this version otherwise we have two different sources people could use and they are not exactly the same, so produce different results. Super bad for debugging later on

Comment on lines 200 to 210
def load_opsd_loaddata(load_fn=None, countries=None):
if load_fn is None:
load_fn = snakemake.input.load

if countries is None:
countries = snakemake.config['countries']

load = pd.read_csv(load_fn, index_col=0, parse_dates=True)
load = load.filter(items=countries)

return (load)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not needed anymore, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, right!

Comment on lines 406 to 427
# # check the number and lenght of gaps
nan_stats = nan_statistics(opsd_load)

gap_filling_threshold = snakemake.config['load']['gap_filling_threshold']

if nan_stats.consecutive.max() > gap_filling_threshold:
logger.warning(f"Load data contains consecutive gaps of longer than '{gap_filling_threshold}' hours! Check dataset carefully!")

# adjust gaps and interpolate load data
logger.info(f"Gaps of {gap_filling_threshold} hours filled with data from previous week. Smaler gaps interpolated linearly.")
opsd_load = opsd_load.apply(fill_large_gaps, gapsize=gap_filling_threshold).interpolate(method='linear', limit=gap_filling_threshold)

# adjust gaps manuel
if snakemake.config['load']['adjust_gaps_manuel']:
logger.info(f"Load data are adjusted manual.")
opsd_load = manual_adjustment(load=opsd_load, source=snakemake.config['load']['source'])

# check the number and lenght of gaps after adjustment and interpolating
nan_stats = nan_statistics(opsd_load)

if nan_stats.consecutive.max() > gap_filling_threshold:
logger.warning(f'Load data contains gaps after manuel adjustment. Modify manual_adjustment() function!')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're almost there. I would restructure this part a bit (sorry for many iterations).

  1. manual corrections if enabled
  2. interpolate gaps < gap_size_interpolated (with info) if nan exist
  3. fill by weekly shift (with a warning) if nan exist
  4. raise a error if nan still exist

Like this we are prioritizing the manual corrections and do the heuristic sanitizing afterwards

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we leave the warning at the beginning that the data set contains gaps? In line 411 & 412
The new order is a good idea!

@FabianHofmann
Copy link
Contributor

FabianHofmann commented Jul 16, 2020

sorry I hit the wrong bottom :D

@JanFrederickUnnewehr
Copy link
Contributor Author

you copy often from start to stop, maybe for that you can use my function as in vresutils:

def copy_timeslice(load, cntry, start, stop, delta):
    start = pd.Timestamp(start)
    stop = pd.Timestamp(stop)
    if start in load.index and stop in load.index:
        load.loc[start:stop, cntry] = load.loc[start-delta:stop-delta, cntry].values
    return load

and then use it as, for example, copy_timeslice(load, 'GR', '2015-08-11 21:00', '2015-08-15 20:00', pd.Timedelta(weeks=1)). Saves many many lines of code and makes it more readable?

That is a good idea. I will integrate in my code.

also, if you interpolate, you can add a limit of how many hours should be interpolated at max. Before, we used 4 hours.
load[interpolate_countries] = load[interpolate_countries].interpolate(limit=4).

Limit is now implemented as gap_filling_threshold

You update Kosovo and Albania twice, line 142+

load['KV'] = load['RS'] * (4.8 / 27.)
load['AL'] = load['MK'] * (4.1 / 7.4)

and then in 370+

load['KV'] = load['RS'] * (5. / 33.)
load['AL'] = load['MK'] * (6.0 / 7.0)

but with different factors. which one is correct?

The correction are for two different years. The first one only for years before 2016 the secound one for year 2018.
"scale parameter selected by energy consumption ratio from IEA Data browser for the year 2017"

you can try to use my suggestions from the PR in vresutils I mentioned before, I also discussed them a while ago with @coroa. But your "alterations" seem more complete. Did you take into account holidays, weekends, etc.?

Holidays are not included. Any idea, how to integrate it in an automated way?

Now we have a code duplicate once in vresutils and here... not sure what to do with it. Is someone else using vresutils? If its just PyPSA-Eur, then we can delete it there and keep only this version otherwise we have two different sources people could use and they are not exactly the same, so produce different results. Super bad for debugging later on

When the changes are transferred to the master, the previous used functions "vresutils.load" are simply no longer imported. Is that what you mean?

@martacki
Copy link
Member

martacki commented Jul 16, 2020

The correction are for two different years. The first one only for years before 2016 the secound one for year 2018.
"scale parameter selected by energy consumption ratio from IEA Data browser for the year 2017"

Will that work, if you run i.e. 2013-2018? Wouldn't then years 2013-2017 be adapted with the 2018 value?

Holidays are not included. Any idea, how to integrate it in an automated way?

No, sorry. Unfortunately I did it brute force. But maybe, if you add an "holidays" array, and then check if the gap you want to fill is in range of this array, the timedelta increases by one. Also, a weekend days should not be filled with a working day... would that work?

When the changes are transferred to the master, the previous used functions "vresutils.load" are simply no longer imported. Is that what you mean?

I think we should remove one of them, otherwise it's just more code to be maintained, which basically should be 1:1 the same and that's just confusing

@JanFrederickUnnewehr
Copy link
Contributor Author

JanFrederickUnnewehr commented Jul 16, 2020

Will that work, if you run i.e. 2013-2018? Wouldn't then years 2013-2017 be adapted with the 2018 value?

Is it intended that the simulation period of one calculation (one model run) is longer than one year? If so, the function manual_adjustment() must be modified. The old code allows more years but only one data source ("ENTSOE_power_statistics") and the manual adjustment is not complete for all countries and years. This code currently only allows one year at a time but both data sources ("ENTSOE_transparency" or "ENTSOE_power_statistics").
"ENTSOE_power_statistics" is only available until mid of 2019, as far as I know

Holidays are not included. Any idea, how to integrate it in an automated way?

I would ignore that for the moment.

"vresutils.load"

Surely we should have only one version of load processing at the end. Let's see what @coroa has to say about this.

@JanFrederickUnnewehr
Copy link
Contributor Author

JanFrederickUnnewehr commented Jul 21, 2020

After a fruitful discussion with @FabianHofmann we agreed on the following procedure:
-When downloading the data, they are automatically filtered for the simulation period (snapshots from config)
-missing countries are "added" in the manual adjustment function
-the scaling parameters are not automatically adjusted by the selected simulation year
-the scaling factor for the manual "adding" of countries must be changed manually by the user in the code
-in the manual adjustment function, a distinction is still made between the two sources in order to be able to include "ENTSOE_transparency" data for future load time series

Code structure:

  1. check raw data and show warning if data contains gaps longer than "gap_filling_threshold"
  2. manual adjustment of the data (if the user sets adjust_gaps_manual: true)
  3. for larger gaps (min= 3 hours, max = 7 days) copy the previous week
  4. interpolate the data (considering the gap_filling_threshold: 3 #hours)
  5. test if there are still gaps, if yes stop with warning and point to manual adjustment function

Is there a way to automate the generation of the file doc/configuration.rst ?
Lets discuss first the proposed changes before I will add it to the doc/release_notes.rst ?

@fneum fneum added this to the Release v0.2.1 milestone Sep 26, 2020
@FabianHofmann FabianHofmann mentioned this pull request Dec 2, 2020
5 tasks
@FabianHofmann
Copy link
Contributor

Hey @JanFrederickUnnewehr, I'm closing this in the favour of #211 . If you still want to integrate the power statistics time series there should be a good way. But we can discuss it in the other PR or via mail/phone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants