Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should variable-unit pairs be unique in a scenario (ensemble)? #338

Open
danielhuppmann opened this issue Feb 26, 2020 · 8 comments
Open

Comments

@danielhuppmann
Copy link
Member

Description

A comment by @gidden in #335 raises a thorny question that deserves its own issue for discussion (I think):

Should any variable in a scenario (or scenario ensemble) be linked to a specific unit, or should we allow non-unique combinations?

Why could that happen

  • a model reports several units for easier visualization workflows
  • an IamDataFrame combines multiple models reporting natively in different units
  • a user converts data from one unit to another for some reason, but wants to keep the "original data" as is e.g., using df.append(df.convert_unit(...)) (the default behaviour of unit conversion is a replacement of the data, not appending).

Immediate problems?

AFAIK all functions (aggregation, etc.) work on both variable and unit separately, so there is no risk of accidentally summing apples and oranges and get fruit punch. But I might be wrong...

Relevant example as food for thought

One relevant example In the SR15: we had three variables

Emissions|Kyoto Gases (AR4-GWP100)
Emissions|Kyoto Gases (AR5-GWP100)
Emissions|Kyoto Gases (SAR-GWP100)

each with the unit Mt CO2-equiv/yr. But could also have had one variable Emissions|Kyoto Gases and "stored" the GWP-metric in the unit.

Way forward

I see three options:

  1. no validation or constraints in pyam
  2. no hard constraints, but add something like assert_variable_consistency() which returns all non-unique pairs
  3. validation of consistency (raise an error if [variable, unit] is non-unique)
@danielhuppmann
Copy link
Member Author

@gidden @znicholls @byersiiasa @khaeru @Rlamboll - any comments or preferences?

@khaeru
Copy link
Contributor

khaeru commented Feb 26, 2020

Comments but no preferences:

  • Write tests for the unusual cases described in the OP.
  • About the IAMC format, my only thought is there should be a spec of what is required, suggested, and optional.
    • Then, you can choose behaviour for pyam that matches or is more strict than the spec requirements.
    • Validation tools like (2) and (3) in the OP are a good way to start.
    • Use a function argument to control whether they generate warnings, or exceptions.
    • Use configuration to control whether they are applied automatically, or not.
  • In SDMX terms:
    • From what I've seen, UNITS are typically stored as an Attribute, rather than a Dimension.
    • If the Attribute is attached to a Series or Dataset, then it's the same for all Observations therein. But the different kinds of attachment can be mixed in the same Dataset/message.
    • If storing an Emissions measure, I would use:
      • a Dimension for the SPECIES, but
      • a separate Attribute for GWP_METHOD or something like that, with values like "AR4-GWP100", when UNITS == "t CO2-equiv/year"; but omit the attribute when e.g. UNITS == "t CH4/year" (no conversion based on GWP has been applied).

@Rlamboll
Copy link
Collaborator

Rlamboll commented Feb 26, 2020

I've been assuming model / scenario / region /variable / time is a unique key, and throwing errors if I see multiple units for a variable, so everything would need to be filtered before being put into Silicone if you allow multiple unit/variable pairs. It's worth focusing on the fact that the values we use in calculations are mostly the gas concentrations themselves (no -equiv), so I'd be happier with always specifying what equivalence we mean in the units than having a separate column for AR5 GWP100 etc., and extending our unique keys further. I don't know that we gain much by moving this from the variable name to the unit, although it is slightly more accurate. It looks like there have been problems with different equivalence factors in the past though so maybe a carefully-thought-through approach is good here.

My general response to the Why Could That Happen is that any specific sub-task should use a specific, filtered DataFrame anyway. I encounter a version of this problem when constructing the basket-values and then downscaling them, and simply have one DataFrame with CO2-equiv values to calculate the baskets and another with the gas-specific measurements to downscale.

@danielhuppmann
Copy link
Member Author

Well, so at least we identified one potential misunderstanding between the different tools!

One immediate reaction, though - the current filter() in pyam only works per column, so it is not possible to say: give me Emissions|CO2 in CO2 and Emissions|CH4 in CH4 (e.g. to filter out CH4 emissions in CO2-equivalent). You'd have to do both filters separately and then concatenate again... If there is a specific need for that, please open an(other) issue.

@znicholls
Copy link
Collaborator

No strong thoughts from me! Discussion here looks good.

@gidden
Copy link
Member

gidden commented Mar 16, 2020

Hi all - picking this back up after our units-related PRs are starting to wrap up. First things are that perhaps this should be picked up a bit higher than this toolchain in the IAMC data group, as suggested by @khaeru:

About the IAMC format, my only thought is there should be a spec of what is required, suggested, and optional.

My immediate gut reaction is that variables should have a unique unit mapping (in pyam). Reasons for this is:

  • it has been our implicit assumption thus far
  • it simplifies internal design
  • it can be circumvented as needed by putting specific conversion information in variable names without breaking the data model

But that is only my gut feeling, and those having implemented already the unit conversions might have more informed opinions. @danielhuppmann, @khaeru, others?

@danielhuppmann
Copy link
Member Author

danielhuppmann commented Mar 16, 2020

I'm not opposed to that approach in principle, just slightly worried about how big the refactoring will be...

If someone volunteers to go ahead (maybe jointly with switching to pd.Series or xarray), please keep in mind:

  • implement validation during initialisation
  • append() should raise an error when trying to merge timeseries with inconsistent variable-unit mappings (currently, this wouldn't be a concern)
  • rename(unit=<mapping>) should continue to work as is

@khaeru
Copy link
Contributor

khaeru commented Mar 17, 2020

About the IAMC format, my only thought is there should be a spec of what is required, suggested, and optional.

My immediate gut reaction is that variables should have a unique unit mapping (in pyam)

Nothing new to add, and still no preferences, since I don't use pyam. If you go this route, then I would again suggest making a clear distinction in your docs re: what is (a) part of the IAMC format spec and (b) pyam behaviour (limitations/automated helpful checks).

ETA: it's fine for (a) and (b) to be different, so long as the difference is advertised!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants