Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warning about IamDataFrame.data deprecation #109

Closed
danielhuppmann opened this issue Jun 15, 2020 · 6 comments
Closed

Warning about IamDataFrame.data deprecation #109

danielhuppmann opened this issue Jun 15, 2020 · 6 comments

Comments

@danielhuppmann
Copy link

Per the discussion in IAMconsortium/pyam#397, it seems that the data attribute of an IamDataFrame is handled not as intended in silicone.

Could you please point to the instance(s) where manipulating data is necessary such that the resulting IamDataFrame is inconsistent.

I can't guarantee that we won't refactor data in a way that breaks the current usage in silicone.

@Rlamboll
Copy link
Collaborator

The majority of tests involve constructing and modifying dataframes in this way, e.g. from test_utils:
"
df_low = simple_df.copy()
df_low.data["scenario"].loc[df_low.data["scenario"] == "scen_a"] = "right_scenario"
df_low.data["scenario"].loc[df_low.data["scenario"] == "scen_b"] = "wrong_scenario"
df_high = df_low.copy()
df_high["model"] = "high_model"
df_low.data["value"] = df_low.data["value"] - 10
df_high.data["value"] = df_high.data["value"] + 11
df_near = simple_df.copy()
df_near.data["value"] = df_near["value"] + 1
df_to_test = df_low.append(df_near).append(df_high)

We need to refresh the metadata in order to proceed.

df_to_test = pyam.IamDataFrame(df_to_test.data)
"

If you change this, you will be going out of your way to make work for me without adding functionality for anyone. This would be most annoying under any circumstances, but particularly since silicone is on haitus while the paper is under review, and I am doing other, time-sensitive projects.

@danielhuppmann
Copy link
Author

danielhuppmann commented Jun 15, 2020

First, this is not going to happen in the next days - so you should have plenty of time to refactor when it's convenient. But I thought better to give warning of possible deprecation early.

Second, it will actually provide additional functionality (to you) by using pyam as intended - for example using meta attributes.

Third, it might be a good opportunity to refactor your code:

df_low.data["scenario"].loc[df_low.data["scenario"] == "scen_a"] = "right_scenario"
df_low.data["scenario"].loc[df_low.data["scenario"] == "scen_b"] = "wrong_scenario"

seems to be doing the same as

df_low.rename(scenario={'scen_a': 'right_scenario', 'scen_b': 'wrong_scenario'})

(with the second option being more readable and maintaining internal consistency).

@znicholls
Copy link
Collaborator

hi hi, I think there might be some misunderstanding here so just want to jump in to see if I can help clear things up at all.

Firstly, thanks @danielhuppmann, pointing out the rename function which avoids the problem we've been hitting. I think refactoring to _data would make clear that users shouldn't be touching data directly. Maybe you could then have a data property which only had a getter, but no setter. (Similarly for meta and _meta).

Secondly, @Rlamboll I wouldn't worry about having extra work all of a sudden. @danielhuppmann and many others have gone to massive lengths to make sure pyam adheres to semantic versioning. In short, that means that if they make some change which doesn't work with silicone, we can simply lock the version of pyam. We can leave it locked until we're ready to update our code for the newer version of pyam, hence no nasty surprises and extra work when there are other tight deadlines, ever :) Whilst we don't get to decide what pyam develops in, we also get the guarantee that it will be versioned appropriately so we can avoid any changes we don't like.

@Rlamboll
Copy link
Collaborator

  1. Silicone creates data. This intrinsically does not preserve metadata.
  2. In all calculations, I extract the data (using, but not assigning to, df.data) to do more serious operations on it. It's wrong to imply that that's an operation to be frowned upon.
  3. Here, using rename is better than how the code is currently. However for this code, as for most of my code, pyam is a (sometimes useful, sometimes annoying) wrapper around a dataframe, and I don't really want to switch between these ways of thinking about it all the time. Rename does not help with changing values.
  4. In the code quoted, I also use the (metadata-breaking, but not .data calling) line
    df_high["model"] = "high_model"
    This is much simpler than a rename operation where I have to specify everything I want renamed.
  5. This toolkit is to be used as part of a pipeline, so semantic versioning is unhelpful unless you can guarantee that no part of the pipeline will use the new function/syntax. In which case you should question why you are introducing it.
    1. is a reminder of why you do not introduce breaking changes to code without a serious need. I could have replaced every instance of .data with ._data and fix the ~two bug cases in less than the time it has taken me to write this. However it will also break every script that uses pyam in the way I do, of which I suspect there are many.

@znicholls
Copy link
Collaborator

tl;dr

I think there's a pretty easy fix here. .data becomes read-only (so people who directly extract the dataframe can still do so) but it stops internal inconsistencies from occurring as a result of direct edits to .data (or .meta) for that matter.


Having said the above, I think there's a few things which are worth clarifying.

2. It's wrong to imply that that's an operation to be frowned upon

I'm not sure that's what is being said. I think we're just clarifying that one shouldn't operate on .data inplace. I acknowledge that's not clear from the way pyam is written, but I think it is clear that operating inplace on .data has unexpected consequences so is best avoided. It is then up to the authors of pyam to decide whether to make this explicit by changing the code base. That's up to them. Yes, the changes should be made in consultation with users but flat out rejecting the idea is not helpful.

4. In the code quoted, I also use the (metadata-breaking, but not .data calling) line
df_high["model"] = "high_model"

I think the point is that being able to set values in this way is an (I suspect unintended) consequence of pyam's convenience access (see https://github.com/IAMconsortium/pyam/blob/0a7931f4fd983c462deba8654e82c5c0f7d520d9/pyam/core.py#L141). I think @danielhuppmann's point is that this is a relic of earlier work but shouldn't actually be used in practice (as you can see in the code, only meta is altered which breaks the internal consistency). I guess this is the nature of bugs, you put stuff in and only work out later that it wasn't what was intended. At that point your options are a) leave it forever so downstream users don't have headaches but your own development stalls and you have permanently buggy code, b) fix it immediately and don't tell anyone, downstream users will work it out or c) flag it as a bug, raise deprecation warnings so people can change code that uses deprecated features and use clear versioning so that downstream users can identify and update to the new behaviour as suits them. Of those options, I think c) is clearly best and am happy pyam does it this way. Of course, if you want to offer a fix so that the direct access (e.g. data["scenario"] = "interesting scenario") actually works as intended, I think such a PR would be gladly welcomed into pyam.

5. This toolkit is to be used as part of a pipeline, so semantic versioning is unhelpful unless you can guarantee that no part of the pipeline will use the new function/syntax

I don't think it's fair to suggest that pyam should have to think of every single place it's used before it makes a change. If we're writing a pipeline elsewhere, then it's our job to control the versions of packages within the pipeline. In such a case the semantic versioning is essential for quickly identifying behaviour conflicts. It's our job as pipeline maintainers to deal with conflicting requirements, not pyam's job.

  1. why you do not introduce breaking changes to code without a serious need

With all due respect, I you're underestimating the pyam team here. They have a massive user base which they are looking after, and they don't make changes like this lightly. If they say the change needs to happen, then I think they should be believed and/or an issue raised in pyam where they have a chance to respond.

In this case, I think the need is pretty clear. Directly fiddling with data means that meta can become out of date, which then blows up a whole bunch of pyam functionality. The other problem with data is that it eats memory (see IAMconsortium/pyam#335) hence a better solution will be needed at some point.

As discussed above, one very easy change would be that .data becomes read-only (so people who directly extract the dataframe can still do so).

@danielhuppmann
Copy link
Author

Thanks @znicholls for the detailed explanation - I agree with everything you say! Indeed, having data as a getter is a good solution. I guess I should have started with why we are planning to refactor our internals (better performance, less memory), which will have this unfortunate side-effect.

If silicone is indeed only working on the data and does not require any extra features, refactoring to

data = df.data
... operations on data...
df_new = IamDataFrame(data)

is a pain, but should be manageable. I apologize for the inconvenience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants