Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: is_copy #18801

Closed
jreback opened this issue Dec 15, 2017 · 22 comments
Closed

DEPR: is_copy #18801

jreback opened this issue Dec 15, 2017 · 22 comments
Labels
Deprecate Functionality to remove in pandas
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Dec 15, 2017

this has always been an internal attribute. We can simply replace by ._is_copy and provide a deprecation warning on the property.

@jreback jreback added the Deprecate Functionality to remove in pandas label Dec 15, 2017
@jreback jreback added this to the 0.22.0 milestone Dec 15, 2017
cbertinato pushed a commit to cbertinato/pandas that referenced this issue Dec 17, 2017
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 18, 2017

is_copy is used in the most popular google hits about SettingWithCopyWarning, eg https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas and https://www.dataquest.io/blog/settingwithcopywarning/
(actually in the first you are using it yourself in an answer :-))

What is the alternative for power users in library code if they want to avoid an extra unnecessary copy?

(I never use it myself, as I just do copy() to avoid if needed, so I am personally not attached to this method, but typically don't handle with data where doing an extra copy() is a problem. But I think the use case mentioned by Andreas in #18799 is a valid one)

cbertinato pushed a commit to cbertinato/pandas that referenced this issue Dec 18, 2017
- Renamed 'is_copy' attribute to '_is_copy' for internal use
- Setup getter and setter for 'is_copy'
- Added tests for deprecation warning
@amueller
Copy link

So the suggested solution is to use copy() and that introduces an additional copy? Can you please document that? It seems a bit counter-intuitive that a library needs to do an expensive operation to avoid a warning in user-space, but if that's the suggested (and documented) fix I'll live with it.

@jreback
Copy link
Contributor Author

jreback commented Dec 21, 2017

@amueller not sure there is anything to add to: http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy

you are chain indexing, which violates view semantics. the point is it may work, but there are cases where it won't. without copy-on-write, you must copy.

@amueller
Copy link

Ok maybe I just really don't understand the documentation, which is entirely possible. My reading of the warning is that we are returning a copy here, which is the intent. Are you saying it might sometimes return a view instead?

I don't want to use view semantics, and it tells me I got a copy. I'm very happy I got a copy, it's what I wanted. If I got a view instead, I would need to copy. But I thought the warning said I got a copy, not a view.

@jreback
Copy link
Contributor Author

jreback commented Dec 21, 2017

exactly, you have understood the point. you don't now whether it is a copy or a view on the original. That is the problem. you are doing chained operations and we can't be sure, so you get the warning. it is up to you to: 1) not chain operations, 2) defensively copy.

@amueller
Copy link

So would the warning also be thrown if it is a view?

@amueller
Copy link

amueller commented Dec 21, 2017

If it's also thrown if it's a view, then the warning is misleading, it says "A value is trying to be set on a copy of a slice from a DataFrame". If it's not thrown on a view, then it seems like I can distinguish between view and copy, and then I should only copy if I got a view.

@jreback
Copy link
Contributor Author

jreback commented Dec 21, 2017

no, if you only have a single dtyped dataframe you won't get this. it only occurs when you filter then add a column on multiple dtypes.

@amueller
Copy link

The question is: can I not find out at runtime if I got a copy or a view and only copy if I got a view?

@jreback
Copy link
Contributor Author

jreback commented Dec 21, 2017

you can try by introspecting the underlying arrays (not .values)
if anything is a view you must copy

@amueller
Copy link

ok. Does that mean that the warning might have been raised even though there is memory sharing?

@amueller
Copy link

Sorry if that question was answered by

no, if you only have a single dtyped dataframe you won't get this. it only occurs when you filter then add a column on multiple dtypes.

but I don't know how that relates to what happens to the memory. I assume it was meant as a reply to #18801 (comment) but I don't understand how it relates to it.

@jreback
Copy link
Contributor Author

jreback commented Dec 21, 2017

because someone could have chained indexed and we don’t know if views are created we
cant be sure that you are not actually looking at a view of something else
and more insidious is that you may have some columns with a view and some without
so we carry around this _is_copy flag which is a weakref to something that is referant
when a copy is made we can clear this
but until then some operations may not know if it’s a view or a copy
now it doesn’t matter until you actually try to assign something to a frame
when this happens

it’s jt trivial and mostly edge cases but if you are seeing the warning then you have incorrect code
it may still work but it IS chained indexing

use at your own risk - you should copy after filtering

@amueller
Copy link

Alright. I feel the warning is pretty confusing since it seems to imply that we made a copy, but it only implies that there is some part of the dataframe that was copied, and we don't actually know whether we made a copy or not.

use at your own risk - you should copy after filtering

Maybe the section in the docs that discusses this warning should say that? I don't think it says that now.

@jorisvandenbossche
Copy link
Member

To repeat myself from the issue: I think @amueller use case is valid one that we should try to support. If not through is_copy, then in another way.
(btw, @jreback it would be nice to at least answer to my objection on the issue here before merging)

In case of sklearn's train_test_split, they are using integer positional indexing, which will (as far as I understand fancy indexing in numpy) never return a view, not even in case of DataFrames with single dtypes. So they can be sure that their subset of a frame is a copy (which they want) and a SettingWithCopyWarning should never be raised on the frames returned by that function.

@amueller not sure there is anything to add to: http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy

Explicitly taking a copy is not mentioned in those docs, so could certainly be added.

@has2k1
Copy link
Contributor

has2k1 commented Dec 31, 2017

Although is_copy was meant to be used internally, it leaked out because it was a solution to a justifiable problem i.e you slice a dataframe, you know you created independent dataframe and you want no complaints. A copy() operation on the new dataframe is wasteful, especially in library code.

plotnine uses is_copy in about 20 locations, and almost every call from the user will hit is_copy at least 10 times, and the number goes up linearly depending on different factors. The example on the documentation page goes through about 80 of them.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 22, 2018

@jreback Can you answer to the objections of me and others? (edit: see now there is a little bit more discussion in #19102)

@jreback
Copy link
Contributor Author

jreback commented Jan 22, 2018

Until copy-on-write, his is simply not possible in pandas in a reliable way. We don't have full control over memory allocations or when views are actually made.

@jorisvandenbossche
Copy link
Member

@has2k1 I see for plotnine you switched to using a contextmanager (with pd.option_context('mode.chained_assignment', None):) around the plotting code (has2k1/plotnine@9b068b4).
This is a satisfying solution for you ?

@has2k1
Copy link
Contributor

has2k1 commented Jan 22, 2018

It is an okay stopgap measure until copy-on-write is available, but as it implicitly assumes user cognisance it is not a good long term solution. Also, since the package aims to be extensible in many ways, the effects of a context manager may extend to other packages.

On the other-hand 'is_copy' was explicit, it forced the user to acknowledge the potential problem at every instance and I think it was better in an open source environment.

@znd4
Copy link

znd4 commented May 23, 2018

I have a different reason to want this:

I'm working on a data pipeline with large enough datasets that I'm worried about the performance hit from repeated copies. An easy way to try to control that would be something like assert transformed_dataframe.is_copy == False at the end of each unit test.

@sam-cohan
Copy link
Contributor

Yet another feasible use case can be when trying to do multi-processing where portions of a DataFrame are processed in different processes. I was under the assumption that if I take a view, when a process is spawned, only the view will be copied over taking 2X memory. In contrast, if I make a copy, then essentially the original process now has two full copies and each process will also have the partial copy so we will end up with 3X memory requirement...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas
Projects
None yet
Development

No branches or pull requests

6 participants